Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Instructions & Opcodes

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the Ori IR instruction representation: in-memory layout, opcode encoding, operand model, instruction flags, creation/iteration APIs, the master descriptor table, and opcode categories. All offsets are from ptxas v13.0.88 (37.7 MB stripped x86-64 ELF).

Instruction Object Layout

Every Ori instruction is a 296-byte C++ object allocated from the Code Object's arena. Instructions are linked into per-basic-block doubly-linked lists via pointers at offsets +0 and +8. The allocator at sub_7DD010 allocates exactly 296 bytes per instruction and zeroes the object before populating it.

Memory Layout (296 bytes)

OffsetSizeTypeFieldDescription
+08ptrprevPrevious instruction in BB linked list (nullptr for head)
+88ptrnextNext instruction in BB linked list (nullptr for tail)
+164i32idUnique instruction ID (monotonically increasing within function)
+204i32ref_countReference/use count (incremented by sub_7E6090)
+244i32bb_indexBasic block index (bix) this instruction belongs to
+284u32reserved_28Reserved / padding
+324u32control_wordScheduling control word (stall cycles, yield, etc.)
+364u32flags_36Instruction flags (bits 19-21 = subtype, see below)
+408ptrsched_slotScheduling state pointer
+488u64flag_bitsExtended flag bits (bit 5 = volatile, bit 27 = reuse)
+568ptrdef_instrDefining instruction (for SSA def-use chains)
+648ptrreserved_64Reserved / register class info
+724u32opcodeFull opcode word (lower 12 bits = base opcode, bits 12-13 = modifier)
+764u32opcode_auxAuxiliary opcode data (sub-operation, comparison predicate)
+804u32operand_countTotal number of operands (destinations + sources)
+84varu32[N*2]operands[]Packed operand array (8 bytes per operand slot)
+884u32operands[0].extraHigh word of first operand slot
+1001u8type_flagsData type / modifier flags (bits 0-2 = data type code)
+1044u32reserved_104Reserved
+1128ptruse_chainUse chain linked list head (for CSE)
+1208ptrreserved_120Reserved
+1364i32reserved_136Reserved
+1608ptrenc_bufEncoding buffer pointer (populated during code generation)
+1688ptrreserved_168Reserved
+1844u32enc_modeEncoding mode selector
+2008u64imm_valueImmediate value (for instructions with constant operands)
+20816xmmsched_paramsScheduling parameters (loaded via _mm_load_si128)
+2404u32reserved_240Reserved
+2441u8reserved_244Reserved
+2488i64sentinel_248Initialized to -1 (0xFFFFFFFFFFFFFFFF)
+2568i64sentinel_256Initialized to 0xFFFFFFFF
+2648i64bb_refBasic block reference / block index storage
+2728i64reserved_272Reserved
+28016u128reserved_280Zeroed on creation

Linked-List Pointers

Instructions form a doubly-linked list within each basic block. The Code Object stores the global list head at offset +272 and tail at offset +280:

Code Object +272  -->  head instruction (prev = nullptr)
                            |
                            v  (+8 = next)
                       instruction 2
                            |
                            v
                       instruction 3
                            |
                            v  ...
Code Object +280  -->  tail instruction (next = nullptr)

The linked-list traversal pattern appears in hundreds of functions throughout ptxas:

// Forward iteration over all instructions
for (instr = *(ptr*)(code_obj + 272); instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t opcode = *(uint32_t*)(instr + 72);
    uint32_t num_ops = *(uint32_t*)(instr + 80);
    // process instruction...
}

Opcode Encoding

The opcode field at offset +72 is a 32-bit word with a structured layout.

Opcode Word Format

 31              16  15  14  13  12  11            0
+------------------+---+---+---+---+---------------+
|    upper flags   |   |   | M | M |  base opcode  |
+------------------+---+---+---+---+---------------+
                            ^   ^
                            |   bit 12: modifier bit 0
                            bit 13: modifier bit 1

M = modifier bits (stripped by the 0xFFFFCFFF mask)
base opcode = 12-bit instruction class identifier (0-4095)

The mask 0xFFFFCFFF (clear bits 12-13) is used throughout InstructionClassifier, MBarrierDetector, OperandLowering, and many other subsystems to extract the base instruction class, stripping sub-operation modifier bits:

uint32_t raw_opcode = *(uint32_t*)(instr + 72);
uint32_t base_opcode = raw_opcode & 0xFFFFCFFF;

Additionally, bit 11 is sometimes used in operand count calculations:

// Effective operand count adjustment (appears in 50+ functions)
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int dst_count = *(uint32_t*)(instr + 80) - adj;

Canonical Opcode Reference

The opcode value stored at instruction+72 is the same index into the ROT13 name table at InstructionInfo+4184. There is a single numbering system -- the ROT13 table index IS the runtime opcode. This was verified by tracing sub_BEBAC0 (getName), which computes InstructionInfo + 4184 + 16 * opcode with no remapping.

The following table lists frequently-referenced opcodes from decompiled code, with their canonical SASS mnemonic names from the ROT13 table. Each opcode appears in 10+ decompiled functions reading *(instr+72).

Base OpcodeSASS MnemonicCategoryReference Count
0ERRBARError barrier (internal)Sentinel in scheduler
1IMADInteger multiply-add100+ functions
7ISETPInteger set-predicatesub_7E0030 switch
18FSETPFP set-predicatesub_7E0030 switch
19MOVMove80+ functions
23PLOP3Predicate 3-input logicsub_7E0030 case 23
25NOPNo-opScheduling, peephole
52AL2P_INDEXEDBB boundary pseudo-opcodesub_6820B0, 100+
54BMOV_BBarrier move (B)sub_7E6090 case 54
61BARBarrier synchronizationSync passes
67BRABranchsub_74ED70, CFG builders
71CALLFunction callsub_7B81D0, ABI, spill
72RETReturnsub_74ED70 (with 67)
77EXITExit threadsub_7E4150, CFG sinks
93OUT_FINALTessellation output (final)sub_734AD0, 25+
94LDSLoad sharedsub_7E0650 case 94
95STSStore sharedsub_7E0030, 40+
96LDGLoad globalMemory analysis
97STGStore globalsub_6820B0, 30+
102ATOMAtomicEncoding switch
104REDReductionEncoding switch
111MEMBARMemory barrierSync passes
119SHFLWarp shufflesub_7E0030 case 119
122DFMADouble FP fused mul-addsub_7E0030 case 122
130HSET2Half-precision set (packed)20+ functions
135INTRINSICCompiler intrinsic (pseudo)ISel, lowering
137SM73_FIRSTSM gen boundary (real instr)Strength reduction
183sm_82+ opcodeExtended mem operation& 0xFFFFCFFF mask

Important caveats:

  1. Opcode 52 (AL2P_INDEXED in name table) is universally used as a basic block delimiter in 100+ decompiled functions. The SASS mnemonic name may be vestigial; no decompiled code uses it for attribute-to-patch operations.

  2. SM boundary markers (136=SM70_LAST, 137=SM73_FIRST, etc.) have marker names in the ROT13 table but are valid runtime opcodes. Instructions with these opcode values exist in the IR and are processed by optimization passes (e.g., strength reduction operates on opcode 137).

  3. Earlier versions of this page had a "Selected Opcode Values" table that assigned incorrect SASS mnemonics based on behavioral inference rather than the ROT13 name table. Those labels (93=BRA/CALL, 95=EXIT, 97=CALL/label, 130=MOV) were wrong. The correct labels are: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET2. Branch/call/exit are at 67=BRA, 71=CALL, 77=EXIT.

Opcode Ranges by SM Generation

The ROT13 opcode name table in sub_BE7390 (InstructionInfo constructor) includes explicit SM generation boundary markers:

Marker OpcodeDecoded NameMeaning
136SM70_LASTLast sm_70 (Volta) opcode
137SM73_FIRSTFirst sm_73 (Volta+) opcode
171SM73_LASTLast sm_73 opcode
172SM82_FIRSTFirst sm_82 (Ampere) opcode
193SM82_LASTLast sm_82 opcode
194SM86_FIRSTFirst sm_86 (Ampere+) opcode
199SM86_LASTLast sm_86 opcode
200SM89_FIRSTFirst sm_89 (Ada) opcode
205SM89_LASTLast sm_89 opcode
206SM90_FIRSTFirst sm_90 (Hopper) opcode
252SM90_LASTLast sm_90 opcode
253SM100_FIRSTFirst sm_100 (Blackwell) opcode
280SM100_LASTLast sm_100 opcode
281SM104_FIRSTFirst sm_104 (Blackwell Ultra) opcode
320SM104_LASTLast sm_104 opcode
321LASTSentinel (end of table)

This gives a clear partitioning: opcodes 0-136 are the base sm_70+ ISA, 137-171 extend to sm_73, and so on up through sm_104. Each SM generation only adds opcodes; no base opcodes are removed.

Operand Model

Packed Operand Encoding

Each operand occupies 8 bytes (two 32-bit words) in the operand array starting at instruction offset +84. The first word carries the type, modifier bits, and index. The second word carries additional data (extended flags, immediate bits, etc.).

Word 0 (at instr + 84 + 8*i):

 31  30  29  28  27  26  25  24  23  22  21  20  19                  0
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
| S |  type(3) |       modifier (8 bits)        |    index (20 bits)   |
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
  ^   ^                                           ^
  |   bits 28-30: operand type                    bits 0-19: register/symbol index
  bit 31: sign/negative flag (S)

Word 1 (at instr + 88 + 8*i):

 31                                                                  0
+--------------------------------------------------------------------+
|               extended data / immediate bits / flags                |
+--------------------------------------------------------------------+

Operand Type Field (bits 28-30)

ValueTypeIndex Meaning
0Unused / padding
1RegisterIndex into *(code_obj+88) + 8*index register descriptor array
2Predicate registerIndex into predicate register file
3Uniform registerUR file index
4Address/offsetMemory offset value
5Symbol/constantIndex into *(code_obj+152) symbol table
6Predicate guardGuard predicate controlling conditional execution
7ImmediateEncoded immediate value

Operand Extraction Pattern

This exact extraction pattern appears in 50+ functions across scheduling, regalloc, encoding, and optimization passes:

uint32_t operand_word = *(uint32_t*)(instr + 84 + 8 * i);

int  type   = (operand_word >> 28) & 7;     // bits 28-30
int  index  = operand_word & 0xFFFFF;        // bits 0-19 (also seen as 0xFFFFFF)
int  mods   = (operand_word >> 20) & 0xFF;   // bits 20-27
bool is_neg = (operand_word >> 31) & 1;      // bit 31

// Register operand check (most common pattern)
if (type == 1) {
    reg_descriptor = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * index);
    reg_file_type  = *(uint32_t*)(reg_descriptor + 64);
    reg_number     = *(uint32_t*)(reg_descriptor + 12);
}

Some functions use a 24-bit index mask (& 0xFFFFFF) instead of 20-bit, packing additional modifier bits into the upper nibble of the index field.

Operand Classification Predicates

Small predicate functions at 0xB28E00-0xB28E90 provide the instruction selection interface for operand queries:

AddressFunctionLogic
sub_B28E00getRegClassReturns register class; 1023 = wildcard, 1 = GPR
sub_B28E10isRegOperand(word >> 28) & 7 == 1
sub_B28E20isPredOperand(word >> 28) & 7 == 2
sub_B28E40isImmOperand(word >> 28) & 7 == 7
sub_B28E80isConstOperand(word >> 28) & 7 == 5
sub_B28E90isUReg(word >> 28) & 7 == 3

Destination vs. Source Operand Split

Destinations come first in the operand array, followed by sources. The boundary is computed from the operand_count field and the modifier bits in the opcode:

uint32_t total_ops = *(uint32_t*)(instr + 80);
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int first_src_index = total_ops - adj;             // or total_ops + ~adj + 1
// Destinations: operands[0 .. first_src_index-1]
// Sources:      operands[first_src_index .. total_ops-1]

For most instructions, adj = 0 and the split point equals operand_count. Instructions with bit 11 set in the opcode word shift the split by 2, indicating 2 extra destination operands (e.g., predicated compare-and-swap operations that write both a result register and a predicate).

Predicate Guard Operand

The last operand (at index operand_count - 1) can be a predicate guard (type 6) controlling conditional execution. The guard predicate check in sub_7E0E80:

bool has_pred_guard(instr) {
    int last_idx = *(uint32_t*)(instr + 80) + ~((*(uint32_t*)(instr + 72) >> 11) & 2);
    uint32_t last_op = *(uint32_t*)(instr + 84 + 8 * last_idx);
    return ((last_op & 0xF) - 2) < 7;  // type bits in low nibble
}

Instruction Flags and Modifiers

Opcode Modifier Bits (offset +72, bits 12-13)

Bits 12-13 of the opcode word encode sub-operation modifiers. The 0xFFFFCFFF mask strips them to yield the base opcode. Common uses:

ModifierMeaning
0Default operation
1.HI or alternate form
2.WIDE or extended form
3Reserved / architecture-specific

Extended Flag Bits (offset +48)

The 64-bit flag word at offset +48 accumulates flags throughout the compilation pipeline:

BitHex MaskFlagSet By
60x40Live-outsub_7E6090 (def-use builder)
160x10000Has single defsub_7E6090
250x2000000Has prior usesub_7E6090
270x8000000Same-block defsub_7E6090
330x200000000Source-only refsub_7E6090

Control Word (offset +32)

The control word encodes scheduling metadata added by the instruction scheduler. It is initialized to zero and populated during scheduling (phases ~150+):

  • Stall cycles (how many cycles to wait before issuing the next instruction)
  • Yield hint (whether the warp scheduler should yield after this instruction)
  • Dependency barrier assignments
  • Reuse flags (register reuse hints for the hardware register file cache)

The stall cycle field is checked during scoreboard computation at sub_A08910. The control word format is the same as the SASS encoding control field.

Data Type Flags (offset +100)

The byte at offset +100 encodes the instruction's data type in its low 3 bits:

uint8_t type_code = *(uint8_t*)(instr + 100) & 7;

These correspond to SASS data type suffixes (.F32, .F64, .U32, .S32, .F16, .B32, etc.). The exact encoding is architecture-specific and queried through the InstructionInfo descriptor table.

ROT13 Opcode Name Table

All SASS opcode mnemonic strings in the binary are ROT13-encoded. This is lightweight obfuscation, not a security measure. The InstructionInfo constructor at sub_BE7390 populates a name table at object offset +4184 with 16-byte {char* name, uint64_t length} entries.

Table Structure

InstructionInfo object:
  +0       vtable pointer (off_233ADC0)
  +8       parent pointer
  ...
  +4184    opcode_names[0].name_ptr    -> "REEONE"   (ROT13 of ERRBAR)
  +4192    opcode_names[0].length      -> 6
  +4200    opcode_names[1].name_ptr    -> "VZNQ"     (ROT13 of IMAD)
  +4208    opcode_names[1].length      -> 4
  ...
  +9320    opcode_names[321].name_ptr  -> "YNFG"     (ROT13 of LAST)
  +9328    opcode_names[321].length    -> 4
  +9336    encoding_category_map[0..321]  (322 x int32, from unk_22B2320)
  +10624   (end of encoding category map)

Total: 322 named opcodes (indices 0-321). The 0x508 bytes at +9336 are not additional name entries -- they are a 322-element int32 array mapping each opcode index to an encoding category number (see Encoding Category Map below).

Full Decoded Opcode Table (Base ISA, sm_70+)

IdxROT13SASSCategory
0REEONEERRBARError barrier (internal)
1VZNQIMADInteger multiply-add
2VZNQ_JVQRIMAD_WIDEInteger multiply-add wide
3VNQQ3IADD33-input integer add
4OZFXBMSKBit mask
5FTKGSGXTSign extend
6YBC3LOP33-input logic
7VFRGCISETPInteger set-predicate
8VNOFIABSInteger absolute value
9YRNLEALoad effective address
10FUSSHFFunnel shift
11SSZNFFMAFP fused multiply-add
12SNQQFADDFP add
13SZHYFMULFP multiply
14SZAZKFMNMXFP min/max
15SFJMNQQFSWZADDFP swizzle add
16SFRGFSETFP set
17SFRYFSELFP select
18SFRGCFSETPFP set-predicate
19ZBIMOVMove
20FRYSELSelect
21C2EP2RPredicate to register
22E2CR2PRegister to predicate
23CYBC3PLOP3Predicate 3-input logic
24CEZGPRMTByte permute
25ABCNOPNo-op
26IBGRVOTEWarp vote
27PF2E_32CS2R_32Control/status to register (32-bit)
28PF2E_64CS2R_64Control/status to register (64-bit)
29CZGEVTPMTRIGPerformance monitor trigger
30CFZGRFGPSMTESTPSM test
31INOFQVSSVABSDIFFVector absolute difference
32INOFQVSS4VABSDIFF4Vector absolute difference (4-way)
33VQCIDPInteger dot product
34VQRIDEInteger dot expand
35V2VI2IInteger to integer conversion
36V2VCI2IPInteger to integer (packed)
37VZAZKIMNMXInteger min/max
38CBCPPOPCPopulation count
39SYBFLOFind leading one
40SPUXFCHKFP check (NaN/Inf)
41VCNIPAInterpolate attribute
42ZHSHMUFUMulti-function unit (SFU)
43S2SF2FFloat to float conversion
44S2S_KF2F_XFloat to float (extended)
45S2VF2IFloat to integer
46S2V_KF2I_XFloat to integer (extended)
47V2SI2FInteger to float
48V2S_KI2F_XInteger to float (extended)
49SEAQFRNDFP round
50SEAQ_KFRND_XFP round (extended)
51NY2CAL2PAttribute to patch
52NY2C_VAQRKRQAL2P_INDEXEDAttribute to patch (indexed)
53OERIBREVBit reverse
54OZBI_OBMOV_BBarrier move (B)
55OZBI_EBMOV_RBarrier move (R)
56OZBIBMOVBarrier move
57F2ES2RSpecial register to register
58O2EB2RBarrier to register
59E2OR2BRegister to barrier
60YRCPLEPCLoad effective PC
61ONEBARBarrier synchronization
62ONE_VAQRKRQBAR_INDEXEDBarrier (indexed)
63FRGPGNVQSETCTAIDSet CTA ID
64FRGYZRZONFRSETLMEMBASESet local memory base
65TRGYZRZONFRGETLMEMBASEGet local memory base
66QRCONEDEPBARDependency barrier
67OENBRABranch
68OEKBRXBranch indirect
69WZCJMPJump
70WZKJMXJump indirect
71PNYYCALLFunction call
72ERGRETReturn
73OFFLBSSYBranch sync stack push
74OERNXBREAKBreak
75OCGBPTBreakpoint trap
76XVYYKILLKill thread
77RKVGEXITExit
78EGGRTTReturn to trap handler
79OFLAPBSYNCBranch sync
80ZNGPUMATCHWarp match
81ANABFYRRCNANOSLEEPNanosleep
82ANABGENCNANOTRAPNano trap
83GRKTEXTexture fetch
84GYQTLDTexture load
85GYQ4TLD4Texture load 4
86GZZYTMMLTexture mip-map level
87GKQTXDTexture fetch with derivatives
88GKDTXQTexture query
89YQPLDCLoad constant
90NYQALDAttribute load
91NFGASTAttribute store
92BHGOUTTessellation output
93BHG_SVANYOUT_FINALTessellation output (final)
94YQFLDSLoad shared
95FGFSTSStore shared
96YQTLDGLoad global
97FGTSTGStore global
98YQYLDLLoad local
99FGYSTLStore local
100YQLDLoad (generic)
101FGSTStore (generic)
102NGBZATOMAtomic
103NGBZTATOMGAtomic global
104ERQREDReduction
105NGBZFATOMSAtomic shared
106DFCPQSPCQuery space
107PPGY_AB_FOCCTL_NO_SBCache control (no scoreboard)
108PPGYCCTLCache control
109PPGYYCCTLLCache control (L2)
110PPGYGCCTLTCache control (texture)
111ZRZONEMEMBARMemory barrier
112FHYQSULDSurface load
113FHFGSUSTSurface store
114FHNGBZSUATOMSurface atomic
115FHERQSUREDSurface reduction
116CVKYQPIXLDPixel load
117VFOREQISBERDIndexed set binding for redirect
118VFORJEISBEWRIndexed set binding for write
119FUSYSHFLWarp shuffle
120JNECFLAPWARPSYNCWarp synchronize
121ZVRYQMYELDYield (internal)
122QSZNDFMADouble FP fused multiply-add
123QNQQDADDDouble FP add
124QZHYDMULDouble FP multiply
125QFRGCDSETPDouble FP set-predicate
126UNQQ2HADD2Half-precision add (packed)
127UNQQ2_S32HADD2_F32Half-precision add (F32 accum)
128USZN2HFMA2Half FP fused multiply-add (packed)
129UZHY2HMUL2Half-precision multiply (packed)
130UFRG2HSET2Half-precision set (packed)
131UFRGC2HSETP2Half-precision set-predicate (packed)
132UZZN_16HMMA_16Half MMA (16-wide)
133UZZN_32HMMA_32Half MMA (32-wide)
134VZZNIMMAInteger MMA
135VAGEVAFVPINTRINSICCompiler intrinsic (pseudo)

Opcode Categories

The ~400 opcodes group into these functional categories:

Integer ALU (14 opcodes): IMAD, IMAD_WIDE, IADD3, IADD, IMNMX, IABS, BMSK, SGXT, LOP3, ISETP, LEA, SHF, POPC, FLO, BREV, IDP, IDE, PRMT

FP32 ALU (9 opcodes): FFMA, FADD, FMUL, FMNMX, FSWZADD, FSET, FSEL, FSETP, FCHK

FP64 ALU (4 opcodes): DFMA, DADD, DMUL, DSETP

FP16 Packed (6 opcodes): HADD2, HADD2_F32, HFMA2, HMUL2, HSET2, HSETP2

Conversion (12 opcodes): F2F, F2I, I2F, I2I, F2FP, F2IP, I2FP, I2IP, FRND, and their _X extended variants

Data Movement (6 opcodes): MOV, UMOV, MOVM, SEL, USEL, PRMT

Special Function (1 opcode): MUFU (sin, cos, rsqrt, rcp, etc.)

Predicate (4 opcodes): PLOP3, P2R, R2P, VOTE

Memory -- Global (4 opcodes): LDG, STG, LD, ST

Memory -- Shared (4 opcodes): LDS, STS, LDSM, STSM

Memory -- Local (2 opcodes): LDL, STL

Memory -- Constant (2 opcodes): LDC, LDCU

Atomic/Reduction (6 opcodes): ATOM, ATOMG, ATOMS, RED, REDUX, REDAS

Texture (6 opcodes): TEX, TLD, TLD4, TMML, TXD, TXQ

Surface (4 opcodes): SULD, SUST, SUATOM, SURED

Control Flow (12 opcodes): BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, KILL, BPT

Synchronization (6 opcodes): BAR, BAR_INDEXED, DEPBAR, MEMBAR, WARPSYNC, NANOSLEEP

Tensor Core / MMA (25+ opcodes): HMMA_*, IMMA_*, BMMA_*, DMMA, GMMA, QMMA_*, OMMA_*, and their sparse (_SP_) variants

Uniform Register (30+ opcodes): All U-prefixed variants (UIMAD, UIADD3, UMOV, USEL, ULOP3, ULEPC, etc.) that operate on uniform registers shared across the warp

Blackwell sm_100+ (28 opcodes): ACQBLK, CGABAR_*, CREATEPOLICY, ELECT, ENDCOLLECTIVE, FENCE_G/S/T, LDTM, STTM, MEMSET, ACQSHMINIT, UTCBAR_*, UTCMMA_*, UTCSHIFT_*, UTCCP_*, TCATOMSWS, TCLDSWS, TCSTSWS, VIRTCOUNT, UGETNEXTWORKID, FADD2, FFMA2, FMUL2, FMNMX3, CREDUX, QFMA4, QADD4, QMUL4, WARPGROUP

Instruction Descriptor Table

The InstructionInfo class at sub_BE7390 (inheriting from the base class at sub_738E20) provides a per-opcode descriptor table consulted by every pass in the compiler. The derived constructor calls the base class constructor sub_738E20, then populates the ROT13 name table, allocates the per-opcode descriptor block, and queries SM-specific configuration knobs. The resulting object is ~11,240 bytes inline plus a 10,288-byte dynamically allocated descriptor block.

Construction Sequence

sub_BE7390(this, parent_context) executes in this order:

  1. Base class init (sub_738E20): sets vtable, stores parent pointer, allocates the opcode-to-descriptor mapping array (512 bytes, 64 QWORD slots), zeroes all four descriptor data areas (+744..+3624), queries SM version and stores at +3728, allocates per-opcode property array (4 * sm_opcode_count bytes at +4112), allocates a reference-counted descriptor block (24 bytes at +4136), queries knobs 812/867/822/493 for configuration. Sets +4132 = 8 and +4176 = 0 (init incomplete).
  2. Override vtable: +0 = off_233ADC0 (derived vtable).
  3. Populate ROT13 name table: 322 inline entries (indices 0-321) at offsets +4184..+9328, each 16 bytes ({char* name_ptr, u64 length}).
  4. Bulk-copy encoding category map: qmemcpy(+9336, unk_22B2320, 0x508) -- 322-entry int32 array (1288 bytes) mapping opcode index to encoding category number. The source table varies by arch constructor (see below).
  5. Initialize post-table fields: zero offsets +10624..+10680.
  6. Store sentinels: +11200 = -2, +11224 = 0xFFFFFFFF.
  7. Set constants: +4048 = 2, +4056 = 10, +3733 = 1.
  8. Descriptor defaults (sub_1370BD0): populates scheduling templates and operand defaults at +192..+704.
  9. Override property mode: +4132 = 7 (overwriting base class's 8).
  10. Allocate descriptor block: 10,288 bytes via the MemoryManager, partitioned into 3 sections.
  11. Query SM-specific config: reads parent->+1664->+72->+55080 and stores result at +10648.

InstructionInfo Object Layout

The complete byte-level field map, derived from sub_BE7390 (derived constructor), sub_738E20 (base constructor), and sub_1370BD0 (descriptor defaults init).

Region 1: Vtable, Parent, and Core Identity (+0 to +91)

OffsetSizeTypeFieldDescription
+08ptrvtableoff_233ADC0 (derived); base chain: off_21DB6E8 / off_21B4790
+88ptrparent_ctxParent compilation context pointer
+448u64operand_countsPacked pair 0x100000001: lo=1 dst, hi=1 src (base default)

Region 2: Scheduling Defaults and Flags (+92 to +159)

OffsetSizeTypeFieldDescription
+9216xmmsched_defaultsScheduling parameter defaults (loaded from xmmword_2029FE0)
+1084i32desc_idx_aDescriptor index sentinel = 0
+1124i32desc_idx_bDescriptor index sentinel = -1 (0xFFFFFFFF)
+1161u8flag_116= 0
+1171u8flag_117= 0
+1181u8flag_118= 1
+1203u8[3]flags_120All = 0
+1364i32sentinel_136= -1 (0xFFFFFFFF)
+1488u64reserved_148= 0

Region 3: Opcode-to-Descriptor Mapping (+160 to +191)

OffsetSizeTypeFieldDescription
+1608ptrmapping_allocatorMemoryManager used for mapping array
+1688ptrmapping_arrayDynamically allocated QWORD array (initial: 512 bytes, 64 entries)
+1764i32mapping_countCurrent entry count (initially 63)
+1804i32mapping_capacityCurrent capacity (initially 64)
+1848u64packed_flags= 0x4000000000 (bit 38: descriptor config flag)

Region 4: Descriptor Defaults (+192 to +704, set by sub_1370BD0)

OffsetSizeTypeFieldDescription
+1928u64default_operand_cfgPacked 0x200000002: lo=2, hi=2
+2004u32default_dst_count= 4
+2084u32default_modifier= 2
+21616xmmsched_template_aScheduling template (from xmmword_233B1E0)
+2404u32default_operand_w= 4
+4488u64section_marker_448= 1
+4564u32section_id_456= 2
+4644u32section_id_464= 3
+47216xmmsched_template_bScheduling template (from xmmword_233B1F0)
+4964u32default_value_496= 5

Gaps within +204..+447 and +500..+695 are zero-initialized by sub_1370BD0.

Region 5: Primary Descriptor Data (+744 to +2155)

OffsetSizeTypeFieldDescription
+7448u64desc_data_startPrimary area header = 0
+752..+21551404u8[]desc_dataZero-initialized per-opcode descriptor records

Region 6: Secondary Descriptor Area (+2156 to +2211)

OffsetSizeTypeFieldDescription
+21568u64secondary_header= 0
+2164..+221148u8[]secondary_dataZero-initialized

Region 7: Tertiary Descriptor Area (+2212 to +3623)

OffsetSizeTypeFieldDescription
+22128u64tertiary_header= 0
+2220..+36231404u8[]tertiary_dataZero-initialized
+23724u32desc_record_type_a= 4 (set by derived constructor)
+24004u32desc_record_type_b= 4 (set by derived constructor)

Region 8: Quaternary Descriptor Area and Target Config (+3624 to +3735)

OffsetSizeTypeFieldDescription
+36248u64quaternary_header= 0
+3640..+366432u64[4]quat_ptrsAll = 0
+36721u8is_sm75_plus= 1 if SM ID >= 16389, else 0
+36731u8target_flag_bit6Bit 6 of *(target+1080)
+36741u8target_flag_bit7Bit 7 of *(target+1080)
+3675..+36828u8[8]zero_padAll = 0
+368432u128[2]zero_pad_3684= 0
+3716..+37172u8[2]flags_3716= 0
+37204u32value_3720= 0
+37241u8flag_3724= 1
+37251u8flag_3725= 0
+37284u32sm_opcode_countSM version / total opcode count from arch query
+37321u8knob_812_flagKnob 812 derived flag
+37331u8derived_flag= 1 (set by derived constructor; base leaves at 0)

Region 9: Scheduling Configuration (+4016 to +4111)

OffsetSizeTypeFieldDescription
+401616u128sched_config_a= 0
+40328u64sched_config_b= 0
+404016xmmsched_constantsLoaded from xmmword_21B4EE0
+40484u32constant_2= 2 (derived overrides base default 0)
+40564u32constant_10= 10 (derived overrides base default 0x7FFFFFFF)
+4060..+40648u32[2]zero_pad= 0
+40728u64sched_ptr= 0
+40808u64sched_ext= 0
+40881u8flag_4088= 0
+40891u8knob_867_flag= 1 if knob absent; = (knob_value == 1) otherwise
+40901u8flag_4090= 0
+40924u32knob_822_valueDefault 7; overridden by knob 822
+40964u32knob_493_valueDefault 5; overridden by knob 493

Region 10: Per-Opcode Property Array (+4112 to +4183)

OffsetSizeTypeFieldDescription
+41128ptrproperty_arrayAllocated: 4 * sm_opcode_count bytes; 4 bytes per opcode
+41204u32property_count= 4 * !hasExtendedPredicates (0 or 4)
+41244u32property_aux= 0
+41281u8property_init_flag= 1
+41324u32property_modeBase sets 8, derived overwrites to 7
+41368ptrref_counted_block24-byte block: [refcount=2, data=0, allocator_ptr]
+4144..+416024u64[3]rc_auxAll = 0
+41761u8init_complete= 0 initially; set to 1 after full initialization

Region 11: ROT13 Opcode Name Table (+4184 to +10623)

OffsetSizeTypeFieldDescription
+41845152struct[322]opcode_names[0..321]322 inline entries, each 16 bytes: {char* name, u64 len}
+93361288int32[322]encoding_category_map[0..321]Per-opcode encoding category; bulk-copied from arch-specific static table (see below)

Total: 322 named opcodes. Index N name is at offset 4184 + 16*N. The getName accessor at sub_BEBAC0 computes this + 4184 + 16 * opcode directly. Encoding category for opcode N is at +9336 + 4*N.

Encoding Category Map

The 1288-byte block at +9336 is a 322-element int32 array that maps each opcode index to an encoding category number. The SASS mnemonic lookup function (sub_1377C60) uses this to resolve a (mnemonic, arch) pair to a binary encoding format descriptor.

Arch-specific source tables:

ConstructorSource TableContent
sub_7A5D10 (base)unk_21C0E00Identity map: map[i] = i for all i in 0..321
sub_7C5410unk_21C3600Arch-remapped: some entries differ from identity
sub_BE7390unk_22B2320Arch-remapped: some entries differ from identity

The base constructor uses a pure identity map where opcode N maps to encoding category N. Arch-specific constructors override selected entries so the same mnemonic at different opcode indices can map to different encoding formats. For example, DMMA at opcode index 180 maps to encoding category 434 on one arch, while DMMA at opcode index 215 maps to encoding category 515 on another.

Reader: sub_1377C60 (SASS mnemonic lookup)

// After matching mnemonic string v11 to opcode index v18 via ROT13 comparison:
v84 = *(_DWORD *)(a1 + 4 * v18 + 9336);  // encoding_category_map[v18]
// v84 is then FNV-1a hashed together with arch discriminator v16,
// and looked up in the hash table at *(a1 + 10672) to find the
// encoding format descriptor for this (category, arch) pair.

The hash table at +10672 stores entries of the form {encoding_category, arch_code, format_value}, keyed by FNV-1a of (encoding_category, arch_discriminator). This is the central mechanism that maps a SASS mnemonic string plus target architecture to the correct binary encoding format.

Region 12: Descriptor Block Control (+10624 to +10687)

OffsetSizeTypeFieldDescription
+106248u64block_ctrl_a= 0
+106328u64block_ctrl_b= 0
+106484u32arch_configSM-specific config from target+55080/55088
+106568ptrdescriptor_blockPointer to allocated 10,288-byte per-opcode descriptor block
+106648ptrblock_allocatorMemoryManager that allocated the descriptor block
+106728ptrencoding_lookup_tableHash table for (encoding_category, arch) -> format descriptor lookup; read by sub_1377C60
+106808u64block_aux_b= 0

Region 13: Sentinels and Architecture Handler (+11200 to +11240)

OffsetSizeTypeFieldDescription
+112004i32sentinel= -2 (0xFFFFFFFE)
+112088ptrarch_handler= parent_ctx->+16 (MemoryManager)
+112168u64zero_11216= 0
+112248u64sentinel_11224= 0xFFFFFFFF
+112321u8flag_11232= 0
+112364u32zero_11236= 0

Per-Opcode Descriptor Block (10,288 bytes)

Allocated by the derived constructor and stored at +10656. The block is 10288 / 8 = 1286 QWORD entries, partitioned into three sections:

+--------------------+  block + 0
| Section 0 header   |  QWORD[0] = 0
+--------------------+  block + 8
| Section 0 payload  |  QWORD[1..640]  = all zero (memset)
| (640 slots)        |  Per-opcode descriptors for opcodes 0..639
+--------------------+  block + 5128
| Section 1 header   |  QWORD[641] = 0
+--------------------+  block + 5136
| Section 1 payload  |  QWORD[642..1283]  (NOT explicitly zeroed)
| (642 slots)        |  Modifier-variant descriptors (opcode | 0x1000, etc.)
+--------------------+  block + 10272
| Section 2 (16B)    |  QWORD[1284] = parent_ctx  (back-pointer)
|                    |  QWORD[1285] = instr_info   (self back-pointer)
+--------------------+  block + 10288

Section 0 (5,128 bytes): 641 QWORD slots. Only the payload (slots 1..640, 5,120 bytes) is explicitly zeroed. Each slot corresponds to a base opcode index. With 402 named opcodes, ~240 slots remain spare.

Section 1 (5,144 bytes): 643 QWORD slots. The header is zeroed but the payload is NOT explicitly zeroed -- it relies on the arena allocator's default behavior or lazy initialization during opcode registration. Likely stores modifier-variant descriptors (e.g., entries for opcode | 0x1000 when bits 12-13 carry sub-operation modifiers).

Section 2 (16 bytes): Two back-pointers for navigating from the descriptor block back to its owning objects (parent compilation context and the InstructionInfo instance).

Architecture-Specific Sub-Tables (sub_896D50, 26,888 bytes)

The architecture-specific extended property object is NOT stored inside InstructionInfo. It is lazily allocated by sub_7A4650, which gates on target+372 == 0x8000 (sm_80 / Ampere targets). The allocation is 26,888 bytes, constructed by sub_896D50(block, parent_context).

sub_896D50 Object Layout

OffsetSizeTypeFieldDescription
+08ptrvtableoff_21DADF8
+88ptrparent_ctxFrom construction parameter
+408ptrallocator_baseMemoryManager from parent->+16

Property Array A (at sub-object +56):

Sub-offsetFieldDescription
+56ptrArray pointer: 64 bytes per entry, 772 entries (49,408 bytes allocated)
+64i32Count = 771
+68i32Capacity = 772

Each 64-byte entry: bytes [0..11] initialized to 0xFF (pipeline-unassigned sentinel), bytes [12..63] zeroed. Stores latency, throughput, port mask, and register class requirements per opcode.

Property Array B (at sub-object +80):

Sub-offsetFieldDescription
+80ptrArray pointer: 36 bytes per entry, 772 entries (27,792 bytes allocated)
+88i32Count = 771
+92i32Capacity = 772

Each 36-byte entry: all zeroed. Stores encoding class, format identifiers, operand encoding rules.

Property Array C (at sub-object +176):

Sub-offsetFieldDescription
+176ptrArray pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+184i32Count = 34
+188i32Capacity = 35

Each 16-byte entry: zeroed. Stores functional unit properties for major FU categories.

Property Array D (at sub-object +200):

Sub-offsetFieldDescription
+200ptrArray pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+208i32Count = 34

Parallel table for alternate functional unit configurations.

Dimension Table (at sub-object +472):

Sub-offsetFieldDescription
+472ptr168-byte block: [count=40, entries[0..39]], 4 bytes per entry, zero-initialized

Alphabetical SASS Name Table (at sub-object +11360):

Starting at offset +11360, sub_896D50 populates an alphabetically sorted ROT13 name table using the same {char*, u64} format. Unlike the InstructionInfo name table (indexed by opcode), this table is sorted by decoded mnemonic name and includes modifier variants:

  • OZZN.168128 (BMMA.168128)
  • PPGY.P.YQP.VINYY (CCTL.C.LDC.IVALL)
  • VZNQ.JVQR.ERNQ.NO (IMAD.WIDE.READ.AB)
  • VZZN.FC.{168128.*|16864.*8.*8} (IMMA.SP.{...} -- regex patterns for variant matching)

This table is used for SASS assembly parsing and opcode-to-encoding resolution, where a single base opcode may map to multiple encoding variants distinguished by modifier suffixes.

Knob-derived fields:

Sub-offsetFieldSource
+108i32Knob 803 value (instruction scheduling latency override)
+468u8= 0
+469u8= 1
+470u8= 1

Accessor Stubs

40+ tiny vtable accessor stubs at 0x859F80-0x85A5F0 and 0x868500-0x869700 provide virtual dispatch access to per-opcode properties. Typical pattern:

int getLatency(ArchSpecificInfo* this, int opcode) {
    return *(int*)(this->property_array_a + 64 * opcode + latency_offset);
}

PTX Text-Generation Operand Accessor API

The PTX text generation subsystem (instruction pretty-printer, dispatcher at sub_5D4190) converts Ori IR instructions into PTX assembly text. The ~580 formatter functions at 0x4DA340-0x5A9FFF query a PTX instruction context object through a stable API of 48 small accessor helpers concentrated at 0x707000-0x710FFF.

PTX Instruction Context Object

The accessor functions do NOT operate on the 296-byte Ori IR instruction directly. They take a PTX instruction context object (~2500+ bytes) that contains pre-decoded fields for text generation. The raw Ori instruction is accessible at *(context + 1096). Each formatter receives this context as argument a1 and a pool allocator table as argument a2.

Partial field map of the PTX instruction context (offsets used by accessors):

OffsetSizeTypeFieldAccessed By
+5448ptrpredicate_ptrhas_predicate, get_opcode_string
+5644u32saturation_codeget_saturation_mode (== 12 means saturate)
+5964u32field_operand_countget_field_a..get_field_d
+6001u8flag_byte_aBit 0: precision, bit 6: addressing, bit 7: addr_mode
+6041u8rounding_modeBits 0-2: rounding mode code (3 bits)
+6051u8scale_byteBits 4-7: scale code (4 bits, 16 entries)
+6091u8base_addr_byteBits 2-3: base address mode (2 bits, 4 entries)
+6111u8param_flagsBits 4-5: parameter variant selector
+6151u8ftz_byteBits 6-7: FTZ flag code (2 bits, 4 entries)
+6201u8variant_indexVariant string lookup index (8 bits, 256 entries)
+6271u8flag_byte_bBits 0-1: extended_op, 2-3: flag_b, 4-5: modifier/variant
+6404i32precision_codeIndex into precision string table
+648varptr[]operand_namesPer-operand name string pointer array (8B per slot)
+8004u32operand_countNumber of operands for comparison/count accessors
+816varptr[]reg_operandsRegister operand pointer array (8B per slot)
+944varu32[]operand_typesPer-operand type code array (4B per slot)
+1024varptr[]src_part0Source part 0 pointer array (8B per slot)
+1264varptr[]src_part1Source part 1 pointer array (8B per slot)
+1504varptr[]data_types_0Data type array, part 0 (8B per slot)
+1744varptr[]data_types_1Data type array, part 1 (8B per slot)
+1984varu32[]target_smTarget SM version array (4B per slot)
+21208ptropcode_nameOpcode mnemonic string pointer
+24888ptrstring_internString interning table for modifier deduplication

Accessor Catalog

Tier 1: Core Accessors (>200 callers)

Used by nearly every formatter function. These are the fundamental building blocks of PTX text generation.

AddressNameSizeCallersSignatureLogic
sub_710860getDataType39B2953(ctx, idx, part) -> u8part ? **(ctx+1744+8*idx) & 0x3F : **(ctx+1504+8*idx) & 0x3F
sub_70B910getSrcPart012B1656(ctx, idx) -> ptr*(ctx + 8*idx + 1024)
sub_70B8E0getRegOperand12B1449(ctx, idx) -> ptr*(ctx + 8*idx + 816)
sub_70B920getSrcPart112B1296(ctx, idx) -> ptr*(ctx + 8*idx + 1264)
sub_70B700hasPredicate14B946(ctx) -> bool*(ctx + 544) != 0
sub_70B780getPredicateName151B514(ctx, pool) -> strAllocates "@" + opcode_name; inserts "!" if negated
sub_70CA60getOperandType11B480(ctx, idx) -> u32*(ctx + 4*idx + 944)
sub_70B710getOpcodeString111B348(ctx, pool) -> strAllocates "@" + *(ctx+2120) from arena pool
sub_70FA00getTargetSM10B286(ctx, idx) -> u32*(ctx + 4*idx + 1984)

Tier 2: Modifier and Property Accessors (10-200 callers)

Used by instruction-class families (memory ops, float ops, texture ops, etc.).

AddressNameSizeCallersSignatureLogic
sub_70CA70getTypeSuffix427B191(ctx, pool) -> strIterates *(ctx+796) type codes; looks up in off_2032300[] with interning
sub_70CD20getOperandOffset122B158(ctx, idx) -> stroff_2032300[*(ctx+4*idx+944)]; resolves via string interning for codes <= 0x39
sub_707CE0getAddressOperand22B93(ctx) -> stroff_2033DE0[*(ctx+600) >> 7]
sub_70B930getOperandCount7B68(ctx) -> u32*(ctx + 800)
sub_70B4C0getBaseAddress22B46(ctx) -> stroff_2032700[(*(ctx+609) >> 2) & 3]
sub_709A10getVariantString73B46(ctx) -> stroff_2033060[*(ctx+620)] resolved via string interning
sub_70B6E0hasPredicate_v214B42(ctx) -> bool*(ctx + 544) != 0 (identical body to hasPredicate)
sub_709760getComparisonOp127B21(ctx, pool) -> strIterates *(ctx+800) operand names from +648 array with " , " separator
sub_709FE0getRoundingMode11B17(ctx) -> u8*(ctx + 604) & 7
sub_70A500getSaturationMode13B15(ctx) -> bool*(ctx + 564) == 12
sub_709910getVariantCount14B13(ctx) -> u8(*(ctx+627) >> 4) & 3
sub_708E40getExtendedOperand29B10(ctx, idx) -> stroff_2033720[(*(ctx+627) >> (idx==1 ? 0 : 2)) & 3]

Tier 3: Instruction-Class-Specific Accessors (<10 callers)

Used by specific instruction families (MMA/tensor, texture, guardrail formatters).

AddressNameSizeCallersSignaturePurpose
sub_70FA10checkTargetSM66B7(ctx, idx, str) -> boolsscanf(str, "sm_%d") then compare to *(ctx+1984+4*idx)
sub_70C890getOperandDetail~300Bvaries(ctx, pool, maxlen, type) -> strComplex: hex parse, fallback to sub_707380, type-dispatch
sub_70A810getScaleString22Bvaries(ctx) -> stroff_2032BA0[(*(ctx+605) >> 4) & 0xF]
sub_70B3F0getFtzFlag22Bvaries(ctx) -> stroff_20327C0[(*(ctx+615) >> 6) & 3]
sub_707530getPrecisionString12Bvaries(ctx) -> stroff_2033FA0[*(ctx+640)]
sub_707C60getAddressingMode12Bvaries(ctx) -> bool(*(ctx+600) & 0x40) != 0
sub_707C80getScopeString22Bvaries(ctx) -> stroff_2033E00[(*(ctx+600) & 0x40) != 0]
sub_7075E0getLayoutString22Bvaries(ctx) -> stroff_2033EE0[*(ctx+600) & 1] -- WMMA/TCGEN05
sub_707BE0getShapeString22Bvaries(ctx) -> stroff_2033E30[(*(ctx+600) & 4) != 0] -- WMMA/TCGEN05
sub_7075C0getInstrFlagA7Bvaries(ctx) -> u8*(ctx+600) & 1 -- WMMA/rsqrt
sub_707BC0getInstrFlagBvariesvaries(ctx) -> variesSecondary flag accessor -- WMMA/rsqrt
sub_70D3B0getFieldA91B2(ctx) -> strReturns ".transA" if operand count matches MMA shape
sub_70D410getFieldB99B2(ctx) -> strReturns ".transB" (symmetric with getFieldA)
sub_70D480getFieldC91B2(ctx) -> strMMA field C modifier string
sub_70D4E0getFieldD91B2(ctx) -> strMMA field D modifier string
sub_70D360getModifier76B1(ctx, pool) -> strReads operand at index 3 or 5 depending on byte 627
sub_70D2F0getImmediate107B1(ctx, pool) -> strReads operand at +672, conditionally appends second value
sub_70FCB0getParamAvariesvaries(ctx) -> u64Dispatch on (*(ctx+611) & 0x30): selects guardrail constant
sub_70FCF0getParamBvariesvaries(ctx) -> u64Similar dispatch on different bit field
sub_70E670getParamCvariesvaries(ctx) -> u64Third parameter accessor

Static String Tables

The accessor functions perform table-driven lookups using static string pointer arrays in .rodata. Each table is indexed by a small bit-field extracted from the context object:

Table AddressEntriesIndexed ByContent
off_2032300>57Operand type codeType suffix strings (.f32, .u16, .b64, etc.)
off_20327004(ctx+609 >> 2) & 3Base address mode strings
off_20327C04(ctx+615 >> 6) & 3FTZ flag strings (empty, .ftz, etc.)
off_2032BA016(ctx+605 >> 4) & 0xFScale modifier strings
off_2033060256ctx+620Variant name strings
off_20337204(ctx+627 >> N) & 3Extended operand strings
off_2033DE02ctx+600 >> 7Address operand strings
off_2033E002(ctx+600 & 0x40) != 0Scope strings (.cta, .gpu, etc.)
off_2033E302(ctx+600 & 4) != 0Shape strings -- WMMA/TCGEN05
off_2033EE02ctx+600 & 1Layout strings -- WMMA/TCGEN05
off_2033FA0indexed by intctx+640Precision strings for texture ops

Architectural Notes

  1. String interning: String-returning accessors for type codes <= 0x39 go through a string interning table at *(ctx+2488). The pattern is: look up a candidate string from the static table, then pass it through sub_426D60 (hash lookup) or sub_7072A0 (insert-and-return). This deduplicates PTX modifier strings across the entire text generation pass.

  2. Pool allocation: Accessors that construct new strings (prefixing "@", joining with separators) receive a pool allocator parameter. They allocate from the formatter's 50KB temp buffer via sub_4280C0 (get pool) -> sub_424070 (alloc from pool) -> sub_42BDB0 (abort on failure).

  3. Duplicate functions: sub_70B700 (hasPredicate, 946 callers) and sub_70B6E0 (hasPredicate_v2, 42 callers) have bytewise-identical bodies. Both return *(a1+544) != 0. These are likely methods in different classes (base and derived, or two sibling classes) that were not merged by the linker because they have distinct mangled names.

  4. MMA/tensor accessors: getFieldA through getFieldD, getLayoutString, and getShapeString are used exclusively by WMMA, HMMA, and TCGEN05 instruction formatters. They decode matrix operation modifiers (.transA, .transB, .row, .col) from compressed bit fields.

Instruction Creation

Allocation: sub_7DD010

The primary instruction allocator at sub_7DD010 (called from pass code that needs to create new instructions):

  1. Allocates 296 bytes from the Code Object's arena allocator (vtable+16, size 296)
  2. Zeroes the entire 296-byte object
  3. Initializes sentinel fields: offset +248 = -1, +256 = 0xFFFFFFFF, +264 and +272 = 0xFFFFFFFF00000000
  4. Loads scheduling parameter defaults from xmmword_2027620 into offset +208
  5. Appends the new instruction to the Code Object's instruction index array at +368 (resizable, 1.5x growth policy)
  6. Assigns a unique instruction index: *(instr + 264) = index
  7. Invalidates cached analysis (RPO at +792)

The instruction is created unlinked -- it is not yet in any basic block's linked list.

Linking: sub_925510 (Insert Before)

sub_925510 inserts instruction a2 before instruction a3 in the doubly-linked list of Code Object a1:

void InsertBefore(CodeObject* ctx, Instr* instr, Instr* before) {
    // 1. Check if instruction removal impacts scheduling state
    if (IsScheduleRelevant(instr, ctx))
        UpdateScheduleState(ctx, instr);

    // 2. Notify observers
    NotifyObservers(ctx->observer_chain + 1952, instr);

    // 3. Unlink from current position
    if (instr->prev) {
        instr->prev->next = instr->next;
        if (instr->next)
            instr->next->prev = instr->prev;
        else
            ctx->tail = instr->prev;   // was tail
    } else {
        ctx->head = instr->next;        // was head
        instr->next->prev = nullptr;
    }

    // 4. Insert before target
    instr->next = before;
    instr->bb_index = before->bb_index;
    instr->prev = before->prev;
    if (before->prev)
        before->prev->next = instr;
    if (before == ctx->head)
        ctx->head = instr;
    before->prev = instr;

    // 5. Post-insert bookkeeping
    PostInsertUpdate(ctx, instr);
}

Removal: sub_9253C0

sub_9253C0 (634 callers) removes an instruction from its linked list:

  1. Checks if the instruction affects scheduling state (same check as insert)
  2. Notifies the observer chain at Code Object +1952
  3. Unlinks from the doubly-linked list (updating head/tail pointers at +272/+280)
  4. Optionally updates the instruction map at Code Object +1136 (if a3 flag is set)
  5. Handles debug info cleanup if the debug flag at byte +1421 bit 5 is set

Instruction Removal Check: sub_7E0030

Before removing an instruction (sub_7E0030, called from both sub_9253C0 and sub_925510), the compiler checks whether the removal is legal. This function examines:

  • Whether the instruction is an STS (store shared, base opcode 95) with specific operand count and data type patterns (operand_count - adj == 5 with data type codes 1, 2, or 4 prevent removal)
  • Whether a target-specific scheduler hook (vtable offset 2128 on the SM backend at compilation context +1584) vetoes the removal
  • Whether the instruction is a PLOP3 (predicate logic, opcode 23) writing to a special register (register file type 9 at descriptor +64)
  • Whether the dead-code check (sub_7DF3A0) clears the instruction, excluding opcodes 93 (OUT_FINAL), 124 (DMUL), and 248 (SM90+ opcode) which have required side effects
  • Whether the opcode class has a "must keep" flag in the per-opcode property array at Code Object +776 (byte[4*opcode + 2] & 4)

Instruction Iteration

Forward Walk

The standard forward walk over a basic block's instructions:

// code_obj->head is at +272, tail at +280
instr_ptr instr = *(ptr*)(code_obj + 272);
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 8);  // next
}

Reverse Walk

instr_ptr instr = *(ptr*)(code_obj + 280);  // tail
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 0);  // prev
}

Block-Scoped Iteration

When iterating within a specific basic block (used by scheduling, regalloc, and peephole passes), the block's head instruction pointer at block_entry +0 is the starting point, and iteration continues until the next block boundary (opcode 52, named AL2P_INDEXED in the ROT13 table but universally used as a BB delimiter pseudo-opcode) or the list tail:

// Block info at code_obj+976, 40 bytes per block
ptr block_head = *(ptr*)(*(ptr*)(code_obj + 976) + 40 * block_index);
for (instr = block_head; instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t op = *(uint32_t*)(instr + 72) & 0xFFFFCFFF;
    if (op == 52)  // BB boundary
        break;
    // process instruction
}

Def-Use Chain Iterator: sub_7E6090

The complex def-use chain builder sub_7E6090 (650 lines decompiled) is the core instruction analysis function. Called from sub_8E3A80 and numerous optimization passes, it:

  1. Walks all instructions in program order
  2. For each register operand (type == 1 via (word >> 28) & 7), updates the register descriptor's def/use counts at offsets +20 and +24
  3. Builds use chains via linked list nodes allocated from the arena (16-byte nodes with {next, instruction_ptr})
  4. Sets flag bits in register descriptors (+48) for live-out, same-block-def, has-prior-use, and source-only-ref
  5. Tracks the single-definition instruction at register descriptor +56
  6. Handles CSE matching: compares operand arrays of instructions with matching opcode, operand count, and auxiliary data to detect redundant computations
  7. Takes parameter a5 as a bitmask of register file types to process (bit per register class)

Instruction Lowering Handler -- sub_65D640 (48 KB)

The central PTX-to-Ori instruction lowering handler lives at sub_65D640. It is installed at vtable offset +32 in the ISel Phase 1 dispatch table (sub_660CE0) and called through the vtable for every PTX instruction during lowering.

Signature: int64 sub_65D640(context*, bb_ref, ptx_node*, ori_instr*)

The function reads the PTX opcode from *(*(ptx_node+32)+8) and dispatches through a ~60-case switch. An entry gate (sub_44AC80) diverts certain opcode types to an alternate handler (sub_656600). The function calls sub_A2FD90 (operand setter) 59 times to populate Ori operands on the resulting instructions.

Opcode Case Map

Case(s)PTX familyHandlerDescription
5prmt (byte permute)inlineDecodes 8-bit per-byte channel mask, sets 2 operands
6prmt (extended)inlineTwo-operand permute with address computation via sub_6294E0
10mov (special)inlineClears immediate flag for float type 109
12(delegated)sub_659F90--
13multi-operand expansioninlineExpands via sub_62E840, resolves type 87 (address) and 97 (register) operands
17, 18, 24mov/cvt variantssub_652FA0--
19, 20, 23surface opsinline~200 lines: multi-register data, sub_6273E0 operand classification, up to 4 data regs + address
34, 35load/storeinlineOptional address resolution gated on (ptx_node+61 & 0xC)
45, 238conversioninlineRewrites operand type to 20 (integer), binds address via sub_6294E0
68, 71register indirect rewriteinlineChecks operand size == 8, rewrites descriptor to type 110
81instruction expansioninlineCreates IADD3 (opcode 38) with constant 0, reg class 12
82instruction expansioninlineRewrites to opcode 162 with IADD3 operand
84load expansioninlineCreates IADD3 with offset, flags 0x2000
85operand reorderinline3-operand shuffle
87reg class adjustmentinlineTable lookup at dword_2026C60, swaps operands 1/2, sets opcode 150
88matrix configinlineMMA dimension table at dword_2026C48, sets fields 179/180
1044-wide loadinlineCreates 4-operand instruction, address binding via sub_6294E0
110(delegated)sub_652610--
123generic addressinginlineConverts flat-to-specific addresses; SM-version-dependent multi-instruction sequences
124, 125cvta / isspacepinlineAddress space conversion; creates CVTA opcode 538/539 on SM > 0x1A
130instruction fusioninlineFuses instruction if operand count is not 3 or 4
165(delegated)sub_65BF40--
175--178texture addr_modeinlineResolves .addr_mode_0/1/2 attributes from texture descriptor
179atomic address modeinlineClassifies atomic op type, creates SEL + ATOM sequence
180(delegated)sub_65CE90--
181, 182(delegated)sub_64FF20--
183conditional atomicinlineState space 0x20: rewrites to opcode 71 with mask 0xFF01010101
184--190surface/texture loweringinlineHandles SULD/SUST/SURED (opcodes 449-456); SM-dependent operand resolution
197, 198call site loweringinlineSame-module vs cross-module call dispatch
201--204, 208--211wide load/storeinline.v2/.v4 multi-element operations with IADD3 offset computation
206, 207, 212, 2133-op wide load/storeinline3-operand variants of wide memory operations
221, 222TMA operationsinlineSets field 197 with value 365/366

Addressing Mode Types

ptxas handles four distinct addressing mode categories during instruction lowering, all resolved by sub_65D640:

1. Texture Addressing Modes (per-dimension)

Cases 175--178 resolve .addr_mode_0, .addr_mode_1, .addr_mode_2 attributes from texture descriptors. These are the PTX txq query targets.

The function walks the texture descriptor's attribute linked list at *(descriptor+16)+24, comparing each attribute name string:

// Pseudocode for cases 175-178:
addr_mode_0 = addr_mode_1 = addr_mode_2 = 0;
found = false;
for (node = attr_list_head; node != NULL; node = *node) {
    name = *(node[1] + 16);    // attribute name string
    value = *(*(node[1] + 24) + 16);  // integer value
    if (strcmp(name, "addr_mode_0") == 0)  { addr_mode_0 = value; found = true; }
    else if (strcmp(name, "addr_mode_1") == 0)  { addr_mode_1 = value; found = true; }
    else if (strcmp(name, "addr_mode_2") == 0)  { addr_mode_2 = value; found = true; }
}

For 2D textures (state space byte & 0xB0 == 0x20), the function checks addr_mode_0 == addr_mode_1. For 3D textures (0x30), it checks all three equal. If modes are uniform (all equal), the instruction gets a single addressing mode flag (field 91 = 1 for clamp_to_border). If modes differ, it delegates to sub_64FC90 for a multi-instruction lowering that handles per-dimension mode selection.

2. Generic-to-Specific Address Conversion (case 123)

Converts flat/generic pointers to specific memory space pointers. The address space ID from *(ptx_node+40) selects the conversion strategy:

Space IDMemory spaceStrategy
4sharedsub_654A90 (direct conversion)
5combinedOR of global + shared + local conversions
6localsub_64F7A0 with register pair 101/102
7generic (flat)SM-dependent: sub_654FB0 (SM <= 0x1A) or SHR/AND extraction + SEL mux (SM > 0x10)
8globalsub_64F7A0 with register pair 98/99

For generic space on older architectures (SM <= 0x1A with feature flag via sub_61AF90), a simpler single-instruction path is used. On newer architectures, a multi-instruction sequence extracts the space tag from the upper address bits.

3. Address Space Conversion (cases 124--125, cvta/isspacep)

The cvta (Convert Address) and isspacep (Is Space Predicate) instructions convert between generic and specific address spaces. For global space (type 8) on SM > 0x1A, the handler creates CVTA with opcode 538 (isspacep) or 539 (cvta) and sets register class 7 with width 4 or 16 bytes.

4. Memory Addressing Modes (implicit)

Memory addressing modes for load/store/atomic instructions are not enumerated as named constants. Instead, they emerge from the operand construction patterns in cases 19--23, 34--35, 81--84, 104, 201--213:

PatternPTX syntaxOri representation
Register indirect[%rd1]Operand type 87 from sub_629E40
Register + offset[%rd1+16]Register operand + immediate via sub_6273E0
Constant bankc[2][0x100]Constant operand via sub_620320 (type 12)
Immediate address.local spaceConstant value via sub_620320
Base + index[%rd1], %r2Two-operand form

ISel Phase 1 Dispatch Vtable

sub_660CE0 constructs a 17-slot vtable at context offset +3784 for the ISel Phase 1 instruction handlers:

OffsetHandlerSizeRole
+0sub_650840--Primary handler
+8sub_64EEB0--Operand handler
+16sub_64F270--Type handler
+24sub_6575D049 KBRegister-class-to-opcode dispatch
+32sub_65D64048 KBInstruction lowering (this function)
+40sub_64EDD0--Auxiliary handler
+128sub_64EEC0--Lowering helper

Key Function Reference

AddressSizeFunctionDescription
sub_7DD0101.3KBInstruction::createAllocate and initialize 296-byte instruction
sub_7E00303.6KBInstruction::canRemoveCheck if instruction removal is legal
sub_7E06500.7KBInstruction::hasPredGuardCheck if instruction has predicate guard
sub_7E0E800.1KBInstruction::lastOpIsPredQuick predicate-guard check on last operand
sub_7E609010KBDefUseChain::buildBuild def-use chains for all instructions
sub_7DDCA00.2KBObserver::notifyWalk observer chain and notify
sub_9253C00.5KBInstruction::removeRemove instruction from linked list (634 callers)
sub_9255100.5KBInstruction::insertBeforeInsert instruction before another (13 callers)
sub_917A606.8KBInstrInfo::getRegClassOpcode-to-register-class mapping (221 callers)
sub_91A0F05.6KBInstrInfo::resolveRegClassResolve operand register class with constraints
sub_9314F00.4KBRegClass::queryRegister class query (1,547 callers)
sub_738E2010KBInstrDescTable::initBase instruction descriptor table constructor
sub_BE739016KBInstructionInfo::initInstructionInfo constructor (ROT13 table + descriptors)
sub_896D5021KBInstrMnemTable::initArchitecture-specific mnemonic table initializer
sub_65D64048KBInstrLowering::handlePTX-to-Ori instruction lowering handler (60+ opcode cases, addressing mode resolution)
sub_660CE00.3KBInstrLowering::initVtableConstructs ISel Phase 1 dispatch vtable (17 slots)
sub_6575D049KBRegClassOpcodeDispatch::handleRegister-class-to-opcode dispatch (vtable +24 sibling)
sub_6D969094KBInstruction::encodeMaster SASS instruction encoder
sub_B28E00variesisReg/isPred/isImmOperand type predicates (isel infrastructure)
sub_5D419012.9KBPTXFormatter::dispatchPTX text generation dispatcher (580 formatters)
sub_71086039BPTXCtx::getDataTypeData type accessor (2,953 callers)
sub_70B8E012BPTXCtx::getRegOperandRegister operand accessor (1,449 callers)
sub_70B91012BPTXCtx::getSrcPart0Source part 0 accessor (1,656 callers)
sub_70B70014BPTXCtx::hasPredicatePredicate presence check (946 callers)
sub_70CA6011BPTXCtx::getOperandTypeOperand type code accessor (480 callers)
sub_70B710111BPTXCtx::getOpcodeStringOpcode string with "@" prefix (348 callers)
sub_70FA0010BPTXCtx::getTargetSMTarget SM version accessor (286 callers)