Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SASS Opcode Catalog

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Complete reference table of all SASS opcode mnemonics known to ptxas v13.0.88. Extracted from the ROT13-encoded opcode name table in the InstructionInfo constructor (sub_7A5D10, vtable off_233ADC0). The table stores exactly 322 named entries (indices 0--321) at object offset +0x1058, with each entry occupying 16 bytes (8-byte string pointer + 8-byte length). A parallel constructor sub_BE7390 initializes an identical table. Immediately after the name table, a 322-element identity-mapped index array (0x508 bytes of 4-byte integers 0..321) is bulk-copied from unk_21C0E00 to object offset +0x2478; this is a separate data structure (encoding category map), not additional opcode names.

All SASS mnemonic strings in the ptxas binary are ROT13-obfuscated. The cleartext names shown here are the result of applying ROT13 decoding to the stored strings.

Table Organization

Opcodes are partitioned by SM generation through explicit boundary markers embedded in the table:

IndexMarkerRange
0--135Base ISAsm_70 (Volta) and all later architectures
136SM70_LASTEnd of sm_70 range
137--171sm_73+Volta extensions (uniform registers, tensor shapes)
171SM73_LASTEnd of sm_73 range
172--193sm_82+Ampere additions (MMA shapes, gather, REDUX)
193SM82_LASTEnd of sm_82 range
194--199sm_86+Ampere+ additions (conversion packed, SUQUERY)
199SM86_LASTEnd of sm_86 range
200--205sm_89+Ada Lovelace additions (QMMA shapes)
205SM89_LASTEnd of sm_89 range
206--252sm_90+Hopper additions (GMMA, CGA barriers, fences, TMA)
252SM90_LASTEnd of sm_90 range
253--280sm_100+Blackwell datacenter additions (UTC, QFMA4, MEMSET)
280SM100_LASTEnd of sm_100 range
281--320sm_104+Blackwell Ultra additions (uniform FP, new conversions)
320SM104_LASTEnd of sm_104 range
321LASTSentinel (end of table)

Each SM generation only adds opcodes; no base opcodes are removed. The Ori IR uses the 12-bit index into this table as the base opcode field (instruction offset +72, lower 12 bits). Bits 12--13 of the opcode word encode sub-operation modifiers (.HI, .WIDE, etc.) and are stripped by the 0xFFFFCFFF mask to recover the base index.

Encoding Format Summary

SASS instructions use three widths, selected per opcode during encoding:

Format CodeWidthUsage
0x164-bitSimple moves, branches, barriers, NOPs, short-form ALU
0x2128-bitMost ALU, load/store, texture, tensor core, atomics
0x8256-bitIMAD.WIDE variants with 16 constant-bank operand slots

The 3-level opcode hierarchy within the encoded instruction word is: major (9 bits, at bits [8:16]) / minor (8 bits, at bits [17:24]) / sub-opcode (7 bits, at bits [25:31]). See the encoding page for full details.

Duplicate Mnemonic Entries

Five entries in the table share a SASS mnemonic with an earlier index. These are not errors in the table -- they are distinct IR opcodes that happen to produce the same assembly mnemonic but with different binary encodings, operand widths, or functional-unit routing. The duplicates fall into two categories:

Category A -- SM-generation re-introduction. The same operation is re-implemented for a newer GPU generation with a different SASS major opcode and encoding path, typically because the tensor core or ALU microarchitecture changed:

Later IndexEarlier IndexMnemonicWhy re-introduced
215 (sm_90)180 (sm_82)DMMAHopper warpgroup-aware TC path (enc. cat. 515 vs 434)
220 (sm_90)14 (sm_70)FMNMXHopper adds 5-entry operand sub-mode table (enc. cat. 534 vs 510)

Category B -- Operand-width extension. Blackwell Ultra (sm_104) adds 64-bit operand variants of existing integer ALU instructions. The SASS printer appends a .64 suffix at render time; the IR name table stores the same base mnemonic for both widths:

Later IndexEarlier IndexMnemonicWhat the later index adds
284 (sm_104)37 (sm_70)IMNMX32-bit form, new encoding path
285 (sm_104)37 (sm_70)IMNMX64-bit form (IMNMX.64, .64.UI, .64.LO)
288 (sm_104)7 (sm_70)ISETP64-bit comparison (ISETP.64, .64.UI, .64.LO)

Binary evidence: in the constructor sub_7A5D10, indices 284 and 285 store identical "VZAZK" string pointers at adjacent 16-byte slots (v2+8728 and v2+8744). The SASS printer (sub_7CB560) maps them to IMNMX vs IMNMX.64 based on operand metadata.

Base ISA -- sm_70 (Volta) and Later (Indices 0--135)

These opcodes are available on all SM architectures supported by ptxas v13.0.

Integer Arithmetic

IdxROT13MnemonicDescription
1VZNQIMADInteger multiply-add (32-bit)
2VZNQ_JVQRIMAD_WIDEInteger multiply-add, 32x32->64 result
3VNQQ3IADD3Three-input integer add with carry
4OZFXBMSKGenerate bitmask from position and width
5FTKGSGXTSign-extend from specified bit position
6YBC3LOP3Three-input logic operation (arbitrary LUT)
7VFRGCISETPInteger compare and set predicate (32-bit; re-introduced at index 288 for sm_104 with 64-bit support)
8VNOFIABSInteger absolute value
9YRNLEALoad effective address (shift-add)
10FUSSHFFunnel shift (concatenate two regs, shift)
33VQCIDPInteger dot product (4-element)
34VQRIDEInteger dot expand
37VZAZKIMNMXInteger min/max (32-bit only; re-introduced at indices 284--285 for sm_104 with 32/64-bit split)
38CBCPPOPCPopulation count (count set bits)
39SYBFLOFind leading one (bit scan)
53OERIBREVBit reverse

FP32 Arithmetic

IdxROT13MnemonicDescription
11SSZNFFMAFP32 fused multiply-add
12SNQQFADDFP32 add
13SZHYFMULFP32 multiply
14SZAZKFMNMXFP32 min/max (base encoding cat. 510; re-introduced at index 220 for sm_90 with extended operand modes)
15SFJMNQQFSWZADDFP32 swizzle add (cross-lane partial reduction)
16SFRGFSETFP32 compare and set result register
17SFRYFSELFP32 select (conditional move)
18SFRGCFSETPFP32 compare and set predicate
40SPUXFCHKFP check for NaN/Inf/denorm
42ZHSHMUFUMulti-function unit: RCP, RSQ, SIN, COS, EX2, LG2, RCP64H, RSQ64H

FP64 Arithmetic

IdxROT13MnemonicDescription
122QSZNDFMAFP64 fused multiply-add
123QNQQDADDFP64 add
124QZHYDMULFP64 multiply
125QFRGCDSETPFP64 compare and set predicate

FP16 Packed Arithmetic

IdxROT13MnemonicDescription
126UNQQ2HADD2Packed FP16x2 add
127UNQQ2_S32HADD2_F32Packed FP16x2 add with FP32 accumulator
128USZN2HFMA2Packed FP16x2 fused multiply-add
129UZHY2HMUL2Packed FP16x2 multiply
130UFRG2HSET2Packed FP16x2 compare and set
131UFRGC2HSETP2Packed FP16x2 compare and set predicate

Type Conversion

IdxROT13MnemonicDescription
35V2VI2IInteger to integer conversion (width/sign change)
36V2VCI2IPInteger to integer, packed variant
43S2SF2FFloat to float conversion (precision change)
44S2S_KF2F_XFloat to float, extended (with carry chain)
45S2VF2IFloat to integer
46S2V_KF2I_XFloat to integer, extended
47V2SI2FInteger to float
48V2S_KI2F_XInteger to float, extended
49SEAQFRNDFP round to integer (within FP format)
50SEAQ_KFRND_XFP round, extended

Data Movement

IdxROT13MnemonicDescription
19ZBIMOVMove register to register
20FRYSELPredicated select (ternary conditional)
21C2EP2RPack predicate registers into GPR
22E2CR2PUnpack GPR bits into predicate registers
24CEZGPRMTByte-level permute (4-byte shuffle)
41VCNIPAInterpolate pixel attribute (fragment shader)
57F2ES2RRead special register to GPR
27PF2E_32CS2R_32Control/status register to GPR (32-bit)
28PF2E_64CS2R_64Control/status register to GPR (64-bit)

Predicate Operations

IdxROT13MnemonicDescription
23CYBC3PLOP3Three-input predicate logic (arbitrary LUT)
26IBGRVOTEWarp-wide vote (ballot/any/all/unanimity)
31INOFQVSSVABSDIFFVector absolute difference
32INOFQVSS4VABSDIFF4Vector absolute difference, 4-way

Memory -- Load/Store

IdxROT13MnemonicDescription
89YQPLDCLoad from constant memory bank c[bank][offset]
90NYQALDAttribute load (vertex/fragment attributes)
91NFGASTAttribute store
94YQFLDSLoad from shared memory
95FGFSTSStore to shared memory
96YQTLDGLoad from global memory
97FGTSTGStore to global memory
98YQYLDLLoad from local memory (per-thread stack)
99FGYSTLStore to local memory
100YQLDLoad, generic address space
101FGSTStore, generic address space

Atomic and Reduction

IdxROT13MnemonicDescription
102NGBZATOMAtomic operation (generic address space)
103NGBZTATOMGAtomic operation (global memory)
104ERQREDReduction (global memory, fire-and-forget)
105NGBZFATOMSAtomic operation (shared memory)

Cache and Memory Control

IdxROT13MnemonicDescription
106DFCPQSPCQuery address space type
107PPGY_AB_FOCCTL_NO_SBCache control, no scoreboard wait
108PPGYCCTLCache control (invalidate/writeback/etc.)
109PPGYYCCTLLCache control, L2 level
110PPGYGCCTLTCache control, texture cache
111ZRZONEMEMBARMemory barrier (fence)

Texture Operations

IdxROT13MnemonicDescription
83GRKTEXTexture fetch (filtered sample)
84GYQTLDTexture load (unfiltered, integer coords)
85GYQ4TLD4Texture gather (fetch 4 texels for bilinear)
86GZZYTMMLQuery texture mip-map level
87GKQTXDTexture fetch with explicit derivatives
88GKDTXQTexture query (dimensions, levels, format)

Surface Operations

IdxROT13MnemonicDescription
112FHYQSULDSurface load
113FHFGSUSTSurface store
114FHNGBZSUATOMSurface atomic
115FHERQSUREDSurface reduction

Graphics Pipeline

IdxROT13MnemonicDescription
51NY2CAL2PAttribute location to patch offset
52NY2C_VAQRKRQAL2P_INDEXEDAttribute to patch, indexed variant
92BHGOUTTessellation output emit
93BHG_SVANYOUT_FINALTessellation output emit (final, cut primitive)
116CVKYQPIXLDPixel information load (coverage, sample mask)
117VFOREQISBERDIndexed set buffer for read (bindless)
118VFORJEISBEWRIndexed set buffer for write (bindless)

Control Flow

IdxROT13MnemonicDescription
67OENBRABranch (relative)
68OEKBRXBranch indirect (register target)
69WZCJMPJump (absolute)
70WZKJMXJump indirect
71PNYYCALLFunction call
72ERGRETReturn from function
73OFFLBSSYPush convergence point onto branch sync stack
74OERNXBREAKBreak out of convergence region
77RKVGEXITThread exit
76XVYYKILLKill thread (discard fragment)
75OCGBPTBreakpoint trap (debugger)
78EGGRTTReturn to trap handler
79OFLAPBSYNCBranch sync (pop convergence stack, reconverge)

Synchronization and Warp

IdxROT13MnemonicDescription
54OZBI_OBMOV_BBarrier move (barrier register, B variant)
55OZBI_EBMOV_RBarrier move (barrier register, R variant)
56OZBIBMOVBarrier move
58O2EB2RBarrier register to GPR
59E2OR2BGPR to barrier register
61ONEBARNamed barrier synchronization
62ONE_VAQRKRQBAR_INDEXEDBarrier, indexed variant
66QRCONEDEPBARDependency barrier (wait for scoreboard)
80ZNGPUMATCHWarp match (find lanes with same value)
119FUSYSHFLWarp shuffle (cross-lane data exchange)
120JNECFLAPWARPSYNCWarp-wide synchronization barrier
81ANABFYRRCNANOSLEEPThread sleep for specified nanoseconds
82ANABGENCNANOTRAPNano trap (lightweight trap)

System and Miscellaneous

IdxROT13MnemonicDescription
0REEONEERRBARError barrier (internal pseudo-instruction)
25ABCNOPNo-operation
29CZGEVTPMTRIGPerformance monitor trigger
30PFZGRFGCSMTESTCSM (compute shader model) test
60YRCPLEPCLoad effective PC (get current instruction address)
63FRGPGNVQSETCTAIDSet CTA (thread block) ID
64FRGYZRZONFRSETLMEMBASESet local memory base address
65TRGYZRZONFRGETLMEMBASEGet local memory base address
121LVRYQYIELDYield execution (internal, scheduler hint)
135VAGEVAFVPINTRINSICCompiler intrinsic (pseudo-opcode, lowered before encoding)

Tensor Core (Base)

IdxROT13MnemonicDescription
132UZZN_16HMMA_16FP16 matrix multiply-accumulate, 16-wide
133UZZN_32HMMA_32FP16 matrix multiply-accumulate, 32-wide
134VZZNIMMAInteger matrix multiply-accumulate

sm_73 Extensions (Indices 137--171)

Volta+ additions. Primarily introduces uniform register variants and additional tensor core shapes.

Uniform Register Operations

Uniform registers (UR0--UR63) hold values shared across the warp, enabling scalar execution of warp-uniform computations.

IdxROT13MnemonicDescription
138HOERIUBREVUniform bit reverse
139HOZFXUBMSKUniform bitmask
140HPYRNUCLEAUniform clear address
141HVFRGCUISETPUniform integer set-predicate
142HYQPULDCUniform load constant
143HYRNULEAUniform load effective address
144HC2HEUP2URUniform predicate to uniform register
145HYBC3ULOP3Uniform three-input logic
146HCYBC3UPLOP3Uniform predicate three-input logic
147HFRYUSELUniform select
148HFTKGUSGXTUniform sign-extend
149HSYBUFLOUniform find leading one
150HVNQQ3UIADD3Uniform three-input integer add
151HVZNQUIMADUniform integer multiply-add
152HZBIUMOVUniform move
153HCEZGUPRMTUniform byte permute
154IBGRHVOTEUUniform vote
155HCBCPUPOPCUniform population count
156HFUSUSHFUniform funnel shift

Additional sm_73 Operations

IdxROT13MnemonicDescription
157FPNGGRESCATTERScatter write
158S2SCF2FPFloat to float, packed conversion
159UZZN_1688HMMA_1688FP16 MMA, 16x8x8 shape
160UZZN_16816HMMA_16816FP16 MMA, 16x8x16 shape
161OZZNBMMABinary (1-bit) matrix multiply-accumulate
162GGHPPGYTTUCCTLTensor texture unit cache control
163GGHZNPEBTTUMACROTensor texture unit macro
164E2HER2URGPR to uniform register
165ZBIZMOVMMove with mask
166YQFZLDSMLoad from shared memory to matrix register
167YQGENZLDTRAMLoad from TRAM (transposed shared memory)
168SBBGCEVAGFOOTPRINTTexture footprint query
169F2HES2URSpecial register to uniform register
170OEKHBRXUBranch indirect, uniform target

sm_82 Extensions (Indices 172--193)

Ampere additions. New MMA shapes, gather/scatter metadata, and reduction variants.

IdxROT13MnemonicDescription
173TNGUREGATHERGather (multi-address load)
174TRAZRGNQNGNGENMETADATAGenerate metadata (for sparse MMA)
175FCZRGNQNGNSPMETADATASparse metadata
176OZZN_88128BMMA_88128Binary MMA, 8x8x128 shape
177OZZN_168128BMMA_168128Binary MMA, 16x8x128 shape
178OZZN_168256BMMA_168256Binary MMA, 16x8x256 shape
179PYZNQCLMADCarry-less multiply-add (GF(2) arithmetic)
180QZZNDMMAFP64 matrix multiply-accumulate (Ampere; encoding category 434; re-introduced at index 215 for Hopper with different TC path)
181UZZN_FC_1688HMMA_SP_1688FP16 sparse MMA, 16x8x8
182USZN2_ZZNHFMA2_MMAFP16 FMA2, MMA variant
183UZAZK2HMNMX2Packed FP16x2 min/max
184VZZN_88IMMA_88Integer MMA, 8x8 shape
185VZZN_FC_88IMMA_SP_88Integer sparse MMA, 8x8
186VZZN_16816IMMA_16816Integer MMA, 16x8x16
187VZZN_16832IMMA_16832Integer MMA, 16x8x32
188VZZN_FC_16832IMMA_SP_16832Integer sparse MMA, 16x8x32
189NEEVIRFARRIVESAsync barrier arrive signal
190YQTQRCONELDGDEPBARLoad-global dependency barrier
191YQTFGFLDGSTSLoad-global, store-to-shared (async copy)
192ERQHKREDUXWarp-wide reduction (uniform result)

sm_86 Extensions (Indices 194--199)

Ampere+ (GA106/GA107) additions.

IdxROT13MnemonicDescription
195S2VCF2IPFloat to integer, packed
196HS2SCUF2FPUniform float to float, packed
197V2SCI2FPInteger to float, packed
198FHDHRELSUQUERYSurface query (dimensions, format)

sm_89 Extensions (Indices 200--205)

Ada Lovelace additions. Quarter-precision MMA shapes for FP8/INT4.

IdxROT13MnemonicDescription
201DZZN_16816QMMA_16816Quarter-precision MMA, 16x8x16 (FP8)
202DZZN_16832QMMA_16832Quarter-precision MMA, 16x8x32
203DZZN_FC_16832QMMA_SP_16832Quarter-precision sparse MMA, 16x8x32
204DZZN_FC_12864QMMA_SP_12864Quarter-precision sparse MMA, 128x64

sm_90 Extensions (Indices 206--252)

Hopper additions. Major expansion: CGA (Cooperative Grid Array) barriers, fences, GMMA (Group MMA), TMA (Tensor Memory Accelerator), and collective operations.

CGA Barriers and Synchronization

IdxROT13MnemonicDescription
207NPDOYXACQBLKAcquire block (CTA resource acquisition)
208PTNONE_NEICGABAR_ARVCGA barrier arrive
209PTNONE_TRGCGABAR_GETCGA barrier get (query state)
210PTNONE_FRGCGABAR_SETCGA barrier set
211PTNONE_JNVGCGABAR_WAITCGA barrier wait
212PTNREEONECGAERRBARCGA error barrier

Collective and Election

IdxROT13MnemonicDescription
213PERNGRCBYVPLCREATEPOLICYCreate scheduling/cache policy
214PIGNCVTAConvert address space (generic to specific)
215QZZNDMMAFP64 matrix multiply-accumulate (Hopper re-introduction; encoding category 515 vs 434 for index 180; uses warpgroup-aware tensor core path, shared dispatch with CVTA at case 0xD6/0xD7 in sub_6575D0)
216RYRPGELECTElect a leader lane in warp
217RAQPBYYRPGVIRENDCOLLECTIVEEnd collective operation scope

Fences

IdxROT13MnemonicDescription
218SRAPR_TFENCE_GFence, global scope
219SRAPR_FFENCE_SFence, shared/CTA scope
220SZAZKFMNMXFP32 min/max (Hopper re-introduction; encoding category 534 vs 510 for index 14; adds 5-entry operand sub-mode table via dword_2026FC0 for extended rounding/precision modes not in base encoding)

GMMA (Group Matrix Multiply-Accumulate)

IdxROT13MnemonicDescription
221TZZNGMMAGroup (warpgroup) matrix multiply-accumulate

Memory Extensions

IdxROT13MnemonicDescription
222YQPHLDCULoad constant, uniform (warp-coherent constant load)
223YRCPLEPCLoad effective PC (sm_90 variant)
224ZNCNMAPAMap address (for TMA address translation)
225CERRKVGPREEXITPre-exit (cleanup before thread exit)
226E2HE_UR2UR_HRegister to uniform register, high half
227ERQNFREDASReduction, async (fire-and-forget with arrive)

Configuration

IdxROT13MnemonicDescription
228FRGZNKERTSETMAXREGSet maximum register count for dynamic partitioning
229FRGFZRZFVMRSETSMEMSIZESet shared memory size dynamically
230FGNFSTASStore async (to shared, with barrier)
231FGFZSTSMStore to shared memory, matrix layout

Synchronization Extensions

IdxROT13MnemonicDescription
232FLAPF_ONFVPSYNCS_BASICSync scope, basic
233FLAPF_YQ_HAVSZSYNCS_LD_UNIFMSync scope with uniform load

Uniform Block Operations

IdxROT13MnemonicDescription
234HOYXPCUBLKCPUniform block copy
235HOYXERQUBLKREDUniform block reduction
236HOYXCSUBLKPFUniform block prefetch
237HPIGNUCVTAUniform convert address space
238HYRCPULEPCUniform load effective PC
239HZNCNUMAPAUniform map address

TMA (Tensor Memory Accelerator) Operations

IdxROT13MnemonicDescription
240HGZNPPGYUTMACCTLTMA cache control
241HGZNPZQSYHFUUTMACMDFLUSHTMA command flush
242HGZNYQTUTMALDGTMA load global
243HGZNCSUTMAPFTMA prefetch
244HGZERQTUTMREDGTMA reduction global
245HGZNYFGUTMALSTTMA load/store

Vector Min/Max Extensions

IdxROT13MnemonicDescription
246IUZAZKVHMNMXVector half min/max (FP16x2)
247IVNQQVIADDVector integer add
248IVNQQZAZKVIADDMNMXVector integer add with min/max
249IVZAZKVIMNMXVector integer min/max
250IVZAZK3VIMNMX3Vector integer three-input min/max
251JNECTEBHCWARPGROUPWarpgroup collective operation

sm_100 Extensions (Indices 253--280)

Blackwell datacenter additions. UTC (Unified Tensor Core) operations, quad-precision FP, FP32x2 packed operations, and tensor core swizzle load/store.

Packed FP32 and Reduction

IdxROT13MnemonicDescription
254PERQHKCREDUXCTA-scope reduction (cross-warp)
255SNQQ2FADD2Packed FP32x2 add
256SSZN2FFMA2Packed FP32x2 fused multiply-add
257SZAZK3FMNMX3FP32 three-input min/max
258SZHY2FMUL2Packed FP32x2 multiply

Tensor Memory

IdxROT13MnemonicDescription
259YQGZLDTMLoad via tensor memory (5th-gen tensor core)
260HTRGARKGJBEXVQUGETNEXTWORKIDUniform get next work ID (dynamic scheduling)

UTC (Unified Tensor Core) Operations

IdxROT13MnemonicDescription
261HGPONE_1PGNUTCBAR_1CTAUTC barrier, 1 CTA scope
262HGPONE_2PGNUTCBAR_2CTAUTC barrier, 2 CTA scope
263HGPPC_1PGNUTCCP_1CTAUTC copy, 1 CTA scope
264HGPPC_2PGNUTCCP_2CTAUTC copy, 2 CTA scope
265HGPZZN_1PGNUTCMMA_1CTAUTC MMA, 1 CTA scope
266HGPZZN_2PGNUTCMMA_2CTAUTC MMA, 2 CTA scope
267HGPFUVSG_1PGNUTCSHIFT_1CTAUTC shift, 1 CTA scope
268HGPFUVSG_2PGNUTCSHIFT_2CTAUTC shift, 2 CTA scope

Tensor Core Swizzle

IdxROT13MnemonicDescription
269IVEGPBHAGVIRTCOUNTVirtual thread count query
270GPNGBZFJFTCATOMSWSTensor core atomic with swizzle
271GPYQFJFTCLDSWSTensor core load with swizzle
272GPFGFJFTCSTSWSTensor core store with swizzle

Quad-Precision FP

IdxROT13MnemonicDescription
273DSZN4QFMA4Quad-element FP fused multiply-add
274DNQQ4QADD4Quad-element FP add
275DZHY4QMUL4Quad-element FP multiply

Additional sm_100

IdxROT13MnemonicDescription
276ZRZFRGMEMSETMemory set (block fill)
277NPDFUZVAVGACQSHMINITAcquire shared memory and initialize
278FGGZSTTMStore via tensor memory
279SRAPR_GFENCE_TFence, tensor scope

sm_104 Extensions (Indices 281--320)

Blackwell Ultra additions. Uniform FP operations, additional integer widths, conversion variants, MMA shape extensions, and MKQ sparse variants.

Integer Extensions

IdxROT13MnemonicDescription
282VNQQIADDInteger add (two-input, distinct from IADD3)
283HIVNQQUVIADDUniform vector integer add
284VZAZKIMNMXInteger min/max, 32-bit operands (sm_104 re-introduction; new Blackwell Ultra encoding path distinct from base index 37)
285VZAZKIMNMXInteger min/max, 64-bit operands (SASS prints as IMNMX.64; consecutive with 284 to form the 32/64-bit pair; .64.UI and .64.LO sub-modifiers select unsigned/low-half comparison modes)
286HVZAZKUIMNMXUniform integer min/max
287HIVZAZKUVIMNMXUniform vector integer min/max
288VFRGCISETPInteger set-predicate (sm_104 re-introduction; supports 64-bit operand comparison as ISETP.64 with .64.UI/.64.LO sub-modifiers; new encoding path, case 0x120 in sub_7482B0 and sub_8380A0)
289HVFRGCUISETPUniform integer set-predicate (sm_104 re-introduction of index 141; pairs with ISETP index 288 for 64-bit uniform comparison)

Data Movement Extensions

IdxROT13MnemonicDescription
290ZBIMOVMove (sm_104 variant)
291HZBIUMOVUniform move (sm_104 variant)
292FRYSELSelect (sm_104 variant)
293HFRYUSELUniform select (sm_104 variant)

Uniform FP Operations

IdxROT13MnemonicDescription
294HSNQQUFADDUniform FP add
295HSFRYUFSELUniform FP select
296HSSZNUFFMAUniform FP fused multiply-add
297HSZHYUFMULUniform FP multiply
298HSFRGUFSETUniform FP compare and set
299HSFRGCUFSETPUniform FP compare and set predicate

Uniform Conversion

IdxROT13MnemonicDescription
300HV2VUI2IUniform integer to integer conversion
301HV2VCUI2IPUniform integer to integer, packed
302HS2SUF2FUniform float to float
303HSEAQUFRNDUniform FP round
304HS2VUF2IUniform float to integer
305HS2VCUF2IPUniform float to integer, packed
306HV2SUI2FUniform integer to float
307HV2SCUI2FPUniform integer to float, packed
308HVNOFUIABSUniform integer absolute value
309PF2HECS2URControl/status register to uniform register
310HS2SCUF2FPUniform float to float, packed (sm_104 variant)

MMA Extensions

IdxROT13MnemonicDescription
311ZKDZZN_FS_16832MXQMMA_SF_16832Mixed-quantized structured-sparse MMA, 16x8x32
312BZZN_16864OMMA_16864Operand MMA, 16x8x64 shape
313BZZN_FC_168128OMMA_SP_168128Operand sparse MMA, 16x8x128
314DZZN_16816QMMA_16816Quarter-precision MMA (sm_104 variant)
315DZZN_16832QMMA_16832Quarter-precision MMA (sm_104 variant)
316DZZN_FC_16832QMMA_SP_16832Quarter-precision sparse MMA (sm_104 variant)
317DZZN_FC_12864QMMA_SP_12864Quarter-precision sparse MMA (sm_104 variant)
318DZZN_FS_16832QMMA_SF_16832Quarter-precision structured sparse MMA
319DZZN_FS_FC_16864QMMA_SF_SP_16864Quarter-precision structured+unstructured sparse MMA

Boundary Markers

IdxROT13MnemonicDescription
136FZ70_YNFGSM70_LASTEnd of sm_70 base ISA
137FZ73_SVEFGSM73_FIRSTStart of sm_73 extensions
171FZ73_YNFGSM73_LASTEnd of sm_73
172FZ82_SVEFGSM82_FIRSTStart of sm_82 extensions
193FZ82_YNFGSM82_LASTEnd of sm_82
194FZ86_SVEFGSM86_FIRSTStart of sm_86 extensions
199FZ86_YNFGSM86_LASTEnd of sm_86
200FZ89_SVEFGSM89_FIRSTStart of sm_89 extensions
205FZ89_YNFGSM89_LASTEnd of sm_89
206FZ90_SVEFGSM90_FIRSTStart of sm_90 extensions
252FZ90_YNFGSM90_LASTEnd of sm_90
253FZ100_SVEFGSM100_FIRSTStart of sm_100 extensions
280FZ100_YNFGSM100_LASTEnd of sm_100
281FZ104_SVEFGSM104_FIRSTStart of sm_104 extensions
320FZ104_YNFGSM104_LASTEnd of sm_104
321YNFGLASTEnd-of-table sentinel

Encoding Category Map at unk_21C0E00

The 0x508 bytes (1288 bytes) at unk_21C0E00 are not additional opcode names. They are a 322-element int32 array mapping each opcode index to an encoding category number -- a level of indirection between opcode indices and binary encoding format descriptors.

Binary Evidence

  1. RSI is loaded with 0x21C0E00 (at 0x7A5D9F: mov $0x21c0e00, %esi)
  2. RDI is set to obj+0x2478 (at 0x7A5D82: lea 0x2478(%rbx), %rdi)
  3. RCX is set to 161 (at 0x7A5D22: mov $0xa1, %r13d; 0x7A5D69: mov %r13, %rcx)
  4. The rep movsq at 0x7A791D copies 161 quadwords = 1288 bytes = 322 x 4 bytes

The destination offset +0x2478 (decimal 9336) is immediately after the 322-entry name table (+4184 through +9328). Three arch-specific constructors each populate this array from a different static source table:

ConstructorSource TableMap Content
sub_7A5D10 (base)unk_21C0E00Identity: map[i] = i for all i in 0..321
sub_7C5410unk_21C3600Arch-remapped (selected entries differ)
sub_BE7390unk_22B2320Arch-remapped (selected entries differ)

Reader: sub_1377C60 (SASS Mnemonic Lookup)

The SASS mnemonic lookup function at sub_1377C60 reads this map at line 292:

v84 = *(_DWORD *)(a1 + 4 * v18 + 9336);  // encoding_category_map[opcode_index]

After matching an input mnemonic string against the ROT13 name table (with inline decoding at lines 264-273), the function reads encoding_category_map[opcode_index] and uses the result as a hash key -- combined with a 24-bit architecture discriminator via FNV-1a -- to look up the encoding format descriptor in the hash table at InstructionInfo+10672.

This is why duplicate mnemonics (e.g. DMMA at indices 180 and 215, or FMNMX at indices 14 and 220) can have different encoding categories (434 vs 515, 510 vs 534): the category map provides the indirection needed to select different binary encoders for the same mnemonic across architectures. The opcode name table has exactly 322 entries and no more.

Opcode Category Summary

CategoryBase ISAsm_73+sm_82+sm_86+sm_89+sm_90+sm_100+sm_104+Total
Integer ALU161010020534
FP3210000014015
FP64401000005
FP16602000008
Conversion101030001024
Data Movement9500020521
Predicate/Vote420000006
Load/Store11320052023
Atomic/Reduce401001006
Cache/Fence6101021011
Texture620000008
Surface400000004
Control Flow13100010015
Sync/Warp10000040014
Tensor Core33100419939
TMA000006006
Uniform Block0000031610
CGA/Collective000005005
Graphics710000008
System/Misc7010042014
Boundaries2222222216

Encoding Format Correlation

From the encoding page analysis, the approximate distribution of 64-bit vs 128-bit formats for the base ISA:

64-bit format (format code 0x1): NOP, BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, BPT, KILL, RTT, BAR, DEPBAR, WARPSYNC, BMOV, B2R, R2B, S2R, CS2R, MOV (short form), YIELD, ERRBAR, NANOSLEEP, NANOTRAP, SHFL. These are primarily control-flow, barriers, and simple data movement instructions that need fewer operand bits.

128-bit format (format code 0x2): All ALU operations (IMAD, IADD3, FFMA, FADD, FMUL, LOP3, ISETP, FSETP, etc.), all memory operations (LDG, STG, LDS, STS, LDL, STL, LD, ST, LDC), all atomics (ATOM, ATOMG, ATOMS, RED), all texture operations (TEX, TLD, TLD4, TMML, TXD, TXQ), all surface operations, tensor core operations (HMMA, IMMA, BMMA, GMMA, etc.), conversion instructions, and most uniform register operations.

256-bit format (format code 0x8): IMAD.WIDE variants with 16 constant-bank operand slots. Extremely rare -- only 2 encoder functions use this format.

The 64-bit short-form encoders cover 27 opcode classes across 174 encoder functions total. The 128-bit encoders cover the remaining ~75+ opcode classes across 912+ encoder functions.

SM100 Encoding Variant Counts

Per-opcode variant counts for the SM100 (Blackwell datacenter) SASS encoder, extracted from the 683 concrete encoding handler functions at 0xED1520--0xFA5F10. Each function encodes one (opcode, operand-form) pair -- e.g., FFMA reg,reg,reg vs FFMA reg,reg,imm vs FFMA reg,reg,pred. The "Enc ID" column is the numeric value written to *(WORD*)(a2+12) by each handler, which maps to the SASS binary major opcode through the encoding dispatch megafunctions. The "SASS Mnemonic" column gives the canonical name from the 322-entry ROT13 opcode name table in InstructionInfo. Where two encoder IDs map to the same mnemonic (e.g. IADD3 IDs 0+1, LOP3 IDs 4+10), both are listed; the "Combined" column gives the merged count for that instruction.

Source: sweep report p1.14-sweep-0xED1000-0xFA6000.txt, ptxas v13.0.88.

Integer ALU

Enc IDVariantsSASS MnemonicCombinedFormats
08IADD313 (IDs 0+1)23F1DF8, 23F1F08
15IADD323F1DF8, 23F1F08
1519IMAD1923F1DF8, 23F2018
4023IMAD (wide)2323F1DF8, 23F21B0
4234IMAD (extended)3423F1DF8, 23F21B0
44LOP312 (IDs 4+10)23F2018
108LOP323F2018
3433ISETP3323F1DF8, 23F29A8
302IMNMX223F1D70
4313FLO1323F1D70, 23F1DF8
444IABS423F1F08, 23F1F90
475POPC523F1F08, 23F1F90
492BREV223F1DF8
215SHF523F1DF8, 23F1F08
846SHF623F1F08, 23F1F90
Subtotal171

FP32 ALU

Enc IDVariantsSASS MnemonicCombinedFormats
1330FFMA3023F2018..23F2EF8
1411FADD1123F1F90, 23F2E70
2218FMUL1823F1DF8..23F2678
312FMNMX223F1D70
3530FSETP30many formats
332FSET/CSET223F2238
382FSWZADD223F2128
1039extended FMA923F1DF8..23F2678
Subtotal104

FP64 ALU

Enc IDVariantsSASS MnemonicCombinedFormats
596DFMA623F2678, 23F2EF8
912DADD223F1DF8
575DMUL523F1F08
656DSETP623F2678, 23F2EF8
Subtotal19

FP16 / Half-Precision

Enc IDVariantsSASS MnemonicCombinedFormats
2318HFMA2/HMUL21823F1DF8..23F2678
3734HSETP2/DSETP3423F1DF8, 23F21B0
Subtotal52

Data Movement

Enc IDVariantsSASS MnemonicCombinedFormats
1878MOV78many formats
3228SEL2823F1D70, 23F1DF8
7145P2R/R2P45many formats
193PRMT323F1C60, 23F1D70
203LEA323F1DF8, 23F1F08
65S2R523F1F08, 23F1F90
72CS2R223F2018
Subtotal164

Memory

Enc IDVariantsSASS MnemonicCombinedFormats
2724LDG/STG2423F1F08, 23F29A8
7718LDS/STS1823F29A8
9416LDL/STL1623F29A8
746ST623F1DF8, 23F1F08
505ATOM/ATOMG523F1DF8, 23F1F08
816RED623F1F08, 23F1F90
1003SULD323F1DF8, 23F1F08
Subtotal78

Tensor Core

Enc IDVariantsSASS MnemonicCombinedFormats
7835HMMA/IMMA3523F1DF8, 23F29A8
905BMMA/QMMA523F2678
Subtotal40

Texture

Enc IDVariantsSASS MnemonicCombinedFormats
51TLD123F1F08
82TEX223F1DF8, 23F1F90
91TLD4123F1F08
882TEX (variant)223F1F08
Subtotal6

Predicate / Warp

Enc IDVariantsSASS MnemonicCombinedFormats
797PLOP3723F1F08..23F2018
826VOTE623F1F08, 23F1F90
487SHFL723F1D70, 23F1DF8
Subtotal20

Control Flow / Sync

Enc IDVariantsSASS MnemonicCombinedFormats
171BRA123F1F08
7310BAR1023F1F08, 23F2238
921DEPBAR123F1F08
981MEMBAR123F1F08
1114MUFU1423F1F08, 23F1F90
451NOP123F1D70
461YIELD/EXIT123F2238
Subtotal29

Totals

CategoryEncoder FunctionsDistinct Opcodes
Integer ALU17115 (across 10 mnemonics)
FP32 ALU1048
FP64 ALU194
FP16522
Data Movement1647
Memory787
Tensor Core402
Texture64
Predicate/Warp203
Control/Sync297
Total68359

The top 5 instructions by variant count -- MOV (78), P2R/R2P (45), HMMA/IMMA (35), IMAD extended (34), HSETP2/DSETP (34) -- account for 226 of 683 encoders (33%). MOV alone accounts for 11.4% of all encoder functions because every possible source type (GPR, uniform reg, immediate, constant bank, predicate, special reg) and every destination type requires a separate encoder with a distinct operand signature and bitfield extraction sequence.

The 21 encoding format descriptors (xmmword groups) cluster into three tiers by usage: heavy (165+141+101 = 407 functions across 3 formats), medium (87+47+36 = 170 across 3 formats), and light (106 functions across 15 formats). The heavy-tier formats (23F1F08, 23F1DF8, 23F29A8) are the simple/compact, primary ALU, and memory/load-store formats respectively -- these three alone cover 60% of all SM100 encoders.

Internal Index vs. Numeric Opcode

The index in this table (the position within the ROT13 name array) is the value stored in the Ori IR instruction's opcode field at offset +72 (lower 12 bits). However, this index is distinct from the encoded SASS major opcode in the binary instruction word. The mapping between IR opcode index and SASS binary major opcode is performed by the encoding dispatch tables (the "six megafunctions" at 0x10C0B20--0x10E32E0, which switch on up to 370 opcode category values from 0x0 through 0x171). A single IR opcode index may map to multiple SASS major opcodes depending on operand types and modifier bits, and vice versa.

Known IR-index-to-numeric correlations (confirmed from switch statements across multiple independent functions):

IR IndexNumeric (encoding switch)Mnemonic
10x59IMAD
30x29IADD3
25(64-bit, no major)NOP
52(pseudo)BB boundary
77(64-bit, no major)EXIT
910x1EATOM
95(64-bit, no major)EXIT/RET
960x38LDG
2210xDFGMMA

Extended Mnemonic Table (sub_896D50)

A second, much larger mnemonic table is constructed by sub_896D50 (21KB, vtable off_21DA9F8). This "extended" table serves a different purpose from the primary 322-entry table: it is used during SASS disassembly input parsing (string-to-index lookup), whereas the primary table is used during encoding (index-to-string). The two tables share the same base class (sub_A2B110) but have different vtables and different object layouts.

Table Dimensions

PropertyPrimary (sub_7A5D10)Extended (sub_896D50)
Entry count322 (indices 0--321)773 (indices 0--772)
Effective mnemonics306 (excl. 16 boundary markers)772 (excl. NONE sentinel)
Entry size16 bytes (8B ptr + 8B len)16 bytes (8B ptr + 8B len)
Object offset+0x1058 (+4184)+0x2C60 (+11360)
OrderingBy IR opcode indexAlphabetical by ROT13 name
Encoding category map322 x int32 at +0x2478772 x int32 at +0x5CB0 (+23728), from unk_21D92E0
Vtableoff_233ADC0off_21DA9F8

Why 772 Entries?

The extended table is 2.4x larger because it expands each base mnemonic into its modifier-qualified SASS forms. For example, the primary table stores one IMAD entry (index 1), but the extended table stores seven:

Extended entryROT13Description
IMADVZNQBase form
IMAD.HIVZNQ.UVHigh-half variant
IMAD.WIDEVZNQ.JVQR32x32->64
IMAD.WIDE.READ.ABVZNQ.JVQR.ERNQ.NOPaired read, A+B
IMAD.WIDE.READ.CHVZNQ.JVQR.ERNQ.PUPaired read, C high
IMAD.WIDE.READ.CLVZNQ.JVQR.ERNQ.PYPaired read, C low
IMAD.WIDE.WRITE.DHVZNQ.JVQR.JEVGR.QUPaired write, D high
IMAD.WIDE.WRITE.DLVZNQ.JVQR.JEVGR.QYPaired write, D low

Entry Composition

The 771 populated entries (from the decompiled string assignments at a1+11360 through a1+23712) break down as:

CategoryCountExamples
SASS base mnemonics (also in primary table)244IMAD, FADD, LDG, BRA, MOV, ...
SASS dot-modified variants125FENCE.G, ISETP.64, BAR.SYNC.DEFER_BLOCKING, HMMA.SP.16832.F16.*
SASS new base names (not in primary)81BGMMA, RPCMOV, SYNCS, MOV32I, SHL, SHR, LOP, BITEXTRACT
Mercury internal descriptors321MERCURY_addmin_srcs_r_ur_0, MERCURY_mbarrier_try_wait_...
Total SASS450
Total (SASS + Mercury)771

Of the 450 SASS entries, 7 carry annotation text in parentheses: F2F (not F64), F2I (not *64), FRND (not F64), I2F (not F64), NANOSLEEP (with Rb), NANOTRAP (with Rb), WARPSYNC (with Rb). These annotations indicate operand-type restrictions or register-variant qualifiers used by the SASS parser to disambiguate instruction forms.

32-Bit Immediate Forms

These mnemonics represent SASS instructions with a 32-bit immediate operand packed directly into the instruction word. They do not appear as separate entries in the primary IR opcode table because the immediate form is selected during encoding based on operand type, not during IR construction:

ROT13MnemonicDescription
SNQQ32VFADD32IFP32 add with 32-bit immediate
SSZN32VFFMA32IFP32 FMA with 32-bit immediate
SZHY32VFMUL32IFP32 multiply with 32-bit immediate
UNQQ2_32VHADD2_32IFP16x2 add with 32-bit immediate
USZN2_32VHFMA2_32IFP16x2 FMA with 32-bit immediate
UZHY2_32VHMUL2_32IFP16x2 multiply with 32-bit immediate
VNQQ32VIADD32IInteger add with 32-bit immediate
VNQQ2IADD2Two-input integer add (32I related)
VZHY32VIMUL32IInteger multiply with 32-bit immediate
VZHY32V.JVQRIMUL32I.WIDEInteger multiply-wide with 32-bit immediate
VFPNQQ32VISCADD32IInteger scaled-add with 32-bit immediate
YBC32VLOP32ILogic operation with 32-bit immediate
ZBI32VMOV32IMove 32-bit immediate to register
ZBI64VHEMOV64IURMove 64-bit immediate to uniform register
HYBC32VULOP32IUniform logic with 32-bit immediate

Mercury Pseudo-Instructions (321 Entries)

The single largest category. These are not real SASS instructions -- they are internal pseudo-instructions representing Mercury IR operations that need mnemonic-string identity for diagnostic and dump output. They follow a rigid naming convention:

MERCURY_{operation}_{srcs|dests}_{regclass}_{variant_index}

Register class codes in the mnemonic:

  • r = GPR (R0--R255)
  • ur = Uniform register (UR0--UR63)
  • p = Predicate register (P0--P6)
  • simm = Signed immediate
  • uimm = Unsigned immediate
  • r2 / ur2 = Register pair

Representative entries (decoded from ROT13):

ROT13CleartextOperation
ZREPHEL__vageMERCURY__intrGeneric intrinsic placeholder
ZREPHEL_nqqzva_fepf_e_he_0MERCURY_addmin_srcs_r_ur_0Fused add-min, GPR + uniform
ZREPHEL_nqqznk_fepf_he_e_0MERCURY_addmax_srcs_ur_r_0Fused add-max, uniform + GPR
ZREPHEL_ngbz_pnf_vag_npd_ery_...MERCURY_atom_cas_int_acq_rel_...Atomic CAS with acquire-release
ZREPHEL_flapf_neevir_n1g0_n0g1_...MERCURY_syncs_arrive_a1t0_a0t1_...Sync arrive with token spec

New Base Mnemonics

Mnemonics that appear in the extended table but have no base-name match in the primary 322-entry table at all. Some are legacy forms (pre-Volta mnemonics preserved for disassembly compatibility), others are specialized operations:

ROT13MnemonicCategory
NPDOHYXACQBULKCGA bulk resource acquire
OVGRKGENPGBITEXTRACTBitfield extract
QRPBZCERFFDECOMPRESSData decompression
VQC4NIDP4AInteger dot-product accumulate (4-element)
VZHYIMULInteger multiply (non-fused, legacy)
VFPNQQISCADDInteger scaled-add (legacy LEA form)
YQTZPLDGMCLoad global with memory consistency
YQGLDTLoad from texture memory
YBCLOPTwo-input logic operation (legacy)
CFRGCPSETPPredicate set-predicate
ERQTREDGReduction, global (explicit address space)
FUYSHLShift left (legacy, replaced by SHF)
FUESHRShift right (legacy, replaced by SHF)
FCNEFVSLSPARSIFYConvert dense to sparse format
FGGSTTStore to texture memory
GNGBZTTATOMGTexture atomic, global scope
IVFRGVISETVector integer set
JNECTEBHCFRGWARPGROUPSETConfigure warpgroup parameters

Modifier Suffix Patterns

Five distinct modifier suffix patterns are used in the extended table's dot-separated SASS mnemonics:

Pattern 1 -- Sub-operation mode. The suffix selects a functional sub-operation within a single hardware instruction. CCTL has the most variants (7):

Extended MnemonicSub-operation
CCTL.CClean
CCTL.C.LDCClean via constant cache
CCTL.C.LDC.IVALLClean constant cache, invalidate all
CCTL.E.LDCEvict via constant cache
CCTL.IInvalidate
CCTL.LDCULoad constant, uniform path
CCTL.QFAULTQuery fault status

Also: SYNCS.ARRIVE.A1T0.A0T1, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK (8 variants); and BPT.DRAIN, BPT.PAUSE.

Pattern 2 -- Operand width. The .64 suffix (with optional .HI/.LO half-selectors) indicates 64-bit operand mode. Added for sm_104 (Blackwell Ultra):

Extended MnemonicBase Opcode
ISETP.64, ISETP.64.HI, ISETP.64.LOISETP (idx 288)
IMNMX.64, IMNMX.64.HI, IMNMX.64.LOIMNMX (idx 285)
IADD.64, IADD.64.HI, IADD.64.LOIADD (idx 282)
IADD2.64, IADD2.64.HI, IADD2.64.LOIADD2
MOV.64, MOV.64.HI, MOV.64.LOMOV (idx 290)
SEL.64, SEL.64.HI, SEL.64.LOSEL (idx 292)
UMOV.64, USEL.64, UIADD3.64, UIMNMX.64, UISETP.64Uniform 64-bit variants

Pattern 3 -- Data access direction. IMAD.WIDE has 5 sub-variants controlling which 32-bit half of the 64-bit accumulator is read or written. These correspond to the 256-bit instruction format (format code 0x8) with 16 constant-bank operand slots:

Extended MnemonicMeaning
IMAD.WIDEDefault wide multiply-add
IMAD.WIDE.READ.ABRead both A and B input halves
IMAD.WIDE.READ.CL / .CHRead accumulator low / high half
IMAD.WIDE.WRITE.DL / .DHWrite result low / high half
IMAD.HIHigh-half result only

Pattern 4 -- Scope qualifier. Fences, barriers, UTC operations, and synchronization carry scope suffixes:

Extended MnemonicScope
FENCE.GGlobal (GPU-wide)
FENCE.SShared/CTA
FENCE.TTensor (sm_100+)
UTCBAR.1CTA, UTCBAR.2CTA1-CTA / 2-CTA scope
UTCBAR.1CTA.FLUSH1-CTA with flush
BAR.SYNC.DEFER_BLOCKINGDeferred blocking sync
USETMAXREG.RELEASERelease variant
USETSHMSZ.FLUSHFlush variant

Pattern 5 -- Shape and type descriptor. Tensor core operations carry shape geometry and data type. Brace-delimited alternation syntax indicates a single encoder handling multiple shapes:

Extended MnemonicMeaning
HMMA.F32.{16816.F16|16816.E8M7|1688.E8M10}FP16 MMA with FP32 accum, multiple shapes
HMMA.SP.16832.F16.*Sparse FP16 MMA, 16x8x32
IMMA.{8816.*|8832.*}Integer MMA, 8x8x16 or 8x8x32
IMMA.SP.{16832.*|16864.*4.*4}Sparse integer MMA
QMMA.SF.SPStructured + unstructured sparse
MUFU.EX2.LOW_ACC.{F16x2, BF16x2}Low-accuracy EX2 for half types

Top Opcodes by Dot-Variant Count

Base OpcodeVariantsCategory
HMMA8Tensor core shape + sparse + FP type
SYNCS8Scope-aware synchronization modes
CCTL7Cache control sub-operations
IMAD7.HI, .WIDE, .WIDE.READ., .WIDE.WRITE.
IMMA6Tensor core shape + sparse
QMMA6Shape + structured/unstructured sparse
USYNCS6Uniform sync scope modes
MUFU5.EX2, .RCP, .RSQ, .EX2 with half-precision
IADD4.64, .64.HI, .64.LO, .XOR
WARPGROUP3.ARRIVE, .DEPBAR, .WAIT
RPCMOV3.32, .32.READ, .64
UTCBAR3.1CTA, .1CTA.FLUSH, .2CTA

Complete New SASS Mnemonics by Category

The following 206 SASS mnemonics appear only in the extended table -- they have no corresponding entry in the base 322-entry name table. Many represent modifier-suffixed forms of base opcodes; others are entirely new operations.

GMMA type-specialized (8): BGMMA, BGMMA_GSB, HGMMA, HGMMA_GSB, IGMMA, IGMMA_GSB, QGMMA, QGMMA_GSB

UTC type-specialized (20): UTCHMMA.1CTA, UTCHMMA.2CTA, UTCIMMA.1CTA, UTCIMMA.2CTA, UTCMXQMMA.1CTA, UTCMXQMMA.2CTA, UTCOMMA.1CTA, UTCOMMA.2CTA, UTCQMMA.1CTA, UTCQMMA.2CTA, UTCBAR.1CTA.FLUSH, UTCATOMSWS, UTCLDSWS, UTCSTSWS, UTCBAR.1CTA, UTCBAR.2CTA, UTCCP.1CTA, UTCCP.2CTA, UTCSHIFT.1CTA, UTCSHIFT.2CTA

DLC/DPC operations (13): UDLCBAR, UDLCCP, UDLCHMMA, UDLCIMMA, UDLCQMMA, UDPCBLKCP, UDPCBLKL2CCTL, UDPCBLKRED, UDPCTMACCTL, UDPCTMAL2CCTL, UDPCTMALDG, UDPCTMAREDG, UDPCTMASTG

Synchronization (17): SYNCS.ARRIVE.A1T0.A0T1, SYNCS.ARRIVE.A1TR.ART0.A0TR.A0TX, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK, SYNCSU.ARRIVE.A1T0, SYNCSU.ARRIVE.MULTICAST.A1T0, WARPGROUP.ARRIVE, WARPGROUP.DEPBAR, WARPGROUP.WAIT, WARPGROUPSET, BAR.SYNC.DEFER_BLOCKING, BPT.DRAIN, BPT.PAUSE

Uniform sync (6): USYNCS.ARRIVE, USYNCS.ARRIVE.MULTICAST, USYNCS.CAS.EXCH, USYNCS.CCTL, USYNCS.LD, USYNCS.PHASECHK

Integer 64-bit variants (18): IADD.64, IADD.64.HI, IADD.64.LO, IADD.XOR, IADD2, IADD2.64, IADD2.64.HI, IADD2.64.LO, IMNMX.64, IMNMX.64.HI, IMNMX.64.LO, ISETP.64, ISETP.64.HI, ISETP.64.LO, MOV.64, MOV.64.HI, MOV.64.LO, SEL.64, SEL.64.HI, SEL.64.LO

Uniform scalar extended (27): UIADD3.64, UIMNMX.64, UISETP.64, UMOV.64, USEL.64, ULOP, ULOP32I, UMEMSETS.64, UPSETP, UR2UP, USHL, USHR, UCCTL, UBLKL2CCTL, UCGABAR_ARV, UCGABAR_GET, UCGABAR_SET, UCGABAR_WAIT, USETMAXREG, USETMAXREG.RELEASE, USETSHMSZ, USETSHMSZ.FLUSH, UREDGR, UREGPRERELEASE, USTGR, UTRACEEVENT, UVIRTCOUNT

IMAD/IMUL variants (8): IMAD.HI, IMAD.WIDE.READ.AB, IMAD.WIDE.READ.CH, IMAD.WIDE.READ.CL, IMAD.WIDE.WRITE.DH, IMAD.WIDE.WRITE.DL, IMUL.WIDE, IMUL32I.WIDE

Tensor core shapes (28): HMMA.16816.F16.*, HMMA.1688.F16.*, HMMA.F32.{...} (4 entries), HMMA.SP.{...} (4 entries), IMMA.{...} (3 entries), IMMA.SP.{...} (3 entries), DMMA.1684, DMMA.1688, DMMA.16816, BMMA.88128, BMMA.168128, BMMA.168256, QMMA.16816, QMMA.16832, QMMA.SF, QMMA.SF.SP, QMMA.SP.16832, QMMA.SP.16864, OMMA.SP

FP extensions (16): FADD32I, FFMA32I, FMUL32I, FHADD, FHADD2, FHFMA, FHFMA2, FHMUL2, UFHADD, UFHFMA, UFMNMX, MUFU.EX2, MUFU.RCP, MUFU.RSQ, MUFU.EX2.{F16x2, BF16x2}, MUFU.EX2.LOW_ACC.{F16x2, BF16x2}

Cache control (7): CCTL.C, CCTL.C.LDC, CCTL.C.LDC.IVALL, CCTL.E.LDC, CCTL.I, CCTL.LDCU, CCTL.QFAULT

Texture extensions (8): TATOMG, TTUCLOSE, TTUGO, TTULD, TTULD_CLOSE, TTUMACROFUSE, TTUOPEN, TTUST

Fence/scope (3): FENCE.G, FENCE.S, FENCE.T

Data movement (7): MOV32I, MOV64IUR, RPCMOV, RPCMOV.32, RPCMOV.32.READ, RPCMOV.64, CS2R (base without size), DECOMPRESS

Memory (4): LDGMC, LDT, STT, REDG

Other new (13): ACQBULK, BRA_IMM, JMP_IMM, JMXU, NONE, PSETP, HADD2_32I, HFMA2_32I, HMUL2_32I, IADD32I, IMUL, LOP, LOP32I

Parallel Constructor Regions

The ROT13 string data for the extended table exists in two identical regions:

RegionAddress RangeSASS EntriesMERCURY Entries
10x2039000--0x203A500139 unique32
20x21CA000--0x21CB100139 unique40

Region 2 has 8 additional MERCURY entries not in region 1, all for sm_100/sm_104 cluster barrier and atomic operations: MERCURY_barrier_cluster_arrive_sync_unaligned_* (4), MERCURY_atom_shared_cta_popc_inc_* (3), MERCURY_atom_shared_cta_int_acq_rel_* (1). This indicates at least two InstructionInfo variant objects for different target architectures, where the newer variant gains additional Mercury instruction templates.

Hash Table for O(1) Lookup

After populating the flat sorted array, sub_896D50 constructs a hash table for O(1) mnemonic lookup during SASS parsing. The hash table is allocated as a 488-byte header object with three backing arrays:

ArraySlot sizeSlotsTotal bytesPurpose
164 bytes77249,408Open-addressing hash (key prefix + metadata)
236 bytes77227,792Auxiliary data per mnemonic
316 bytes35560Overflow / collision chain

Array 1 slots are initialized to 0xFF (empty sentinel). The hash function used for lookup is the same FNV-1a variant used by sub_1377C60 for the primary table.

Object Tail Configuration

After building the tables and hash structure, the constructor:

  1. Queries ~14 knobs via context+1664 (knobs 1, 2, 5, 11, 14, 18, 22, 25, 28, 273, 774, 775, 803, 983, 998) to conditionally register feature-gated instruction families at context+1728
  2. Stores knob 803's value at obj+108
  3. Sets the vtable to off_21DA9F8 (line 2438 in decompiled source)
  4. Writes feature bitmask 0x48018BA65 at obj+26856
  5. Stores the hash table pointer at obj+26832 and the arena pointer at obj+26840

Key Functions

AddressSizeRoleConfidence
sub_7A5D10--InstructionInfo constructor; initializes the 322-entry ROT13 opcode name table at object offset +0x1058 and the 322-entry encoding category identity map at +0x2478 (vtable off_233ADC0)0.92
sub_BE7390--Parallel InstructionInfo constructor; initializes an identical 322-entry name table0.90
sub_7CB560--SASS printer; maps duplicate opcode indices (e.g., 284 vs 285) to distinct mnemonic strings (IMNMX vs IMNMX.64) based on operand metadata0.85
sub_6575D049KBRegister-class-to-opcode dispatch; handles DMMA (index 215) shared dispatch with CVTA at cases 0xD6/0xD70.85
sub_7482B0--Encoding path for ISETP (index 288, sm_104); handles case 0x120 for 64-bit integer set-predicate0.80
sub_8380A0--Encoding path for ISETP (index 288, sm_104); second handler for case 0x1200.80
sub_896D5021KBExtended mnemonic table constructor; builds the 772-entry alphabetically-sorted SASS mnemonic lookup table at object offset +11360, with parallel 772-entry encoding category map from unk_21D92E0, plus 3-array hash table for O(1) string lookup during disassembly parsing (vtable off_21DA9F8)0.90
sub_A2B110--Base class constructor shared by both primary (sub_7A5D10) and extended (sub_896D50) mnemonic table objects0.85