NVPTX Machine Opcode Reference
This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.
| Constraint table | word_3F3E6C0 (static .data array of 16-bit entries) |
| Constraint emitter | sub_B612D0 (104KB, 179-case switch) |
| Copy-type mapper | sub_3494EA0 (12.7KB, maps opcodes 1--0x12 to families 440--503) |
| Register class builder | sub_B5BA00 (21KB, 111 cases) |
| Operand type classifier | sub_34961A0 (26.6KB, reads byte_444C4A0) |
| ISel entry | sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) |
| Intrinsic lowering switch | sub_33B0210 (343KB, hundreds of NVVM intrinsics) |
Opcode Numbering Scheme
Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).
The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.
The Constraint Table: word_3F3E6C0
Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:
| Bits | Field | Purpose |
|---|---|---|
[7:0] (low byte) | constraint_class | Index into the 179-case switch in sub_B612D0 |
[15:8] (high byte) | register_class_id | Target register class for the instruction's primary result |
The access pattern, decompiled from sub_B612D0:
uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF; // low byte
uint8_t register_class = (entry >> 8) & 0xFF; // high byte
switch (constraint_class) {
case 0x00: ... // simple 2-input ALU
case 0x01: ... // 3-input FMA
...
case 0xB2: ... // maximum observed class
}
The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.
179 Constraint Classes
The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:
| Class Range | Pattern | Descriptor Count | Typical Instructions |
|---|---|---|---|
| 0x00--0x0F | Simple ALU (2 inputs, 1 output) | 3 | add, sub, mul, and, or, xor |
| 0x10--0x1F | Ternary (3 inputs, 1 output) | 4 | fma, madc, selp |
| 0x20--0x3F | Load/store variants | 2--5 | ld, st with address space and vector width |
| 0x40--0x5F | Conversion and move | 2--3 | cvt, mov, bitcast |
| 0x60--0x7F | Atomic and barrier | 3--6 | atom.*, membar, fence |
| 0x80--0x9F | Texture/surface | 4--12 | tex., sust., suld.* |
| 0xA0--0xAF | Tensor core (MMA) | 6--16 | hmma, imma, wmma, mma |
| 0xB0 | Maximum operand (17 inputs) | 18 | Complex intrinsic (opcode 176) |
| 0xB1--0xB2 | Miscellaneous high-operand-count | variable | Specialized instructions |
The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.
Register Class IDs in the High Byte
The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:
| ID | Register Class | PTX Type | PTX Prefix | Vtable Address |
|---|---|---|---|---|
| 14 | Int32Regs | .b32 | %r | off_4A025A0 |
| 22 | Int16Regs | .b16 | %rs | off_4A02720 |
| 40 | Float32Regs | .f32 | %f | off_4A02620 |
| 43 | Float16Regs | .b16 | %h | off_4A02760 |
| 50 | Int64Regs | .b64 | %rd | off_4A024A0 |
| 51 | Float64Regs | .f64 | %fd | off_4A02520 |
| 52 | Int128Regs | .b128 | %rq | off_4A02460 |
| 78 | PredRegs | .pred | %p | off_4A027A0 |
| 86 | SpecialRegs | (varies) | (varies) | off_4A026E0 |
Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.
Identified Opcode Families
The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.
Copy and Move Family (440--503)
These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.
The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:
| Opcode Range | Family | Description |
|---|---|---|
| 440--443 | Type-preserving moves | Same-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the byte_444C4A0 classification table. |
| 444--470 (approx.) | Cross-class moves | Bitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit mov instructions in PTX because the source and destination register types differ. |
| 471--490 (approx.) | Paired/wide moves | 128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by sub_3497B40. |
| 491--503 (approx.) | ABI parameter copies | .param-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing. |
The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:
struct OperandTypeEntry { // 16 bytes at byte_444C4A0[16 * mvt - 16]
uint8_t type_code; // +0: 12=i32, 13=i64, 15=f32, etc.
uint8_t size_class; // +1: size in register-width units
uint8_t register_bank; // +2: bank identifier
uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
uint8_t reserved[12]; // +4: padding
};
The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.
Call ABI Family (505--573)
These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.
| Opcode | Name | PTX Equivalent | Operands |
|---|---|---|---|
| 315 | CallSeqBegin | (pseudo) call frame setup | chain, seq_id, zero |
| 316 | CallSeqEnd_Outer | (pseudo) outer call frame teardown | chain, glue, callee_ref, callee_ref_hi |
| 505 | DeclareParam | .param .align A .b8 param[N] | chain, alignment, param_index, byte_size |
| 506 | DeclareScalarParam | .param .bW paramN | chain, alignment, param_index, widened_size |
| 507 | DeclareRetParam | .param .align A .b8 retval[N] | chain, alignment, byte_size, zero |
| 508 | DeclareRetScalarParam | .param .bW retval | chain, 1, widened_size, zero |
| 510 | CallDirect | call (retval), func, (params) | chain, callee, params... |
| 511 | CallDirectNoProto | call func, (params) (old-style) | chain, callee, params... |
| 512 | CallIndirect | call (retval), %rd, (params) | chain, func_ptr, params... |
| 513 | CallIndirectNoProto | call %rd, (params) | chain, func_ptr, params... |
| 514 | CallStart | (pseudo) actual call emission point | CallProto result |
| 515 | LoadRetParam | ld.param.bW retvalN | call_result, 1, element_index |
| 516 | LoadRetParamLast | ld.param.bW retvalN (last) | call_result, 1, element_index |
| 517 | CallSeqEnd | (pseudo) inner call frame teardown | last_load, chain, flag |
| 518 | CallProto | .callprototype | chain, callee, proto_string |
| 521 | DeclareRetParam_Ext | .param for return (ext path) | CallSeqEnd result, seq_id |
| 527 | StoreCalleeRetAddr | (pseudo) callee return addr | chain, proto_symbol |
| 528 | StoreRetValToParam | st.param.bW retvalN (return) | chain, value, offset |
The call sequence follows a strict emission order:
CallSeqBegin(315)
for each argument:
DeclareParam(505) or DeclareScalarParam(506)
StoreV1/V2/V4(571/572/573) — store argument values
DeclareRetParam(507) or DeclareRetScalarParam(508) [if callee returns]
CallProto(518)
CallStart(514) [actual call point]
for each return value:
LoadRetParam(515) or LoadRetParamLast(516)
CallSeqEnd(517)
DeclareRetParam_Ext(521) [if prototype present]
CallSeqEnd_Outer(316)
Vector Load/Store Family (568--573)
These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:
| Opcode | Name | PTX Equivalent | Vector Width |
|---|---|---|---|
| 568 | LoadV1 | ld.param.b32 / ld.param.b64 | 1 element |
| 569 | LoadV2 | ld.param.v2.b32 / ld.param.v2.b64 | 2 elements |
| 570 | LoadV4 | ld.param.v4.b32 / ld.param.v4.b64 | 4 elements |
| 571 | StoreV1 | st.param.b32 / st.param.b64 | 1 element |
| 572 | StoreV2 | st.param.v2.b32 / st.param.v2.b64 | 2 elements |
| 573 | StoreV4 | st.param.v4.b32 / st.param.v4.b64 | 4 elements |
The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):
accumulated_operand_count == 3 -> StoreV1 (571), width=1
accumulated_operand_count == 4 -> StoreV2 (572), width=2
accumulated_operand_count == 6 -> StoreV4 (573), width=4
other -> fatal error (unreachable)
The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).
Atomic Family (294--317, 462)
Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):
| Opcode Range | PTX Instruction | Types |
|---|---|---|
| 294--297 | atom.add | f32, f64, i32, i64 |
| 302--305 | atom.min | s32, s64, u32, u64 |
| 314--317 | atom.max | s32, s64, u32, u64 |
| 462 | atom.cas | generic (compare-and-swap) |
Within the PTX emission layer, the atomic operation is encoded in a packed operand word:
| Bits | Field | Values |
|---|---|---|
[7:4] | scope | 0=gpu (default), 1=cta, 2=sys |
[23:16] (BYTE2) | operation | 0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas |
Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.
On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.
Barrier and Fence Family (287--290)
| Opcode | PTX Instruction | Scope |
|---|---|---|
| 287 | membar.gpu | GPU |
| 288 | membar.cta | CTA (thread block) |
| 289 | membar.sys | System |
| 290 | fence.sc.cluster | Cluster (SM 90+) |
The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.
Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):
| Operand Encoding | PTX Instruction |
|---|---|
bits[3:0]=0, bits[7:4]=0 | barrier.cluster.arrive |
bits[3:0]=0, bits[7:4]=1 | barrier.cluster.arrive.relaxed |
bits[3:0]=1, bits[7:4]=0 | barrier.cluster.wait |
bits[3:0]=1, bits[7:4]=1 | barrier.cluster.wait.relaxed |
NVPTXISD Custom DAG Opcodes (22--499)
These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:
| DAG Opcode | Identity | Notes |
|---|---|---|
| 22 | NVPTXISD::TargetAddr | Data pointer computation |
| 24 | NVPTXISD::Wrapper | Global address wrapping |
| 149 | NVPTXISD::ATOMIC_LOAD | Atomic load (lowered from IR atomic) |
| 152 | NVPTXISD::SELECT_CC | Conditional select (ternary) |
| 189 | NVPTXISD::MoveParam | Thread index and parameter moves |
| 193--196 | NVPTXISD::MIN/MAX | Min/max variants (2- and 3-source) |
| 197 | NVPTXISD::CTPOP | Population count |
| 198--204 | NVPTXISD::ConstantPool | Constant pool entry variants |
| 208 | NVPTXISD::CMPXCHG | Compare-and-exchange |
| 213--214 | NVPTXISD::STORE_SIGNED | Store with sign-extension flag |
| 215 | NVPTXISD::AddrSpaceCast | Address space conversion (within lowering) |
| 230 | NVPTXISD::DeclareLocal | Declare local variable / address of param |
| 233--234 | NVPTXISD::AddrSpaceCast pair | Two-step address space cast |
| 245--274 | NVPTXISD::MathOp_RN/RZ/RM/RP | Rounded math (add, mul, sqrt, div, fma) |
| 310 | NVPTXISD::Annotation | PTX .pragma annotation |
| 321 | NVPTXISD::StackRestore | Stack pointer restore |
| 322 | NVPTXISD::StackAlloc | Dynamic stack allocation |
| 330 | NVPTXISD::FunctionAddr | Function address (for indirect calls) |
| 335 | NVPTXISD::BinaryArith | Two-operand arithmetic |
| 371 | NVPTXISD::DynAreaOffset | Dynamic alloca offset |
| 499 | NVPTXISD::ConditionalBranch | Conditional branch with .param alloc |
The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:
| Intrinsic ID | NVPTXISD Opcode | PTX Operation |
|---|---|---|
| 63 | 249 | add.rz |
| 64 | 255 | mul.rz |
| 89 | 267 | fma.rz |
| 170 | 245 | add.rm |
| 172 | 274 | mul.rm |
| 250 | 271 | fma.rm |
| 308 | 270 | add.rp |
| 309 | 272 | mul.rp |
| 310 | 273 | fma.rp |
| 325 | 248 | sqrt.rz |
| 328 | 254 | sqrt.rm |
| 335 | 246 | sqrt.rp |
| 348 | 250 | div.rz |
| 349 | 256 | div.rm |
| 355 | 269 | div.rp |
MMA / Tensor Core Opcodes
Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.
Pre-Blackwell (SM 70--90) families:
| Function | Family | PTX Base | Min SM |
|---|---|---|---|
sub_21E0360 | HMMA load A/B | wmma.load.a / wmma.load.b | 70 |
sub_21E0630 | HMMA load C | wmma.load.c | 70 |
sub_21DFBF0 | HMMA store C | wmma.store.c | 70 |
sub_21E0870 | HMMA MMA | wmma.mma / mma | 70 |
sub_21E1280 | IMMA load A/B | wmma.load.a (int) | 72 |
sub_21E15D0 | IMMA load C | wmma.load.c (int) | 72 |
sub_21E1830 | IMMA store C | wmma.store.c (int) | 72 |
sub_21E1D20 | IMMA MMA | mma (integer, with saturation) | 72 |
sub_21E2280 | BMMA MMA | mma (binary, b1.and.popc / b1.xor.popc) | 75 |
Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.
Blackwell tcgen05 (SM 100+):
Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:
| Bit | Field | Values |
|---|---|---|
| 0 | scaleD | 0 or 1 |
| 1 | negA | 0=positive, 1=negative |
| 2 | negB | 0=positive, 1=negative |
| 3 | transA | 0=normal, 1=transposed |
| 4 | transB | 0=normal, 1=transposed |
| 5 | sparsity | structured sparsity enable |
[8:6] | type encoding | mxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16 |
Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).
MMA Shape and Type Encoding
The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:
Shape codes (bits [39:32]):
| Code | Shape | PTX String | Min SM |
|---|---|---|---|
| 0x01 | m8n8k4 | "m8n8k4" | 70 |
| 0x02 | m8n8k16 | "m8n8k16" | 72 |
| 0x03 | m8n8k32 | "m8n8k32" | 75 |
| 0x04 | m8n8k64 | "m8n8k64" | 75 |
| 0x05 | m8n8k128 | "m8n8k128" | 75 |
| 0x10 | m16n8k4 | "m16n8k4" | 80 |
| 0x11 | m16n8k8 | "m16n8k8" | 75 |
| 0x12 | m16n8k16 | "m16n8k16" | 80 |
| 0x13 | m16n8k32 | "m16n8k32" | 75 |
| 0x14 | m16n8k64 | "m16n8k64" | 75 |
| 0x15 | m16n8k128 | "m16n8k128" | 75 |
| 0x16 | m16n8k256 | "m16n8k256" | 75 |
| 0x17 | m16n16k16 | "m16n16k16" | 90 |
| 0x18 | m32n8k16 | "m32n8k16" | 90? |
| 0x19 | m16n16k8 | "m16n16k8" | 70 |
Data type codes (in aty/bty fields):
| Code | Type | Bits | PTX |
|---|---|---|---|
| 1 | b1 | 1 | "b1" |
| 2 | s4 | 4 | "s4" |
| 3 | u4 | 4 | "u4" |
| 4 | s8 | 8 | "s8" |
| 5 | u8 | 8 | "u8" |
| 6 | f16 | 16 | "f16" |
| 7 | bf16 | 16 | "bf16" |
| 8 | tf32 | 19 | "tf32" |
| 9 | f64 | 64 | "f64" |
| 10 | f32 | 32 | "f32" |
| 11 | s32 | 32 | "s32" |
Special Register Access
Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:
| Operand | Register | Description |
|---|---|---|
| 0x26 | %tid.x | Thread ID, X |
| 0x27 | %tid.y | Thread ID, Y |
| 0x28 | %tid.z | Thread ID, Z |
| 0x29 | %ntid.x | Block dimension, X |
| 0x2A | %ntid.y | Block dimension, Y |
| 0x2B | %ntid.z | Block dimension, Z |
| 0x2C | %ctaid.x | Block ID, X |
| 0x2D | %ctaid.y | Block ID, Y |
| 0x2E | %ctaid.z | Block ID, Z |
| 0x2F | %nctaid.x | Grid dimension, X |
| 0x30 | %nctaid.y | Grid dimension, Y |
| 0x31 | %nctaid.z | Grid dimension, Z |
| 0x5E | (dynamic) | %warpid / %laneid (via sub_3958DA0) |
| 0x5F | (dynamic) | %nwarpid or similar (via sub_3958DA0) |
Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.
Address Space Conversion
The cvta instruction family is emitted by sub_21E7FE0:
| Operand Value | Suffix | Full Instruction |
|---|---|---|
| 0 | (none) | cvta (generic) |
| 1 | .global | cvta.to.global / cvta.global |
| 3 | .shared | cvta.to.shared / cvta.shared |
| 4+ | .local | cvta.to.local / cvta.local |
Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).
Constraint Emission Pipeline
The full path from opcode to emitted constraint:
sub_B612D0(emitter_state, opcode):
// Step 1: Table lookup
entry = word_3F3E6C0[opcode - 1]
reg_class = entry >> 8
constraint_class = entry & 0xFF
// Step 2: Build descriptor array on stack
switch (constraint_class):
case 0x00:
// Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 3)
case 0x01:
// Ternary FMA: {op0, op1, op2, result}
desc[0..2] = three input constraints
desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 4)
...
case 0xB0:
// 17-input complex: 17 input constraints + 1 output
for i in 0..16:
desc[i] = {kind=i, value=...}
desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
sub_A78010(state, desc, 18)
Key helper functions:
| Address | Function | Purpose |
|---|---|---|
sub_A778C0 | createRegClassConstraint(state, regclass, flags) | Build input operand constraint for a specific register class |
sub_A77AD0 | createAnyRegConstraint(state, flags) | Build an unconstrained ("any register") input constraint |
sub_A79C90 | composeConstraints(state, desc, N) | Merge N descriptors into a single composite constraint |
sub_B5BA00 | createOutputConstraint(state, regclass_id) | Build the output/result constraint |
sub_A78010 | emitConstraint(state, desc_array, N) | Finalize and emit the constraint with N entries |
sub_B612D0 | emitInstrConstraint(state, opcode) | Top-level entry: table lookup + switch + emit |
The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.
Complete Identified Opcode Summary
The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.
| Opcode | Identity | Family | Evidence Source |
|---|---|---|---|
| 0--~430 | Generic LLVM TargetOpcode | LLVM standard | upstream LLVM 20.0.0 |
| 440--443 | Type-preserving moves | Copy | register coalescer (sub_3494EA0) |
| 444--503 | Cross-class / wide / ABI copies | Copy | register coalescer (sub_3494EA0) |
| 294--297 | atom.add (f32/f64/i32/i64) | Atomic | DAG legalization (sub_20BED60) |
| 302--305 | atom.min (s32/s64/u32/u64) | Atomic | DAG legalization (sub_20BED60) |
| 314--317 | atom.max (s32/s64/u32/u64) | Atomic | DAG legalization (sub_20BED60) |
| 315 | CallSeqBegin | Call ABI | LowerCall (sub_3040BF0) |
| 316 | CallSeqEnd_Outer | Call ABI | LowerCall |
| 462 | atom.cas | Atomic | DAG legalization |
| 499 | ConditionalBranch | Control | intrinsic lowering |
| 505 | DeclareParam | Call ABI | LowerCall |
| 506 | DeclareScalarParam | Call ABI | LowerCall |
| 507 | DeclareRetParam | Call ABI | LowerCall |
| 508 | DeclareRetScalarParam | Call ABI | LowerCall |
| 510 | CallDirect | Call ABI | LowerCall |
| 511 | CallDirectNoProto | Call ABI | LowerCall |
| 512 | CallIndirect | Call ABI | LowerCall |
| 513 | CallIndirectNoProto | Call ABI | LowerCall |
| 514 | CallStart | Call ABI | LowerCall |
| 515 | LoadRetParam | Call ABI | LowerCall |
| 516 | LoadRetParamLast | Call ABI | LowerCall |
| 517 | CallSeqEnd | Call ABI | LowerCall |
| 518 | CallProto | Call ABI | LowerCall |
| 521 | DeclareRetParam_Ext | Call ABI | LowerCall |
| 527 | StoreCalleeRetAddr | Call ABI | LowerCall |
| 528 | StoreRetValToParam | Call ABI | LowerCall |
| 568 | LoadV1 | Vector Param | LowerCall |
| 569 | LoadV2 | Vector Param | LowerCall |
| 570 | LoadV4 | Vector Param | LowerCall |
| 571 | StoreV1 | Vector Param | LowerCall |
| 572 | StoreV2 | Vector Param | LowerCall |
| 573 | StoreV4 | Vector Param | LowerCall |
| 4905--4940 | tcgen05.mma (10 shape variants) | Tensor Core | Blackwell emission (sub_21E8CD0) |
Gaps and Unknown Ranges
The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:
| Range | Likely Contents | Evidence |
|---|---|---|
| 430--439 | Transition zone (generic-to-target boundary) | Adjacent to copy family |
| 574--~800 | Global/shared/local loads and stores | Large gap between param-store and first identified general opcode |
| 800--~1500 | Texture and surface instructions | sub_33B0210 intrinsic switch references hundreds of tex/surf intrinsics |
| 1500--~3000 | Shuffle, vote, match, redux | Warp-level intrinsic families |
| 3000--~4000 | WGMMA, TMA, bulk operations | Hopper-era instruction families |
| 4000--4904 | Additional tensor/cluster instructions | Bridging pre-Blackwell and tcgen05 |
Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
Constraint emission (179-case switch on word_3F3E6C0) | sub_B612D0 | 104KB | -- |
| Register class set builder (111 cases) | sub_B5BA00 | 21KB | -- |
| Operand type decoder (101 cases) | sub_B6B200 | 44KB | -- |
createRegClassConstraint(state, regclass, flags) | sub_A778C0 | -- | -- |
createAnyRegConstraint(state, flags) | sub_A77AD0 | -- | -- |
composeConstraints(state, desc, N) | sub_A79C90 | -- | -- |
emitConstraint(state, desc_array, N) | sub_A78010 | -- | -- |
| Opcode-to-copy-type mapping (switch, families 440--503) | sub_3494EA0 | 12.7KB | -- |
Operand-type classification (reads byte_444C4A0) | sub_34961A0 | 26.6KB | -- |
| Register-pair decomposition (wide/paired registers) | sub_3497B40 | 16.5KB | -- |
NVPTXTargetLowering::LowerCall (call ABI opcodes) | sub_3040BF0 | 88KB | -- |
| Intrinsic lowering switch (NVVM intrinsic to opcode) | sub_33B0210 | 343KB | -- |
NVPTXDAGToDAGISel::Select (ISel entry) | sub_3090F90 | 91KB | -- |
| MMA instruction builder (packed descriptor) | sub_21E74C0 | 17KB | -- |
| Atomic operation PTX emission (base) | sub_21E5E70 | -- | -- |
| L2 cache-hinted atomic PTX emission (SM 80+) | sub_21E6420 | -- | -- |
| Memory barrier PTX emission | sub_21E94F0 | -- | -- |
| Cluster barrier PTX emission (SM 90+) | sub_21E8EA0 | -- | -- |
| Special register PTX emission | sub_21E86B0 | -- | -- |
| Cluster special register PTX emission (SM 90+) | sub_21E9060 | -- | -- |
| Address space conversion (cvta) PTX emission | sub_21E7FE0 | -- | -- |
| tcgen05 Blackwell MMA emission (SM 100+) | sub_21E8CD0 | -- | -- |
| Register class to encoded ID mapping | sub_21583D0 | -- | -- |
| Register class to PTX type suffix | sub_2163730 | -- | -- |
| Register class to PTX register prefix | sub_21638D0 | -- | -- |
Global Data References
| Symbol | Address | Purpose |
|---|---|---|
word_3F3E6C0 | 0x3F3E6C0 | Constraint table (16-bit entries, indexed by opcode-1) |
byte_444C4A0 | 0x444C4A0 | MVT/operand type table (16-byte entries, indexed by MVT enum) |
word_4456340 | 0x4456340 | MVT to vector element count (16-bit entries) |
word_4456580 | 0x4456580 | MVT to scalarized MVT (16-bit entries) |
byte_3F252E0 | 0x3F252E0 | Constraint type classification table |
qword_502A920 | 0x502A920 | SM processor table (45 entries, stride-2) |
Cross-References
- Pattern Database -- detailed constraint descriptor layout and emission sub-functions
- Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
- Code Generation -- pipeline overview including ISel, RA, and machine-level passes
- InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
- Register Allocation -- greedy RA that consumes constraint table data
- AsmPrinter -- the PTX emission layer that converts these opcodes to text