NVPTX Machine Opcode Reference

This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.


Constraint table	`word_3F3E6C0` (static `.data` array of 16-bit entries)
Constraint emitter	`sub_B612D0` (104KB, 179-case switch)
Copy-type mapper	`sub_3494EA0` (12.7KB, maps opcodes 1--0x12 to families 440--503)
Register class builder	`sub_B5BA00` (21KB, 111 cases)
Operand type classifier	`sub_34961A0` (26.6KB, reads `byte_444C4A0`)
ISel entry	`sub_3090F90` (91KB, `NVPTXDAGToDAGISel::Select`)
Intrinsic lowering switch	`sub_33B0210` (343KB, hundreds of NVVM intrinsics)

Opcode Numbering Scheme

Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).

The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.

The Constraint Table: `word_3F3E6C0`

Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:

Bits	Field	Purpose
`[7:0]` (low byte)	`constraint_class`	Index into the 179-case switch in `sub_B612D0`
`[15:8]` (high byte)	`register_class_id`	Target register class for the instruction's primary result

The access pattern, decompiled from sub_B612D0:

uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF;         // low byte
uint8_t register_class   = (entry >> 8) & 0xFF;  // high byte

switch (constraint_class) {
    case 0x00: ...  // simple 2-input ALU
    case 0x01: ...  // 3-input FMA
    ...
    case 0xB2: ...  // maximum observed class
}

The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.

179 Constraint Classes

The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:

Class Range	Pattern	Descriptor Count	Typical Instructions
0x00--0x0F	Simple ALU (2 inputs, 1 output)	3	add, sub, mul, and, or, xor
0x10--0x1F	Ternary (3 inputs, 1 output)	4	fma, madc, selp
0x20--0x3F	Load/store variants	2--5	ld, st with address space and vector width
0x40--0x5F	Conversion and move	2--3	cvt, mov, bitcast
0x60--0x7F	Atomic and barrier	3--6	atom.*, membar, fence
0x80--0x9F	Texture/surface	4--12	tex., sust., suld.*
0xA0--0xAF	Tensor core (MMA)	6--16	hmma, imma, wmma, mma
0xB0	Maximum operand (17 inputs)	18	Complex intrinsic (opcode 176)
0xB1--0xB2	Miscellaneous high-operand-count	variable	Specialized instructions

The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.

Register Class IDs in the High Byte

The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:

ID	Register Class	PTX Type	PTX Prefix	Vtable Address
14	`Int32Regs`	`.b32`	`%r`	`off_4A025A0`
22	`Int16Regs`	`.b16`	`%rs`	`off_4A02720`
40	`Float32Regs`	`.f32`	`%f`	`off_4A02620`
43	`Float16Regs`	`.b16`	`%h`	`off_4A02760`
50	`Int64Regs`	`.b64`	`%rd`	`off_4A024A0`
51	`Float64Regs`	`.f64`	`%fd`	`off_4A02520`
52	`Int128Regs`	`.b128`	`%rq`	`off_4A02460`
78	`PredRegs`	`.pred`	`%p`	`off_4A027A0`
86	`SpecialRegs`	(varies)	(varies)	`off_4A026E0`

Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.

Identified Opcode Families

The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.

Copy and Move Family (440--503)

These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.

The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:

Opcode Range	Family	Description
440--443	Type-preserving moves	Same-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the `byte_444C4A0` classification table.
444--470 (approx.)	Cross-class moves	Bitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit `mov` instructions in PTX because the source and destination register types differ.
471--490 (approx.)	Paired/wide moves	128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by `sub_3497B40`.
491--503 (approx.)	ABI parameter copies	`.param`-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing.

The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:

struct OperandTypeEntry {    // 16 bytes at byte_444C4A0[16 * mvt - 16]
    uint8_t type_code;       // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;      // +1: size in register-width units
    uint8_t register_bank;   // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];    // +4: padding
};

The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.

Call ABI Family (505--573)

These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.

Opcode	Name	PTX Equivalent	Operands
315	`CallSeqBegin`	(pseudo) call frame setup	chain, seq_id, zero
316	`CallSeqEnd_Outer`	(pseudo) outer call frame teardown	chain, glue, callee_ref, callee_ref_hi
505	`DeclareParam`	`.param .align A .b8 param[N]`	chain, alignment, param_index, byte_size
506	`DeclareScalarParam`	`.param .bW paramN`	chain, alignment, param_index, widened_size
507	`DeclareRetParam`	`.param .align A .b8 retval[N]`	chain, alignment, byte_size, zero
508	`DeclareRetScalarParam`	`.param .bW retval`	chain, 1, widened_size, zero
510	`CallDirect`	`call (retval), func, (params)`	chain, callee, params...
511	`CallDirectNoProto`	`call func, (params)` (old-style)	chain, callee, params...
512	`CallIndirect`	`call (retval), %rd, (params)`	chain, func_ptr, params...
513	`CallIndirectNoProto`	`call %rd, (params)`	chain, func_ptr, params...
514	`CallStart`	(pseudo) actual call emission point	CallProto result
515	`LoadRetParam`	`ld.param.bW retvalN`	call_result, 1, element_index
516	`LoadRetParamLast`	`ld.param.bW retvalN` (last)	call_result, 1, element_index
517	`CallSeqEnd`	(pseudo) inner call frame teardown	last_load, chain, flag
518	`CallProto`	`.callprototype`	chain, callee, proto_string
521	`DeclareRetParam_Ext`	`.param` for return (ext path)	CallSeqEnd result, seq_id
527	`StoreCalleeRetAddr`	(pseudo) callee return addr	chain, proto_symbol
528	`StoreRetValToParam`	`st.param.bW retvalN` (return)	chain, value, offset

The call sequence follows a strict emission order:

CallSeqBegin(315)
  for each argument:
    DeclareParam(505) or DeclareScalarParam(506)
    StoreV1/V2/V4(571/572/573) — store argument values
  DeclareRetParam(507) or DeclareRetScalarParam(508)  [if callee returns]
  CallProto(518)
  CallStart(514)                                       [actual call point]
  for each return value:
    LoadRetParam(515) or LoadRetParamLast(516)
  CallSeqEnd(517)
  DeclareRetParam_Ext(521)                             [if prototype present]
CallSeqEnd_Outer(316)

Vector Load/Store Family (568--573)

These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:

Opcode	Name	PTX Equivalent	Vector Width
568	`LoadV1`	`ld.param.b32` / `ld.param.b64`	1 element
569	`LoadV2`	`ld.param.v2.b32` / `ld.param.v2.b64`	2 elements
570	`LoadV4`	`ld.param.v4.b32` / `ld.param.v4.b64`	4 elements
571	`StoreV1`	`st.param.b32` / `st.param.b64`	1 element
572	`StoreV2`	`st.param.v2.b32` / `st.param.v2.b64`	2 elements
573	`StoreV4`	`st.param.v4.b32` / `st.param.v4.b64`	4 elements

The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):

accumulated_operand_count == 3  ->  StoreV1 (571), width=1
accumulated_operand_count == 4  ->  StoreV2 (572), width=2
accumulated_operand_count == 6  ->  StoreV4 (573), width=4
other                           ->  fatal error (unreachable)

The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).

Atomic Family (294--317, 462)

Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):

Opcode Range	PTX Instruction	Types
294--297	`atom.add`	f32, f64, i32, i64
302--305	`atom.min`	s32, s64, u32, u64
314--317	`atom.max`	s32, s64, u32, u64
462	`atom.cas`	generic (compare-and-swap)

Within the PTX emission layer, the atomic operation is encoded in a packed operand word:

Bits	Field	Values
`[7:4]`	scope	0=gpu (default), 1=cta, 2=sys
`[23:16]` (BYTE2)	operation	0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas

Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.

On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.

Barrier and Fence Family (287--290)

Opcode	PTX Instruction	Scope
287	`membar.gpu`	GPU
288	`membar.cta`	CTA (thread block)
289	`membar.sys`	System
290	`fence.sc.cluster`	Cluster (SM 90+)

The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.

Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):

Operand Encoding	PTX Instruction
`bits[3:0]=0, bits[7:4]=0`	`barrier.cluster.arrive`
`bits[3:0]=0, bits[7:4]=1`	`barrier.cluster.arrive.relaxed`
`bits[3:0]=1, bits[7:4]=0`	`barrier.cluster.wait`
`bits[3:0]=1, bits[7:4]=1`	`barrier.cluster.wait.relaxed`

NVPTXISD Custom DAG Opcodes (22--499)

These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:

DAG Opcode	Identity	Notes
22	`NVPTXISD::TargetAddr`	Data pointer computation
24	`NVPTXISD::Wrapper`	Global address wrapping
149	`NVPTXISD::ATOMIC_LOAD`	Atomic load (lowered from IR atomic)
152	`NVPTXISD::SELECT_CC`	Conditional select (ternary)
189	`NVPTXISD::MoveParam`	Thread index and parameter moves
193--196	`NVPTXISD::MIN/MAX`	Min/max variants (2- and 3-source)
197	`NVPTXISD::CTPOP`	Population count
198--204	`NVPTXISD::ConstantPool`	Constant pool entry variants
208	`NVPTXISD::CMPXCHG`	Compare-and-exchange
213--214	`NVPTXISD::STORE_SIGNED`	Store with sign-extension flag
215	`NVPTXISD::AddrSpaceCast`	Address space conversion (within lowering)
230	`NVPTXISD::DeclareLocal`	Declare local variable / address of param
233--234	`NVPTXISD::AddrSpaceCast` pair	Two-step address space cast
245--274	`NVPTXISD::MathOp_RN/RZ/RM/RP`	Rounded math (add, mul, sqrt, div, fma)
310	`NVPTXISD::Annotation`	PTX `.pragma` annotation
321	`NVPTXISD::StackRestore`	Stack pointer restore
322	`NVPTXISD::StackAlloc`	Dynamic stack allocation
330	`NVPTXISD::FunctionAddr`	Function address (for indirect calls)
335	`NVPTXISD::BinaryArith`	Two-operand arithmetic
371	`NVPTXISD::DynAreaOffset`	Dynamic alloca offset
499	`NVPTXISD::ConditionalBranch`	Conditional branch with `.param` alloc

The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:

Intrinsic ID	NVPTXISD Opcode	PTX Operation
63	249	`add.rz`
64	255	`mul.rz`
89	267	`fma.rz`
170	245	`add.rm`
172	274	`mul.rm`
250	271	`fma.rm`
308	270	`add.rp`
309	272	`mul.rp`
310	273	`fma.rp`
325	248	`sqrt.rz`
328	254	`sqrt.rm`
335	246	`sqrt.rp`
348	250	`div.rz`
349	256	`div.rm`
355	269	`div.rp`

MMA / Tensor Core Opcodes

Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.

Pre-Blackwell (SM 70--90) families:

Function	Family	PTX Base	Min SM
`sub_21E0360`	HMMA load A/B	`wmma.load.a` / `wmma.load.b`	70
`sub_21E0630`	HMMA load C	`wmma.load.c`	70
`sub_21DFBF0`	HMMA store C	`wmma.store.c`	70
`sub_21E0870`	HMMA MMA	`wmma.mma` / `mma`	70
`sub_21E1280`	IMMA load A/B	`wmma.load.a` (int)	72
`sub_21E15D0`	IMMA load C	`wmma.load.c` (int)	72
`sub_21E1830`	IMMA store C	`wmma.store.c` (int)	72
`sub_21E1D20`	IMMA MMA	`mma` (integer, with saturation)	72
`sub_21E2280`	BMMA MMA	`mma` (binary, `b1.and.popc` / `b1.xor.popc`)	75

Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.

Blackwell tcgen05 (SM 100+):

Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:

Bit	Field	Values
0	scaleD	0 or 1
1	negA	0=positive, 1=negative
2	negB	0=positive, 1=negative
3	transA	0=normal, 1=transposed
4	transB	0=normal, 1=transposed
5	sparsity	structured sparsity enable
`[8:6]`	type encoding	mxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16

Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).

MMA Shape and Type Encoding

The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:

Shape codes (bits [39:32]):

Code	Shape	PTX String	Min SM
0x01	m8n8k4	`"m8n8k4"`	70
0x02	m8n8k16	`"m8n8k16"`	72
0x03	m8n8k32	`"m8n8k32"`	75
0x04	m8n8k64	`"m8n8k64"`	75
0x05	m8n8k128	`"m8n8k128"`	75
0x10	m16n8k4	`"m16n8k4"`	80
0x11	m16n8k8	`"m16n8k8"`	75
0x12	m16n8k16	`"m16n8k16"`	80
0x13	m16n8k32	`"m16n8k32"`	75
0x14	m16n8k64	`"m16n8k64"`	75
0x15	m16n8k128	`"m16n8k128"`	75
0x16	m16n8k256	`"m16n8k256"`	75
0x17	m16n16k16	`"m16n16k16"`	90
0x18	m32n8k16	`"m32n8k16"`	90?
0x19	m16n16k8	`"m16n16k8"`	70

Data type codes (in aty/bty fields):

Code	Type	Bits	PTX
1	b1	1	`"b1"`
2	s4	4	`"s4"`
3	u4	4	`"u4"`
4	s8	8	`"s8"`
5	u8	8	`"u8"`
6	f16	16	`"f16"`
7	bf16	16	`"bf16"`
8	tf32	19	`"tf32"`
9	f64	64	`"f64"`
10	f32	32	`"f32"`
11	s32	32	`"s32"`

Special Register Access

Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:

Operand	Register	Description
0x26	`%tid.x`	Thread ID, X
0x27	`%tid.y`	Thread ID, Y
0x28	`%tid.z`	Thread ID, Z
0x29	`%ntid.x`	Block dimension, X
0x2A	`%ntid.y`	Block dimension, Y
0x2B	`%ntid.z`	Block dimension, Z
0x2C	`%ctaid.x`	Block ID, X
0x2D	`%ctaid.y`	Block ID, Y
0x2E	`%ctaid.z`	Block ID, Z
0x2F	`%nctaid.x`	Grid dimension, X
0x30	`%nctaid.y`	Grid dimension, Y
0x31	`%nctaid.z`	Grid dimension, Z
0x5E	(dynamic)	`%warpid` / `%laneid` (via `sub_3958DA0`)
0x5F	(dynamic)	`%nwarpid` or similar (via `sub_3958DA0`)

Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.

Address Space Conversion

The cvta instruction family is emitted by sub_21E7FE0:

Operand Value	Suffix	Full Instruction
0	(none)	`cvta` (generic)
1	`.global`	`cvta.to.global` / `cvta.global`
3	`.shared`	`cvta.to.shared` / `cvta.shared`
4+	`.local`	`cvta.to.local` / `cvta.local`

Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).

Constraint Emission Pipeline

The full path from opcode to emitted constraint:

sub_B612D0(emitter_state, opcode):
    // Step 1: Table lookup
    entry = word_3F3E6C0[opcode - 1]
    reg_class = entry >> 8
    constraint_class = entry & 0xFF

    // Step 2: Build descriptor array on stack
    switch (constraint_class):
        case 0x00:
            // Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
            desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
            desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
            desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 3)
        case 0x01:
            // Ternary FMA: {op0, op1, op2, result}
            desc[0..2] = three input constraints
            desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 4)
        ...
        case 0xB0:
            // 17-input complex: 17 input constraints + 1 output
            for i in 0..16:
                desc[i] = {kind=i, value=...}
            desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 18)

Key helper functions:

Address	Function	Purpose
`sub_A778C0`	`createRegClassConstraint(state, regclass, flags)`	Build input operand constraint for a specific register class
`sub_A77AD0`	`createAnyRegConstraint(state, flags)`	Build an unconstrained ("any register") input constraint
`sub_A79C90`	`composeConstraints(state, desc, N)`	Merge N descriptors into a single composite constraint
`sub_B5BA00`	`createOutputConstraint(state, regclass_id)`	Build the output/result constraint
`sub_A78010`	`emitConstraint(state, desc_array, N)`	Finalize and emit the constraint with N entries
`sub_B612D0`	`emitInstrConstraint(state, opcode)`	Top-level entry: table lookup + switch + emit

The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.

Complete Identified Opcode Summary

The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.

Opcode	Identity	Family	Evidence Source
0--~430	Generic LLVM `TargetOpcode`	LLVM standard	upstream LLVM 20.0.0
440--443	Type-preserving moves	Copy	register coalescer (`sub_3494EA0`)
444--503	Cross-class / wide / ABI copies	Copy	register coalescer (`sub_3494EA0`)
294--297	`atom.add` (f32/f64/i32/i64)	Atomic	DAG legalization (`sub_20BED60`)
302--305	`atom.min` (s32/s64/u32/u64)	Atomic	DAG legalization (`sub_20BED60`)
314--317	`atom.max` (s32/s64/u32/u64)	Atomic	DAG legalization (`sub_20BED60`)
315	`CallSeqBegin`	Call ABI	`LowerCall` (`sub_3040BF0`)
316	`CallSeqEnd_Outer`	Call ABI	`LowerCall`
462	`atom.cas`	Atomic	DAG legalization
499	`ConditionalBranch`	Control	intrinsic lowering
505	`DeclareParam`	Call ABI	`LowerCall`
506	`DeclareScalarParam`	Call ABI	`LowerCall`
507	`DeclareRetParam`	Call ABI	`LowerCall`
508	`DeclareRetScalarParam`	Call ABI	`LowerCall`
510	`CallDirect`	Call ABI	`LowerCall`
511	`CallDirectNoProto`	Call ABI	`LowerCall`
512	`CallIndirect`	Call ABI	`LowerCall`
513	`CallIndirectNoProto`	Call ABI	`LowerCall`
514	`CallStart`	Call ABI	`LowerCall`
515	`LoadRetParam`	Call ABI	`LowerCall`
516	`LoadRetParamLast`	Call ABI	`LowerCall`
517	`CallSeqEnd`	Call ABI	`LowerCall`
518	`CallProto`	Call ABI	`LowerCall`
521	`DeclareRetParam_Ext`	Call ABI	`LowerCall`
527	`StoreCalleeRetAddr`	Call ABI	`LowerCall`
528	`StoreRetValToParam`	Call ABI	`LowerCall`
568	`LoadV1`	Vector Param	`LowerCall`
569	`LoadV2`	Vector Param	`LowerCall`
570	`LoadV4`	Vector Param	`LowerCall`
571	`StoreV1`	Vector Param	`LowerCall`
572	`StoreV2`	Vector Param	`LowerCall`
573	`StoreV4`	Vector Param	`LowerCall`
4905--4940	`tcgen05.mma` (10 shape variants)	Tensor Core	Blackwell emission (`sub_21E8CD0`)

Gaps and Unknown Ranges

The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:

Range	Likely Contents	Evidence
430--439	Transition zone (generic-to-target boundary)	Adjacent to copy family
574--~800	Global/shared/local loads and stores	Large gap between param-store and first identified general opcode
800--~1500	Texture and surface instructions	`sub_33B0210` intrinsic switch references hundreds of tex/surf intrinsics
1500--~3000	Shuffle, vote, match, redux	Warp-level intrinsic families
3000--~4000	WGMMA, TMA, bulk operations	Hopper-era instruction families
4000--4904	Additional tensor/cluster instructions	Bridging pre-Blackwell and tcgen05

Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.

Function Map

Function	Address	Size	Role
Constraint emission (179-case switch on `word_3F3E6C0`)	`sub_B612D0`	104KB	--
Register class set builder (111 cases)	`sub_B5BA00`	21KB	--
Operand type decoder (101 cases)	`sub_B6B200`	44KB	--
`createRegClassConstraint(state, regclass, flags)`	`sub_A778C0`	--	--
`createAnyRegConstraint(state, flags)`	`sub_A77AD0`	--	--
`composeConstraints(state, desc, N)`	`sub_A79C90`	--	--
`emitConstraint(state, desc_array, N)`	`sub_A78010`	--	--
Opcode-to-copy-type mapping (switch, families 440--503)	`sub_3494EA0`	12.7KB	--
Operand-type classification (reads `byte_444C4A0`)	`sub_34961A0`	26.6KB	--
Register-pair decomposition (wide/paired registers)	`sub_3497B40`	16.5KB	--
`NVPTXTargetLowering::LowerCall` (call ABI opcodes)	`sub_3040BF0`	88KB	--
Intrinsic lowering switch (NVVM intrinsic to opcode)	`sub_33B0210`	343KB	--
`NVPTXDAGToDAGISel::Select` (ISel entry)	`sub_3090F90`	91KB	--
MMA instruction builder (packed descriptor)	`sub_21E74C0`	17KB	--
Atomic operation PTX emission (base)	`sub_21E5E70`	--	--
L2 cache-hinted atomic PTX emission (SM 80+)	`sub_21E6420`	--	--
Memory barrier PTX emission	`sub_21E94F0`	--	--
Cluster barrier PTX emission (SM 90+)	`sub_21E8EA0`	--	--
Special register PTX emission	`sub_21E86B0`	--	--
Cluster special register PTX emission (SM 90+)	`sub_21E9060`	--	--
Address space conversion (cvta) PTX emission	`sub_21E7FE0`	--	--
tcgen05 Blackwell MMA emission (SM 100+)	`sub_21E8CD0`	--	--
Register class to encoded ID mapping	`sub_21583D0`	--	--
Register class to PTX type suffix	`sub_2163730`	--	--
Register class to PTX register prefix	`sub_21638D0`	--	--

Global Data References

Symbol	Address	Purpose
`word_3F3E6C0`	`0x3F3E6C0`	Constraint table (16-bit entries, indexed by opcode-1)
`byte_444C4A0`	`0x444C4A0`	MVT/operand type table (16-byte entries, indexed by MVT enum)
`word_4456340`	`0x4456340`	MVT to vector element count (16-bit entries)
`word_4456580`	`0x4456580`	MVT to scalarized MVT (16-bit entries)
`byte_3F252E0`	`0x3F252E0`	Constraint type classification table
`qword_502A920`	`0x502A920`	SM processor table (45 entries, stride-2)

Cross-References

Pattern Database -- detailed constraint descriptor layout and emission sub-functions
Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
Code Generation -- pipeline overview including ISel, RA, and machine-level passes
InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
Register Allocation -- greedy RA that consumes constraint table data
AsmPrinter -- the PTX emission layer that converts these opcodes to text

Keyboard shortcuts

CICC Reverse Engineering Reference