Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVPTX Machine Opcode Reference

This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.

Constraint tableword_3F3E6C0 (static .data array of 16-bit entries)
Constraint emittersub_B612D0 (104KB, 179-case switch)
Copy-type mappersub_3494EA0 (12.7KB, maps opcodes 1--0x12 to families 440--503)
Register class buildersub_B5BA00 (21KB, 111 cases)
Operand type classifiersub_34961A0 (26.6KB, reads byte_444C4A0)
ISel entrysub_3090F90 (91KB, NVPTXDAGToDAGISel::Select)
Intrinsic lowering switchsub_33B0210 (343KB, hundreds of NVVM intrinsics)

Opcode Numbering Scheme

Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).

The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.

The Constraint Table: word_3F3E6C0

Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:

BitsFieldPurpose
[7:0] (low byte)constraint_classIndex into the 179-case switch in sub_B612D0
[15:8] (high byte)register_class_idTarget register class for the instruction's primary result

The access pattern, decompiled from sub_B612D0:

uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF;         // low byte
uint8_t register_class   = (entry >> 8) & 0xFF;  // high byte

switch (constraint_class) {
    case 0x00: ...  // simple 2-input ALU
    case 0x01: ...  // 3-input FMA
    ...
    case 0xB2: ...  // maximum observed class
}

The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.

179 Constraint Classes

The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:

Class RangePatternDescriptor CountTypical Instructions
0x00--0x0FSimple ALU (2 inputs, 1 output)3add, sub, mul, and, or, xor
0x10--0x1FTernary (3 inputs, 1 output)4fma, madc, selp
0x20--0x3FLoad/store variants2--5ld, st with address space and vector width
0x40--0x5FConversion and move2--3cvt, mov, bitcast
0x60--0x7FAtomic and barrier3--6atom.*, membar, fence
0x80--0x9FTexture/surface4--12tex., sust., suld.*
0xA0--0xAFTensor core (MMA)6--16hmma, imma, wmma, mma
0xB0Maximum operand (17 inputs)18Complex intrinsic (opcode 176)
0xB1--0xB2Miscellaneous high-operand-countvariableSpecialized instructions

The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.

Register Class IDs in the High Byte

The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:

IDRegister ClassPTX TypePTX PrefixVtable Address
14Int32Regs.b32%roff_4A025A0
22Int16Regs.b16%rsoff_4A02720
40Float32Regs.f32%foff_4A02620
43Float16Regs.b16%hoff_4A02760
50Int64Regs.b64%rdoff_4A024A0
51Float64Regs.f64%fdoff_4A02520
52Int128Regs.b128%rqoff_4A02460
78PredRegs.pred%poff_4A027A0
86SpecialRegs(varies)(varies)off_4A026E0

Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.

Identified Opcode Families

The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.

Copy and Move Family (440--503)

These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.

The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:

Opcode RangeFamilyDescription
440--443Type-preserving movesSame-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the byte_444C4A0 classification table.
444--470 (approx.)Cross-class movesBitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit mov instructions in PTX because the source and destination register types differ.
471--490 (approx.)Paired/wide moves128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by sub_3497B40.
491--503 (approx.)ABI parameter copies.param-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing.

The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:

struct OperandTypeEntry {    // 16 bytes at byte_444C4A0[16 * mvt - 16]
    uint8_t type_code;       // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;      // +1: size in register-width units
    uint8_t register_bank;   // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];    // +4: padding
};

The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.

Call ABI Family (505--573)

These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.

OpcodeNamePTX EquivalentOperands
315CallSeqBegin(pseudo) call frame setupchain, seq_id, zero
316CallSeqEnd_Outer(pseudo) outer call frame teardownchain, glue, callee_ref, callee_ref_hi
505DeclareParam.param .align A .b8 param[N]chain, alignment, param_index, byte_size
506DeclareScalarParam.param .bW paramNchain, alignment, param_index, widened_size
507DeclareRetParam.param .align A .b8 retval[N]chain, alignment, byte_size, zero
508DeclareRetScalarParam.param .bW retvalchain, 1, widened_size, zero
510CallDirectcall (retval), func, (params)chain, callee, params...
511CallDirectNoProtocall func, (params) (old-style)chain, callee, params...
512CallIndirectcall (retval), %rd, (params)chain, func_ptr, params...
513CallIndirectNoProtocall %rd, (params)chain, func_ptr, params...
514CallStart(pseudo) actual call emission pointCallProto result
515LoadRetParamld.param.bW retvalNcall_result, 1, element_index
516LoadRetParamLastld.param.bW retvalN (last)call_result, 1, element_index
517CallSeqEnd(pseudo) inner call frame teardownlast_load, chain, flag
518CallProto.callprototypechain, callee, proto_string
521DeclareRetParam_Ext.param for return (ext path)CallSeqEnd result, seq_id
527StoreCalleeRetAddr(pseudo) callee return addrchain, proto_symbol
528StoreRetValToParamst.param.bW retvalN (return)chain, value, offset

The call sequence follows a strict emission order:

CallSeqBegin(315)
  for each argument:
    DeclareParam(505) or DeclareScalarParam(506)
    StoreV1/V2/V4(571/572/573) — store argument values
  DeclareRetParam(507) or DeclareRetScalarParam(508)  [if callee returns]
  CallProto(518)
  CallStart(514)                                       [actual call point]
  for each return value:
    LoadRetParam(515) or LoadRetParamLast(516)
  CallSeqEnd(517)
  DeclareRetParam_Ext(521)                             [if prototype present]
CallSeqEnd_Outer(316)

Vector Load/Store Family (568--573)

These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:

OpcodeNamePTX EquivalentVector Width
568LoadV1ld.param.b32 / ld.param.b641 element
569LoadV2ld.param.v2.b32 / ld.param.v2.b642 elements
570LoadV4ld.param.v4.b32 / ld.param.v4.b644 elements
571StoreV1st.param.b32 / st.param.b641 element
572StoreV2st.param.v2.b32 / st.param.v2.b642 elements
573StoreV4st.param.v4.b32 / st.param.v4.b644 elements

The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):

accumulated_operand_count == 3  ->  StoreV1 (571), width=1
accumulated_operand_count == 4  ->  StoreV2 (572), width=2
accumulated_operand_count == 6  ->  StoreV4 (573), width=4
other                           ->  fatal error (unreachable)

The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).

Atomic Family (294--317, 462)

Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):

Opcode RangePTX InstructionTypes
294--297atom.addf32, f64, i32, i64
302--305atom.mins32, s64, u32, u64
314--317atom.maxs32, s64, u32, u64
462atom.casgeneric (compare-and-swap)

Within the PTX emission layer, the atomic operation is encoded in a packed operand word:

BitsFieldValues
[7:4]scope0=gpu (default), 1=cta, 2=sys
[23:16] (BYTE2)operation0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas

Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.

On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.

Barrier and Fence Family (287--290)

OpcodePTX InstructionScope
287membar.gpuGPU
288membar.ctaCTA (thread block)
289membar.sysSystem
290fence.sc.clusterCluster (SM 90+)

The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.

Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):

Operand EncodingPTX Instruction
bits[3:0]=0, bits[7:4]=0barrier.cluster.arrive
bits[3:0]=0, bits[7:4]=1barrier.cluster.arrive.relaxed
bits[3:0]=1, bits[7:4]=0barrier.cluster.wait
bits[3:0]=1, bits[7:4]=1barrier.cluster.wait.relaxed

NVPTXISD Custom DAG Opcodes (22--499)

These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:

DAG OpcodeIdentityNotes
22NVPTXISD::TargetAddrData pointer computation
24NVPTXISD::WrapperGlobal address wrapping
149NVPTXISD::ATOMIC_LOADAtomic load (lowered from IR atomic)
152NVPTXISD::SELECT_CCConditional select (ternary)
189NVPTXISD::MoveParamThread index and parameter moves
193--196NVPTXISD::MIN/MAXMin/max variants (2- and 3-source)
197NVPTXISD::CTPOPPopulation count
198--204NVPTXISD::ConstantPoolConstant pool entry variants
208NVPTXISD::CMPXCHGCompare-and-exchange
213--214NVPTXISD::STORE_SIGNEDStore with sign-extension flag
215NVPTXISD::AddrSpaceCastAddress space conversion (within lowering)
230NVPTXISD::DeclareLocalDeclare local variable / address of param
233--234NVPTXISD::AddrSpaceCast pairTwo-step address space cast
245--274NVPTXISD::MathOp_RN/RZ/RM/RPRounded math (add, mul, sqrt, div, fma)
310NVPTXISD::AnnotationPTX .pragma annotation
321NVPTXISD::StackRestoreStack pointer restore
322NVPTXISD::StackAllocDynamic stack allocation
330NVPTXISD::FunctionAddrFunction address (for indirect calls)
335NVPTXISD::BinaryArithTwo-operand arithmetic
371NVPTXISD::DynAreaOffsetDynamic alloca offset
499NVPTXISD::ConditionalBranchConditional branch with .param alloc

The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:

Intrinsic IDNVPTXISD OpcodePTX Operation
63249add.rz
64255mul.rz
89267fma.rz
170245add.rm
172274mul.rm
250271fma.rm
308270add.rp
309272mul.rp
310273fma.rp
325248sqrt.rz
328254sqrt.rm
335246sqrt.rp
348250div.rz
349256div.rm
355269div.rp

MMA / Tensor Core Opcodes

Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.

Pre-Blackwell (SM 70--90) families:

FunctionFamilyPTX BaseMin SM
sub_21E0360HMMA load A/Bwmma.load.a / wmma.load.b70
sub_21E0630HMMA load Cwmma.load.c70
sub_21DFBF0HMMA store Cwmma.store.c70
sub_21E0870HMMA MMAwmma.mma / mma70
sub_21E1280IMMA load A/Bwmma.load.a (int)72
sub_21E15D0IMMA load Cwmma.load.c (int)72
sub_21E1830IMMA store Cwmma.store.c (int)72
sub_21E1D20IMMA MMAmma (integer, with saturation)72
sub_21E2280BMMA MMAmma (binary, b1.and.popc / b1.xor.popc)75

Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.

Blackwell tcgen05 (SM 100+):

Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:

BitFieldValues
0scaleD0 or 1
1negA0=positive, 1=negative
2negB0=positive, 1=negative
3transA0=normal, 1=transposed
4transB0=normal, 1=transposed
5sparsitystructured sparsity enable
[8:6]type encodingmxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16

Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).

MMA Shape and Type Encoding

The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:

Shape codes (bits [39:32]):

CodeShapePTX StringMin SM
0x01m8n8k4"m8n8k4"70
0x02m8n8k16"m8n8k16"72
0x03m8n8k32"m8n8k32"75
0x04m8n8k64"m8n8k64"75
0x05m8n8k128"m8n8k128"75
0x10m16n8k4"m16n8k4"80
0x11m16n8k8"m16n8k8"75
0x12m16n8k16"m16n8k16"80
0x13m16n8k32"m16n8k32"75
0x14m16n8k64"m16n8k64"75
0x15m16n8k128"m16n8k128"75
0x16m16n8k256"m16n8k256"75
0x17m16n16k16"m16n16k16"90
0x18m32n8k16"m32n8k16"90?
0x19m16n16k8"m16n16k8"70

Data type codes (in aty/bty fields):

CodeTypeBitsPTX
1b11"b1"
2s44"s4"
3u44"u4"
4s88"s8"
5u88"u8"
6f1616"f16"
7bf1616"bf16"
8tf3219"tf32"
9f6464"f64"
10f3232"f32"
11s3232"s32"

Special Register Access

Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:

OperandRegisterDescription
0x26%tid.xThread ID, X
0x27%tid.yThread ID, Y
0x28%tid.zThread ID, Z
0x29%ntid.xBlock dimension, X
0x2A%ntid.yBlock dimension, Y
0x2B%ntid.zBlock dimension, Z
0x2C%ctaid.xBlock ID, X
0x2D%ctaid.yBlock ID, Y
0x2E%ctaid.zBlock ID, Z
0x2F%nctaid.xGrid dimension, X
0x30%nctaid.yGrid dimension, Y
0x31%nctaid.zGrid dimension, Z
0x5E(dynamic)%warpid / %laneid (via sub_3958DA0)
0x5F(dynamic)%nwarpid or similar (via sub_3958DA0)

Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.

Address Space Conversion

The cvta instruction family is emitted by sub_21E7FE0:

Operand ValueSuffixFull Instruction
0(none)cvta (generic)
1.globalcvta.to.global / cvta.global
3.sharedcvta.to.shared / cvta.shared
4+.localcvta.to.local / cvta.local

Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).

Constraint Emission Pipeline

The full path from opcode to emitted constraint:

sub_B612D0(emitter_state, opcode):
    // Step 1: Table lookup
    entry = word_3F3E6C0[opcode - 1]
    reg_class = entry >> 8
    constraint_class = entry & 0xFF

    // Step 2: Build descriptor array on stack
    switch (constraint_class):
        case 0x00:
            // Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
            desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
            desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
            desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 3)
        case 0x01:
            // Ternary FMA: {op0, op1, op2, result}
            desc[0..2] = three input constraints
            desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 4)
        ...
        case 0xB0:
            // 17-input complex: 17 input constraints + 1 output
            for i in 0..16:
                desc[i] = {kind=i, value=...}
            desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 18)

Key helper functions:

AddressFunctionPurpose
sub_A778C0createRegClassConstraint(state, regclass, flags)Build input operand constraint for a specific register class
sub_A77AD0createAnyRegConstraint(state, flags)Build an unconstrained ("any register") input constraint
sub_A79C90composeConstraints(state, desc, N)Merge N descriptors into a single composite constraint
sub_B5BA00createOutputConstraint(state, regclass_id)Build the output/result constraint
sub_A78010emitConstraint(state, desc_array, N)Finalize and emit the constraint with N entries
sub_B612D0emitInstrConstraint(state, opcode)Top-level entry: table lookup + switch + emit

The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.

Complete Identified Opcode Summary

The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.

OpcodeIdentityFamilyEvidence Source
0--~430Generic LLVM TargetOpcodeLLVM standardupstream LLVM 20.0.0
440--443Type-preserving movesCopyregister coalescer (sub_3494EA0)
444--503Cross-class / wide / ABI copiesCopyregister coalescer (sub_3494EA0)
294--297atom.add (f32/f64/i32/i64)AtomicDAG legalization (sub_20BED60)
302--305atom.min (s32/s64/u32/u64)AtomicDAG legalization (sub_20BED60)
314--317atom.max (s32/s64/u32/u64)AtomicDAG legalization (sub_20BED60)
315CallSeqBeginCall ABILowerCall (sub_3040BF0)
316CallSeqEnd_OuterCall ABILowerCall
462atom.casAtomicDAG legalization
499ConditionalBranchControlintrinsic lowering
505DeclareParamCall ABILowerCall
506DeclareScalarParamCall ABILowerCall
507DeclareRetParamCall ABILowerCall
508DeclareRetScalarParamCall ABILowerCall
510CallDirectCall ABILowerCall
511CallDirectNoProtoCall ABILowerCall
512CallIndirectCall ABILowerCall
513CallIndirectNoProtoCall ABILowerCall
514CallStartCall ABILowerCall
515LoadRetParamCall ABILowerCall
516LoadRetParamLastCall ABILowerCall
517CallSeqEndCall ABILowerCall
518CallProtoCall ABILowerCall
521DeclareRetParam_ExtCall ABILowerCall
527StoreCalleeRetAddrCall ABILowerCall
528StoreRetValToParamCall ABILowerCall
568LoadV1Vector ParamLowerCall
569LoadV2Vector ParamLowerCall
570LoadV4Vector ParamLowerCall
571StoreV1Vector ParamLowerCall
572StoreV2Vector ParamLowerCall
573StoreV4Vector ParamLowerCall
4905--4940tcgen05.mma (10 shape variants)Tensor CoreBlackwell emission (sub_21E8CD0)

Gaps and Unknown Ranges

The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:

RangeLikely ContentsEvidence
430--439Transition zone (generic-to-target boundary)Adjacent to copy family
574--~800Global/shared/local loads and storesLarge gap between param-store and first identified general opcode
800--~1500Texture and surface instructionssub_33B0210 intrinsic switch references hundreds of tex/surf intrinsics
1500--~3000Shuffle, vote, match, reduxWarp-level intrinsic families
3000--~4000WGMMA, TMA, bulk operationsHopper-era instruction families
4000--4904Additional tensor/cluster instructionsBridging pre-Blackwell and tcgen05

Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.

Function Map

FunctionAddressSizeRole
Constraint emission (179-case switch on word_3F3E6C0)sub_B612D0104KB--
Register class set builder (111 cases)sub_B5BA0021KB--
Operand type decoder (101 cases)sub_B6B20044KB--
createRegClassConstraint(state, regclass, flags)sub_A778C0----
createAnyRegConstraint(state, flags)sub_A77AD0----
composeConstraints(state, desc, N)sub_A79C90----
emitConstraint(state, desc_array, N)sub_A78010----
Opcode-to-copy-type mapping (switch, families 440--503)sub_3494EA012.7KB--
Operand-type classification (reads byte_444C4A0)sub_34961A026.6KB--
Register-pair decomposition (wide/paired registers)sub_3497B4016.5KB--
NVPTXTargetLowering::LowerCall (call ABI opcodes)sub_3040BF088KB--
Intrinsic lowering switch (NVVM intrinsic to opcode)sub_33B0210343KB--
NVPTXDAGToDAGISel::Select (ISel entry)sub_3090F9091KB--
MMA instruction builder (packed descriptor)sub_21E74C017KB--
Atomic operation PTX emission (base)sub_21E5E70----
L2 cache-hinted atomic PTX emission (SM 80+)sub_21E6420----
Memory barrier PTX emissionsub_21E94F0----
Cluster barrier PTX emission (SM 90+)sub_21E8EA0----
Special register PTX emissionsub_21E86B0----
Cluster special register PTX emission (SM 90+)sub_21E9060----
Address space conversion (cvta) PTX emissionsub_21E7FE0----
tcgen05 Blackwell MMA emission (SM 100+)sub_21E8CD0----
Register class to encoded ID mappingsub_21583D0----
Register class to PTX type suffixsub_2163730----
Register class to PTX register prefixsub_21638D0----

Global Data References

SymbolAddressPurpose
word_3F3E6C00x3F3E6C0Constraint table (16-bit entries, indexed by opcode-1)
byte_444C4A00x444C4A0MVT/operand type table (16-byte entries, indexed by MVT enum)
word_44563400x4456340MVT to vector element count (16-bit entries)
word_44565800x4456580MVT to scalarized MVT (16-bit entries)
byte_3F252E00x3F252E0Constraint type classification table
qword_502A9200x502A920SM processor table (45 entries, stride-2)

Cross-References

  • Pattern Database -- detailed constraint descriptor layout and emission sub-functions
  • Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
  • Code Generation -- pipeline overview including ISel, RA, and machine-level passes
  • InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
  • Register Allocation -- greedy RA that consumes constraint table data
  • AsmPrinter -- the PTX emission layer that converts these opcodes to text