Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM100 Blackwell

The SM100 architecture (Blackwell, datacenter) represents the largest single-ISA definition embedded in nvlink v13.0.88. The instruction encoding infrastructure spans approximately 8 MB of .text across four major regions, defines 3,200+ encoding/decoding functions, 118 instruction families organized under 3 major opcodes, and introduces SM100-specific features including tcgen05 tensor intrinsics, unified tensor core (UTC) instructions, and a new ROT13-obfuscated mnemonic set internally codenamed "MERCURY" (ZREPHEL in ROT13).

This page documents the complete ISA encoding layer as reconstructed from the embedded ptxas backend, covering the 128-bit instruction word format, the opcode space partitioning, the encoder/decoder/descriptor infrastructure, and all SM100-specific instruction families.

Sub-variant Matrix

nvlink registers three SM100 sub-variants through the architecture dispatch at sub_15C0CE0:

Sub-variantELF FlagDescription
sm_100BaseFull Blackwell datacenter ISA
sm_100aAcceleratedEnables all SM100 features including tcgen05 MMA
sm_100fForward-compatibleFeature-subset for forward binary compatibility

All three sub-variants share the same encoding tables. The a suffix enables the full feature set; the f suffix restricts to the forward-compatible subset. The architecture dispatch registers 7 callbacks per sub-variant (nv.info emitter, resource usage table, instruction encoding table, compute capability array, perf-stats handler, cpf_optx handler, codegen options).

Instruction Word Format

All SM100 SASS instructions are 128 bits (16 bytes) wide. The instruction word is stored as two 64-bit halves at offsets +544 (bits 0-63) and +552 (bits 64-127) in the internal instruction representation structure.

Fixed Opcode Fields (Bits 0-31)

The first 32 bits encode the instruction identity through five fields set by the first five calls to the bitfield insertion primitive sub_4C28B0:

127                              64 63                                0
+----------------------------------+----------------------------------+
|     Modifier / Operand Fields    |  Fmt2 | Mod | SubOp | Minor | EF | MajOp |
+----------------------------------+----------------------------------+

Bit Range    Width   Name              Description
---------    -----   ----              -----------
  [3:0]      4 bits  Major Opcode      Instruction format class (1, 2, or 3)
  [6:4]      3 bits  Encoding Format    Sub-format / encoding variant (usually 0)
  [16:8]     9 bits  Minor Opcode      Instruction family (118 unique values)
  [24:17]    8 bits  Sub-Opcode        Specific instruction within family
  [31:25]    7 bits  Modifier           Data type, addressing mode, operation variant

Major Opcode Distribution

Major OpcodeClassEncoder CountPercentageDescription
1ALU/Scalar55837.2%Integer arithmetic, float arithmetic, conversion, comparison, bitfield, shift, move, special register
2Vector/Memory/Control97762.7%Memory load/store, texture, tensor core, control flow, barrier, predicate, warp shuffle, async copy
3Special20.1%Half-precision extended format (HSETP2 wide variant only)

Encoding Formats

FormatEncoder CountTypical Usage
Format 1308Simpler register-register forms, 3-operand ALU, branches
Format 21,227Extended forms with immediates, predicated variants, memory ops
Format 31Wide format -- only HSETP2 (opcode 0x6F sub 0x04, half-precision paired comparison)

Operand Field Encoding

Bits 32-127 encode operand references, modifier flags, and immediates. The layout varies by instruction class but follows consistent patterns:

Bit RangeTypical Content
48-53Source operand modifiers (negate, absolute value, swizzle)
54-57Rounding mode, conversion type
58-63Destination modifiers, saturation
64-79Extended opcode / sub-function qualifiers (word 1)
80-127Instruction-specific operand fields, immediates
134-132 (offset 0x84)Format2 field -- secondary format indicator (3 bits)

Register Operand Types

The register encoder sub_4C4D60 writes a structured operand field at the specified bit offset:

FieldWidthValues
is_output1 bit0=source, 1=destination
register type4 bits0=GPR, 2=uniform, 3=pair, 4=quad, 5=predicate, 7=barrier, 8=special, 10=64-bit, 11=128-bit
register number10 bits0-1023

The decoder sub_4C60F0 uses a register class parameter: 2=GPR, 3=predicate, 9=constant buffer reference, 10=uniform register.

Encoding Infrastructure

The SM100 ISA definition is split across four binary regions, totaling ~8 MB of template-instantiated functions:

Region Map

Address RangeSizeContentFunctions
0x620000--0x84DD702.2 MBSM100+ SASS encoders (table 1)1,537
0x84DD70--0xA482901.7 MBInstrDesc initializers1,613
0xDA0000--0xE436D0660 KBSASS encoders (table 2)438
0xE43DC0--0xF15A50840 KBSASS decoders648

Combined totals: 1,975 encoder functions, 648 decoder functions, 1,613 descriptor initializers = 4,236 template-instantiated functions.

Encoder Architecture

Every encoder follows an identical structural template:

__int64 __fastcall encode_XXX(__int64 buf, __int64 ir_instr)
{
    sub_4C2A60(buf);                          // Initialize encoding buffer
    sub_4C28B0(buf, 0, 4, major_opcode);      // Set bits[3:0] = major
    sub_4C28B0(buf, 4, 3, 0);                 // Set bits[6:4] = format ext
    sub_4C28B0(buf, 8, 9, minor_opcode);      // Set bits[16:8] = minor
    sub_4C28B0(buf, 17, 8, sub_opcode);       // Set bits[24:17] = sub
    sub_4C28B0(buf, 25, 7, modifier);         // Set bits[31:25] = modifier
    sub_4C2A90(buf, ir_instr, pred_idx);      // Encode predicate guard
    sub_4C4D60(buf, ir_instr, 0, 0x50);       // Encode register operand 0
    sub_4C4D60(buf, ir_instr, 1, 0x60);       // Encode register operand 1
    sub_4C52F0(buf, ir_instr, 2, 0x70);       // Encode immediate operand
    sub_4C5C30(buf, ir_instr, 3, 0x80);       // Encode predicate operand
    // ... instruction-specific modifier encoding via sub_A4xxxx/sub_A5xxxx ...
    sub_A50D10(lookup_table, opcode_value);    // Primary opcode mapping
}

Core Infrastructure Functions

AddressSignatureUsageDescription
sub_4C28B0(buf, bit_offset, width, value)All encodersBitfield insertion into 128-bit instruction word
sub_4C2A60(buf)All encodersClear operand remap table (offsets 468-531, 16 DWORD slots), reset operand counter at offset 532
sub_4C2A90(buf, ir, variant)All encodersEncode predicate register + 5-bit scheduling field; variant 0=standard, 1=extended immediate
sub_4C4D60(buf, ir, op_idx, offset)1,964/1,975Register operand encoder: 1-bit output flag + 4-bit type + 10-bit register number
sub_4C52F0(buf, ir, op_idx, offset)715/1,975Constant/immediate operand encoder: 5-bit type + register number
sub_4C5C30(buf, ir, op_idx, offset)875/1,975Predicate/special operand encoder with operand remapping

Modifier Encoding Helpers

AddressReferencesModifier
sub_A50D101,975 (all)Primary opcode mapping through lookup table
sub_A50CF0267Secondary/auxiliary opcode bits
sub_A4D6A086Rounding mode
sub_A4D92081Type conversion
sub_A4F1C075Destination negate
sub_A4F21075Absolute value
sub_A4DDD065Saturation
sub_A4F12066Source negate
sub_A4D7D043Data type size

Decoder Architecture

Decoders mirror encoders -- each unpacks a 128-bit instruction word into the internal IR:

__int64 __fastcall decode_XXX(__int64 buf, __int64 ir_out)
{
    sub_4C5F90(buf, ir_out);                       // Finalize init
    sub_4C60F0(buf, ir_out, 0, 0x50, 2);           // Decode register (GPR)
    sub_4C60F0(buf, ir_out, 1, 0x60, 10);          // Decode register (uniform)
    sub_4C6380(buf, ir_out, 2, 0x70, 9);           // Decode constant buffer
    sub_4C6DC0(buf, ir_out, 3, 0x80, 3);           // Decode predicate
    sub_50C790(lookup, bit_val);                    // Read 1-bit flag
    sub_50BBA0(lookup, val);                        // Read 5-bit predicate field
    // ... field extraction via sub_50xxxx/sub_51xxxx ...
}
AddressUsageDescription
sub_4C5F90All 648 decodersFinalize instruction decoding
sub_4C60F0634/648Decode register operand with class parameter
sub_4C6380288/648Decode constant buffer operand
sub_4C6DC0321/648Decode predicate/barrier operand
sub_50C790All 648Read 1-bit field from encoded instruction (universal)
sub_50C770188/648Read multi-bit flag field
sub_50BBA0101/648Read 5-bit predicate register field

Dispatch Functions

Two functions serve as the entry points for the entire encoding/decoding pipeline:

Encoder dispatch (sub_E43C20, 92 lines): Reads the instruction opcode from *(a2+12) as a 16-bit value. Special-cases opcodes 120-121 (control flow) to sub_4C8810. Otherwise performs binary search in the 24-byte dispatch table at off_1EA4380. Each table entry contains {opcode_byte0, opcode_byte1, encode_func_ptr, this_offset}. Supports vtable-aware dispatch (checks LSB of function pointer).

Decoder dispatch (sub_EFE6C0, 93 lines): Reads the 128-bit instruction word from a1[68] (offset 544). Extracts the format field (bits>>4)&7 and immediate offset 16*(bits&0xF). Delegates format 2/3 to sub_4CB100. Extracts the 9-bit major opcode (bits>>8)&0x1FF. For format 1, adds 121 to create a separate opcode space. Binary search in dispatch table at off_1E957E0. Calls sub_A49B50 to finalize.

InstrDesc Initializers

The 1,613 descriptor initializers at 0x84DD70--0xA48290 populate instruction metadata objects defining:

  • Operand count and types (register, immediate, memory, predicate)
  • Instruction properties (flags, latencies, execution unit assignment)
  • Scheduling hints and resource requirements

Each descriptor writes a class ID to *(a2+12). The class distribution spans 56 unique IDs (0-103), with ID 18 being the largest (78 descriptors, integer arithmetic class).

Complete Instruction Set Map

The following tables enumerate all 118 instruction families across the three major opcodes. Within each family, sub-opcodes define specific instruction variants (register-register, register-immediate, predicated, etc.).

Major Opcode 1: ALU/Scalar (558 encoders)

MinorFamilyVariantsOperand RangeDescription
0x01ALU-MISC-A2R1-2Miscellaneous ALU type A
0x02ALU-MISC-B7R2-3, I0-1Miscellaneous ALU type B
0x03ALU-MISC-C1R2Single-variant ALU
0x04INT-SHIFT6R2-3, I0-1Integer shift operations
0x06INT-ARITH12R2-3, I0-1Integer arithmetic (ADD, SUB)
0x08INT-MUL3R2-3Integer multiply
0x09INT-MAD-EXT26R2-4, I0-1Integer multiply-add extended -- largest family in major 1
0x0AFLOAT-CVT3R2-3Float conversion
0x0CCVT1R2Type conversion
0x0DINT-MAD10R2-3, I0-1Integer multiply-add
0x0EINT-WIDE6R2-3, I0-1Integer wide operations
0x0FINT-CMP2R2-3Integer comparison
0x10IMM-LOAD4R2-3, I1Immediate value load
0x12BIT-OP4R2-3Bit manipulation
0x17SHIFT-EXT8R2-4, I0-1Extended shift operations
0x18ISCADD/LEA10R2-4, I0-1, P0-1Integer scale-add / load effective address
0x19SHIFT-WIDE5R2-3, I0-1Wide shift variants
0x1ACONST-LOAD4R2-3, I0-1Constant buffer load
0x1BINT-ABS3R2-3Integer absolute value
0x1CINT-NEG1R2Integer negate
0x1DINT-MIN/MAX1R2Integer minimum/maximum
0x1EBIT-EXT4R2-3, I0-1Bit field extract/insert
0x22INT-SPECIAL1R2Special integer operation
0x24INT-DIV2R2-3Integer divide helpers
0x27LD-GLOBAL-1282I4128-bit global memory load (immediate-only form)
0x33UNIFORM-ALU2R2-3Uniform register ALU
0x39WARP-VOTE2R2-3, P1Warp voting operations

Major Opcode 2: Vector/Memory/Control (977 encoders)

MinorFamilyVariantsOperand RangeDescription
0x27P-LD-1282R3, I1, P1Predicated 128-bit load
0x29ATOM-SHARED24R2-3, I1-2, P0-1Shared memory atomic operations
0x3BTEXTURE-SURF10R4-5, I1, P0-1Texture/surface operations
0x3EEXIT/BRK2R5, P1, M1Exit and break (7 operands each)
0x3FCONT1R5, P1, M1Continue (7 operands)
0x40LONGJMP1R5, P1, M1Long jump
0x42PREBRK1R5, P1, M1Pre-break
0x43PCNT1R5, P1, M1Pre-continue
0x44PRET1R5, P1, M1Pre-return
0x45SSY1R5, P1, M1Set synchronization point
0x46CAL1R5, P1, M1Function call
0x49ATOM-GLOBAL6R2-3, I1-2, P0-1Global memory atomic operations
0x4AREDUCE4R2-3, I0-1Reduction operations
0x4CMEM-FENCE4R2-3, I0-1, P0-1Memory fence instructions
0x4ECORE-VEC63+R3-4, I1, P1Core vector ALU -- largest family
0x4FCORE-VEC-EXT25R2-4, I0-2, P0-1Extended core vector ALU
0x52INT-SET-PRED6R2-3, I0-1Integer set predicate
0x54LOAD-SHARED-EXT6R3, P1Extended shared memory load
0x56WARP-BARRIER4+R1-2, P0-1Warp-level barrier operations
0x59MMA-TENSOR30R4-7, I0-1, P0-2Tensor core MMA -- 2nd largest family
0x5CLOAD-GLOBAL6R3-4, P0-1Global memory load
0x61STORE-GLOBAL5R1-3, I1-4, P0-2Global memory store
0x62STORE-SHARED6R2-3, P0-2Shared memory store
0x6BTEXTURE-LOAD4R4-7, I1, P0-1Texture load operations
0x6DSURFACE-OP5R4-7, I1, P0-1Surface load/store
0x70ASYNC-COPY4R2-3, I1, P0-1Asynchronous memory copy
0x71WARP-GROUP3R2-3, P0-1Warp group operations
0x7EBARRIER-OP6R1-2, P0-1Barrier operations
0x81UNIFORM-PRED4R0-1, I1, P0-1Uniform predicate operations
0x84CONTROL-FLOW5R0-3, I0-1, P0-1Branch, call, return
0x94TMA-ACCESS4R3-4, P0-1Tensor memory accelerator access
0x95DEPBAR6R0-2, P0-1Dependency barrier
0xA4HMMA-A5R5-6, I1, P0-1Half-precision MMA type A
0xA5HMMA-B4R4-5, I0-1, P0-1Half-precision MMA type B
0xA6IMMA3R4-5, P0-1Integer MMA
0xA8MMA-F16xF168R5-6, I1, P1-2FP16 x FP16 matrix multiply
0xA9MMA-MIXED8R4-5, I0-1, P0-1Mixed-precision MMA
0xABMMA-INT4R4-5, P0-1Integer matrix multiply-accumulate
0xACMMA-INT-V8R4-5, P0-1Integer MMA variant
0xADLOAD-UNIFORM9R1-2, P0-1Uniform register load
0xAEMMA-TF32xTF3210R5-6, I1, P1-2TF32 x TF32 MMA
0xAFMMA-F64xF6410R5-6, I1, P1-2FP64 x FP64 MMA
0xB0MMA-REDUCED10R4-6, I0-1, P1-2Reduced-precision MMA
0xB1MMA-SPARSE5R4-5, P1-2Sparse MMA
0xB2MMA-WIDE5R5-6, I1, P1-2Wide MMA
0xB3MMA-SPECIAL5R4, P1Special MMA variant
0xCFPRED-LOGIC3R1-2, P0-1Predicate logic operations
0xD4SELECT2R1-2, P0-1Conditional select / cmov
0xDFIMM-MOV14R1-2, I1-3, P0-1Immediate move -- most immediate-rich family
0xE1STORE-EXT4R1-3, I1-3, P0-2Extended store operations
0xEEWARP-SHUFFLE3R2-3, P0-1Warp shuffle operations

Major Opcode 3: Special (1 encoder)

MinorFamilyVariantsOperandsDescription
0x6FHSETP2-WIDE1R4, I1, P2Half-precision set predicate, paired comparison -- the only format 3 instruction in the entire ISA

Instruction Families by Functional Class

The 118 instruction families group naturally into functional classes:

Integer Arithmetic (13 families, ~110 encoders)

Integer arithmetic dominates the ALU opcode space. The key families:

  • IADD3 (minor 0x005 in the 0xDA region): 3-input integer add with carry. 26 encoding variants covering register-register, register-immediate, and different data widths. All variants include 1 immediate + 1 predicate + 3-4 register operands.
  • IMAD (minors 0x016, 0x017): Integer multiply-add in standard (0x016, 7 variants) and wide (0x017, 6 variants) forms.
  • ISCADD/LEA (minor 0x18): Integer scale-add / load effective address. 10 variants. Used extensively in address computation.
  • ALU-MISC (minor 0x012 in 0xDA region): The largest single class with 63 distinct encodings. Includes BFE, BFI, FLO, POPC, LEA, PRMT, and many more.

Floating-Point Arithmetic (5 families, ~30 encoders)

  • FADD (minor 0x007): 11 variants covering FP32 add, multiply, FMA.
  • HADD2/HFMA2 (minors 0x06A, 0x06D): Half-precision packed arithmetic. 5 and 4 variants respectively. HFMA2 uses 2 predicate operands.
  • HSETP2 (minor 0x06F): Half-precision set predicate with paired output. 7 variants including the sole format 3 instruction.
  • DFMA (minor 0x020): Double-precision FMA with 4 variants.

Memory Operations (15 families, ~90 encoders)

  • LDG (minor 0x025): Global memory load. 9 variants with up to 4 register operands.
  • STG (minor 0x02E): Global memory store. 8 variants with up to 4 immediates.
  • LDS/STS (minors 0x00D, 0x0E3, 0x0EC): Shared memory load and store. 6 variants each.
  • LDL (minor 0x067): Local memory / stack. 13 variants with 2-4 immediates for complex addressing.
  • LDC (minor 0x060): Constant buffer load. 6 variants, all with 4 immediate operands.
  • Async Copy (minor 0x0B8): 6 variants for asynchronous data movement.

Texture and Surface (5 families, ~30 encoders)

  • TEX (minor 0x05A): Texture fetch. 13 variants. The most operand-rich instructions in the entire ISA: up to 7 register + 1 immediate + 1 predicate operands. The largest encoder function (sub_DC6680, 8,794 bytes, 302 lines) belongs to this family.
  • TXQ (minor 0x08B): Texture query. 4 variants with 4-7 register operands.
  • SUST (minor 0x0CE): Surface store. 4 variants.
  • Texture/Surface load (minor 0x03B): 6 variants.

Tensor Core / MMA (16 families, ~120 encoders)

The tensor core families constitute the second-largest functional class:

FamilyMinor(s)VariantsDescription
HMMA-A/B0xA4, 0xA59Half-precision matrix multiply-accumulate
MMA-F16xF160xA86FP16 x FP16 with 5-6 register operands
MMA-MIXED0xA96Mixed-precision MMA
IMMA0xA6, 0xAB6Integer MMA
IMMA-V0xAC6Integer MMA variant
MMA-TF320xAE6TF32 x TF32 matrix multiply
MMA-F640xAF6FP64 x FP64 double-precision MMA
MMA-REDUCED0xB06Reduced-precision MMA
MMA-SPARSE0xB13Sparsity-aware MMA
MMA-WIDE0xB23Wide accumulator MMA
MMA-SPECIAL0xB31Special MMA variant
TENSOR-LD0xB42Tensor memory load
TENSOR-ST0xB62Tensor memory store
TMA0x09F6Tensor Memory Accelerator operations
WGMMA0x0C22Warp Group MMA (Blackwell-new)

Control Flow (10 families, ~25 encoders)

  • BRA (minor 0x034): Branch with 3 variants (conditional, unconditional, indirect).
  • EXIT/BRK (minor 0x3E): Exit/break with 7-operand encoding (5 registers + 1 predicate + 1 memory).
  • CALL/RET (minors 0x079, 0x07B): Function call and return.
  • SSY/CAL/PCNT/PRET (minors 0x42-0x46): Structured control flow primitives, each with 7 operands.
  • YIELD/NANOSLEEP (minor 0x07C): Thread yield with 2 variants.

Synchronization (6 families, ~25 encoders)

  • BAR.SYNC (minor 0x085): Barrier synchronization.
  • WARP-BARRIER (minor 0x56): Warp-level barrier with 4+ variants.
  • DEPBAR (minor 0x096): Dependency barrier for scoreboard management.
  • BARRIER-OP (minor 0x0A7): Extended barrier operations.
  • FENCE (minor 0x08D): Memory fence with 4 variants.

SM100-Specific Instructions

ROT13 Mnemonic Table

The SM100 opcode table constructor at sub_1782540 (111,076 bytes, 3,227 lines) initializes ~400+ instruction mnemonics using ROT13 encoding. The ROT13 prefix ZREPHEL decodes to MERCURY, the internal codename for Blackwell. Key SM100-specific mnemonics decoded from the binary:

ROT13DecodedDescription
OZZNBMMABlock matrix multiply-accumulate
QZZNDMMADense matrix multiply-accumulate
DZZNQMMAQuantized matrix multiply-accumulate
GPTRA05TCGEN05Tensor Core Generation 5 intrinsic
HGPONEUTCBARUnified Tensor Core barrier
HGPPCUTCPCUnified Tensor Core program counter
HGPUZZNUTCHMMAUnified Tensor Core half-precision MMA
ZKEZN.FCMXQMA.SPMixed-precision quantized MMA, sparse variant
FLAPFSYNCSSynchronization primitives (Blackwell-specific)
NPDOHYXACQBULKAcquire bulk (barrier operation)
NPDFUZVAVGACQSHMINITAcquire shared memory init
NY2CAL2PAttribute to parameter (legacy compat)
NEEVIRFARRIVESBarrier arrival notification
NGRASTAttribute store
NGBZATOMAtomic operation
NGBZTATOMGAtomic global
SNQQFADDFloat add
SZHYFMULFloat multiply
VZNQIMADInteger multiply-add
VZNQ_JVQRIMAD_WIDEInteger multiply-add wide
VNQQ3IADD33-input integer add
OZFXBMSKBit mask
FTKGSGXTSign extend
YBC3LOP33-input logic operation
VFRGCISETPInteger set predicate
PPGYCCTLCache control
OFLAPBSYNCBlock synchronization
SRAPRFENCEMemory fence
REEONEERRBARError barrier

tcgen05 Intrinsics

SM100 introduces the tcgen05 (Tensor Core Generation 5) subsystem with dedicated PTX-level intrinsics. The code generation infrastructure resides at 0x16E0000--0x16E3AB0 and spans 14+ functions handling type classification, bounds checking, tensor memory address computation, MMA operand setup, and argument type mapping.

The two PTX instruction mnemonics are tcgen05.mma (standard) and tcgen05.mma.ws (warp-specialized). Modifier suffixes include _expand16bit, _pack16bit, _maxabs, _minabs, _fused, _blockscale, and _ashift.

Instruction Type Classifier

sub_16E0A70 (17,302 bytes, 322 lines) classifies a tcgen05 MMA instruction into one of 54 type IDs by chaining predicate functions against the instruction object's type field at *(*(a1+8)). Each predicate is a trivial equality check (*a1 == N); the classifier tests them in priority order and returns the first matching type ID, or 0 if no predicate matches.

The 54 predicate functions live at 0x12B5670--0x12B5C20 (16 bytes each, all identical in structure). The complete classifier chain and resulting type ID assignment:

PriorityPredicateInternal ValueType IDType
1sub_12B567011Base types
2sub_12B568022Base types
3sub_12B569033Base types
4sub_12B56A044Base types
5sub_12B56B055Base types
6sub_12B56C066Base types
7sub_12B56D077Base types
8sub_12B56E088Base types
9sub_12B57001818Extended type A
10sub_12B57101919Extended type A
11sub_12B57202121Extended type A
12sub_12B57302323Extended type A
13sub_12B57402424Extended type A
14sub_12B57801010Extended type B
15sub_12B57901111Extended type B
16sub_12B57A01313Extended type B
17sub_12B57B01515Extended type B
18sub_12B57C01616Extended type B
19sub_12B58D02525Blockscale types
20sub_12B58E02626Blockscale types
21sub_12B59602727Blockscale types
22sub_12B59502929Blockscale types
23sub_12B59402828Blockscale types
24sub_12B59703333Quantized types
25sub_12B58F03232Quantized types
26sub_12B59003030Quantized types
27sub_12B59103131Quantized types
28sub_12B5B503434Mixed-precision types
29sub_12B59203535Mixed-precision types
30sub_12B59303636Mixed-precision types
31sub_12B59803737Mixed-precision types
32sub_12B5B304242Sparse types
33sub_12B5B404343Sparse types
34sub_12B59903838FP8/FP6/FP4 types
35sub_12B5A803939FP8/FP6/FP4 types
36sub_12B5A904040FP8/FP6/FP4 types
37sub_12B5AA04141FP8/FP6/FP4 types
38sub_12B5AB04444Ashift types
39sub_12B5AC04545Ashift types
40sub_12B5AD04646Ashift types
41sub_12B5B004747Fused types
42sub_12B5B104848Fused types
43sub_12B5AF05050Pack/Expand types
44sub_12B5AE04949Pack/Expand types
45sub_12B5B205151Pack/Expand types
46sub_12B5BA05252MXQ types
47sub_12B5BB05656MXQ types
48sub_12B5BF05353MXQ types
49sub_12B5C005454MXQ types
50sub_12B5BC05858Maxabs types
51sub_12B5BE05959Maxabs types
52sub_12B5C105555Extended sparse
53sub_12B5C206060Extended sparse

Type IDs 9, 12, 14, 17, 20, 22, 57 are absent from the enum -- either reserved for future expansion or used by related subsystems not routed through this classifier.

The type ID groupings inferred from the classification order and the modifier suffix strings:

  • IDs 1--8: Base MMA types (standard precision combinations -- the 8 fundamental tcgen05.mma configurations)
  • IDs 10--16: Extended base types with non-standard precision or accumulator widths
  • IDs 18--24: Extended type A variants (wider accumulator or non-standard rounding)
  • IDs 25--29: Blockscale variants (_blockscale modifier -- block-level scaling for MX formats)
  • IDs 30--33: Quantized MMA types (tcmma_*_q / tcmma_*_mxq internal names)
  • IDs 34--37: Mixed-precision variants (asymmetric A/B input types)
  • IDs 38--41: FP8/FP6/FP4 narrow-type variants (e4m3, e5m2, e3m2, e2m3, e2m1 combinations)
  • IDs 42--43: Sparse MMA variants (structured sparsity -- 2:4 or 4:8 patterns)
  • IDs 44--46: Ashift variants (_ashift modifier -- arithmetic shift on matrix A)
  • IDs 47--48: Fused variants (_fused modifier -- fused accumulation)
  • IDs 49--51: Pack/Expand variants (_pack16bit, _expand16bit modifiers)
  • IDs 52--56: MXQ (Mixed-precision Quantized) types (_blockscale + quantization)
  • IDs 58--59: Maxabs/Minabs variants (_maxabs, _minabs reduction modifiers)
  • ID 55: Extended sparse variant
  • ID 60: Extended sparse variant (last in chain)

Guardrails Code Generation

sub_16E1DB0 (10,365 bytes, 325 lines) generates inline PTX boundary-checking code that validates tensor memory accesses before tcgen05.mma execution. This is a compiler-inserted safety mechanism -- the guardrails are weak functions that can be overridden at link time.

Symbol selection (line 67--73): The function selects one of two guardrail symbol prefixes based on the instruction opcode at offset 776:

OpcodeSymbol Prefix
149__cuda__sm10x_tcgen05_guardrails_are_columns_allocated_
other__cuda__sm10x_tcgen05_guardrails_in_physical_bounds_

Both prefixes get nCols appended to form the final guardrail variable name (e.g., __cuda__sm10x_tcgen05_guardrails_in_physical_bounds_nCols).

nCols computation (line 82--318): The number of columns accessed by the MMA instruction is computed via a multi-level dispatch. The first dispatch key is BYTE2(v63) & 0x1C -- a 3-bit field extracted from the instruction descriptor at offset 632:

Field ValuenCols Computation
0x10 (16)Constant: mov.u32 %s, 8
0x14 (20)Dynamic: reads descriptor field at offset 634, dispatches on (byte >> 5) -- see sub-table below
0x18 (24)Register: copies from operand at offset 664 via mov.u32 %s, %s
0x1C (28)Register: copies from operand at offset 656 via mov.u32 %s, %s
otherFalls through to instruction-opcode-based dispatch

Instruction-opcode nCols table (lines 103--128): When none of the above field values match, the byte at offset 620 (v4, signed char interpreted as instruction opcode) determines nCols:

Opcode (signed)Opcode (unsigned)nCols BasenCols (with scale)Interpretation
-87 (0xA9)16916 / 8scaled16-col MMA types
-83 (0xAD)17316 / 8scaled16-col MMA types
-78 (0xB2)17816 / 8scaled16-col MMA types
-86 (0xAA)1702 / 1scaled2-col MMA types
-82 (0xAE)1742 / 1scaled2-col MMA types
-85 (0xAB)1714 / 2scaled4-col MMA types
-84 (0xAC)1728 / 4scaled8-col MMA types
-81 (0xAF)1758 / 4scaled8-col MMA types
-80 (0xB0)1768 / 4scaled8-col MMA types
-79 (0xB1)1778 / 4scaled8-col MMA types

The nCols value is modified by two flags in the instruction descriptor at offset 624 (v62.m128i_i8[8]):

  • & 0x03 (CTA group flag): If nonzero, the base nCols is doubled (nCols = nCols_wide instead of nCols_narrow)
  • & 0x78 (scale factor flag): If nonzero, nCols is multiplied by sub_12AB660(v5) where v5 = (byte >> 3) & 0xF is a 4-bit index into the scale lookup table at dword_1F24300 (values indexed 0--7)

Dynamic descriptor dispatch (field value 0x14, lines 158--318): When the field value is 0x14, a secondary 3-bit dispatch on *(a1+634) >> 5 selects one of 6 nCols computation strategies:

CasePTX EmittedSemantics
1mov.u32 %s, 8Constant 8 columns
2and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8; mov.u32 %s, %s;Extract bits[22:17] from descriptor, multiply by 8
3 (with FP16 inputs)mov.u32 %s, 4Constant 4 columns for FP16
3 (other)and.b32 %s, 0x1F000000, %s; shr.u32 %s, %s, 24; mul.lo.u32 %s, %s, 16; setp.eq.u32 %s, %s, 128; selp.u32 %s, 2, 4, %s;Extract bits[28:24], multiply by 16, select 2 or 4 based on result
4and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8; cvt.rp.f32.u32 %s, %s; div.rp.f32 %s, %s, 64.0; cvt.rpi.u32.f32 %s, %s; mul.lo.u32 %s, %s, 2;Extract bits[22:17], multiply by 8, divide by 64 rounding up, multiply by 2
5mov.u32 %s, 2Constant 2 columns

Case 3 checks whether either matrix A (offset 611, bits[5:4]) or matrix B (offset 627, bits[5:4]) uses FP16 format (== 1). If so, nCols is fixed at 4; otherwise, a full bitfield extraction from the descriptor determines the value dynamically.

Case 4 is notable for using floating-point arithmetic (cvt.rp.f32.u32, div.rp.f32, cvt.rpi.u32.f32) to compute a ceiling division -- nCols = ceil(bits * 8 / 64) * 2. This computes the number of 64-byte-aligned column groups needed.

Tensor Memory Address Computation

Five functions generate the inline PTX for computing tensor memory (tmem) addresses used by tcgen05.mma operands:

AddressSizeSymbolOperandSource Offset
sub_16E2410111 lines__cuda_sm_100_tcgen05_tmem_addrGeneral tmemOffset 648 (opcode-dependent)
sub_16E261093 lines__cuda_sm10x_tcgen05_mma_tmemDDestination DOffset 648 (slot 0)
sub_16E27D093 lines__cuda_sm10x_tcgen05_mma_tmemASource AOffset 656 (slot 1)
sub_16E299099 lines__cuda_sm10x_tcgen05_mma_scaleTmemAScale AOffset 648 + 8*N (dynamic)
sub_16E2B8099 lines__cuda_sm10x_tcgen05_mma_scaleTmemBScale BOffset 648 + 8*N (dynamic)

A sixth function at sub_16E2D70 (93 lines) handles __cuda_sm10x_tcgen05_mma_spMetaTmem -- the sparsity metadata tensor memory address, reading from offset 672 (slot 3).

Each function follows the same two-path pattern:

  1. 32-bit tmem address (when the address operand type is sub_12B5850 or sub_12B56C0, internal type 6): Emits a simple add.u32:

    add.u32 %s, <symbol>, %s, %s;
    
  2. 64-bit tmem address (all other types): Wraps in a scoped register block with 64-to-32-bit conversion:

    {.reg .b32 __cuda_sm_100_tcgen05_tmem_addr_base;
    cvt.u32.u64 __cuda_sm_100_tcgen05_tmem_addr_base, %s;
    add.u32 %s, __cuda_sm_100_tcgen05_tmem_addr_base, %s;
    }
    

The scale operand functions (sub_16E2990, sub_16E2B80) dynamically select their operand slot: scaleTmemA uses slot 5 - (cta_group_flag == 0) and scaleTmemB uses slot 6 - (cta_group_flag == 0), where cta_group_flag = (*(a1+627) & 0x30).

Return Value / Input Operand Marshalling

Two functions handle parameter passing to/from the tcgen05 MMA helper functions:

Output marshalling (sub_16E1A80, 47 lines): Iterates over the operand list at offset 648, emitting ld.param.b32 %s, [%s + %d] for each register result. The return array symbol depends on the opcode:

  • Opcode 143: __cuda_sm_100_tcgen05_ld_red_funcRetArr (load-reduce variant)
  • Opcode 160: __cuda_sm_100_tcgen05_ld_funcRetArr (standard load variant)

For reduce operations (sub_12A9290 returns true), an additional operand is appended from offset 656.

Input marshalling (sub_16E1BC0, 30 lines): Iterates over the input operand list at offset 648 (slot determined by *(a1+596) - 1), emitting st.param.b32 [%s + %d], %s for each register input. Uses the symbol __cuda_sm_100_tcgen05_st_funcInputArr.

Arguments Type Mapper

sub_16E3AB0 (18,623 bytes, 337 lines) performs the same type classification as sub_16E0A70 but operates on an array of tcgen05 instruction arguments rather than a single instruction. It iterates over the argument array (count at offset 796, result stored at offset 944), calling the identical predicate chain to assign numeric type IDs.

Guardrail Trap Functions

The guardrails infrastructure defines five weak trap functions that are called when bounds violations are detected. These are declared as PTX prototypes embedded in the binary:

SymbolParametersDescription
__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_boundsoob_access_col_no, instr_kindOut-of-bounds column access
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_accesscol_no_accessed, alloced_mask, instr_kindAccess to unallocated column
__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignmentdp_lane, matrix_kind, valid_alignment_kindMisaligned datapath access
__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_envidesc_sp_enabled, idesc, mma_kind, ptx_target, is_family_portableSparsity used in unsupported context
__cuda_sm10x_tcgen05_guardrails_check_column_allocationstart_col_num, num_of_cols, inst_kindColumn allocation verification (returns retVal)

Two additional check functions return a result value:

SymbolParametersDescription
__cuda_sm10x_tcgen05_guardrails_check_physical_boundsstart_col_num, num_of_cols, inst_kindPhysical bounds check (returns retVal)
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignmenttmemAddr, iDesc, cta_group, hasWS, hasSP, matrix_kindDatapath alignment check (returns retVal)

All are declared .weak .func (overridable at link time) and use .FORCE_INLINE for the check variants, ensuring they are inlined at every call site.

Additional tcgen05 Symbols

Beyond MMA, the tcgen05 infrastructure references additional symbols for load, store, and load-reduce operations:

SymbolUsage
__cuda_sm_100_tcgen05_tmem_addrGeneral tmem address for ld/st operations
__cuda_sm_100_tcgen05_ld_funcRetArrReturn array for tcgen05 loads
__cuda_sm_100_tcgen05_ld_red_funcRetArrReturn array for tcgen05 load-reduce
__cuda_sm_100_tcgen05_st_funcInputArrInput array for tcgen05 stores
__cuda_sm_100_tcgen05_ld_immhalfSplitOffHalf-split offset for loads (opcode 0xAA)
__cuda_sm_100_tcgen05_ld_red_immhalfSplitOffHalf-split offset for load-reduce (opcode 0xAA)
__cuda_sm_100_tcgen05_st_immhalfSplitOffHalf-split offset for stores (opcode 0xAA)

The EIATTR system tracks tcgen05 MMA usage through two compatibility attributes: EICOMPAT_ATTR_INST_TCGEN05_MMA (current) and EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED (legacy).

Decoder Opcode Classes for Blackwell-New Instructions

The decoder region (0xE43DC0--0xF15A50) reveals Blackwell-specific instruction classes through high opcode IDs (>300) not present in earlier architectures:

Decoder OpcodeDecodersDescription
35615Uniform compute extensions (new register file operations)
3576New barrier primitives
3586New synchronization primitives
36810Asynchronous copy/compute operations
289-299~20TMA (Tensor Memory Accelerator) operations

SM100 Compiler Backend

The SM100-specific compiler backend extends from 0x1782540 through 0x17B9300 and includes:

Opcode Table Constructor

sub_1782540 (111,076 bytes, 3,227 lines) is the SM100 opcode table constructor. It calls the parent class constructor sub_19B11F0 and initializes all SM100 instruction mnemonics with name + length at offset +11,360. The vtable is at off_2415E98. This is over 3x larger than the SM70 equivalent (sub_1769B50, 24,230 bytes) reflecting the expanded ISA.

Instruction Property Initializer

sub_17884A0 (44,713 bytes, 1,603 lines) sets latencies, throughputs, and execution unit assignments for every SM100 opcode. Companion to the opcode table constructor.

Scheduling Table

sub_178AA00 (35,422 bytes, 1,205 lines) initializes the SM100-specific scheduling resource tables defining per-instruction throughputs and latencies for each execution unit type (ALU, SFU, LDST, TEX, MMA).

SM100-Specific Optimization Passes

AddressSizeLinesFunction
sub_179BD1016,544 B649SM100 peephole optimizer
sub_179E62010,854 B455SM100 instruction combiner
sub_179EF109,214 B352SM100 pattern matcher
sub_179F6D07,078 B294SM100 dead code handler
sub_17A213033,823 B1,065SM100 instruction legalization (main)
sub_17A7A4017,754 B601SM100 type legalization
sub_17A861029,094 B889SM100 lowering pass
sub_17AB9D036,177 B1,221SM100 instruction selection
sub_17ADA4035,411 B1,270SM100 complex instruction selection
sub_17B04A023,212 B871SM100 operand folding

Master Encoder Functions

Two monumental encoding functions handle the final SASS binary emission:

  • sub_17F2670 (156,611 bytes, 4,858 lines): The master instruction encoder -- the largest function in the entire nvlink binary. Dispatches to individual encoding routines for all SASS instruction types. ~640 local variables, 0x2C8-byte stack frame.
  • sub_17F9AE0 (61,531 bytes, 2,150 lines): Secondary encoder for less common instruction types.

Format Descriptor Tables

Each instruction references a 16-byte (128-bit) format descriptor loaded via SSE instructions. These descriptors define the operand layout template:

Encoder-Side Descriptors

AddressEncodersUsage
xmmword_1E30DA0166Default/common format
xmmword_1E30E3054Integer arithmetic formats
xmmword_1E30DC034Memory operation formats
xmmword_1E30DB019Comparison/predicate formats
xmmword_1E30EF014MOV/immediate formats
xmmword_1E30F1016Integer extended formats
xmmword_1E30E5012Integer multiply formats
xmmword_1E30EA010Bitfield/logic formats

Decoder-Side Descriptors

AddressDecodersUsage
xmmword_1F46278125Common 64-bit instruction format
xmmword_1F46388106Standard 3-operand format
xmmword_1F46AF899Extended operand format
xmmword_1F4663084Memory instruction format
xmmword_1F46E2872Wide instruction format
xmmword_1F461F064Predicate instruction format

The universal operand descriptor at xmmword_1F460E0/xmmword_1F460F0 (32 bytes) is referenced by all 4,236 encoder+decoder+descriptor functions -- it defines the register class mapping for the SM100 architecture.

Internal Data Structures

Instruction Representation Object (~560 bytes)

The a1 parameter across all encoder/decoder functions points to:

Offset  Size   Description
------  ----   -----------
  0       4    Flags / instruction ID
  4       4    Scheduling control bits
  8      16    Format descriptor (128-bit SSE copy from xmmword table)
 12       4    Operand count metadata field 1
 16       4    Operand count metadata field 2
 24-60   40    Operand register indices (10 x 4 bytes)
 64-100  40    Operand type/class (10 x 4 bytes, -1 = unused)
104-140  40    Operand modifier flags (10 x 4 bytes)
144       4    Active operand count
148       4    Immediate encoding offset 1
152       4    Immediate encoding offset 2
156-276 120    Reserved / operand extension data
276-404 128    Decoded operand output buffer
404-468  64    Modifier flag output buffer
452       4    Modifier count field 1
456       4    Modifier count field 2
468-532  64    Operand remap table (16 DWORD slots)
532       4    Operand remap counter
536       8    Pointer to encoding/decoding lookup table
544       8    Instruction word 0 (bits 0-63 of 128-bit SASS)
552       8    Instruction word 1 (bits 64-127 of 128-bit SASS)
556       4    Decoded immediate / offset value

Encoder Dispatch Table Entry (24 bytes)

Offset  Size  Description
------  ----  -----------
  0       1   Opcode byte 0 (minor opcode low)
  1       1   Opcode byte 1 (minor opcode high / sub-opcode)
  8       8   Function pointer to encoder (LSB=1 indicates vtable indirection)
 16       8   this-pointer offset adjustment

Statistical Summary

MetricValue
Total encoding functions1,975
Total decoding functions648
Total descriptor initializers1,613
Total template instantiations4,236
Unique instruction families118
Unique descriptor class IDs56 (ranging 0-103)
Largest encodersub_DC6680 TEX (8,794 bytes, 302 lines)
Smallest encoder~4,661 bytes, 167 lines
Only format 3 instructionHSETP2 (minor 0x6F, sub 0x04)
Max register operands7 (texture fetch)
Max immediate operands4 (constant buffer load, local memory)
Max predicate operands2 (half-precision comparison, store ops)
SM100 opcode table size111,076 bytes / 3,227 lines
SM70 opcode table size24,230 bytes / 733 lines
SM100-to-SM70 ISA size ratio~4.6x

Confidence Assessment

ClaimConfidenceVerification
ISA class string "Blackwell" for sm_100CONFIRMEDDecompiled sub_484F50 line 609: "Blackwell"; string at 0x1d40b6e
__CUDA_ARCH__=1000CONFIRMEDString at 0x1d40b59; decompiled line 617
Three sub-variants: sm_100, sm_100a, sm_100fCONFIRMEDStrings at 0x1d40b78/0x1d40bae/0x1d40be2; dispatch table in sub_15C0CE0
Dispatch table: sm_100 encoding table = sub_15C3840CONFIRMEDDecompiled sub_15C0CE0 line 126: sub_448E70(qword_2A644A8, "sm_100", sub_15C3840)
8 MB encoding infrastructure across four regionsHIGHRegion sizes from systematic binary analysis; addresses consistent with function catalog
1,975 encoders + 648 decoders + 1,613 descriptors = 4,236HIGHCounts from comprehensive sweep; cross-checked with dispatch table sizes
118 instruction families under 3 major opcodesHIGHFamily count from systematic opcode table analysis
sub_4C28B0 bitfield insertion at bits[3:0], [6:4], [16:8], [24:17], [31:25]HIGHEncoder template pattern confirmed across multiple decompiled encoders
ROT13 mnemonic table: ZREPHEL = MERCURYHIGHROT13 decoding of string found in opcode table constructor
tcgen05 intrinsics: tcgen05.mma, tcgen05.mma.wsHIGHPTX mnemonic strings referenced in sub_16E0A70 type classifier
54 tcgen05 type IDs from predicate chainHIGHDecompiled sub_16E0A70 at 322 lines with complete predicate chain
Guardrail symbols (__cuda__sm10x_tcgen05_guardrails_*)HIGHSymbol name strings referenced in decompiled sub_16E1DB0
Encoder dispatch sub_E43C20 (92 lines)HIGHAddress and size consistent with function catalog
Decoder dispatch sub_EFE6C0 (93 lines)HIGHAddress and size consistent with function catalog
SM100 opcode table constructor sub_1782540 (111,076 B)HIGHOne of the largest functions; address consistent
Master encoder sub_17F2670 (156,611 B) -- largest function in binaryHIGHSize claim from function boundary analysis; consistent with claimed 4,858 lines
Format descriptor tables (xmmword_1E30DA0, etc.)HIGHRodata addresses from decompiled encoder functions
Major opcode 3 has only 1 encoder (HSETP2)MEDIUMDerived from opcode distribution analysis; single format-3 instruction

For general Blackwell architecture details, see the ptxas wiki: Blackwell and cicc wiki: SM100 Blackwell.

Cross-References

Sibling Wikis