SM100 Blackwell
The SM100 architecture (Blackwell, datacenter) represents the largest single-ISA definition embedded in nvlink v13.0.88. The instruction encoding infrastructure spans approximately 8 MB of .text across four major regions, defines 3,200+ encoding/decoding functions, 118 instruction families organized under 3 major opcodes, and introduces SM100-specific features including tcgen05 tensor intrinsics, unified tensor core (UTC) instructions, and a new ROT13-obfuscated mnemonic set internally codenamed "MERCURY" (ZREPHEL in ROT13).
This page documents the complete ISA encoding layer as reconstructed from the embedded ptxas backend, covering the 128-bit instruction word format, the opcode space partitioning, the encoder/decoder/descriptor infrastructure, and all SM100-specific instruction families.
Sub-variant Matrix
nvlink registers three SM100 sub-variants through the architecture dispatch at sub_15C0CE0:
| Sub-variant | ELF Flag | Description |
|---|---|---|
sm_100 | Base | Full Blackwell datacenter ISA |
sm_100a | Accelerated | Enables all SM100 features including tcgen05 MMA |
sm_100f | Forward-compatible | Feature-subset for forward binary compatibility |
All three sub-variants share the same encoding tables. The a suffix enables the full feature set; the f suffix restricts to the forward-compatible subset. The architecture dispatch registers 7 callbacks per sub-variant (nv.info emitter, resource usage table, instruction encoding table, compute capability array, perf-stats handler, cpf_optx handler, codegen options).
Instruction Word Format
All SM100 SASS instructions are 128 bits (16 bytes) wide. The instruction word is stored as two 64-bit halves at offsets +544 (bits 0-63) and +552 (bits 64-127) in the internal instruction representation structure.
Fixed Opcode Fields (Bits 0-31)
The first 32 bits encode the instruction identity through five fields set by the first five calls to the bitfield insertion primitive sub_4C28B0:
127 64 63 0
+----------------------------------+----------------------------------+
| Modifier / Operand Fields | Fmt2 | Mod | SubOp | Minor | EF | MajOp |
+----------------------------------+----------------------------------+
Bit Range Width Name Description
--------- ----- ---- -----------
[3:0] 4 bits Major Opcode Instruction format class (1, 2, or 3)
[6:4] 3 bits Encoding Format Sub-format / encoding variant (usually 0)
[16:8] 9 bits Minor Opcode Instruction family (118 unique values)
[24:17] 8 bits Sub-Opcode Specific instruction within family
[31:25] 7 bits Modifier Data type, addressing mode, operation variant
Major Opcode Distribution
| Major Opcode | Class | Encoder Count | Percentage | Description |
|---|---|---|---|---|
| 1 | ALU/Scalar | 558 | 37.2% | Integer arithmetic, float arithmetic, conversion, comparison, bitfield, shift, move, special register |
| 2 | Vector/Memory/Control | 977 | 62.7% | Memory load/store, texture, tensor core, control flow, barrier, predicate, warp shuffle, async copy |
| 3 | Special | 2 | 0.1% | Half-precision extended format (HSETP2 wide variant only) |
Encoding Formats
| Format | Encoder Count | Typical Usage |
|---|---|---|
| Format 1 | 308 | Simpler register-register forms, 3-operand ALU, branches |
| Format 2 | 1,227 | Extended forms with immediates, predicated variants, memory ops |
| Format 3 | 1 | Wide format -- only HSETP2 (opcode 0x6F sub 0x04, half-precision paired comparison) |
Operand Field Encoding
Bits 32-127 encode operand references, modifier flags, and immediates. The layout varies by instruction class but follows consistent patterns:
| Bit Range | Typical Content |
|---|---|
| 48-53 | Source operand modifiers (negate, absolute value, swizzle) |
| 54-57 | Rounding mode, conversion type |
| 58-63 | Destination modifiers, saturation |
| 64-79 | Extended opcode / sub-function qualifiers (word 1) |
| 80-127 | Instruction-specific operand fields, immediates |
| 134-132 (offset 0x84) | Format2 field -- secondary format indicator (3 bits) |
Register Operand Types
The register encoder sub_4C4D60 writes a structured operand field at the specified bit offset:
| Field | Width | Values |
|---|---|---|
| is_output | 1 bit | 0=source, 1=destination |
| register type | 4 bits | 0=GPR, 2=uniform, 3=pair, 4=quad, 5=predicate, 7=barrier, 8=special, 10=64-bit, 11=128-bit |
| register number | 10 bits | 0-1023 |
The decoder sub_4C60F0 uses a register class parameter: 2=GPR, 3=predicate, 9=constant buffer reference, 10=uniform register.
Encoding Infrastructure
The SM100 ISA definition is split across four binary regions, totaling ~8 MB of template-instantiated functions:
Region Map
| Address Range | Size | Content | Functions |
|---|---|---|---|
0x620000--0x84DD70 | 2.2 MB | SM100+ SASS encoders (table 1) | 1,537 |
0x84DD70--0xA48290 | 1.7 MB | InstrDesc initializers | 1,613 |
0xDA0000--0xE436D0 | 660 KB | SASS encoders (table 2) | 438 |
0xE43DC0--0xF15A50 | 840 KB | SASS decoders | 648 |
Combined totals: 1,975 encoder functions, 648 decoder functions, 1,613 descriptor initializers = 4,236 template-instantiated functions.
Encoder Architecture
Every encoder follows an identical structural template:
__int64 __fastcall encode_XXX(__int64 buf, __int64 ir_instr)
{
sub_4C2A60(buf); // Initialize encoding buffer
sub_4C28B0(buf, 0, 4, major_opcode); // Set bits[3:0] = major
sub_4C28B0(buf, 4, 3, 0); // Set bits[6:4] = format ext
sub_4C28B0(buf, 8, 9, minor_opcode); // Set bits[16:8] = minor
sub_4C28B0(buf, 17, 8, sub_opcode); // Set bits[24:17] = sub
sub_4C28B0(buf, 25, 7, modifier); // Set bits[31:25] = modifier
sub_4C2A90(buf, ir_instr, pred_idx); // Encode predicate guard
sub_4C4D60(buf, ir_instr, 0, 0x50); // Encode register operand 0
sub_4C4D60(buf, ir_instr, 1, 0x60); // Encode register operand 1
sub_4C52F0(buf, ir_instr, 2, 0x70); // Encode immediate operand
sub_4C5C30(buf, ir_instr, 3, 0x80); // Encode predicate operand
// ... instruction-specific modifier encoding via sub_A4xxxx/sub_A5xxxx ...
sub_A50D10(lookup_table, opcode_value); // Primary opcode mapping
}
Core Infrastructure Functions
| Address | Signature | Usage | Description |
|---|---|---|---|
sub_4C28B0 | (buf, bit_offset, width, value) | All encoders | Bitfield insertion into 128-bit instruction word |
sub_4C2A60 | (buf) | All encoders | Clear operand remap table (offsets 468-531, 16 DWORD slots), reset operand counter at offset 532 |
sub_4C2A90 | (buf, ir, variant) | All encoders | Encode predicate register + 5-bit scheduling field; variant 0=standard, 1=extended immediate |
sub_4C4D60 | (buf, ir, op_idx, offset) | 1,964/1,975 | Register operand encoder: 1-bit output flag + 4-bit type + 10-bit register number |
sub_4C52F0 | (buf, ir, op_idx, offset) | 715/1,975 | Constant/immediate operand encoder: 5-bit type + register number |
sub_4C5C30 | (buf, ir, op_idx, offset) | 875/1,975 | Predicate/special operand encoder with operand remapping |
Modifier Encoding Helpers
| Address | References | Modifier |
|---|---|---|
sub_A50D10 | 1,975 (all) | Primary opcode mapping through lookup table |
sub_A50CF0 | 267 | Secondary/auxiliary opcode bits |
sub_A4D6A0 | 86 | Rounding mode |
sub_A4D920 | 81 | Type conversion |
sub_A4F1C0 | 75 | Destination negate |
sub_A4F210 | 75 | Absolute value |
sub_A4DDD0 | 65 | Saturation |
sub_A4F120 | 66 | Source negate |
sub_A4D7D0 | 43 | Data type size |
Decoder Architecture
Decoders mirror encoders -- each unpacks a 128-bit instruction word into the internal IR:
__int64 __fastcall decode_XXX(__int64 buf, __int64 ir_out)
{
sub_4C5F90(buf, ir_out); // Finalize init
sub_4C60F0(buf, ir_out, 0, 0x50, 2); // Decode register (GPR)
sub_4C60F0(buf, ir_out, 1, 0x60, 10); // Decode register (uniform)
sub_4C6380(buf, ir_out, 2, 0x70, 9); // Decode constant buffer
sub_4C6DC0(buf, ir_out, 3, 0x80, 3); // Decode predicate
sub_50C790(lookup, bit_val); // Read 1-bit flag
sub_50BBA0(lookup, val); // Read 5-bit predicate field
// ... field extraction via sub_50xxxx/sub_51xxxx ...
}
| Address | Usage | Description |
|---|---|---|
sub_4C5F90 | All 648 decoders | Finalize instruction decoding |
sub_4C60F0 | 634/648 | Decode register operand with class parameter |
sub_4C6380 | 288/648 | Decode constant buffer operand |
sub_4C6DC0 | 321/648 | Decode predicate/barrier operand |
sub_50C790 | All 648 | Read 1-bit field from encoded instruction (universal) |
sub_50C770 | 188/648 | Read multi-bit flag field |
sub_50BBA0 | 101/648 | Read 5-bit predicate register field |
Dispatch Functions
Two functions serve as the entry points for the entire encoding/decoding pipeline:
Encoder dispatch (sub_E43C20, 92 lines): Reads the instruction opcode from *(a2+12) as a 16-bit value. Special-cases opcodes 120-121 (control flow) to sub_4C8810. Otherwise performs binary search in the 24-byte dispatch table at off_1EA4380. Each table entry contains {opcode_byte0, opcode_byte1, encode_func_ptr, this_offset}. Supports vtable-aware dispatch (checks LSB of function pointer).
Decoder dispatch (sub_EFE6C0, 93 lines): Reads the 128-bit instruction word from a1[68] (offset 544). Extracts the format field (bits>>4)&7 and immediate offset 16*(bits&0xF). Delegates format 2/3 to sub_4CB100. Extracts the 9-bit major opcode (bits>>8)&0x1FF. For format 1, adds 121 to create a separate opcode space. Binary search in dispatch table at off_1E957E0. Calls sub_A49B50 to finalize.
InstrDesc Initializers
The 1,613 descriptor initializers at 0x84DD70--0xA48290 populate instruction metadata objects defining:
- Operand count and types (register, immediate, memory, predicate)
- Instruction properties (flags, latencies, execution unit assignment)
- Scheduling hints and resource requirements
Each descriptor writes a class ID to *(a2+12). The class distribution spans 56 unique IDs (0-103), with ID 18 being the largest (78 descriptors, integer arithmetic class).
Complete Instruction Set Map
The following tables enumerate all 118 instruction families across the three major opcodes. Within each family, sub-opcodes define specific instruction variants (register-register, register-immediate, predicated, etc.).
Major Opcode 1: ALU/Scalar (558 encoders)
| Minor | Family | Variants | Operand Range | Description |
|---|---|---|---|---|
| 0x01 | ALU-MISC-A | 2 | R1-2 | Miscellaneous ALU type A |
| 0x02 | ALU-MISC-B | 7 | R2-3, I0-1 | Miscellaneous ALU type B |
| 0x03 | ALU-MISC-C | 1 | R2 | Single-variant ALU |
| 0x04 | INT-SHIFT | 6 | R2-3, I0-1 | Integer shift operations |
| 0x06 | INT-ARITH | 12 | R2-3, I0-1 | Integer arithmetic (ADD, SUB) |
| 0x08 | INT-MUL | 3 | R2-3 | Integer multiply |
| 0x09 | INT-MAD-EXT | 26 | R2-4, I0-1 | Integer multiply-add extended -- largest family in major 1 |
| 0x0A | FLOAT-CVT | 3 | R2-3 | Float conversion |
| 0x0C | CVT | 1 | R2 | Type conversion |
| 0x0D | INT-MAD | 10 | R2-3, I0-1 | Integer multiply-add |
| 0x0E | INT-WIDE | 6 | R2-3, I0-1 | Integer wide operations |
| 0x0F | INT-CMP | 2 | R2-3 | Integer comparison |
| 0x10 | IMM-LOAD | 4 | R2-3, I1 | Immediate value load |
| 0x12 | BIT-OP | 4 | R2-3 | Bit manipulation |
| 0x17 | SHIFT-EXT | 8 | R2-4, I0-1 | Extended shift operations |
| 0x18 | ISCADD/LEA | 10 | R2-4, I0-1, P0-1 | Integer scale-add / load effective address |
| 0x19 | SHIFT-WIDE | 5 | R2-3, I0-1 | Wide shift variants |
| 0x1A | CONST-LOAD | 4 | R2-3, I0-1 | Constant buffer load |
| 0x1B | INT-ABS | 3 | R2-3 | Integer absolute value |
| 0x1C | INT-NEG | 1 | R2 | Integer negate |
| 0x1D | INT-MIN/MAX | 1 | R2 | Integer minimum/maximum |
| 0x1E | BIT-EXT | 4 | R2-3, I0-1 | Bit field extract/insert |
| 0x22 | INT-SPECIAL | 1 | R2 | Special integer operation |
| 0x24 | INT-DIV | 2 | R2-3 | Integer divide helpers |
| 0x27 | LD-GLOBAL-128 | 2 | I4 | 128-bit global memory load (immediate-only form) |
| 0x33 | UNIFORM-ALU | 2 | R2-3 | Uniform register ALU |
| 0x39 | WARP-VOTE | 2 | R2-3, P1 | Warp voting operations |
Major Opcode 2: Vector/Memory/Control (977 encoders)
| Minor | Family | Variants | Operand Range | Description |
|---|---|---|---|---|
| 0x27 | P-LD-128 | 2 | R3, I1, P1 | Predicated 128-bit load |
| 0x29 | ATOM-SHARED | 24 | R2-3, I1-2, P0-1 | Shared memory atomic operations |
| 0x3B | TEXTURE-SURF | 10 | R4-5, I1, P0-1 | Texture/surface operations |
| 0x3E | EXIT/BRK | 2 | R5, P1, M1 | Exit and break (7 operands each) |
| 0x3F | CONT | 1 | R5, P1, M1 | Continue (7 operands) |
| 0x40 | LONGJMP | 1 | R5, P1, M1 | Long jump |
| 0x42 | PREBRK | 1 | R5, P1, M1 | Pre-break |
| 0x43 | PCNT | 1 | R5, P1, M1 | Pre-continue |
| 0x44 | PRET | 1 | R5, P1, M1 | Pre-return |
| 0x45 | SSY | 1 | R5, P1, M1 | Set synchronization point |
| 0x46 | CAL | 1 | R5, P1, M1 | Function call |
| 0x49 | ATOM-GLOBAL | 6 | R2-3, I1-2, P0-1 | Global memory atomic operations |
| 0x4A | REDUCE | 4 | R2-3, I0-1 | Reduction operations |
| 0x4C | MEM-FENCE | 4 | R2-3, I0-1, P0-1 | Memory fence instructions |
| 0x4E | CORE-VEC | 63+ | R3-4, I1, P1 | Core vector ALU -- largest family |
| 0x4F | CORE-VEC-EXT | 25 | R2-4, I0-2, P0-1 | Extended core vector ALU |
| 0x52 | INT-SET-PRED | 6 | R2-3, I0-1 | Integer set predicate |
| 0x54 | LOAD-SHARED-EXT | 6 | R3, P1 | Extended shared memory load |
| 0x56 | WARP-BARRIER | 4+ | R1-2, P0-1 | Warp-level barrier operations |
| 0x59 | MMA-TENSOR | 30 | R4-7, I0-1, P0-2 | Tensor core MMA -- 2nd largest family |
| 0x5C | LOAD-GLOBAL | 6 | R3-4, P0-1 | Global memory load |
| 0x61 | STORE-GLOBAL | 5 | R1-3, I1-4, P0-2 | Global memory store |
| 0x62 | STORE-SHARED | 6 | R2-3, P0-2 | Shared memory store |
| 0x6B | TEXTURE-LOAD | 4 | R4-7, I1, P0-1 | Texture load operations |
| 0x6D | SURFACE-OP | 5 | R4-7, I1, P0-1 | Surface load/store |
| 0x70 | ASYNC-COPY | 4 | R2-3, I1, P0-1 | Asynchronous memory copy |
| 0x71 | WARP-GROUP | 3 | R2-3, P0-1 | Warp group operations |
| 0x7E | BARRIER-OP | 6 | R1-2, P0-1 | Barrier operations |
| 0x81 | UNIFORM-PRED | 4 | R0-1, I1, P0-1 | Uniform predicate operations |
| 0x84 | CONTROL-FLOW | 5 | R0-3, I0-1, P0-1 | Branch, call, return |
| 0x94 | TMA-ACCESS | 4 | R3-4, P0-1 | Tensor memory accelerator access |
| 0x95 | DEPBAR | 6 | R0-2, P0-1 | Dependency barrier |
| 0xA4 | HMMA-A | 5 | R5-6, I1, P0-1 | Half-precision MMA type A |
| 0xA5 | HMMA-B | 4 | R4-5, I0-1, P0-1 | Half-precision MMA type B |
| 0xA6 | IMMA | 3 | R4-5, P0-1 | Integer MMA |
| 0xA8 | MMA-F16xF16 | 8 | R5-6, I1, P1-2 | FP16 x FP16 matrix multiply |
| 0xA9 | MMA-MIXED | 8 | R4-5, I0-1, P0-1 | Mixed-precision MMA |
| 0xAB | MMA-INT | 4 | R4-5, P0-1 | Integer matrix multiply-accumulate |
| 0xAC | MMA-INT-V | 8 | R4-5, P0-1 | Integer MMA variant |
| 0xAD | LOAD-UNIFORM | 9 | R1-2, P0-1 | Uniform register load |
| 0xAE | MMA-TF32xTF32 | 10 | R5-6, I1, P1-2 | TF32 x TF32 MMA |
| 0xAF | MMA-F64xF64 | 10 | R5-6, I1, P1-2 | FP64 x FP64 MMA |
| 0xB0 | MMA-REDUCED | 10 | R4-6, I0-1, P1-2 | Reduced-precision MMA |
| 0xB1 | MMA-SPARSE | 5 | R4-5, P1-2 | Sparse MMA |
| 0xB2 | MMA-WIDE | 5 | R5-6, I1, P1-2 | Wide MMA |
| 0xB3 | MMA-SPECIAL | 5 | R4, P1 | Special MMA variant |
| 0xCF | PRED-LOGIC | 3 | R1-2, P0-1 | Predicate logic operations |
| 0xD4 | SELECT | 2 | R1-2, P0-1 | Conditional select / cmov |
| 0xDF | IMM-MOV | 14 | R1-2, I1-3, P0-1 | Immediate move -- most immediate-rich family |
| 0xE1 | STORE-EXT | 4 | R1-3, I1-3, P0-2 | Extended store operations |
| 0xEE | WARP-SHUFFLE | 3 | R2-3, P0-1 | Warp shuffle operations |
Major Opcode 3: Special (1 encoder)
| Minor | Family | Variants | Operands | Description |
|---|---|---|---|---|
| 0x6F | HSETP2-WIDE | 1 | R4, I1, P2 | Half-precision set predicate, paired comparison -- the only format 3 instruction in the entire ISA |
Instruction Families by Functional Class
The 118 instruction families group naturally into functional classes:
Integer Arithmetic (13 families, ~110 encoders)
Integer arithmetic dominates the ALU opcode space. The key families:
- IADD3 (minor 0x005 in the 0xDA region): 3-input integer add with carry. 26 encoding variants covering register-register, register-immediate, and different data widths. All variants include 1 immediate + 1 predicate + 3-4 register operands.
- IMAD (minors 0x016, 0x017): Integer multiply-add in standard (0x016, 7 variants) and wide (0x017, 6 variants) forms.
- ISCADD/LEA (minor 0x18): Integer scale-add / load effective address. 10 variants. Used extensively in address computation.
- ALU-MISC (minor 0x012 in 0xDA region): The largest single class with 63 distinct encodings. Includes BFE, BFI, FLO, POPC, LEA, PRMT, and many more.
Floating-Point Arithmetic (5 families, ~30 encoders)
- FADD (minor 0x007): 11 variants covering FP32 add, multiply, FMA.
- HADD2/HFMA2 (minors 0x06A, 0x06D): Half-precision packed arithmetic. 5 and 4 variants respectively. HFMA2 uses 2 predicate operands.
- HSETP2 (minor 0x06F): Half-precision set predicate with paired output. 7 variants including the sole format 3 instruction.
- DFMA (minor 0x020): Double-precision FMA with 4 variants.
Memory Operations (15 families, ~90 encoders)
- LDG (minor 0x025): Global memory load. 9 variants with up to 4 register operands.
- STG (minor 0x02E): Global memory store. 8 variants with up to 4 immediates.
- LDS/STS (minors 0x00D, 0x0E3, 0x0EC): Shared memory load and store. 6 variants each.
- LDL (minor 0x067): Local memory / stack. 13 variants with 2-4 immediates for complex addressing.
- LDC (minor 0x060): Constant buffer load. 6 variants, all with 4 immediate operands.
- Async Copy (minor 0x0B8): 6 variants for asynchronous data movement.
Texture and Surface (5 families, ~30 encoders)
- TEX (minor 0x05A): Texture fetch. 13 variants. The most operand-rich instructions in the entire ISA: up to 7 register + 1 immediate + 1 predicate operands. The largest encoder function (
sub_DC6680, 8,794 bytes, 302 lines) belongs to this family. - TXQ (minor 0x08B): Texture query. 4 variants with 4-7 register operands.
- SUST (minor 0x0CE): Surface store. 4 variants.
- Texture/Surface load (minor 0x03B): 6 variants.
Tensor Core / MMA (16 families, ~120 encoders)
The tensor core families constitute the second-largest functional class:
| Family | Minor(s) | Variants | Description |
|---|---|---|---|
| HMMA-A/B | 0xA4, 0xA5 | 9 | Half-precision matrix multiply-accumulate |
| MMA-F16xF16 | 0xA8 | 6 | FP16 x FP16 with 5-6 register operands |
| MMA-MIXED | 0xA9 | 6 | Mixed-precision MMA |
| IMMA | 0xA6, 0xAB | 6 | Integer MMA |
| IMMA-V | 0xAC | 6 | Integer MMA variant |
| MMA-TF32 | 0xAE | 6 | TF32 x TF32 matrix multiply |
| MMA-F64 | 0xAF | 6 | FP64 x FP64 double-precision MMA |
| MMA-REDUCED | 0xB0 | 6 | Reduced-precision MMA |
| MMA-SPARSE | 0xB1 | 3 | Sparsity-aware MMA |
| MMA-WIDE | 0xB2 | 3 | Wide accumulator MMA |
| MMA-SPECIAL | 0xB3 | 1 | Special MMA variant |
| TENSOR-LD | 0xB4 | 2 | Tensor memory load |
| TENSOR-ST | 0xB6 | 2 | Tensor memory store |
| TMA | 0x09F | 6 | Tensor Memory Accelerator operations |
| WGMMA | 0x0C2 | 2 | Warp Group MMA (Blackwell-new) |
Control Flow (10 families, ~25 encoders)
- BRA (minor 0x034): Branch with 3 variants (conditional, unconditional, indirect).
- EXIT/BRK (minor 0x3E): Exit/break with 7-operand encoding (5 registers + 1 predicate + 1 memory).
- CALL/RET (minors 0x079, 0x07B): Function call and return.
- SSY/CAL/PCNT/PRET (minors 0x42-0x46): Structured control flow primitives, each with 7 operands.
- YIELD/NANOSLEEP (minor 0x07C): Thread yield with 2 variants.
Synchronization (6 families, ~25 encoders)
- BAR.SYNC (minor 0x085): Barrier synchronization.
- WARP-BARRIER (minor 0x56): Warp-level barrier with 4+ variants.
- DEPBAR (minor 0x096): Dependency barrier for scoreboard management.
- BARRIER-OP (minor 0x0A7): Extended barrier operations.
- FENCE (minor 0x08D): Memory fence with 4 variants.
SM100-Specific Instructions
ROT13 Mnemonic Table
The SM100 opcode table constructor at sub_1782540 (111,076 bytes, 3,227 lines) initializes ~400+ instruction mnemonics using ROT13 encoding. The ROT13 prefix ZREPHEL decodes to MERCURY, the internal codename for Blackwell. Key SM100-specific mnemonics decoded from the binary:
| ROT13 | Decoded | Description |
|---|---|---|
OZZN | BMMA | Block matrix multiply-accumulate |
QZZN | DMMA | Dense matrix multiply-accumulate |
DZZN | QMMA | Quantized matrix multiply-accumulate |
GPTRA05 | TCGEN05 | Tensor Core Generation 5 intrinsic |
HGPONE | UTCBAR | Unified Tensor Core barrier |
HGPPC | UTCPC | Unified Tensor Core program counter |
HGPUZZN | UTCHMMA | Unified Tensor Core half-precision MMA |
ZKEZN.FC | MXQMA.SP | Mixed-precision quantized MMA, sparse variant |
FLAPF | SYNCS | Synchronization primitives (Blackwell-specific) |
NPDOHYX | ACQBULK | Acquire bulk (barrier operation) |
NPDFUZVAVG | ACQSHMINIT | Acquire shared memory init |
NY2C | AL2P | Attribute to parameter (legacy compat) |
NEEVIRF | ARRIVES | Barrier arrival notification |
NGR | AST | Attribute store |
NGBZ | ATOM | Atomic operation |
NGBZT | ATOMG | Atomic global |
SNQQ | FADD | Float add |
SZHY | FMUL | Float multiply |
VZNQ | IMAD | Integer multiply-add |
VZNQ_JVQR | IMAD_WIDE | Integer multiply-add wide |
VNQQ3 | IADD3 | 3-input integer add |
OZFX | BMSK | Bit mask |
FTKG | SGXT | Sign extend |
YBC3 | LOP3 | 3-input logic operation |
VFRGC | ISETP | Integer set predicate |
PPGY | CCTL | Cache control |
OFLAP | BSYNC | Block synchronization |
SRAPR | FENCE | Memory fence |
REEONE | ERRBAR | Error barrier |
tcgen05 Intrinsics
SM100 introduces the tcgen05 (Tensor Core Generation 5) subsystem with dedicated PTX-level intrinsics. The code generation infrastructure resides at 0x16E0000--0x16E3AB0 and spans 14+ functions handling type classification, bounds checking, tensor memory address computation, MMA operand setup, and argument type mapping.
The two PTX instruction mnemonics are tcgen05.mma (standard) and tcgen05.mma.ws (warp-specialized). Modifier suffixes include _expand16bit, _pack16bit, _maxabs, _minabs, _fused, _blockscale, and _ashift.
Instruction Type Classifier
sub_16E0A70 (17,302 bytes, 322 lines) classifies a tcgen05 MMA instruction into one of 54 type IDs by chaining predicate functions against the instruction object's type field at *(*(a1+8)). Each predicate is a trivial equality check (*a1 == N); the classifier tests them in priority order and returns the first matching type ID, or 0 if no predicate matches.
The 54 predicate functions live at 0x12B5670--0x12B5C20 (16 bytes each, all identical in structure). The complete classifier chain and resulting type ID assignment:
| Priority | Predicate | Internal Value | Type ID | Type |
|---|---|---|---|---|
| 1 | sub_12B5670 | 1 | 1 | Base types |
| 2 | sub_12B5680 | 2 | 2 | Base types |
| 3 | sub_12B5690 | 3 | 3 | Base types |
| 4 | sub_12B56A0 | 4 | 4 | Base types |
| 5 | sub_12B56B0 | 5 | 5 | Base types |
| 6 | sub_12B56C0 | 6 | 6 | Base types |
| 7 | sub_12B56D0 | 7 | 7 | Base types |
| 8 | sub_12B56E0 | 8 | 8 | Base types |
| 9 | sub_12B5700 | 18 | 18 | Extended type A |
| 10 | sub_12B5710 | 19 | 19 | Extended type A |
| 11 | sub_12B5720 | 21 | 21 | Extended type A |
| 12 | sub_12B5730 | 23 | 23 | Extended type A |
| 13 | sub_12B5740 | 24 | 24 | Extended type A |
| 14 | sub_12B5780 | 10 | 10 | Extended type B |
| 15 | sub_12B5790 | 11 | 11 | Extended type B |
| 16 | sub_12B57A0 | 13 | 13 | Extended type B |
| 17 | sub_12B57B0 | 15 | 15 | Extended type B |
| 18 | sub_12B57C0 | 16 | 16 | Extended type B |
| 19 | sub_12B58D0 | 25 | 25 | Blockscale types |
| 20 | sub_12B58E0 | 26 | 26 | Blockscale types |
| 21 | sub_12B5960 | 27 | 27 | Blockscale types |
| 22 | sub_12B5950 | 29 | 29 | Blockscale types |
| 23 | sub_12B5940 | 28 | 28 | Blockscale types |
| 24 | sub_12B5970 | 33 | 33 | Quantized types |
| 25 | sub_12B58F0 | 32 | 32 | Quantized types |
| 26 | sub_12B5900 | 30 | 30 | Quantized types |
| 27 | sub_12B5910 | 31 | 31 | Quantized types |
| 28 | sub_12B5B50 | 34 | 34 | Mixed-precision types |
| 29 | sub_12B5920 | 35 | 35 | Mixed-precision types |
| 30 | sub_12B5930 | 36 | 36 | Mixed-precision types |
| 31 | sub_12B5980 | 37 | 37 | Mixed-precision types |
| 32 | sub_12B5B30 | 42 | 42 | Sparse types |
| 33 | sub_12B5B40 | 43 | 43 | Sparse types |
| 34 | sub_12B5990 | 38 | 38 | FP8/FP6/FP4 types |
| 35 | sub_12B5A80 | 39 | 39 | FP8/FP6/FP4 types |
| 36 | sub_12B5A90 | 40 | 40 | FP8/FP6/FP4 types |
| 37 | sub_12B5AA0 | 41 | 41 | FP8/FP6/FP4 types |
| 38 | sub_12B5AB0 | 44 | 44 | Ashift types |
| 39 | sub_12B5AC0 | 45 | 45 | Ashift types |
| 40 | sub_12B5AD0 | 46 | 46 | Ashift types |
| 41 | sub_12B5B00 | 47 | 47 | Fused types |
| 42 | sub_12B5B10 | 48 | 48 | Fused types |
| 43 | sub_12B5AF0 | 50 | 50 | Pack/Expand types |
| 44 | sub_12B5AE0 | 49 | 49 | Pack/Expand types |
| 45 | sub_12B5B20 | 51 | 51 | Pack/Expand types |
| 46 | sub_12B5BA0 | 52 | 52 | MXQ types |
| 47 | sub_12B5BB0 | 56 | 56 | MXQ types |
| 48 | sub_12B5BF0 | 53 | 53 | MXQ types |
| 49 | sub_12B5C00 | 54 | 54 | MXQ types |
| 50 | sub_12B5BC0 | 58 | 58 | Maxabs types |
| 51 | sub_12B5BE0 | 59 | 59 | Maxabs types |
| 52 | sub_12B5C10 | 55 | 55 | Extended sparse |
| 53 | sub_12B5C20 | 60 | 60 | Extended sparse |
Type IDs 9, 12, 14, 17, 20, 22, 57 are absent from the enum -- either reserved for future expansion or used by related subsystems not routed through this classifier.
The type ID groupings inferred from the classification order and the modifier suffix strings:
- IDs 1--8: Base MMA types (standard precision combinations -- the 8 fundamental
tcgen05.mmaconfigurations) - IDs 10--16: Extended base types with non-standard precision or accumulator widths
- IDs 18--24: Extended type A variants (wider accumulator or non-standard rounding)
- IDs 25--29: Blockscale variants (
_blockscalemodifier -- block-level scaling for MX formats) - IDs 30--33: Quantized MMA types (
tcmma_*_q/tcmma_*_mxqinternal names) - IDs 34--37: Mixed-precision variants (asymmetric A/B input types)
- IDs 38--41: FP8/FP6/FP4 narrow-type variants (e4m3, e5m2, e3m2, e2m3, e2m1 combinations)
- IDs 42--43: Sparse MMA variants (structured sparsity -- 2:4 or 4:8 patterns)
- IDs 44--46: Ashift variants (
_ashiftmodifier -- arithmetic shift on matrix A) - IDs 47--48: Fused variants (
_fusedmodifier -- fused accumulation) - IDs 49--51: Pack/Expand variants (
_pack16bit,_expand16bitmodifiers) - IDs 52--56: MXQ (Mixed-precision Quantized) types (
_blockscale+ quantization) - IDs 58--59: Maxabs/Minabs variants (
_maxabs,_minabsreduction modifiers) - ID 55: Extended sparse variant
- ID 60: Extended sparse variant (last in chain)
Guardrails Code Generation
sub_16E1DB0 (10,365 bytes, 325 lines) generates inline PTX boundary-checking code that validates tensor memory accesses before tcgen05.mma execution. This is a compiler-inserted safety mechanism -- the guardrails are weak functions that can be overridden at link time.
Symbol selection (line 67--73): The function selects one of two guardrail symbol prefixes based on the instruction opcode at offset 776:
| Opcode | Symbol Prefix |
|---|---|
| 149 | __cuda__sm10x_tcgen05_guardrails_are_columns_allocated_ |
| other | __cuda__sm10x_tcgen05_guardrails_in_physical_bounds_ |
Both prefixes get nCols appended to form the final guardrail variable name (e.g., __cuda__sm10x_tcgen05_guardrails_in_physical_bounds_nCols).
nCols computation (line 82--318): The number of columns accessed by the MMA instruction is computed via a multi-level dispatch. The first dispatch key is BYTE2(v63) & 0x1C -- a 3-bit field extracted from the instruction descriptor at offset 632:
| Field Value | nCols Computation |
|---|---|
0x10 (16) | Constant: mov.u32 %s, 8 |
0x14 (20) | Dynamic: reads descriptor field at offset 634, dispatches on (byte >> 5) -- see sub-table below |
0x18 (24) | Register: copies from operand at offset 664 via mov.u32 %s, %s |
0x1C (28) | Register: copies from operand at offset 656 via mov.u32 %s, %s |
| other | Falls through to instruction-opcode-based dispatch |
Instruction-opcode nCols table (lines 103--128): When none of the above field values match, the byte at offset 620 (v4, signed char interpreted as instruction opcode) determines nCols:
| Opcode (signed) | Opcode (unsigned) | nCols Base | nCols (with scale) | Interpretation |
|---|---|---|---|---|
| -87 (0xA9) | 169 | 16 / 8 | scaled | 16-col MMA types |
| -83 (0xAD) | 173 | 16 / 8 | scaled | 16-col MMA types |
| -78 (0xB2) | 178 | 16 / 8 | scaled | 16-col MMA types |
| -86 (0xAA) | 170 | 2 / 1 | scaled | 2-col MMA types |
| -82 (0xAE) | 174 | 2 / 1 | scaled | 2-col MMA types |
| -85 (0xAB) | 171 | 4 / 2 | scaled | 4-col MMA types |
| -84 (0xAC) | 172 | 8 / 4 | scaled | 8-col MMA types |
| -81 (0xAF) | 175 | 8 / 4 | scaled | 8-col MMA types |
| -80 (0xB0) | 176 | 8 / 4 | scaled | 8-col MMA types |
| -79 (0xB1) | 177 | 8 / 4 | scaled | 8-col MMA types |
The nCols value is modified by two flags in the instruction descriptor at offset 624 (v62.m128i_i8[8]):
& 0x03(CTA group flag): If nonzero, the base nCols is doubled (nCols = nCols_wide instead of nCols_narrow)& 0x78(scale factor flag): If nonzero, nCols is multiplied bysub_12AB660(v5)wherev5 = (byte >> 3) & 0xFis a 4-bit index into the scale lookup table atdword_1F24300(values indexed 0--7)
Dynamic descriptor dispatch (field value 0x14, lines 158--318): When the field value is 0x14, a secondary 3-bit dispatch on *(a1+634) >> 5 selects one of 6 nCols computation strategies:
| Case | PTX Emitted | Semantics |
|---|---|---|
| 1 | mov.u32 %s, 8 | Constant 8 columns |
| 2 | and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8; mov.u32 %s, %s; | Extract bits[22:17] from descriptor, multiply by 8 |
| 3 (with FP16 inputs) | mov.u32 %s, 4 | Constant 4 columns for FP16 |
| 3 (other) | and.b32 %s, 0x1F000000, %s; shr.u32 %s, %s, 24; mul.lo.u32 %s, %s, 16; setp.eq.u32 %s, %s, 128; selp.u32 %s, 2, 4, %s; | Extract bits[28:24], multiply by 16, select 2 or 4 based on result |
| 4 | and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8; cvt.rp.f32.u32 %s, %s; div.rp.f32 %s, %s, 64.0; cvt.rpi.u32.f32 %s, %s; mul.lo.u32 %s, %s, 2; | Extract bits[22:17], multiply by 8, divide by 64 rounding up, multiply by 2 |
| 5 | mov.u32 %s, 2 | Constant 2 columns |
Case 3 checks whether either matrix A (offset 611, bits[5:4]) or matrix B (offset 627, bits[5:4]) uses FP16 format (== 1). If so, nCols is fixed at 4; otherwise, a full bitfield extraction from the descriptor determines the value dynamically.
Case 4 is notable for using floating-point arithmetic (cvt.rp.f32.u32, div.rp.f32, cvt.rpi.u32.f32) to compute a ceiling division -- nCols = ceil(bits * 8 / 64) * 2. This computes the number of 64-byte-aligned column groups needed.
Tensor Memory Address Computation
Five functions generate the inline PTX for computing tensor memory (tmem) addresses used by tcgen05.mma operands:
| Address | Size | Symbol | Operand | Source Offset |
|---|---|---|---|---|
sub_16E2410 | 111 lines | __cuda_sm_100_tcgen05_tmem_addr | General tmem | Offset 648 (opcode-dependent) |
sub_16E2610 | 93 lines | __cuda_sm10x_tcgen05_mma_tmemD | Destination D | Offset 648 (slot 0) |
sub_16E27D0 | 93 lines | __cuda_sm10x_tcgen05_mma_tmemA | Source A | Offset 656 (slot 1) |
sub_16E2990 | 99 lines | __cuda_sm10x_tcgen05_mma_scaleTmemA | Scale A | Offset 648 + 8*N (dynamic) |
sub_16E2B80 | 99 lines | __cuda_sm10x_tcgen05_mma_scaleTmemB | Scale B | Offset 648 + 8*N (dynamic) |
A sixth function at sub_16E2D70 (93 lines) handles __cuda_sm10x_tcgen05_mma_spMetaTmem -- the sparsity metadata tensor memory address, reading from offset 672 (slot 3).
Each function follows the same two-path pattern:
-
32-bit tmem address (when the address operand type is
sub_12B5850orsub_12B56C0, internal type 6): Emits a simpleadd.u32:add.u32 %s, <symbol>, %s, %s; -
64-bit tmem address (all other types): Wraps in a scoped register block with 64-to-32-bit conversion:
{.reg .b32 __cuda_sm_100_tcgen05_tmem_addr_base; cvt.u32.u64 __cuda_sm_100_tcgen05_tmem_addr_base, %s; add.u32 %s, __cuda_sm_100_tcgen05_tmem_addr_base, %s; }
The scale operand functions (sub_16E2990, sub_16E2B80) dynamically select their operand slot: scaleTmemA uses slot 5 - (cta_group_flag == 0) and scaleTmemB uses slot 6 - (cta_group_flag == 0), where cta_group_flag = (*(a1+627) & 0x30).
Return Value / Input Operand Marshalling
Two functions handle parameter passing to/from the tcgen05 MMA helper functions:
Output marshalling (sub_16E1A80, 47 lines): Iterates over the operand list at offset 648, emitting ld.param.b32 %s, [%s + %d] for each register result. The return array symbol depends on the opcode:
- Opcode 143:
__cuda_sm_100_tcgen05_ld_red_funcRetArr(load-reduce variant) - Opcode 160:
__cuda_sm_100_tcgen05_ld_funcRetArr(standard load variant)
For reduce operations (sub_12A9290 returns true), an additional operand is appended from offset 656.
Input marshalling (sub_16E1BC0, 30 lines): Iterates over the input operand list at offset 648 (slot determined by *(a1+596) - 1), emitting st.param.b32 [%s + %d], %s for each register input. Uses the symbol __cuda_sm_100_tcgen05_st_funcInputArr.
Arguments Type Mapper
sub_16E3AB0 (18,623 bytes, 337 lines) performs the same type classification as sub_16E0A70 but operates on an array of tcgen05 instruction arguments rather than a single instruction. It iterates over the argument array (count at offset 796, result stored at offset 944), calling the identical predicate chain to assign numeric type IDs.
Guardrail Trap Functions
The guardrails infrastructure defines five weak trap functions that are called when bounds violations are detected. These are declared as PTX prototypes embedded in the binary:
| Symbol | Parameters | Description |
|---|---|---|
__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds | oob_access_col_no, instr_kind | Out-of-bounds column access |
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access | col_no_accessed, alloced_mask, instr_kind | Access to unallocated column |
__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment | dp_lane, matrix_kind, valid_alignment_kind | Misaligned datapath access |
__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env | idesc_sp_enabled, idesc, mma_kind, ptx_target, is_family_portable | Sparsity used in unsupported context |
__cuda_sm10x_tcgen05_guardrails_check_column_allocation | start_col_num, num_of_cols, inst_kind | Column allocation verification (returns retVal) |
Two additional check functions return a result value:
| Symbol | Parameters | Description |
|---|---|---|
__cuda_sm10x_tcgen05_guardrails_check_physical_bounds | start_col_num, num_of_cols, inst_kind | Physical bounds check (returns retVal) |
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment | tmemAddr, iDesc, cta_group, hasWS, hasSP, matrix_kind | Datapath alignment check (returns retVal) |
All are declared .weak .func (overridable at link time) and use .FORCE_INLINE for the check variants, ensuring they are inlined at every call site.
Additional tcgen05 Symbols
Beyond MMA, the tcgen05 infrastructure references additional symbols for load, store, and load-reduce operations:
| Symbol | Usage |
|---|---|
__cuda_sm_100_tcgen05_tmem_addr | General tmem address for ld/st operations |
__cuda_sm_100_tcgen05_ld_funcRetArr | Return array for tcgen05 loads |
__cuda_sm_100_tcgen05_ld_red_funcRetArr | Return array for tcgen05 load-reduce |
__cuda_sm_100_tcgen05_st_funcInputArr | Input array for tcgen05 stores |
__cuda_sm_100_tcgen05_ld_immhalfSplitOff | Half-split offset for loads (opcode 0xAA) |
__cuda_sm_100_tcgen05_ld_red_immhalfSplitOff | Half-split offset for load-reduce (opcode 0xAA) |
__cuda_sm_100_tcgen05_st_immhalfSplitOff | Half-split offset for stores (opcode 0xAA) |
The EIATTR system tracks tcgen05 MMA usage through two compatibility attributes: EICOMPAT_ATTR_INST_TCGEN05_MMA (current) and EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED (legacy).
Decoder Opcode Classes for Blackwell-New Instructions
The decoder region (0xE43DC0--0xF15A50) reveals Blackwell-specific instruction classes through high opcode IDs (>300) not present in earlier architectures:
| Decoder Opcode | Decoders | Description |
|---|---|---|
| 356 | 15 | Uniform compute extensions (new register file operations) |
| 357 | 6 | New barrier primitives |
| 358 | 6 | New synchronization primitives |
| 368 | 10 | Asynchronous copy/compute operations |
| 289-299 | ~20 | TMA (Tensor Memory Accelerator) operations |
SM100 Compiler Backend
The SM100-specific compiler backend extends from 0x1782540 through 0x17B9300 and includes:
Opcode Table Constructor
sub_1782540 (111,076 bytes, 3,227 lines) is the SM100 opcode table constructor. It calls the parent class constructor sub_19B11F0 and initializes all SM100 instruction mnemonics with name + length at offset +11,360. The vtable is at off_2415E98. This is over 3x larger than the SM70 equivalent (sub_1769B50, 24,230 bytes) reflecting the expanded ISA.
Instruction Property Initializer
sub_17884A0 (44,713 bytes, 1,603 lines) sets latencies, throughputs, and execution unit assignments for every SM100 opcode. Companion to the opcode table constructor.
Scheduling Table
sub_178AA00 (35,422 bytes, 1,205 lines) initializes the SM100-specific scheduling resource tables defining per-instruction throughputs and latencies for each execution unit type (ALU, SFU, LDST, TEX, MMA).
SM100-Specific Optimization Passes
| Address | Size | Lines | Function |
|---|---|---|---|
sub_179BD10 | 16,544 B | 649 | SM100 peephole optimizer |
sub_179E620 | 10,854 B | 455 | SM100 instruction combiner |
sub_179EF10 | 9,214 B | 352 | SM100 pattern matcher |
sub_179F6D0 | 7,078 B | 294 | SM100 dead code handler |
sub_17A2130 | 33,823 B | 1,065 | SM100 instruction legalization (main) |
sub_17A7A40 | 17,754 B | 601 | SM100 type legalization |
sub_17A8610 | 29,094 B | 889 | SM100 lowering pass |
sub_17AB9D0 | 36,177 B | 1,221 | SM100 instruction selection |
sub_17ADA40 | 35,411 B | 1,270 | SM100 complex instruction selection |
sub_17B04A0 | 23,212 B | 871 | SM100 operand folding |
Master Encoder Functions
Two monumental encoding functions handle the final SASS binary emission:
sub_17F2670(156,611 bytes, 4,858 lines): The master instruction encoder -- the largest function in the entire nvlink binary. Dispatches to individual encoding routines for all SASS instruction types. ~640 local variables, 0x2C8-byte stack frame.sub_17F9AE0(61,531 bytes, 2,150 lines): Secondary encoder for less common instruction types.
Format Descriptor Tables
Each instruction references a 16-byte (128-bit) format descriptor loaded via SSE instructions. These descriptors define the operand layout template:
Encoder-Side Descriptors
| Address | Encoders | Usage |
|---|---|---|
xmmword_1E30DA0 | 166 | Default/common format |
xmmword_1E30E30 | 54 | Integer arithmetic formats |
xmmword_1E30DC0 | 34 | Memory operation formats |
xmmword_1E30DB0 | 19 | Comparison/predicate formats |
xmmword_1E30EF0 | 14 | MOV/immediate formats |
xmmword_1E30F10 | 16 | Integer extended formats |
xmmword_1E30E50 | 12 | Integer multiply formats |
xmmword_1E30EA0 | 10 | Bitfield/logic formats |
Decoder-Side Descriptors
| Address | Decoders | Usage |
|---|---|---|
xmmword_1F46278 | 125 | Common 64-bit instruction format |
xmmword_1F46388 | 106 | Standard 3-operand format |
xmmword_1F46AF8 | 99 | Extended operand format |
xmmword_1F46630 | 84 | Memory instruction format |
xmmword_1F46E28 | 72 | Wide instruction format |
xmmword_1F461F0 | 64 | Predicate instruction format |
The universal operand descriptor at xmmword_1F460E0/xmmword_1F460F0 (32 bytes) is referenced by all 4,236 encoder+decoder+descriptor functions -- it defines the register class mapping for the SM100 architecture.
Internal Data Structures
Instruction Representation Object (~560 bytes)
The a1 parameter across all encoder/decoder functions points to:
Offset Size Description
------ ---- -----------
0 4 Flags / instruction ID
4 4 Scheduling control bits
8 16 Format descriptor (128-bit SSE copy from xmmword table)
12 4 Operand count metadata field 1
16 4 Operand count metadata field 2
24-60 40 Operand register indices (10 x 4 bytes)
64-100 40 Operand type/class (10 x 4 bytes, -1 = unused)
104-140 40 Operand modifier flags (10 x 4 bytes)
144 4 Active operand count
148 4 Immediate encoding offset 1
152 4 Immediate encoding offset 2
156-276 120 Reserved / operand extension data
276-404 128 Decoded operand output buffer
404-468 64 Modifier flag output buffer
452 4 Modifier count field 1
456 4 Modifier count field 2
468-532 64 Operand remap table (16 DWORD slots)
532 4 Operand remap counter
536 8 Pointer to encoding/decoding lookup table
544 8 Instruction word 0 (bits 0-63 of 128-bit SASS)
552 8 Instruction word 1 (bits 64-127 of 128-bit SASS)
556 4 Decoded immediate / offset value
Encoder Dispatch Table Entry (24 bytes)
Offset Size Description
------ ---- -----------
0 1 Opcode byte 0 (minor opcode low)
1 1 Opcode byte 1 (minor opcode high / sub-opcode)
8 8 Function pointer to encoder (LSB=1 indicates vtable indirection)
16 8 this-pointer offset adjustment
Statistical Summary
| Metric | Value |
|---|---|
| Total encoding functions | 1,975 |
| Total decoding functions | 648 |
| Total descriptor initializers | 1,613 |
| Total template instantiations | 4,236 |
| Unique instruction families | 118 |
| Unique descriptor class IDs | 56 (ranging 0-103) |
| Largest encoder | sub_DC6680 TEX (8,794 bytes, 302 lines) |
| Smallest encoder | ~4,661 bytes, 167 lines |
| Only format 3 instruction | HSETP2 (minor 0x6F, sub 0x04) |
| Max register operands | 7 (texture fetch) |
| Max immediate operands | 4 (constant buffer load, local memory) |
| Max predicate operands | 2 (half-precision comparison, store ops) |
| SM100 opcode table size | 111,076 bytes / 3,227 lines |
| SM70 opcode table size | 24,230 bytes / 733 lines |
| SM100-to-SM70 ISA size ratio | ~4.6x |
Confidence Assessment
| Claim | Confidence | Verification |
|---|---|---|
| ISA class string "Blackwell" for sm_100 | CONFIRMED | Decompiled sub_484F50 line 609: "Blackwell"; string at 0x1d40b6e |
__CUDA_ARCH__=1000 | CONFIRMED | String at 0x1d40b59; decompiled line 617 |
| Three sub-variants: sm_100, sm_100a, sm_100f | CONFIRMED | Strings at 0x1d40b78/0x1d40bae/0x1d40be2; dispatch table in sub_15C0CE0 |
Dispatch table: sm_100 encoding table = sub_15C3840 | CONFIRMED | Decompiled sub_15C0CE0 line 126: sub_448E70(qword_2A644A8, "sm_100", sub_15C3840) |
| 8 MB encoding infrastructure across four regions | HIGH | Region sizes from systematic binary analysis; addresses consistent with function catalog |
| 1,975 encoders + 648 decoders + 1,613 descriptors = 4,236 | HIGH | Counts from comprehensive sweep; cross-checked with dispatch table sizes |
| 118 instruction families under 3 major opcodes | HIGH | Family count from systematic opcode table analysis |
sub_4C28B0 bitfield insertion at bits[3:0], [6:4], [16:8], [24:17], [31:25] | HIGH | Encoder template pattern confirmed across multiple decompiled encoders |
| ROT13 mnemonic table: ZREPHEL = MERCURY | HIGH | ROT13 decoding of string found in opcode table constructor |
tcgen05 intrinsics: tcgen05.mma, tcgen05.mma.ws | HIGH | PTX mnemonic strings referenced in sub_16E0A70 type classifier |
| 54 tcgen05 type IDs from predicate chain | HIGH | Decompiled sub_16E0A70 at 322 lines with complete predicate chain |
Guardrail symbols (__cuda__sm10x_tcgen05_guardrails_*) | HIGH | Symbol name strings referenced in decompiled sub_16E1DB0 |
Encoder dispatch sub_E43C20 (92 lines) | HIGH | Address and size consistent with function catalog |
Decoder dispatch sub_EFE6C0 (93 lines) | HIGH | Address and size consistent with function catalog |
SM100 opcode table constructor sub_1782540 (111,076 B) | HIGH | One of the largest functions; address consistent |
Master encoder sub_17F2670 (156,611 B) -- largest function in binary | HIGH | Size claim from function boundary analysis; consistent with claimed 4,858 lines |
Format descriptor tables (xmmword_1E30DA0, etc.) | HIGH | Rodata addresses from decompiled encoder functions |
| Major opcode 3 has only 1 encoder (HSETP2) | MEDIUM | Derived from opcode distribution analysis; single format-3 instruction |
For general Blackwell architecture details, see the ptxas wiki: Blackwell and cicc wiki: SM100 Blackwell.
Cross-References
nvlink Internal
- Embedded ptxas Overview -- SM100+ codec at
0xDA0000--0xF16000and0x620000--0xA70000 - Architecture Dispatch -- SM100/100a/100f vtable registration and callbacks
- Architecture Profiles -- SM100 profile in linker database
- Compatibility -- SM100 finalization compatibility and capability bitmasks
- SM103/110/120/121 -- derived Blackwell-family architectures
- Mercury Overview -- Mercury encoding is the default for SM100+
- FNLZR -- finalizer for Mercury/capmerc SM100+ output
- Capsule Mercury Format -- capmerc output format for SM100+
Sibling Wikis
- ptxas: Blackwell -- standalone ptxas SM100+ target documentation
- ptxas: tcgen05 -- standalone ptxas tcgen05 tensor intrinsics
- ptxas: Mercury Encoder -- Mercury pipeline for SM100+ encoding
- ptxas: Capsule Mercury -- capmerc format in standalone ptxas
- cicc: SM100 Blackwell -- cicc compiler SM100 Blackwell target