Tensor Core Intrinsics
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas supports five generations of tensor core operations spanning SM 70 through SM 100. The binary contains three major codegen handlers -- sub_5C7A50 (173KB, WMMA), sub_5C10A0 (120KB, MMA), and sub_5BBC30 (90KB, tcgen05.mma) -- plus four WGMMA handlers, eleven tcgen05 instruction handlers, and ~400 numeric MMA hash table entries. Together these constitute ~500KB of code generation logic dedicated to tensor core instructions, making this the single largest functional subsystem in ptxas.
| WMMA codegen | sub_5C7A50 (173KB) -- wmma.mma instruction code generation |
| MMA codegen | sub_5C10A0 (120KB) -- mma.sync instruction code generation |
| TCGen05 MMA codegen | sub_5BBC30 (90KB) -- tcgen05.mma instruction code generation |
| WMMA load/store | sub_5A0EA0 (7.8KB), sub_5A8E40 (9.8KB), sub_5A6BD0 (8.8KB), sub_5A2D10 (8.1KB) |
| WGMMA handlers | sub_50AC70 (1.3KB), sub_4DA380 (295B), sub_4DA4B0 (295B), sub_4DA5E0 (311B) |
| MMA validator | sub_4C2FD0 (12.2KB), sub_49BBA0 (11.4KB), sub_4BFED0 (10.3KB) |
| Numeric MMA hash | ~400 entries at compilation context offset a1+816 |
| Prototype generator | sub_5FF700 (354KB) -- generates .weak .func PTX declarations |
| SASS MMA encoders | sub_6D4350, sub_6D7AF0, sub_6D5CB0, sub_6D69B0 |
Tensor Core Generations
ptxas tracks five distinct tensor core generations, each introducing new SASS opcodes, data types, and matrix shapes. The internal scheduling counters (visible in DUMPIR statistics output) reveal the SASS-level taxonomy.
| Gen | SM | SASS Opcodes | PTX API | Key Addition |
|---|---|---|---|---|
| 1st | 75 (Turing) | HMMA (m16n8k8, m8n8k16) | wmma.mma | FP16 tensor cores; INT8/INT4/B1 WMMA |
| 2nd | 80 (Ampere) | HMMA (m16n8k16), IMMA, DMMA, BMMA | wmma.mma, mma.sync | BF16, TF32, FP64, structured sparsity |
| 3rd | 89 (Ada) | HMMA (extended) | mma.sync | FP8 (E4M3, E5M2), block-scale MMA |
| 4th | 90 (Hopper) | WGMMA | wgmma.mma_async | Warpgroup MMA, async pipeline, sub-byte sparse |
| 5th | 100 (Blackwell) | UTCHMMA, UTCIMMA, tcmma | tcgen05.mma, tcgen05.mma.ws | Tensor memory (TMEM), warp-shared, block-scale |
Scheduling Unit Names
The binary's statistics printer functions (clones at 0x700-byte intervals from sub_ABBA50) emit per-unit throughput counters using the internal SASS operation classification:
| Counter | SASS Operation | Matrix Shape | Description |
|---|---|---|---|
hmma1688 | HMMA m16n8k8 | 16x8x8 | 1st-gen FP16 MMA (Turing) |
hmma1688f16 | HMMA m16n8k8 (f16 accum) | 16x8x8 | FP16 accumulation variant |
hmma16816 | HMMA m16n8k16 | 16x8x16 | 2nd-gen FP16 MMA (Ampere+) |
hmma16816f16 | HMMA m16n8k16 (f16 accum) | 16x8x16 | FP16 accumulation variant |
hmmaSp1688 | HMMA.SP m16n8k8 | 16x8x(8*2) | Sparse FP16 (2:4 sparsity) |
hmmaSp1688f16 | HMMA.SP m16n8k8 (f16 accum) | 16x8x(8*2) | Sparse FP16, f16 accum |
imma16816 | IMMA m16n8k16 | 16x8x16 | Integer MMA (INT8) |
imma16832 | IMMA m16n8k32 | 16x8x32 | Integer MMA (INT4/sub-byte) |
immaSp8832 | IMMA.SP m8n8k32 | 8x8x(32*2) | Sparse integer (m8 variant) |
immaSp16832 | IMMA.SP m16n8k32 | 16x8x(32*2) | Sparse integer (m16 variant) |
dmma | DMMA | 8x8x4 | FP64 tensor MMA |
fma64 | FMA64 | -- | FP64 FMA (non-tensor) |
Format strings from the binary:
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [est hmma16816=%d] [est hmma16816f16=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f]
These counters are emitted by the post-scheduling statistics pass. The scheduler treats each counter as a separate functional unit throughput class, with dedicated latency table entries:
| Ori Opcode | Latency Class | Description |
|---|---|---|
0x144 | 600 | Tensor fence |
0x145--0x146 | 759 | HMMA/BMMA tensor core operations |
0x147--0x148 | 757 or 761 | Narrow/wide DP tensor operations |
0x149 | 604 | Tensor sync |
PTX-Level Instruction Lowering
WMMA Instructions
The WMMA (Warp Matrix Multiply-Accumulate) API is the oldest tensor core interface. Five PTX instructions are registered in the opcode dispatch table at sub_5D4190:
| PTX Instruction | Codegen Handler | Size | Purpose |
|---|---|---|---|
wmma.load.a | sub_5A0EA0 | 7,779B | Load matrix fragment A |
wmma.load.b | sub_5A8E40 | 9,757B | Load matrix fragment B |
wmma.load.c | sub_5A6BD0 | 8,813B | Load accumulator fragment C |
wmma.store.d | sub_5A2D10 | 8,074B | Store result fragment D |
wmma.mma | sub_5C7A50 | 173KB | Matrix multiply-accumulate |
All five handlers allocate a 50,000-byte code generation buffer. The load/store handlers are ~8--10KB each and cover the combinatorial product of shape, layout, data type, and address space. The wmma.mma handler at 173KB is the largest single codegen handler in ptxas.
Instruction property accessors used by WMMA codegen:
| Accessor | Purpose |
|---|---|
sub_7075C0 | Get instruction flag A (layout/type encoding) |
sub_707BC0 | Get instruction flag B (variant/mode encoding) |
sub_7075E0 | Get layout string (row/col) |
sub_707BE0 | Get shape string (m16n16k16 etc.) |
sub_70A810 | Get scale string (satfinite etc.) |
WMMA Shapes and Types
sm_70 (Volta/Turing) -- 1st generation:
| Shape | Data Types | Accumulator | Regs (A) | Regs (B) | Regs (C/D) |
|---|---|---|---|---|---|
| m16n16k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
| m32n8k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
| m8n32k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
sm_72 (Turing) -- integer WMMA extension:
| Shape | Data Types | Accumulator | Regs (A) | Regs (B) | Regs (C/D) |
|---|---|---|---|---|---|
| m16n16k16 | s8, u8 | s32 | 2 x b32 | 2 x b32 | 8 x b32 |
| m32n8k16 | s8, u8 | s32 | 4 x b32 | 1 x b32 | 8 x b32 |
| m8n32k16 | s8, u8 | s32 | 1 x b32 | 4 x b32 | 8 x b32 |
| m8n8k32 | s4, u4 (sub-byte) | s32 | 1 x b32 | 1 x b32 | 2 x b32 |
| m8n8k128 | b1 (1-bit) | s32 | 1 x b32 | 1 x b32 | 2 x b32 |
sm_80 (Ampere) -- 2nd generation extensions:
| Shape | Data Types | Accumulator | Notes |
|---|---|---|---|
| m16n16k16 | bf16 | f32 | New BF16 support, 4 x b32 per fragment |
| m16n16k8 | tf32 | f32 | New TF32 support, 4 x b32 per fragment |
| m8n8k4 | f64 | f64 | Double-precision MMA, 1 x f64 per A/B, 2 x f64 per C/D |
All WMMA shapes support three address spaces (generic, global, shared) and optional descriptor-based addressing (_desc variants). The binary contains separate intrinsic registrations for each combination, with the full prototype available in the string table.
MMA (mma.sync) Instructions
The mma.sync API uses a single PTX opcode mma dispatched to sub_5C10A0 (120KB). Unlike WMMA, it uses asymmetric matrix shapes with M and N decoupled from K, and operates at a single-warp granularity.
The numeric MMA hash table at a1+816 collapses the full variant space (shape + type + layout) into ~400 hash entries. Each entry maps a numeric string key (e.g., "2644314910") to a specific codegen handler function pointer, avoiding multi-dimensional dispatch.
sm_80+ MMA intrinsics (IDs 0x209--0x22F):
The 39 intrinsics registered under __cuda_sm_8x_mma_* cover:
| Intrinsic Pattern | Layout | Types (D, C, A, B) | Regs |
|---|---|---|---|
mma_row_col_f16_f16_f16_f16 | row x col | f16, f16, f16, f16 | D: 4 x b32 |
mma_row_col_f32_f16_f16_f16 | row x col | f32, f16, f16, f16 | D: 8 x b32 |
mma_row_col_f32_f16_f16_f32 | row x col | f32, f32, f16, f16 | D: 8 x b32, C: 8 x b32 |
mma_col_col_* | col x col | same set | same |
mma_row_row_* | row x row | same set | same |
mma_col_row_* | col x row | same set | same |
mma_shfl_f16 | -- | shuffle f16 | D: 2 x b32 |
mma_shfl_f32 | -- | shuffle f32 | D: 4 x b32 |
Prototype examples from the binary:
.weak .func (.param .align 16 .b32 mma_dst[4])
__cuda_sm_8x_mma_row_col_f16_f16_f16_f16
(.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3);
.weak .func (.param .align 16 .b32 mma_dst[8])
__cuda_sm_8x_mma_row_col_f32_f16_f16_f32
(.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
.reg .b32 c4, .reg .b32 c5, .reg .b32 c6, .reg .b32 c7);
Note the .param .align 16 .b32 mma_dst[N] return convention -- MMA results are returned through aligned parameter space, not registers, because the warp-cooperative nature of the operation means each thread holds only a fragment.
MMA Shape Summary Across Generations
| Shape | Types | SM Floor | SASS Opcode | Notes |
|---|---|---|---|---|
| m8n8k4 | f64 | 80 | DMMA | Double-precision |
| m8n8k16 | f16 | 75 | HMMA | Original Turing shape |
| m8n8k32 | s4/u4 | 75 | IMMA | Sub-byte integer |
| m8n8k128 | b1 | 75 | BMMA | 1-bit (XOR/AND pop) |
| m16n8k8 | f16 | 75 | HMMA | Asymmetric Turing shape |
| m16n8k16 | f16, bf16, s8/u8 | 80 | HMMA, IMMA | Primary Ampere shape |
| m16n8k32 | s8/u8, s4/u4 | 80/90 | IMMA | Integer, sub-byte at sm_90 |
| m16n8k64 | s4/u4 | 90 | IMMA | Hopper sub-byte extension |
| m16n8k128 | s4/u4 (sparse), b1 | 90 | IMMA, BMMA | Hopper sparse sub-byte |
| m16n8k256 | b1 | 90/100 | BMMA | Extended 1-bit MMA |
MMA Validation
Three validator functions gate MMA features by SM version:
sub_4C2FD0 (12.2KB) -- WMMA/MMA master validator:
Performs three-way version checks:
- SM 75: base WMMA (f16)
- SM 80: extended types (BF16, TF32, FP64 --
"MMA with double types") - SM 90: WGMMA features
- FP8:
"mma with FP8 floating point type"(gated by sm_89+)
sub_49BBA0 (11.4KB) -- MMA type/scale validator:
Validates FP8 and block-scale configurations:
"mma with FP8 floating point type"-- sm_89+ gate"mma with FP8 floating point type and FP16 accumulation"-- additional FP16 accum check"mma with FP8 floating point type and .m16n8k16 shape"-- shape/type cross-validation"Sparse mma with block scale"-- block-scale + sparsity interaction.block_scalemodifier validation
sub_4BFED0 (10.3KB) -- WMMA shape validator:
Validates WMMA-specific shapes and the .aligned modifier:
".aligned modifier for wmma"-- alignment enforcement- SM 75/80 version checks for shape legality
sub_490F90 -- Integer MMA validator:
Checks integer MMA shape validity: "Integer MMA with shape " -- validates m/n/k dimensions against the SM-level capability set.
sub_494210 (2.3KB) -- Sparse GMMA validator:
Validates sparse MMA metadata: "Sparse GMMA with " -- checks 2:4 sparsity pattern encoding.
sub_495900 -- WMMA floating-point validator:
Checks: "'wmma.mma' with floating point type" -- validates FP type compatibility with the target shape.
sub_4428E0 -- FP64 MMA validator:
Validates: "mma with .f64 type" -- gates double-precision MMA on sm_80+.
sm_90+ Sub-Byte MMA Intrinsics (IDs 0x23A--0x25F)
Hopper introduces 38 sub-byte MMA intrinsics covering s4/u4 sparse operations at warp granularity. These are distinct from the WGMMA API and provide backward-compatible sub-byte operations through the classical mma.sync interface.
Dense sub-byte (m8n8k32, m16n8k32, m16n8k64):
| Shape | Type Combinations | Variants | Count |
|---|---|---|---|
| m8n8k32 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
| m16n8k32 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
| m16n8k64 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
Sparse sub-byte (m16n8k64, m16n8k128):
| Shape | Type Combinations | Variants | Count |
|---|---|---|---|
| m16n8k64 (sparse) | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite, split _0/_1 | 16 |
| m16n8k128 (sparse) | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
The _0 and _1 suffixes on sparse m16n8k64 represent the two halves of a split operation -- the K dimension is decomposed into two steps for the sparsity pattern. The sparse variants take an additional e (metadata) operand encoding the 2:4 sparsity pattern.
Bit-operations (b1):
| Shape | Operation | SM | Intrinsic |
|---|---|---|---|
| m8n8k128 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m8n8k128 |
| m16n8k128 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m16n8k128 |
| m16n8k256 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m16n8k256 |
Prototype example (sparse m16n8k128):
.weak .func (.param .align 16 .b32 mma_dst[4])
__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
(.reg .b32 a0, .reg .b32 a1, .reg .b32 a2, .reg .b32 a3,
.reg .b32 b0, .reg .b32 b1, .reg .b32 b2, .reg .b32 b3,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
.reg .b32 e);
The sparse m16n8k128 variant takes 4 A-operands, 4 B-operands (double-width due to sparsity encoding), 4 C-operands, and 1 metadata operand e.
sm_100 Blackwell MMA (IDs 0x230--0x239)
Blackwell introduces 10 intrinsics for extended MMA operations:
HMMA/IMMA metadata helpers (__cuda_sm_10x_*_mdata_*):
| Intrinsic | Shape | Returns | Inputs |
|---|---|---|---|
__cuda_sm_10x_hmma_mdata_m16n8k16 | m16n8k16 | ret_dst[3] | a0, a1, e, f_temp |
__cuda_sm_10x_hmma_mdata_m16n8k32 | m16n8k32 | ret_dst[5] | a0, a1, a2, a3, e, f_temp |
__cuda_sm_10x_imma_mdata_m16n8k32 | m16n8k32 | ret_dst[3] | a0, a1, e, f_temp |
__cuda_sm_10x_imma_mdata_m16n8k64 | m16n8k64 | ret_dst[5] | a0, a1, a2, a3, e, f_temp |
These mdata functions compute sparse metadata for Blackwell's 5th-gen tensor cores. The e parameter is the sparsity selector, f_temp is a scratch register. The return array includes the transformed A-operands plus the computed metadata word.
Bit MMA (AND + XOR for sm_100):
| Intrinsic | Shape | Operation | Regs |
|---|---|---|---|
__cuda_sm_10x_mma_bit_internal_and_m8n8k128 | m8n8k128 | AND | D: 2, A: 1, B: 1, C: 2 |
__cuda_sm_10x_mma_bit_internal_and_m16n8k128 | m16n8k128 | AND | D: 4, A: 2, B: 1, C: 4 |
__cuda_sm_10x_mma_bit_internal_and_m16n8k256 | m16n8k256 | AND | D: 4, A: 4, B: 2, C: 4 |
__cuda_sm_10x_mma_bit_internal_xor_m8n8k128 | m8n8k128 | XOR | same as AND |
__cuda_sm_10x_mma_bit_internal_xor_m16n8k128 | m16n8k128 | XOR | same as AND |
__cuda_sm_10x_mma_bit_internal_xor_m16n8k256 | m16n8k256 | XOR | same as AND |
Blackwell adds the AND reduction mode for 1-bit MMA (sm_90 only had XOR).
WGMMA -- Warpgroup MMA (SM 90+)
WGMMA (Warp Group Matrix Multiply-Accumulate) operates at warpgroup granularity (4 warps, 128 threads) and uses an asynchronous pipeline protocol. Four PTX instructions are registered:
| PTX Instruction | Handler | Size | Role |
|---|---|---|---|
wgmma.mma_async | sub_50AC70 | 1,282B | Dispatch asynchronous MMA operation |
wgmma.fence | sub_4DA380 | 295B | Open pipeline stage |
wgmma.commit_group | sub_4DA4B0 | 295B | Close pipeline stage |
wgmma.wait_group | sub_4DA5E0 | 311B | Wait for N committed groups |
WGMMA Pipeline Protocol
The hardware requires strict sequencing:
wgmma.fence -- open pipeline stage
wgmma.mma_async (1..N) -- asynchronous MMA operations sharing accumulators
wgmma.commit_group -- close pipeline stage
wgmma.wait_group N -- wait for N outstanding groups to complete
Between fence and wait, strict register constraints apply:
- No non-WGMMA definitions of accumulator registers
- No non-WGMMA reads of accumulator registers
- No non-WGMMA definitions of WGMMA input registers (including descriptors)
Violation of any constraint triggers pipeline serialization via sub_AE47B0, which collapses the pipeline to individual fence/mma/commit/wait per operation. The serialization reason is reported through warning codes 0x1D55--0x1D5E (see GMMA Pipeline).
WGMMA Descriptor Format
The wgmma.mma_async handler (sub_50AC70, 1,282 bytes) encodes the operation's matrix dimensions, data types, layout, and scale factors into the instruction. The A operand can be either a register operand or a descriptor -- a 64-bit value encoding the matrix base address, leading dimension, stride, and swizzle pattern. The B operand is always descriptor-based.
The descriptor format allows the hardware to fetch matrix data directly from shared memory via the TMA (Tensor Memory Accelerator), bypassing register file involvement for the B matrix operand entirely.
Ori Internal Encoding
| Constant | Value | Meaning |
|---|---|---|
| WGMMA opcode | 309 | Ori opcode for wgmma.mma_async |
| Arrive opcode (masked) | 271 | opcode & 0xFFFFCFFF for _warpgroup.arrive/wait |
| Commit opcode | 323 | Ori opcode for _warpgroup.commit_batch |
| Accum reg_type | 6 | vreg+64 value for tensor/accumulator registers |
| Accum src tag | 0x90000000 | High nibble tag for source accumulator set |
| Accum dst tag | 0x10000000 | High nibble tag for destination accumulator set |
Compiler-Inserted Warpgroup Instructions
The GMMA pipeline passes (phases 85 and 87) insert three compiler-internal pseudo-operations prefixed with underscore:
| Pseudo-Op | SASS Output | Purpose |
|---|---|---|
_warpgroup.arrive | WARPSYNC / BAR.ARRIVE | Warpgroup synchronization (arrive) |
_warpgroup.wait | WARPSYNC / BAR.WAIT | Warpgroup synchronization (wait) |
_warpgroup.commit_batch | DEPBAR variant | Warpgroup dependency barrier |
These are not directly written by the programmer. The compiler inserts them to manage register ownership transfer between the warpgroup's register file and the tensor core's accumulator pipeline.
TCGen05 -- 5th Generation Tensor Cores (SM 100+)
Blackwell introduces TCGen05 (Tensor Core Generation 5), which operates through tensor memory (TMEM) -- a dedicated on-chip memory visible only to the tensor core engine, separate from the register file and shared memory. Eleven PTX instructions are registered:
| PTX Instruction | Handler | Size | Purpose |
|---|---|---|---|
tcgen05.mma | sub_5BBC30 | 90KB | Tensor core MMA from TMEM |
tcgen05.mma.ws | sub_58FA20 | 4,604B | Warp-shared MMA variant |
tcgen05.ld | sub_574050 | -- | Load data into TMEM |
tcgen05.ld.red | sub_578DB0 | -- | Load-reduce into TMEM |
tcgen05.st | sub_571FE0 | -- | Store from TMEM |
tcgen05.cp | sub_5427F0 | -- | Copy within TMEM |
tcgen05.commit | sub_56C190 | -- | Commit pending operations |
tcgen05.shift | sub_4F1A90 | -- | Shift TMEM contents |
tcgen05.alloc | sub_569180 | -- | Allocate TMEM columns |
tcgen05.dealloc | sub_58C7F0 | -- | Deallocate TMEM columns |
tcgen05.relinquish_alloc_permit | sub_526370 | -- | Release allocation rights |
TCGen05 MMA Codegen
The tcgen05.mma handler (sub_5BBC30, 90KB) is the third-largest codegen handler in ptxas. It:
- Allocates a 50,000-byte code generation buffer
- Validates tcgen05 capability via
sub_70FA00(*, 29) - Handles standard, sparse (
.sp), and warp-shared (.ws) variants - Extracts sparse metadata via
sub_70F0A0 - Generates TMEM address computation code
The tcgen05.mma.ws formatter (sub_58FA20, 4,604B, also used for tcgen05.shift) handles the warp-shared variant where multiple warps contribute to a single MMA operation.
TCGen05 SASS Encoding
At the SASS level, TCGen05 operations are encoded by four specialized Mercury encoder functions:
| Address | Handler | Purpose |
|---|---|---|
sub_6D4350 | SASS MMA encoder | Primary MMA SASS emission |
sub_6D7AF0 | SASS MMA encoder | Alternate MMA variant |
sub_6D5CB0 | SASS MMA encoder | Additional MMA mode |
sub_6D69B0 | SASS MMA encoder | Additional MMA mode |
The SASS encoder at sub_6D4350 references the tcmma operation namespace and validates block-scale configurations:
"tcmma_*_o must be specified with blockscale"-- output operand requires block-scale modifier"uri width for tcmma_*_o must be 2"-- output URI width constraint"tcmma_*_q with blockscale must have uri width of 2"-- scale factor operand constraint"tcmma_*_mxq must be specified with blockscale"-- MX quantization operand constraint"For UTCHMMA, #scaleU4 must be 0 in SPA 10.1."-- SM 100 vs 103 compatibility
The string "UTCHMMA" (Unified Tensor Core HMMA) and "tcmma" (Tensor Core MMA) are the internal SASS-level names for Blackwell's tensor core operations.
TCGen05 Guardrails
Blackwell includes a debug/validation mode activated by --g-tensor-memory-access-check. When enabled, ptxas wraps TMEM operations with guardrail trap functions. Ten guardrail intrinsics are registered (IDs 0x20--0x2A):
| Intrinsic | Trap Condition |
|---|---|
tcgen05_guardrail_trap_phase_invalid_during_alloc | TMEM allocation during invalid phase |
tcgen05_guardrail_trap_current_warp_owner_invalid | Warp accessing TMEM it does not own |
tcgen05_guardrail_trap_unallocated_columns_access | Access to unallocated TMEM columns |
tcgen05_guardrail_trap_unallocated_columns_being_dealloced | Deallocation of unallocated columns |
tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc | Dealloc of column not from alloc |
tcgen05_guardrail_trap_allocation_granularity_invalid | Invalid allocation granularity |
tcgen05_guardrail_trap_access_out_of_physical_bounds | Out-of-bounds TMEM access |
tcgen05_guardrail_trap_invalid_datapath_alignment | TMEM datapath misalignment |
tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod | Sparsity mismatch in instruction descriptor |
tcgen05_guardrail_trap_sp_used_in_unsupported_env | Sparsity in unsupported config |
Eight guardrail PTX instructions are registered in the opcode dispatch table:
| PTX Instruction | Check |
|---|---|
_tcgen05.guardrails.is_phase_valid | Phase validity for alloc |
_tcgen05.guardrails.is_current_warp_valid_owner | Warp ownership |
_tcgen05.guardrails.are_columns_allocated | Column allocation status |
_tcgen05.guardrails.in_physical_bounds | Physical bounds check |
_tcgen05.guardrails.allocation_granularity | Granularity validation |
_tcgen05.guardrails.datapath_alignment | Alignment validation |
_tcgen05.guardrails.sp_consistency_across_idesc_mod | Sparsity consistency |
_tcgen05.guardrails.check_sparse_usage | Sparse usage validation |
The guardrail check functions are .FORCE_INLINE and return a boolean retVal:
.FORCE_INLINE .func (.reg .b32 retVal)
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment
(.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group,
.reg .u32 hasWS, .reg .u32 hasSP, .reg .u32 matrix_kind);
The parameters reveal TMEM addressing structure: tmemAddr (TMEM base address), iDesc (instruction descriptor), cta_group (CTA group for cluster operations), hasWS (warp-shared flag), hasSP (sparse flag), matrix_kind (operand role).
Block-Scale MMA
Block-scale MMA allows per-block scaling factors for mixed-precision computation. In ptxas, this is gated by the .block_scale modifier on the PTX mma instruction:
- Validator
sub_49BBA0checks".block_scale"and"Sparse mma with block scale"(sparsity + block-scale interaction) - Additional intrinsic suffix
__cuda_sm_100_tcgen05_ld_immhalfSplitOffand variants handle block-scale-aware loads - The
bf16x2.ue8m0x2type string in the binary indicates UE8M0 (unsigned exponent-only) scale factor format for MX (microscaling) quantization
Intrinsic Registration Summary
Full Tensor Core Intrinsic ID Map
| ID Range | Count | SM | Category | SASS Target |
|---|---|---|---|---|
0x89--0x1FA (subset) | ~200 | 70+ | __cuda_sm70_wmma_* -- WMMA load/store/mma (f16) | HMMA |
0x1FB--0x208 | 14 | 80 | __cuda_sm80_* -- bf16/tf32/s4/s8/b1 MMA, createpolicy | HMMA, IMMA, DMMA, BMMA |
0x209--0x22F | 39 | 80+ | __cuda_sm_8x_mma_* -- direct MMA operations | HMMA, IMMA |
0x230--0x239 | 10 | 100 | __cuda_sm_10x_* -- hmma/imma mdata + bit MMA | UTCHMMA, UTCIMMA |
0x23A--0x25F | 38 | 90 | __cuda_sm_9x_mma_sub_byte_internal_* -- sub-byte sparse | IMMA |
TCGen05 Intrinsics (Not in Master ID Table)
TCGen05 operations are dispatched through the named opcode table, not the numeric ID table. The 11 tcgen05.* instructions are registered directly in the opcode hash map at a1+808:
| PTX Opcode | Codegen Handler |
|---|---|
tcgen05.alloc | sub_569180 |
tcgen05.relinquish_alloc_permit | sub_526370 |
tcgen05.dealloc | sub_58C7F0 |
tcgen05.ld | sub_574050 |
tcgen05.ld.red | sub_578DB0 |
tcgen05.st | sub_571FE0 |
tcgen05.commit | sub_56C190 |
tcgen05.cp | sub_5427F0 |
tcgen05.shift | sub_4F1A90 |
tcgen05.mma | sub_5BBC30 |
tcgen05.mma.ws | sub_58FA20 |
OCG-Level MMA Operations
The OCG system at sub_6C9EB0 registers additional SASS-level MMA operations for SM100+:
tcmma (2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit)
tcbar, mmareadshma
16dp32bitt0t15, 16dp32bitt16t31
sparsify, spfactor2to4
tcshift, tcatomsws, tcldsws, tcstsws
These are internal Mercury-level operations that do not correspond 1:1 to PTX instructions. The tcmma variants encode the specific SASS datapath configuration: 2x64dp128bit means 2 datapaths, 64-element wide, 128-bit lane width.
Data Type Matrix
Complete data type support across tensor core generations as visible in the binary:
| Data Type | PTX Type | Width | SM Floor | Use |
|---|---|---|---|---|
| FP16 | .f16 | 16b | 75 | Primary HMMA operand |
| BF16 | .bf16 | 16b | 80 | HMMA alternate format |
| TF32 | .tf32 | 19b (stored as 32b) | 80 | Reduced-precision FP32 |
| FP32 | .f32 | 32b | 75 | HMMA accumulator |
| FP64 | .f64 | 64b | 80 | DMMA (double-precision) |
| FP8 E4M3 | .e4m3 | 8b | 89 | Ada/Hopper FP8 MMA |
| FP8 E5M2 | .e5m2 | 8b | 89 | Ada/Hopper FP8 MMA |
| INT8 | .s8, .u8 | 8b | 72 | IMMA integer MMA |
| INT4 | .s4, .u4 | 4b | 72 | IMMA sub-byte MMA |
| INT1 | .b1 | 1b | 75 | BMMA 1-bit MMA (XOR/AND) |
| UE8M0 | .ue8m0x2 | 8b (packed) | 100 | Block-scale exponent factor |
| B1024 | .b1024 | 1024b | 100 | TMEM-width operand |
ELF Metadata
The cubin output includes tensor-core-specific EIATTR attributes:
| EIATTR | Purpose |
|---|---|
EIATTR_SPARSE_MMA_MASK | Records which MMA operations use structured sparsity |
EIATTR_TCGEN05_1CTA_USED | Kernel uses 1-CTA tcgen05 operations |
EIATTR_TCGEN05_2CTA_USED | Kernel uses 2-CTA tcgen05 operations |
Knobs and Options
| Knob / Option | Purpose |
|---|---|
suppress-sparse-mma-advisory-info | Suppress advisory info for mma.sp operations |
--g-tensor-memory-access-check | Enable tcgen05 guardrail instrumentation |
Key Function Table
| Address | Size | Identity | Confidence |
|---|---|---|---|
0x5C7A50 | 173KB | WMMA.MMA codegen (all shapes/types/layouts) | 98% |
0x5C10A0 | 120KB | MMA.sync codegen (post-Volta shapes) | 98% |
0x5BBC30 | 90KB | TCGen05.MMA codegen (Blackwell TMEM-based) | 98% |
0x5A0EA0 | 7,779B | wmma.load.a formatter | 95% |
0x5A8E40 | 9,757B | wmma.load.b formatter | 95% |
0x5A6BD0 | 8,813B | wmma.load.c formatter | 95% |
0x5A2D10 | 8,074B | wmma.store.d formatter | 95% |
0x50AC70 | 1,282B | wgmma.mma_async handler | 99% |
0x4DA380 | 295B | wgmma.fence handler | 99% |
0x4DA4B0 | 295B | wgmma.commit_group handler | 99% |
0x4DA5E0 | 311B | wgmma.wait_group handler | 99% |
0x58FA20 | 4,604B | tcgen05.mma.ws / tcgen05.shift formatter | 95% |
0x4DA720 | 343B | tcgen05.mma.ws formatter | 90% |
0x569180 | -- | tcgen05.alloc handler | 90% |
0x526370 | -- | tcgen05.relinquish_alloc_permit handler | 90% |
0x58C7F0 | -- | tcgen05.dealloc handler | 90% |
0x574050 | -- | tcgen05.ld handler | 90% |
0x578DB0 | -- | tcgen05.ld.red handler | 90% |
0x571FE0 | -- | tcgen05.st handler | 90% |
0x56C190 | -- | tcgen05.commit handler | 90% |
0x5427F0 | -- | tcgen05.cp handler | 90% |
0x4F1A90 | -- | tcgen05.shift handler | 90% |
0x4C2FD0 | 12.2KB | WMMA/MMA master validator (sm_75/80/90) | 90% |
0x49BBA0 | 11.4KB | MMA type/scale validator (FP8, block-scale) | 90% |
0x4BFED0 | 10.3KB | WMMA shape validator | 90% |
0x490F90 | -- | Integer MMA shape validator | 85% |
0x494210 | 2,276B | Sparse GMMA validator | 85% |
0x495900 | -- | WMMA floating-point type validator | 85% |
0x496570 | -- | FP8 MMA shape validator | 85% |
0x4961F0 | -- | FP8 MMA accumulation validator | 85% |
0x4428E0 | -- | FP64 MMA type validator | 85% |
0x6D4350 | -- | SASS TCGen05 MMA encoder (UTCHMMA/tcmma) | 85% |
0x6D7AF0 | -- | SASS MMA encoder variant | 85% |
0x6D5CB0 | -- | SASS MMA encoder variant | 85% |
0x6D69B0 | -- | SASS MMA encoder variant | 85% |
0x50D4B0 | 1,187B | ldmatrix formatter | 90% |
0x4DAEA0 | -- | movmatrix formatter | 90% |
0x4F05D0 | -- | stmatrix formatter | 90% |
0x5D1660 | 46KB | Master intrinsic registration (608 entries) | 99% |
0x5D4190 | 41KB | Opcode dispatch (MMA hash table builder) | 99% |
0x5FF700 | 354KB | Prototype generator (.weak .func declarations) | 99% |
Cross-References
- Intrinsic Table Architecture -- Master registration, ID ranges, opcode dispatch
- GMMA/WGMMA Pipeline -- Phases 85/87, pipeline constraints, serialization
- Ada & Hopper Targets -- SM 89/90 feature gates, WGMMA details
- Turing & Ampere Targets -- SM 75--88 tensor core introduction
- Latency Model -- HMMA/IMMA/DMMA functional unit scheduling
- Register Model -- reg_type 6 (tensor/accumulator, allocator class 6)
- Mercury Encoder -- SASS encoding of MMA instructions
- ELF Output -- EIATTR_SPARSE_MMA_MASK, EIATTR_TCGEN05_*