Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tensor Core Intrinsics

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas supports five generations of tensor core operations spanning SM 70 through SM 100. The binary contains three major codegen handlers -- sub_5C7A50 (173KB, WMMA), sub_5C10A0 (120KB, MMA), and sub_5BBC30 (90KB, tcgen05.mma) -- plus four WGMMA handlers, eleven tcgen05 instruction handlers, and ~400 numeric MMA hash table entries. Together these constitute ~500KB of code generation logic dedicated to tensor core instructions, making this the single largest functional subsystem in ptxas.

WMMA codegensub_5C7A50 (173KB) -- wmma.mma instruction code generation
MMA codegensub_5C10A0 (120KB) -- mma.sync instruction code generation
TCGen05 MMA codegensub_5BBC30 (90KB) -- tcgen05.mma instruction code generation
WMMA load/storesub_5A0EA0 (7.8KB), sub_5A8E40 (9.8KB), sub_5A6BD0 (8.8KB), sub_5A2D10 (8.1KB)
WGMMA handlerssub_50AC70 (1.3KB), sub_4DA380 (295B), sub_4DA4B0 (295B), sub_4DA5E0 (311B)
MMA validatorsub_4C2FD0 (12.2KB), sub_49BBA0 (11.4KB), sub_4BFED0 (10.3KB)
Numeric MMA hash~400 entries at compilation context offset a1+816
Prototype generatorsub_5FF700 (354KB) -- generates .weak .func PTX declarations
SASS MMA encoderssub_6D4350, sub_6D7AF0, sub_6D5CB0, sub_6D69B0

Tensor Core Generations

ptxas tracks five distinct tensor core generations, each introducing new SASS opcodes, data types, and matrix shapes. The internal scheduling counters (visible in DUMPIR statistics output) reveal the SASS-level taxonomy.

GenSMSASS OpcodesPTX APIKey Addition
1st75 (Turing)HMMA (m16n8k8, m8n8k16)wmma.mmaFP16 tensor cores; INT8/INT4/B1 WMMA
2nd80 (Ampere)HMMA (m16n8k16), IMMA, DMMA, BMMAwmma.mma, mma.syncBF16, TF32, FP64, structured sparsity
3rd89 (Ada)HMMA (extended)mma.syncFP8 (E4M3, E5M2), block-scale MMA
4th90 (Hopper)WGMMAwgmma.mma_asyncWarpgroup MMA, async pipeline, sub-byte sparse
5th100 (Blackwell)UTCHMMA, UTCIMMA, tcmmatcgen05.mma, tcgen05.mma.wsTensor memory (TMEM), warp-shared, block-scale

Scheduling Unit Names

The binary's statistics printer functions (clones at 0x700-byte intervals from sub_ABBA50) emit per-unit throughput counters using the internal SASS operation classification:

CounterSASS OperationMatrix ShapeDescription
hmma1688HMMA m16n8k816x8x81st-gen FP16 MMA (Turing)
hmma1688f16HMMA m16n8k8 (f16 accum)16x8x8FP16 accumulation variant
hmma16816HMMA m16n8k1616x8x162nd-gen FP16 MMA (Ampere+)
hmma16816f16HMMA m16n8k16 (f16 accum)16x8x16FP16 accumulation variant
hmmaSp1688HMMA.SP m16n8k816x8x(8*2)Sparse FP16 (2:4 sparsity)
hmmaSp1688f16HMMA.SP m16n8k8 (f16 accum)16x8x(8*2)Sparse FP16, f16 accum
imma16816IMMA m16n8k1616x8x16Integer MMA (INT8)
imma16832IMMA m16n8k3216x8x32Integer MMA (INT4/sub-byte)
immaSp8832IMMA.SP m8n8k328x8x(32*2)Sparse integer (m8 variant)
immaSp16832IMMA.SP m16n8k3216x8x(32*2)Sparse integer (m16 variant)
dmmaDMMA8x8x4FP64 tensor MMA
fma64FMA64--FP64 FMA (non-tensor)

Format strings from the binary:

# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [est hmma16816=%d] [est hmma16816f16=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f]

These counters are emitted by the post-scheduling statistics pass. The scheduler treats each counter as a separate functional unit throughput class, with dedicated latency table entries:

Ori OpcodeLatency ClassDescription
0x144600Tensor fence
0x145--0x146759HMMA/BMMA tensor core operations
0x147--0x148757 or 761Narrow/wide DP tensor operations
0x149604Tensor sync

PTX-Level Instruction Lowering

WMMA Instructions

The WMMA (Warp Matrix Multiply-Accumulate) API is the oldest tensor core interface. Five PTX instructions are registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerSizePurpose
wmma.load.asub_5A0EA07,779BLoad matrix fragment A
wmma.load.bsub_5A8E409,757BLoad matrix fragment B
wmma.load.csub_5A6BD08,813BLoad accumulator fragment C
wmma.store.dsub_5A2D108,074BStore result fragment D
wmma.mmasub_5C7A50173KBMatrix multiply-accumulate

All five handlers allocate a 50,000-byte code generation buffer. The load/store handlers are ~8--10KB each and cover the combinatorial product of shape, layout, data type, and address space. The wmma.mma handler at 173KB is the largest single codegen handler in ptxas.

Instruction property accessors used by WMMA codegen:

AccessorPurpose
sub_7075C0Get instruction flag A (layout/type encoding)
sub_707BC0Get instruction flag B (variant/mode encoding)
sub_7075E0Get layout string (row/col)
sub_707BE0Get shape string (m16n16k16 etc.)
sub_70A810Get scale string (satfinite etc.)

WMMA Shapes and Types

sm_70 (Volta/Turing) -- 1st generation:

ShapeData TypesAccumulatorRegs (A)Regs (B)Regs (C/D)
m16n16k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)
m32n8k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)
m8n32k16f16f16 or f328 x b328 x b324 x b32 (f16) or 8 x b32 (f32)

sm_72 (Turing) -- integer WMMA extension:

ShapeData TypesAccumulatorRegs (A)Regs (B)Regs (C/D)
m16n16k16s8, u8s322 x b322 x b328 x b32
m32n8k16s8, u8s324 x b321 x b328 x b32
m8n32k16s8, u8s321 x b324 x b328 x b32
m8n8k32s4, u4 (sub-byte)s321 x b321 x b322 x b32
m8n8k128b1 (1-bit)s321 x b321 x b322 x b32

sm_80 (Ampere) -- 2nd generation extensions:

ShapeData TypesAccumulatorNotes
m16n16k16bf16f32New BF16 support, 4 x b32 per fragment
m16n16k8tf32f32New TF32 support, 4 x b32 per fragment
m8n8k4f64f64Double-precision MMA, 1 x f64 per A/B, 2 x f64 per C/D

All WMMA shapes support three address spaces (generic, global, shared) and optional descriptor-based addressing (_desc variants). The binary contains separate intrinsic registrations for each combination, with the full prototype available in the string table.

MMA (mma.sync) Instructions

The mma.sync API uses a single PTX opcode mma dispatched to sub_5C10A0 (120KB). Unlike WMMA, it uses asymmetric matrix shapes with M and N decoupled from K, and operates at a single-warp granularity.

The numeric MMA hash table at a1+816 collapses the full variant space (shape + type + layout) into ~400 hash entries. Each entry maps a numeric string key (e.g., "2644314910") to a specific codegen handler function pointer, avoiding multi-dimensional dispatch.

sm_80+ MMA intrinsics (IDs 0x209--0x22F):

The 39 intrinsics registered under __cuda_sm_8x_mma_* cover:

Intrinsic PatternLayoutTypes (D, C, A, B)Regs
mma_row_col_f16_f16_f16_f16row x colf16, f16, f16, f16D: 4 x b32
mma_row_col_f32_f16_f16_f16row x colf32, f16, f16, f16D: 8 x b32
mma_row_col_f32_f16_f16_f32row x colf32, f32, f16, f16D: 8 x b32, C: 8 x b32
mma_col_col_*col x colsame setsame
mma_row_row_*row x rowsame setsame
mma_col_row_*col x rowsame setsame
mma_shfl_f16--shuffle f16D: 2 x b32
mma_shfl_f32--shuffle f32D: 4 x b32

Prototype examples from the binary:

.weak .func (.param .align 16 .b32 mma_dst[4])
    __cuda_sm_8x_mma_row_col_f16_f16_f16_f16
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3);

.weak .func (.param .align 16 .b32 mma_dst[8])
    __cuda_sm_8x_mma_row_col_f32_f16_f16_f32
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
     .reg .b32 c4, .reg .b32 c5, .reg .b32 c6, .reg .b32 c7);

Note the .param .align 16 .b32 mma_dst[N] return convention -- MMA results are returned through aligned parameter space, not registers, because the warp-cooperative nature of the operation means each thread holds only a fragment.

MMA Shape Summary Across Generations

ShapeTypesSM FloorSASS OpcodeNotes
m8n8k4f6480DMMADouble-precision
m8n8k16f1675HMMAOriginal Turing shape
m8n8k32s4/u475IMMASub-byte integer
m8n8k128b175BMMA1-bit (XOR/AND pop)
m16n8k8f1675HMMAAsymmetric Turing shape
m16n8k16f16, bf16, s8/u880HMMA, IMMAPrimary Ampere shape
m16n8k32s8/u8, s4/u480/90IMMAInteger, sub-byte at sm_90
m16n8k64s4/u490IMMAHopper sub-byte extension
m16n8k128s4/u4 (sparse), b190IMMA, BMMAHopper sparse sub-byte
m16n8k256b190/100BMMAExtended 1-bit MMA

MMA Validation

Three validator functions gate MMA features by SM version:

sub_4C2FD0 (12.2KB) -- WMMA/MMA master validator:

Performs three-way version checks:

  • SM 75: base WMMA (f16)
  • SM 80: extended types (BF16, TF32, FP64 -- "MMA with double types")
  • SM 90: WGMMA features
  • FP8: "mma with FP8 floating point type" (gated by sm_89+)

sub_49BBA0 (11.4KB) -- MMA type/scale validator:

Validates FP8 and block-scale configurations:

  • "mma with FP8 floating point type" -- sm_89+ gate
  • "mma with FP8 floating point type and FP16 accumulation" -- additional FP16 accum check
  • "mma with FP8 floating point type and .m16n8k16 shape" -- shape/type cross-validation
  • "Sparse mma with block scale" -- block-scale + sparsity interaction
  • .block_scale modifier validation

sub_4BFED0 (10.3KB) -- WMMA shape validator:

Validates WMMA-specific shapes and the .aligned modifier:

  • ".aligned modifier for wmma" -- alignment enforcement
  • SM 75/80 version checks for shape legality

sub_490F90 -- Integer MMA validator:

Checks integer MMA shape validity: "Integer MMA with shape " -- validates m/n/k dimensions against the SM-level capability set.

sub_494210 (2.3KB) -- Sparse GMMA validator:

Validates sparse MMA metadata: "Sparse GMMA with " -- checks 2:4 sparsity pattern encoding.

sub_495900 -- WMMA floating-point validator:

Checks: "'wmma.mma' with floating point type" -- validates FP type compatibility with the target shape.

sub_4428E0 -- FP64 MMA validator:

Validates: "mma with .f64 type" -- gates double-precision MMA on sm_80+.

sm_90+ Sub-Byte MMA Intrinsics (IDs 0x23A--0x25F)

Hopper introduces 38 sub-byte MMA intrinsics covering s4/u4 sparse operations at warp granularity. These are distinct from the WGMMA API and provide backward-compatible sub-byte operations through the classical mma.sync interface.

Dense sub-byte (m8n8k32, m16n8k32, m16n8k64):

ShapeType CombinationsVariantsCount
m8n8k32s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8
m16n8k32s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8
m16n8k64s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8

Sparse sub-byte (m16n8k64, m16n8k128):

ShapeType CombinationsVariantsCount
m16n8k64 (sparse)s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite, split _0/_116
m16n8k128 (sparse)s4xs4, s4xu4, u4xs4, u4xu4plain + satfinite8

The _0 and _1 suffixes on sparse m16n8k64 represent the two halves of a split operation -- the K dimension is decomposed into two steps for the sparsity pattern. The sparse variants take an additional e (metadata) operand encoding the 2:4 sparsity pattern.

Bit-operations (b1):

ShapeOperationSMIntrinsic
m8n8k128XOR90__cuda_sm_9x_mma_bit_internal_xor_m8n8k128
m16n8k128XOR90__cuda_sm_9x_mma_bit_internal_xor_m16n8k128
m16n8k256XOR90__cuda_sm_9x_mma_bit_internal_xor_m16n8k256

Prototype example (sparse m16n8k128):

.weak .func (.param .align 16 .b32 mma_dst[4])
    __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
    (.reg .b32 a0, .reg .b32 a1, .reg .b32 a2, .reg .b32 a3,
     .reg .b32 b0, .reg .b32 b1, .reg .b32 b2, .reg .b32 b3,
     .reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
     .reg .b32 e);

The sparse m16n8k128 variant takes 4 A-operands, 4 B-operands (double-width due to sparsity encoding), 4 C-operands, and 1 metadata operand e.

sm_100 Blackwell MMA (IDs 0x230--0x239)

Blackwell introduces 10 intrinsics for extended MMA operations:

HMMA/IMMA metadata helpers (__cuda_sm_10x_*_mdata_*):

IntrinsicShapeReturnsInputs
__cuda_sm_10x_hmma_mdata_m16n8k16m16n8k16ret_dst[3]a0, a1, e, f_temp
__cuda_sm_10x_hmma_mdata_m16n8k32m16n8k32ret_dst[5]a0, a1, a2, a3, e, f_temp
__cuda_sm_10x_imma_mdata_m16n8k32m16n8k32ret_dst[3]a0, a1, e, f_temp
__cuda_sm_10x_imma_mdata_m16n8k64m16n8k64ret_dst[5]a0, a1, a2, a3, e, f_temp

These mdata functions compute sparse metadata for Blackwell's 5th-gen tensor cores. The e parameter is the sparsity selector, f_temp is a scratch register. The return array includes the transformed A-operands plus the computed metadata word.

Bit MMA (AND + XOR for sm_100):

IntrinsicShapeOperationRegs
__cuda_sm_10x_mma_bit_internal_and_m8n8k128m8n8k128ANDD: 2, A: 1, B: 1, C: 2
__cuda_sm_10x_mma_bit_internal_and_m16n8k128m16n8k128ANDD: 4, A: 2, B: 1, C: 4
__cuda_sm_10x_mma_bit_internal_and_m16n8k256m16n8k256ANDD: 4, A: 4, B: 2, C: 4
__cuda_sm_10x_mma_bit_internal_xor_m8n8k128m8n8k128XORsame as AND
__cuda_sm_10x_mma_bit_internal_xor_m16n8k128m16n8k128XORsame as AND
__cuda_sm_10x_mma_bit_internal_xor_m16n8k256m16n8k256XORsame as AND

Blackwell adds the AND reduction mode for 1-bit MMA (sm_90 only had XOR).

WGMMA -- Warpgroup MMA (SM 90+)

WGMMA (Warp Group Matrix Multiply-Accumulate) operates at warpgroup granularity (4 warps, 128 threads) and uses an asynchronous pipeline protocol. Four PTX instructions are registered:

PTX InstructionHandlerSizeRole
wgmma.mma_asyncsub_50AC701,282BDispatch asynchronous MMA operation
wgmma.fencesub_4DA380295BOpen pipeline stage
wgmma.commit_groupsub_4DA4B0295BClose pipeline stage
wgmma.wait_groupsub_4DA5E0311BWait for N committed groups

WGMMA Pipeline Protocol

The hardware requires strict sequencing:

wgmma.fence              -- open pipeline stage
wgmma.mma_async  (1..N)  -- asynchronous MMA operations sharing accumulators
wgmma.commit_group       -- close pipeline stage
wgmma.wait_group N       -- wait for N outstanding groups to complete

Between fence and wait, strict register constraints apply:

  1. No non-WGMMA definitions of accumulator registers
  2. No non-WGMMA reads of accumulator registers
  3. No non-WGMMA definitions of WGMMA input registers (including descriptors)

Violation of any constraint triggers pipeline serialization via sub_AE47B0, which collapses the pipeline to individual fence/mma/commit/wait per operation. The serialization reason is reported through warning codes 0x1D55--0x1D5E (see GMMA Pipeline).

WGMMA Descriptor Format

The wgmma.mma_async handler (sub_50AC70, 1,282 bytes) encodes the operation's matrix dimensions, data types, layout, and scale factors into the instruction. The A operand can be either a register operand or a descriptor -- a 64-bit value encoding the matrix base address, leading dimension, stride, and swizzle pattern. The B operand is always descriptor-based.

The descriptor format allows the hardware to fetch matrix data directly from shared memory via the TMA (Tensor Memory Accelerator), bypassing register file involvement for the B matrix operand entirely.

Ori Internal Encoding

ConstantValueMeaning
WGMMA opcode309Ori opcode for wgmma.mma_async
Arrive opcode (masked)271opcode & 0xFFFFCFFF for _warpgroup.arrive/wait
Commit opcode323Ori opcode for _warpgroup.commit_batch
Accum reg_type6vreg+64 value for tensor/accumulator registers
Accum src tag0x90000000High nibble tag for source accumulator set
Accum dst tag0x10000000High nibble tag for destination accumulator set

Compiler-Inserted Warpgroup Instructions

The GMMA pipeline passes (phases 85 and 87) insert three compiler-internal pseudo-operations prefixed with underscore:

Pseudo-OpSASS OutputPurpose
_warpgroup.arriveWARPSYNC / BAR.ARRIVEWarpgroup synchronization (arrive)
_warpgroup.waitWARPSYNC / BAR.WAITWarpgroup synchronization (wait)
_warpgroup.commit_batchDEPBAR variantWarpgroup dependency barrier

These are not directly written by the programmer. The compiler inserts them to manage register ownership transfer between the warpgroup's register file and the tensor core's accumulator pipeline.

TCGen05 -- 5th Generation Tensor Cores (SM 100+)

Blackwell introduces TCGen05 (Tensor Core Generation 5), which operates through tensor memory (TMEM) -- a dedicated on-chip memory visible only to the tensor core engine, separate from the register file and shared memory. Eleven PTX instructions are registered:

PTX InstructionHandlerSizePurpose
tcgen05.mmasub_5BBC3090KBTensor core MMA from TMEM
tcgen05.mma.wssub_58FA204,604BWarp-shared MMA variant
tcgen05.ldsub_574050--Load data into TMEM
tcgen05.ld.redsub_578DB0--Load-reduce into TMEM
tcgen05.stsub_571FE0--Store from TMEM
tcgen05.cpsub_5427F0--Copy within TMEM
tcgen05.commitsub_56C190--Commit pending operations
tcgen05.shiftsub_4F1A90--Shift TMEM contents
tcgen05.allocsub_569180--Allocate TMEM columns
tcgen05.deallocsub_58C7F0--Deallocate TMEM columns
tcgen05.relinquish_alloc_permitsub_526370--Release allocation rights

TCGen05 MMA Codegen

The tcgen05.mma handler (sub_5BBC30, 90KB) is the third-largest codegen handler in ptxas. It:

  1. Allocates a 50,000-byte code generation buffer
  2. Validates tcgen05 capability via sub_70FA00(*, 29)
  3. Handles standard, sparse (.sp), and warp-shared (.ws) variants
  4. Extracts sparse metadata via sub_70F0A0
  5. Generates TMEM address computation code

The tcgen05.mma.ws formatter (sub_58FA20, 4,604B, also used for tcgen05.shift) handles the warp-shared variant where multiple warps contribute to a single MMA operation.

TCGen05 SASS Encoding

At the SASS level, TCGen05 operations are encoded by four specialized Mercury encoder functions:

AddressHandlerPurpose
sub_6D4350SASS MMA encoderPrimary MMA SASS emission
sub_6D7AF0SASS MMA encoderAlternate MMA variant
sub_6D5CB0SASS MMA encoderAdditional MMA mode
sub_6D69B0SASS MMA encoderAdditional MMA mode

The SASS encoder at sub_6D4350 references the tcmma operation namespace and validates block-scale configurations:

  • "tcmma_*_o must be specified with blockscale" -- output operand requires block-scale modifier
  • "uri width for tcmma_*_o must be 2" -- output URI width constraint
  • "tcmma_*_q with blockscale must have uri width of 2" -- scale factor operand constraint
  • "tcmma_*_mxq must be specified with blockscale" -- MX quantization operand constraint
  • "For UTCHMMA, #scaleU4 must be 0 in SPA 10.1." -- SM 100 vs 103 compatibility

The string "UTCHMMA" (Unified Tensor Core HMMA) and "tcmma" (Tensor Core MMA) are the internal SASS-level names for Blackwell's tensor core operations.

TCGen05 Guardrails

Blackwell includes a debug/validation mode activated by --g-tensor-memory-access-check. When enabled, ptxas wraps TMEM operations with guardrail trap functions. Ten guardrail intrinsics are registered (IDs 0x20--0x2A):

IntrinsicTrap Condition
tcgen05_guardrail_trap_phase_invalid_during_allocTMEM allocation during invalid phase
tcgen05_guardrail_trap_current_warp_owner_invalidWarp accessing TMEM it does not own
tcgen05_guardrail_trap_unallocated_columns_accessAccess to unallocated TMEM columns
tcgen05_guardrail_trap_unallocated_columns_being_deallocedDeallocation of unallocated columns
tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_allocDealloc of column not from alloc
tcgen05_guardrail_trap_allocation_granularity_invalidInvalid allocation granularity
tcgen05_guardrail_trap_access_out_of_physical_boundsOut-of-bounds TMEM access
tcgen05_guardrail_trap_invalid_datapath_alignmentTMEM datapath misalignment
tcgen05_guardrail_trap_sparse_mismatch_between_idesc_modSparsity mismatch in instruction descriptor
tcgen05_guardrail_trap_sp_used_in_unsupported_envSparsity in unsupported config

Eight guardrail PTX instructions are registered in the opcode dispatch table:

PTX InstructionCheck
_tcgen05.guardrails.is_phase_validPhase validity for alloc
_tcgen05.guardrails.is_current_warp_valid_ownerWarp ownership
_tcgen05.guardrails.are_columns_allocatedColumn allocation status
_tcgen05.guardrails.in_physical_boundsPhysical bounds check
_tcgen05.guardrails.allocation_granularityGranularity validation
_tcgen05.guardrails.datapath_alignmentAlignment validation
_tcgen05.guardrails.sp_consistency_across_idesc_modSparsity consistency
_tcgen05.guardrails.check_sparse_usageSparse usage validation

The guardrail check functions are .FORCE_INLINE and return a boolean retVal:

.FORCE_INLINE .func (.reg .b32 retVal)
    __cuda_sm10x_tcgen05_guardrails_check_datapath_alignment
    (.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group,
     .reg .u32 hasWS, .reg .u32 hasSP, .reg .u32 matrix_kind);

The parameters reveal TMEM addressing structure: tmemAddr (TMEM base address), iDesc (instruction descriptor), cta_group (CTA group for cluster operations), hasWS (warp-shared flag), hasSP (sparse flag), matrix_kind (operand role).

Block-Scale MMA

Block-scale MMA allows per-block scaling factors for mixed-precision computation. In ptxas, this is gated by the .block_scale modifier on the PTX mma instruction:

  • Validator sub_49BBA0 checks ".block_scale" and "Sparse mma with block scale" (sparsity + block-scale interaction)
  • Additional intrinsic suffix __cuda_sm_100_tcgen05_ld_immhalfSplitOff and variants handle block-scale-aware loads
  • The bf16x2.ue8m0x2 type string in the binary indicates UE8M0 (unsigned exponent-only) scale factor format for MX (microscaling) quantization

Intrinsic Registration Summary

Full Tensor Core Intrinsic ID Map

ID RangeCountSMCategorySASS Target
0x89--0x1FA (subset)~20070+__cuda_sm70_wmma_* -- WMMA load/store/mma (f16)HMMA
0x1FB--0x2081480__cuda_sm80_* -- bf16/tf32/s4/s8/b1 MMA, createpolicyHMMA, IMMA, DMMA, BMMA
0x209--0x22F3980+__cuda_sm_8x_mma_* -- direct MMA operationsHMMA, IMMA
0x230--0x23910100__cuda_sm_10x_* -- hmma/imma mdata + bit MMAUTCHMMA, UTCIMMA
0x23A--0x25F3890__cuda_sm_9x_mma_sub_byte_internal_* -- sub-byte sparseIMMA

TCGen05 Intrinsics (Not in Master ID Table)

TCGen05 operations are dispatched through the named opcode table, not the numeric ID table. The 11 tcgen05.* instructions are registered directly in the opcode hash map at a1+808:

PTX OpcodeCodegen Handler
tcgen05.allocsub_569180
tcgen05.relinquish_alloc_permitsub_526370
tcgen05.deallocsub_58C7F0
tcgen05.ldsub_574050
tcgen05.ld.redsub_578DB0
tcgen05.stsub_571FE0
tcgen05.commitsub_56C190
tcgen05.cpsub_5427F0
tcgen05.shiftsub_4F1A90
tcgen05.mmasub_5BBC30
tcgen05.mma.wssub_58FA20

OCG-Level MMA Operations

The OCG system at sub_6C9EB0 registers additional SASS-level MMA operations for SM100+:

tcmma (2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit)
tcbar, mmareadshma
16dp32bitt0t15, 16dp32bitt16t31
sparsify, spfactor2to4
tcshift, tcatomsws, tcldsws, tcstsws

These are internal Mercury-level operations that do not correspond 1:1 to PTX instructions. The tcmma variants encode the specific SASS datapath configuration: 2x64dp128bit means 2 datapaths, 64-element wide, 128-bit lane width.

Data Type Matrix

Complete data type support across tensor core generations as visible in the binary:

Data TypePTX TypeWidthSM FloorUse
FP16.f1616b75Primary HMMA operand
BF16.bf1616b80HMMA alternate format
TF32.tf3219b (stored as 32b)80Reduced-precision FP32
FP32.f3232b75HMMA accumulator
FP64.f6464b80DMMA (double-precision)
FP8 E4M3.e4m38b89Ada/Hopper FP8 MMA
FP8 E5M2.e5m28b89Ada/Hopper FP8 MMA
INT8.s8, .u88b72IMMA integer MMA
INT4.s4, .u44b72IMMA sub-byte MMA
INT1.b11b75BMMA 1-bit MMA (XOR/AND)
UE8M0.ue8m0x28b (packed)100Block-scale exponent factor
B1024.b10241024b100TMEM-width operand

ELF Metadata

The cubin output includes tensor-core-specific EIATTR attributes:

EIATTRPurpose
EIATTR_SPARSE_MMA_MASKRecords which MMA operations use structured sparsity
EIATTR_TCGEN05_1CTA_USEDKernel uses 1-CTA tcgen05 operations
EIATTR_TCGEN05_2CTA_USEDKernel uses 2-CTA tcgen05 operations

Knobs and Options

Knob / OptionPurpose
suppress-sparse-mma-advisory-infoSuppress advisory info for mma.sp operations
--g-tensor-memory-access-checkEnable tcgen05 guardrail instrumentation

Key Function Table

AddressSizeIdentityConfidence
0x5C7A50173KBWMMA.MMA codegen (all shapes/types/layouts)98%
0x5C10A0120KBMMA.sync codegen (post-Volta shapes)98%
0x5BBC3090KBTCGen05.MMA codegen (Blackwell TMEM-based)98%
0x5A0EA07,779Bwmma.load.a formatter95%
0x5A8E409,757Bwmma.load.b formatter95%
0x5A6BD08,813Bwmma.load.c formatter95%
0x5A2D108,074Bwmma.store.d formatter95%
0x50AC701,282Bwgmma.mma_async handler99%
0x4DA380295Bwgmma.fence handler99%
0x4DA4B0295Bwgmma.commit_group handler99%
0x4DA5E0311Bwgmma.wait_group handler99%
0x58FA204,604Btcgen05.mma.ws / tcgen05.shift formatter95%
0x4DA720343Btcgen05.mma.ws formatter90%
0x569180--tcgen05.alloc handler90%
0x526370--tcgen05.relinquish_alloc_permit handler90%
0x58C7F0--tcgen05.dealloc handler90%
0x574050--tcgen05.ld handler90%
0x578DB0--tcgen05.ld.red handler90%
0x571FE0--tcgen05.st handler90%
0x56C190--tcgen05.commit handler90%
0x5427F0--tcgen05.cp handler90%
0x4F1A90--tcgen05.shift handler90%
0x4C2FD012.2KBWMMA/MMA master validator (sm_75/80/90)90%
0x49BBA011.4KBMMA type/scale validator (FP8, block-scale)90%
0x4BFED010.3KBWMMA shape validator90%
0x490F90--Integer MMA shape validator85%
0x4942102,276BSparse GMMA validator85%
0x495900--WMMA floating-point type validator85%
0x496570--FP8 MMA shape validator85%
0x4961F0--FP8 MMA accumulation validator85%
0x4428E0--FP64 MMA type validator85%
0x6D4350--SASS TCGen05 MMA encoder (UTCHMMA/tcmma)85%
0x6D7AF0--SASS MMA encoder variant85%
0x6D5CB0--SASS MMA encoder variant85%
0x6D69B0--SASS MMA encoder variant85%
0x50D4B01,187Bldmatrix formatter90%
0x4DAEA0--movmatrix formatter90%
0x4F05D0--stmatrix formatter90%
0x5D166046KBMaster intrinsic registration (608 entries)99%
0x5D419041KBOpcode dispatch (MMA hash table builder)99%
0x5FF700354KBPrototype generator (.weak .func declarations)99%

Cross-References