Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tensor Core / MMA Builtins

Tensor core builtins implement the Warp Matrix Multiply-Accumulate (WMMA) and Warp Group MMA (WGMMA) interfaces, spanning IDs 678--770 across four SM generations. Each generation added new data types and matrix shapes, resulting in 91 registered builtins that cover half-precision, integer, binary, double-precision, TF32, BF16, and FP8 matrix operations. SM 100 (Blackwell) adds a fifth generation -- tcgen05 -- documented in Tensor / MMA Codegen.

Key Facts

PropertyValue
Builtin IDs678--770 (93 entries)
WGMMA handler (IDs 753--768)~800 lines in sub_12B3FD0 / sub_955A70
LLVM intrinsic range (WGMMA)5304--5447 (144-entry 5-D grid) plus 10654--10779 (N-dimension table)
NVVM loweringsub_955A70 (105KB), sub_12B3FD0 (103KB)
Backend emissionsub_21E74C0 (PTX builder), sub_36E9630 (tcgen05 ISD selection)
SM gatesSM 70+ HMMA, SM 72+ IMMA, SM 75+ BMMA, SM 80+ DMMA/TF32/BF16, SM 90+ WGMMA

WMMA Architecture Evolution

SM GenerationFeatureID RangeCount
SM 70 (Volta)HMMA: FP16 tensor core678--70730
SM 75 (Turing)IMMA: INT8/INT4, BMMA: binary708--74538
SM 80 (Ampere)DMMA: FP64, TF32, BF16746--76419
SM 90 (Hopper)WGMMA: warp-group MMA, FP8765--7684
SM 100 (Blackwell)tcgen05: MX formats, block-scale, sparsity(intrinsic path)--

HMMA -- Half-Precision (IDs 678--707, SM 70+)

The original tensor core builtins provide 16-bit floating-point matrix multiply for three tile shapes. Each shape has 10 operations: load A, load B, load C (f16 and f32 accumulators), store C (f16 and f32), and four MMA variants for input/output precision combinations.

ID RangeShapeBuiltin Prefix
678--68716x16x16__hmma_m16n16k16_*
688--69732x8x16__hmma_m32n8k16_*
698--7078x32x16__hmma_m8n32k16_*

Per-shape operations (10 each):

SuffixOperationDescription
ld_aLoad A fragmentLoad matrix A tile from memory
ld_bLoad B fragmentLoad matrix B tile from memory
ld_c_f16Load C (f16)Load accumulator as half-precision
ld_c_f32Load C (f32)Load accumulator as single-precision
st_c_f16Store C (f16)Store result as half-precision
st_c_f32Store C (f32)Store result as single-precision
mma_f16f16MMA f16->f16FP16 input, FP16 accumulator
mma_f32f16MMA f16->f32FP16 input, FP32 accumulator
mma_f16f32MMA f32->f16FP32 accumulator, FP16 output
mma_f32f32MMA f32->f32FP32 input and accumulator

IMMA -- Integer MMA (IDs 708--739, SM 75+)

Integer tensor core operations for INT8 and INT4 data types.

INT8 (IDs 708--731)

Three shapes (16x16x16, 32x8x16, 8x32x16), each with 8 operations:

SuffixDescription
ld_a_s8 / ld_a_u8Load A fragment (signed/unsigned INT8)
ld_b_s8 / ld_b_u8Load B fragment (signed/unsigned INT8)
ld_cLoad accumulator (INT32)
st_c_i32Store result (INT32)
mma_s8 / mma_u8INT8 MMA (signed/unsigned)

INT4 (IDs 732--739)

Single shape (8x8x32) with the same operation set but _s4 / _u4 type suffixes.

BMMA -- Binary MMA (IDs 740--745, SM 75+)

Binary (1-bit) matrix multiply with XOR-POPC and AND-POPC accumulation modes. Single shape: 8x8x128.

IDBuiltinDescription
740__bmma_m8n8k128_ld_a_b1Load A fragment (binary)
741__bmma_m8n8k128_ld_b_b1Load B fragment (binary)
742__bmma_m8n8k128_ld_cLoad accumulator
743__bmma_m8n8k128_st_c_i32Store result
744__bmma_m8n8k128_mma_xor_popc_b1Binary MMA (XOR + popcount)
745__bmma_m8n8k128_mma_and_popc_b1Binary MMA (AND + popcount)

Extended Tensor Core (IDs 746--764, SM 80+)

SM 80 (Ampere) added double-precision, TF32, and BF16 tensor operations.

DMMA -- Double Precision (IDs 746, 751--754)

IDBuiltinDescription
746__dmma_m8n8k4_mma_f64FP64 MMA
751__dmma_m8n8k4_st_c_f64Store FP64 result
752--754__dmma_m8n8k4_{ld_a,ld_b,ld_c}Load fragments

TF32 (IDs 747, 755--757)

IDBuiltinDescription
747__mma_tf32_m16n16k8_mma_f32TF32 MMA producing FP32
755--757__mma_tf32_m16n16k8_{ld_a,ld_b,ld_c}Load fragments

BF16 (IDs 748--750, 758--764)

IDBuiltinDescription
748__mma_bf16_m16n16k16_mma_f32BF16 16x16x16 MMA
749__mma_bf16_m32n8k16_mma_f32BF16 32x8x16 MMA
750__mma_bf16_m8n32k16_mma_f32BF16 8x32x16 MMA
758--764__mma_bf16_m*_{ld_a,ld_b}Load fragments for each shape

WMMA Lowering Details

Three-Table Lookup

WMMA builtins use a three-table structure for mapping builtin IDs to LLVM intrinsic IDs:

TableAddress (NVVM)ID RangeDescription
dword_3F14840Entries 0--29678--707HMMA (first-generation, FP16)
dword_3F147E0Entries 0--23708--731IMMA (INT8)
dword_3F147A0Entries 0--12732--744BMMA (binary) / INT4

The EDG-side parallel tables live at dword_42810C0 (678--709), dword_4281060 (708--731), dword_4281020 (732--744), addressed from sub_12AC1A0.

Fragment Size Determination

The number of register-level fragments varies by operation and data type:

ConditionFragment CountExample
First-gen WMMA, BF16, store4BF16 store_c
First-gen WMMA, default8FP16 mma
IMMA, intrinsic 8914/82802INT8 ld_a compact
BMMA2Binary operations
IMMA intrinsic 0x22BB/0x22BC/0x22C5/0x22C64INT4 load A/B
IMMA intrinsic 0x22BD/0x22BE/0x22C3/0x22C4/0x22CB--0x22CE1Sub-byte single-element
IMMA intrinsic 0x22B7/0x22BF/0x22C78INT8 full-width

MMA Codegen Flow

The MMA handler (sub_94E0D0 / sub_12AC5F0) processes 5 input operands:

  1. dest_ptr -- Pointer to output fragment storage
  2. A_fragment -- Matrix A input (loaded v100 times)
  3. B_fragment -- Matrix B input (loaded v95 times)
  4. C_fragment -- Accumulator input (loaded v101 times)
  5. rowcol -- Layout operand (validated 0--3 for MMA)

An optional satf flag (saturation, validated 0--1) is consumed for most intrinsics except ID 8279.

The handler emits the MMA call via sub_921880 and scatters results back to the destination fragment through v103 iterations of element-wise stores.

Fragment iteration counts per family (NVVM path, sub_94E0D0):

Familyv95 (load B)v100 (load A)v101 (load C)v103 (store D)
BMMA (b1)1122
IMMA (0x22C0-0x22C1)1488
IMMA (0x22B8-0x22B9 = 8888-8889)2288
IMMA (0x22C8-0x22C9 = 8904-8905)4188
HMMA (default, first-gen)88variesvaries (4 or 8)

The output fragment count is determined by bit-test: (0x300C003 >> (intrinsic_id + 127)) & 1 selects 4 vs 8 fragments.

Architecture Gating -- Exact Thresholds

The architecture version is stored at *(target_info + 252) as a DWORD.

FunctionGate ExpressionMinimum SMNotes
sub_21DFBF0 hmmastcv8 > 0x45SM 70FP16 store
sub_21E0360 hmmaldabv8 > 0x45SM 70FP16 load A/B
sub_21E0870 hmmammav8 > 0x45SM 70FP16 MMA
sub_21E1280 immaldabv8 > 0x47SM 72INT load; v8==72 && variant>1 rejected
sub_21E1D20 immammav8 > 0x47SM 72INT MMA; variant>1 && v8==72 rejected
sub_21E2280 bmmammav8 > 0x48SM 73/75Binary MMA
sub_36E9630 tcgen05arch >= 0x3E8SM 100Blackwell only

SM 72 (Xavier) has a unique partial IMMA implementation: only variant 0/1 shapes are supported, with explicit gating that blocks higher variants. This matches hardware reality where Xavier had limited INT8 tensor cores.

WGMMA -- Warp Group MMA (SM 90+ Hopper)

WGMMA operates on an entire warp group (4 warps, 128 threads) rather than a single warp. The system is split across four builtin IDs, 20 auxiliary IDs for fence/store/load operations, and two massive handler blocks totaling ~800 lines of lowering logic.

Builtin Registration

Four builtins are registered in sub_90AEE0 (NVVM) and sub_126A910 (EDG):

IDBuiltinData TypeLowering Case
765 (0x2FD)__wgmma_mma_async_f16FP16Full operand set (6 chained: A, B, C, scale, negate, sparsity)
766 (0x2FE)__wgmma_mma_async_bf16BF162-operand (no scale/negate)
767 (0x2FF)__wgmma_mma_async_tf32TF32Reduced operand set
768 (0x300)__wgmma_mma_async_f8FP8 (SM 90a+)Minimal (2 scale operands only)

WGMMA ID Space Overview

The full WGMMA ID range spans 745--770, subdivided into four functional groups:

ID RangeFunctionHandler
745--750 (0x2E9--0x2EE)Fence / commit / waitsub_12B1C20 / sub_953BA0
751--752 (0x2EF--0x2F0)Storesub_12B27B0 / sub_954350
753--764 (0x2F1--0x2FC)MMA async load (12 variants)inline / sub_9547E0
765--768 (0x2FD--0x300)MMA async compute (4 type builtins)inline ~800 lines / sub_12B2E10
769--770 (0x301--0x302)Warp-group barrierinline IR via sub_127FC40

WGMMA Fence / Commit / Wait (IDs 745--750)

sub_953BA0 (NVVM) / sub_12B1C20 (EDG) builds a red-black tree on first call with 7 entries keyed by builtin ID. Each entry packs:

struct wgmma_fence_entry {
    uint32_t id;           // builtin ID (745--751)
    uint32_t trans_a;      // transpose A flag
    uint32_t shape;        // shape code (0 or 1)
    uint32_t trans_b;      // transpose B flag
    uint32_t a_nregs;      // register count for A fragment
    uint32_t b_nregs;      // register count for B fragment
    uint32_t padding;      // unused alignment
    llvm_type *a_type;     // LLVM type for A (i64, i32, i16x2, i32x4)
    llvm_type *b_type;     // LLVM type for B
    llvm_type *c_type;     // LLVM type for C (i32x2, i32x8)
};

Decoded entries from local variables v47--v106:

IDtrans_ashapetrans_ba_nregsb_nregsA typeB typeC type
74501511i64i64--
74610199i32i32i32x2
747002588i16x2i16x2--
748002377i32x4i32x4i32x8
749002477i32x4i32x4i32x8
75000677i64i32x2i32x8

Output packed encoding (*a4, 64-bit):

BitsFieldSource
[3:0]trans_a*(entry+40)
[7:4]shape*(entry+48) << 4
[15:8]a_nregs*(entry+64) << 8
[27:16]b_nregs*(entry+72) << 16
[31:28]padding*(entry+80) << 28
[63:32]trans_b*(entry+56) << 32
[25]rowcol bit 1(rowcol & 2) == 0 ? 0x2000000 : 0x1000000
[27:26]rowcol bit 0((rowcol & 1) + 1) << 26

The fence dispatch validates the rowcol operand (must be 0--3) and emits a 4-argument call to intrinsic 9062 (llvm.nvvm.wgmma.fence.aligned) with 3 type overloads. Fragment operands are prepared via sub_94B510.

WGMMA Store (IDs 751--752)

sub_954350 / sub_12B27B0 builds a separate parameter lookup tree. Store operations validate rowcol (0 or 1) and emit a 5-argument call using intrinsic 9145 (llvm.nvvm.wgmma.store) with 2 type overloads. Operands: {constant, B_fragment, descriptor, rowcol, zero}.

WGMMA MMA Async Load (IDs 753--764)

sub_9547E0 (NVVM) / sub_12B2E10 (EDG) builds a 12-entry red-black tree at ctx+656:

IDShapenregsVariantFragment Type
753190--
754191--
755192i16x2
7562580--
7572581--
75825102i32x8
7592370i32x4
7602371i32x4
7612470i32x4
7622471i32x4
763670i32x2/i64
764671i32x2/i64

Output packed encoding (*a4, 64-bit):

BitsField
[63:32]*(entry+40) << 32
[31:4]*(entry+48) << 4 | rowcol
[1]*(entry+56) << 1

Emits intrinsic 9067 (llvm.nvvm.wgmma.mma.async) with 2 type overloads. Arguments: {constant, B_fragment, rowcol_value, zero_constant}. Results scattered via sub_94B940.

WGMMA MMA Async Compute -- The 800-Line Handler (IDs 765--768)

This is the primary WGMMA lowering path. It lives inline in the mega-switch of sub_955A70 (NVVM, lines ~2850--3138) and sub_12B3FD0 (EDG, lines ~2270--3138). The handler implements two completely different intrinsic selection strategies depending on which builtin ID triggered entry.

Argument Extraction

The handler walks the argument chain 7 levels deep from the call expression:

v263 = M dimension              (first constant argument)
v512 = accumulator fragments    (pointer to fragment array)
v528 = A descriptor             (64-bit matrix descriptor or register fragments)
v524 = B descriptor             (64-bit matrix descriptor)
v519 = scale factors            (A and D scale constants)
v264 = layout params            (rowcol encoding)
v516, v265 = shape params       (additional dimension info)
v540 = element type info        (integer type tag from AST)

Each constant argument is validated through sub_620FD0 (EDG) / sub_620FD0 (shared), which extracts the integer value and sets an overflow flag. On overflow:

"unexpected constant overflow in __wgmma_mma_async operand"

This check is applied 5 times: once for N dimension, once for each scale factor, and once for each negate/saturation bit.

Per-Builtin Argument Layouts

IDBuiltinOperand Chain
765 (0x2FD)_f166 chained: A, B, C, scaleA, scaleD, negate/saturation
766 (0x2FE)_bf16Separate branch (LABEL_56 path), 2-operand (no scale/negate)
767 (0x2FF)_tf32Rearranged arguments, fewer config bits
768 (0x300)_f8Simplest form, 2 matrix descriptors + config

Strategy 1: N-Dimension Dispatch (IDs 765--768, inner path)

When the element type is checked and the first argument yields an N dimension, the handler enters a 33-entry switch mapping N values to LLVM intrinsic IDs in the range 10654--10779:

NInteger-type IntrinsicFloat-type Intrinsic
81077410775
161069010691
241073410735
321074210743
401074610747
481075010751
561075410755
641075810759
721076210763
801076610767
881077010771
961077810779
1041065410655
1121065810659
1201066210663
1281066610667
1361067010671
1441067410675
1521067810679
1601068210683
1681068610687
1761069410695
1841069810699
1921070210703
2001070610707
2081071010711
2161071410715
2241071810719
2321072210723
2401072610727
2481073010731
2561073810739

The even/odd intrinsic ID pairing encodes the distinction between integer-element and float-element variants. Type discrimination uses the AST element type: if the element type is integer with width 10 (i.e., a 10-bit integer signaling bf16/tf32 internal encoding), the even (integer) intrinsic is selected; otherwise the odd (float) intrinsic.

N dimension validation:

if ((N & (N - 1)) != 0)
    error("N only supported for powers of two");

This is applied when the N value does not match any case in the 33-entry switch. The N values 8, 16, 32, 64, 128, 256 are powers of two; the intermediate values (24, 40, 48, ..., 248) are non-power-of-two multiples of 8 that are still valid WGMMA dimensions.

Strategy 2: 5-Dimensional Intrinsic Grid (IDs 753--764 path, shared)

For the full WGMMA async variants (handled through sub_12B2E10), the handler selects from a 144-entry intrinsic table spanning IDs 5304--5447, organized as a 5-dimensional grid:

DimensionValuesDescription
1. N{16, 32, 64, 128}Output column dimension
2. B_shared{false, true}Is B operand from shared memory? (sub_12A71A0 != 0)
3. is_s64{false, true}Is accumulator type s64/int? (type tag 2, subtype 10)
4. scale/negatevariesA scale nonzero? D scale nonzero?
5. variant{0x2FD, 0x2FE, 0x2FF, 0x300}Which builtin triggered entry

Base addresses and stride:

NBase IDStride per N
128530424 variants
64~532824
32~535224
16~537624
overflow~5400--5447remaining

Size-based opcode selection (for f16, ID 765):

Accumulator SizeOpcode (integer)Opcode (float)
1653325333
3253805381
6454045405
12853085309
other5356/54285357/5429

The mapping formula: base + N_offset + shared_offset + type_offset + variant_offset. The accumulator size is extracted by sub_12A71A0(expr) from the expression type chain.

WGMMA Config Bit Packing

Multiple boolean arguments are packed into a single configuration word passed to the final intrinsic call:

BitFieldSourceValue Semantics
0Accumulate / saturation flagFinal constant operand (v433)1 = accumulate into D, 0 = overwrite
1ScaleD / transpose flagv445 constant1 = transpose B descriptor
2Negate-C / layout flagv81 / v433 constant1 = negate accumulator input
3Sign bit for Bv427 constant (if present)Reserved / sign extension
4Negate-A / additional modev80 / v427 constant (if present)1 = negate A operand

Combined via: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).

After intrinsic selection, the handler:

  1. Converts the accumulator pointer to a vector pointer (.asvecptr tag)
  2. Extracts bitfield from constant operands for mode flags
  3. Calls sub_1285290 / sub_921880 with name hint "mmafrag"
  4. Scatters results via sub_94B940 / sub_1280F50 (size 4 = float elements)

WGMMA Validation Summary

All constant arguments pass through sub_620FD0, which extracts the integer value and sets an overflow flag.

CheckError MessageCondition
Constant overflow"unexpected constant overflow in __wgmma_mma_async operand"Any integer operand overflows extraction (5 occurrences)
N power-of-two"N only supported for powers of two"(N & (N - 1)) != 0 and N not in the 33-entry switch
rowcol range (fence)"'rowcol' operand can be 0 or 1 only"rowcol > 1 for load/store
rowcol range (MMA)(implicit -- validated 0--3)rowcol > 3 for MMA operations

WGMMA Support Functions

FunctionAddressEDG ParallelPurpose
sub_953BA00x953BA0sub_12B1C20Fence/commit/wait parameter lookup, builds packed 64-bit encoding
sub_9547E00x9547E0sub_12B2E10MMA async load parameter lookup, 12-entry red-black tree
sub_9543500x954350sub_12B27B0Store variant parameter lookup
sub_94B5100x94B510--Prepare fragment operand for WGMMA call
sub_94B9400x94B940sub_1280F50Scatter MMA results back to fragment outputs
sub_94B2B00x94B2B0--Extract fragment element at index (WMMA shared)
sub_12A71A00x12A71A0--Extract size/dimension from expression type (EDG-only)
sub_12A6F100x12A6F10--Validate constant integer in range (EDG-only)
sub_620FD00x620FD0--Extract constant integer with overflow detection (shared)

Packed MMA Descriptor Word

The MMA PTX string builder at sub_21E74C0 (AsmPrinter) / sub_35F_range (NVPTX backend) reads a packed 64-bit descriptor for all MMA instruction emission. The descriptor is stored at:

v22 = *(QWORD *)(*(QWORD *)(a1 + 16) + 16 * a2 + 8)
BitsFieldQuery KeyValues
[0]Row/col layout"rowcol"0=row, 1=col
[2:1]Matrix ID"mid"0=a, 1=b, 2=c, 3=d
[7:4]Binary opcode"opc"0=default, 1=.and.popc, 2=.xor.popc
[2:0]Rounding mode"rnd"0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz
[15:8]A element type"aty"Type enum 1--11
[23:16]B element type"bty"Type enum 1--11
[25:24]A layout"al"0=row, nonzero=col
[27:26]B layout"bl"0=row, nonzero=col
[28]Saturation"satf"1=.satfinite
[39:32]Shape enum"shape"0x01--0x19, 18 entries

Shape Enum

EnumShapePTX StringMin SMNotes
0x01m8n8k4"m8n8k4"SM 70Original Volta HMMA
0x02m8n8k16"m8n8k16"SM 72Integer MMA (s8/u8)
0x03m8n8k32"m8n8k32"SM 75Sub-byte (s4/u4)
0x04m8n8k64"m8n8k64"SM 75Extended sub-byte
0x05m8n8k128"m8n8k128"SM 75Binary MMA (b1)
0x06m8n32k16"m8n32k16"--Appears unused in standard paths
0x10m16n8k4"m16n8k4"SM 75Turing HMMA, f64 on Ampere
0x11m16n8k8"m16n8k8"SM 75Turing/Ampere HMMA
0x12m16n8k16"m16n8k16"SM 80Ampere HMMA (bf16, tf32)
0x13m16n8k32"m16n8k32"SM 75Ampere integer
0x14m16n8k64"m16n8k64"SM 75Sub-byte integer
0x15m16n8k128"m16n8k128"SM 75Extended sub-byte
0x16m16n8k256"m16n8k256"SM 75Binary/sub-byte (largest)
0x17m16n16k16"m16n16k16"SM 90Square shape, Hopper+
0x18m32n8k16"m32n8k16"SM 80Tall shape
0x19m16n16k8"m16n16k8"SM 70WMMA f16 path

Unknown shape codes hit the default branch and abort via BUG(). String emission uses fast-path integer stores: *(QWORD *)ptr = 0x36316B386E36316DLL emits "m16n8k16" as a single 8-byte write.

Type Enum

EnumTypeBitsPTX String
1b11"b1"
2s44"s4"
3u44"u4"
4s88"s8"
5u88"u8"
6f1616"f16"
7bf1616"bf16"
8tf3219"tf32"
9f6464"f64"
10f3232"f32"
11s3232"s32"

Any other type code produces fatal error: "Wrong MMA element type".

Shape x Type x Architecture Summary

ShapeA/B TypesAcc TypesMin SMNotes
m8n8k4f16f16, f32SM 70Original Volta
m16n8k4f64f64SM 80Ampere f64
m16n8k8f16f16, f32SM 75Turing+
m16n8k16f16, bf16, tf32f16, f32SM 80Ampere+
m16n16k8f16f16, f32SM 70WMMA path
m16n16k16f16, bf16f16, f32SM 90Hopper+
m32n8k16f16, bf16f16, f32SM 80Tall shape
m8n8k16s8, u8s32SM 72Integer MMA
m16n8k16s8, u8s32SM 75Turing+
m16n8k32s8, u8s32SM 75Turing+
m8n8k32s4, u4s32SM 75Sub-byte
m16n8k64s4, u4s32SM 75Sub-byte
m8n8k64s4, u4s32SM 75Extended sub-byte
m16n8k128s4, u4s32SM 75Extended sub-byte
m8n8k128b1s32SM 75Binary (.and.popc, .xor.popc)
m16n8k256b1s32SM 75Binary extended
WGMMA (N=8..256)f16, bf16, tf32, f8f16, f32SM 90Warp-group, 33 N values
tcgen05 (10 variants)mxf8f6f4, mxf4, mxf4nvf4, f16, bf16, tf32, i8, fp4variesSM 100See mma-codegen

tcgen05 Blackwell Overview (SM 100+)

Full tcgen05 documentation lives in Tensor / MMA Codegen. Key points summarized here for cross-reference:

Data type kinds (bits [8:6] of the tcgen05 operand, emitted by sub_35F3330):

ValueKindNotes
0mxf4nvf4MX FP4 with NV FP4
1f8f6f4FP8/FP6/FP4 standard
2mxf8f6f4MX variant of f8f6f4
3f16Half precision
4i88-bit integer (arch-conditional only)
5tf32TensorFloat-32
7mxf4MX FP4

Modifier fields:

ModifierBitsDescription
Weight stationary (.ws)bit 0NOT compatible with cta_group::2, mxf8f6f4, fp4
CTA groupbit 1cta_group::1 (clear) or cta_group::2 (set)
Scale vector size[3:2].scale_vec::1X/2X/4X with per-type constraints
Scale input accumulatorbit 4f16/tf32 only; NOT on sm_100a/sm_103a
Sparsitybit 5MXF4/MXF4NVF4 restricted to arch-conditional
Block scale alias[10:9].block16 (0) or .block32 (1)

Collector modes (emitted by sub_35F38B0):

ValueModifierConstraint
1.collector::a::lastuse--
2.collector::a::fillCannot combine with .ashift
3.collector::a::useCannot combine with .ashift

tcgen05 scaled MMA operand builder (sub_21E8CD0 / sub_35F3E90):

BitQueryClearSet
0"scaleD""0""1"
1"negA""1" (no negate)"-1" (negate)
2"negB""1""-1"
3"transA""0""1"
4"transB""0""1"

Note the asymmetry: scaleD/transA/transB emit boolean "0"/"1" strings, while negA/negB emit sign multiplier "1"/"-1" strings. This reflects the PTX encoding where negation is a multiplication factor and transpose is a boolean flag.

LLVM Intrinsic Reference

Intrinsic IDNameUsage
9062llvm.nvvm.wgmma.fence.alignedWGMMA fence (3 type overloads)
9067llvm.nvvm.wgmma.mma.asyncWGMMA MMA async load (2 type overloads)
9145llvm.nvvm.wgmma.storeWGMMA store (2 type overloads)
10654--10779llvm.nvvm.wgmma.mma.async.*Per-N-dimension variants (126 entries, even=int, odd=float)
5304--5447(WGMMA 5-D grid)Per-N x shared x type x scale x variant (144 entries)
4905--4940(tcgen05 ISD opcodes)tcgen05.mma variants (36 opcodes via 10-way shape switch)

NVPTX Backend Duplicate Functions

All MMA emission functions exist in two structurally identical copies:

AsmPrinter (0x21Dxxxx)NVPTX Backend (0x36Exxxx)Function
sub_21DFBF0sub_36E91F0hmmastc (HMMA store C)
sub_21E0360sub_36E72A0hmmaldab (HMMA load A/B)
sub_21E0630sub_36E7580hmmaldc (HMMA load C)
sub_21E0870sub_36E77C0hmmamma (HMMA MMA)
sub_21E1280sub_36E7B50immaldab (IMMA load A/B)
sub_21E15D0sub_36E7EA0immaldc (IMMA load C)
sub_21E1830sub_36E8110immastc (IMMA store C)
sub_21E1D20sub_36E8630immamma (IMMA MMA)
sub_21E2280sub_36E8BD0bmmamma (Binary MMA)
sub_21E8CD0sub_35F3E90tcgen05 scaled MMA operands

The pairs differ only in error reporting (sub_16BD130 vs sub_C64ED0) and reference counting functions (sub_1623A60/sub_161E7C0 vs sub_B96E90/sub_B91220).

Cross-References