Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TCGen05 -- 5th Generation Tensor Cores

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

TCGen05 is the Blackwell-generation tensor core instruction family introduced with SM 100. It replaces Hopper's WGMMA with a descriptor-based programming model that operates on Tensor Memory (TMEM) -- a dedicated register-file-like storage visible only to the tensor core. ptxas implements TCGen05 as 13 PTX instruction mnemonics (plus 8 debug guardrails), backed by a 90KB MMA codegen function, 11 SASS opcode groups (28 encoding variants), and a set of compiler-inserted validation hooks. TCGen05 is absent on sm_120/sm_121 (consumer Blackwell).

Target architecturessm_100, sm_100a, sm_100f, sm_103, sm_103a, sm_103f, sm_110, sm_110a, sm_110f
NOT availablesm_120, sm_121 (consumer/DGX Spark) -- gated by SM version checks
Capability checksub_70FA00(*, 29) -- returns true for tcgen05-capable targets
PTX instructions13: alloc, dealloc, relinquish_alloc_permit, ld, ld.red, st, commit, cp, shift, fence, wait, mma, mma.ws
Guardrail instructions8: is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usage
SASS opcode rangeOpcodes 122--139 (TMEM operations), 213--221 (TCGEN05_MMA/FENCE, TMEM extended), 342--372 (TCGEN05 control)
Codegen factory36864 (9 << 12) -- shared across all Blackwell targets
MMA codegensub_5BBC30 (90KB)
PTX validatorsub_4C5FB0 (28KB -- shared MMA/WMMA/tcgen05 validator)
Intrinsic handlersub_6D7AF0 (19KB -- TCGen05 MMA handler)
Intrinsic validatorsub_6D69B0 (12KB -- TCGen05 MMA validator)
EIATTR markersEIATTR_TCGEN05_1CTA_USED, EIATTR_TCGEN05_2CTA_USED
Version constraintObjects using tcgen05 from CUDA 12.x cannot link with 13.0+; must rebuild

Architecture Overview

Descriptor-Based Model

TCGen05 abandons the register-operand model of previous tensor core generations (WMMA, HMMA, WGMMA) in favor of descriptors. The instruction descriptor (idesc) encodes the matrix operation configuration -- dimensions, data types, data path width, sparsity, and layout. The descriptor is passed as an operand to tcgen05.mma rather than encoded in the instruction mnemonic.

This design decouples the instruction encoding from the operation specification. Where WGMMA required hundreds of distinct intrinsic hash entries to cover every shape/type/layout combination, tcgen05 uses a single instruction with different descriptor values. The ~400 numeric MMA hash entries in the intrinsic dispatch table (at a1+816 in sub_5D4190) map WGMMA variants; tcgen05 replaces that complexity with descriptor-driven dispatch.

Tensor Memory (TMEM)

TMEM is a dedicated storage region private to the tensor core unit. It is not part of the general register file and is not directly addressable by non-tensor-core instructions. TMEM is organized into columns that are allocated, used, and deallocated explicitly by the programmer.

Key properties from binary analysis:

  • Column-based allocation: tcgen05.alloc reserves columns; tcgen05.dealloc releases them
  • Two CTA granularities: Operations execute at .cta_group::1 (single CTA) or .cta_group::2 (CTA pair) granularity. A function cannot mix both -- ptxas enforces: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed."
  • Allocation tracking: The compiler inserts reserved shared memory variables to track allocation state:
    • __nv_reservedSMEM_tcgen05_partition -- partition identifier
    • __nv_reservedSMEM_allocation_phase -- current allocation phase
    • __nv_reservedSMEM_allocation_mask -- bitmask of allocated columns
    • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier -- mbarrier for allocation pipeline
    • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity -- parity tracking

TMEM Address Computation

Tensor memory addresses are computed through a standardized pattern visible in the TMEM address generator functions (sub_70E740, sub_70E940, sub_70EB00):

cvt.u32.u64 __cuda_sm_100_tcgen05_tmem_addr_base, %s;
add.u32 %s, __cuda_sm_100_tcgen05_tmem_addr_base, %s;

Five named TMEM address roles exist for MMA operations:

Address RoleIntrinsic NamePurpose
D (destination)__cuda_sm10x_tcgen05_mma_tmemDAccumulator/output matrix
A (input)__cuda_sm10x_tcgen05_mma_tmemALeft input matrix
Scale A__cuda_sm10x_tcgen05_mma_scaleTmemAScale factors for A
Scale B__cuda_sm10x_tcgen05_mma_scaleTmemBScale factors for B
Sparse Meta__cuda_sm10x_tcgen05_mma_spMetaTmemSparsity metadata

Constraint from the binary: "URa must be uint32 when URa is TMEM" -- uniform registers addressing TMEM must use 32-bit unsigned integers. When addressing a global descriptor: "URa must be uint64 when URa is GDESC".

PTX Instruction Set

Lifecycle Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.alloc0x5263701,287 BAllocate TMEM columns for tensor core use
tcgen05.dealloc0x5740502,130 BRelease allocated TMEM columns
tcgen05.relinquish_alloc_permit0x58C7F04,282 BRelinquish allocation permit (multi-CTA coordination)

The alloc instruction has two CTA-granularity variants visible in the prototype strings:

  • __cuda_sm10x_tcgen05_alloc_one_sm -- single-SM allocation (.cta_group::1)
  • __cuda_sm10x_tcgen05_alloc_two_sm -- two-SM allocation (.cta_group::2)

Both take a destination pointer argument (__cuda_sm10x_tc_alloc_dst_ptr_arg) and a column count (__cuda_sm10x_tc_alloc_num_cols_arg).

Data Movement Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.ld0x578DB02,466 BLoad data into TMEM from shared/global memory
tcgen05.ld.red0x571FE02,066 BLoad with reduction (accumulate into TMEM)
tcgen05.st0x56C1901,842 BStore data from TMEM to shared/global memory
tcgen05.cp0x5427F0903 BCopy between TMEM regions (intra-tensor-core)

Three intrinsic helper arrays support the ld/st/ld.red operations:

HelperPurpose
__cuda_sm_100_tcgen05_ld_funcRetArrReturn array descriptor for loads
__cuda_sm_100_tcgen05_ld_red_funcRetArrReturn array descriptor for load-reduce
__cuda_sm_100_tcgen05_st_funcInputArrInput array descriptor for stores

Each has a corresponding immhalfSplitOff parameter controlling split behavior:

  • __cuda_sm_100_tcgen05_ld_immhalfSplitOff
  • __cuda_sm_100_tcgen05_ld_red_immhalfSplitOff
  • __cuda_sm_100_tcgen05_st_immhalfSplitOff

Synchronization Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.commit0x5427F01,575 BCommit pending tensor core operations
tcgen05.fence(inline)--Fence preventing reordering of tcgen05 operations
tcgen05.wait(inline)--Wait for committed tcgen05 operations to complete
tcgen05.shift0x58FA204,604 BShift accumulator data within TMEM (shared formatter with mma)

Compute Instructions

PTX InstructionFormatter AddressSizePurpose
tcgen05.mma0x5BBC30 (codegen) / 0x58FA20 (formatter)90KB / 4,604 BMatrix multiply-accumulate
tcgen05.mma.ws0x4DA720 (formatter)343 BWarp-specialized MMA variant

TCGen05.MMA -- Matrix Multiply-Accumulate

Codegen Function: sub_5BBC30 (90KB)

The largest per-instruction codegen function for TCGen05. Registered as the "tcgen05.mma" handler in sub_5D4190 (the intrinsic dispatch table builder). The function:

  1. Allocates a 50,000-byte working buffer
  2. Queries sub_70FA00(*, 29) to validate tcgen05 capability on the current target
  3. Processes the instruction descriptor to determine operation parameters
  4. Generates tensor memory addressing code for all operands (D, A, scaleA, scaleB, sparsity meta)
  5. Emits the final MMA instruction encoding

MMA Modifiers

The binary reveals a rich set of MMA modifiers extracted by functions in the sub_70D1F0--sub_70D410 cluster:

ModifierStringPurpose
.o128".o128"128-bit output size
.transA".transA"Transpose A matrix
.transB".transB"Transpose B matrix
.negA"_negA"Negate A matrix
.negB"_negB"Negate B matrix
_expand16bit"_expand16bit"16-bit expansion mode
_pack16bit"_pack16bit"16-bit packing mode
_maxabs"_maxabs"Maximum absolute value reduction
_minabs"_minabs"Minimum absolute value reduction
_fused"_fused"Fused operation mode
_blockscale"_blockscale"Block scaling (MX format support)
_ashift"_ashift"A-matrix shift
_areuse"_areuse"A-matrix register reuse
_akeep"_akeep"A-matrix keep (preserve for reuse)

Data Path Configurations

The MMA data path width determines the number of elements processed per cycle and the accumulator layout. Six configurations exist:

Data PathStringInterpretation
_4dp256bit4 data paths, 256 bits each
_16dp32bit16 data paths, 32 bits each (two sub-variants: t0t15, t16t31)
_32dp32bit32 data paths, 32 bits each
_16dp256bit16 data paths, 256 bits each
_128dp256bit128 data paths, 256 bits each

Constraint: "fused and l16dp32bit must be specified together" -- the fused mode requires the 16dp32bit data path.

Block Scaling (MX Format)

TCGen05 adds native block scaling support for microscaling (MX) format operations, visible through the tcmma prefix strings:

  • "tcmma_*_o must be specified with blockscale" -- output modifier requires blockscale
  • "uri width for tcmma_*_o must be 2" -- output uniform register index width must be 2
  • "tcmma_*_q with blockscale must have uri width of 2" -- quantization with blockscale
  • "tcmma_*_mxq must be specified with blockscale" -- MX quantization requires blockscale

Warp-Specialized MMA (.ws)

The .ws modifier enables warp-specialized execution where different warps in a warpgroup contribute to different phases of the MMA pipeline. Constraints from the binary:

  • "When using buffer1-3, WS modifier must be specified" -- triple buffering requires .ws
  • "ws opcode modifier not allowed with .2CTA" -- warp specialization is single-CTA only
  • "ws opcode modifier not allowed with areuse or akeep" -- .ws incompatible with A-matrix reuse
  • "ws opcode modifier not allowed with ashift" -- .ws incompatible with A-matrix shift

Triple-buffer register reuse strings for .ws mode:

BufferVariant
_breuse_bkeep_buffer1B-reuse + B-keep, buffer 1
_breuse_buffer1B-reuse, buffer 1
_breuse_bkeep_buffer2B-reuse + B-keep, buffer 2
_breuse_buffer2B-reuse, buffer 2
_breuse_bkeep_buffer3B-reuse + B-keep, buffer 3
_breuse_buffer3B-reuse, buffer 3

Sparsity Support

TCGen05 supports structured sparsity through the sparsity metadata TMEM address (spMetaTmem). The _ashift modifier is constrained: "Ashift can only be specific when URa is in TMEM".

SASS Encoding

Opcode Map

TCGen05 SASS instructions span three opcode regions in the SM 100 SASS ISA. The encoding information comes from the latency model tables (sub_8E8A90 for sm_100) and the master instruction encoder (sub_6D9690, 94KB).

TMEM Operations (Opcodes 122--139)

OpcodeVariantsCategoryEncoding ClassOperands
1222TMEM_OP / new ISAF1F08, F1C603-op, reg10
1236TMEM_LD (tensor mem load)F1F08, F1DF82--3 op
1256TMEM_ST (tensor mem store)F1F08, F1DF82--3 op
1279TMEM_ALLOC / FENCEF1F08..F29A83--6 op
1293TMEM extendedF1F082 op
13026EXTENDED_MOV / TMEM_MVAF1F08..F26782--9 op
1313EXTENDED_ALU / UTMAF21B04--5 op
1331UTMA variantF21B04 op
1394TCGEN05 operationsF21B0, F25684--8 op

TCGEN05 MMA/FENCE (Opcodes 213--221)

OpcodeVariantsCategoryEncoding ClassOperands
2136TCGEN05_MMAF26785--7 op
2162TCGEN05_FENCEF26783--4 op
2196TMEM_LD extendedF1C60..F28103--7 op
2201TMEM_ST extendedF1C603 op
2211TMEM_PREFETCHF1C603 op
2551SETSTMEMADDRF1F081 op
2694TMEM_ALLOC_FENCE extF2018, F1DF82--3 op

TCGEN05 Control (Opcodes 342--372)

28 encoding variants across 10 opcodes. These are the primary tensor core pipeline control instructions:

OpcodeVariantsCategoryEncoding ClassOperands
3421TCGEN05 ctrl AF1F080 op (scheduling marker)
3431TCGEN05 ctrl BF1F080 op (scheduling marker)
34414TCGEN05 executeF1F08..F30082--7 op
3464TCGEN05 commitF1F08, F20182--3 op
3491TCGEN05 syncF1D700 op
3593TCGEN05 allocF1D70, F1F080--2 op
3691TCGEN05 deallocF1F080 op
3701TCGEN05 release AF1D700 op
3711TCGEN05 release BF1D700 op
3721TCGEN05 release CF1D700 op

Opcode 344 (TCGEN05 execute) has the most variants (14), spanning encoding classes from F1F08 to F3008 with 2 to 7 operands. This is the actual MMA dispatch instruction -- the wide encoding range reflects the variety of descriptor configurations, operand modes, and data path widths.

Encoding Class Distribution

The encoding classes used by TCGen05 SASS instructions:

ClassHexUsage
F1D70Control/syncalloc (0-op), sync, release A/B/C
F1F08Generalctrl markers, execute, commit, alloc, dealloc, TMEM ops
F1C60ExtendedTMEM_LD/ST extended, TMEM_PREFETCH
F1DF8StandardTMEM_LD/ST, TMEM_ALLOC_FENCE ext
F2018Commit extTCGEN05 commit, TMEM_ALLOC_FENCE ext
F21B0ALUTCGEN05 operations, UTMA
F2568TCGEN05 opsTCGEN05 operations
F2678MMA/FENCETCGEN05_MMA, TCGEN05_FENCE
F29A8TMEM_ALLOCTMEM_ALLOC/FENCE
F2810ExtendedTMEM_LD extended
F3008Execute maxTCGEN05 execute (high-operand-count)

Latency Model

The sm_100 latency table (sub_8E8A90) uses a two-part structure: a 3.0KB base table covering standard instructions and a 949-byte supplement dedicated to TCGEN05 operations. The sm_120 consumer Blackwell table (sub_8E9000 + sub_8E92E0, 5.5KB) is the largest individual table and does not include TCGEN05 entries (confirming the feature's absence on consumer silicon).

CTA Granularity

TCGen05 operations specify whether they execute at single-CTA or CTA-pair granularity through the .cta_group modifier:

GranularityModifierEIATTRELF Marker
Single CTA.cta_group::1EIATTR_TCGEN05_1CTA_USEDTC_1CTA
CTA Pair.cta_group::2EIATTR_TCGEN05_2CTA_USEDTC_2CTA

The compiler emits the appropriate EIATTR marker into the output cubin based on which granularity the kernel uses. The CUDA runtime uses this to configure the CTA launch parameters.

The binary enforces exclusivity: a single function cannot mix .cta_group::1 and .cta_group::2 operations. The error message is explicit: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed."

ELF/Cubin Markers

EIATTR Entries

EIATTR NamePurpose
EIATTR_TCGEN05_1CTA_USEDKernel uses tcgen05 at single-CTA granularity
EIATTR_TCGEN05_2CTA_USEDKernel uses tcgen05 at CTA-pair granularity

EICOMPAT Attributes

EICOMPAT NamePurpose
EICOMPAT_ATTR_INST_TCGEN05_MMAKernel uses tcgen05.mma instructions
EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATEDKernel uses deprecated (12.x-era) tcgen05.mma encoding

Entry Fragment Markers

TMEM usage per-CTA is recorded in entry fragment markers:

MarkerVersionPurpose
AT_ENTRY_FRAGMENT_TMEM_CTA1V1TMEM usage for single-CTA kernels
AT_ENTRY_FRAGMENT_TMEM_CTA2V1TMEM usage for CTA-pair kernels
AT_ENTRY_FRAGMENT_TMEM_CTA1_V2V2TMEM usage V2 format, single-CTA
AT_ENTRY_FRAGMENT_TMEM_CTA2_V2V2TMEM usage V2 format, CTA-pair

Guardrail Debug Instrumentation

When compiling with -g (debug mode), ptxas inserts runtime validation checks around tcgen05 operations. These are controlled by the --g-tensor-memory-access-check / --gno-tensor-memory-access-check CLI options.

Guardrail Check Functions

Eight _tcgen05.guardrails.* pseudo-instructions insert inline validation code:

GuardrailFormatter AddressSizeValidates
is_phase_valid0x4DDE70775 BAllocation phase is correct for the operation
are_columns_allocated0x4DBF20599 BAccessed columns are currently allocated
is_current_warp_valid_owner0x4DE180791 BCurrent warp owns the accessed TMEM region
in_physical_bounds0x4DB050439 BColumn access is within physical TMEM bounds
allocation_granularity0x4F0960839 BColumn count meets granularity requirements
datapath_alignment0x4DD580735 BTMEM address is aligned for the data path width
sp_consistency_across_idesc_mod0x500FA0970 BSparsity settings in descriptor match modifier
check_sparse_usage0x4DDB80743 BSparse mode usage is valid for the environment

Guardrail Trap Functions

When a guardrail check fails, it calls a trap function that reports the violation and terminates:

Trap IntrinsicParameters
__cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc(.reg .b32 phase)
__cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid(.reg .b32 tmem_start_lane_accessed, .reg .b32 cur_warp_id, ...)
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access(.reg .b32 col_no_accessed, .reg .b32 alloced_mask, .reg .b32 instr_kind)
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced(.reg .b32 col_no_being_dealloced, .reg .b32 alloced_mask)
__cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc(.reg .b32 col_no_being_dealloced_not_returned_by_alloc, ...)
__cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid(.reg .b32 nCols)
__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds(.reg .b32 oob_access_col_no, .reg .b32 instr_kind)
__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment(.reg .b32 dp_lane, .reg .b32 matrix_kind, .reg .b32 valid_alignment_kind, ...)
__cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod(.reg .b32 idesc_sp_enabled, .reg .b32 mod_sp_enabled)
__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env(.reg .b32 idesc_sp_enabled, .reg .b32 idesc, .reg .b32 mma_kind, .reg .b32 ptx_target, .reg .b32 is_family_portable)

These are intrinsic IDs 0x20--0x2A (11 entries total including a mask creation helper) in the intrinsic table.

Guardrail Check Wrappers

The compiler also generates .FORCE_INLINE wrapper functions that combine multiple checks:

WrapperParameters
__cuda_sm10x_tcgen05_guardrails_check_phase_validity(.reg .u32 dummyInp)
__cuda_sm10x_tcgen05_guardrails_check_column_allocation(.reg .u32 start_col_num, .reg .u32 num_of_cols, ...)
__cuda_sm10x_tcgen05_guardrails_check_datapath_validity(.reg .u32 tmem_addr, .reg .u32 ld_or_st)
__cuda_sm10x_tcgen05_guardrails_check_physical_bounds(.reg .u32 start_col_num, .reg .u32 num_of_cols, ...)
__cuda_sm10x_tcgen05_guardrails_check_allocation_granularity(.reg .u32 num_of_cols)
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment(.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group, ...)

Bulk Copy Operations (cp.async.bulk.tensor)

TCGen05 is complemented by asynchronous bulk copy operations for loading data into tensor memory. These are registered as separate intrinsic IDs (0x2B--0x3C, 18 entries) under the __cuda_sm1xx_* naming convention:

OperationCodegen HandlerSize
cp.async.bulk.tensor (1D--5D, tile/im2col, unicast/multicast)sub_5AB46045KB
cp.async.bulksub_593210--
cp.async.mbarrier.arrivesub_4DC180--

The cp.async.bulk.tensor handler is 45KB and covers all dimensionality variants (1D through 5D), both tile and im2col access patterns, and unicast/multicast delivery modes.

SM Availability Gating

Capability Check

TCGen05 availability is gated by sub_70FA00(*, 29), which checks the target SM version. The check returns true for sm_100, sm_103, and sm_110 (and their a/f sub-variants) and false for sm_120/sm_121.

OCG Builtin Names

The OCG (Optimized Code Generation) layer uses short mnemonic names for tcgen05 operations visible in the builtin name lookup (sub_6C9EB0):

OCG NameFull Operation
tcmmatcgen05.mma core multiply-accumulate
tcshifttcgen05.shift accumulator data shift
gdescGlobal descriptor operations
memclearTensor memory clear
sparsifySparsity pattern application

The .tcgen05op string identifies an Ori IR instruction as belonging to the tcgen05 family during the optimizer pipeline.

Version Compatibility

CUDA 12.x to 13.0 Breaking Change

ptxas v13.0.88 includes a linker-level version check for tcgen05 objects:

"Object '%s' cannot be linked due to version mismatch. Objects using tcgen05 in 12.x cannot be linked with 13.0 or later, they must be rebuilt with latest compiler"

The EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED attribute tags objects compiled with the 12.x-era tcgen05 encoding, which is binary-incompatible with the 13.0 encoding. The SASS instruction encoding for tcgen05.mma changed between CUDA 12.x and 13.0 -- objects must be recompiled.

SM 100 vs SM 103 Differences

Both sm_100 and sm_103 share the same tcgen05 instruction set and codegen factory (36864). They share all 7 dispatch-table handler functions. The differences between sm_100 and sm_103 are:

  • Different Handler A and Handler B capability accessor functions (sm_100: sub_609C30/sub_609BD0; sm_103: sub_608F20/sub_609D20)
  • Different intrinsic table initializers (sm_100: sub_60A910; sm_103: sub_60A700)
  • sm_103 may expose additional capability flags for GB300-specific features

Both targets produce identical SASS for tcgen05 instructions. The f sub-variants (sm_100f, sm_103f) allow cross-compilation within the family: sm_100f code can run on sm_103 hardware.

Compiler Pipeline

PTX Parsing and Validation

  1. Lexer (sub_720F00, 64KB): Recognizes tcgen05.* tokens during lexical analysis
  2. Validator (sub_4C5FB0, 28KB): Shared MMA/WMMA/tcgen05 validation function. Checks instruction legality for the current SM target, validates operand types, descriptor fields, and modifier combinations
  3. Instruction table (sub_46E000, 93KB): Registers tcgen05 instruction variants with their type combinations (e.g., .tcgen05op)

Intrinsic Dispatch

The intrinsic dispatch table builder (sub_5D4190, 41KB) registers tcgen05 handlers:

RegistrationPTX InstructionHandlerSize
Line 112tcgen05.mmasub_5BBC3090KB
Lifecycletcgen05.allocsub_569180--
Lifecycletcgen05.relinquish_alloc_permitsub_526370--
Lifecycletcgen05.deallocsub_58C7F0--
Datatcgen05.ldsub_574050--
Datatcgen05.ld.redsub_578DB0--
Datatcgen05.stsub_571FE0--
Synctcgen05.commitsub_56C190--
Copytcgen05.cpsub_5427F0--
Computetcgen05.shiftsub_4F1A90--
Computetcgen05.mma.wssub_58FA20--

Intrinsic Lowering

The TCGen05 MMA handler (sub_6D7AF0, 19KB) and validator (sub_6D69B0, 12KB) in the encoding zone handle the lowering from abstract intrinsic operations to concrete SASS encoding. The handler checks modifier consistency:

  • "fused and l16dp32bit must be specified together"
  • "Inputs vector length is inconsistent with layout and num modifiers"

TMEM Address Generation

The TMEM address generator cluster (sub_70E740, sub_70E940, sub_70EB00) generates PTX parameter passing code for tensor memory addresses:

st.param.b32 [%s + %d], %s;
ld.param.b32 %s, [%s + %d];

SASS Encoding

The master instruction encoder (sub_6D9690, 94KB) handles the final binary encoding. TCGen05 instructions use the Mercury encoding pipeline (encoder factory 36864) with Blackwell-specific opcode tables.

Function Map

AddressSizeIdentityConfidence
sub_4C5FB028KBPTX validator (MMA/WMMA/tcgen05 shared)HIGH
sub_4DA720343 Btcgen05.mma.ws formatterHIGH
sub_4DB050439 Bguardrails.in_physical_bounds formatterHIGH
sub_4DBF20599 Bguardrails.are_columns_allocated formatterHIGH
sub_4DD580735 Bguardrails.datapath_alignment formatterHIGH
sub_4DDB80743 Bguardrails.check_sparse_usage formatterHIGH
sub_4DDE70775 Bguardrails.is_phase_valid formatterHIGH
sub_4DE180791 Bguardrails.is_current_warp_valid_owner formatterHIGH
sub_4F0960839 Bguardrails.allocation_granularity formatterHIGH
sub_4F1A90903 Btcgen05.shift / tcgen05.cp formatterHIGH
sub_500FA0970 Bguardrails.sp_consistency_across_idesc_mod formatterHIGH
sub_5263701,287 Btcgen05.alloc / tcgen05.relinquish_alloc_permit formatterHIGH
sub_5427F01,575 Btcgen05.commit formatterHIGH
sub_569180--tcgen05.alloc codegen handlerHIGH
sub_56C1901,842 Btcgen05.st formatterHIGH
sub_571FE02,066 Btcgen05.ld.red formatterHIGH
sub_5740502,130 Btcgen05.dealloc formatterHIGH
sub_578DB02,466 Btcgen05.ld formatterHIGH
sub_58C7F04,282 Btcgen05.relinquish_alloc_permit / tcgen05.dealloc formatterHIGH
sub_58FA204,604 Btcgen05.shift + tcgen05.mma formatterHIGH
sub_593210--cp.async.bulk codegenHIGH
sub_5AB46045KBcp.async.bulk.tensor codegen (1D--5D)HIGH
sub_5BBC3090KBtcgen05.mma codegen (main)HIGH
sub_6D69B012KBTCGen05 MMA validator (encoding zone)MED
sub_6D7AF019KBTCGen05 MMA handler (encoding zone)HIGH
sub_70BC30--TCGen05 parameter helperMED
sub_70BCC0--TCGen05 parameter helperMED
sub_70DEF0--TCGen05 parameter helperMED
sub_70E0E0--SM100 guardrail bounds-check code generatorMED
sub_70E740--TMEM address generator (tmemD)MED
sub_70E940--TMEM address generator (tmemA)MED
sub_70EB00--TMEM address generator (scaleTmemA/B, spMetaTmem)MED
sub_70FA00--Instruction capability checker (29 = tcgen05)HIGH
sub_8E8A903.0KB + 949 BSM 100 latency table (base + TCGEN05 supplement)HIGH

Cross-References