Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Blackwell Datacenter (sm_100, sm_100a, sm_103, sm_103a)

The Blackwell datacenter family introduces the fifth-generation tensor core instruction set (tcgen05), new floating-point formats (FP4, FP6, MX formats), and a sophisticated arch-conditional versus family-conditional feature gating system. sm_100/sm_100a targets the NVIDIA B200, while sm_103/sm_103a targets Blackwell Ultra (GB300 system). Both share the tcgen05 ISA but differ in __CUDA_ARCH values and minor tensor core configuration.

Architecture Identity

Six Blackwell arch constants are defined in sub_CD09E0:

NVVM EnumNumeric ValueImplied SM
NVVM_ARCH_BLACKWELL_10_01000sm_100
NVVM_ARCH_BLACKWELL_10_11010sm_101
NVVM_ARCH_BLACKWELL_10_31030sm_103
NVVM_ARCH_BLACKWELL_11_01100sm_110 (Jetson Thor)
NVVM_ARCH_BLACKWELL_12_01200sm_120
NVVM_ARCH_BLACKWELL_12_11210sm_121

Notable: sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line. Despite the rename, both remain in the Blackwell family (NVVM_ARCH_BLACKWELL_*). The numeric encoding follows the standard major*100 + minor*10 formula: 11100 + 010 = 1100.

SM Variant Table

Each Blackwell datacenter target has base, accelerated (a), and forward-compatible (f) sub-variants:

Variant__CUDA_ARCHPTX VersionProduct
sm_10010006B200 base
sm_100a10007B200 accelerated
sm_100f10007B200 forward-compatible
sm_10310306Blackwell Ultra / GB300 base
sm_103a10307Blackwell Ultra / GB300 accelerated
sm_103f10307Blackwell Ultra / GB300 forward-compatible

The undocumented sm_101 and sm_102 targets also exist in the processor table (ctor_605) with their own a/f variants. sm_101 maps to __CUDA_ARCH=1010 and sm_102 to __CUDA_ARCH=1020. No unique feature gates differentiate them from sm_100 in cicc.

Suffix Semantics

The sub-variant flags are stored in EDG frontend globals:

  • unk_4D045E8 — Major SM number (100, 103)
  • unk_4D045E4 — Accelerated flag; set for both a and f variants
  • unk_4D045E0 — Forward-compatible flag; set only for f variants

The f suffix implies a — whenever the forward-compatible flag is set, the accelerated flag is also set. In cicc v13.0, the f flag is set during CLI parsing and reset in sub_615CB0 but is never read by any compiler logic. It exists for future-proofing and potential ptxas-level differentiation.

Arch-Conditional vs. Family-Conditional Gating

Blackwell introduces a two-tier feature gating system that distinguishes between "arch-conditional" and "family-conditional" access to instructions. This pattern repeats across every tcgen05 handler.

The gate check at sub_30462A0, sub_304E6C0, and sub_36E9630 uses a complex encoding:

v = arch_version (offset +340 of arch struct)
if (v > 0x408) {           // 0x408 = 1032 = sm_103.2
    if (v - 1101 > 1)      // allows {1101, 1102} — sm_110a/sm_110f (Jetson Thor)
        goto ERROR;
} else if (v <= 0x3E8 || ((1LL << ((v & 0xFF) + 23)) & 0xC0000C03) == 0) {
    goto ERROR;             // 0x3E8 = 1000 = sm_100 base
}

The bitmask 0xC0000C03 selects specific sub-variants when shifted by (v & 0xFF) + 23. PTX version gates further refine access: family-conditional features require PTX >= 86, while arch-conditional features require PTX >= 88.

Features gated by both arch-conditional and family-conditional (broader access): tcgen05.fence, tcgen05.wait, tcgen05.relinquish.alloc, tcgen05.cp, tcgen05.commit, tcgen05.alloc, tcgen05.mma, and the ue8m0x2 type in cvt_packfloat.

Features gated by arch-conditional only (stricter): {fp6/fp4}x2 types in cvt_packfloat, INT8 type in tcgen05.mma, MXF4/MXF4NVF4 with sparsity, and explicit scale vector size.

tcgen05 — Tensor Core Generation 5

The tcgen05 instruction family is the primary new ISA extension for Blackwell datacenter. All tcgen05 instructions are handled in sub_30462A0 and sub_304E6C0.

Lifecycle Instructions

InstructionOpcodeISDOperandsPurpose
tcgen05.alloc100804765Basic allocationAllocate tensor core accumulator memory
tcgen05.alloc (multicast)100834770/477132-bit flag variantMulticast allocation
tcgen05.dealloc1014048274 operandsDeallocate tensor core memory
tcgen05.commit10090/100914772–4777Mask variantsCommit pending operations
tcgen05.fence1014348302 operandsMemory fence for tensor ops
tcgen05.wait1035150202 operandsWait for tensor ops to complete
tcgen05.relinquish.alloc1031149412 operandsRelinquish allocated tensor memory
tcgen05.cp.*1010147904 operandsCopy operations for tensor data

The commit instruction has multiple variants based on multicast mask size. Only 16-bit and 32-bit masks are valid; other sizes produce an error.

tcgen05.mma — Matrix Multiply-Accumulate

The main MMA instruction is handled in sub_304E6C0 (opcodes 10299–10309) and validated in sub_36E9630. The operand encoding packs configuration into bitfields:

Data types (bits 8–6 of operand):

ValueKindNotes
0kind::mxf4nvf4MX FP4 with NV FP4
1kind::f8f6f4Standard FP8/FP6/FP4
2kind::mxf8f6f4MX variant of f8f6f4
3kind::f16Half precision
4kind::i88-bit integer (arch-conditional only)
5kind::tf32TensorFloat-32
7kind::mxf4MX FP4

Scale vector sizes (bits 3–2):

ValueModifierConstraints
default.scale_vec::1XNot for mxf4nvf4 or mxf4
2.scale_vec::2XNot for mxf8f6f4
3.scale_vec::4XNot for mxf8f6f4 or mxf4

Block scale (bits 10–9): .block16 (16-element block scaling) or .block32 (32-element block scaling). Not supported for f16, tf32, f8f6f4, or i8.

Weight stationary (bit 0): .ws flag. Incompatible with cta_group::2, mxf8f6f4, and FP4 types.

Sparsity (bit 5): Restricted for MXF4 and MXF4NVF4 types on arch-conditional variants only.

Scale input accumulator (bit 4): Scales the accumulator input. Only usable with f16 and tf32 types. Notably, this is NOT supported on the a sub-variants (sm_100a at v=1001, sm_103a at v=1033) but IS supported on base variants (sm_100 at v=1000, sm_103 at v=1030) and sm_120+.

CTA group (bit 1): cta_group::1 (clear) or cta_group::2 (set).

Collector modes (from sub_35F38B0): .collector::a::fill, .collector::a::use, .collector::a::lastuse, and .collector::b with ::ws sub-variants. Constraint: cannot use collector::a::use or collector::a::fill with the ashift modifier.

tcgen05.cp Copy Shapes

The copy instruction shape emission at sub_35F5090 supports:

ShapeBits 3–1 Value
.128x256b0
.4x256b1
.128x128b2
.64x128b3
.32x128b4

Destination format modifiers: .b8x16 (base), .b6x16_p32 (6-bit with 32-bit padding), .b4x16_p64 (4-bit with 64-bit padding).

Multicast modes: .warpx2::02_13 (warp pairs 0,2 and 1,3), .warpx2::01_23 (warp pairs 0,1 and 2,3), .warpx4 (all 4 warps).

cvt_packfloat — Extended Numeric Formats

The cvt_packfloat intrinsic (sub_304FBD0 for validation, sub_35ED820 for emission) has a base requirement of SM >= 90 and PTX >= 78. Blackwell adds four new types:

CaseTypeGeneration
0.f32sm_90+
1.f16x2sm_90+
2.e4m3x2 (FP8 E4M3)sm_90+
3.e5m2x2 (FP8 E5M2)sm_90+
4.bf16x2 (BFloat16)sm_90+
5.e2m1x2 (FP4 E2M1)sm_100+
6.e2m3x2 (FP6 E2M3)sm_100+
7.e3m2x2 (FP6 E3M2)sm_100+
8.ue8m0x2 (UE8M0 scale)sm_100+

The ue8m0x2 type is gated by both arch-conditional and family-conditional paths, while {fp6/fp4}x2 types (e2m1x2, e2m3x2, e3m2x2) are arch-conditional only.

tcgen05 Commit with Mbarrier

The commit modifier emission at sub_35F4E30 combines tensor core commit with mbarrier synchronization:

  • .cta_group::1 / .cta_group::2 — Group selection
  • .mbarrier::arrive::one — Mbarrier arrive modifier
  • .shared::cluster — Shared memory cluster scope
  • .multicast::cluster — Multicast cluster scope

sm_100 vs. sm_103 Differences

Both families share the full tcgen05 ISA. Observable differences in cicc:

  • __CUDA_ARCH: 1000 vs. 1030
  • Tensor core operand range: sm_103 may handle wider operand loops (offset 760 vs. 600 for simpler variants in cases 10303/10308)
  • Scale input accumulator: Not available on a sub-variants of either family

No sm_103-specific feature gates exist beyond the __CUDA_ARCH value. Hardware differences between B200 and GB300 are resolved at the ptxas level.

Feature Flag Configuration

At the sm_100+ threshold (qword_4F077A8 > 109999), the master configurator sub_60E7C0 enables:

FlagCondition
unk_4D04184Unconditional
unk_4D04800Requires CUDA mode + C++20
dword_4D041ACGuarded by byte_4CF8172

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (all Blackwell constants)NVVM arch enum (all Blackwell constants)
sub_1C1B1500x1C1B150Second arch enum copy (LLVM module metadata)Second arch enum copy (LLVM module metadata)
sub_30462A00x30462A0tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp)tcgen05 intrinsic handler (alloc/dealloc/commit/fence/wait/cp)
sub_304E6C00x304E6C0tcgen05.mma intrinsic handler + SelectionDAG loweringtcgen05.mma intrinsic handler + SelectionDAG lowering
sub_36E96300x36E9630tcgen05.mma validation + ISD opcode selectiontcgen05.mma validation + ISD opcode selection
sub_304FBD00x304FBD0cvt_packfloat intrinsic handlercvt_packfloat intrinsic handler
sub_35ED8200x35ED820cvt_packfloat type string emissioncvt_packfloat type string emission
sub_35F33300x35F3330tcgen05.mma modifier emission (kind, scale, cta_group)tcgen05.mma modifier emission (kind, scale, cta_group)
sub_35F38B00x35F38B0tcgen05.mma modifier emission (ashift, collector)tcgen05.mma modifier emission (ashift, collector)
sub_35F4E300x35F4E30tcgen05 commit modifier emissiontcgen05 commit modifier emission
sub_35F50900x35F5090tcgen05.cp shape/format emissiontcgen05.cp shape/format emission
sub_95EB400x95EB40CLI arch string mappingCLI arch string mapping
sub_617BD00x617BD0compute_NNN string parsingcompute_NNN string parsing
ctor_6050x584510Processor variant string tableProcessor variant string table
ctor_3560x50C890LLVM processor description tableLLVM processor description table