Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)

The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.

Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:

if (SmVersion > 1032):          // above sm_103f
    if (SmVersion - 1101) > 1:  // only 1101 (sm_110a) and 1102 (sm_110f) pass
        → ERROR "tcgen05 supported only on arch-conditional..."

sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.

The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.

The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):

SmVersionTargettcgen05
1001sm_100aAllowed (bitmask bit 0)
1002sm_100fAllowed (bitmask bit 1)
1011sm_101aAllowed (bitmask bit 10)
1012sm_101fAllowed (bitmask bit 11)
1031sm_103aAllowed (bitmask bit 30)
1032sm_103fAllowed (bitmask bit 31)
1101sm_110aAllowed ((v-1101) <= 1)
1102sm_110fAllowed ((v-1101) <= 1)
1000, 1010, 1030, 1100base variantsBlocked (no suffix)
1200–1212all sm_120/121Blocked (v-1101 > 1)

From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.

Architecture Identity

NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:

NVVM EnumNumeric Value__CUDA_ARCHProduct
NVVM_ARCH_BLACKWELL_12_012001200RTX 50xx / RTX Blackwell Pro
NVVM_ARCH_BLACKWELL_12_112101210DGX Spark

The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.

SM Variant Table

Variant__CUDA_ARCHPTX Versiona flagf flag
sm_1201200600
sm_120a1200710
sm_120f1200711
sm_1211210600
sm_121a1210710
sm_121f1210711

The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.

Suffix Behavior

For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:

  • unk_4D045E4 (accelerated flag): Read in exactly one location (sub_6C4D80 line 167), but only for unk_4D045E8 == 90 — the sm_90a gate. The flag is never checked for sm_120.
  • unk_4D045E0 (forward-compatible flag): Set during CLI parsing, reset in sub_615CB0, but never read anywhere in the compiler logic.

The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.

SM 120 Exclusive Feature Gates

The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).

Feature 1: .offset.bindless Texture Intrinsics

Frontend gate: sub_1C36530 line 2724 Backend gate: sub_2C7B6A0 line 2160

When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.

The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.

68 intrinsic variants are classified by two functions:

  • Frontend: sub_1C303A0 — Checks three ID ranges:

    • Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
    • Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
    • Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
  • Backend: sub_CEA320 — Checks corresponding backend intrinsic IDs

These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.

Feature 2: 16-bit Texture Element Types

Frontend gate: sub_1C36530 line 3381 Backend gate: sub_2C7B6A0 line 2386

When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:

type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
             (is_16bit(type) && tex16_allowed_flag)

The tex16_allowed_flag differs by architecture:

  • sm < 120: True only for builtin ID 3811 (checked by sub_1C30390)
  • sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by sub_1C30470 on frontend, sub_CEA3F0 for backend IDs 10462–10477)

This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.

sm_120 vs. sm_121

Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.

sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.

Relationship to sm_100

What sm_120 Inherits from sm_100

sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.

What sm_120 Does NOT Have

sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:

  • No tcgen05.alloc / tcgen05.dealloc (no TMEM to allocate)
  • No tcgen05.mma (the async TMEM-based tensor core path)
  • No tcgen05.cp / tcgen05.commit / tcgen05.fence / tcgen05.wait
  • No tcgen05.relinquish.alloc

What sm_120 Has Instead

The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:

mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0

This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.

The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).

cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).

In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.

Tensor Core Instruction Timeline

GenerationSMInstructionMemory Model
Volta/Turingsm_70/75mma.sync (HMMA)Register-to-register, synchronous
Amperesm_80mma.sync (extended shapes)Register-to-register, synchronous
Hoppersm_90wgmma.mma_asyncShared memory → registers, async warpgroup
Blackwell datacentersm_100/103/110tcgen05.mmaTensor Memory (TMEM), fully async
Blackwell consumersm_120/121mma.sync.block_scale (LLVM 22+)Register-to-register, synchronous + MX scaling

sm_110 — Jetson Thor

sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.

Variant__CUDA_ARCHPTX Version
sm_11011006
sm_110a11007
sm_110f11007

Feature Flag Configuration

At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:

FlagPurpose
unk_4D047BCDisabled (set to 0) for sm_120+; enabled for all lower architectures
unk_4D0428CEnabled at sm_120+

The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1)NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1)
sub_95EB400x95EB40CLI arch string mappingCLI arch string mapping
sub_617BD00x617BD0compute_NNN string parsingcompute_NNN string parsing
ctor_6050x584510Processor variant table (PTX versions)Processor variant table (PTX versions)
ctor_3560x50C890LLVM processor description tableLLVM processor description table
sub_1C365300x1C36530Frontend verifier (.offset.bindless + f16 texture gates)Frontend verifier (.offset.bindless + f16 texture gates)
sub_2C7B6A00x2C7B6A0Backend verifier (.offset.bindless + f16 texture gates)Backend verifier (.offset.bindless + f16 texture gates)
sub_1C303A00x1C303A0.offset.bindless intrinsic classifier (frontend).offset.bindless intrinsic classifier (frontend)
sub_CEA3200xCEA320.offset.bindless intrinsic classifier (backend).offset.bindless intrinsic classifier (backend)
sub_1C304700x1C30470f16 texture exclusion list (frontend)f16 texture exclusion list (frontend)
sub_CEA3F00xCEA3F0f16 texture exclusion list (backend)f16 texture exclusion list (backend)
sub_6C4D800x6C4D80Accelerated flag reader (sm_90a only, not sm_120)Accelerated flag reader (sm_90a only, not sm_120)
sub_615CB00x615CB0Forward-compatible flag resetForward-compatible flag reset