Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)

The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.

Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:

if (SmVersion > 1032):          // above sm_103f
    if (SmVersion - 1101) > 1:  // only 1101 (sm_110a) and 1102 (sm_110f) pass
        → ERROR "tcgen05 supported only on arch-conditional..."

sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1 → rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.

The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.

The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):

SmVersion	Target	tcgen05
1001	sm_100a	Allowed (bitmask bit 0)
1002	sm_100f	Allowed (bitmask bit 1)
1011	sm_101a	Allowed (bitmask bit 10)
1012	sm_101f	Allowed (bitmask bit 11)
1031	sm_103a	Allowed (bitmask bit 30)
1032	sm_103f	Allowed (bitmask bit 31)
1101	sm_110a	Allowed ((v-1101) <= 1)
1102	sm_110f	Allowed ((v-1101) <= 1)
1000, 1010, 1030, 1100	base variants	Blocked (no suffix)
1200–1212	all sm_120/121	Blocked (v-1101 > 1)

From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.

Architecture Identity

NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:

NVVM Enum	Numeric Value	`__CUDA_ARCH`	Product
`NVVM_ARCH_BLACKWELL_12_0`	1200	1200	RTX 50xx / RTX Blackwell Pro
`NVVM_ARCH_BLACKWELL_12_1`	1210	1210	DGX Spark

The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.

SM Variant Table

Variant	`__CUDA_ARCH`	PTX Version	`a` flag	`f` flag
`sm_120`	1200	6	0	0
`sm_120a`	1200	7	1	0
`sm_120f`	1200	7	1	1
`sm_121`	1210	6	0	0
`sm_121a`	1210	7	1	0
`sm_121f`	1210	7	1	1

The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.

Suffix Behavior

For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:

unk_4D045E4 (accelerated flag): Read in exactly one location (sub_6C4D80 line 167), but only for unk_4D045E8 == 90 — the sm_90a gate. The flag is never checked for sm_120.
unk_4D045E0 (forward-compatible flag): Set during CLI parsing, reset in sub_615CB0, but never read anywhere in the compiler logic.

The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.

SM 120 Exclusive Feature Gates

The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).

Feature 1: .offset.bindless Texture Intrinsics

Frontend gate: sub_1C36530 line 2724 Backend gate: sub_2C7B6A0 line 2160

When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.

The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.

68 intrinsic variants are classified by two functions:

Frontend: sub_1C303A0 — Checks three ID ranges:
- Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
- Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
- Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
Backend: sub_CEA320 — Checks corresponding backend intrinsic IDs

These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.

Feature 2: 16-bit Texture Element Types

Frontend gate: sub_1C36530 line 3381 Backend gate: sub_2C7B6A0 line 2386

When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:

type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
             (is_16bit(type) && tex16_allowed_flag)

The tex16_allowed_flag differs by architecture:

sm < 120: True only for builtin ID 3811 (checked by sub_1C30390)
sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by sub_1C30470 on frontend, sub_CEA3F0 for backend IDs 10462–10477)

This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.

sm_120 vs. sm_121

Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.

sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.

Relationship to sm_100

What sm_120 Inherits from sm_100

sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.

What sm_120 Does NOT Have

sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:

No tcgen05.alloc / tcgen05.dealloc (no TMEM to allocate)
No tcgen05.mma (the async TMEM-based tensor core path)
No tcgen05.cp / tcgen05.commit / tcgen05.fence / tcgen05.wait
No tcgen05.relinquish.alloc

What sm_120 Has Instead

The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:

mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0

This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.

The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).

cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).

In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.

Tensor Core Instruction Timeline

Generation	SM	Instruction	Memory Model
Volta/Turing	sm_70/75	`mma.sync` (HMMA)	Register-to-register, synchronous
Ampere	sm_80	`mma.sync` (extended shapes)	Register-to-register, synchronous
Hopper	sm_90	`wgmma.mma_async`	Shared memory → registers, async warpgroup
Blackwell datacenter	sm_100/103/110	`tcgen05.mma`	Tensor Memory (TMEM), fully async
Blackwell consumer	sm_120/121	`mma.sync.block_scale` (LLVM 22+)	Register-to-register, synchronous + MX scaling

sm_110 — Jetson Thor

sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.

Variant	`__CUDA_ARCH`	PTX Version
`sm_110`	1100	6
`sm_110a`	1100	7
`sm_110f`	1100	7

Feature Flag Configuration

At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:

Flag	Purpose
`unk_4D047BC`	Disabled (set to 0) for sm_120+; enabled for all lower architectures
`unk_4D0428C`	Enabled at sm_120+

The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.

Key Binary Locations

Function	Address	Size	Role
`sub_CD09E0`	`0xCD09E0`	NVVM arch enum (`NVVM_ARCH_BLACKWELL_12_0/12_1`)	NVVM arch enum (`NVVM_ARCH_BLACKWELL_12_0/12_1`)
`sub_95EB40`	`0x95EB40`	CLI arch string mapping	CLI arch string mapping
`sub_617BD0`	`0x617BD0`	`compute_NNN` string parsing	`compute_NNN` string parsing
`ctor_605`	`0x584510`	Processor variant table (PTX versions)	Processor variant table (PTX versions)
`ctor_356`	`0x50C890`	LLVM processor description table	LLVM processor description table
`sub_1C36530`	`0x1C36530`	Frontend verifier (`.offset.bindless` + f16 texture gates)	Frontend verifier (`.offset.bindless` + f16 texture gates)
`sub_2C7B6A0`	`0x2C7B6A0`	Backend verifier (`.offset.bindless` + f16 texture gates)	Backend verifier (`.offset.bindless` + f16 texture gates)
`sub_1C303A0`	`0x1C303A0`	`.offset.bindless` intrinsic classifier (frontend)	`.offset.bindless` intrinsic classifier (frontend)
`sub_CEA320`	`0xCEA320`	`.offset.bindless` intrinsic classifier (backend)	`.offset.bindless` intrinsic classifier (backend)
`sub_1C30470`	`0x1C30470`	f16 texture exclusion list (frontend)	f16 texture exclusion list (frontend)
`sub_CEA3F0`	`0xCEA3F0`	f16 texture exclusion list (backend)	f16 texture exclusion list (backend)
`sub_6C4D80`	`0x6C4D80`	Accelerated flag reader (sm_90a only, not sm_120)	Accelerated flag reader (sm_90a only, not sm_120)
`sub_615CB0`	`0x615CB0`	Forward-compatible flag reset	Forward-compatible flag reset

Keyboard shortcuts

CICC Reverse Engineering Reference