Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)
The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.
Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:
if (SmVersion > 1032): // above sm_103f
if (SmVersion - 1101) > 1: // only 1101 (sm_110a) and 1102 (sm_110f) pass
→ ERROR "tcgen05 supported only on arch-conditional..."
sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1 → rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.
The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.
The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):
| SmVersion | Target | tcgen05 |
|---|---|---|
| 1001 | sm_100a | Allowed (bitmask bit 0) |
| 1002 | sm_100f | Allowed (bitmask bit 1) |
| 1011 | sm_101a | Allowed (bitmask bit 10) |
| 1012 | sm_101f | Allowed (bitmask bit 11) |
| 1031 | sm_103a | Allowed (bitmask bit 30) |
| 1032 | sm_103f | Allowed (bitmask bit 31) |
| 1101 | sm_110a | Allowed ((v-1101) <= 1) |
| 1102 | sm_110f | Allowed ((v-1101) <= 1) |
| 1000, 1010, 1030, 1100 | base variants | Blocked (no suffix) |
| 1200–1212 | all sm_120/121 | Blocked (v-1101 > 1) |
From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.
Architecture Identity
NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:
| NVVM Enum | Numeric Value | __CUDA_ARCH | Product |
|---|---|---|---|
NVVM_ARCH_BLACKWELL_12_0 | 1200 | 1200 | RTX 50xx / RTX Blackwell Pro |
NVVM_ARCH_BLACKWELL_12_1 | 1210 | 1210 | DGX Spark |
The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.
SM Variant Table
| Variant | __CUDA_ARCH | PTX Version | a flag | f flag |
|---|---|---|---|---|
sm_120 | 1200 | 6 | 0 | 0 |
sm_120a | 1200 | 7 | 1 | 0 |
sm_120f | 1200 | 7 | 1 | 1 |
sm_121 | 1210 | 6 | 0 | 0 |
sm_121a | 1210 | 7 | 1 | 0 |
sm_121f | 1210 | 7 | 1 | 1 |
The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.
Suffix Behavior
For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:
unk_4D045E4(accelerated flag): Read in exactly one location (sub_6C4D80line 167), but only forunk_4D045E8 == 90— the sm_90a gate. The flag is never checked for sm_120.unk_4D045E0(forward-compatible flag): Set during CLI parsing, reset insub_615CB0, but never read anywhere in the compiler logic.
The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.
SM 120 Exclusive Feature Gates
The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).
Feature 1: .offset.bindless Texture Intrinsics
Frontend gate: sub_1C36530 line 2724
Backend gate: sub_2C7B6A0 line 2160
When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.
The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.
68 intrinsic variants are classified by two functions:
-
Frontend:
sub_1C303A0— Checks three ID ranges:- Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
- Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
- Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
-
Backend:
sub_CEA320— Checks corresponding backend intrinsic IDs
These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.
Feature 2: 16-bit Texture Element Types
Frontend gate: sub_1C36530 line 3381
Backend gate: sub_2C7B6A0 line 2386
When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:
type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
(is_16bit(type) && tex16_allowed_flag)
The tex16_allowed_flag differs by architecture:
- sm < 120: True only for builtin ID 3811 (checked by
sub_1C30390) - sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by
sub_1C30470on frontend,sub_CEA3F0for backend IDs 10462–10477)
This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.
sm_120 vs. sm_121
Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.
sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.
Relationship to sm_100
What sm_120 Inherits from sm_100
sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.
What sm_120 Does NOT Have
sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:
- No
tcgen05.alloc/tcgen05.dealloc(no TMEM to allocate) - No
tcgen05.mma(the async TMEM-based tensor core path) - No
tcgen05.cp/tcgen05.commit/tcgen05.fence/tcgen05.wait - No
tcgen05.relinquish.alloc
What sm_120 Has Instead
The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:
mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0
This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.
The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).
cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).
In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.
Tensor Core Instruction Timeline
| Generation | SM | Instruction | Memory Model |
|---|---|---|---|
| Volta/Turing | sm_70/75 | mma.sync (HMMA) | Register-to-register, synchronous |
| Ampere | sm_80 | mma.sync (extended shapes) | Register-to-register, synchronous |
| Hopper | sm_90 | wgmma.mma_async | Shared memory → registers, async warpgroup |
| Blackwell datacenter | sm_100/103/110 | tcgen05.mma | Tensor Memory (TMEM), fully async |
| Blackwell consumer | sm_120/121 | mma.sync.block_scale (LLVM 22+) | Register-to-register, synchronous + MX scaling |
sm_110 — Jetson Thor
sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.
| Variant | __CUDA_ARCH | PTX Version |
|---|---|---|
sm_110 | 1100 | 6 |
sm_110a | 1100 | 7 |
sm_110f | 1100 | 7 |
Feature Flag Configuration
At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:
| Flag | Purpose |
|---|---|
unk_4D047BC | Disabled (set to 0) for sm_120+; enabled for all lower architectures |
unk_4D0428C | Enabled at sm_120+ |
The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.
Key Binary Locations
| Function | Address | Size | Role |
|---|---|---|---|
sub_CD09E0 | 0xCD09E0 | NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1) | NVVM arch enum (NVVM_ARCH_BLACKWELL_12_0/12_1) |
sub_95EB40 | 0x95EB40 | CLI arch string mapping | CLI arch string mapping |
sub_617BD0 | 0x617BD0 | compute_NNN string parsing | compute_NNN string parsing |
ctor_605 | 0x584510 | Processor variant table (PTX versions) | Processor variant table (PTX versions) |
ctor_356 | 0x50C890 | LLVM processor description table | LLVM processor description table |
sub_1C36530 | 0x1C36530 | Frontend verifier (.offset.bindless + f16 texture gates) | Frontend verifier (.offset.bindless + f16 texture gates) |
sub_2C7B6A0 | 0x2C7B6A0 | Backend verifier (.offset.bindless + f16 texture gates) | Backend verifier (.offset.bindless + f16 texture gates) |
sub_1C303A0 | 0x1C303A0 | .offset.bindless intrinsic classifier (frontend) | .offset.bindless intrinsic classifier (frontend) |
sub_CEA320 | 0xCEA320 | .offset.bindless intrinsic classifier (backend) | .offset.bindless intrinsic classifier (backend) |
sub_1C30470 | 0x1C30470 | f16 texture exclusion list (frontend) | f16 texture exclusion list (frontend) |
sub_CEA3F0 | 0xCEA3F0 | f16 texture exclusion list (backend) | f16 texture exclusion list (backend) |
sub_6C4D80 | 0x6C4D80 | Accelerated flag reader (sm_90a only, not sm_120) | Accelerated flag reader (sm_90a only, not sm_120) |
sub_615CB0 | 0x615CB0 | Forward-compatible flag reset | Forward-compatible flag reset |