Blackwell (SM 100--121)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas v13.0.88 handles five Blackwell-era base targets -- sm_100, sm_103, sm_110, sm_120, sm_121 -- spanning datacenter, automotive, consumer, and DGX product lines. All share the codegen factory value 36864 (generation 9, 9 << 12) and the "Blackwell" family string internally, despite being distinct microarchitectures. The defining Blackwell feature is Capsule Mercury (capmerc) as the default binary output format, automatically enabled for SM numbers exceeding 99. The datacenter variants (sm_100, sm_103, sm_110) support tcgen05 (5th-generation tensor cores with dedicated tensor memory); the consumer variants (sm_120, sm_121) do not.
| SM targets | sm_100, sm_103, sm_110, sm_120, sm_121 (+ a and f sub-variants each) |
| Codegen factory | 36864 (0x9000, generation 9) |
| Family string | "Blackwell" (all five targets) |
| Default binary format | Capsule Mercury (capmerc) -- auto-enabled for SM > 99 |
| SASS encoding | 128-bit per instruction (Mercury-encoded) |
| Warp geometry | 16 warps, 240 dispatch slots (shared with Hopper sm_90) |
| Sub-variants per SM | 3: base, a (accelerated), f (feature-reduced) |
| Profile constructor | sub_6765E0 (54KB) |
| Capability dispatch | sub_607DB0 (7 hash maps, once-guarded) |
SM Version Table
| SM | Product | __CUDA_ARCH__ | Codegen Factory | Hex | Variant |
|---|---|---|---|---|---|
sm_100 | B100 / B200 (datacenter) | 1000 | 36864 | 0x9000 | 0 (gen 9 base) |
sm_103 | GB300 (Blackwell Ultra) | 1030 | 36867 | 0x9003 | 3 |
sm_110 | Jetson Thor SoC | 1100 | 36868 | 0x9004 | 4 |
sm_120 | RTX 50xx / RTX Pro | 1200 | 36869 | 0x9005 | 5 |
sm_121 | DGX Spark | 1210 | 36869 | 0x9005 | 5 |
Codegen factory encoding: (9 << 12) | sub_variant. sm_100 is variant 0 (generation base). sm_103 is variant 3. sm_110 is variant 4. sm_120 and sm_121 appear to share variant 5 in the scheduling sub-architecture table at sub_8E4400.
Unreleased SM numbers referenced in the binary: The SASS formatter sub_583190 (rsqrt) checks for SM codes 102, 103, 107, 124, 130 in architecture-specific dispatch paths, suggesting internal/future variants beyond the five publicly exposed targets.
Sub-Variant System
Every Blackwell SM has three sub-variants. The base and a/f variants within an SM share all 7 dispatch table handler functions -- they are identical silicon with different compatibility metadata and feature exposure.
Profile Object Fields
The profile constructor sub_6765E0 builds profile objects for each sub-variant with these fields:
| SM | Base | a Variant | f Variant |
|---|---|---|---|
| sm_100 | sm_100 / compute_100 / lto_100 | sm_100a / compute_100a / lto_100a | sm_100f / compute_100f / lto_100f |
| sm_103 | sm_103 / compute_103 / lto_103 | sm_103a / compute_103a / lto_103a | sm_103f / compute_103f / lto_103f |
| sm_110 | sm_110 / compute_110 / lto_110 | sm_110a / compute_110a / lto_110a | sm_110f / compute_110f / lto_110f |
| sm_120 | sm_120 / compute_120 / lto_120 | sm_120a / compute_120a / lto_120a | sm_120f / compute_120f / lto_120f |
| sm_121 | sm_121 / compute_121 / lto_121 | sm_121a / compute_121a / lto_121a | sm_121f / compute_121f / lto_121f |
CUDA_ARCH Macro Values
| Sub-Variant | sm_100 | sm_103 | sm_110 | sm_120 | sm_121 |
|---|---|---|---|---|---|
| Base | -D__CUDA_ARCH__=1000 | =1030 | =1100 | =1200 | =1210 |
| Accelerated | =100a0 | =103a0 | =110a0 | =120a0 | =121a0 |
| Feature-reduced | =100f0 | =103f0 | =110f0 | =120f0 | =121f0 |
Suffix Bit Flags in Profile Objects
From the decompiled profile constructor, suffixed variants set specific byte flags:
| Suffix | Flag Position | Evidence |
|---|---|---|
a (accelerated) | profile[4] = 1 | v79->m128i_i8[4] = 1; *(_BYTE *)(v82 + 4) = 1; |
f (feature-reduced) | profile[5] = 1 (on all 3 objects: sm, compute, lto) | v88->m128i_i8[5] = 1; *(_BYTE *)(v91 + 5) = 1; v94[5] = 1; |
The a flag is set on the SM and compute profile objects only. The f flag is set on all three (sm, compute, lto), reflecting the fact that f-compiled code must retain its feature-reduced metadata through linking.
isaClass Inheritance
Sub-variants reference their base SM's isaClass rather than defining a new one:
sm_100aandsm_100freference"(profile_sm_100)->isaClass"sm_103aandsm_103freference"(profile_sm_103)->isaClass"sm_120aandsm_120freference"(profile_sm_120)->isaClass"sm_121aandsm_121freference"(profile_sm_121)->isaClass"
This confirms that sub-variants share the instruction set architecture class with their base. The a/f distinction is purely in compatibility metadata, not in the ISA or codegen.
Capability Dispatch
sub_607DB0 registers handler functions into 7 parallel hash maps. All sub-variants of a given SM register the same function pointers.
Handler Assignments (Maps 1--3)
| SM | Handler A (Map 1) | Handler B (Map 2) | Intrinsic Init (Map 3) |
|---|---|---|---|
| sm_100 / 100a / 100f | sub_609C30 | sub_609BD0 | sub_60A910 |
| sm_103 / 103a / 103f | sub_608F20 | sub_609D20 | sub_60A700 |
| sm_110 / 110a / 110f | sub_609F30 | sub_608F50 | sub_60AA20 |
| sm_120 / 120a / 120f | sub_609E40 | sub_609C60 | sub_608DF0 |
| sm_121 / 121a / 121f | sub_609ED0 | sub_609BA0 | sub_60A4E0 |
Performance / Occupancy Handlers (Maps 6--7)
| SM | Handler E (Map 6) | Handler F (Map 7) |
|---|---|---|
| sm_100 / 100a / 100f | sub_609080 | sub_6098A0 |
| sm_103 / 103a / 103f | sub_609020 | sub_6091A0 |
| sm_110 / 110a / 110f | sub_609000 | sub_609280 |
| sm_120 / 120a / 120f | sub_608FE0 | sub_609520 |
| sm_121 / 121a / 121f | sub_609040 | sub_6097C0 |
Every Blackwell SM has unique handler functions in all 7 maps. This contrasts with Hopper where sm_90 and sm_90a share all handlers. Each Blackwell SM variant is architecturally distinct enough to warrant separate capability accessors, performance models, and intrinsic tables.
Warp Geometry
The warp geometry initializer at sub_8E4400 uses the codegen factory value to select dispatch parameters. All Blackwell targets (codegen factory > 36863) fall into the maximum bucket:
encoded > 36863 -> 16 warps, 240 dispatch slots
This is identical to Hopper (sm_90). The 16-warp / 240-slot geometry supports Blackwell's warpgroup execution model (4 warps per warpgroup, 4 warpgroups per SM partition).
Sub-Architecture Variant Table
The secondary variant assignment at sub_8E4400 maps codegen factory values to sub-architecture indices:
| Codegen Factory | Variant | SM |
|---|---|---|
| 36864 | 0 | sm_100 (base) |
| 36867 | 3 | sm_103 |
| 36868 | 4 | sm_110 |
| 36869 | 5 | sm_120, sm_121 |
These variant indices select different entries within the per-SM latency tables, allowing the scheduler to use silicon-specific pipeline timing.
Hardware Resource Geometry
Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_100 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_103 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 88% |
sm_110 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 85% |
sm_120 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 88% |
sm_121 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_120) | 85% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block).
- Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute.
All Blackwell targets share the 16-partition / 240-slot geometry (identical to Hopper sm_90). The a and f sub-variants within each SM share the same geometry -- differentiation is in compatibility metadata and feature exposure, not in resource limits. The primary distinction across Blackwell SMs is in the latency tables and tcgen05 availability, not in the scheduling partition structure.
Capsule Mercury (capmerc) -- Default Output Format
Capsule Mercury is automatically enabled for all Blackwell targets. When the SM architecture number exceeds 99, ptxas sets the capmerc flag at offset+81 in the compilation context. This applies to sm_100, sm_103, sm_110, sm_120, and sm_121 uniformly.
Three Output Modes
| Mode | String | Default For | SM Range |
|---|---|---|---|
mercury | "mercury" | sm_75 -- sm_90 | Turing through Hopper |
capmerc | "capmerc" | sm_100 -- sm_121 | All Blackwell |
sass | "sass" | None (explicit only) | Any |
Capsule Mercury vs Mercury
Both modes use the same Mercury encoder pipeline (phases 117--122). The capmerc distinction is at the ELF emission level:
- Mercury produces a fully-resolved SASS binary in
.text.<funcname>sections - Capsule Mercury wraps Mercury-encoded instructions in
.nv.capmerc<funcname>sections with a 328-byte capsule descriptor, plus.nv.merc.*debug/metadata sections
The capsule descriptor (constructed by sub_1C9C300, 24KB) contains the Mercury instruction stream, relocation metadata (R_MERCURY_* types), KNOBS compilation configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage).
Opportunistic Finalization
Capsule Mercury enables deferred finalization -- compiling once for one SM and reconstituting SASS for a different SM at link or load time.
| Level | Name | Behavior | Example |
|---|---|---|---|
| 0 | default | Standard finalization (compile-target only) | -- |
| 1 | none | No finalization; output stays as capmerc | -- |
| 2 | intra-family | Finalize within same SM family | sm_100 -> sm_103 |
| 3 | intra+inter | Finalize across SM families | sm_100 -> sm_120 |
The compatibility checker sub_60F290 determines whether a capmerc binary compiled for SM X can be finalized for SM Y. On success, ptxas emits: "applied for off-target %u -> %u finalization".
Self-Check Mechanism
The --self-check CLI option performs roundtrip verification:
- Generate capmerc output (Mercury encoding + metadata)
- Reconstitute SASS from the capmerc data
- Compare section-by-section; report error codes 17 (content mismatch), 18 (count mismatch), 19 (metadata mismatch)
The reconstituted SASS can be dumped with --out-sass for debugging self-check failures.
TCGen05 -- 5th Generation Tensor Cores
TCGen05 is the defining hardware feature of Blackwell datacenter parts. It introduces tensor memory (TMEM) as a dedicated register-like storage directly connected to the tensor core, eliminating the shared-memory bottleneck of previous WGMMA designs.
SM Availability
| SM | tcgen05 Available | Notes |
|---|---|---|
| sm_100 / 100a / 100f | Yes | Full datacenter tcgen05 |
| sm_103 / 103a / 103f | Yes | Blackwell Ultra -- same tcgen05 ISA |
| sm_110 / 110a / 110f | Yes | Jetson Thor -- full tcgen05 hardware |
| sm_120 / 120a / 120f | No | Consumer -- no TMEM, no tcgen05 |
| sm_121 / 121a / 121f | No | DGX Spark -- no TMEM, no tcgen05 |
The tcgen05 ISA is gated by SM version checks (visible as sub_70FA00(*, 29) capability queries). sm_120 and sm_121 fall back to inherited HMMA/IMMA/WGMMA tensor core paths.
PTX Instructions
Registered in the opcode dispatch table at sub_5D4190:
| PTX Instruction | Codegen Handler | Formatter | Size |
|---|---|---|---|
tcgen05.alloc | sub_569180 | sub_526370 | 1287B |
tcgen05.relinquish_alloc_permit | sub_526370 | -- | -- |
tcgen05.dealloc | sub_58C7F0 | sub_574050 | 2130B |
tcgen05.ld | sub_574050 | sub_578DB0 | 2466B |
tcgen05.ld.red | sub_578DB0 | -- | -- |
tcgen05.st | sub_571FE0 | sub_56C190 | 1842B |
tcgen05.commit | sub_56C190 | sub_5427F0 | 1575B |
tcgen05.cp | sub_5427F0 | sub_4F1A90 | 903B |
tcgen05.shift | sub_4F1A90 | sub_58FA20 | 4604B |
tcgen05.mma | sub_5BBC30 (90KB) | -- | -- |
tcgen05.mma.ws | sub_58FA20 | sub_4DA720 | 343B |
The tcgen05.mma codegen handler at sub_5BBC30 is 90KB -- the largest single-instruction handler in ptxas -- reflecting the complexity of 5th-gen tensor core MMA with tensor memory operands, scale factors, sparsity, and accumulator management.
TCGen05 Guardrail Functions
Eight debug/validation functions provide runtime instrumentation for tensor memory operations when compiled with --g-tensor-memory-access-check (or -g-tmem-access-check):
| Guardrail | Formatter | Size | Validation |
|---|---|---|---|
_tcgen05.guardrails.is_phase_valid | sub_4DA720 | 775B | Phase lifecycle |
_tcgen05.guardrails.are_columns_allocated | sub_4DDE70 | 599B | Column allocation |
_tcgen05.guardrails.is_current_warp_valid_owner | sub_4DBF20 | 791B | Warp ownership |
_tcgen05.guardrails.in_physical_bounds | sub_4DB050 | 439B | Memory bounds |
_tcgen05.guardrails.allocation_granularity | sub_4F0960 | 839B | Allocation alignment |
_tcgen05.guardrails.datapath_alignment | sub_4DD580 | 735B | Data path checks |
_tcgen05.guardrails.sp_consistency_across_idesc_mod | sub_500FA0 | 970B | Sparse descriptor |
_tcgen05.guardrails.check_sparse_usage | sub_4DDB80 | 743B | Sparsity validation |
The bounds checker (sub_70E0E0, 296 lines decompiled) generates inline PTX code to validate tensor memory column counts, extracting bitfield positions from tcgen05 descriptors (e.g., and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8).
TCGen05 Intrinsics
| ID Range | Count | Category |
|---|---|---|
| 0x20--0x2A | 11 | __cuda_sm10x_tcgen05_guardrail_trap_* (trap on validation failure) |
| 0x230--0x239 | 10 | __cuda_sm_10x_* (hmma/imma mdata + bit MMA) |
Additional tcgen05 helper intrinsics observed in decompiled code:
__cuda_sm_100_tcgen05_ld_red_immhalfSplitOff-- load-reduce with immediate half-split offset__cuda_sm_100_tcgen05_ld_immhalfSplitOff-- load with immediate half-split offset__cuda_sm_100_tcgen05_st_immhalfSplitOff-- store with immediate half-split offset__cuda_sm_100_tcgen05_ld_red_funcRetArr-- load-reduce returning array__cuda_sm_100_tcgen05_ld_funcRetArr-- load returning array
These helpers (decompiled in sub_70D910 and sub_70DDB0) generate inline PTX for complex tensor memory access patterns including array returns via ld.param.b32 sequences.
Bulk Copy Intrinsics (sm_1xx)
18 intrinsics in the __cuda_sm1xx_* namespace cover cp.async.bulk.tensor 1D--5D in tile and im2col modes. These extend the Hopper TMA infrastructure with Blackwell-specific enhancements.
SM 100 / SM 100a / SM 100f -- Blackwell Datacenter
sm_100 is the reference Blackwell architecture. Codegen factory 36864 (0x9000), CUDA_ARCH 1000.
Products
B100, B200 (datacenter GPU), paired as GB200 NVL72 superchips.
Key Features
- TCGen05: Full 5th-gen tensor core with TMEM (alloc, dealloc, ld, st, commit, cp, shift, mma, mma.ws)
- Capsule Mercury: Default output format (auto-enabled for SM > 99)
- WGMMA inherited: Warpgroup MMA from Hopper carries forward
- Cluster operations: Thread-block clusters, distributed shared memory (from Hopper)
- setmaxnreg: Dynamic register allocation (from Hopper)
- Uniform register ALU: UFADD, UFFMA, UFSEL, UFSETP, UVIADDR (Blackwell uniform register ISA additions)
Handler Functions
| Map | Function | Role |
|---|---|---|
| Handler A | sub_609C30 | Primary codegen capability accessor |
| Handler B | sub_609BD0 | Secondary codegen capability accessor |
| Intrinsic init | sub_60A910 | Intrinsic table population |
| Perf/occupancy E | sub_609080 | Performance statistics |
| Perf/occupancy F | sub_6098A0 | Occupancy calculator |
HW Latency Table
sub_8E8A90 (3.0KB) -- the base Blackwell latency table. Two-part structure: a 3.0KB base table for standard instructions plus a ~949-byte TCGEN05 supplement covering tensor core scheduling classes 745--772+.
Profile Object
From sub_6765E0:
SM name: "sm_100"
Compute name: "compute_100"
LTO name: "lto_100"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1000"
The profile constructor stores dword_29FE2C4 = 100 after constructing all sm_100 sub-variants, likely recording the current highest-registered base SM number.
SM 103 / SM 103a / SM 103f -- Blackwell Ultra
sm_103 is Blackwell Ultra, targeting the GB300 NVL72 platform. Codegen factory 36867 (0x9003), CUDA_ARCH 1030.
Products
GB300 (datacenter, Blackwell Ultra). Incremental silicon revision over sm_100.
Differentiation from sm_100
The SASS formatter sub_583190 (rsqrt instruction) explicitly checks for "sm_103" and applies a Blackwell Ultra-specific operand layout. The formatter dispatch first tests raw SM codes (102, 103, 107, 130, 124), then uses check_target_sm(instr, 0, "sm_103") for string-based validation.
From sweep data on the SM103-specific encoding path:
- Sets encoding flags XOR 0x10, XOR 0x40 for sm_103-specific instruction variants
- New operand layout for transcendental instructions (rsqrt, likely also rcp/sqrt)
sm_103 has a separate 618-byte supplementary latency table -- the smallest in the binary -- suggesting minimal scheduling parameter changes from sm_100.
Handler Functions
All unique from sm_100:
| Map | Function |
|---|---|
| Handler A | sub_608F20 |
| Handler B | sub_609D20 |
| Intrinsic init | sub_60A700 |
| Perf/occupancy E | sub_609020 |
| Perf/occupancy F | sub_6091A0 |
Profile Object
SM name: "sm_103"
Compute name: "compute_103"
LTO name: "lto_103"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1030"
SM 110 / SM 110a / SM 110f -- Jetson Thor
sm_110 targets the Jetson Thor SoC for automotive and robotics applications. Codegen factory 36868 (0x9004), CUDA_ARCH 1100.
Products
Jetson Thor (automotive-grade SoC with integrated GPU). Originally internally designated sm_101 before rename.
sm_101 Legacy Alias
sm_101 (with variants sm_101a and sm_101f) was the original internal name for Jetson Thor. It was renamed to sm_110 in a later CUDA release, but all three validation table entries are retained for backward compatibility:
| Table | Entry | PTX ISA | Purpose |
|---|---|---|---|
Base (unk_1D16220) | {101, 8, 6} | 8.6 | Accepts --gpu-name sm_101 in existing PTX files |
Accelerated (unk_1D161C0) | {101, 8, 6} | 8.6 | Accepts sm_101a |
Feature-reduced (unk_1D16160) | {101, 8, 8} | 8.8 | Accepts sm_101f |
The validation tables use bsearch() (sub_484B70 comparator), so both sm_101 and sm_110 are independently findable. However, sub_6765E0 (the profile constructor) registers only sm_110 / sm_110a / sm_110f -- there is no profile object for sm_101. After passing validation, sm_101 must resolve to the sm_110 profile through an internal aliasing path (likely in sub_4B1080, the target directive parser).
The PTX ISA version difference is notable: sm_101 requires PTX 8.6 (same as sm_100), while sm_110 requires PTX 9.0. This reflects the timeline -- sm_101 was named when the Jetson Thor target was first added alongside sm_100, before the sm_110 numbering and PTX 9.0 specification existed.
Key Characteristics
- Full tcgen05 hardware: Retains datacenter-class tensor memory and tensor core features
- Same WGMMA/cluster support: Inherits Hopper-era warpgroup and cluster operations
- SoC-specific constraints: Differentiated from sm_100 through capability flags, not through missing features -- the capability accessor functions (
sub_609F30,sub_608F50) return SoC-appropriate resource limits
Handler Functions
| Map | Function |
|---|---|
| Handler A | sub_609F30 |
| Handler B | sub_608F50 |
| Intrinsic init | sub_60AA20 |
| Perf/occupancy E | sub_609000 |
| Perf/occupancy F | sub_609280 |
Profile Object
SM name: "sm_110"
Compute name: "compute_110"
LTO name: "lto_110"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1100"
Note: sm_110 uses different xmmword constants for profile fields [5]--[7] compared to sm_100, visible in the profile constructor where v98[5] = xmmword_2027D70; v98[6] = v103 (from v209); v98[7] = v104 (from v207). This encodes different hardware resource parameters.
SM 120 / SM 120a / SM 120f -- Blackwell Consumer
sm_120 targets consumer and enterprise workstation GPUs. Codegen factory 36869 (0x9005), CUDA_ARCH 1200.
Products
RTX 5090, RTX 5080, RTX 5070 Ti, RTX 5070, RTX 5060 (consumer). RTX Blackwell Pro (enterprise workstation).
The tcgen05 Gap
sm_120 is architecturally distinct from sm_100 despite sharing the "Blackwell" family string. The critical difference: no tcgen05 support. The entire tensor memory subsystem (alloc, dealloc, ld, st, commit, cp, shift, mma) is absent. Tensor operations fall back to:
- HMMA/IMMA inherited from sm_70--sm_89 (direct MMA path)
- WGMMA inherited from sm_90 (warpgroup async MMA)
This is gated by SM version checks in the capability accessor functions. The intrinsic table initializer (sub_608DF0) does not register tcgen05 intrinsic handlers for sm_120.
HW Latency Table
sm_120 has a distinct latency model, split into two parts:
| Function | Size | Content |
|---|---|---|
sub_8E9000 | 2.9KB | Base consumer Blackwell table |
sub_8E92E0 | 5.5KB | Extended table (largest individual table in binary) |
The 5.5KB extended table is larger than any other individual latency table, suggesting that sm_120's consumer pipeline has significantly different scheduling characteristics from datacenter sm_100. The consumer pipeline likely has different functional unit counts, memory latencies, and tensor core throughput profiles.
Handler Functions
| Map | Function |
|---|---|
| Handler A | sub_609E40 |
| Handler B | sub_609C60 |
| Intrinsic init | sub_608DF0 |
| Perf/occupancy E | sub_608FE0 |
| Perf/occupancy F | sub_609520 |
Profile Object
SM name: "sm_120"
Compute name: "compute_120"
LTO name: "lto_120"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1200"
sm_120 uses a third distinct set of xmmword constants for profile fields, including xmmword_2027DC0 at field [6], confirming different hardware resource parameters from both sm_100 and sm_110.
SM 121 / SM 121a / SM 121f -- DGX Spark
sm_121 targets the DGX Spark desktop AI workstation. Codegen factory 36869 (0x9005), CUDA_ARCH 1210.
Products
NVIDIA DGX Spark (desktop AI workstation with Grace CPU + Blackwell GPU).
Relationship to sm_120
sm_121 shares the same codegen factory sub-variant (5) as sm_120 in the scheduling sub-architecture table, and inherits sm_120's xmmword profile constants. This suggests sm_121 is a binned or slightly modified sm_120 die, similar to how sm_86 relates to sm_80 in the Ampere generation.
Like sm_120, sm_121 has no tcgen05 support -- tensor operations use the HMMA/IMMA/WGMMA path.
Handler Functions
All unique from sm_120:
| Map | Function |
|---|---|
| Handler A | sub_609ED0 |
| Handler B | sub_609BA0 |
| Intrinsic init | sub_60A4E0 |
| Perf/occupancy E | sub_609040 |
| Perf/occupancy F | sub_6097C0 |
Profile Object
SM name: "sm_121"
Compute name: "compute_121"
LTO name: "lto_121"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1210"
Blackwell Uniform Register ISA
Blackwell extends the uniform register (UR) ISA introduced in Turing/Ampere with dedicated uniform ALU instructions:
| SASS Instruction | Operation | Notes |
|---|---|---|
UFADD | Uniform floating-point add | New in Blackwell |
UFFMA | Uniform fused multiply-add | New in Blackwell |
UFSEL | Uniform floating-point select | New in Blackwell |
UFSETP | Uniform FP set-predicate | New in Blackwell |
UVIADDR | Uniform integer-to-address | New in Blackwell |
These instructions execute on the uniform datapath (UDP, functional unit index 9), allowing floating-point uniform computations to stay in the UR file without round-tripping through the R file. Mercury encoding assigns major opcode 0x0E with 6 variants (sub_10C0550) for uniform ALU.
Architecture Version Threshold Checks
All Blackwell targets share codegen factory 36864+ (0x9000+). The binary uses these thresholds to gate Blackwell-specific features:
| Check Pattern | Threshold | Meaning |
|---|---|---|
encoded > 36863 | > sm_90 extended | Blackwell warp geometry (16 warps, 240 slots) |
codegen_factory >= 36864 | >= sm_100 | Blackwell generation features |
codegen_factory == 36864 | sm_100 exactly | sm_100-specific paths |
SM_number > 99 | sm_100+ | Capsule Mercury auto-enable |
sub_70FA00(*, 29) | -- | tcgen05 capability query (SM-specific) |
The tcgen05 gating is not a simple threshold -- it uses a per-SM capability query (sub_70FA00 with argument 29) that returns true for sm_100/103/110 and false for sm_120/121.
BB Initialization (Secondary Encoding)
The basic block initializer sub_6E8EB0 uses a secondary encoding space for Blackwell:
| Secondary Encoding | SM | Flags |
|---|---|---|
20480 (0x5000) | sm_100 | Instruction set flags for datacenter |
20484 (0x5004) | sm_103 | XOR 0x10, XOR 0x40 for Ultra variants |
This secondary encoding uses generation 5 in the BB init context (5 << 12), separate from the primary codegen factory's generation 9.
Scheduling Profile Differences
Blackwell targets share the 16-warp / 240-slot geometry with Hopper but have distinct latency tables:
| SM | Latency Table | Size | Structure |
|---|---|---|---|
| sm_100 | sub_8E8A90 | 3.0KB | Base + 949B TCGEN05 supplement |
| sm_103 | (supplementary) | 618B | Smallest table -- minimal delta from sm_100 |
| sm_110 | (shared with sm_100 or dedicated) | -- | Not separately identified in sweep |
| sm_120 | sub_8E9000 + sub_8E92E0 | 2.9KB + 5.5KB | Two-part consumer table |
| sm_121 | (likely shares sm_120's table) | -- | Same sub-variant index as sm_120 |
| Universal | sub_8E97B0 | 8.8KB | Fallback / universal |
The 5.5KB extended table for sm_120 is the largest individual latency table in the binary, reflecting the consumer microarchitecture's distinct pipeline design. The sm_100 table uses a supplement mechanism for TCGEN05-specific scheduling classes that the consumer sm_120 table does not need (since sm_120 lacks tcgen05).
Scheduling Class Assignments
From the opcode-to-scheduling-class mapper sub_89FBA0 (85KB), Blackwell-era opcodes use high-numbered scheduling classes:
| Class Range | Category | Architecture |
|---|---|---|
| 700--772+ | Mercury/Blackwell tensor ops | sm_100+ with tcgen05 |
| 745 | WGMMA primary | sm_90+ (Hopper carry-forward) |
| 744 | WGMMA variant | sm_90+ |
| 765--767 | BGMMA/QMMA (Blackwell-specific MMA types) | sm_100+ |
| 759 | HMMA/BMMA tensor core | sm_100+ |
| 757, 761 | Narrow/wide DP tensor | sm_100+ |
| 600, 604 | Tensor fence / tensor sync | sm_90+ |
Intrinsic Table
Blackwell intrinsic availability is cumulative -- all sm_70, sm_80, sm_8x, and sm_9x intrinsics carry forward. Blackwell adds two new intrinsic groups:
sm_10x Intrinsics (21 entries)
| ID Range | Count | Namespace | Category |
|---|---|---|---|
| 0x20--0x2A | 11 | __cuda_sm10x_tcgen05_guardrail_trap_* | Trap on guardrail validation failure |
| 0x230--0x239 | 10 | __cuda_sm_10x_* | hmma/imma mdata + bit MMA (Blackwell-specific shapes) |
sm_1xx Intrinsics (18 entries)
| Namespace | Count | Category |
|---|---|---|
__cuda_sm1xx_* | 18 | cp.async.bulk.tensor 1D--5D tile/im2col (extended bulk copy) |
OCG (On-Chip Generated) Intrinsics
The OCG builtin name table at sub_6C9EB0 (13KB) contains the master list of Blackwell+ runtime-generated intrinsic names:
cp_async_bulk, cp_red_async_bulk, cp_async_tensor,
cp_async_prefetch_tensor, fence_view_async,
viaddmax, viaddmin, viadd, vimax, vimin, vimax3, vimin3,
write_async, tcbar, mmareadshma, tcmma,
tcshift, tcatomsws, tcldsws, tcstsws,
gdesc, breuse, bkeep, virtcount,
memclear, acqshminit, sparsify, spfactor2to4,
2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit,
16dp32bitt0t15, 16dp32bitt16t31,
selfcast, broadcast, findandset, align
These names represent the Blackwell SASS operations exposed through the OCG intrinsic interface, covering tensor core scheduling (tcbar, tcmma, tcshift), sparse operations (sparsify, spfactor2to4), integer min/max variants (viaddmax, viaddmin), and async memory operations.
New PTX Instructions (Blackwell-Specific)
Beyond tcgen05, Blackwell introduces or extends several instruction families visible in the opcode dispatch:
| Instruction | Category | Evidence |
|---|---|---|
tcgen05.* (11 instructions) | Tensor memory ops | sub_5D4190 registration |
fence_view_async | Memory ordering | OCG builtin table |
write_async | Async writes | OCG builtin table |
viaddmax / viaddmin | Integer add-with-max/min | OCG builtin table |
| BGMMA / QMMA | Block/quantized MMA | Scheduling classes 765--767 |
CLI Options -- Tensor Memory Checks
Two CLI options control tcgen05 guardrail instrumentation:
| Option | Short Form | Default | Description |
|---|---|---|---|
--g-tensor-memory-access-check | -g-tmem-access-check | Enabled with -g | Enable tensor memory access checks for tcgen05 operations |
--gno-tensor-memory-access-check | -gno-tmem-access-check | false | Disable checks (overrides the above) |
When enabled, the compiler inserts inline validation code (the 8 guardrail functions) around tcgen05 operations. These emit trap instructions if tensor memory invariants are violated at runtime -- useful for debugging TMEM allocation errors, bounds violations, and ownership conflicts.
Feature Comparison
| Feature | sm_100 | sm_103 | sm_110 | sm_120 | sm_121 |
|---|---|---|---|---|---|
| Codegen factory | 36864 | 36867 | 36868 | 36869 | 36869 |
| Sub-arch variant | 0 | 3 | 4 | 5 | 5 |
| Family string | Blackwell | Blackwell | Blackwell | Blackwell | Blackwell |
| Capsule Mercury default | Yes | Yes | Yes | Yes | Yes |
| tcgen05 (tensor memory) | Yes | Yes | Yes | No | No |
| WGMMA (from Hopper) | Yes | Yes | Yes | Yes | Yes |
| Cluster operations | Yes | Yes | Yes | Yes | Yes |
| setmaxnreg | Yes | Yes | Yes | Yes | Yes |
| Uniform FP ALU (UFADD etc.) | Yes | Yes | Yes | Yes | Yes |
| BGMMA/QMMA | Yes | Yes | Yes | ? | ? |
| Guardrail instrumentation | Yes | Yes | Yes | N/A | N/A |
| HW latency table | 3.0KB + 949B | 618B (supp.) | -- | 2.9KB + 5.5KB | (shared w/ sm_120) |
a sub-variant | Yes | Yes | Yes | Yes | Yes |
f sub-variant | Yes | Yes | Yes | Yes | Yes |
| Products | B100/B200 | GB300 | Jetson Thor | RTX 50xx | DGX Spark |
SASS Instruction Encoding
Blackwell continues the 128-bit per-instruction format introduced in Turing. The Mercury encoder handles the SM100+ instruction set through a dedicated encoding subsystem spanning approximately 851KB in the address range 0xDFC000--0x107B000.
The encoding subsystem covers 16 instruction format groups with full Blackwell ISA support including:
- Standard ALU/FPU/memory operations (inherited from earlier architectures)
- TCGEN05 tensor memory operations (new encoding classes)
- BGMMA/QMMA block-scale and quantized MMA variants
- Extended bulk copy operations (UBLKCP variants)
- Sparse tensor operations
Function Map
| Address | Size | Identity | SM | Confidence |
|---|---|---|---|---|
sub_607DB0 | 14KB | Capability dispatch (all Blackwell registrations) | all | 99% |
sub_608DF0 | ~1KB | Intrinsic table initializer | sm_120 | 85% |
sub_608F20 | ~1.2KB | Handler A (capability accessor) | sm_103 | 90% |
sub_608F50 | ~1.2KB | Handler B (capability accessor) | sm_110 | 90% |
sub_609000 | ~200B | Perf/occupancy E | sm_110 | 85% |
sub_609020 | ~200B | Perf/occupancy E | sm_103 | 85% |
sub_609040 | ~200B | Perf/occupancy E | sm_121 | 85% |
sub_609080 | ~200B | Perf/occupancy E | sm_100 | 85% |
sub_6091A0 | ~200B | Perf/occupancy F | sm_103 | 85% |
sub_609280 | ~200B | Perf/occupancy F | sm_110 | 85% |
sub_609520 | ~200B | Perf/occupancy F | sm_120 | 85% |
sub_6097C0 | ~200B | Perf/occupancy F | sm_121 | 85% |
sub_6098A0 | ~200B | Perf/occupancy F | sm_100 | 85% |
sub_609BA0 | ~48B | Handler B | sm_121 | 99% |
sub_609BD0 | ~48B | Handler B | sm_100 | 99% |
sub_609C30 | ~48B | Handler A | sm_100 | 99% |
sub_609C60 | ~48B | Handler B | sm_120 | 99% |
sub_609D20 | ~48B | Handler B | sm_103 | 99% |
sub_609E40 | ~48B | Handler A | sm_120 | 99% |
sub_609ED0 | ~48B | Handler A | sm_121 | 99% |
sub_609F30 | ~48B | Handler A | sm_110 | 99% |
sub_60A4E0 | ~1KB | Intrinsic table initializer | sm_121 | 85% |
sub_60A700 | ~1KB | Intrinsic table initializer | sm_103 | 85% |
sub_60A910 | ~1KB | Intrinsic table initializer | sm_100 | 85% |
sub_60AA20 | ~1KB | Intrinsic table initializer | sm_110 | 85% |
sub_60F290 | est. | Off-target capmerc compatibility checker | all | 75% |
sub_612DE0 | 47KB | Kernel finalizer / ELF builder | all | 80% |
sub_6765E0 | 54KB | Profile constructor (Blackwell entries at lines 600--1330) | all | 95% |
sub_703AB0 | 10KB | CLI option parser (capmerc/mercury/sass) | all | 90% |
sub_70D910 | 24 lines | tcgen05 immhalfSplitOff helper | sm_100 | 90% |
sub_70DDB0 | 47 lines | tcgen05 funcRetArr helper | sm_100 | 90% |
sub_70E0E0 | 296 lines | tcgen05 guardrail bounds checker | sm_100 | 90% |
sub_8E4400 | 3.3KB | Warp geometry initializer (Blackwell = 16 warps, 240 slots) | all | 90% |
sub_8E8A90 | 3.0KB | HW latency table (Blackwell datacenter) | sm_100 | 85% |
sub_8E9000 | 2.9KB | HW latency table (consumer base) | sm_120 | 85% |
sub_8E92E0 | 5.5KB | HW latency table (consumer extended) | sm_120 | 85% |
sub_8E97B0 | 8.8KB | Universal fallback latency table | all | 85% |
sub_89FBA0 | 85KB | Opcode-to-scheduling-class mapper | all | 90% |
sub_5BBC30 | 90KB | tcgen05.mma codegen handler | sm_100 | 98% |
sub_5D4190 | -- | Opcode dispatch table (tcgen05 registrations) | all | 99% |
sub_1C9C300 | 24KB | Capsule Mercury section processor | all | 85% |
sub_1C9B110 | 23KB | Mercury capsule builder | all | 85% |
Cross-References
- SM Architecture Map -- Validation tables, codegen factory encoding, suffix semantics
- Turing & Ampere (SM 75--88) -- Baseline features that Blackwell inherits
- Ada & Hopper (SM 89--90a) -- WGMMA, clusters, TMA -- carried forward to Blackwell
- TCGen05 -- 5th Gen Tensor Cores -- Detailed tcgen05 ISA, TMEM model, instruction encoding
- Capsule Mercury & Finalization -- Capmerc format, ELF structure, finalization levels
- Mercury Encoder -- Shared encoding pipeline for all Mercury/capmerc output
- Intrinsic Table (608 Entries) -- Full intrinsic catalog with sm_10x and sm_1xx ranges
- Latency Model & HW Profiles -- Per-SM scheduling parameters and functional units
- Uniform Register Optimization -- UR-file passes, Blackwell uniform ALU additions
- CLI Options --
--gpu-name,--binary-kind, tensor memory check flags - SASS Instruction Encoding -- 128-bit format, Mercury encoder selection