Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Blackwell (SM 100--121)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas v13.0.88 handles five Blackwell-era base targets -- sm_100, sm_103, sm_110, sm_120, sm_121 -- spanning datacenter, automotive, consumer, and DGX product lines. All share the codegen factory value 36864 (generation 9, 9 << 12) and the "Blackwell" family string internally, despite being distinct microarchitectures. The defining Blackwell feature is Capsule Mercury (capmerc) as the default binary output format, automatically enabled for SM numbers exceeding 99. The datacenter variants (sm_100, sm_103, sm_110) support tcgen05 (5th-generation tensor cores with dedicated tensor memory); the consumer variants (sm_120, sm_121) do not.

SM targetssm_100, sm_103, sm_110, sm_120, sm_121 (+ a and f sub-variants each)
Codegen factory36864 (0x9000, generation 9)
Family string"Blackwell" (all five targets)
Default binary formatCapsule Mercury (capmerc) -- auto-enabled for SM > 99
SASS encoding128-bit per instruction (Mercury-encoded)
Warp geometry16 warps, 240 dispatch slots (shared with Hopper sm_90)
Sub-variants per SM3: base, a (accelerated), f (feature-reduced)
Profile constructorsub_6765E0 (54KB)
Capability dispatchsub_607DB0 (7 hash maps, once-guarded)

SM Version Table

SMProduct__CUDA_ARCH__Codegen FactoryHexVariant
sm_100B100 / B200 (datacenter)1000368640x90000 (gen 9 base)
sm_103GB300 (Blackwell Ultra)1030368670x90033
sm_110Jetson Thor SoC1100368680x90044
sm_120RTX 50xx / RTX Pro1200368690x90055
sm_121DGX Spark1210368690x90055

Codegen factory encoding: (9 << 12) | sub_variant. sm_100 is variant 0 (generation base). sm_103 is variant 3. sm_110 is variant 4. sm_120 and sm_121 appear to share variant 5 in the scheduling sub-architecture table at sub_8E4400.

Unreleased SM numbers referenced in the binary: The SASS formatter sub_583190 (rsqrt) checks for SM codes 102, 103, 107, 124, 130 in architecture-specific dispatch paths, suggesting internal/future variants beyond the five publicly exposed targets.

Sub-Variant System

Every Blackwell SM has three sub-variants. The base and a/f variants within an SM share all 7 dispatch table handler functions -- they are identical silicon with different compatibility metadata and feature exposure.

Profile Object Fields

The profile constructor sub_6765E0 builds profile objects for each sub-variant with these fields:

SMBasea Variantf Variant
sm_100sm_100 / compute_100 / lto_100sm_100a / compute_100a / lto_100asm_100f / compute_100f / lto_100f
sm_103sm_103 / compute_103 / lto_103sm_103a / compute_103a / lto_103asm_103f / compute_103f / lto_103f
sm_110sm_110 / compute_110 / lto_110sm_110a / compute_110a / lto_110asm_110f / compute_110f / lto_110f
sm_120sm_120 / compute_120 / lto_120sm_120a / compute_120a / lto_120asm_120f / compute_120f / lto_120f
sm_121sm_121 / compute_121 / lto_121sm_121a / compute_121a / lto_121asm_121f / compute_121f / lto_121f

CUDA_ARCH Macro Values

Sub-Variantsm_100sm_103sm_110sm_120sm_121
Base-D__CUDA_ARCH__=1000=1030=1100=1200=1210
Accelerated=100a0=103a0=110a0=120a0=121a0
Feature-reduced=100f0=103f0=110f0=120f0=121f0

Suffix Bit Flags in Profile Objects

From the decompiled profile constructor, suffixed variants set specific byte flags:

SuffixFlag PositionEvidence
a (accelerated)profile[4] = 1v79->m128i_i8[4] = 1; *(_BYTE *)(v82 + 4) = 1;
f (feature-reduced)profile[5] = 1 (on all 3 objects: sm, compute, lto)v88->m128i_i8[5] = 1; *(_BYTE *)(v91 + 5) = 1; v94[5] = 1;

The a flag is set on the SM and compute profile objects only. The f flag is set on all three (sm, compute, lto), reflecting the fact that f-compiled code must retain its feature-reduced metadata through linking.

isaClass Inheritance

Sub-variants reference their base SM's isaClass rather than defining a new one:

  • sm_100a and sm_100f reference "(profile_sm_100)->isaClass"
  • sm_103a and sm_103f reference "(profile_sm_103)->isaClass"
  • sm_120a and sm_120f reference "(profile_sm_120)->isaClass"
  • sm_121a and sm_121f reference "(profile_sm_121)->isaClass"

This confirms that sub-variants share the instruction set architecture class with their base. The a/f distinction is purely in compatibility metadata, not in the ISA or codegen.

Capability Dispatch

sub_607DB0 registers handler functions into 7 parallel hash maps. All sub-variants of a given SM register the same function pointers.

Handler Assignments (Maps 1--3)

SMHandler A (Map 1)Handler B (Map 2)Intrinsic Init (Map 3)
sm_100 / 100a / 100fsub_609C30sub_609BD0sub_60A910
sm_103 / 103a / 103fsub_608F20sub_609D20sub_60A700
sm_110 / 110a / 110fsub_609F30sub_608F50sub_60AA20
sm_120 / 120a / 120fsub_609E40sub_609C60sub_608DF0
sm_121 / 121a / 121fsub_609ED0sub_609BA0sub_60A4E0

Performance / Occupancy Handlers (Maps 6--7)

SMHandler E (Map 6)Handler F (Map 7)
sm_100 / 100a / 100fsub_609080sub_6098A0
sm_103 / 103a / 103fsub_609020sub_6091A0
sm_110 / 110a / 110fsub_609000sub_609280
sm_120 / 120a / 120fsub_608FE0sub_609520
sm_121 / 121a / 121fsub_609040sub_6097C0

Every Blackwell SM has unique handler functions in all 7 maps. This contrasts with Hopper where sm_90 and sm_90a share all handlers. Each Blackwell SM variant is architecturally distinct enough to warrant separate capability accessors, performance models, and intrinsic tables.

Warp Geometry

The warp geometry initializer at sub_8E4400 uses the codegen factory value to select dispatch parameters. All Blackwell targets (codegen factory > 36863) fall into the maximum bucket:

encoded >  36863  -> 16 warps, 240 dispatch slots

This is identical to Hopper (sm_90). The 16-warp / 240-slot geometry supports Blackwell's warpgroup execution model (4 warps per warpgroup, 4 warpgroups per SM partition).

Sub-Architecture Variant Table

The secondary variant assignment at sub_8E4400 maps codegen factory values to sub-architecture indices:

Codegen FactoryVariantSM
368640sm_100 (base)
368673sm_103
368684sm_110
368695sm_120, sm_121

These variant indices select different entries within the per-SM latency tables, allowing the scheduler to use silicon-specific pipeline timing.

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_10065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB90%
sm_10365,5362551,024643216 / 240240(same as sm_100)88%
sm_11065,5362551,024643216 / 240240(same as sm_100)85%
sm_12065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB88%
sm_12165,5362551,024643216 / 240240(same as sm_120)85%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block).
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

All Blackwell targets share the 16-partition / 240-slot geometry (identical to Hopper sm_90). The a and f sub-variants within each SM share the same geometry -- differentiation is in compatibility metadata and feature exposure, not in resource limits. The primary distinction across Blackwell SMs is in the latency tables and tcgen05 availability, not in the scheduling partition structure.

Capsule Mercury (capmerc) -- Default Output Format

Capsule Mercury is automatically enabled for all Blackwell targets. When the SM architecture number exceeds 99, ptxas sets the capmerc flag at offset+81 in the compilation context. This applies to sm_100, sm_103, sm_110, sm_120, and sm_121 uniformly.

Three Output Modes

ModeStringDefault ForSM Range
mercury"mercury"sm_75 -- sm_90Turing through Hopper
capmerc"capmerc"sm_100 -- sm_121All Blackwell
sass"sass"None (explicit only)Any

Capsule Mercury vs Mercury

Both modes use the same Mercury encoder pipeline (phases 117--122). The capmerc distinction is at the ELF emission level:

  • Mercury produces a fully-resolved SASS binary in .text.<funcname> sections
  • Capsule Mercury wraps Mercury-encoded instructions in .nv.capmerc<funcname> sections with a 328-byte capsule descriptor, plus .nv.merc.* debug/metadata sections

The capsule descriptor (constructed by sub_1C9C300, 24KB) contains the Mercury instruction stream, relocation metadata (R_MERCURY_* types), KNOBS compilation configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage).

Opportunistic Finalization

Capsule Mercury enables deferred finalization -- compiling once for one SM and reconstituting SASS for a different SM at link or load time.

LevelNameBehaviorExample
0defaultStandard finalization (compile-target only)--
1noneNo finalization; output stays as capmerc--
2intra-familyFinalize within same SM familysm_100 -> sm_103
3intra+interFinalize across SM familiessm_100 -> sm_120

The compatibility checker sub_60F290 determines whether a capmerc binary compiled for SM X can be finalized for SM Y. On success, ptxas emits: "applied for off-target %u -> %u finalization".

Self-Check Mechanism

The --self-check CLI option performs roundtrip verification:

  1. Generate capmerc output (Mercury encoding + metadata)
  2. Reconstitute SASS from the capmerc data
  3. Compare section-by-section; report error codes 17 (content mismatch), 18 (count mismatch), 19 (metadata mismatch)

The reconstituted SASS can be dumped with --out-sass for debugging self-check failures.

TCGen05 -- 5th Generation Tensor Cores

TCGen05 is the defining hardware feature of Blackwell datacenter parts. It introduces tensor memory (TMEM) as a dedicated register-like storage directly connected to the tensor core, eliminating the shared-memory bottleneck of previous WGMMA designs.

SM Availability

SMtcgen05 AvailableNotes
sm_100 / 100a / 100fYesFull datacenter tcgen05
sm_103 / 103a / 103fYesBlackwell Ultra -- same tcgen05 ISA
sm_110 / 110a / 110fYesJetson Thor -- full tcgen05 hardware
sm_120 / 120a / 120fNoConsumer -- no TMEM, no tcgen05
sm_121 / 121a / 121fNoDGX Spark -- no TMEM, no tcgen05

The tcgen05 ISA is gated by SM version checks (visible as sub_70FA00(*, 29) capability queries). sm_120 and sm_121 fall back to inherited HMMA/IMMA/WGMMA tensor core paths.

PTX Instructions

Registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerFormatterSize
tcgen05.allocsub_569180sub_5263701287B
tcgen05.relinquish_alloc_permitsub_526370----
tcgen05.deallocsub_58C7F0sub_5740502130B
tcgen05.ldsub_574050sub_578DB02466B
tcgen05.ld.redsub_578DB0----
tcgen05.stsub_571FE0sub_56C1901842B
tcgen05.commitsub_56C190sub_5427F01575B
tcgen05.cpsub_5427F0sub_4F1A90903B
tcgen05.shiftsub_4F1A90sub_58FA204604B
tcgen05.mmasub_5BBC30 (90KB)----
tcgen05.mma.wssub_58FA20sub_4DA720343B

The tcgen05.mma codegen handler at sub_5BBC30 is 90KB -- the largest single-instruction handler in ptxas -- reflecting the complexity of 5th-gen tensor core MMA with tensor memory operands, scale factors, sparsity, and accumulator management.

TCGen05 Guardrail Functions

Eight debug/validation functions provide runtime instrumentation for tensor memory operations when compiled with --g-tensor-memory-access-check (or -g-tmem-access-check):

GuardrailFormatterSizeValidation
_tcgen05.guardrails.is_phase_validsub_4DA720775BPhase lifecycle
_tcgen05.guardrails.are_columns_allocatedsub_4DDE70599BColumn allocation
_tcgen05.guardrails.is_current_warp_valid_ownersub_4DBF20791BWarp ownership
_tcgen05.guardrails.in_physical_boundssub_4DB050439BMemory bounds
_tcgen05.guardrails.allocation_granularitysub_4F0960839BAllocation alignment
_tcgen05.guardrails.datapath_alignmentsub_4DD580735BData path checks
_tcgen05.guardrails.sp_consistency_across_idesc_modsub_500FA0970BSparse descriptor
_tcgen05.guardrails.check_sparse_usagesub_4DDB80743BSparsity validation

The bounds checker (sub_70E0E0, 296 lines decompiled) generates inline PTX code to validate tensor memory column counts, extracting bitfield positions from tcgen05 descriptors (e.g., and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8).

TCGen05 Intrinsics

ID RangeCountCategory
0x20--0x2A11__cuda_sm10x_tcgen05_guardrail_trap_* (trap on validation failure)
0x230--0x23910__cuda_sm_10x_* (hmma/imma mdata + bit MMA)

Additional tcgen05 helper intrinsics observed in decompiled code:

  • __cuda_sm_100_tcgen05_ld_red_immhalfSplitOff -- load-reduce with immediate half-split offset
  • __cuda_sm_100_tcgen05_ld_immhalfSplitOff -- load with immediate half-split offset
  • __cuda_sm_100_tcgen05_st_immhalfSplitOff -- store with immediate half-split offset
  • __cuda_sm_100_tcgen05_ld_red_funcRetArr -- load-reduce returning array
  • __cuda_sm_100_tcgen05_ld_funcRetArr -- load returning array

These helpers (decompiled in sub_70D910 and sub_70DDB0) generate inline PTX for complex tensor memory access patterns including array returns via ld.param.b32 sequences.

Bulk Copy Intrinsics (sm_1xx)

18 intrinsics in the __cuda_sm1xx_* namespace cover cp.async.bulk.tensor 1D--5D in tile and im2col modes. These extend the Hopper TMA infrastructure with Blackwell-specific enhancements.

SM 100 / SM 100a / SM 100f -- Blackwell Datacenter

sm_100 is the reference Blackwell architecture. Codegen factory 36864 (0x9000), CUDA_ARCH 1000.

Products

B100, B200 (datacenter GPU), paired as GB200 NVL72 superchips.

Key Features

  • TCGen05: Full 5th-gen tensor core with TMEM (alloc, dealloc, ld, st, commit, cp, shift, mma, mma.ws)
  • Capsule Mercury: Default output format (auto-enabled for SM > 99)
  • WGMMA inherited: Warpgroup MMA from Hopper carries forward
  • Cluster operations: Thread-block clusters, distributed shared memory (from Hopper)
  • setmaxnreg: Dynamic register allocation (from Hopper)
  • Uniform register ALU: UFADD, UFFMA, UFSEL, UFSETP, UVIADDR (Blackwell uniform register ISA additions)

Handler Functions

MapFunctionRole
Handler Asub_609C30Primary codegen capability accessor
Handler Bsub_609BD0Secondary codegen capability accessor
Intrinsic initsub_60A910Intrinsic table population
Perf/occupancy Esub_609080Performance statistics
Perf/occupancy Fsub_6098A0Occupancy calculator

HW Latency Table

sub_8E8A90 (3.0KB) -- the base Blackwell latency table. Two-part structure: a 3.0KB base table for standard instructions plus a ~949-byte TCGEN05 supplement covering tensor core scheduling classes 745--772+.

Profile Object

From sub_6765E0:

SM name:       "sm_100"
Compute name:  "compute_100"
LTO name:      "lto_100"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1000"

The profile constructor stores dword_29FE2C4 = 100 after constructing all sm_100 sub-variants, likely recording the current highest-registered base SM number.

SM 103 / SM 103a / SM 103f -- Blackwell Ultra

sm_103 is Blackwell Ultra, targeting the GB300 NVL72 platform. Codegen factory 36867 (0x9003), CUDA_ARCH 1030.

Products

GB300 (datacenter, Blackwell Ultra). Incremental silicon revision over sm_100.

Differentiation from sm_100

The SASS formatter sub_583190 (rsqrt instruction) explicitly checks for "sm_103" and applies a Blackwell Ultra-specific operand layout. The formatter dispatch first tests raw SM codes (102, 103, 107, 130, 124), then uses check_target_sm(instr, 0, "sm_103") for string-based validation.

From sweep data on the SM103-specific encoding path:

  • Sets encoding flags XOR 0x10, XOR 0x40 for sm_103-specific instruction variants
  • New operand layout for transcendental instructions (rsqrt, likely also rcp/sqrt)

sm_103 has a separate 618-byte supplementary latency table -- the smallest in the binary -- suggesting minimal scheduling parameter changes from sm_100.

Handler Functions

All unique from sm_100:

MapFunction
Handler Asub_608F20
Handler Bsub_609D20
Intrinsic initsub_60A700
Perf/occupancy Esub_609020
Perf/occupancy Fsub_6091A0

Profile Object

SM name:       "sm_103"
Compute name:  "compute_103"
LTO name:      "lto_103"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1030"

SM 110 / SM 110a / SM 110f -- Jetson Thor

sm_110 targets the Jetson Thor SoC for automotive and robotics applications. Codegen factory 36868 (0x9004), CUDA_ARCH 1100.

Products

Jetson Thor (automotive-grade SoC with integrated GPU). Originally internally designated sm_101 before rename.

sm_101 Legacy Alias

sm_101 (with variants sm_101a and sm_101f) was the original internal name for Jetson Thor. It was renamed to sm_110 in a later CUDA release, but all three validation table entries are retained for backward compatibility:

TableEntryPTX ISAPurpose
Base (unk_1D16220){101, 8, 6}8.6Accepts --gpu-name sm_101 in existing PTX files
Accelerated (unk_1D161C0){101, 8, 6}8.6Accepts sm_101a
Feature-reduced (unk_1D16160){101, 8, 8}8.8Accepts sm_101f

The validation tables use bsearch() (sub_484B70 comparator), so both sm_101 and sm_110 are independently findable. However, sub_6765E0 (the profile constructor) registers only sm_110 / sm_110a / sm_110f -- there is no profile object for sm_101. After passing validation, sm_101 must resolve to the sm_110 profile through an internal aliasing path (likely in sub_4B1080, the target directive parser).

The PTX ISA version difference is notable: sm_101 requires PTX 8.6 (same as sm_100), while sm_110 requires PTX 9.0. This reflects the timeline -- sm_101 was named when the Jetson Thor target was first added alongside sm_100, before the sm_110 numbering and PTX 9.0 specification existed.

Key Characteristics

  • Full tcgen05 hardware: Retains datacenter-class tensor memory and tensor core features
  • Same WGMMA/cluster support: Inherits Hopper-era warpgroup and cluster operations
  • SoC-specific constraints: Differentiated from sm_100 through capability flags, not through missing features -- the capability accessor functions (sub_609F30, sub_608F50) return SoC-appropriate resource limits

Handler Functions

MapFunction
Handler Asub_609F30
Handler Bsub_608F50
Intrinsic initsub_60AA20
Perf/occupancy Esub_609000
Perf/occupancy Fsub_609280

Profile Object

SM name:       "sm_110"
Compute name:  "compute_110"
LTO name:      "lto_110"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1100"

Note: sm_110 uses different xmmword constants for profile fields [5]--[7] compared to sm_100, visible in the profile constructor where v98[5] = xmmword_2027D70; v98[6] = v103 (from v209); v98[7] = v104 (from v207). This encodes different hardware resource parameters.

SM 120 / SM 120a / SM 120f -- Blackwell Consumer

sm_120 targets consumer and enterprise workstation GPUs. Codegen factory 36869 (0x9005), CUDA_ARCH 1200.

Products

RTX 5090, RTX 5080, RTX 5070 Ti, RTX 5070, RTX 5060 (consumer). RTX Blackwell Pro (enterprise workstation).

The tcgen05 Gap

sm_120 is architecturally distinct from sm_100 despite sharing the "Blackwell" family string. The critical difference: no tcgen05 support. The entire tensor memory subsystem (alloc, dealloc, ld, st, commit, cp, shift, mma) is absent. Tensor operations fall back to:

  • HMMA/IMMA inherited from sm_70--sm_89 (direct MMA path)
  • WGMMA inherited from sm_90 (warpgroup async MMA)

This is gated by SM version checks in the capability accessor functions. The intrinsic table initializer (sub_608DF0) does not register tcgen05 intrinsic handlers for sm_120.

HW Latency Table

sm_120 has a distinct latency model, split into two parts:

FunctionSizeContent
sub_8E90002.9KBBase consumer Blackwell table
sub_8E92E05.5KBExtended table (largest individual table in binary)

The 5.5KB extended table is larger than any other individual latency table, suggesting that sm_120's consumer pipeline has significantly different scheduling characteristics from datacenter sm_100. The consumer pipeline likely has different functional unit counts, memory latencies, and tensor core throughput profiles.

Handler Functions

MapFunction
Handler Asub_609E40
Handler Bsub_609C60
Intrinsic initsub_608DF0
Perf/occupancy Esub_608FE0
Perf/occupancy Fsub_609520

Profile Object

SM name:       "sm_120"
Compute name:  "compute_120"
LTO name:      "lto_120"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1200"

sm_120 uses a third distinct set of xmmword constants for profile fields, including xmmword_2027DC0 at field [6], confirming different hardware resource parameters from both sm_100 and sm_110.

SM 121 / SM 121a / SM 121f -- DGX Spark

sm_121 targets the DGX Spark desktop AI workstation. Codegen factory 36869 (0x9005), CUDA_ARCH 1210.

Products

NVIDIA DGX Spark (desktop AI workstation with Grace CPU + Blackwell GPU).

Relationship to sm_120

sm_121 shares the same codegen factory sub-variant (5) as sm_120 in the scheduling sub-architecture table, and inherits sm_120's xmmword profile constants. This suggests sm_121 is a binned or slightly modified sm_120 die, similar to how sm_86 relates to sm_80 in the Ampere generation.

Like sm_120, sm_121 has no tcgen05 support -- tensor operations use the HMMA/IMMA/WGMMA path.

Handler Functions

All unique from sm_120:

MapFunction
Handler Asub_609ED0
Handler Bsub_609BA0
Intrinsic initsub_60A4E0
Perf/occupancy Esub_609040
Perf/occupancy Fsub_6097C0

Profile Object

SM name:       "sm_121"
Compute name:  "compute_121"
LTO name:      "lto_121"
Family:        "Blackwell"
CUDA_ARCH:     "-D__CUDA_ARCH__=1210"

Blackwell Uniform Register ISA

Blackwell extends the uniform register (UR) ISA introduced in Turing/Ampere with dedicated uniform ALU instructions:

SASS InstructionOperationNotes
UFADDUniform floating-point addNew in Blackwell
UFFMAUniform fused multiply-addNew in Blackwell
UFSELUniform floating-point selectNew in Blackwell
UFSETPUniform FP set-predicateNew in Blackwell
UVIADDRUniform integer-to-addressNew in Blackwell

These instructions execute on the uniform datapath (UDP, functional unit index 9), allowing floating-point uniform computations to stay in the UR file without round-tripping through the R file. Mercury encoding assigns major opcode 0x0E with 6 variants (sub_10C0550) for uniform ALU.

Architecture Version Threshold Checks

All Blackwell targets share codegen factory 36864+ (0x9000+). The binary uses these thresholds to gate Blackwell-specific features:

Check PatternThresholdMeaning
encoded > 36863> sm_90 extendedBlackwell warp geometry (16 warps, 240 slots)
codegen_factory >= 36864>= sm_100Blackwell generation features
codegen_factory == 36864sm_100 exactlysm_100-specific paths
SM_number > 99sm_100+Capsule Mercury auto-enable
sub_70FA00(*, 29)--tcgen05 capability query (SM-specific)

The tcgen05 gating is not a simple threshold -- it uses a per-SM capability query (sub_70FA00 with argument 29) that returns true for sm_100/103/110 and false for sm_120/121.

BB Initialization (Secondary Encoding)

The basic block initializer sub_6E8EB0 uses a secondary encoding space for Blackwell:

Secondary EncodingSMFlags
20480 (0x5000)sm_100Instruction set flags for datacenter
20484 (0x5004)sm_103XOR 0x10, XOR 0x40 for Ultra variants

This secondary encoding uses generation 5 in the BB init context (5 << 12), separate from the primary codegen factory's generation 9.

Scheduling Profile Differences

Blackwell targets share the 16-warp / 240-slot geometry with Hopper but have distinct latency tables:

SMLatency TableSizeStructure
sm_100sub_8E8A903.0KBBase + 949B TCGEN05 supplement
sm_103(supplementary)618BSmallest table -- minimal delta from sm_100
sm_110(shared with sm_100 or dedicated)--Not separately identified in sweep
sm_120sub_8E9000 + sub_8E92E02.9KB + 5.5KBTwo-part consumer table
sm_121(likely shares sm_120's table)--Same sub-variant index as sm_120
Universalsub_8E97B08.8KBFallback / universal

The 5.5KB extended table for sm_120 is the largest individual latency table in the binary, reflecting the consumer microarchitecture's distinct pipeline design. The sm_100 table uses a supplement mechanism for TCGEN05-specific scheduling classes that the consumer sm_120 table does not need (since sm_120 lacks tcgen05).

Scheduling Class Assignments

From the opcode-to-scheduling-class mapper sub_89FBA0 (85KB), Blackwell-era opcodes use high-numbered scheduling classes:

Class RangeCategoryArchitecture
700--772+Mercury/Blackwell tensor opssm_100+ with tcgen05
745WGMMA primarysm_90+ (Hopper carry-forward)
744WGMMA variantsm_90+
765--767BGMMA/QMMA (Blackwell-specific MMA types)sm_100+
759HMMA/BMMA tensor coresm_100+
757, 761Narrow/wide DP tensorsm_100+
600, 604Tensor fence / tensor syncsm_90+

Intrinsic Table

Blackwell intrinsic availability is cumulative -- all sm_70, sm_80, sm_8x, and sm_9x intrinsics carry forward. Blackwell adds two new intrinsic groups:

sm_10x Intrinsics (21 entries)

ID RangeCountNamespaceCategory
0x20--0x2A11__cuda_sm10x_tcgen05_guardrail_trap_*Trap on guardrail validation failure
0x230--0x23910__cuda_sm_10x_*hmma/imma mdata + bit MMA (Blackwell-specific shapes)

sm_1xx Intrinsics (18 entries)

NamespaceCountCategory
__cuda_sm1xx_*18cp.async.bulk.tensor 1D--5D tile/im2col (extended bulk copy)

OCG (On-Chip Generated) Intrinsics

The OCG builtin name table at sub_6C9EB0 (13KB) contains the master list of Blackwell+ runtime-generated intrinsic names:

cp_async_bulk, cp_red_async_bulk, cp_async_tensor,
cp_async_prefetch_tensor, fence_view_async,
viaddmax, viaddmin, viadd, vimax, vimin, vimax3, vimin3,
write_async, tcbar, mmareadshma, tcmma,
tcshift, tcatomsws, tcldsws, tcstsws,
gdesc, breuse, bkeep, virtcount,
memclear, acqshminit, sparsify, spfactor2to4,
2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit,
16dp32bitt0t15, 16dp32bitt16t31,
selfcast, broadcast, findandset, align

These names represent the Blackwell SASS operations exposed through the OCG intrinsic interface, covering tensor core scheduling (tcbar, tcmma, tcshift), sparse operations (sparsify, spfactor2to4), integer min/max variants (viaddmax, viaddmin), and async memory operations.

New PTX Instructions (Blackwell-Specific)

Beyond tcgen05, Blackwell introduces or extends several instruction families visible in the opcode dispatch:

InstructionCategoryEvidence
tcgen05.* (11 instructions)Tensor memory opssub_5D4190 registration
fence_view_asyncMemory orderingOCG builtin table
write_asyncAsync writesOCG builtin table
viaddmax / viaddminInteger add-with-max/minOCG builtin table
BGMMA / QMMABlock/quantized MMAScheduling classes 765--767

CLI Options -- Tensor Memory Checks

Two CLI options control tcgen05 guardrail instrumentation:

OptionShort FormDefaultDescription
--g-tensor-memory-access-check-g-tmem-access-checkEnabled with -gEnable tensor memory access checks for tcgen05 operations
--gno-tensor-memory-access-check-gno-tmem-access-checkfalseDisable checks (overrides the above)

When enabled, the compiler inserts inline validation code (the 8 guardrail functions) around tcgen05 operations. These emit trap instructions if tensor memory invariants are violated at runtime -- useful for debugging TMEM allocation errors, bounds violations, and ownership conflicts.

Feature Comparison

Featuresm_100sm_103sm_110sm_120sm_121
Codegen factory3686436867368683686936869
Sub-arch variant03455
Family stringBlackwellBlackwellBlackwellBlackwellBlackwell
Capsule Mercury defaultYesYesYesYesYes
tcgen05 (tensor memory)YesYesYesNoNo
WGMMA (from Hopper)YesYesYesYesYes
Cluster operationsYesYesYesYesYes
setmaxnregYesYesYesYesYes
Uniform FP ALU (UFADD etc.)YesYesYesYesYes
BGMMA/QMMAYesYesYes??
Guardrail instrumentationYesYesYesN/AN/A
HW latency table3.0KB + 949B618B (supp.)--2.9KB + 5.5KB(shared w/ sm_120)
a sub-variantYesYesYesYesYes
f sub-variantYesYesYesYesYes
ProductsB100/B200GB300Jetson ThorRTX 50xxDGX Spark

SASS Instruction Encoding

Blackwell continues the 128-bit per-instruction format introduced in Turing. The Mercury encoder handles the SM100+ instruction set through a dedicated encoding subsystem spanning approximately 851KB in the address range 0xDFC000--0x107B000.

The encoding subsystem covers 16 instruction format groups with full Blackwell ISA support including:

  • Standard ALU/FPU/memory operations (inherited from earlier architectures)
  • TCGEN05 tensor memory operations (new encoding classes)
  • BGMMA/QMMA block-scale and quantized MMA variants
  • Extended bulk copy operations (UBLKCP variants)
  • Sparse tensor operations

Function Map

AddressSizeIdentitySMConfidence
sub_607DB014KBCapability dispatch (all Blackwell registrations)all99%
sub_608DF0~1KBIntrinsic table initializersm_12085%
sub_608F20~1.2KBHandler A (capability accessor)sm_10390%
sub_608F50~1.2KBHandler B (capability accessor)sm_11090%
sub_609000~200BPerf/occupancy Esm_11085%
sub_609020~200BPerf/occupancy Esm_10385%
sub_609040~200BPerf/occupancy Esm_12185%
sub_609080~200BPerf/occupancy Esm_10085%
sub_6091A0~200BPerf/occupancy Fsm_10385%
sub_609280~200BPerf/occupancy Fsm_11085%
sub_609520~200BPerf/occupancy Fsm_12085%
sub_6097C0~200BPerf/occupancy Fsm_12185%
sub_6098A0~200BPerf/occupancy Fsm_10085%
sub_609BA0~48BHandler Bsm_12199%
sub_609BD0~48BHandler Bsm_10099%
sub_609C30~48BHandler Asm_10099%
sub_609C60~48BHandler Bsm_12099%
sub_609D20~48BHandler Bsm_10399%
sub_609E40~48BHandler Asm_12099%
sub_609ED0~48BHandler Asm_12199%
sub_609F30~48BHandler Asm_11099%
sub_60A4E0~1KBIntrinsic table initializersm_12185%
sub_60A700~1KBIntrinsic table initializersm_10385%
sub_60A910~1KBIntrinsic table initializersm_10085%
sub_60AA20~1KBIntrinsic table initializersm_11085%
sub_60F290est.Off-target capmerc compatibility checkerall75%
sub_612DE047KBKernel finalizer / ELF builderall80%
sub_6765E054KBProfile constructor (Blackwell entries at lines 600--1330)all95%
sub_703AB010KBCLI option parser (capmerc/mercury/sass)all90%
sub_70D91024 linestcgen05 immhalfSplitOff helpersm_10090%
sub_70DDB047 linestcgen05 funcRetArr helpersm_10090%
sub_70E0E0296 linestcgen05 guardrail bounds checkersm_10090%
sub_8E44003.3KBWarp geometry initializer (Blackwell = 16 warps, 240 slots)all90%
sub_8E8A903.0KBHW latency table (Blackwell datacenter)sm_10085%
sub_8E90002.9KBHW latency table (consumer base)sm_12085%
sub_8E92E05.5KBHW latency table (consumer extended)sm_12085%
sub_8E97B08.8KBUniversal fallback latency tableall85%
sub_89FBA085KBOpcode-to-scheduling-class mapperall90%
sub_5BBC3090KBtcgen05.mma codegen handlersm_10098%
sub_5D4190--Opcode dispatch table (tcgen05 registrations)all99%
sub_1C9C30024KBCapsule Mercury section processorall85%
sub_1C9B11023KBMercury capsule builderall85%

Cross-References