Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ada Lovelace & Hopper (SM 89--90a)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas handles SM 89 (Ada Lovelace -- RTX 4090, L40S) and SM 90/90a (Hopper -- H100, H200) as adjacent but architecturally distinct targets. Ada shares the Ampere codegen factory (28673) and is stored internally as "Ampere"-family despite being a different microarchitecture. Hopper gets its own codegen factory (32768), its own family string "Hopper", and introduces the largest single-generation feature addition in ptxas: WGMMA, thread-block clusters, TMA, setmaxnreg, and distributed shared memory.

SM 89 (Ada)SM 90 (Hopper)SM 90a (Hopper accel)
ProductsRTX 4090, RTX 4080, L40S, L4H100, H200H100, H200 (arch-locked)
Family string"Ampere""Hopper""Hopper"
__CUDA_ARCH__89090090a0
Codegen factory28673 (7 << 12 | 1)32768 (8 << 12)32768
Handler Asub_609E10sub_609DB0sub_609DB0 (shared)
Handler Bsub_609CF0sub_609C00sub_609C00 (shared)
Intrinsic initsub_60A810sub_60A5F0sub_60A5F0 (shared)
HW latency tablesub_8E8280 (3.1KB)sub_8E8480 (5.2KB)sub_8E8780 (4.6KB)
Suffix variantsNonea only (no f)--
Forward compatFull (runs on sm_90+)Full (base variant)None (locked to H100/H200)

The "a" Suffix -- Architecture-Accelerated

SM 90a is the first target to use the a suffix. It appears in the accelerated validation table (unk_1D161C0, 7 entries). The meaning is precise: sm_90a SASS executes only on the specific silicon it was compiled for (H100/H200) and will not run on any future architecture. This trades forward compatibility for access to features that may not survive to the next generation.

In ptxas, sm_90 and sm_90a share all 7 dispatch-table handler functions. The a suffix does not produce different handler code paths -- it produces different compatibility metadata in the output cubin. The ELF header records whether the binary is forward-compatible (base) or arch-locked (accelerated), and the CUDA driver enforces this at load time.

Compilation rules from ptxas help text:

  • sm_90a PTX must be compiled to sm_90a SASS (no cross-arch)
  • sm_90 PTX can compile to sm_90 or any later SASS target
  • No sm_90f variant exists; the f suffix starts with Blackwell

SM 89 -- Ada Lovelace

Internal Classification

Ada is classified as Ampere-derived in the binary. The profile object constructed by sub_6765E0 stores "Ampere" as the family name and uses codegen factory value 28673 -- identical to sm_80 through sm_88. The compiler distinguishes Ada from Ampere through per-SM capability accessor functions, not through the factory ID.

The encoded SM version for sm_89 is 28677 (7 << 12 | 5), placing it as the 5th variant in the Ampere generation:

Encoded ValueSMVariant
28673sm_807 << 12 | 1 (base Ampere)
28674sm_867 << 12 | 2
28675sm_877 << 12 | 3
28676sm_887 << 12 | 4
28677sm_897 << 12 | 5 (Ada)

Ada-Specific Features

Ada introduces 4th-generation tensor cores with FP8 (E4M3, E5M2) support. In the PTX validator, sub_4A6050 explicitly references "%s on sm_89" in its CVT instruction validation, confirming sm_89-specific conversion rules for new data types.

The intrinsic table initializer for sm_89 (sub_60A810) enables 39 additional MMA intrinsics:

Intrinsic ID RangeCountCategory
0x209--0x22F39__cuda_sm_8x_mma_* -- extended MMA shapes and types

These intrinsics cover FP8 MMA operations, block-scale MMA, and additional type combinations beyond what sm_80--88 support. The MMA validator at sub_49BBA0 checks for "mma with FP8 floating point type" and validates against the target SM version.

Ada Scheduling Profile

The HW latency table for sm_89 (sub_8E8280, 3.1KB) is smaller than Hopper's (5.2KB), reflecting Ada's simpler pipeline structure -- no WGMMA async pipeline, no cluster operations. The register file geometry from sub_8E4400:

ParameterValueNotes
Warps per scheduler8Threshold: encoded SM <= 36863
Dispatch slots224Same as sm_80 class
Sub-architecture variant5From encoded value 28677

SM 90 / SM 90a -- Hopper

Profile Construction

Hopper is the first SM to get its own codegen factory value (32768 = 8 << 12) since the Turing/Ampere split. The profile object stores "Hopper" as the family name. Key profile fields:

Fieldsm_90sm_90a
SM name"sm_90""sm_90a"
Compute name"compute_90""compute_90a"
LTO name"lto_90""lto_90a"
CUDA_ARCH90090a0
Family"Hopper""Hopper"
Codegen factory3276832768

Hopper Scheduling Profile

The HW latency tables for Hopper are substantially larger than any previous architecture, reflecting the async pipeline and tensor core scheduling complexity:

FunctionSizeTargetNotes
sub_8E84805.2KBsm_90Base Hopper latency model
sub_8E87804.6KBsm_90aArch-accelerated variant

sm_90a gets its own latency table (4.6KB) distinct from sm_90 (5.2KB), even though they share all dispatch handler functions. This is the only architecture where base and a variants have separate scheduling profiles -- all Blackwell variants share their tables within each base SM.

Register file geometry from sub_8E4400:

ParameterValueNotes
Warps per scheduler16Threshold: encoded SM > 36863 (32768 qualifies)
Dispatch slots240Maximum -- 2x the sm_80 class
Sub-architecture variant0From encoded value 32768 (base variant)
Max threads/CTA240From code object builder sub_A465F0

The jump from 8 warps / 224 slots (sm_89) to 16 warps / 240 slots (sm_90) is the largest warp geometry change in the binary. This doubling of warp capacity directly corresponds to Hopper's 4-warp warpgroup execution model.

Hopper Intrinsics

The intrinsic initializer for sm_90 (sub_60A5F0) enables 38 sub-byte MMA intrinsics:

Intrinsic ID RangeCountCategory
0x23A--0x25F38__cuda_sm_9x_mma_sub_byte_internal_*

These cover sparse sub-byte MMA operations: s4/u4 sparse variants for m16n8k32, m16n8k64, and m16n8k128 shapes. These are Hopper-specific and do not appear in the Ada (sm_89) intrinsic table.

WGMMA -- Warpgroup Matrix Multiply-Accumulate

WGMMA is Hopper's signature instruction. It operates on warpgroups (4 consecutive warps) rather than single warps, and executes asynchronously -- the tensor core operates in parallel with the warp's regular instruction stream. ptxas handles WGMMA through four PTX instructions and a dedicated compiler pass infrastructure.

PTX Instructions

Registered in the opcode dispatch table at sub_5D4190:

PTX InstructionCodegen HandlerFormatterFormatter Size
wgmma.mma_asyncsub_50AC70sub_4DA380295B
wgmma.fencesub_4DA380sub_4DA4B0295B
wgmma.commit_groupsub_4DA4B0sub_4DA5E0311B
wgmma.wait_groupsub_4DA5E0sub_505B001066B

The formatters are the smallest named formatters in ptxas (295 bytes), reflecting WGMMA's compact text representation. wgmma.wait_group is significantly larger (1066B) because it must encode the pipeline depth parameter.

GMMA Pipeline Pass Infrastructure

The WGMMA pipeline optimizer is the largest single-architecture compiler subsystem in ptxas, spanning approximately 100KB of code across 15+ functions in the range 0xACE000--0xAE6000. It is active only for SM 90+ targets.

Call chain:

sub_AE4F70  (coordinator -- outside primary range)
 +-- sub_ACE480   (22.7KB)  WGMMA serialization warning emitter
 +-- sub_ADEB40   (43.1KB)  warpgroup.arrive/wait fence insertion
 +-- sub_AE17C0   (37.9KB)  pipeline stage builder
 |    +-- sub_AE0D20  (16.8KB)  live range builder
 |    +-- sub_AE06F0           GMMA operand classifier
 +-- sub_ADDDF0   (20.6KB)  pass entry (vtable dispatch)
      +-- sub_ADCA60  (21.7KB)  scheduling coordinator
           +-- sub_ADBD30  (23.9KB)  register pressure estimator
           |    +-- sub_ADAD60  (8.4KB)  live range limiter
           |    +-- sub_AD9C20  (14.4KB) register class allocator
           +-- sub_AD70B0  (22.6KB)  operand register assignment

Warpgroup Synchronization Injection

The fence insertion pass (sub_ADEB40, 43.1KB) scans for wgmma.mma_async operations and automatically injects warpgroup.arrive and warpgroup.wait instructions. These fences manage register ownership when asynchronous tensor core operations are in flight -- the hardware requires explicit handoff between the warpgroup's register file and the tensor core's accumulator registers.

Diagnostic messages emitted by the compiler:

  • "warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"
  • "warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"

WGMMA Serialization Warnings

When the compiler cannot pipeline WGMMA operations, sub_ACE480 (22.7KB, 98% confidence) emits detailed performance advisories using codes 0x1D55--0x1D57 (7509--7511 decimal). Nine distinct serialization reasons are enumerated:

Reason CodeDiagnostic Message
1"presence of Extern calls in the function"
2"wgmma pipeline crossing function boundary"
3"insufficient register resources for the wgmma pipeline"
4"program dependence on compiler-inserted warpgroup"
5"ill formed pipeline stage in the function"
6"non wgmma instructions defining accumulator registers"
7"non wgmma instructions reading accumulator registers"
8"non wgmma instructions defining input registers"
9"insufficient register resources for the function"

All messages are prefixed with "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...". The pass reads its configuration from offsets +26280 (1-byte enable) and +26288 (dword threshold) on the compilation context.

GMMA Live Range Management

The live range limiter (sub_ADAD60) enforces a maximum on simultaneously active live ranges within GMMA sequences:

"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"

When the threshold is exceeded, the system triggers register spilling or sequence splitting through sub_ADBD30 (register pressure estimator). The live range builder uses FNV-1a hashing (constants 16777619 and 0x811C9DC5) for instruction deduplication.

Thread-Block Clusters

Hopper introduces the concept of a thread-block cluster -- a group of cooperating CTAs that can access each other's shared memory (distributed shared memory). ptxas adds several PTX directives and special registers to support this.

Cluster Directives

The directive validator (sub_4CE6B0, 48KB) enforces mutual exclusivity of cluster configuration:

".reqnctapercluster and .maxclusterrank cannot both be specified"

Two shared-memory state spaces are distinguished:

  • .shared::cta -- CTA-local shared memory (pre-Hopper behavior)
  • .shared::cluster -- distributed shared memory accessible across CTAs in a cluster

Cluster Special Registers

Registered in sub_61B850 (special register table initializer):

Special RegisterPurpose
%clusteridCluster ID within the grid
%nclusteridNumber of clusters in the grid
%cluster_ctaidCTA position within the cluster
%cluster_nctaidNumber of CTAs in the cluster
%cluster_ctarankLinear rank of CTA within the cluster
%cluster_nctarankTotal CTAs in the cluster (linear)
%is_explicit_clusterWhether this launch uses explicit clustering
%aggr_smem_sizeAggregate shared memory across cluster

Distributed Shared Memory Intrinsics

The intrinsic handler OCG_DshmemHandler at sub_6C60B0 validates distributed shared memory operations:

  • "Cannot use both the selfcast and the broadcast modifier."
  • "Either the selfcast or the broadcast modifier must be used."

TMA -- Tensor Memory Accelerator

The Tensor Memory Accelerator (TMA) provides hardware-accelerated bulk data movement between global and shared memory, using tensor descriptors to specify multi-dimensional copy patterns. In ptxas, TMA is exposed through cp.async.bulk.tensor.

cp.async.bulk.tensor Codegen

The codegen handler (sub_5AB460, 45KB) is one of the largest single-instruction handlers in ptxas:

PropertyValue
Handler functionsub_5AB460
Size45KB
Buffer allocation50,000 bytes
Registered name"cp.async.bulk.tensor"
Dimensionality1D through 5D
Modestile, im2col
Cast variantsunicast, multicast

The TMA intrinsic handler (OCG_CpAsyncTensorHandler at sub_6C8100) validates operand counts per mode:

  • "Must have 1 input with a1t0 and no multicast"
  • "Must have 2 inputs with a1t0 and multicast"
  • "Must have 2 input with a0tx and no multicast"
  • "Must have 3 inputs with a0tx and multicast"

Tensormap Instructions

The tensormap validator (sub_4A73C0, 10.8KB) handles tensor descriptor manipulation:

  • ".tile" mode validation
  • "Tensormap field with input value >= 13" / "with input value == 4" bounds checking
  • ".tensormap::generic" addressing mode
  • "Interger Immediate for ordinal" (sic -- typo preserved from binary)
HandlerFunctionSize
cp.async.bulksub_5932105.1KB (formatter)
cp.async.mbarrier.arrivesub_4DC180--
OCG_CpAsyncBulkHandlersub_6C347020KB
OCG_CpAsyncHandlersub_6C2AE010KB

setmaxnreg -- Dynamic Register Allocation

Hopper introduces setmaxnreg (PTX opcode 315) for dynamic register count adjustment. This allows kernels to change their register footprint at runtime, enabling techniques like CTA reconfiguration and warpgroup-level resource management.

setmaxnreg Pass

The handler (sub_97EC60, ~3.5KB, 90% confidence) walks the instruction list looking for opcode 315 and processes or removes them. Five reasons for ignoring setmaxnreg are enumerated as "Potential Performance Loss" warnings:

ReasonMessage
1"unable to determine register count at entry"
2"to maintain minimum register requirements"
3"to allow debugging"
4"to maintain compatibility across compilation units"
5"to maintain compatibility into 'extern' call"

The setmaxnreg handling mode is controlled by knob 653.

CTA Reconfig Pragmas

The pragma validator (sub_97F540, ~4KB) enforces ordering constraints on setmaxreg.alloc and setmaxreg.dealloc:

  • "Found an 'alloc' pragma after 'dealloc'"
  • "Found incompatible thread count re-specification"
  • "Found a 'dealloc' pragma after 'alloc'"

The dealloc validator (sub_98D100, ~4.8KB) enforces register count bounds:

  • "setmaxreg.dealloc/release has register count (%d) less than launch min target (%d)"
  • "setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count"

Async Pipeline Features

Hopper extends the async copy pipeline introduced in Ampere with mbarrier (memory barrier) objects. ptxas implements mbarrier handling through a cluster of detector, classifier, and emitter functions:

FunctionAddressIdentity
MBarrierDetector::isNonTrivialMBarriersub_A94440Checks "%mbarrier_" prefix
MBarrierDetector::classifyMBarriersub_A9A5F0Returns packed (type << 32) | is_mbarrier
MBarrierDetector::resolveMBarrierBaseNamesub_A9A920Extracts base name from symbol
MBarrierEmitter::rewriteMBarrierOperandsub_AA33C0Constructs "%%mbarrier_%s_%s"

The mbarrier type classifier returns an enumeration of 0--12 for different operation kinds, covering arrive, arrive_drop, expect_tx, and their counted variants.

Warpgroup Attribute Management

The warpgroup attribute processor (sub_64BAF0, 30KB) handles kernel-level attributes introduced with Hopper:

Attribute StringPurpose
"warpgroup-commit_batch"WGMMA commit batch configuration
"warpgroup-wait"WGMMA wait configuration
"warpgroup-arrive"WGMMA arrive configuration
"setsmemsize-state"Shared memory size pragma state
"setmaxreg-state"Register limit pragma state
"func_begin"Function entry marker
"CC-temp"Calling convention temporary

Architecture Version Threshold Checks

The binary uses the encoded SM version (codegen factory value) for feature gating throughout the compiler. Key thresholds observed:

Check PatternThresholdMeaning
profile[+372] > 28673> sm_80 basePost-Ampere features
a2 <= 32767<= sm_89 classPre-Hopper geometry (7 warps, 208 slots)
a2 <= 36863<= sm_89 extendedAmpere/Ada geometry (8 warps, 224 slots)
a2 > 36863>= sm_90Hopper+ geometry (16 warps, 240 slots)
Codegen factory == 32768sm_90 exactlyHopper-specific code paths
Codegen factory >= 32768sm_90+WGMMA, cluster, TMA enabled

The register file descriptor at sub_8E4400 uses the encoded value to select warp geometry. The full cascade:

encoded <= 20479  ->  4 warps,  96 slots   (pre-Maxwell)
encoded <= 24575  ->  6 warps, 176 slots   (Pascal)
encoded <= 28672  ->  7 warps, 192 slots   (Volta)
encoded <= 32767  ->  7 warps, 208 slots   (Turing/Ampere/Ada)
encoded <= 36863  ->  8 warps, 224 slots   (Ampere extended)
encoded >  36863  -> 16 warps, 240 slots   (Hopper+)

Note: sm_89 (encoded 28677) falls in the <= 32767 range, giving it 7 warps / 208 slots. But the separate warp geometry lookup uses a different cascade where sm_89 (as Ampere class) gets 8 warps / 224 slots. The dual-cascade structure reflects the fact that different subsystems query different profile fields.

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_8965,5362551,53648167 / 20820848 / 100 KB90%
sm_9065,5362551,02464328 / 22422448 / 100 / 132 / 164 / 228 KB90%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block).
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

sm_90a shares sm_90's geometry -- the a suffix affects only compatibility metadata, not hardware resource limits. The jump from sm_89 (7 partitions, 208 slots, 48 warps) to sm_90 (8 partitions, 224 slots, 64 warps) is the largest single-generation scheduling capacity increase in the binary, directly supporting Hopper's 4-warp warpgroup execution model.

MMA Instruction Validators

Several validator functions gate MMA features by SM version:

ValidatorAddressSizeSM Strings Referenced
WMMA/MMA validatorsub_4C2FD012.2KB"sm_90", "sm_80", "sm_75"
MMA scale/block validatorsub_49BBA011.4KB"sm_%d", "mma with FP8 floating point type"
WMMA shape validatorsub_4BFED010.3KB"sm_80", "sm_75"
CVT arch-specificsub_4A60505.0KB"%s on sm_89"
Special register validatorsub_49A5A03.5KB"sm_90", "%laneid", "%warpid"
Instruction fusion validatorsub_4AA3E07.1KB"sm_90"
Float instruction validatorsub_4A2CA03.7KB"sm_90"
Function address validatorsub_4B16304.6KB"sm_90", "sm_30", "sm_20"

The WMMA/MMA validator at sub_4C2FD0 performs three-way version checks: sm_75 for base WMMA, sm_80 for extended types (BF16/TF32), sm_90 for WGMMA features. FP8 MMA is gated by both sm_89 (Ada) for the data types and sm_90 (Hopper) for the warpgroup shapes.

Post-Scheduling Statistics

Eight architecture-variant statistics printers (clones at 0x700-byte intervals from sub_ABBA50) emit DUMPIR statistics. The metrics cover Hopper-specific counters:

MetricFormat String
MMA counts"# [hmma1688=%d]" (and variants for imma, sparse, dmma)
Occupancy"# [Occupancy = %f]"
Per-unit throughput"# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] ..."
Issue throughput"# [issue thru=%f] [adu thru=%f] [alu thru=%f] ..."
WGMMA serialization"Potential Performance Loss: wgmma.mma_async ..." (9 variants)
Shared memory"# [SharedMem Alloc thru=%f]"

Function Map

AddressSizeIdentityConfidence
sub_4A60505.0KBCVT validator (sm_89 special cases)85%
sub_4C2FD012.2KBWMMA/MMA validator (sm_75/80/90 version checks)90%
sub_4DA380295Bwgmma.mma_async formatter99%
sub_4DA4B0295Bwgmma.fence formatter99%
sub_4DA5E0311Bwgmma.commit_group formatter99%
sub_505B001066Bwgmma.wait_group formatter99%
sub_50AC70--wgmma.mma_async codegen handler99%
sub_5AB46045KBcp.async.bulk.tensor codegen (TMA)95%
sub_609CF0~1.2KBsm_89 handler B (capability accessor)90%
sub_609DB0~1.2KBsm_90 handler A (capability accessor)90%
sub_609E10~1.2KBsm_89 handler A (capability accessor)90%
sub_60A5F0~1KBsm_90 intrinsic table initializer85%
sub_60A810~1KBsm_89 intrinsic table initializer85%
sub_61B85010KBSpecial register table (cluster regs)99%
sub_64BAF030KBWarpgroup/kernel attribute processor80%
sub_6C60B0--Distributed shared memory intrinsic handler65%
sub_6C8100--TMA (cp.async.tensor) intrinsic handler85%
sub_8E44003.3KBRegister file geometry initializer90%
sub_8E82803.1KBsm_89 (Ada) HW latency table85%
sub_8E84805.2KBsm_90 (Hopper) HW latency table85%
sub_8E87804.6KBsm_90a HW latency table85%
sub_97EC60~3.5KBsetmaxnreg handler (opcode 315)90%
sub_97F540~4KBCTA reconfig pragma validator90%
sub_98D100~4.8KBsetmaxreg.dealloc validator90%
sub_A94440--MBarrierDetector::isNonTrivialMBarrier85%
sub_A9A5F0--MBarrierDetector::classifyMBarrier85%
sub_AA33C0--MBarrierEmitter::rewriteMBarrierOperand85%
sub_ACE48022.7KBWGMMA serialization warning emitter98%
sub_AD70B022.6KBGMMA operand register assignment75%
sub_AD9C2014.4KBGMMA register class allocator75%
sub_ADAD608.4KBGMMA live range limiter90%
sub_ADBD3023.9KBGMMA register pressure estimator80%
sub_ADCA6021.7KBGMMA scheduling coordinator85%
sub_ADDDF020.6KBGMMA pass entry (vtable)80%
sub_ADEB4043.1KBWGMMA sync fence insertion95%
sub_AE0D2016.8KBGMMA live range builder80%
sub_AE17C037.9KBGMMA pipeline stage builder85%
sub_AE4F70--GMMA pass coordinator (outside range)90%
sub_AE503015.5KBGMMA scheduling wrapper (alt entry)75%

Cross-References