Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hopper (sm_90, sm_90a)

Hopper represents the largest single-generation feature expansion in cicc v13.0. The sm_90 gate at qword_4F077A8 > 89999 unlocks thread block clusters, distributed shared memory, Tensor Memory Access (TMA), Warpgroup Matrix Multiply-Accumulate (WGMMA), dynamic register count control, and a new fence instruction. The sm_90a "accelerated" sub-variant shares __CUDA_ARCH=900 with sm_90 but uses a higher PTX version and enables one additional feature gate in the EDG frontend.

Architecture Identity

The NVVM container format registers Hopper as NVVM_ARCH_HOPPER_9_0 with numeric value 900, assigned in sub_CD09E0 (line 255) and sub_1C1B150 (line 270) via the pattern v62(a1, "NVVM_ARCH_HOPPER_9_0", v64) => *a2 = 900.

VariantSubtarget Enum__CUDA_ARCHPTX Version-opt-arch-mcpu
sm_90389005sm_90sm_90
sm_90a399006sm_90asm_90a

Both variants share __CUDA_ARCH=900. The distinction lies in the -opt-arch and -mcpu flags passed through the internal pipeline (sub_95EB40 lines 461–469, sub_12C8DD0 lines 435–457). The sm_90a variant is the only pre-Blackwell SM that uses PTX version 6; all sm_20 through sm_90 base variants use PTX version 5.

The a flag is stored in unk_4D045E4 and read in exactly one location: sub_6C4D80 line 167, where the check unk_4D045E8 != 90 || !unk_4D045E4 gates a specific sm_90a-only feature (error code 0xE90 = 3728).

Thread Block Cluster Infrastructure

Clusters are the headline Hopper feature. The compiler gates all cluster functionality at arch_id >= 90 (unk_4D045E8 > 89).

Frontend Attributes

The EDG frontend recognizes three cluster-related kernel attributes:

__cluster_dims__ — Attribute code k in sub_5C79F0. Processing in sub_5D1FE0 validates three integer arguments (x, y, z) and stores them at offsets +20, +24, +28 of the kernel metadata structure. Error codes 3685/3686 on invalid values. On sm_89 and below, diagnostic 3687 is emitted as a warning.

__launch_bounds__ 3rd parameter — The cluster dimension extension to __launch_bounds__ is processed in sub_5D2430. On sm_89 and below, diagnostic 3704 is emitted.

__block_size__ attribute — Handled in sub_5D1A60. At sm_90+, five block dimension arguments are parsed (including the cluster dimension). At sm_89 and below, diagnostic 3790 is emitted and only four arguments are accepted.

NVVM Metadata

Cluster configuration propagates through NVVM IR via several metadata keys:

Metadata KeyWritersReaders
nvvm.cluster_dimsub_93AE30, sub_129A750sub_A84F90, sub_CE8EA0
cluster_dim_x/y/zsub_913C80, sub_1273830sub_CE8C00/40/80
cluster_max_blockssub_913C80, sub_1273830(kernel metadata)
nvvm.blocksareclusterssub_93AE30, sub_129A750sub_214DA90
nvvm.maxclusterrank(external)sub_A84F90, sub_CE9030

The blocksareclusters metadata requires reqntid to be set — error message: "blocksareclusters requires reqntid" (sub_214DA90 line 111).

PTX Directives

The kernel attribute emitter at sub_214DA90 gates cluster directives at arch_id >= 90. When the gate passes, four directives may be emitted:

  • .blocksareclusters — Declares that thread blocks form clusters
  • .explicitcluster — Emitted when all three cluster dimensions are present
  • .reqnctapercluster X, Y, Z — Required CTA count per cluster
  • .maxclusterrank N — Maximum cluster rank

Cluster Special Registers

The PTX emitter at sub_21E9060 handles 15 cluster special registers via a switch statement:

CaseRegisterDescription
0%is_explicit_clusterBoolean: was cluster explicitly set
1%cluster_ctarankCTA rank within the cluster
2%cluster_nctarankNumber of CTAs in cluster
3–5%cluster_nctaid.{x,y,z}Cluster grid dimensions
6–8%cluster_ctaid.{x,y,z}CTA position within cluster
9–11%nclusterid.{x,y,z}Cluster grid count
12–14%clusterid.{x,y,z}Cluster ID

Cluster Barrier Operations

The barrier.cluster instruction is emitted from sub_21E8EA0 with two operation modes and two memory ordering modes:

Opcode (bits 0–3)OperationMemory Mode (bits 4–7)Qualifier
0arrive0(default acquire/release)
1wait1.relaxed

Error strings: "bad cluster barrier op" for invalid opcode, "bad cluster barrier mem mode" for invalid memory mode.

Three corresponding builtins are registered in sub_90AEE0:

BuiltinID
__nv_cluster_barrier_arrive_impl11
__nv_cluster_barrier_wait_impl12
__nv_cluster_barrier_arrive_relaxed_impl13

Cluster Query Builtins

Nine cluster information builtins are registered in sub_90AEE0:

BuiltinIDPurpose
__nv_clusterDimIsSpecifed_impl8Check if cluster dims are set
__nv_clusterRelativeBlockRank_impl9Block rank within cluster
__nv_clusterSizeInBlocks_impl10Total blocks in cluster
__nv_cluster_query_shared_rank_impl203Query shared memory rank
__nv_cluster_map_shared_rank_impl365Map to shared memory rank
__nv_clusterDim_impl405Get cluster dimensions
__nv_clusterRelativeBlockIdx_impl406Relative block index
__nv_clusterGridDimInClusters_impl407Grid dimension in clusters
__nv_clusterIdx_impl408Cluster index

fence.sc.cluster Instruction

A new fence instruction is emitted from sub_21E94F0, the membar/fence printer. The opcode encoding uses the low 4 bits of the operand:

ValueInstructionGeneration
0membar.gpuAll
1membar.ctaAll
2membar.sysAll
4fence.sc.clusterHopper+

A duplicate implementation exists in the NVPTX backend at sub_35F18E0.

Atomic Cluster Scope

At sm_90+, the atomic lowering paths (sub_12AE930 line 255, sub_9502D0 line 424) add cluster scope support. Scope value 2 now resolves to "cluster" instead of falling through to "gpu" as it does on sm_70–89. This enables atom.*.cluster operations for intra-cluster synchronization.

setmaxnreg — Dynamic Register Count

Hopper introduces dynamic register count adjustment via setmaxnreg.{inc,dec}.sync.aligned.u32.

NVVM IR validation (sub_BFC6A0 lines 1732–1754): Builtin IDs 9431–9432 correspond to nvvm.setmaxnreg.inc and nvvm.setmaxnreg.dec. Validation rules enforce that the register count must be a multiple of 8 and within the range [24, 256].

Inline assembly recognition (sub_FCDCB0, sub_21EA5F0): The compiler scans inline asm for setmaxnreg. followed by .sync.aligned.u32, extracting the immediate operand from either a $0 placeholder or a literal integer. Backend duplicates exist at sub_307BA30 and sub_3953170.

WGMMA — Warpgroup Matrix Multiply-Accumulate

WGMMA is Hopper's primary tensor core interface, superseding HMMA for large matrix operations.

Registered Builtins

Four type variants are registered in sub_90AEE0 (lines 2941–2944) with a duplicate table in sub_126A910:

BuiltinIDAccumulator Type
__wgmma_mma_async_f16765FP16
__wgmma_mma_async_bf16766BF16
__wgmma_mma_async_tf32767TF32
__wgmma_mma_async_f8768FP8

Shape Selection

The WGMMA lowering at sub_955A70 (lines 2850–2910+) uses a switch on the M dimension (output rows) to select MachineInstr opcodes:

M DimensionOpcode
810774
1610690
2410734
3210742
40–88 (stride 8)10746–10770

Error on invalid M: "unexpected constant overflow in __wgmma_mma_async operand".

Operand Modifiers

The NVPTX printer at sub_35F3330 emits WGMMA operand modifiers encoded in bitfields:

  • kind (bits 6–8): mxf4nvf4 (0), f8f6f4 (1), mxf8f6f4 (2), f16 (3), i8 (4), tf32 (5), mxf4 (7)
  • cta_group (bit 1): cta_group::1 (clear) or cta_group::2 (set)
  • scale (bits 2–3): Additional scaling modifier

TMA — Tensor Memory Access

TMA provides hardware-accelerated bulk data movement between global and shared memory, driven by a tensor map descriptor that encodes the multi-dimensional layout. Three independent subsystems in cicc cooperate to implement TMA: the intrinsic name parser (sub_A8E250), the SelectionDAG lowering handler (sub_33AD3D0), and the NVPTX ISel pattern matcher for CpAsyncBulkTensor (sub_36EC510).

TMA Descriptor Format (NVVM Container Tag 401)

The host-side tensor map descriptor is embedded in the NVVM container under tag 401. The tag is conditional on ExtOpt.Field344 (tag 301) having value 1, which identifies the Hopper TMA path. (Blackwell uses tag 402 for TCGen05Config instead, gated by Field344==4; the two are mutually exclusive.)

ComponentSizeDescription
Fixed header44 bytesTensor map metadata (dimensions, strides, element type, interleave, swizzle, fill, OOB policy)
Per-descriptor entry16 bytes eachOne entry per cp.async.bulk.tensor call site in the kernel
Total struct at offset 40844 + 16*N bytesN = number of distinct TMA operations

The compiler serializes this into the NVVM container (sub_CDD2D0) so ptxas can validate shared memory allocation sizes and descriptor compatibility at link time.

TMA Descriptor ABI in Kernel Parameters

The EDG frontend detects TMA descriptor parameters during kernel registration stub generation. The detection function sub_8D4C10 (edg::get_tma_descriptor_flags) checks:

if (unk_4F068E0
    && arch > 0x9EFB
    && type_is_struct_or_class(type)
    && (*(type+140) & ~4) == 8
    && get_tma_descriptor_flags(type) & 4):
  insert copy_node(sub_7E7ED0, calling_convention=7)
  byte_at(node+88) |= 4   // TMA descriptor flag

This gives TMA descriptors a distinct ABI: calling convention 7 with flag bit 4, separate from normal struct-by-value passing. The copy node ensures the descriptor is materialized at the correct address space boundary before kernel launch.

TMA Intrinsic Name Parsing (sub_A8E250)

The intrinsic dispatcher sub_A8E250 (52 KB) matches TMA intrinsic names via string comparison and assigns internal opcode IDs. Two families exist:

Tensor-structured copies (require a tensor map descriptor):

Intrinsic PatternDimensionsOpcode
cp.async.bulk.tensor.g2s.tile.1d1D9222
cp.async.bulk.tensor.g2s.tile.2d2D9223
cp.async.bulk.tensor.g2s.tile.3d3D9224
cp.async.bulk.tensor.g2s.tile.4d4D9225
cp.async.bulk.tensor.g2s.tile.5d5D9226
cp.async.bulk.tensor.g2s.im2col.3d3D9213
cp.async.bulk.tensor.g2s.im2col.4d4D9214
cp.async.bulk.tensor.g2s.im2col.5d5D9215
cp.async.bulk.tensor.gmem.to.smem.1d1D8324
cp.async.bulk.tensor.gmem.to.smem.2d2D8325
cp.async.bulk.tensor.gmem.to.smem.3d3D8326
cp.async.bulk.tensor.gmem.to.smem.4d4D8327
cp.async.bulk.tensor.gmem.to.smem.5d5D8328
cp.async.bulk.tensor.gmem.to.smem.im2col.w.3d3D8329
cp.async.bulk.tensor.gmem.to.smem.im2col.w.4d4D8330
cp.async.bulk.tensor.gmem.to.smem.im2col.w.5d5D8331

Unstructured bulk copies (byte-level, no tensor map descriptor):

Intrinsic PatternOpcode
cp.async.bulk.global.to.shared.cluster8315
cp.async.bulk.gmem.to.dsmem8316

Fragment-indexed TMA (from builtin IDs 411/412 via sub_9483E0):

LLVM IntrinsicBase OpcodeIndex Range
llvm.nvvm.tma.load92339227–9232 (6 entries, indexed by fragment count)
llvm.nvvm.tma.store9257(corresponding store entries)

TMA SelectionDAG Lowering (sub_33AD3D0)

The unified TMA handler sub_33AD3D0 receives a mode argument from the main intrinsic lowering switch in sub_33B0210:

CaseModeOperationMemory Direction
0x1792TMA loadglobal -> shared
0x17A3TMA storeshared -> global
0x17B5TMA prefetchglobal (read-only)
0x17C7TMA multicast loadglobal -> N shared (across cluster)

Related cp.async handlers in the same dispatch table:

CaseHandlerOperation
0x175sub_33AC2B0cp.async (non-TMA async copy)
0x176sub_33AC130cp.async.wait
0x177sub_33AB690cp.async.bulk (non-tensor bulk copy)
0x178goto LABEL_32No-op — commit/barrier (scheduling fence only)

The 0x178 no-op is significant: it represents the cp.async.bulk commit/barrier intrinsic that exists purely for scheduling purposes. The compiler preserves it as a DAG ordering constraint even though it produces no data-flow SDNode.

CpAsyncBulkTensor G2S Lowering (sub_36EC510)

The 27 KB function sub_36EC510 (1185 lines) implements the complete cp.async.bulk.tensor global-to-shared lowering with full architecture gating and mode validation.

Architecture gates (read from offset+340 of the subtarget object):

SM ValueHexFeatures Unlocked
>= 10000x3E8SM 90: tile mode (1D–5D), Im2Col mode (3D–5D)
>= 10320x408SM 100: adds 2CTA mode, Im2Col_W, Im2Col_W128

Mode bit decoding from operand v11:

BitsMaskMeaning
2–4v11 & 0x1CIm2Col variant: Im2Col, Im2Col_W, Im2Col_W128
3–4v11 & 0x182CTA mode flag

Validation error strings (emitted as fatal diagnostics):

  • "NumDims should be at least 3 for Im2Col or Im2Col_W or Im2Col_W128 mode" — Im2Col requires >= 3D tensors
  • "Im2Col_W and Im2Col_W128 modes are not supported on this architecture." — SM 90 does not support Im2Col_W/W128; requires SM 100+
  • "2CTA Mode for CpAsyncBulkTensorG2S not supported on this architecture" — 2CTA mode requires SM 100+

TMA Builtin Codegen (EDG -> LLVM IR)

The EDG-to-LLVM builtin lowering handles TMA as builtin IDs 411 and 412 (hex 0x19B / 0x19C).

ID 411 (scatter/store path)sub_12A7070 extracts TMA descriptor info, then an iterative loop builds a vector of per-element store nodes. The intrinsic table 0x107A0x107F (4218–4223) selects among 6 entries indexed by element count. Approximately 300 lines of handler code (lines 1256–1501 of sub_12A71A0).

ID 412 (gather/load path) — Similar structure but for the load direction. Uses intrinsic table 0x10940x109A (4244–4250). Includes bitcast insertion (opcode 47) for type mismatches between the descriptor element type and the destination register type. Approximately 450 lines (lines 1503–1713).

Both paths use:

  • sub_12AA280 — TMA descriptor builder (constructs the multi-operand struct from the builtin arguments)
  • sub_12A9E60extractvalue emission (decomposes aggregate returns into individual registers)
  • sub_39FAC40 — Fragment count computation (determines how many load/store fragments the TMA operation expands into)

TMA Scheduling Constraints

TMA operations impose specific scheduling constraints visible in cicc's SelectionDAG construction:

  1. Chain dependencies by mode. Every TMA operation produces a memory chain in the SelectionDAG. The mode parameter determines the chain direction:

    ModeReadsWritesChain Effect
    2 (load)globalsharedLoad chain
    3 (store)sharedglobalStore chain
    5 (prefetch)global(none)Load chain
    7 (multicast)globalN x sharedLoad chain
  2. Commit-as-fence. Intrinsic ID 0x178 lowers to no-op (goto LABEL_32), functioning as a pure scheduling barrier. This prevents the DAG scheduler from reordering TMA operations past their commit point.

  3. Async qualifier hierarchy. The memory space qualifiers emitted by sub_35F4B50 form an ordered fence hierarchy:

    QualifierScopeStrength
    .asyncUnscopedWeakest
    .async.globalGlobal memory domain
    .async.shared::ctaCTA-local shared memory
    .async.shared::clusterCluster shared memory (DSMEM)Strongest

Distributed Shared Memory

Hopper's cluster architecture enables distributed shared memory (DSMEM) across CTAs in a cluster. The NVPTX backend emits memory space qualifiers from two functions:

sub_35F4B50 — Async memory space qualifier emission (switch on operand):

LineQualifierSemantic
20.asyncBase async qualifier (unscoped)
32.async.globalAsync from global memory
45.async.shared::ctaAsync to CTA-local shared memory
59.async.shared::clusterAsync to cluster distributed shared memory
73.aliasAliased access modifier (permits overlapping accesses)

sub_35F4E30 — Commit modifier emission (switch on operand):

LineQualifierSemantic
28.cta_group::1CTA group 1 selection
38.cta_group::2CTA group 2 selection
51.mbarrier::arrive::oneSingle-thread mbarrier arrive
67.shared::clusterCluster shared memory scope
80.multicast::clusterMulticast to all CTAs in cluster

sub_35F4080 — Secondary .shared::cluster emission (line 68), used in non-commit contexts.

These qualifiers attach to cp.async.bulk and mbarrier instructions to specify the scope and direction of asynchronous data movement within the cluster.

Mbarrier Extensions — DMA Fence/Arrive/Wait

Hopper extends the async barrier (mbarrier) mechanism to coordinate TMA data movement. The TMA DMA pipeline follows a three-phase synchronization protocol:

Phase 1: Initialization

.mbarrier_init (emitted from sub_35F4AD0) initializes the async barrier with the expected transaction byte count. The arrive_expect_tx variant sets both the expected arrival count and the transaction byte count atomically.

Phase 2: Arrive (Producer Signals Completion)

When a TMA operation completes, it signals the mbarrier:

  • .mbarrier::arrive::one (sub_35F4E30 line 51) — single-thread arrive notification. The TMA hardware auto-arrives with the transferred byte count.
  • .cta_group::1 / .cta_group::2 (sub_35F4E30 lines 28/38) — selects which CTA group the arrive targets, enabling pipelined producer-consumer patterns where two groups alternate roles.

Phase 3: Wait (Consumer Blocks)

The consumer thread issues mbarrier.try_wait with a phase bit. The phase alternates each time the barrier completes a full cycle, enabling pipelined double-buffered access patterns. No additional cicc emission function is needed; the standard mbarrier wait path handles this.

WGMMA Fence/Commit/Wait (Distinct Pipeline)

WGMMA has its own synchronization cycle, separate from TMA mbarriers:

BuiltinIDsHandlerLLVM Intrinsic
__wgmma_fence745–750sub_12B1C209062 (wgmma.fence.aligned, 3 type overloads)
__wgmma_commit_group(same range)sub_12B1C20(same dispatch)
__wgmma_wait_group(same range)sub_12B1C20(same dispatch)

WGMMA fences synchronize the tensor core accumulator pipeline; TMA mbarriers synchronize the DMA engine. A typical Hopper kernel pipelines both: TMA loads data into shared memory (mbarrier-synchronized), then WGMMA consumes the data from shared memory (fence-synchronized). The two synchronization domains must not be confused in a reimplementation.

Feature Flag Configuration

The master feature configurator sub_60E7C0 sets the following flags at the sm_90+ threshold (qword_4F077A8 > 89999):

FlagSource
unk_4D043D0sub_60E7C0
unk_4D041B0sub_60E7C0
unk_4D04814sub_60E7C0
unk_4D0486Csub_60E7C0 (with C++ version check)
dword_4F07760sub_60E530
dword_4D043F8sub_60E530 (at > 99999)
dword_4D041E8sub_60E530 (at > 99999)

Key Binary Locations

FunctionAddressSizeRole
sub_CD09E00xCD09E0NVVM arch enum (NVVM_ARCH_HOPPER_9_0)NVVM arch enum (NVVM_ARCH_HOPPER_9_0)
ctor_3560x50C890Subtarget registration (sm_90 enum 38, sm_90a enum 39)Subtarget registration (sm_90 enum 38, sm_90a enum 39)
sub_214DA900x214DA90Kernel attribute emitter (cluster PTX directives)Kernel attribute emitter (cluster PTX directives)
sub_21E90600x21E9060Cluster special register PTX emissionCluster special register PTX emission
sub_21E8EA00x21E8EA0Cluster barrier instruction emissionCluster barrier instruction emission
sub_21E94F00x21E94F0Membar/fence printer (fence.sc.cluster)Membar/fence printer (fence.sc.cluster)
sub_BFC6A00xBFC6A0setmaxnreg NVVM IR validationsetmaxnreg NVVM IR validation
sub_FCDCB00xFCDCB0setmaxnreg inline asm pattern matchingsetmaxnreg inline asm pattern matching
sub_955A700x955A70WGMMA lowering (M-dimension switch)WGMMA lowering (M-dimension switch)
sub_90AEE00x90AEE0Builtin registration (WGMMA, cluster barriers/queries)Builtin registration (WGMMA, cluster barriers/queries)
sub_A8E2500xA8E250TMA intrinsic name parsing (52 KB)TMA intrinsic name parsing (52 KB)
sub_33AD3D00x33AD3D0TMA SelectionDAG lowering handler (modes 2/3/5/7)TMA SelectionDAG lowering handler (modes 2/3/5/7)
sub_33AB6900x33AB690cp.async.bulk non-tensor handlercp.async.bulk non-tensor handler
sub_33AC2B00x33AC2B0cp.async handlercp.async handler
sub_33AC1300x33AC130cp.async.wait handlercp.async.wait handler
sub_36EC5100x36EC510CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines)CpAsyncBulkTensor G2S lowering (27 KB, 1185 lines)
sub_9483E00x9483E0TMA descriptor extractionTMA descriptor extraction
sub_12AA2800x12AA280TMA descriptor builder (EDG -> LLVM IR)TMA descriptor builder (EDG -> LLVM IR)
sub_12A70700x12A7070TMA scatter/store builtin handlerTMA scatter/store builtin handler
sub_8D4C100x8D4C10edg::get_tma_descriptor_flagsedg::get_tma_descriptor_flags
sub_35F4B500x35F4B50DSMEM qualifier emissionDSMEM qualifier emission
sub_35F4E300x35F4E30Commit modifier emission (mbarrier, multicast)Commit modifier emission (mbarrier, multicast)
sub_35F4AD00x35F4AD0.mbarrier_init emission.mbarrier_init emission
sub_35F40800x35F4080Secondary .shared::cluster emissionSecondary .shared::cluster emission