Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM80-88 Ampere

The Ampere family in nvlink v13.0.88 covers four compute capabilities -- sm_80, sm_86, sm_87, and sm_88 -- all sharing the ISA class string "Ampere", the same ISel backend at 0xCA0000--0xDA0000 (1 MB), and the same 128-bit SASS instruction encoding introduced by Turing. The sm_88 variant is new in CUDA 13.0. None of the four Ampere targets carry 'a' or 'f' suffix variants (unlike Blackwell targets), so the profile database contains exactly 12 Ampere profile entries: 4 real, 4 virtual, and 4 LTO.

For general Ampere architecture details (hardware specs, PTX ISA requirements, codegen factory encoding, scheduler parameters, latency tables), see the ptxas wiki: Turing/Ampere. For cicc-level feature gates and optimizer flag configuration, see the cicc wiki: SM70-89. This page focuses on the nvlink-internal per-sub-architecture data: profile registration, capability vectors, dispatch table function pointers, and the ISel backend shared by all four.

Per-Architecture Profiles

Profile Registration in sub_484F50

The profile database initializer sub_484F50 (53,974 bytes) registers the four Ampere architectures in numeric order. Each architecture produces three profile objects via sub_484DB0 (real sm_, virtual compute_, LTO lto_), which are inserted into the global hash map at qword_2A5F8D8. The Ampere block spans lines 288--462 of the decompiled source.

SMRegistration Linessub_484DB0 ArgsVariables
sm_80288--331(0, 0, "sm_80", "sm_80", "Ampere", "-D__CUDA_ARCH__=800", "sm_80")v14 (real), v15 (virtual)
sm_86332--377(0, 0, "sm_86", "sm_86", "Ampere", "-D__CUDA_ARCH__=860", "sm_86")v22 (real), v23 (virtual)
sm_87378--420(0, 0, "sm_87", "sm_87", "Ampere", "-D__CUDA_ARCH__=870", "sm_87")v31 (real), v32 (virtual)
sm_88421--462(0, 0, "sm_88", "sm_88", "Ampere", "-D__CUDA_ARCH__=880", "sm_88")v39 (real), v40 (virtual)

The registration pattern for each is identical:

// 1. Create real profile (is_virtual=0, is_lto=0)
sm_80 = ArchProfile::create(0, 0, "sm_80", "sm_80", "Ampere",
                            "-D__CUDA_ARCH__=800", "sm_80");

// 2. Create virtual profile (is_virtual=1, is_lto=0)
compute_80 = ArchProfile::create(1, 0, "compute_80", "compute_80", "Ampere",
                                 "-D__CUDA_ARCH__=800", "compute_80");

// 3. Link real <-> virtual
sm_80->virtual_ptr = compute_80;       // offset +72
compute_80->virtual_ptr = compute_80;  // self-ref

// 4. Insert into hash map
LinkerHash::insert(qword_2A5F8D8, "sm_80", sm_80);
LinkerHash::insert(qword_2A5F8D8, "compute_80", compute_80);

// 5. Create LTO profile (is_virtual=1, is_lto=1)
lto_80 = ArchProfile::create(1, 1, "lto_80", "compute_80", NULL,
                             "-D__CUDA_ARCH__=800", "lto_80");
lto_80->virtual_ptr = compute_80;
LinkerHash::insert(qword_2A5F8D8, "lto_80", lto_80);

// 6. Build compatibility lists
list_append(compute_80->compat_list_2, sm_80);    // offset +64
list_append(sm_80->compat_list_2, compute_80);
list_append(sm_80->compat_list_1, sm_80);          // offset +56: family
list_append(sm_80->compat_list_0, sm_80);          // offset +48: cross-variant

// 7. Copy capability vectors from rodata
sm_80->capability[0] = xmmword_1D40F10;   // offset +80
sm_80->capability[1] = xmmword_1D40F40;   // offset +96
sm_80->capability[2] = xmmword_1D40F30;   // offset +112

After registering each sub-architecture, sub_484F50 links it into the Ampere family chain. sm_86 is appended to sm_80's family list (compat_list_0 and compat_list_1), sm_87 and sm_88 follow the same chaining to sm_86's lists (which transitively connect back to sm_80).

The dword_2A5F8CC = 80 assignment (line 326) sets the default minimum architecture to sm_80 -- any link operation that does not specify an explicit --arch flag defaults to this value.

Architecture Identity Matrix

Propertysm_80sm_86sm_87sm_88
ISA class string"Ampere""Ampere""Ampere""Ampere"
Family string"Ampere""Ampere""Ampere""Ampere"
__CUDA_ARCH__800860870880
Preprocessor define-D__CUDA_ARCH__=800-D__CUDA_ARCH__=860-D__CUDA_ARCH__=870-D__CUDA_ARCH__=880
Suffix variantsNoneNoneNoneNone
Profile byte[3] (finalization class)0000
ProductsGA100 (A100, A30)GA10x (RTX 3090/3080/3070/3060, A40, A16)GA10B (Jetson Orin AGX/NX/Nano)Undocumented (CUDA 13.0)
Same-decade group (SM/10)8888

All four share same-decade group 8, which also includes sm_89 (Ada). This means code compiled for sm_80 is compatible with any target in the 80--89 range via the same-decade rule (see the Compatibility page).

Capability Vectors

Each profile stores three 128-bit XMM vectors at offsets +80, +96, and +112. These encode hardware capability bitmasks checked by the finalization pipeline (sub_470DA0). The vectors are loaded from rodata constants in sub_484F50:

ArchitectureVec 0 (offset +80)Vec 1 (offset +96)Vec 2 (offset +112)Source Lines
sm_75 (Turing)xmmword_1D40F10xmmword_1D40F20xmmword_1D40F30283--287
sm_80xmmword_1D40F10xmmword_1D40F40xmmword_1D40F30325--331
sm_86xmmword_1D40F10xmmword_1D40F50xmmword_1D40F30370--374
sm_87xmmword_1D40F10xmmword_1D40F50 (copied from sm_86)xmmword_1D40F30416--420
sm_88xmmword_1D40F10xmmword_1D40F50 (copied from sm_87)xmmword_1D40F30458--462
sm_89 (Ada)xmmword_1D40F10xmmword_1D40F60xmmword_1D40F30499--505

The rodata symbols encode these capability tiers:

SymbolRoleArchitectures Using It
xmmword_1D40F10Universal base (Vec 0)All architectures sm_75--sm_121
xmmword_1D40F20Turing feature set (Vec 1)sm_75 only
xmmword_1D40F30Pre-Blackwell ISA version (Vec 2)sm_75 through sm_90a
xmmword_1D40F40Ampere-base feature set (Vec 1)sm_80, sm_90, sm_100, sm_103
xmmword_1D40F50Ampere-86+ feature set (Vec 1)sm_86, sm_87, sm_88
xmmword_1D40F60Ada/Thor/RTX-50 feature set (Vec 1)sm_89, sm_110, sm_120, sm_121
xmmword_1D40F70Blackwell ISA version (Vec 2)sm_100 onward

Key observations:

  • Vec 0 is constant across all architectures. It encodes the universal baseline capabilities.
  • Vec 1 differentiates sub-architectures. sm_80 gets xmmword_1D40F40 (Ampere base), while sm_86/87/88 share xmmword_1D40F50 (extended Ampere). The sm_86 value propagates to sm_87 and sm_88 through copy chains (v211 = v29 at line 375; v210 = v37 at line 418), not through independent rodata loads. This means sm_86, sm_87, and sm_88 are capability-identical from the finalization pipeline's perspective.
  • Vec 2 is constant within the pre-Blackwell group. All Ampere targets use xmmword_1D40F30.
  • The Ampere-to-Ada boundary is marked by the Vec 1 change from xmmword_1D40F50 (sm_86/87/88) to xmmword_1D40F60 (sm_89).

What Distinguishes Each Sub-Architecture

sm_80 (GA100 -- datacenter Ampere). The base Ampere target and generational anchor. It uses xmmword_1D40F40 for Vec 1, which differs from the sm_86+ value xmmword_1D40F50. This means sm_80's capability mask is a strict subset of sm_86's -- code finalized for sm_86 is not guaranteed to be re-finalizable for sm_80 without capability mask verification. sm_80's codegen factory is 28673 (0x7001), placing it at sub-variant 1 within generation 7. In the ptxas scheduler, sm_80 falls through to the default variant (0 or 1) and uses baseline latency tables.

sm_86 (GA10x -- consumer/enterprise Ampere). Extends sm_80 with xmmword_1D40F50 capability bits. The codegen factory is 28674 (0x7002, sub-variant 2). In the ptxas scheduler, sm_86 maps to sub-architecture variant 2, receiving tuned scheduling parameters distinct from sm_80's baseline. The ptxas latency table for sm_86 (sub_8E7D80, 4.4 KB) is the largest in the Ampere family, reflecting the different pipeline characteristics of the RTX 30xx consumer die versus the A100 datacenter die.

sm_87 (GA10B -- Jetson Orin). Capability-identical to sm_86 (same Vec 1 xmmword_1D40F50, copied at line 418 via v37 = _mm_load_si128(&v211) where v211 holds sm_86's vector). The codegen factory is 28675 (0x7003, sub-variant 3). The ptxas scheduler maps it to variant 3, giving Orin its own latency profile (sub_8E8070, 3.5 KB). This is the only material binary difference between sm_86 and sm_87 in nvlink -- the capability mask, ISel patterns, and encoding pipeline are identical.

sm_88 (undocumented -- CUDA 13.0). Capability-identical to sm_86 and sm_87 (same Vec 1, copied at line 458 via the sm_87 chain). The codegen factory is 28676 (0x7004, sub-variant 4). The ptxas scheduler maps it to variant 4, shared with sm_110 (Jetson Thor). No separate latency table was found in the ptxas sweep data -- sm_88 may share sm_86's or sm_87's table. No public product ships on sm_88; it may represent an unreleased Ampere derivative or internal validation target.

Dispatch Table (sub_15C0CE0)

The dispatch table initializer sub_15C0CE0 registers seven callback function pointers per architecture into hash maps (qword_2A644B8 through qword_2A64488). Each callback serves a specific role in the embedded compilation pipeline.

Slot Assignments

SlotHash Mapsm_75 (Turing)sm_80sm_86sm_87sm_88
B8Pre-compilationsub_15C2AA0sub_15C2BF0sub_15C2C80sub_15C2E30sub_15C2DA0
B0Compilationsub_15C2A70sub_15C2BC0sub_15C2CB0sub_15C2D10sub_15C2DD0
A8Backend initsub_15C3210sub_15C3310sub_15C3B60sub_15C3C60sub_15C3A60
A0Internal versionbyte_2A5EE40byte_2A5EE3Cbyte_2A5EE38byte_2A5EE34byte_2A5EE30
90Perf-statssub_15C1C80sub_15C1EB0sub_15C1EF0sub_15C1FD0sub_15C1E30
88Resource calcsub_15C2610sub_15C28B0sub_15C1FF0sub_15C2990sub_15C2530

Each compute_ prefix variant receives the same internal version constant as its real counterpart (e.g., compute_80 = byte_2A5EE3C, compute_86 = byte_2A5EE38).

Backend Init Functions (Slot A8)

The A8 slot handlers are the only callbacks with architecturally significant differences between Ampere sub-architectures. Each allocates a per-function codegen context via sub_189F230(a3) and writes a per-SM codegen factory value at offset +348:

SMA8 HandlerFactory ValueHexEncoding
sm_75sub_15C3210245770x6001Generation 6, sub-variant 1
sm_80sub_15C3310286730x7001Generation 7, sub-variant 1
sm_86sub_15C3B60286740x7002Generation 7, sub-variant 2
sm_87sub_15C3C60286750x7003Generation 7, sub-variant 3
sm_88sub_15C3A60286760x7004Generation 7, sub-variant 4

The codegen factory encodes (isa_generation << 12) | sub_variant. All four Ampere targets are generation 7 (bits 12--15 = 0x7), differing only in the low 12-bit sub-variant field. The factory value controls:

  1. Scheduler profile selection in sub_8E4400: all four fall into the 7-warp / 208-dispatch-slot bucket (factory range 24577--28676, threshold <= 32767).
  2. Sub-architecture variant for latency tuning: sm_80 gets default variant, sm_86 gets variant 2, sm_87 gets variant 3, sm_88 gets variant 4.
  3. Instruction encoding table selection: each sub-variant can enable/disable specific instruction forms.

All other fields at +344 (shared memory config base, 0x100000 = 1 MB) and the remaining context fields are identical across all four Ampere handlers.

Internal Version Numbers

The internal version numbers at qword_2A644A0 are stored as 4-byte integers at decreasing addresses, each 4 bytes apart. From the version mapping table in the SM89 Ada page:

AddressSMInternal VersionNotes
byte_2A5EE40sm_7514Turing baseline
byte_2A5EE3Csm_80(inferred 25)Ampere base
byte_2A5EE38sm_86(inferred 26)First sub-arch > 26 threshold triggers
byte_2A5EE34sm_87(inferred 27)Orin
byte_2A5EE30sm_88(inferred 28)Undocumented
byte_2A5EE2Csm_8929Ada (confirmed)

The specific internal versions for sm_80--sm_88 are inferred from the address spacing pattern and the known threshold at *(a1+376) > 26 (sm_86+, confirmed in the compilation driver sub_1112F30 line 991). The sm_89 value of 29 is confirmed from decompiled code. The gap between sm_75 (internal 14) and sm_80 (inferred 25) reflects deprecated architectures in the 15--24 range that were removed from the profile database but whose internal version slots remain allocated.

Ampere vs Turing: Generational Differences

The transition from Turing (sm_75, generation 6) to Ampere (sm_80, generation 7) manifests in nvlink at several levels. For hardware and ISA-level details, see ptxas wiki: Turing/Ampere. Within nvlink specifically:

Profile-Level Changes

Aspectsm_75 (Turing)sm_80 (Ampere)
ISA class string"Turing""Ampere"
ISA generation67
Codegen factory24577 (0x6001)28673 (0x7001)
Vec 1 capabilityxmmword_1D40F20xmmword_1D40F40
Same-decade group78
Default arch flag--dword_2A5F8CC = 80

Capability Boundary

The Vec 1 change from xmmword_1D40F20 (Turing) to xmmword_1D40F40 (Ampere) marks a hard capability boundary. Code finalized for sm_80 cannot be re-finalized for sm_75 -- the capability mask comparison in sub_470DA0 will fail because Ampere capabilities are a superset of Turing capabilities. The reverse direction (sm_75 code on sm_80 targets) is permitted by the same-decade rule only if both are in the same family linked list, which they are not (different decades: 7 vs 8).

ISel Backend Independence

Despite sharing the same 128-bit instruction encoding format, sm_75 and sm_80 use completely separate ISel backends:

PropertySM75 (Turing)SM80 (Ampere)
Address range0xF16000--0x100C000 (984 KB)0xCA0000--0xDA0000 (1 MB)
Mega-hubsub_FBB810 (280 KB)sub_D5FD70 (239 KB)
Pattern matchers276259
SASS opcodes--19 (in LTO subset)

The Ampere ISel is slightly smaller than Turing's despite being a later generation. The reduction in pattern matchers (276 to 259) reflects instruction set refinement rather than feature removal -- Ampere reorganized and consolidated some multi-variant patterns.

Feature Gates in cicc and ptxas

From the cicc wiki: SM70-89, the Ampere-specific features enabled at the sm_80 threshold include:

  • C++20 __VA_OPT__ support via unk_4D041B8
  • llvm.nvvm.branch.if.convergent intrinsic (sm_80+, distinct from the sm_70+ branch.if.all.convergent)
  • L2 cache hint atomics (PTX 7.3+)
  • cp.async.bulk patterns for asynchronous memory copy

From the ptxas wiki: Turing/Ampere, the sm_82 SASS opcode boundary defines 22 Ampere-era opcode slots (indices 172--193) covering sparse MMA, binary tensor core shapes, FP64 tensor MMA, async copy infrastructure, and warp-wide reduction.

Family Linkage

The Ampere family linked list built by sub_484F50 chains all four sub-architectures through compat_list_1 (offset +56) and compat_list_0 (offset +48):

sm_80.compat_list_1 -> { sm_80, sm_86, sm_87, sm_88 }
sm_86.compat_list_1 -> { sm_86 }
    also: sm_86 appended to sm_80's compat_list_0 and compat_list_1
sm_87.compat_list_1 -> { sm_87 }
    also: sm_87 appended to sm_86's compat_list_0 and compat_list_1
sm_88.compat_list_1 -> { sm_88 }
    also: sm_88 appended to sm_87's compat_list_0 and compat_list_1

Additionally, sm_89 (Ada) links into the Ampere family chain:

// sub_484F50 lines 507--510:
list_append(sm_80->compat_list_0, sm_89);       // v14[3].m128i_i64[0] = sm_80's list
list_append(sm_80->compat_list_1, sm_89);       // v14[3].m128i_i64[1] = sm_80's list
list_append(sm_86->compat_list_0, sm_89);       // v22[3].m128i_i64[0] = sm_86's list
list_append(sm_86->compat_list_1, sm_89);       // v22[3].m128i_i64[1] = sm_86's list

This means sm_89 (Ada) is in the same compatibility family as the Ampere targets. The same-decade rule (80/10 = 89/10 = 8) already ensures this, but the explicit linkage provides direct traversal without arithmetic.

Worked Compatibility Example: sm_80 vs sm_86

Consider linking a cubin compiled for sm_80 into a final binary targeting sm_86.

  1. sub_4878A0 parses both architecture strings: record_a = {arch_number=80, is_virtual=0, has_suffix=0}, record_b = {arch_number=86, is_virtual=0, has_suffix=0}.
  2. Neither record is virtual, so the function does not reject on the "both virtual" check.
  3. record_a.has_suffix is 0, so the function enters Case 1: profile_family_match(sm_80.family_list, sm_86).
  4. sub_465450 traverses sm_80's compat_list_1: { sm_80, sm_86, sm_87, sm_88 }. sm_86 is found in the list.
  5. Result: compatible. The linker accepts the sm_80 cubin into the sm_86 link.

The reverse direction (sm_86 cubin targeting sm_80) follows a different path:

  1. Parse: record_a = {86, ...}, record_b = {80, ...}.
  2. Case 1: profile_family_match(sm_86.family_list, sm_80).
  3. sm_86's compat_list_1 contains { sm_86 } -- it does not contain sm_80.
  4. Result: incompatible. sm_86 code cannot be linked for sm_80.

This asymmetry reflects the forward-compatibility guarantee: code compiled for a lower SM within a family runs on higher SMs, but not the reverse.

However, capability mask verification at finalization time (sub_470DA0) adds a second check: even if the linker accepts the cubin, re-finalization will compare the capability vectors. Since sm_80 uses xmmword_1D40F40 (Vec 1) and sm_86 uses xmmword_1D40F50 (Vec 1), the sm_86 target has a superset of sm_80's capabilities. The finalization check passes for sm_80-compiled code targeting sm_86 but would fail for sm_86-compiled code targeting sm_80 if any sm_86-specific capability bits were used.

ISel Backend (0xCA0000--0xDA0000)

All four Ampere sub-architectures share a single ISel backend. The backend is variant-agnostic -- it produces identical SASS encoding for sm_80, sm_86, sm_87, and sm_88. Any differences between sub-architectures are resolved upstream in the dispatch table (slot A8 codegen factory) and downstream in the scheduler (latency tables), not within the ISel code itself.

Address Map

RangeSizeSubsystemFunctions
0xCA0000--0xCDC000240 KBOperand emission + bitfield packing137
0xCDD5F0--0xCDD690<1 KBOperand type predicates15
0xCE2000--0xD5FD70510 KBISel pattern matchers259
0xD5FD70239 KBISel mega-hub dispatcher1
0xD9A400--0xDA000023 KBBinary instruction encoding17

Total analyzed functions (>3 KB): 413. The mega-hub at sub_D5FD70 is 239 KB (55,985 instructions, 1,340 callees) and too large for Hex-Rays to decompile. It is the third-largest function in the binary after the SM75 mega-hub (sub_FBB810, 280 KB) and the cuda_builtin_prototype_generator (sub_15B86A0, 345 KB).

Three-Phase Pipeline

Every IR instruction processed by this backend passes through three phases in sequence: instruction selection, operand emission, and binary encoding. The output is a 128-bit SASS instruction word ready for insertion into the device ELF .text section.

Phase 1: Instruction Selection. The mega-hub sub_D5FD70 iterates all 259 pattern matcher functions. Each matcher receives (ctx, ir_node, &pattern_id, &priority) and tests a single candidate encoding. The highest-priority match wins.

for each pattern_matcher in sm80_pattern_table[0..258]:
    pattern_matcher(ctx, ir_node, &pattern_id, &priority)
    if priority > best_priority:
        best_priority = priority
        best_id = pattern_id
sm80_emitter_table[best_id](ctx, ir_node)

Each pattern matcher queries: instruction attributes via sub_A49150(ctx, ir_node, slot_id) (up to 12 attribute checks per pattern), definition/use operand counts via sub_530FD0/sub_530FC0, individual operands via sub_530FB0(ir_node, index), and operand type predicates: isGPR (sub_CDD600), isPredicate (sub_CDD610), isUniformReg (sub_CDD630), isImmediate (sub_CDD670), isConstBuf (sub_CDD680).

Priority values range from 14 (least specific, fallback patterns) to 34 (most specific, heavily constrained patterns). Higher priority means more attribute checks and narrower operand constraints.

Phase 2: Operand Emission. The winning pattern dispatches to an operand emission function from Zone 1 (0xCA0000--0xCDC000). Each emitter handles one (opcode, format) combination and populates a structured instruction descriptor: opcode ID at *(a2+12), encoding format at *(a2+14), max operand slot count at *(a2+15), modifier flags via setter functions, and decoded register operands.

Phase 3: Binary Encoding. Zone 3 functions (0xD9A400--0xDA0000, 17 functions) produce the final 128-bit SASS binary instruction word using SSE2 intrinsics (_mm_or_si128) for efficient 128-bit bulk operations.

Instruction Set Coverage

The SM80 backend handles 19 distinct SASS opcodes with a total of 80 (opcode, format) emission variants:

Opcode IDMnemonicVariantsDescriptionMax Operands
34HMMA11Tensor Core half-precision matrix multiply-accumulate25
39S2R2Move special register to GPR10
40CS2R2Move control/status special register to GPR10
90IMAD4Integer multiply-add (32-bit)19
127FFMA12FP32 fused multiply-add25
195DSETP2FP64 set predicate (comparison)10
205LEA1Load effective address computation19
230IMAD.WIDE9Integer multiply-add with 64-bit result25
284DADD1FP64 addition25
285LDG9Global memory load25
289ISETP4Integer set predicate19/37
290IMNMX4Integer min/max selection19
292FSETP2FP32 set predicate19
293SEL4Select / conditional move19
294SHFL1Warp shuffle (inter-lane communication)3
295FADD4FP32 addition19
296FMUL4FP32 multiplication19
297MUFU4Multi-function unit (sin/cos/sqrt/rsq/rcp/lg2/ex2)19
299HADD22FP16x2 packed addition19

These 19 opcodes represent the core compute-intensive instructions that the linker's embedded ptxas must emit during LTO compilation. The full Ampere ISA is substantially larger; instructions that never appear in LTO-generated code (control flow, barriers, texture, surface, etc.) are handled by separate codec tables at 0xC00070--0xC50970.

ISel Pattern Classes

The 259 ISel pattern matchers divide into 9 functional classes:

ClassPatternsAddress RangeTarget Opcodes
Integer/comparison600xCE20F0--0xCF0040ISETP, IMNMX, IMAD
Floating-point300xCF0040--0xCFA770FADD, FMUL, FSETP, DSETP, DADD
Memory/load-store310xCFA770--0xD07000LDG, S2R, CS2R
Conversion/special170xD07000--0xD0E000MUFU, HADD2, SHFL
Wide multiply140xD0E000--0xD13000IMAD.WIDE
Fused multiply-add400xD13000--0xD39000FFMA
Complex ALU140xD39000--0xD3E000LEA, SEL (complex forms)
Tensor core270xD3E000--0xD52000HMMA (all TC formats)
Predicate/select260xD52000--0xD5FD70SEL, predicate combinations

Priority levels range from 14 to 34. The priority system ensures more specific patterns win over generic fallbacks. Within each class, patterns form a lattice from general to specific:

Priority RangeAttribute ChecksOperand Constraint LevelExample
14--162--3Lightly constrainedGPR+Imm fallback
17--194--5StandardGPR+Pred+UReg+Imm
20--236--7Moderately constrainedMulti-operand with specific attributes
24--278--9Heavily constrainedUReg-only with 9 attribute checks
28--3410--12Highly constrainedFMA with 12 attribute checks, priority 34

Encoding Formats

The format field at *(a2+14) selects the operand encoding layout. Formats observed across all 19 opcodes:

Format IDNameDescription
0RRRegister-Register (both sources in GPRs)
1RIRegister-Immediate (one immediate source)
2RCRegister-ConstantBuffer (one source from constant memory)
3RR.ALTRegister-Register alternate encoding
4RR.PRegister-Register with predicate output
5RI.PRegister-Immediate with predicate output
6RC.PRegister-ConstantBuffer with predicate output
7SHFLWarp shuffle encoding
8RR.3SRCRegister-Register with 3 register sources
9RI.P2Register-Immediate with dual predicate output
10RR.WIDERegister-Register with wide (64-bit) result
11RR.ADDRegister-Register addition-specific encoding
13--18TCA--TCETensor Core formats A through E
19RR.MEMRegister-Register memory-mapped encoding
23--24TC.ALT/TC.ALT2Tensor Core alternate compact formats
42--45TC.WIDE1--4Tensor Core wide formats 1 through 4

Emission Function Catalog

The following tables list all 80 operand emission functions in Zone 1, organized by opcode. Each row is one (opcode, format) combination.

HMMA (Tensor Core) -- 11 variants

AddressIdentityFormatSizeNotes
sub_CCE930sm80_emit_HMMA_TC.ALT233,185 B3-operand compact form
sub_CCD8E0sm80_emit_HMMA_TC.ALT2243,134 B3-operand compact form
sub_CCECD0sm80_emit_HMMA_TCB143,151 B25-operand
sub_CCF070sm80_emit_HMMA_TCA133,202 B25-operand
sub_CD12D0sm80_emit_HMMA_TCC153,227 B25-operand
sub_CD0E40sm80_emit_HMMA_TCD173,211 B25-operand
sub_CD0230sm80_emit_HMMA_TCE183,160 B25-operand
sub_CD6740sm80_emit_HMMA_TC.WIDE24311,127 BComplex multi-variant with fixup tables
sub_CD7310sm80_emit_HMMA_TC.WIDE44511,127 BComplex multi-variant with fixup tables
sub_CD7EE0sm80_emit_HMMA_TC.WIDE14211,188 BComplex multi-variant with fixup tables
sub_CD8AC0sm80_emit_HMMA_TC.WIDE34411,204 BComplex multi-variant with fixup tables

FFMA -- 12 variants

AddressIdentityFormatSize
sub_CC7380sm80_emit_FFMA_RR04,102 B
sub_CC4F80sm80_emit_FFMA_RI13,866 B
sub_CC7880sm80_emit_FFMA_RC24,384 B
sub_CC58D0sm80_emit_FFMA_RR.ALT34,148 B
sub_CAAFE0sm80_emit_FFMA_RR.P44,603 B
sub_CC7D20sm80_emit_FFMA_RI.P54,397 B
sub_CC8990sm80_emit_FFMA_RC.P64,413 B
sub_CC4230sm80_emit_FFMA_SHFL73,669 B
sub_CC3500sm80_emit_FFMA_RR.3SRC83,433 B
sub_CC6B60sm80_emit_FFMA_RI.P293,888 B
sub_CC4AF0sm80_emit_FFMA_RR.WIDE103,685 B
sub_CC5440sm80_emit_FFMA_RR.ADD113,701 B

LDG (Global Memory Load) -- 9 variants

AddressIdentityFormatSize
sub_CD5BE0sm80_emit_LDG_RR011,464 B
sub_CD4520sm80_emit_LDG_RI111,399 B
sub_CD5080sm80_emit_LDG_RC211,448 B
sub_CD39E0sm80_emit_LDG_RR.ALT311,356 B
sub_CBA5D0sm80_emit_LDG_RR.P43,480 B
sub_CBA210sm80_emit_LDG_RI.P53,245 B
sub_CBADD0sm80_emit_LDG_RC.P63,492 B
sub_CBC350sm80_emit_LDG_SHFL73,680 B
sub_CBA9D0sm80_emit_LDG_RR.3SRC83,476 B

The RR/RI/RC/RR.ALT base forms are substantially larger (~11 KB each) than the predicated (.P) and shuffle forms (~3.5 KB), reflecting the complex cache hierarchy modifiers (strongOrder, eviction, scope, cacheOp, memoryType) that the base forms must encode.

IMAD.WIDE (64-bit Multiply-Add) -- 9 variants

AddressIdentityFormatSize
sub_CDBBD0sm80_emit_IMAD.WIDE_RR012,692 B
sub_CDA300sm80_emit_IMAD.WIDE_RI112,627 B
sub_CDAF60sm80_emit_IMAD.WIDE_RC212,676 B
sub_CD96B0sm80_emit_IMAD.WIDE_RR.ALT312,577 B
sub_CD1C80sm80_emit_IMAD.WIDE_RR.P44,874 B
sub_CD1770sm80_emit_IMAD.WIDE_RI.P54,638 B
sub_CD2730sm80_emit_IMAD.WIDE_RC.P64,889 B
sub_CD2C90sm80_emit_IMAD.WIDE_SHFL75,078 B
sub_CD21D0sm80_emit_IMAD.WIDE_RR.3SRC84,873 B

IMAD.WIDE has the largest base-form emitters at ~12.7 KB each. The wide result format requires handling both halves of the 64-bit product, with separate register pair allocation.

Remaining Opcodes

AddressIdentityFormatSize
sub_CCA5D0sm80_emit_IMAD_RR04,835 B
sub_CC2A20sm80_emit_IMAD_RC23,583 B
sub_CBF4A0sm80_emit_IMAD_RR.P43,798 B
sub_CC3D20sm80_emit_IMAD_RR.3SRC84,215 B
sub_CABB10sm80_emit_S2R_TCC153,213 B
sub_CABEA0sm80_emit_S2R_RR.MEM193,216 B
sub_CAC230sm80_emit_CS2R_TCC153,314 B
sub_CAB750sm80_emit_CS2R_RR.MEM193,317 B
sub_CC9C30sm80_emit_DSETP_RI15,292 B
sub_CC5D30sm80_emit_DSETP_RR.ALT34,903 B
sub_CAE310sm80_emit_LEA_RR05,246 B
sub_CBED30sm80_emit_DADD_RR.ADD113,153 B
sub_CC46A0sm80_emit_ISETP_RR03,741 B
sub_CC6260sm80_emit_ISETP_RI13,785 B
sub_CC66E0sm80_emit_ISETP_RC23,917 B
sub_CC84F0sm80_emit_ISETP_RR.ALT33,961 B
sub_CB96B0sm80_emit_IMNMX_RR03,570 B
sub_CB9AD0sm80_emit_IMNMX_RI13,614 B
sub_CB8E80sm80_emit_IMNMX_RC23,467 B
sub_CB9280sm80_emit_IMNMX_RR.ALT33,511 B
sub_CC2660sm80_emit_FSETP_RR03,331 B
sub_CC3930sm80_emit_FSETP_RI13,376 B
sub_CC22A0sm80_emit_SEL_RR03,316 B
sub_CC2EA0sm80_emit_SEL_RI13,361 B
sub_CBF0F0sm80_emit_SEL_RC23,213 B
sub_CC1ED0sm80_emit_SEL_RR.ALT33,258 B
sub_CA02C0sm80_emit_SHFL_SHFL74,076 B
sub_CB7B20sm80_emit_FADD_RR07,415 B
sub_CBF900sm80_emit_FADD_RI127,061 B
sub_CA08C0sm80_emit_FADD_RC244,490 B
sub_CB71C0sm80_emit_FADD_RR.ALT36,705 B
sub_CB60E0sm80_emit_FMUL_RR05,715 B
sub_CBB1E0sm80_emit_FMUL_RI111,561 B
sub_CBC940sm80_emit_FMUL_RC214,951 B
sub_CB4BE0sm80_emit_FMUL_RR.ALT35,164 B
sub_CB4390sm80_emit_MUFU_RR04,743 B
sub_CB5A10sm80_emit_MUFU_RI15,377 B
sub_CB8410sm80_emit_MUFU_RC28,046 B
sub_CB3E20sm80_emit_MUFU_RR.ALT34,529 B
sub_CB5380sm80_emit_HADD2_RR05,434 B
sub_CB67F0sm80_emit_HADD2_RI15,648 B

Notable size outliers: sm80_emit_FADD_RC at 44,490 bytes is the single largest emission function, followed by sm80_emit_FADD_RI at 27,061 bytes. Both require extensive fixup tables for the constant-buffer and immediate operand forms of FP32 addition with negation, absolute value, saturation, and data type modifiers.

Bitfield Packing Functions

Zone 1 contains 75 bitfield packing functions. These translate the instruction descriptor's register IDs, opcode fields, and modifiers into bit positions within the 128-bit SASS instruction word:

ClassFunctionsShift-Pack OpsNotes
FADD/FMUL/MUFU/HADD2/SHFL2611--16Arithmetic + special function encoding
IMAD/FFMA/LEA313--15Multiply-add class encoding
FFMA415Fused multiply-add specific
FFMA/DSETP1213--16FMA and FP64 comparison encoding
HMMA (Tensor Core)416Tensor core with fixed 16-bitfield layout
HMMA/IMAD.WIDE1014--20Wide operand encoding (most complex)

Operand Type Predicates

Fifteen small predicate functions at 0xCDD5F0--0xCDD690 classify operands by type:

AddressIdentityCheck
sub_CDD5F0getRegFileExtract register file from register ID
sub_CDD600isGPROperand is a general-purpose register
sub_CDD610isPredicateOperand is a predicate register
sub_CDD630isUniformRegOperand is a uniform register
sub_CDD670isImmediateOperand is an immediate value
sub_CDD680isConstBufOperand is a constant buffer reference

Instruction Modifier Setters

AddressIdentityModifier
sub_509100setDnzModeDenormalized-number-as-zero mode
sub_509160setRoundingModeIEEE rounding mode (RN, RZ, RP, RM)
sub_509760setAbsoluteAbsolute value modifier
sub_509890setEvictFirstCache eviction hint
sub_509950setCacheLevelCache level targeting
sub_509B00setScopeMemory scope (CTA, GPU, SYS)
sub_509B20setEvictionEviction policy
sub_50AC80setCacheOpCache operation type
sub_50ACD0setMemoryTypeMemory type qualifier
sub_50B160setComparisonComparison predicate (LT, GT, EQ, NE, ...)
sub_50B300setNegationSource operand negation
sub_50B500setDataTypeData type (F32, F16, S32, U32, ...)
sub_50B900setStrongOrderStrong memory ordering
sub_50BD00setSaturationOutput saturation clamp
sub_50BDA0setAddrSpaceAddress space (global, shared, local)
sub_50C060setFtzModeFlush-to-zero for denormals

External Dependencies

The SM80 backend calls into shared infrastructure functions:

AddressIdentityRole
sub_530FB0getOperand(idx)Universal operand accessor (31,399 callers)
sub_530FC0getNumUses()Count use (input) operands
sub_530FD0getNumDefs()Count definition (output) operands
sub_A49150getInsnAttributeQuery instruction attribute by slot ID (30,768 callers)
sub_4FF010emitRegOperandEmit register operand to descriptor
sub_4FF150emitPredicateOperandEmit predicate operand to descriptor
sub_4FF280emitAddrOperandEmit address/memory operand to descriptor
sub_4C28B0setBitfieldCore primitive: pack value into 128-bit instruction at bit offset
sub_4C2A90initInsnFromTemplateInitialize instruction buffer from static template
sub_4C4D60encodeOperandSlot0Encode register into operand slot 0
sub_4C5C30encodeOperandSlot1Encode register into operand slot 1
sub_A50D10encodeRegIdTranslate virtual register ID to SASS encoding

Relationship to Other Backends

BackendMega-HubSizePattern Matchers
SM50-7x (shared)sub_126CA30239 KB~160
SM75 (Turing)sub_FBB810280 KB276
SM80 (Ampere)sub_D5FD70239 KB259
SM89/90 (Ada/Hopper)sub_119BF40231 KB~160

Despite having fewer ISel patterns than SM75, the SM80 mega-hub matches the SM50-7x hub at 239 KB. The SM89/90 backend at 1.9 MB is substantially larger overall, though its mega-hub is smaller, because it includes 750 instruction encoder template instantiations that the SM80 backend does not require.

Confidence Assessment

ClaimConfidenceVerification
ISA class string "Ampere" for sm_80/86/87/88CONFIRMEDDecompiled sub_484F50 lines 293, 337, 383, 426: "Ampere" for all four
__CUDA_ARCH__ values: 800, 860, 870, 880CONFIRMEDDecompiled sub_484F50 lines 294, 338, 384, 427
Codegen factory values: 28673, 28674, 28675, 28676CONFIRMEDDecompiled sub_15C3310 +348=28673, sub_15C3B60 +348=28674, sub_15C3C60 +348=28675, sub_15C3A60 +348=28676
Dispatch table: all 7 slots per SMCONFIRMEDDecompiled sub_15C0CE0 lines 75--102
Dispatch table: sm_88 encoding table = sub_15C3A60CONFIRMEDDecompiled sub_15C0CE0 line 98
Capability Vec 1: sm_80 = xmmword_1D40F40, sm_86/87/88 = xmmword_1D40F50CONFIRMEDDecompiled sub_484F50 lines 328 (F40), 370 (F50), 418 (copy), 458 (copy)
sm_86/87/88 capability-identical (same Vec 1 via copy chain)HIGHCopy chain v211->v210 traced through lines 375, 418, 458
sm_88 new in CUDA 13.0HIGHString at 0x1d40a9a; not in earlier toolkit versions
SM80 ISel backend at 0xCA0000--0xDA0000 (1 MB)HIGHAddress range consistent with function catalog and mega-hub location
259 ISel pattern matchers, 80 emission variantsHIGHDerived from systematic sweep of address range
Internal version numbers (sm_80=25 through sm_88=28)MEDIUMInferred from address spacing and *(a1+376) > 26 threshold
Family linkage: sm_89 links into sm_80/sm_86 chainsCONFIRMEDDecompiled sub_484F50 lines 507--510
FADD_RC largest emitter at 44,490 BMEDIUMSize from function boundary analysis
Three-phase pipeline (ISel, emission, encoding)HIGHArchitectural pattern consistent across all SM backends

Cross-References

Sibling Wikis

  • ptxas: Turing/Ampere -- standalone ptxas SM80 target documentation (codegen factory encoding, scheduler profiles, latency tables, SASS encoding format)
  • cicc: SM70-89 -- cicc compiler SM80 through SM88 feature gates (__VA_OPT__, convergent branches, L2 cache hint atomics)