SM80-88 Ampere
The Ampere family in nvlink v13.0.88 covers four compute capabilities -- sm_80, sm_86, sm_87, and sm_88 -- all sharing the ISA class string "Ampere", the same ISel backend at 0xCA0000--0xDA0000 (1 MB), and the same 128-bit SASS instruction encoding introduced by Turing. The sm_88 variant is new in CUDA 13.0. None of the four Ampere targets carry 'a' or 'f' suffix variants (unlike Blackwell targets), so the profile database contains exactly 12 Ampere profile entries: 4 real, 4 virtual, and 4 LTO.
For general Ampere architecture details (hardware specs, PTX ISA requirements, codegen factory encoding, scheduler parameters, latency tables), see the ptxas wiki: Turing/Ampere. For cicc-level feature gates and optimizer flag configuration, see the cicc wiki: SM70-89. This page focuses on the nvlink-internal per-sub-architecture data: profile registration, capability vectors, dispatch table function pointers, and the ISel backend shared by all four.
Per-Architecture Profiles
Profile Registration in sub_484F50
The profile database initializer sub_484F50 (53,974 bytes) registers the four Ampere architectures in numeric order. Each architecture produces three profile objects via sub_484DB0 (real sm_, virtual compute_, LTO lto_), which are inserted into the global hash map at qword_2A5F8D8. The Ampere block spans lines 288--462 of the decompiled source.
| SM | Registration Lines | sub_484DB0 Args | Variables |
|---|---|---|---|
| sm_80 | 288--331 | (0, 0, "sm_80", "sm_80", "Ampere", "-D__CUDA_ARCH__=800", "sm_80") | v14 (real), v15 (virtual) |
| sm_86 | 332--377 | (0, 0, "sm_86", "sm_86", "Ampere", "-D__CUDA_ARCH__=860", "sm_86") | v22 (real), v23 (virtual) |
| sm_87 | 378--420 | (0, 0, "sm_87", "sm_87", "Ampere", "-D__CUDA_ARCH__=870", "sm_87") | v31 (real), v32 (virtual) |
| sm_88 | 421--462 | (0, 0, "sm_88", "sm_88", "Ampere", "-D__CUDA_ARCH__=880", "sm_88") | v39 (real), v40 (virtual) |
The registration pattern for each is identical:
// 1. Create real profile (is_virtual=0, is_lto=0)
sm_80 = ArchProfile::create(0, 0, "sm_80", "sm_80", "Ampere",
"-D__CUDA_ARCH__=800", "sm_80");
// 2. Create virtual profile (is_virtual=1, is_lto=0)
compute_80 = ArchProfile::create(1, 0, "compute_80", "compute_80", "Ampere",
"-D__CUDA_ARCH__=800", "compute_80");
// 3. Link real <-> virtual
sm_80->virtual_ptr = compute_80; // offset +72
compute_80->virtual_ptr = compute_80; // self-ref
// 4. Insert into hash map
LinkerHash::insert(qword_2A5F8D8, "sm_80", sm_80);
LinkerHash::insert(qword_2A5F8D8, "compute_80", compute_80);
// 5. Create LTO profile (is_virtual=1, is_lto=1)
lto_80 = ArchProfile::create(1, 1, "lto_80", "compute_80", NULL,
"-D__CUDA_ARCH__=800", "lto_80");
lto_80->virtual_ptr = compute_80;
LinkerHash::insert(qword_2A5F8D8, "lto_80", lto_80);
// 6. Build compatibility lists
list_append(compute_80->compat_list_2, sm_80); // offset +64
list_append(sm_80->compat_list_2, compute_80);
list_append(sm_80->compat_list_1, sm_80); // offset +56: family
list_append(sm_80->compat_list_0, sm_80); // offset +48: cross-variant
// 7. Copy capability vectors from rodata
sm_80->capability[0] = xmmword_1D40F10; // offset +80
sm_80->capability[1] = xmmword_1D40F40; // offset +96
sm_80->capability[2] = xmmword_1D40F30; // offset +112
After registering each sub-architecture, sub_484F50 links it into the Ampere family chain. sm_86 is appended to sm_80's family list (compat_list_0 and compat_list_1), sm_87 and sm_88 follow the same chaining to sm_86's lists (which transitively connect back to sm_80).
The dword_2A5F8CC = 80 assignment (line 326) sets the default minimum architecture to sm_80 -- any link operation that does not specify an explicit --arch flag defaults to this value.
Architecture Identity Matrix
| Property | sm_80 | sm_86 | sm_87 | sm_88 |
|---|---|---|---|---|
| ISA class string | "Ampere" | "Ampere" | "Ampere" | "Ampere" |
| Family string | "Ampere" | "Ampere" | "Ampere" | "Ampere" |
__CUDA_ARCH__ | 800 | 860 | 870 | 880 |
| Preprocessor define | -D__CUDA_ARCH__=800 | -D__CUDA_ARCH__=860 | -D__CUDA_ARCH__=870 | -D__CUDA_ARCH__=880 |
| Suffix variants | None | None | None | None |
| Profile byte[3] (finalization class) | 0 | 0 | 0 | 0 |
| Products | GA100 (A100, A30) | GA10x (RTX 3090/3080/3070/3060, A40, A16) | GA10B (Jetson Orin AGX/NX/Nano) | Undocumented (CUDA 13.0) |
| Same-decade group (SM/10) | 8 | 8 | 8 | 8 |
All four share same-decade group 8, which also includes sm_89 (Ada). This means code compiled for sm_80 is compatible with any target in the 80--89 range via the same-decade rule (see the Compatibility page).
Capability Vectors
Each profile stores three 128-bit XMM vectors at offsets +80, +96, and +112. These encode hardware capability bitmasks checked by the finalization pipeline (sub_470DA0). The vectors are loaded from rodata constants in sub_484F50:
| Architecture | Vec 0 (offset +80) | Vec 1 (offset +96) | Vec 2 (offset +112) | Source Lines |
|---|---|---|---|---|
| sm_75 (Turing) | xmmword_1D40F10 | xmmword_1D40F20 | xmmword_1D40F30 | 283--287 |
| sm_80 | xmmword_1D40F10 | xmmword_1D40F40 | xmmword_1D40F30 | 325--331 |
| sm_86 | xmmword_1D40F10 | xmmword_1D40F50 | xmmword_1D40F30 | 370--374 |
| sm_87 | xmmword_1D40F10 | xmmword_1D40F50 (copied from sm_86) | xmmword_1D40F30 | 416--420 |
| sm_88 | xmmword_1D40F10 | xmmword_1D40F50 (copied from sm_87) | xmmword_1D40F30 | 458--462 |
| sm_89 (Ada) | xmmword_1D40F10 | xmmword_1D40F60 | xmmword_1D40F30 | 499--505 |
The rodata symbols encode these capability tiers:
| Symbol | Role | Architectures Using It |
|---|---|---|
xmmword_1D40F10 | Universal base (Vec 0) | All architectures sm_75--sm_121 |
xmmword_1D40F20 | Turing feature set (Vec 1) | sm_75 only |
xmmword_1D40F30 | Pre-Blackwell ISA version (Vec 2) | sm_75 through sm_90a |
xmmword_1D40F40 | Ampere-base feature set (Vec 1) | sm_80, sm_90, sm_100, sm_103 |
xmmword_1D40F50 | Ampere-86+ feature set (Vec 1) | sm_86, sm_87, sm_88 |
xmmword_1D40F60 | Ada/Thor/RTX-50 feature set (Vec 1) | sm_89, sm_110, sm_120, sm_121 |
xmmword_1D40F70 | Blackwell ISA version (Vec 2) | sm_100 onward |
Key observations:
- Vec 0 is constant across all architectures. It encodes the universal baseline capabilities.
- Vec 1 differentiates sub-architectures. sm_80 gets
xmmword_1D40F40(Ampere base), while sm_86/87/88 sharexmmword_1D40F50(extended Ampere). The sm_86 value propagates to sm_87 and sm_88 through copy chains (v211 = v29at line 375;v210 = v37at line 418), not through independent rodata loads. This means sm_86, sm_87, and sm_88 are capability-identical from the finalization pipeline's perspective. - Vec 2 is constant within the pre-Blackwell group. All Ampere targets use
xmmword_1D40F30. - The Ampere-to-Ada boundary is marked by the Vec 1 change from
xmmword_1D40F50(sm_86/87/88) toxmmword_1D40F60(sm_89).
What Distinguishes Each Sub-Architecture
sm_80 (GA100 -- datacenter Ampere). The base Ampere target and generational anchor. It uses xmmword_1D40F40 for Vec 1, which differs from the sm_86+ value xmmword_1D40F50. This means sm_80's capability mask is a strict subset of sm_86's -- code finalized for sm_86 is not guaranteed to be re-finalizable for sm_80 without capability mask verification. sm_80's codegen factory is 28673 (0x7001), placing it at sub-variant 1 within generation 7. In the ptxas scheduler, sm_80 falls through to the default variant (0 or 1) and uses baseline latency tables.
sm_86 (GA10x -- consumer/enterprise Ampere). Extends sm_80 with xmmword_1D40F50 capability bits. The codegen factory is 28674 (0x7002, sub-variant 2). In the ptxas scheduler, sm_86 maps to sub-architecture variant 2, receiving tuned scheduling parameters distinct from sm_80's baseline. The ptxas latency table for sm_86 (sub_8E7D80, 4.4 KB) is the largest in the Ampere family, reflecting the different pipeline characteristics of the RTX 30xx consumer die versus the A100 datacenter die.
sm_87 (GA10B -- Jetson Orin). Capability-identical to sm_86 (same Vec 1 xmmword_1D40F50, copied at line 418 via v37 = _mm_load_si128(&v211) where v211 holds sm_86's vector). The codegen factory is 28675 (0x7003, sub-variant 3). The ptxas scheduler maps it to variant 3, giving Orin its own latency profile (sub_8E8070, 3.5 KB). This is the only material binary difference between sm_86 and sm_87 in nvlink -- the capability mask, ISel patterns, and encoding pipeline are identical.
sm_88 (undocumented -- CUDA 13.0). Capability-identical to sm_86 and sm_87 (same Vec 1, copied at line 458 via the sm_87 chain). The codegen factory is 28676 (0x7004, sub-variant 4). The ptxas scheduler maps it to variant 4, shared with sm_110 (Jetson Thor). No separate latency table was found in the ptxas sweep data -- sm_88 may share sm_86's or sm_87's table. No public product ships on sm_88; it may represent an unreleased Ampere derivative or internal validation target.
Dispatch Table (sub_15C0CE0)
The dispatch table initializer sub_15C0CE0 registers seven callback function pointers per architecture into hash maps (qword_2A644B8 through qword_2A64488). Each callback serves a specific role in the embedded compilation pipeline.
Slot Assignments
| Slot | Hash Map | sm_75 (Turing) | sm_80 | sm_86 | sm_87 | sm_88 |
|---|---|---|---|---|---|---|
| B8 | Pre-compilation | sub_15C2AA0 | sub_15C2BF0 | sub_15C2C80 | sub_15C2E30 | sub_15C2DA0 |
| B0 | Compilation | sub_15C2A70 | sub_15C2BC0 | sub_15C2CB0 | sub_15C2D10 | sub_15C2DD0 |
| A8 | Backend init | sub_15C3210 | sub_15C3310 | sub_15C3B60 | sub_15C3C60 | sub_15C3A60 |
| A0 | Internal version | byte_2A5EE40 | byte_2A5EE3C | byte_2A5EE38 | byte_2A5EE34 | byte_2A5EE30 |
| 90 | Perf-stats | sub_15C1C80 | sub_15C1EB0 | sub_15C1EF0 | sub_15C1FD0 | sub_15C1E30 |
| 88 | Resource calc | sub_15C2610 | sub_15C28B0 | sub_15C1FF0 | sub_15C2990 | sub_15C2530 |
Each compute_ prefix variant receives the same internal version constant as its real counterpart (e.g., compute_80 = byte_2A5EE3C, compute_86 = byte_2A5EE38).
Backend Init Functions (Slot A8)
The A8 slot handlers are the only callbacks with architecturally significant differences between Ampere sub-architectures. Each allocates a per-function codegen context via sub_189F230(a3) and writes a per-SM codegen factory value at offset +348:
| SM | A8 Handler | Factory Value | Hex | Encoding |
|---|---|---|---|---|
| sm_75 | sub_15C3210 | 24577 | 0x6001 | Generation 6, sub-variant 1 |
| sm_80 | sub_15C3310 | 28673 | 0x7001 | Generation 7, sub-variant 1 |
| sm_86 | sub_15C3B60 | 28674 | 0x7002 | Generation 7, sub-variant 2 |
| sm_87 | sub_15C3C60 | 28675 | 0x7003 | Generation 7, sub-variant 3 |
| sm_88 | sub_15C3A60 | 28676 | 0x7004 | Generation 7, sub-variant 4 |
The codegen factory encodes (isa_generation << 12) | sub_variant. All four Ampere targets are generation 7 (bits 12--15 = 0x7), differing only in the low 12-bit sub-variant field. The factory value controls:
- Scheduler profile selection in
sub_8E4400: all four fall into the 7-warp / 208-dispatch-slot bucket (factory range 24577--28676, threshold <= 32767). - Sub-architecture variant for latency tuning: sm_80 gets default variant, sm_86 gets variant 2, sm_87 gets variant 3, sm_88 gets variant 4.
- Instruction encoding table selection: each sub-variant can enable/disable specific instruction forms.
All other fields at +344 (shared memory config base, 0x100000 = 1 MB) and the remaining context fields are identical across all four Ampere handlers.
Internal Version Numbers
The internal version numbers at qword_2A644A0 are stored as 4-byte integers at decreasing addresses, each 4 bytes apart. From the version mapping table in the SM89 Ada page:
| Address | SM | Internal Version | Notes |
|---|---|---|---|
byte_2A5EE40 | sm_75 | 14 | Turing baseline |
byte_2A5EE3C | sm_80 | (inferred 25) | Ampere base |
byte_2A5EE38 | sm_86 | (inferred 26) | First sub-arch > 26 threshold triggers |
byte_2A5EE34 | sm_87 | (inferred 27) | Orin |
byte_2A5EE30 | sm_88 | (inferred 28) | Undocumented |
byte_2A5EE2C | sm_89 | 29 | Ada (confirmed) |
The specific internal versions for sm_80--sm_88 are inferred from the address spacing pattern and the known threshold at *(a1+376) > 26 (sm_86+, confirmed in the compilation driver sub_1112F30 line 991). The sm_89 value of 29 is confirmed from decompiled code. The gap between sm_75 (internal 14) and sm_80 (inferred 25) reflects deprecated architectures in the 15--24 range that were removed from the profile database but whose internal version slots remain allocated.
Ampere vs Turing: Generational Differences
The transition from Turing (sm_75, generation 6) to Ampere (sm_80, generation 7) manifests in nvlink at several levels. For hardware and ISA-level details, see ptxas wiki: Turing/Ampere. Within nvlink specifically:
Profile-Level Changes
| Aspect | sm_75 (Turing) | sm_80 (Ampere) |
|---|---|---|
| ISA class string | "Turing" | "Ampere" |
| ISA generation | 6 | 7 |
| Codegen factory | 24577 (0x6001) | 28673 (0x7001) |
| Vec 1 capability | xmmword_1D40F20 | xmmword_1D40F40 |
| Same-decade group | 7 | 8 |
| Default arch flag | -- | dword_2A5F8CC = 80 |
Capability Boundary
The Vec 1 change from xmmword_1D40F20 (Turing) to xmmword_1D40F40 (Ampere) marks a hard capability boundary. Code finalized for sm_80 cannot be re-finalized for sm_75 -- the capability mask comparison in sub_470DA0 will fail because Ampere capabilities are a superset of Turing capabilities. The reverse direction (sm_75 code on sm_80 targets) is permitted by the same-decade rule only if both are in the same family linked list, which they are not (different decades: 7 vs 8).
ISel Backend Independence
Despite sharing the same 128-bit instruction encoding format, sm_75 and sm_80 use completely separate ISel backends:
| Property | SM75 (Turing) | SM80 (Ampere) |
|---|---|---|
| Address range | 0xF16000--0x100C000 (984 KB) | 0xCA0000--0xDA0000 (1 MB) |
| Mega-hub | sub_FBB810 (280 KB) | sub_D5FD70 (239 KB) |
| Pattern matchers | 276 | 259 |
| SASS opcodes | -- | 19 (in LTO subset) |
The Ampere ISel is slightly smaller than Turing's despite being a later generation. The reduction in pattern matchers (276 to 259) reflects instruction set refinement rather than feature removal -- Ampere reorganized and consolidated some multi-variant patterns.
Feature Gates in cicc and ptxas
From the cicc wiki: SM70-89, the Ampere-specific features enabled at the sm_80 threshold include:
- C++20
__VA_OPT__support viaunk_4D041B8 llvm.nvvm.branch.if.convergentintrinsic (sm_80+, distinct from the sm_70+branch.if.all.convergent)- L2 cache hint atomics (PTX 7.3+)
- cp.async.bulk patterns for asynchronous memory copy
From the ptxas wiki: Turing/Ampere, the sm_82 SASS opcode boundary defines 22 Ampere-era opcode slots (indices 172--193) covering sparse MMA, binary tensor core shapes, FP64 tensor MMA, async copy infrastructure, and warp-wide reduction.
Family Linkage
The Ampere family linked list built by sub_484F50 chains all four sub-architectures through compat_list_1 (offset +56) and compat_list_0 (offset +48):
sm_80.compat_list_1 -> { sm_80, sm_86, sm_87, sm_88 }
sm_86.compat_list_1 -> { sm_86 }
also: sm_86 appended to sm_80's compat_list_0 and compat_list_1
sm_87.compat_list_1 -> { sm_87 }
also: sm_87 appended to sm_86's compat_list_0 and compat_list_1
sm_88.compat_list_1 -> { sm_88 }
also: sm_88 appended to sm_87's compat_list_0 and compat_list_1
Additionally, sm_89 (Ada) links into the Ampere family chain:
// sub_484F50 lines 507--510:
list_append(sm_80->compat_list_0, sm_89); // v14[3].m128i_i64[0] = sm_80's list
list_append(sm_80->compat_list_1, sm_89); // v14[3].m128i_i64[1] = sm_80's list
list_append(sm_86->compat_list_0, sm_89); // v22[3].m128i_i64[0] = sm_86's list
list_append(sm_86->compat_list_1, sm_89); // v22[3].m128i_i64[1] = sm_86's list
This means sm_89 (Ada) is in the same compatibility family as the Ampere targets. The same-decade rule (80/10 = 89/10 = 8) already ensures this, but the explicit linkage provides direct traversal without arithmetic.
Worked Compatibility Example: sm_80 vs sm_86
Consider linking a cubin compiled for sm_80 into a final binary targeting sm_86.
sub_4878A0parses both architecture strings:record_a = {arch_number=80, is_virtual=0, has_suffix=0},record_b = {arch_number=86, is_virtual=0, has_suffix=0}.- Neither record is virtual, so the function does not reject on the "both virtual" check.
record_a.has_suffixis 0, so the function enters Case 1:profile_family_match(sm_80.family_list, sm_86).sub_465450traverses sm_80'scompat_list_1:{ sm_80, sm_86, sm_87, sm_88 }. sm_86 is found in the list.- Result: compatible. The linker accepts the sm_80 cubin into the sm_86 link.
The reverse direction (sm_86 cubin targeting sm_80) follows a different path:
- Parse:
record_a = {86, ...},record_b = {80, ...}. - Case 1:
profile_family_match(sm_86.family_list, sm_80). - sm_86's
compat_list_1contains{ sm_86 }-- it does not contain sm_80. - Result: incompatible. sm_86 code cannot be linked for sm_80.
This asymmetry reflects the forward-compatibility guarantee: code compiled for a lower SM within a family runs on higher SMs, but not the reverse.
However, capability mask verification at finalization time (sub_470DA0) adds a second check: even if the linker accepts the cubin, re-finalization will compare the capability vectors. Since sm_80 uses xmmword_1D40F40 (Vec 1) and sm_86 uses xmmword_1D40F50 (Vec 1), the sm_86 target has a superset of sm_80's capabilities. The finalization check passes for sm_80-compiled code targeting sm_86 but would fail for sm_86-compiled code targeting sm_80 if any sm_86-specific capability bits were used.
ISel Backend (0xCA0000--0xDA0000)
All four Ampere sub-architectures share a single ISel backend. The backend is variant-agnostic -- it produces identical SASS encoding for sm_80, sm_86, sm_87, and sm_88. Any differences between sub-architectures are resolved upstream in the dispatch table (slot A8 codegen factory) and downstream in the scheduler (latency tables), not within the ISel code itself.
Address Map
| Range | Size | Subsystem | Functions |
|---|---|---|---|
0xCA0000--0xCDC000 | 240 KB | Operand emission + bitfield packing | 137 |
0xCDD5F0--0xCDD690 | <1 KB | Operand type predicates | 15 |
0xCE2000--0xD5FD70 | 510 KB | ISel pattern matchers | 259 |
0xD5FD70 | 239 KB | ISel mega-hub dispatcher | 1 |
0xD9A400--0xDA0000 | 23 KB | Binary instruction encoding | 17 |
Total analyzed functions (>3 KB): 413. The mega-hub at sub_D5FD70 is 239 KB (55,985 instructions, 1,340 callees) and too large for Hex-Rays to decompile. It is the third-largest function in the binary after the SM75 mega-hub (sub_FBB810, 280 KB) and the cuda_builtin_prototype_generator (sub_15B86A0, 345 KB).
Three-Phase Pipeline
Every IR instruction processed by this backend passes through three phases in sequence: instruction selection, operand emission, and binary encoding. The output is a 128-bit SASS instruction word ready for insertion into the device ELF .text section.
Phase 1: Instruction Selection. The mega-hub sub_D5FD70 iterates all 259 pattern matcher functions. Each matcher receives (ctx, ir_node, &pattern_id, &priority) and tests a single candidate encoding. The highest-priority match wins.
for each pattern_matcher in sm80_pattern_table[0..258]:
pattern_matcher(ctx, ir_node, &pattern_id, &priority)
if priority > best_priority:
best_priority = priority
best_id = pattern_id
sm80_emitter_table[best_id](ctx, ir_node)
Each pattern matcher queries: instruction attributes via sub_A49150(ctx, ir_node, slot_id) (up to 12 attribute checks per pattern), definition/use operand counts via sub_530FD0/sub_530FC0, individual operands via sub_530FB0(ir_node, index), and operand type predicates: isGPR (sub_CDD600), isPredicate (sub_CDD610), isUniformReg (sub_CDD630), isImmediate (sub_CDD670), isConstBuf (sub_CDD680).
Priority values range from 14 (least specific, fallback patterns) to 34 (most specific, heavily constrained patterns). Higher priority means more attribute checks and narrower operand constraints.
Phase 2: Operand Emission. The winning pattern dispatches to an operand emission function from Zone 1 (0xCA0000--0xCDC000). Each emitter handles one (opcode, format) combination and populates a structured instruction descriptor: opcode ID at *(a2+12), encoding format at *(a2+14), max operand slot count at *(a2+15), modifier flags via setter functions, and decoded register operands.
Phase 3: Binary Encoding. Zone 3 functions (0xD9A400--0xDA0000, 17 functions) produce the final 128-bit SASS binary instruction word using SSE2 intrinsics (_mm_or_si128) for efficient 128-bit bulk operations.
Instruction Set Coverage
The SM80 backend handles 19 distinct SASS opcodes with a total of 80 (opcode, format) emission variants:
| Opcode ID | Mnemonic | Variants | Description | Max Operands |
|---|---|---|---|---|
| 34 | HMMA | 11 | Tensor Core half-precision matrix multiply-accumulate | 25 |
| 39 | S2R | 2 | Move special register to GPR | 10 |
| 40 | CS2R | 2 | Move control/status special register to GPR | 10 |
| 90 | IMAD | 4 | Integer multiply-add (32-bit) | 19 |
| 127 | FFMA | 12 | FP32 fused multiply-add | 25 |
| 195 | DSETP | 2 | FP64 set predicate (comparison) | 10 |
| 205 | LEA | 1 | Load effective address computation | 19 |
| 230 | IMAD.WIDE | 9 | Integer multiply-add with 64-bit result | 25 |
| 284 | DADD | 1 | FP64 addition | 25 |
| 285 | LDG | 9 | Global memory load | 25 |
| 289 | ISETP | 4 | Integer set predicate | 19/37 |
| 290 | IMNMX | 4 | Integer min/max selection | 19 |
| 292 | FSETP | 2 | FP32 set predicate | 19 |
| 293 | SEL | 4 | Select / conditional move | 19 |
| 294 | SHFL | 1 | Warp shuffle (inter-lane communication) | 3 |
| 295 | FADD | 4 | FP32 addition | 19 |
| 296 | FMUL | 4 | FP32 multiplication | 19 |
| 297 | MUFU | 4 | Multi-function unit (sin/cos/sqrt/rsq/rcp/lg2/ex2) | 19 |
| 299 | HADD2 | 2 | FP16x2 packed addition | 19 |
These 19 opcodes represent the core compute-intensive instructions that the linker's embedded ptxas must emit during LTO compilation. The full Ampere ISA is substantially larger; instructions that never appear in LTO-generated code (control flow, barriers, texture, surface, etc.) are handled by separate codec tables at 0xC00070--0xC50970.
ISel Pattern Classes
The 259 ISel pattern matchers divide into 9 functional classes:
| Class | Patterns | Address Range | Target Opcodes |
|---|---|---|---|
| Integer/comparison | 60 | 0xCE20F0--0xCF0040 | ISETP, IMNMX, IMAD |
| Floating-point | 30 | 0xCF0040--0xCFA770 | FADD, FMUL, FSETP, DSETP, DADD |
| Memory/load-store | 31 | 0xCFA770--0xD07000 | LDG, S2R, CS2R |
| Conversion/special | 17 | 0xD07000--0xD0E000 | MUFU, HADD2, SHFL |
| Wide multiply | 14 | 0xD0E000--0xD13000 | IMAD.WIDE |
| Fused multiply-add | 40 | 0xD13000--0xD39000 | FFMA |
| Complex ALU | 14 | 0xD39000--0xD3E000 | LEA, SEL (complex forms) |
| Tensor core | 27 | 0xD3E000--0xD52000 | HMMA (all TC formats) |
| Predicate/select | 26 | 0xD52000--0xD5FD70 | SEL, predicate combinations |
Priority levels range from 14 to 34. The priority system ensures more specific patterns win over generic fallbacks. Within each class, patterns form a lattice from general to specific:
| Priority Range | Attribute Checks | Operand Constraint Level | Example |
|---|---|---|---|
| 14--16 | 2--3 | Lightly constrained | GPR+Imm fallback |
| 17--19 | 4--5 | Standard | GPR+Pred+UReg+Imm |
| 20--23 | 6--7 | Moderately constrained | Multi-operand with specific attributes |
| 24--27 | 8--9 | Heavily constrained | UReg-only with 9 attribute checks |
| 28--34 | 10--12 | Highly constrained | FMA with 12 attribute checks, priority 34 |
Encoding Formats
The format field at *(a2+14) selects the operand encoding layout. Formats observed across all 19 opcodes:
| Format ID | Name | Description |
|---|---|---|
| 0 | RR | Register-Register (both sources in GPRs) |
| 1 | RI | Register-Immediate (one immediate source) |
| 2 | RC | Register-ConstantBuffer (one source from constant memory) |
| 3 | RR.ALT | Register-Register alternate encoding |
| 4 | RR.P | Register-Register with predicate output |
| 5 | RI.P | Register-Immediate with predicate output |
| 6 | RC.P | Register-ConstantBuffer with predicate output |
| 7 | SHFL | Warp shuffle encoding |
| 8 | RR.3SRC | Register-Register with 3 register sources |
| 9 | RI.P2 | Register-Immediate with dual predicate output |
| 10 | RR.WIDE | Register-Register with wide (64-bit) result |
| 11 | RR.ADD | Register-Register addition-specific encoding |
| 13--18 | TCA--TCE | Tensor Core formats A through E |
| 19 | RR.MEM | Register-Register memory-mapped encoding |
| 23--24 | TC.ALT/TC.ALT2 | Tensor Core alternate compact formats |
| 42--45 | TC.WIDE1--4 | Tensor Core wide formats 1 through 4 |
Emission Function Catalog
The following tables list all 80 operand emission functions in Zone 1, organized by opcode. Each row is one (opcode, format) combination.
HMMA (Tensor Core) -- 11 variants
| Address | Identity | Format | Size | Notes |
|---|---|---|---|---|
sub_CCE930 | sm80_emit_HMMA_TC.ALT | 23 | 3,185 B | 3-operand compact form |
sub_CCD8E0 | sm80_emit_HMMA_TC.ALT2 | 24 | 3,134 B | 3-operand compact form |
sub_CCECD0 | sm80_emit_HMMA_TCB | 14 | 3,151 B | 25-operand |
sub_CCF070 | sm80_emit_HMMA_TCA | 13 | 3,202 B | 25-operand |
sub_CD12D0 | sm80_emit_HMMA_TCC | 15 | 3,227 B | 25-operand |
sub_CD0E40 | sm80_emit_HMMA_TCD | 17 | 3,211 B | 25-operand |
sub_CD0230 | sm80_emit_HMMA_TCE | 18 | 3,160 B | 25-operand |
sub_CD6740 | sm80_emit_HMMA_TC.WIDE2 | 43 | 11,127 B | Complex multi-variant with fixup tables |
sub_CD7310 | sm80_emit_HMMA_TC.WIDE4 | 45 | 11,127 B | Complex multi-variant with fixup tables |
sub_CD7EE0 | sm80_emit_HMMA_TC.WIDE1 | 42 | 11,188 B | Complex multi-variant with fixup tables |
sub_CD8AC0 | sm80_emit_HMMA_TC.WIDE3 | 44 | 11,204 B | Complex multi-variant with fixup tables |
FFMA -- 12 variants
| Address | Identity | Format | Size |
|---|---|---|---|
sub_CC7380 | sm80_emit_FFMA_RR | 0 | 4,102 B |
sub_CC4F80 | sm80_emit_FFMA_RI | 1 | 3,866 B |
sub_CC7880 | sm80_emit_FFMA_RC | 2 | 4,384 B |
sub_CC58D0 | sm80_emit_FFMA_RR.ALT | 3 | 4,148 B |
sub_CAAFE0 | sm80_emit_FFMA_RR.P | 4 | 4,603 B |
sub_CC7D20 | sm80_emit_FFMA_RI.P | 5 | 4,397 B |
sub_CC8990 | sm80_emit_FFMA_RC.P | 6 | 4,413 B |
sub_CC4230 | sm80_emit_FFMA_SHFL | 7 | 3,669 B |
sub_CC3500 | sm80_emit_FFMA_RR.3SRC | 8 | 3,433 B |
sub_CC6B60 | sm80_emit_FFMA_RI.P2 | 9 | 3,888 B |
sub_CC4AF0 | sm80_emit_FFMA_RR.WIDE | 10 | 3,685 B |
sub_CC5440 | sm80_emit_FFMA_RR.ADD | 11 | 3,701 B |
LDG (Global Memory Load) -- 9 variants
| Address | Identity | Format | Size |
|---|---|---|---|
sub_CD5BE0 | sm80_emit_LDG_RR | 0 | 11,464 B |
sub_CD4520 | sm80_emit_LDG_RI | 1 | 11,399 B |
sub_CD5080 | sm80_emit_LDG_RC | 2 | 11,448 B |
sub_CD39E0 | sm80_emit_LDG_RR.ALT | 3 | 11,356 B |
sub_CBA5D0 | sm80_emit_LDG_RR.P | 4 | 3,480 B |
sub_CBA210 | sm80_emit_LDG_RI.P | 5 | 3,245 B |
sub_CBADD0 | sm80_emit_LDG_RC.P | 6 | 3,492 B |
sub_CBC350 | sm80_emit_LDG_SHFL | 7 | 3,680 B |
sub_CBA9D0 | sm80_emit_LDG_RR.3SRC | 8 | 3,476 B |
The RR/RI/RC/RR.ALT base forms are substantially larger (~11 KB each) than the predicated (.P) and shuffle forms (~3.5 KB), reflecting the complex cache hierarchy modifiers (strongOrder, eviction, scope, cacheOp, memoryType) that the base forms must encode.
IMAD.WIDE (64-bit Multiply-Add) -- 9 variants
| Address | Identity | Format | Size |
|---|---|---|---|
sub_CDBBD0 | sm80_emit_IMAD.WIDE_RR | 0 | 12,692 B |
sub_CDA300 | sm80_emit_IMAD.WIDE_RI | 1 | 12,627 B |
sub_CDAF60 | sm80_emit_IMAD.WIDE_RC | 2 | 12,676 B |
sub_CD96B0 | sm80_emit_IMAD.WIDE_RR.ALT | 3 | 12,577 B |
sub_CD1C80 | sm80_emit_IMAD.WIDE_RR.P | 4 | 4,874 B |
sub_CD1770 | sm80_emit_IMAD.WIDE_RI.P | 5 | 4,638 B |
sub_CD2730 | sm80_emit_IMAD.WIDE_RC.P | 6 | 4,889 B |
sub_CD2C90 | sm80_emit_IMAD.WIDE_SHFL | 7 | 5,078 B |
sub_CD21D0 | sm80_emit_IMAD.WIDE_RR.3SRC | 8 | 4,873 B |
IMAD.WIDE has the largest base-form emitters at ~12.7 KB each. The wide result format requires handling both halves of the 64-bit product, with separate register pair allocation.
Remaining Opcodes
| Address | Identity | Format | Size |
|---|---|---|---|
sub_CCA5D0 | sm80_emit_IMAD_RR | 0 | 4,835 B |
sub_CC2A20 | sm80_emit_IMAD_RC | 2 | 3,583 B |
sub_CBF4A0 | sm80_emit_IMAD_RR.P | 4 | 3,798 B |
sub_CC3D20 | sm80_emit_IMAD_RR.3SRC | 8 | 4,215 B |
sub_CABB10 | sm80_emit_S2R_TCC | 15 | 3,213 B |
sub_CABEA0 | sm80_emit_S2R_RR.MEM | 19 | 3,216 B |
sub_CAC230 | sm80_emit_CS2R_TCC | 15 | 3,314 B |
sub_CAB750 | sm80_emit_CS2R_RR.MEM | 19 | 3,317 B |
sub_CC9C30 | sm80_emit_DSETP_RI | 1 | 5,292 B |
sub_CC5D30 | sm80_emit_DSETP_RR.ALT | 3 | 4,903 B |
sub_CAE310 | sm80_emit_LEA_RR | 0 | 5,246 B |
sub_CBED30 | sm80_emit_DADD_RR.ADD | 11 | 3,153 B |
sub_CC46A0 | sm80_emit_ISETP_RR | 0 | 3,741 B |
sub_CC6260 | sm80_emit_ISETP_RI | 1 | 3,785 B |
sub_CC66E0 | sm80_emit_ISETP_RC | 2 | 3,917 B |
sub_CC84F0 | sm80_emit_ISETP_RR.ALT | 3 | 3,961 B |
sub_CB96B0 | sm80_emit_IMNMX_RR | 0 | 3,570 B |
sub_CB9AD0 | sm80_emit_IMNMX_RI | 1 | 3,614 B |
sub_CB8E80 | sm80_emit_IMNMX_RC | 2 | 3,467 B |
sub_CB9280 | sm80_emit_IMNMX_RR.ALT | 3 | 3,511 B |
sub_CC2660 | sm80_emit_FSETP_RR | 0 | 3,331 B |
sub_CC3930 | sm80_emit_FSETP_RI | 1 | 3,376 B |
sub_CC22A0 | sm80_emit_SEL_RR | 0 | 3,316 B |
sub_CC2EA0 | sm80_emit_SEL_RI | 1 | 3,361 B |
sub_CBF0F0 | sm80_emit_SEL_RC | 2 | 3,213 B |
sub_CC1ED0 | sm80_emit_SEL_RR.ALT | 3 | 3,258 B |
sub_CA02C0 | sm80_emit_SHFL_SHFL | 7 | 4,076 B |
sub_CB7B20 | sm80_emit_FADD_RR | 0 | 7,415 B |
sub_CBF900 | sm80_emit_FADD_RI | 1 | 27,061 B |
sub_CA08C0 | sm80_emit_FADD_RC | 2 | 44,490 B |
sub_CB71C0 | sm80_emit_FADD_RR.ALT | 3 | 6,705 B |
sub_CB60E0 | sm80_emit_FMUL_RR | 0 | 5,715 B |
sub_CBB1E0 | sm80_emit_FMUL_RI | 1 | 11,561 B |
sub_CBC940 | sm80_emit_FMUL_RC | 2 | 14,951 B |
sub_CB4BE0 | sm80_emit_FMUL_RR.ALT | 3 | 5,164 B |
sub_CB4390 | sm80_emit_MUFU_RR | 0 | 4,743 B |
sub_CB5A10 | sm80_emit_MUFU_RI | 1 | 5,377 B |
sub_CB8410 | sm80_emit_MUFU_RC | 2 | 8,046 B |
sub_CB3E20 | sm80_emit_MUFU_RR.ALT | 3 | 4,529 B |
sub_CB5380 | sm80_emit_HADD2_RR | 0 | 5,434 B |
sub_CB67F0 | sm80_emit_HADD2_RI | 1 | 5,648 B |
Notable size outliers: sm80_emit_FADD_RC at 44,490 bytes is the single largest emission function, followed by sm80_emit_FADD_RI at 27,061 bytes. Both require extensive fixup tables for the constant-buffer and immediate operand forms of FP32 addition with negation, absolute value, saturation, and data type modifiers.
Bitfield Packing Functions
Zone 1 contains 75 bitfield packing functions. These translate the instruction descriptor's register IDs, opcode fields, and modifiers into bit positions within the 128-bit SASS instruction word:
| Class | Functions | Shift-Pack Ops | Notes |
|---|---|---|---|
| FADD/FMUL/MUFU/HADD2/SHFL | 26 | 11--16 | Arithmetic + special function encoding |
| IMAD/FFMA/LEA | 3 | 13--15 | Multiply-add class encoding |
| FFMA | 4 | 15 | Fused multiply-add specific |
| FFMA/DSETP | 12 | 13--16 | FMA and FP64 comparison encoding |
| HMMA (Tensor Core) | 4 | 16 | Tensor core with fixed 16-bitfield layout |
| HMMA/IMAD.WIDE | 10 | 14--20 | Wide operand encoding (most complex) |
Operand Type Predicates
Fifteen small predicate functions at 0xCDD5F0--0xCDD690 classify operands by type:
| Address | Identity | Check |
|---|---|---|
sub_CDD5F0 | getRegFile | Extract register file from register ID |
sub_CDD600 | isGPR | Operand is a general-purpose register |
sub_CDD610 | isPredicate | Operand is a predicate register |
sub_CDD630 | isUniformReg | Operand is a uniform register |
sub_CDD670 | isImmediate | Operand is an immediate value |
sub_CDD680 | isConstBuf | Operand is a constant buffer reference |
Instruction Modifier Setters
| Address | Identity | Modifier |
|---|---|---|
sub_509100 | setDnzMode | Denormalized-number-as-zero mode |
sub_509160 | setRoundingMode | IEEE rounding mode (RN, RZ, RP, RM) |
sub_509760 | setAbsolute | Absolute value modifier |
sub_509890 | setEvictFirst | Cache eviction hint |
sub_509950 | setCacheLevel | Cache level targeting |
sub_509B00 | setScope | Memory scope (CTA, GPU, SYS) |
sub_509B20 | setEviction | Eviction policy |
sub_50AC80 | setCacheOp | Cache operation type |
sub_50ACD0 | setMemoryType | Memory type qualifier |
sub_50B160 | setComparison | Comparison predicate (LT, GT, EQ, NE, ...) |
sub_50B300 | setNegation | Source operand negation |
sub_50B500 | setDataType | Data type (F32, F16, S32, U32, ...) |
sub_50B900 | setStrongOrder | Strong memory ordering |
sub_50BD00 | setSaturation | Output saturation clamp |
sub_50BDA0 | setAddrSpace | Address space (global, shared, local) |
sub_50C060 | setFtzMode | Flush-to-zero for denormals |
External Dependencies
The SM80 backend calls into shared infrastructure functions:
| Address | Identity | Role |
|---|---|---|
sub_530FB0 | getOperand(idx) | Universal operand accessor (31,399 callers) |
sub_530FC0 | getNumUses() | Count use (input) operands |
sub_530FD0 | getNumDefs() | Count definition (output) operands |
sub_A49150 | getInsnAttribute | Query instruction attribute by slot ID (30,768 callers) |
sub_4FF010 | emitRegOperand | Emit register operand to descriptor |
sub_4FF150 | emitPredicateOperand | Emit predicate operand to descriptor |
sub_4FF280 | emitAddrOperand | Emit address/memory operand to descriptor |
sub_4C28B0 | setBitfield | Core primitive: pack value into 128-bit instruction at bit offset |
sub_4C2A90 | initInsnFromTemplate | Initialize instruction buffer from static template |
sub_4C4D60 | encodeOperandSlot0 | Encode register into operand slot 0 |
sub_4C5C30 | encodeOperandSlot1 | Encode register into operand slot 1 |
sub_A50D10 | encodeRegId | Translate virtual register ID to SASS encoding |
Relationship to Other Backends
| Backend | Mega-Hub | Size | Pattern Matchers |
|---|---|---|---|
| SM50-7x (shared) | sub_126CA30 | 239 KB | ~160 |
| SM75 (Turing) | sub_FBB810 | 280 KB | 276 |
| SM80 (Ampere) | sub_D5FD70 | 239 KB | 259 |
| SM89/90 (Ada/Hopper) | sub_119BF40 | 231 KB | ~160 |
Despite having fewer ISel patterns than SM75, the SM80 mega-hub matches the SM50-7x hub at 239 KB. The SM89/90 backend at 1.9 MB is substantially larger overall, though its mega-hub is smaller, because it includes 750 instruction encoder template instantiations that the SM80 backend does not require.
Confidence Assessment
| Claim | Confidence | Verification |
|---|---|---|
ISA class string "Ampere" for sm_80/86/87/88 | CONFIRMED | Decompiled sub_484F50 lines 293, 337, 383, 426: "Ampere" for all four |
__CUDA_ARCH__ values: 800, 860, 870, 880 | CONFIRMED | Decompiled sub_484F50 lines 294, 338, 384, 427 |
| Codegen factory values: 28673, 28674, 28675, 28676 | CONFIRMED | Decompiled sub_15C3310 +348=28673, sub_15C3B60 +348=28674, sub_15C3C60 +348=28675, sub_15C3A60 +348=28676 |
| Dispatch table: all 7 slots per SM | CONFIRMED | Decompiled sub_15C0CE0 lines 75--102 |
Dispatch table: sm_88 encoding table = sub_15C3A60 | CONFIRMED | Decompiled sub_15C0CE0 line 98 |
Capability Vec 1: sm_80 = xmmword_1D40F40, sm_86/87/88 = xmmword_1D40F50 | CONFIRMED | Decompiled sub_484F50 lines 328 (F40), 370 (F50), 418 (copy), 458 (copy) |
| sm_86/87/88 capability-identical (same Vec 1 via copy chain) | HIGH | Copy chain v211->v210 traced through lines 375, 418, 458 |
| sm_88 new in CUDA 13.0 | HIGH | String at 0x1d40a9a; not in earlier toolkit versions |
SM80 ISel backend at 0xCA0000--0xDA0000 (1 MB) | HIGH | Address range consistent with function catalog and mega-hub location |
| 259 ISel pattern matchers, 80 emission variants | HIGH | Derived from systematic sweep of address range |
| Internal version numbers (sm_80=25 through sm_88=28) | MEDIUM | Inferred from address spacing and *(a1+376) > 26 threshold |
| Family linkage: sm_89 links into sm_80/sm_86 chains | CONFIRMED | Decompiled sub_484F50 lines 507--510 |
| FADD_RC largest emitter at 44,490 B | MEDIUM | Size from function boundary analysis |
| Three-phase pipeline (ISel, emission, encoding) | HIGH | Architectural pattern consistent across all SM backends |
Cross-References
nvlink Internal
- Architecture Profiles -- SM80 family profiles in the linker database, struct layout, capability vector table
- SM75 Turing -- predecessor ISel backend
- SM89 Ada -- successor backend, dispatch table comparison
- Compatibility Checking -- same-decade rule, family linkage, capability mask verification
- Architecture Dispatch -- per-arch function pointer dispatch mechanism
- ISel Hubs -- SM80 mega-hub
sub_D5FD70(239 KB)
Sibling Wikis
- ptxas: Turing/Ampere -- standalone ptxas SM80 target documentation (codegen factory encoding, scheduler profiles, latency tables, SASS encoding format)
- cicc: SM70-89 -- cicc compiler SM80 through SM88 feature gates (
__VA_OPT__, convergent branches, L2 cache hint atomics)