Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM103 / SM110 / SM120 / SM121

nvlink v13.0.88 registers four additional Blackwell-family architectures beyond the base SM100 datacenter target. All four share the "Blackwell" family name string at 0x1D40B6E, all use the 128-bit SASS instruction encoding defined by the SM100 ISA, and all reuse SM100 infrastructure for encoding, decoding, and descriptor initialization. The differences between them are confined to three areas: the architecture profile metadata (SM number, __CUDA_ARCH__ define, variant suffixes), the finalization compatibility remapping table, and the capability bitmask used to gate feature subsets during JIT re-finalization.

This page documents the profile registration, dispatch table sharing, finalization remapping, and capability bitmask system for sm_103, sm_110, sm_120, and sm_121 as observed in the binary.

Architecture Identity Matrix

ArchitectureProduct Line__CUDA_ARCH__Family StringISA Class StringSame-Decade Group
sm_100Datacenter Blackwell (B200/B100)1000"Blackwell""(profile_sm_100)->isaClass"10
sm_103Blackwell Ultra (GB300)1030"Blackwell""(profile_sm_103)->isaClass"10
sm_110Jetson Thor1100"Blackwell""(profile_sm_110)->isaClass"11
sm_120Consumer RTX 50xx / Enterprise Pro1200"Blackwell""(profile_sm_120)->isaClass"12
sm_121DGX Spark1210"Blackwell""(profile_sm_121)->isaClass"12

Every architecture stores its ISA class as a (profile_sm_NNN)->isaClass string rather than a hardcoded human-readable name like "Hopper" or "Turing". This indirection means the ISA class is resolved at runtime through the profile object's field pointer rather than being a compile-time constant. The family name "Blackwell" is shared across all five 1xx architectures (a single rodata string with xrefs from all five registration blocks in sub_484F50).

Sub-variant Registration

Each of the four architectures registers three sub-variants (base, a, f) through the profile database initializer sub_484F50. The a suffix enables the full accelerated feature set; the f suffix marks the forward-compatible subset. For each base architecture, the database creates nine profile objects: three real (sm_NNN, sm_NNNa, sm_NNNf), three virtual (compute_NNN, compute_NNNa, compute_NNNf), and three LTO (lto_NNN, lto_NNNa, lto_NNNf).

ArchitectureBase Profilesa Profilesf Profiles
SM103sm_103, compute_103, lto_103sm_103a, compute_103a, lto_103asm_103f, compute_103f, lto_103f
SM110sm_110, compute_110, lto_110sm_110a, compute_110a, lto_110asm_110f, compute_110f, lto_110f
SM120sm_120, compute_120, lto_120sm_120a, compute_120a, lto_120asm_120f, compute_120f, lto_120f
SM121sm_121, compute_121, lto_121sm_121a, compute_121a, lto_121asm_121f, compute_121f, lto_121f

Registration Order

The architectures are registered in sub_484F50 in the following order within the Blackwell block, determined by the address order of their string references:

  1. sm_100 / sm_100a / sm_100f (xrefs at 0x485A..)
  2. sm_110 / sm_110a / sm_110f (xrefs at 0x485E..)
  3. sm_103 / sm_103a / sm_103f (xrefs at 0x4861..)
  4. sm_120 / sm_120a / sm_120f (xrefs at 0x4865..)
  5. sm_121 / sm_121a / sm_121f (xrefs at 0x4869..)

The ordering is notable: sm_103 is registered after sm_110, not after sm_100 as the numbering might suggest. This reflects the chronological order in which these targets were added to the compiler toolchain -- sm_110 (Jetson Thor) was defined before sm_103 (GB300 Blackwell Ultra), as confirmed by the xref address ordering in sub_484F50.

Rodata String Addresses

StringAddressRegistering Function
"sm_103"0x1D40CDEsub_484F50
"sm_103a"0x1D40D09sub_484F50
"sm_103f"0x1D40D3Asub_484F50
"(profile_sm_103)->isaClass"0x1D40CF9sub_484F50
"-D__CUDA_ARCH__=1030"0x1D40CC9sub_484F50
"sm_110"0x1D40C2Bsub_484F50
"sm_110a"0x1D40C56sub_484F50
"sm_110f"0x1D40C87sub_484F50
"(profile_sm_110)->isaClass"0x1D40C46sub_484F50
"-D__CUDA_ARCH__=1100"0x1D40C16sub_484F50
"sm_120"0x1D40D91sub_484F50
"sm_120a"0x1D40DBCsub_484F50
"sm_120f"0x1D40DEDsub_484F50
"(profile_sm_120)->isaClass"0x1D40DACsub_484F50
"-D__CUDA_ARCH__=1200"0x1D40D7Csub_484F50
"sm_121"0x1D40E44sub_484F50
"sm_121a"0x1D40E6Fsub_484F50
"sm_121f"0x1D40EA0sub_484F50
"(profile_sm_121)->isaClass"0x1D40E5Fsub_484F50
"-D__CUDA_ARCH__=1210"0x1D40E2Fsub_484F50

Dispatch Table Sharing

The SM dispatch table initializer sub_15C0CE0 registers seven callback function pointers per architecture into hash maps (qword_2A644B8 through qword_2A64488). The callbacks serve these roles:

SlotHash MapRole
0qword_2A644B8cpf_optx (control-program-flow optimization)
1(implicit)nv.info attribute emitter
2(implicit)Resource usage table
3(implicit)Instruction encoding table
4qword_2A644A0Compute capability byte array
5qword_2A64490Perf-stats handler
6qword_2A64488Codegen option handler

Within each architecture family, all sub-variants share identical function pointers:

ArchitectureSharingEncoding Table Function
sm_100, sm_100a, sm_100fAll 7 slots identicalsub_15C3840
sm_103, sm_103a, sm_103fAll 7 slots identicalsub_15C3630
sm_110, sm_110a, sm_110fAll 7 slots identicalsub_15C3950
sm_120, sm_120a, sm_120fAll 7 slots identicalsub_15C1D20
sm_121, sm_121a, sm_121fAll 7 slots identicalsub_15C3410

The encoding table accessor functions are small stubs (~100 bytes each) that return architecture-specific instruction encoding parameters. Each architecture gets its own encoding table accessor despite sharing the same underlying 128-bit instruction format. The differences between accessors encode per-architecture feature availability flags (e.g., which MMA variants are supported, which memory ordering modes exist).

Complete Encoding Table Map

FunctionArchitectureSlot
sub_15C3210sm_753
sub_15C3310sm_803
sub_15C3B60sm_863
sub_15C3C60sm_873
sub_15C3A60sm_883
sub_15C3740sm_893
sub_15C3520sm_903
sub_15C3840sm_1003
sub_15C3630sm_1033
sub_15C3950sm_1103
sub_15C1D20sm_1203
sub_15C3410sm_1213

Finalization Architecture Remapping

The finalization compatibility checker sub_4709E0 (2,609 bytes) and its companion sub_470DA0 (2,074 bytes) apply an internal architecture remapping table before performing compatibility comparisons. This remapping collapses certain architecture numbers into canonical equivalents:

Input ArchRemapped ToInterpretation
104120Internal designation 104 maps to sm_120 (consumer Blackwell)
130107Internal designation 130 maps to 107 (within sm_100 family, decade 10)
101110Internal designation 101 maps to sm_110 (Jetson Thor)

The remapping uses a character-based encoding where each arch number maps to an ASCII character: 'd' (100), 'h' (104->120), 'g' (103), 'n' (110), 'y' (121). After remapping, the standard same-decade rule (arch / 10) determines family membership.

Remapping Semantics

The 104->120 remapping means that a cubin tagged with internal arch 104 is treated as sm_120-compatible for finalization purposes. Similarly, 130->107 places internal arch 130 into decade 10 (the sm_100 family), and 101->110 bridges internal arch 101 to the sm_110 family. These internal designations (101, 104, 130) never appear in user-facing --arch flags; they exist only within the finalization pipeline's compatibility-checking logic and represent early or experimental architecture IDs that were subsequently renumbered.

Special-Case Handling

sub_4709E0 contains explicit special-case logic for three architectures:

  • sm_110: Direct match check outside the decade rule, because decade 11 contains only sm_110
  • sm_121: Direct match check, because sm_121 shares decade 12 with sm_120 but has distinct finalization semantics
  • sm_100: Family-head check for the entire 100-decade (sm_100, sm_103, sm_107)

The function returns a 5-way error code: 0 = compatible, 24 = null input, 25 = version too high (>0x101), 26 = incompatible architecture, 27-30 = type-specific incompatibility. The a1[3] byte selects among finalization class types 0-4 through the lookup table dword_1D40660[].

Capability Bitmasks

sub_470DA0 (can_finalize_with_capability_mask) extends the architecture check with a per-architecture capability bitmask. This function reads a mask pointer from a1+16, computes a target bitmask value based on the architecture number, and returns whether the required capabilities are satisfied.

Bitmask Assignment

ArchitectureChar CodeBitmask ValueBinary
sm_100'd' (100)10b00000001
sm_110'n' (110)20b00000010
sm_103'g' (103)80b00001000
sm_121'y' (121)640b01000000

The check evaluates (v12 & *v11) == v12 where v12 is the required bitmask for the target architecture and *v11 is the capability mask stored in the compilation unit. A compilation unit built with capability 1 (sm_100 only) cannot be re-finalized for sm_103 (requires bit 3) or sm_121 (requires bit 6). This mechanism controls which Blackwell sub-architectures a given compiled artifact is forward-compatible with.

Note that sm_120 does not appear in the bitmask table. Its finalization compatibility is handled entirely through the architecture remapping (104->120) and the same-decade rule (decade 12 includes both sm_120 and sm_121).

Capability Data in Profile Structs

Each profile struct stores three 128-bit capability vectors at offsets +80, +96, and +112 (loaded from xmmword_1D40F10--xmmword_1D40F70 via SSE instructions during sub_484F50 initialization). These vectors encode generation-specific feature bitmasks used by the finalization pipeline. The can_finalize_with_capability_mask function dereferences through the profile's capability pointer at a1+16 to reach these vectors.

Same-Decade Compatibility Groups

The same-decade rule (arch_number / 10, integer division) produces three distinct Blackwell compatibility groups:

DecadeArchitecturesCompatibility
10sm_100, sm_103sm_100 code runs on sm_103; sm_103 code does not run on sm_100
11sm_110Sole member; no cross-compatibility within decade
12sm_120, sm_121sm_120 code runs on sm_121; sm_121 code does not run on sm_120

Despite sharing the "Blackwell" family name string and the same ISA encoding infrastructure, code compiled for sm_100 cannot run on sm_120 -- the decade boundary is a hard compatibility wall. The family name is informational only; actual compatibility is governed by the decade rule and the finalization remapping table.

SM101/SM110 Cross-Mapping Bridge

The compatibility checker sub_4878A0 contains a special bidirectional bridge between SM101 and SM110. When either the source or target architecture is 101 or 110, the normal same-decade comparison is bypassed. SM101 is an internal designation that maps to the sm_110 family (confirmed by the 101->110 remapping in sub_4709E0). This bridge allows artifacts tagged with the internal arch 101 to finalize for sm_110 and vice versa.

Instruction Encoding Sharing

All five Blackwell-family architectures (sm_100 through sm_121) share the same 128-bit SASS instruction encoding infrastructure documented on the SM100 Blackwell page. The 4,236 template-instantiated encoder/decoder/descriptor functions at 0x620000--0xF15A50 are common to all 1xx targets. The per-architecture encoding table accessors (slot 3 in the dispatch table) return architecture-specific parameters that modify which instruction families are available, but the encoding format itself is identical.

Shared Components

ComponentAddress RangeSizeShared By
SASS encoders (table 1)0x620000--0x84DD702.2 MBAll sm_1xx
InstrDesc initializers0x84DD70--0xA482901.7 MBAll sm_1xx
SASS encoders (table 2)0xDA0000--0xE436D0660 KBAll sm_1xx
SASS decoders0xE43DC0--0xF15A50840 KBAll sm_1xx
Encoder dispatchsub_E43C2092 linesAll sm_1xx
Decoder dispatchsub_EFE6C093 linesAll sm_1xx
Opcode table constructorsub_1782540111 KBAll sm_1xx
Master instruction encodersub_17F2670157 KBAll sm_1xx

Per-Architecture Differences

The encoding table accessor functions (slot 3) return different parameter sets that control feature gating at the instruction level. While the exact parameter layouts have not been fully decoded, the pattern is consistent: each accessor populates a small structure (8-32 bytes, exact size not determined) that tells the encoder/decoder which instruction families and sub-opcodes are valid for the target architecture.

This means sm_103 (GB300) may support additional MMA instruction variants compared to sm_100, and sm_120 (consumer) may lack certain datacenter-specific instructions present in sm_100. The instruction format is the same; only the set of valid opcodes within that format varies.

Compiler Backend Sharing

The SM-specific compiler backend functions (instruction selector, peephole optimizer, legalization passes) are selected through the dispatch table rather than duplicated per architecture. The backend at 0x1782540--0x17B9300 is shared across all 1xx targets, with per-architecture behavior controlled through the dispatch table callbacks and the encoding table parameters.

Key shared backend functions:

AddressSizeFunctionShared By
sub_1782540111,076 BOpcode table constructorAll sm_1xx
sub_17884A044,713 BInstruction property initializerAll sm_1xx
sub_178AA0035,422 BScheduling table initializerAll sm_1xx
sub_179BD1016,544 BPeephole optimizerAll sm_1xx
sub_17A213033,823 BInstruction legalizationAll sm_1xx
sub_17AB9D036,177 BInstruction selectionAll sm_1xx
sub_17F2670156,611 BMaster instruction encoderAll sm_1xx

SM-Specific Codegen Options

The per-SM codegen option handler sub_15C2E90 processes SM-specific compilation flags via string comparison dispatch. These options are common to all Blackwell targets:

OptionValuesDescription
lds128convertalways / nonconst / never128-bit shared memory load conversion policy
stress-maxrregcountIntegerMaximum register count override for stress testing
stress-noglobalregallocBooleanDisable global register allocation
legacy-cvtf64BooleanEnable legacy FP64 conversion behavior
perf-per-watt-opt-level0 / 1 / 2Performance-per-watt optimization level
stress-no-crpBooleanDisable constant register propagation

Key Functions

AddressNameSizeRole
sub_484F50ArchProfileDB::init53,974 BRegisters all GPU architectures including sm_103/110/120/121
sub_15C0CE0init_sm_dispatch_tables14,517 BRegisters 7 dispatch callbacks per architecture
sub_4709E0can_finalize_arch_check2,609 BFinalization compatibility with arch remapping
sub_470DA0can_finalize_capability_mask2,074 BCapability bitmask check for finalization
sub_15C3630sm_103 encoding table accessor~100 BReturns sm_103-specific encoding parameters
sub_15C3950sm_110 encoding table accessor~100 BReturns sm_110-specific encoding parameters
sub_15C1D20sm_120 encoding table accessor~100 BReturns sm_120-specific encoding parameters
sub_15C3410sm_121 encoding table accessor~100 BReturns sm_121-specific encoding parameters
sub_15C2E90process_sm_codegen_option~70 linesSM-specific codegen option string dispatch
sub_4878A0arch_string_match328 BCore compatibility checker with SM101/110 bridge

Observations

  1. No separate ISA definition -- sm_103, sm_110, sm_120, and sm_121 do not define their own instruction encoding/decoding tables. They reuse the SM100 infrastructure entirely. The per-architecture differentiation happens through small encoding-table accessor stubs and capability bitmask checks, not through separate code.

  2. Heterogeneous numbering -- The five Blackwell architectures span three decades (10, 11, 12), creating three distinct compatibility groups despite sharing a single ISA. The three decades correspond to substantially different silicon configurations (datacenter, automotive/embedded, consumer) even though the instruction set is common.

  3. Internal designations -- The remapping table reveals three internal architecture numbers (101, 104, 130) that do not correspond to any user-visible sm_ target. These are development or pre-release designations that were renumbered before public release.

  4. Asymmetric bitmasks -- The capability bitmask values are non-sequential (1, 2, 8, 64) with gaps at 4, 16, and 32. This leaves room for future architectures to be inserted at those positions without renumbering existing masks.

  5. Registration order anomaly -- sm_103 is registered after sm_110 in the database initializer, suggesting sm_110 (Jetson Thor) was added to the compiler toolchain before sm_103 (Blackwell Ultra GB300) despite sm_103 having a lower SM number.

  6. sm_120 bitmask absence -- sm_120 has no entry in the capability bitmask table at sub_470DA0. Its finalization compatibility is handled solely through the remapping rule (104->120) and the same-decade rule. This may indicate that sm_120 was designed from the start to be the "base" consumer architecture within decade 12, with sm_121 being the enhanced derivative.

Confidence Assessment

ClaimConfidenceVerification
All four use "Blackwell" family string at 0x1D40B6ECONFIRMEDDecompiled sub_484F50 lines 751/891/1032/1175: "Blackwell" for sm_110/sm_103/sm_120/sm_121 base profiles; string at 0x1d40b6e
ISA class strings (profile_sm_NNN)->isaClassCONFIRMEDStrings confirmed: 0x1d40cf9 (sm_103), 0x1d40c46 (sm_110), 0x1d40dac (sm_120), 0x1d40e5f (sm_121)
__CUDA_ARCH__ values: 1030, 1100, 1200, 1210CONFIRMEDStrings at 0x1d40cc9, 0x1d40c16, 0x1d40d7c, 0x1d40e2f
All sub-variant strings (sm_NNNa, sm_NNNf, compute_, lto_)CONFIRMEDAll strings found in nvlink_strings.json at documented addresses (e.g., 0x1d40d14=sm_103a, 0x1d40c61=sm_110a, etc.)
Registration order: sm_100 -> sm_110 -> sm_103 -> sm_120 -> sm_121CONFIRMEDDecompiled sub_484F50: sm_110 at ~line 751, sm_103 at ~line 891; string address ordering 0x1d40c2b (sm_110) < 0x1d40cde (sm_103) confirms sm_110 registered before sm_103
Dispatch table: sm_103 encoding = sub_15C3630CONFIRMEDDecompiled sub_15C0CE0 line 182: sub_448E70(qword_2A644A8, "sm_103f", sub_15C3630)
Dispatch table: sm_110 encoding = sub_15C3950CONFIRMEDDecompiled sub_15C0CE0 shows sm_110 with sub_15C3950 at A8 slot
Dispatch table: sm_120 encoding = sub_15C1D20CONFIRMEDDecompiled sub_15C0CE0 line 189: sub_448E70(qword_2A644A8, "sm_120", sub_15C1D20)
Dispatch table: sm_121 encoding = sub_15C3410CONFIRMEDDecompiled sub_15C0CE0 line 210: sub_448E70(qword_2A644A8, "sm_121", sub_15C3410)
Sub-variants share function pointers (sm_120/120a/120f identical)CONFIRMEDDecompiled sub_15C0CE0 lines 187-207: sm_120, sm_120a, sm_120f all use sub_15C1D20 for encoding table
Finalization remapping: 104->120, 130->107, 101->110CONFIRMEDDecompiled sub_4709E0 lines 22-31 and sub_470DA0 lines 20-31
Capability bitmask: d=1, n=2, g=8, y=64CONFIRMEDDecompiled sub_470DA0 lines 95-106
sm_120 absent from capability bitmask tableCONFIRMEDDecompiled sub_470DA0 switch: only cases d/g/n/y; no case for 120
Same-decade groups: 10={100,103}, 11={110}, 12={120,121}HIGHDerived from integer division rule confirmed in sub_4878A0
SM101/110 bidirectional bridgeCONFIRMEDDecompiled sub_4878A0 line 55: v19 == 101 || v20 == 101 || v19 == 110 || v20 == 110
All 4 share SM100 encoding infrastructureHIGHDispatch table shows distinct accessor stubs but same underlying encoder/decoder regions

For general Blackwell architecture details, see the ptxas wiki: Blackwell. For SM120 consumer target specifics, see cicc wiki: SM120.

Cross-References

Sibling Wikis