SM Architecture Map
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas validates the --gpu-name target against three sorted lookup tables, constructs a profile object with family metadata and a CUDA_ARCH macro value, then populates seven parallel dispatch tables that drive capability checks, code generation factory selection, performance modeling, and occupancy calculation throughout the compiler. The default target is sm_75 (Turing). Every downstream decision -- instruction legality, encoder selection, register file geometry, scheduling latencies -- routes through the profile object built here.
| SM validation | sub_6765E0 (54KB, profile object construction) |
| Capability dispatch | sub_607DB0 (14KB, 7 parallel hash maps) |
| Default target | sub_6784B0 -- returns sm_75 when --gpu-name is omitted |
| Validation tables | 3 bsearch arrays: base (32 entries at unk_1D16220), f (6 entries at unk_1D16160), a (7 entries at unk_1D161C0) |
| Per-SM accessors | sub_609XXX cluster (24 functions, ~1.2KB each) |
| Per-SM intrinsic init | sub_60AXXX cluster (12 functions, ~1KB each) |
| Profile lookup | sub_608D70 (384 bytes, dispatcher registered via sub_42BEC0) |
Per-SM Deep Dives:
- Turing & Ampere (SM 75--88) -- Baseline feature set, codegen factory 24577/28673
- Ada & Hopper (SM 89--90a) -- WGMMA, cluster operations, codegen factory 32768
- Blackwell (SM 100--121) -- tcgen05, arch/family gating, codegen factory 36864
- TCGen05 -- 5th Gen Tensor Cores -- Blackwell tensor core ISA detail
Complete SM Table
23 active SM base targets ship in ptxas v13.0.88 (plus 9 legacy and 2 internal/alias entries retained in the validation table for backward compatibility). Each base target optionally has a (accelerated) and/or f (feature-reduced) sub-variants. The CUDA_ARCH column shows the value the -D__CUDA_ARCH__ macro expands to.
| SM | __CUDA_ARCH | Family | Product | Codegen Factory | Status | Deep Dive |
|---|---|---|---|---|---|---|
sm_75 | 750 | Turing | TU10x (RTX 20xx) | 24577 | Production | turing-ampere |
sm_80 | 800 | Ampere | A100 | 28673 | Production | turing-ampere |
sm_86 | 860 | Ampere | A40/A10/RTX 30xx | 28673 | Production | turing-ampere |
sm_87 | 870 | Ampere | Orin (Jetson) | 28673 | Production | turing-ampere |
sm_88 | 880 | Ampere | -- | 28673 | Production | turing-ampere |
sm_89 | 890 | Ada Lovelace | AD10x (RTX 40xx) / L40S | 28673 | Production | ada-hopper |
sm_90 / sm_90a | 900 | Hopper | H100 / H200 | 32768 | Production | ada-hopper |
sm_100 / sm_100a / sm_100f | 1000 | Blackwell | B200 (datacenter) | 36864 | Production | blackwell |
sm_103 / sm_103a / sm_103f | 1030 | Blackwell Ultra | GB300 (datacenter) | 36864 | Production | blackwell |
sm_110 / sm_110a / sm_110f | 1100 | Jetson Thor | Thor SoC (auto/robotics) | 36864 | Production | blackwell |
sm_120 / sm_120a / sm_120f | 1200 | Blackwell (sm120) | RTX 50xx / RTX Pro | 36864 | Production | blackwell |
sm_121 / sm_121a / sm_121f | 1210 | Blackwell (sm120) | DGX Spark | 36864 | Production | blackwell |
The family name stored in the profile object (from sub_6765E0) uses NVIDIA's internal naming: "Turing", "Ampere", "Hopper", "Blackwell". Ada Lovelace (sm_89) is stored as Ampere-derived internally despite being a distinct microarchitecture. sm_120/121 use "Blackwell" internally despite being a different consumer microarchitecture from sm_100 datacenter Blackwell.
Suffix Semantics
ptxas uses three suffix modes to control forward compatibility. The distinction is critical: it determines which SASS binary a cubin can execute on.
| Suffix | Meaning | Forward Compatibility | Validation Table |
|---|---|---|---|
| (none) | Base feature set | Full forward-compat across generations | unk_1D16220 (32 entries) |
a (accelerated) | Architecture-locked, advanced features | No forward compat -- locked to specific silicon | unk_1D161C0 (7 entries) |
f (feature-reduced) | Same-family forward compat only | Forward-compat within family, not across | unk_1D16160 (6 entries) |
The base variant (no suffix) produces SASS that runs on the named architecture and all later ones: sm_80 code runs on sm_86, sm_89, sm_90, sm_100, etc. The a suffix locks the binary to exact silicon: sm_90a code runs only on H100/H200 hardware and will not execute on Blackwell. The f suffix allows forward compatibility within the same family: sm_100f code runs on sm_100 and sm_103 (both Blackwell datacenter) but not on sm_120 (different family).
Compilation rules from help text:
sm_90aPTX must be compiled tosm_90aSASS (no cross-arch compilation)sm_100fPTX can compile tosm_100forsm_103fSASS (same family)sm_100aPTX must compile tosm_100aSASS only- Base
sm_100PTX compiles to anysm_100+SASS
Sub-Variant Expansion
| Base | a Variant | f Variant | CUDA_ARCH (a) | CUDA_ARCH (f) |
|---|---|---|---|---|
sm_90 | sm_90a | -- | 90a0 | -- |
sm_100 | sm_100a | sm_100f | 100a0 | 100f0 |
sm_103 | sm_103a | sm_103f | 103a0 | 103f0 |
sm_101 | sm_101a | sm_101f | -- | -- |
sm_110 | sm_110a | sm_110f | 110a0 | 110f0 |
sm_120 | sm_120a | sm_120f | 120a0 | 120f0 |
sm_121 | sm_121a | sm_121f | 121a0 | 121f0 |
sm_75 through sm_89 have no a or f variants. sm_90 has only the a variant (no f). All Blackwell-era targets (sm_100+) have both a and f. sm_101 is a legacy alias for sm_110 (Jetson Thor, original internal designation); it passes validation but is not registered as a profile object, so its CUDA_ARCH values are not populated.
SM Validation Tables
Target name validation uses three sorted arrays searched via bsearch(). The CLI parser extracts the SM string from --gpu-name, strips any suffix, and searches the appropriate table.
Base Table -- unk_1D16220 (32 entries)
Contains all valid base SM names without suffix, sorted by numeric SM ID. Includes legacy architectures no longer supported for active compilation but retained for validation, plus two internal/alias entries. Each entry is 12 bytes: {uint32 sm_id, uint32 ptx_major, uint32 ptx_minor}. The bsearch comparison (sub_484B70) compares the numeric sm_id extracted from the --gpu-name string via sscanf.
sm_10, sm_11, sm_12, sm_13, // Tesla (legacy, PTX 1.0--1.2)
sm_20, sm_21, // Fermi (legacy, PTX 2.0)
sm_30, sm_32, sm_35, sm_37, // Kepler (legacy, PTX 3.0--4.1)
sm_50, sm_52, sm_53, // Maxwell (legacy, PTX 4.0--4.2)
sm_60, sm_61, sm_62, // Pascal (legacy, PTX 5.0)
sm_70, sm_72, // Volta (legacy, PTX 6.0--6.1)
sm_75, // Turing (active, PTX 6.3)
sm_80, sm_82, sm_86, sm_87, sm_88, sm_89, // Ampere/Ada (active, PTX 6.2--7.8)
sm_90, // Hopper (active, PTX 7.8)
sm_100, sm_101, sm_103, sm_110, sm_120, sm_121 // Blackwell (active, PTX 8.6--9.0)
sm_82 (PTX 6.2): Undocumented internal Ampere target. Not registered in sub_6765E0 (no profile object). Serves as the SASS opcode generation boundary (SM82_FIRST/SM82_LAST, opcode indices 172--193). The anomalously low PTX version requirement (6.2 vs sm_80's 7.0) suggests it was an early development target added before PTX ISA versioning was finalized.
sm_101 (PTX 8.6): Original internal designation for Jetson Thor, renamed to sm_110 in a later CUDA release. Both entries coexist in the validation table for backward compatibility with PTX files referencing the old name. sub_6765E0 registers only sm_110; sm_101 is validation-only.
Accelerated Table -- unk_1D161C0 (7 entries)
sm_90a, sm_100a, sm_101a, sm_103a, sm_110a, sm_120a, sm_121a
One Hopper entry, six Blackwell entries. sm_101a is the legacy alias for sm_110a (Jetson Thor, original internal designation).
Feature-Reduced Table -- unk_1D16160 (6 entries)
sm_100f, sm_101f, sm_103f, sm_110f, sm_120f, sm_121f
No Hopper entry (sm_90 has no f variant). All Blackwell-era. sm_101f is the legacy alias for sm_110f.
Architecture Registration -- sub_6765E0
This 54KB function constructs profile objects for every SM version. Each profile contains:
| Field | Content | Example (sm_90) |
|---|---|---|
| SM name | "sm_90" | "sm_90" |
| Compute name | "compute_90" | "compute_90" |
| Family name | "Hopper" | "Hopper" |
CUDA_ARCH macro | Decimal integer | 900 |
| LTO name | "lto_90" | "lto_90" |
isaClass | Architecture class ID | -- |
The function registers each profile into three hash maps indexed by sm_XX, compute_XX, and lto_XX strings. This allows lookup by any of the three naming conventions used in different contexts (CLI, PTX .target directive, LTO linking).
Family assignment from sub_6765E0:
| SM Range | Family String | Notes |
|---|---|---|
| sm_75 | "Turing" | Single entry |
| sm_80, sm_86, sm_87, sm_88 | "Ampere" | Includes sm_88 (undocumented Ada variant) |
| sm_89 | "Ampere" | Ada Lovelace stored as Ampere-derived internally |
| sm_90/90a | "Hopper" | Single silicon, two feature levels |
| sm_100/100a/100f | "Blackwell" | Datacenter B200 |
| sm_103/103a/103f | "Blackwell" | Blackwell Ultra (GB300) |
| sm_110/110a/110f | "Blackwell" | Jetson Thor -- same family string despite different product |
| sm_120/120a/120f | "Blackwell" | Consumer/enterprise (RTX 50xx) -- different uarch, same string |
| sm_121/121a/121f | "Blackwell" | DGX Spark |
All sm_100 through sm_121 share the "Blackwell" family string internally, even though sm_110 (Jetson Thor) and sm_120 (consumer RTX 50xx) are distinct microarchitectures. The compiler distinguishes them through the capability dispatch tables, not through family name.
Capability Dispatch -- sub_607DB0
The capability dispatch initializer builds 7 parallel hash maps at initialization time, protected by a once-guard (byte_29FE1D8). Each map indexes sm_XX / compute_XX strings to per-architecture values or handler functions. Error recovery uses setjmp/longjmp.
| Map | Global | Purpose | Value Type |
|---|---|---|---|
| 1 | qword_29FE1D0 | Handler A (primary codegen) | Function pointer |
| 2 | qword_29FE1C8 | Handler B (secondary codegen) | Function pointer |
| 3 | qword_29FE1C0 | Intrinsic table initializer | Function pointer |
| 4 | qword_29FE1B8 | Capability flags | Byte value |
| 5 | qword_29FE1B0 | Profile registration | Registered via sub_42BEC0 |
| 6 | qword_29FE1A8 | Perf-stats / occupancy handler E | Function pointer |
| 7 | qword_29FE1A0 | Perf-stats / occupancy handler F | Function pointer |
Handler Function Assignments
Each SM version registers its own handler functions into these maps. Functions within the same suffix group (e.g., sm_100/100a/100f) share all handlers -- they are the same silicon with different feature exposure.
Map 1 -- Handler A (per SM):
| SM | Handler A | SM | Handler A |
|---|---|---|---|
| sm_75 | sub_609B70 | sm_100 | sub_609C30 |
| sm_80 | sub_609CC0 | sm_110 | sub_609F30 |
| sm_86 | sub_609D50 | sm_103 | sub_608F20 |
| sm_87 | sub_609F00 | sm_120 | sub_609E40 |
| sm_88 | sub_609E70 | sm_121 | sub_609ED0 |
| sm_89 | sub_609E10 | ||
| sm_90 | sub_609DB0 |
Map 2 -- Handler B (per SM):
| SM | Handler B | SM | Handler B |
|---|---|---|---|
| sm_75 | sub_609B40 | sm_100 | sub_609BD0 |
| sm_80 | sub_609C90 | sm_110 | sub_608F50 |
| sm_86 | sub_609D80 | sm_103 | sub_609D20 |
| sm_87 | sub_609DE0 | sm_120 | sub_609C60 |
| sm_88 | sub_609EA0 | sm_121 | sub_609BA0 |
| sm_89 | sub_609CF0 | ||
| sm_90 | sub_609C00 |
Map 3 -- Intrinsic table initializer (per SM):
| SM | Initializer | SM | Initializer |
|---|---|---|---|
| sm_75 | sub_60A2E0 | sm_100 | sub_60A910 |
| sm_80 | sub_60A3E0 | sm_110 | sub_60AA20 |
| sm_86 | sub_60AC30 | sm_103 | sub_60A700 |
| sm_87 | sub_60AD30 | sm_120 | sub_608DF0 |
| sm_88 | sub_60AB30 | sm_121 | sub_60A4E0 |
| sm_89 | sub_60A810 | ||
| sm_90 | sub_60A5F0 |
Shared Handler Groups
Sub-variants within a base SM share all handler functions, confirming they are identical silicon:
| Group | Members | Shared Handlers |
|---|---|---|
| Hopper | sm_90, sm_90a | All 7 maps |
| Blackwell DC | sm_100, sm_100a, sm_100f | All 7 maps |
| Blackwell Ultra | sm_103, sm_103a, sm_103f | All 7 maps |
| Jetson Thor | sm_110, sm_110a, sm_110f | All 7 maps |
| Consumer | sm_120, sm_120a, sm_120f | All 7 maps |
| DGX Spark | sm_121, sm_121a, sm_121f | All 7 maps |
Codegen Factory Values
The profile object stores an encoded architecture identifier at a known offset (visible as field +348 on the profile pointer chain, e.g., *(_QWORD *)(a1+1584)+348). This value is compared throughout the compiler to gate features:
| Codegen Factory | SM Range | SASS ISA Generation |
|---|---|---|
| 24577 | sm_75 | Turing (SM 7.5) |
| 28673 | sm_80 -- sm_89 | Ampere / Ada (SM 8.x) |
| 32768 | sm_90 | Hopper (SM 9.0) |
| 36864 | sm_100 -- sm_121 | Blackwell (SM 10.x -- 12.x) |
These values appear in feature-gating checks. For example, FMA/DFMA combining in the peephole optimizer checks profile[+372] > 28673 to require sm_70+ capability. The exact encoding formula is (isa_generation << 12) | variant, where the high bits identify the SASS instruction set generation.
Related encoded values seen in the binary:
12288 = sm_30 (Kepler) // 3 << 12
16385 = sm_50 (Maxwell) // 4 << 12 | 1
20481 = sm_50 alt (Maxwell) // 5 << 12 | 1
24576 = sm_60 (Pascal) // 6 << 12
24577 = sm_75 (Turing) // 6 << 12 | 1
28673 = sm_80 (Ampere) // 7 << 12 | 1
28674-28677 = sm_86/87/88/89 // 7 << 12 | 2..5
32768 = sm_90 (Hopper) // 8 << 12
36864 = sm_100 (Blackwell) // 9 << 12
36865-36869 = sm_103..121 // 9 << 12 | 1..5
Hardware Resource Geometry
ptxas assembles per-SM hardware parameters from three data sources: sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator properties). These parameters control register allocation limits, shared memory partitioning, occupancy calculations, and scheduling decisions throughout the compiler.
Universal Constants (sub_8688F0)
sub_8688F0 sets the baseline hardware profile shared by all SM 75+ targets. These values are architecture-invariant within the ptxas v13.0.88 binary:
| Parameter | Value | Binary Evidence | Profile Offset |
|---|---|---|---|
| Warp size | 32 threads | *(a1+1472) = 32 | +1472 |
| Max registers per thread | 255 | *(a1+612) = 0xFF0000003F | +612 |
| Register file per SM | 65,536 x 32-bit | Derived: max_warps = 65536 / (regcount * 32) | -- |
| Dependency barriers per warp | 6 | *(a1+604) = 6 | +604 |
| Named barriers per CTA | 16 | barrier_arrive_0 through barrier_arrive_15 intrinsics | -- |
| Static shared memory base | 48 KB (49,152 B) | *(a1+1484) = 49152 | +1484 |
| Shared memory config base | 1 MB (1,048,576 B) | *(v6+344) = 0x100000 in all per-SM inits | profile +344 |
The register file size of 65,536 registers is confirmed by the EIATTR_REGCOUNT formula (code 0x2F): max_warps_per_SM = total_registers / (regcount * warp_size), and by explicit reference in codegen/templates.md ("the entire physical register file is 65,536 32-bit registers shared across all active warps").
Per-SM Resource Geometry Table
Combines binary evidence (sub_8E4400 scheduling profile, sub_8688F0 baseline, sub_ABF250 occupancy properties, sub_60AXXX per-SM initializers) with NVIDIA public specifications for parameters not stored as scalar constants in the binary. Confidence column rates how directly the value was extracted from the binary vs. inferred from public documentation.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_75 | 65,536 | 255 | 1,024 | 32 | 16 | 7 / 208 | 208 | 32 / 48 / 64 KB | 90% |
sm_80 | 65,536 | 255 | 2,048 | 64 | 32 | 7 / 208 | 208 | 48 / 100 / 132 / 164 KB | 90% |
sm_86 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_87 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 / 164 KB | 90% |
sm_88 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | (same as sm_86) | 85% |
sm_89 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_90 | 65,536 | 255 | 1,024 | 64 | 32 | 8 / 224 | 224 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_100 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_103 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 88% |
sm_110 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 85% |
sm_120 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 88% |
sm_121 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_120) | 85% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block). Not stored as a ptxas constant; derived from
warps_per_SM * warp_size / max_CTAs. - Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). The scheduler partition count is the number of warp scheduler units; dispatch slots is the total scheduling capacity. - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute. Stored as pointer-to-table at profile offsets +1488/+1496; sm_75 has 3 entries, later architectures have more.
sm_88 note: No known product ships on sm_88. It shares all handler functions with sm_86. Listed parameters are inherited; actual hardware behavior is unverifiable.
Scheduler Partition Geometry (sub_8E4400 Detail)
The packed DWORD at offset +18 of the warp-level profile encodes scheduler partition counts. The WORD at offset +22 is the dispatch slot count -- a scheduling capacity value distinct from the raw warp count.
| Codegen Factory Range | Packed DWORD | Hex | Partitions | Dispatch Slots | SM Era |
|---|---|---|---|---|---|
| <= 20479 | 458,759 | 0x00070007 | 7 | 96 | sm_50 (Maxwell) |
| 20480 -- 24575 | 786,444 | 0x000C000C | 12 | 176 | sm_60 (Pascal) |
| 24576 -- 28672 | 851,981 | 0x000D000D | 13 | 192 | sm_70 (Volta) |
| 28673 -- 32767 | 917,518 | 0x000E000E | 14 | 208 | sm_75 -- sm_89 |
| 32768 -- 36863 | 983,055 | 0x000F000F | 15 | 224 | sm_90 (Hopper) |
| > 36863 | 1,048,592 | 0x00100010 | 16 | 240 | sm_100+ (Blackwell) |
The dispatch slot count increases monotonically across generations, reflecting wider scheduling capacity. All sm_75 through sm_89 targets (Turing, Ampere, Ada Lovelace) share identical scheduling partition geometry despite their hardware differences -- the differentiation occurs in the per-SM latency tables, not in the partition structure.
Shared Memory Configuration Tables
ptxas stores configurable shared memory sizes as a pointer + count pair at profile offsets +1488 and +1496. The driver uses this table to validate cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, ...) calls.
sub_8688F0 sets the sm_75 configuration:
*(a1+1488) = &unk_21D9168 // pointer to shared memory size table
*(a1+1496) = 3 // 3 valid configurations
For sm_75 (Turing), the 3 entries correspond to 32 KB, 48 KB, and 64 KB configurable shared memory. The L1/shared partitioning on Turing splits the 96 KB unified data cache between L1 and shared memory.
For sm_80 (Ampere), the configurable shared memory extends to 164 KB, reflecting the larger shared memory/L1 combined capacity. sub_ABF250 records the maximum as 167,936 bytes (163.8 KB) for the base sm_60 path and 233,472 bytes (228 KB) for sm_70+ paths, though these values encode via xmmword constants that depend on the specific SM variant.
For sm_90+ (Hopper, Blackwell), sub_ABF250 populates a maximum configurable value of 233,472 bytes (228 KB), supporting the opt-in extended shared memory mode added in Hopper.
Register Allocation Mechanics
ptxas allocates registers in units determined by the register allocation granularity stored in sub_ABF250:
| SM Generation | Alloc Granularity | a2[6][1] | a2[6][2] | Notes |
|---|---|---|---|---|
| sm_30 -- sm_60 | 64 registers / warp | 63 | 1 | Allocates in blocks of 2 regs/thread |
| sm_70+ | 256 registers / warp | 255 | 2 | Allocates in blocks of 8 regs/thread |
The register allocation unit directly affects occupancy. With 256-register granularity on sm_75+, a kernel using 33 registers effectively consumes 40 (rounded up to the next multiple of 8), which means each warp uses 40 * 32 = 1280 of the 65,536 available registers, allowing up to 51 warps -- but capped by the hardware limit of 32 warps on sm_75.
The formula the GPU driver uses (from EIATTR_REGCOUNT documentation):
effective_regs = ceil(regcount / alloc_granularity) * alloc_granularity
regs_per_warp = effective_regs * warp_size
max_warps = min(registers_per_SM / regs_per_warp, hw_max_warps)
max_CTAs = min(max_warps / warps_per_CTA, hw_max_CTAs)
SM Version Encoding
The raw SM version number stored in profile objects and code object headers uses a packed integer format. This is the value at v4[93] in the code object builder (sub_A465F0):
| Encoded Value | SM Target | Code Object Version | Max Threads/CTA |
|---|---|---|---|
| 12288 | sm_30 | 0x70007 | 96 |
| 20481 | sm_50 | 0xC000C | 176 |
| 24576 | sm_60 | -- | -- |
| 28673 | sm_80 | -- | -- |
| 36864 | sm_90 | 0x100010 | 240 |
The code object builder (sub_A465F0 at 0xA465F0) maps these encoded SM versions to ELF code object version fields and thread-per-CTA limits. The magic number 0x16375564E is written at offset 0 of every code object header, with the SM version at offset +8.
Per-SM Capability Accessors -- sub_609XXX
The 24 functions in the sub_609XXX cluster (range 0x609280--0x609F60, ~1.2KB each) are the per-SM-version capability accessor functions. They are registered into Maps 1 and 2 of the dispatch tables and return architecture-specific values: register file sizes, feature flags, warp geometry, shared memory limits, and similar hardware parameters.
These are the functions that downstream code calls (through the dispatch table) to answer questions like "how many registers does this SM have?" or "is feature X available on this target?"
Profile Layering
Three levels of SM profile information cooperate:
Level 1: sub_607DB0 // Capability dispatch (7 hash maps)
| // -> feature flags, handler functions
v
Level 2: sub_6765E0 // Profile objects (name, family, CUDA_ARCH, lto)
| // -> identity metadata, isaClass
v
Level 3: sub_609XXX / sub_60AXXX // Per-SM accessor functions
// -> concrete hardware parameter values
Level 1 provides the dispatch infrastructure. Level 2 provides identity metadata for diagnostics and linking. Level 3 provides the actual numeric values that drive register allocation, scheduling, and instruction legality.
Generation-Specific Features
Turing (sm_75)
sm_75 is the default architecture for ptxas v13.0.88, returned by sub_6784B0 when no --gpu-name is specified. Codegen factory value: 24577.
- Base tensor core (WMMA m16n16k16 f16/f32, m32n8k16, m8n32k16)
- Integer MMA (IMMA int8/int4), binary MMA (BMMA b1)
- Base warp-level operations (shfl.sync, vote.sync, match.sync, redux.sync)
- Named barrier support (bar 0-15 with arrive/red/sync variants)
Ampere (sm_80 -- sm_88)
Codegen factory value: 28673. Shared with Ada Lovelace (sm_89).
- Extended tensor core: TF32, BF16, FP64 MMA shapes
cp.asyncfor asynchronous shared memory copies- L2 cache hints on atomic operations
createpolicyinstructions for cache management- 14 additional intrinsics (
__cuda_sm80_*: bf16/tf32/s4/s8/b1 MMA, createpolicy)
Ada Lovelace (sm_89)
Codegen factory value: 28673 (same as Ampere). Stored as "Ampere" internally despite being a distinct Ada Lovelace microarchitecture.
- Same codegen path as Ampere; differentiated through capability flags, not codegen factory
- 39 additional MMA intrinsics (
__cuda_sm_8x_mma_*)
Hopper (sm_90 / sm_90a)
Codegen factory value: 32768. sm_90a is architecture-locked (H100/H200 only).
- WGMMA (warpgroup MMA async):
wgmma.mma_async,wgmma.fence,wgmma.commit_group,wgmma.wait_group - Cluster operations:
barrier.cluster.arrive/wait, distributed shared memory - setmaxnreg: Dynamic register allocation limit
- Cluster special registers:
%clusterid,%cluster_ctaid,%cluster_ctarank, etc. - 38 sub-byte MMA intrinsics (
__cuda_sm_9x_mma_sub_byte_internal_*: s4/u4 sparse)
Blackwell Datacenter (sm_100, sm_103)
Codegen factory value: 36864. Both a and f sub-variants available.
- tcgen05: 5th-generation tensor core ISA (alloc, dealloc, ld, st, commit, cp, shift, mma) --
a/fsub-variants only - tcgen05 guardrails: 8 debug validation functions (phase validity, column allocation, bounds checking)
- Extended MMA: 10 Blackwell-specific hmma/imma + bit MMA intrinsics (
__cuda_sm_10x_*) - 11 tcgen05 guardrail trap intrinsics (
__cuda_sm10x_tcgen05_guardrail_trap_*) - 18 sm_1xx bulk copy intrinsics (
__cuda_sm1xx_*: cp.async.bulk.tensor 1D-5D tile/im2col)
Jetson Thor (sm_110)
Codegen factory value: 36864 (same as sm_100). Originally sm_101 before rename. Automotive/robotics SoC.
- Retains full tcgen05/TMEM hardware on
a/fsub-variants - Same Blackwell datacenter feature set for tensor operations
- Differentiated through capability flags for SoC-specific constraints
Blackwell Consumer (sm_120, sm_121)
Codegen factory value: 36864 (same as sm_100). Architecturally a distinct consumer microarchitecture despite sharing the "Blackwell" family string.
- No tcgen05: The entire tcgen05 ISA is absent on sm_120/121 -- gated by SM version checks
- Tensor core falls back to HMMA/IMMA/WGMMA inherited from sm_70--sm_90 path
- sm_120 = RTX 50xx consumer / RTX Blackwell Pro (enterprise)
- sm_121 = DGX Spark
Diagnostic Strings
| String | Context | Function |
|---|---|---|
"Turing" | Family name in profile object | sub_6765E0 |
"Ampere" | Family name in profile object | sub_6765E0 |
"Hopper" | Family name in profile object | sub_6765E0 |
"Blackwell" | Family name in profile object | sub_6765E0 |
"isaClass" | Architecture class reference on profile | sub_6765E0 |
"sm_%d" | SM name formatting | Multiple |
"compute_%d" | Compute name formatting | sub_6765E0 |
"lto_%d" | LTO name formatting | sub_6765E0 |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_607DB0 | 14KB | SM capability dispatch -- builds 7 hash maps | 99% |
sub_608D70 | 384B | Profile lookup dispatcher | 80% |
sub_608DF0 | ~1KB | sm_120 intrinsic table initializer | 85% |
sub_608F20 | ~1.2KB | sm_103 handler A (capability accessor) | 90% |
sub_608F50 | ~1.2KB | sm_110 handler B (capability accessor) | 90% |
sub_609280--sub_609F60 | ~1.2KB each | 24 per-SM capability accessor functions (Maps 1+2) | 90% |
sub_609F60 | 2.8KB | lds128convert option handler ("always", "nonconst", "never") | 90% |
sub_60A2E0--sub_60AD30 | ~1KB each | 12 per-SM intrinsic table initializers (Map 3) | 85% |
sub_60B040 | 4.5KB | Stress test options ("stress-maxrregcount", etc.) | 85% |
sub_6765E0 | 54KB | SM profile object construction (family, CUDA_ARCH, lto) | 95% |
sub_6784B0 | -- | Default architecture -- returns sm_75 | 99% |
sub_8688F0 | 31 lines | Universal HW profile baseline (warp size, regs, barriers, shmem) | 95% |
sub_8E4400 | 3.3KB | Warp-level HW profile: scheduler partitions, dispatch slots | 95% |
sub_ABF250 | ~600B | Occupancy property table: configurable shmem, reg alloc granularity | 90% |
sub_A95DC0 | ~1.8KB | Extended HW profile: architecture-specific shmem config | 85% |
sub_A465F0 | 14KB | Code object header builder (SM version -> ELF fields) | 88% |
Profile Object Layout (1936 bytes)
Every SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object that carries target-specific parameters throughout the compiler. This is the compilation unit's target descriptor -- the single structure that downstream code reads to answer "what hardware am I compiling for?"
Construction Sequence
1. sub_71BDE0(1936, a1) heap allocate 1936 bytes
2. sub_C1B7A0(profile) zero-fill + structural defaults (8 SSE blocks, 5 scalars)
3. sub_917990(a3) overlay: codegen factory default, tail constants
4. sub_60AXXX(a1,a2,a3,a4) per-SM: codegen factory, shmem base, capability flags
Key Fields -- Explicitly Initialized
These fields receive non-zero values during construction. Offsets are byte offsets from the profile object base pointer. Type column: D=DWORD(4B), Q=QWORD(8B), O=OWORD(16B), B=BYTE.
| Offset | Type | Default | Set By | Semantic Name | Confidence |
|---|---|---|---|---|---|
| +0 | Q | 0 | sub_C1B7A0 | object_base -- zeroed, likely vtable/class pointer | 75% |
| +112 | Q | 0x500000000 | sub_C1B7A0 | packed_config -- stores DWORD 5 at +112, DWORD 0 at +116 | 85% |
| +120 | D | 5 | sub_C1B7A0 | opt_level_default -- initial optimization level or block dimension | 85% |
| +132 | Q | 0xFFFFFFFF | sub_C1B7A0 | max_register_limit -- -1 sentinel = "no limit" | 85% |
| +340 | D | 1 | sub_C1B7A0 | enable_flag_A -- default-enabled capability | 85% |
| +344 | D | 0x100000 | per-SM init | shared_memory_config_base -- 1 MB for all SM 75+ targets | 95% |
| +348 | D | per-SM | per-SM init | codegen_factory -- ISA generation encoding (gen << 12) | variant | 99% |
| +424 | Q | 0x100000000 | sub_C1B7A0 | packed_enable -- stores DWORD 1 at +424, DWORD 0 at +428 | 90% |
| +428 | D | 0 (cond.) | per-SM init | conditional_feature_flag -- sm_90+ only: set to 0 when *(a2+355) is true | 85% |
| +432 | D | computed | per-SM init | module_base_address -- callback() - 0x6FFFFE64 or -1 if disabled | 95% |
| +588 | D | 0 | sub_917990 | cleared_field -- explicitly re-zeroed; used in 10+ consumer functions | 90% |
| +708 | D | 1 | per-SM init | enable_flag_D -- universally set to 1 by all per-SM initializers | 85% |
| +944 | D | 4 | sub_C1B7A0 | pipeline_depth -- possibly barrier count or pipeline stage limit | 85% |
| +1200 | Q | "NVIDIA" | sub_43A400 | vendor_string_ptr -- pointer to vendor identification string | 95% |
| +1208 | Q | (pointer) | sub_43A400 | associated_data_ptr -- assigned from callback result | 90% |
| +1216 | D | 1 | sub_43A400 | vendor_flag -- set to 1 during ELF builder initialization | 85% |
| +1385 | B | 0 (bits) | runtime | scheduling_feature_flags -- bitfield, 21+ consumer sites | 99% |
| +1536 | Q | 1832 | sub_C1B7A0 | dynamic_region_offset -- points to tail SSE constant region start | 90% |
| +1552 | Q | 0 | runtime | pipeline_progress -- monotonically increasing counter (values 0--21); scoreboard guards check 16--19 | 95% |
| +1584 | Q | nullsub_856 | sub_C1B7A0 | sm_backend_vtable_ptr -- THE central polymorphic pointer; initialized to null stub | 99% |
| +1684 | D | CLI value | per-SM init | cli_option_value -- *(a1+108) passthrough from compiler driver | 90% |
| +1840 | D | 1 | per-SM init | elf_section_data -- initially 1 (enable), later overwritten with ptr | 85% |
| +1880 | Q | 1 | per-SM init | barrier_tracking_ptr -- initially 1, later pointer to scoreboard data | 95% |
| +1892 | D | 2 | sub_917990 | tail_mode_value -- possibly versioning or encoding mode indicator | 85% |
| +1912 | D | 0 (cond.) | per-SM init | conditional_clear -- cleared when *(a2+233) is true (debug mode) | 85% |
| +1928 | D | 1 | per-SM init | output_config_value -- compilation output configuration | 85% |
SSE Constant Blocks
10 blocks of 16 bytes each are loaded from .rodata segments via _mm_load_si128. These likely contain per-register-class sizing parameters, pipeline configuration constants, or default opcode table pointers. Exact values require .rodata dump.
| Offset | Source | Set By |
|---|---|---|
| +184 | xmmword_20206F0 | sub_C1B7A0 |
| +280 | xmmword_2027950 | sub_C1B7A0 |
| +312 | xmmword_2027600 | sub_C1B7A0 |
| +680 | xmmword_2027620 | sub_C1B7A0 |
| +696 | xmmword_22B4ED0 | sub_C1B7A0 |
| +740 | xmmword_22B4EE0 | sub_C1B7A0 |
| +788 | xmmword_22B4EF0 | sub_C1B7A0 |
| +816 | xmmword_22B4F00 | sub_C1B7A0 |
| +1832 | xmmword_21DEBA0 | sub_917990 |
| +1908 | xmmword_21DEBB0 | sub_917990 |
Scheduling Feature Flags (+1385 bitfield)
The byte at offset +1385 is the most heavily accessed bitfield on the profile object (21+ consumer sites in 15+ decompiled functions). Each bit gates a scheduling or codegen behavior.
| Bit | Mask | Meaning | Evidence |
|---|---|---|---|
| 0 | 0x01 | Function has sync barriers | sub_792CD0 sets/clears; sub_75F680, sub_75F580 check |
| 1 | 0x02 | (unknown) | -- |
| 2 | 0x04 | Extended barrier model | sub_796D60 checks jointly with +1382 & 0x20 |
| 3 | 0x08 | Scoreboard tracking enabled | sub_925670, sub_925510, sub_9253C0 check jointly with +1880 and +1552 |
| 4 | 0x10 | (unknown) | -- |
| 5 | 0x20 | Scheduling feature flag | sub_793220, sub_A36360 (scoreboard encoder) check |
| 6 | 0x40 | Temporary analysis flag | sub_752E40 sets; sub_77F0D0 clears |
| 7 | 0x80 | Preserved across resets | sub_7F7DC0: *(a1+1385) &= 0x80 (all others cleared) |
Per-SM Initializer Differences
All 12 per-SM initializers (one per SM family) are structurally identical. Only two fields differ between families.
| Field | sm_75--sm_88 | sm_89 | sm_90 | sm_100--sm_121 |
|---|---|---|---|---|
+348 codegen_factory | 24577--28676 | 28677 | 0x8000 (32768) | 36864--36869 |
+428 conditional_feature_flag | not written | not written | written (if *(a2+355)) | written (if *(a2+355)) |
All other fields (+344, +432, +588, +708, +1684, +1840, +1880, +1912, +1928) are set identically across all SM families.
Critical Distinction: Profile Object vs SM Backend
The profile object (1936 bytes, this layout) is the compilation unit's target descriptor, stored somewhere in the compilation context. The pointer at context+1584 (sm_backend_vtable_ptr) points to a separate polymorphic SM backend object -- not to another profile object. Fields accessed as *(*(ctx+1584) + N) are on the SM backend, not on this profile object.
Commonly accessed SM backend fields (NOT on the 1936-byte profile):
| SM Backend Offset | Semantic | Consumer Count |
|---|---|---|
| +12 | arch_class_id (4=Maxwell, 10=Grace, 11=NVLink-Volta+) | 3+ |
| +372 | codegen_factory (THE feature-gating value, 37+ sites) | 37+ |
| +1037 | hw_capability_flags bitfield (SFU precision, barriers) | 20+ |
| +1312 | vtable dispatch for predicate marking capability | 2+ |
| +1320 | vtable function slot for optimization dispatch | 2+ |
| +1417 | Grace/ARM architecture flag (bit 7) | 1+ |
| +1418 | NVLink-capable flag (bit 0) | 1+ |
The profile's own +348 stores the codegen factory value at construction time. The SM backend's +372 is where all downstream code reads it. These are the same numeric value stored in two different objects.
Object Region Map
Offset Range Size Content
----------- ---- -------
[0..111] 112B Object header, vtable chain, zeroed bulk
[112..159] 48B Config scalars (opt level, register limit)
[160..343] 184B Zeroed + 3 SSE constant blocks (184, 280, 312)
[344..587] 244B Target identity (shmem base, codegen factory, enable flags)
[588..879] 292B Capability fields, 4 SSE constant blocks (680, 696, 740, 788)
[880..1063] 184B Scheduling config, barrier config (+944=4 default)
[1064..1279] 216B Extended config, vendor metadata (+1200="NVIDIA")
[1280..1535] 256B Architecture feature flags (+1385 bitfield, +1417/+1418)
[1536..1663] 128B Dynamic region pointer, phase counter, sm_backend ptr (+1584)
[1664..1831] 168B CLI passthrough, ELF section data, barrier tracking ptr
[1832..1935] 104B Tail region: 2 SSE constant blocks, mode value (+1892=2)
Cross-References
- Turing & Ampere (SM 75--88) -- Detailed feature flags for sm_75 through sm_89
- Ada & Hopper (SM 89--90a) -- WGMMA, cluster operations, sm_90a arch-lock
- Blackwell (SM 100--121) -- tcgen05, arch-conditional vs family-conditional gating
- TCGen05 -- 5th Gen Tensor Cores -- tcgen05 instruction set detail
- Intrinsic Table (608 Entries) -- Master intrinsic catalog with per-SM generation ranges
- Tensor Core Intrinsics -- WMMA/MMA/tcgen05 intrinsic lowering
- Latency Model & HW Profiles -- Per-SM scheduling parameters
- SASS Instruction Encoding -- Codegen factory -> encoder selection
- Mercury Encoder -- SM-dependent SASS encoding
- CLI Options --
--gpu-nameparsing and default target - Data Structures -- context+1584 polymorphic pointer documentation