Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM Architecture Map

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas validates the --gpu-name target against three sorted lookup tables, constructs a profile object with family metadata and a CUDA_ARCH macro value, then populates seven parallel dispatch tables that drive capability checks, code generation factory selection, performance modeling, and occupancy calculation throughout the compiler. The default target is sm_75 (Turing). Every downstream decision -- instruction legality, encoder selection, register file geometry, scheduling latencies -- routes through the profile object built here.

SM validationsub_6765E0 (54KB, profile object construction)
Capability dispatchsub_607DB0 (14KB, 7 parallel hash maps)
Default targetsub_6784B0 -- returns sm_75 when --gpu-name is omitted
Validation tables3 bsearch arrays: base (32 entries at unk_1D16220), f (6 entries at unk_1D16160), a (7 entries at unk_1D161C0)
Per-SM accessorssub_609XXX cluster (24 functions, ~1.2KB each)
Per-SM intrinsic initsub_60AXXX cluster (12 functions, ~1KB each)
Profile lookupsub_608D70 (384 bytes, dispatcher registered via sub_42BEC0)

Per-SM Deep Dives:

Complete SM Table

23 active SM base targets ship in ptxas v13.0.88 (plus 9 legacy and 2 internal/alias entries retained in the validation table for backward compatibility). Each base target optionally has a (accelerated) and/or f (feature-reduced) sub-variants. The CUDA_ARCH column shows the value the -D__CUDA_ARCH__ macro expands to.

SM__CUDA_ARCHFamilyProductCodegen FactoryStatusDeep Dive
sm_75750TuringTU10x (RTX 20xx)24577Productionturing-ampere
sm_80800AmpereA10028673Productionturing-ampere
sm_86860AmpereA40/A10/RTX 30xx28673Productionturing-ampere
sm_87870AmpereOrin (Jetson)28673Productionturing-ampere
sm_88880Ampere--28673Productionturing-ampere
sm_89890Ada LovelaceAD10x (RTX 40xx) / L40S28673Productionada-hopper
sm_90 / sm_90a900HopperH100 / H20032768Productionada-hopper
sm_100 / sm_100a / sm_100f1000BlackwellB200 (datacenter)36864Productionblackwell
sm_103 / sm_103a / sm_103f1030Blackwell UltraGB300 (datacenter)36864Productionblackwell
sm_110 / sm_110a / sm_110f1100Jetson ThorThor SoC (auto/robotics)36864Productionblackwell
sm_120 / sm_120a / sm_120f1200Blackwell (sm120)RTX 50xx / RTX Pro36864Productionblackwell
sm_121 / sm_121a / sm_121f1210Blackwell (sm120)DGX Spark36864Productionblackwell

The family name stored in the profile object (from sub_6765E0) uses NVIDIA's internal naming: "Turing", "Ampere", "Hopper", "Blackwell". Ada Lovelace (sm_89) is stored as Ampere-derived internally despite being a distinct microarchitecture. sm_120/121 use "Blackwell" internally despite being a different consumer microarchitecture from sm_100 datacenter Blackwell.

Suffix Semantics

ptxas uses three suffix modes to control forward compatibility. The distinction is critical: it determines which SASS binary a cubin can execute on.

SuffixMeaningForward CompatibilityValidation Table
(none)Base feature setFull forward-compat across generationsunk_1D16220 (32 entries)
a (accelerated)Architecture-locked, advanced featuresNo forward compat -- locked to specific siliconunk_1D161C0 (7 entries)
f (feature-reduced)Same-family forward compat onlyForward-compat within family, not acrossunk_1D16160 (6 entries)

The base variant (no suffix) produces SASS that runs on the named architecture and all later ones: sm_80 code runs on sm_86, sm_89, sm_90, sm_100, etc. The a suffix locks the binary to exact silicon: sm_90a code runs only on H100/H200 hardware and will not execute on Blackwell. The f suffix allows forward compatibility within the same family: sm_100f code runs on sm_100 and sm_103 (both Blackwell datacenter) but not on sm_120 (different family).

Compilation rules from help text:

  • sm_90a PTX must be compiled to sm_90a SASS (no cross-arch compilation)
  • sm_100f PTX can compile to sm_100f or sm_103f SASS (same family)
  • sm_100a PTX must compile to sm_100a SASS only
  • Base sm_100 PTX compiles to any sm_100+ SASS

Sub-Variant Expansion

Basea Variantf VariantCUDA_ARCH (a)CUDA_ARCH (f)
sm_90sm_90a--90a0--
sm_100sm_100asm_100f100a0100f0
sm_103sm_103asm_103f103a0103f0
sm_101sm_101asm_101f----
sm_110sm_110asm_110f110a0110f0
sm_120sm_120asm_120f120a0120f0
sm_121sm_121asm_121f121a0121f0

sm_75 through sm_89 have no a or f variants. sm_90 has only the a variant (no f). All Blackwell-era targets (sm_100+) have both a and f. sm_101 is a legacy alias for sm_110 (Jetson Thor, original internal designation); it passes validation but is not registered as a profile object, so its CUDA_ARCH values are not populated.

SM Validation Tables

Target name validation uses three sorted arrays searched via bsearch(). The CLI parser extracts the SM string from --gpu-name, strips any suffix, and searches the appropriate table.

Base Table -- unk_1D16220 (32 entries)

Contains all valid base SM names without suffix, sorted by numeric SM ID. Includes legacy architectures no longer supported for active compilation but retained for validation, plus two internal/alias entries. Each entry is 12 bytes: {uint32 sm_id, uint32 ptx_major, uint32 ptx_minor}. The bsearch comparison (sub_484B70) compares the numeric sm_id extracted from the --gpu-name string via sscanf.

sm_10, sm_11, sm_12, sm_13,                    // Tesla (legacy, PTX 1.0--1.2)
sm_20, sm_21,                                  // Fermi (legacy, PTX 2.0)
sm_30, sm_32, sm_35, sm_37,                    // Kepler (legacy, PTX 3.0--4.1)
sm_50, sm_52, sm_53,                           // Maxwell (legacy, PTX 4.0--4.2)
sm_60, sm_61, sm_62,                           // Pascal (legacy, PTX 5.0)
sm_70, sm_72,                                  // Volta (legacy, PTX 6.0--6.1)
sm_75,                                         // Turing (active, PTX 6.3)
sm_80, sm_82, sm_86, sm_87, sm_88, sm_89,      // Ampere/Ada (active, PTX 6.2--7.8)
sm_90,                                         // Hopper (active, PTX 7.8)
sm_100, sm_101, sm_103, sm_110, sm_120, sm_121 // Blackwell (active, PTX 8.6--9.0)

sm_82 (PTX 6.2): Undocumented internal Ampere target. Not registered in sub_6765E0 (no profile object). Serves as the SASS opcode generation boundary (SM82_FIRST/SM82_LAST, opcode indices 172--193). The anomalously low PTX version requirement (6.2 vs sm_80's 7.0) suggests it was an early development target added before PTX ISA versioning was finalized.

sm_101 (PTX 8.6): Original internal designation for Jetson Thor, renamed to sm_110 in a later CUDA release. Both entries coexist in the validation table for backward compatibility with PTX files referencing the old name. sub_6765E0 registers only sm_110; sm_101 is validation-only.

Accelerated Table -- unk_1D161C0 (7 entries)

sm_90a, sm_100a, sm_101a, sm_103a, sm_110a, sm_120a, sm_121a

One Hopper entry, six Blackwell entries. sm_101a is the legacy alias for sm_110a (Jetson Thor, original internal designation).

Feature-Reduced Table -- unk_1D16160 (6 entries)

sm_100f, sm_101f, sm_103f, sm_110f, sm_120f, sm_121f

No Hopper entry (sm_90 has no f variant). All Blackwell-era. sm_101f is the legacy alias for sm_110f.

Architecture Registration -- sub_6765E0

This 54KB function constructs profile objects for every SM version. Each profile contains:

FieldContentExample (sm_90)
SM name"sm_90""sm_90"
Compute name"compute_90""compute_90"
Family name"Hopper""Hopper"
CUDA_ARCH macroDecimal integer900
LTO name"lto_90""lto_90"
isaClassArchitecture class ID--

The function registers each profile into three hash maps indexed by sm_XX, compute_XX, and lto_XX strings. This allows lookup by any of the three naming conventions used in different contexts (CLI, PTX .target directive, LTO linking).

Family assignment from sub_6765E0:

SM RangeFamily StringNotes
sm_75"Turing"Single entry
sm_80, sm_86, sm_87, sm_88"Ampere"Includes sm_88 (undocumented Ada variant)
sm_89"Ampere"Ada Lovelace stored as Ampere-derived internally
sm_90/90a"Hopper"Single silicon, two feature levels
sm_100/100a/100f"Blackwell"Datacenter B200
sm_103/103a/103f"Blackwell"Blackwell Ultra (GB300)
sm_110/110a/110f"Blackwell"Jetson Thor -- same family string despite different product
sm_120/120a/120f"Blackwell"Consumer/enterprise (RTX 50xx) -- different uarch, same string
sm_121/121a/121f"Blackwell"DGX Spark

All sm_100 through sm_121 share the "Blackwell" family string internally, even though sm_110 (Jetson Thor) and sm_120 (consumer RTX 50xx) are distinct microarchitectures. The compiler distinguishes them through the capability dispatch tables, not through family name.

Capability Dispatch -- sub_607DB0

The capability dispatch initializer builds 7 parallel hash maps at initialization time, protected by a once-guard (byte_29FE1D8). Each map indexes sm_XX / compute_XX strings to per-architecture values or handler functions. Error recovery uses setjmp/longjmp.

MapGlobalPurposeValue Type
1qword_29FE1D0Handler A (primary codegen)Function pointer
2qword_29FE1C8Handler B (secondary codegen)Function pointer
3qword_29FE1C0Intrinsic table initializerFunction pointer
4qword_29FE1B8Capability flagsByte value
5qword_29FE1B0Profile registrationRegistered via sub_42BEC0
6qword_29FE1A8Perf-stats / occupancy handler EFunction pointer
7qword_29FE1A0Perf-stats / occupancy handler FFunction pointer

Handler Function Assignments

Each SM version registers its own handler functions into these maps. Functions within the same suffix group (e.g., sm_100/100a/100f) share all handlers -- they are the same silicon with different feature exposure.

Map 1 -- Handler A (per SM):

SMHandler ASMHandler A
sm_75sub_609B70sm_100sub_609C30
sm_80sub_609CC0sm_110sub_609F30
sm_86sub_609D50sm_103sub_608F20
sm_87sub_609F00sm_120sub_609E40
sm_88sub_609E70sm_121sub_609ED0
sm_89sub_609E10
sm_90sub_609DB0

Map 2 -- Handler B (per SM):

SMHandler BSMHandler B
sm_75sub_609B40sm_100sub_609BD0
sm_80sub_609C90sm_110sub_608F50
sm_86sub_609D80sm_103sub_609D20
sm_87sub_609DE0sm_120sub_609C60
sm_88sub_609EA0sm_121sub_609BA0
sm_89sub_609CF0
sm_90sub_609C00

Map 3 -- Intrinsic table initializer (per SM):

SMInitializerSMInitializer
sm_75sub_60A2E0sm_100sub_60A910
sm_80sub_60A3E0sm_110sub_60AA20
sm_86sub_60AC30sm_103sub_60A700
sm_87sub_60AD30sm_120sub_608DF0
sm_88sub_60AB30sm_121sub_60A4E0
sm_89sub_60A810
sm_90sub_60A5F0

Shared Handler Groups

Sub-variants within a base SM share all handler functions, confirming they are identical silicon:

GroupMembersShared Handlers
Hoppersm_90, sm_90aAll 7 maps
Blackwell DCsm_100, sm_100a, sm_100fAll 7 maps
Blackwell Ultrasm_103, sm_103a, sm_103fAll 7 maps
Jetson Thorsm_110, sm_110a, sm_110fAll 7 maps
Consumersm_120, sm_120a, sm_120fAll 7 maps
DGX Sparksm_121, sm_121a, sm_121fAll 7 maps

Codegen Factory Values

The profile object stores an encoded architecture identifier at a known offset (visible as field +348 on the profile pointer chain, e.g., *(_QWORD *)(a1+1584)+348). This value is compared throughout the compiler to gate features:

Codegen FactorySM RangeSASS ISA Generation
24577sm_75Turing (SM 7.5)
28673sm_80 -- sm_89Ampere / Ada (SM 8.x)
32768sm_90Hopper (SM 9.0)
36864sm_100 -- sm_121Blackwell (SM 10.x -- 12.x)

These values appear in feature-gating checks. For example, FMA/DFMA combining in the peephole optimizer checks profile[+372] > 28673 to require sm_70+ capability. The exact encoding formula is (isa_generation << 12) | variant, where the high bits identify the SASS instruction set generation.

Related encoded values seen in the binary:

12288  = sm_30 (Kepler)       // 3 << 12
16385  = sm_50 (Maxwell)      // 4 << 12 | 1
20481  = sm_50 alt (Maxwell)  // 5 << 12 | 1
24576  = sm_60 (Pascal)       // 6 << 12
24577  = sm_75 (Turing)       // 6 << 12 | 1
28673  = sm_80 (Ampere)       // 7 << 12 | 1
28674-28677 = sm_86/87/88/89  // 7 << 12 | 2..5
32768  = sm_90 (Hopper)       // 8 << 12
36864  = sm_100 (Blackwell)   // 9 << 12
36865-36869 = sm_103..121     // 9 << 12 | 1..5

Hardware Resource Geometry

ptxas assembles per-SM hardware parameters from three data sources: sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator properties). These parameters control register allocation limits, shared memory partitioning, occupancy calculations, and scheduling decisions throughout the compiler.

Universal Constants (sub_8688F0)

sub_8688F0 sets the baseline hardware profile shared by all SM 75+ targets. These values are architecture-invariant within the ptxas v13.0.88 binary:

ParameterValueBinary EvidenceProfile Offset
Warp size32 threads*(a1+1472) = 32+1472
Max registers per thread255*(a1+612) = 0xFF0000003F+612
Register file per SM65,536 x 32-bitDerived: max_warps = 65536 / (regcount * 32)--
Dependency barriers per warp6*(a1+604) = 6+604
Named barriers per CTA16barrier_arrive_0 through barrier_arrive_15 intrinsics--
Static shared memory base48 KB (49,152 B)*(a1+1484) = 49152+1484
Shared memory config base1 MB (1,048,576 B)*(v6+344) = 0x100000 in all per-SM initsprofile +344

The register file size of 65,536 registers is confirmed by the EIATTR_REGCOUNT formula (code 0x2F): max_warps_per_SM = total_registers / (regcount * warp_size), and by explicit reference in codegen/templates.md ("the entire physical register file is 65,536 32-bit registers shared across all active warps").

Per-SM Resource Geometry Table

Combines binary evidence (sub_8E4400 scheduling profile, sub_8688F0 baseline, sub_ABF250 occupancy properties, sub_60AXXX per-SM initializers) with NVIDIA public specifications for parameters not stored as scalar constants in the binary. Confidence column rates how directly the value was extracted from the binary vs. inferred from public documentation.

SMRegs/SMMax Regs/ThreadMax Threads/CTAWarps/SMMax CTAs/SMSched PartitionsDispatch SlotsConfigurable Shared MemoryConf
sm_7565,5362551,02432167 / 20820832 / 48 / 64 KB90%
sm_8065,5362552,04864327 / 20820848 / 100 / 132 / 164 KB90%
sm_8665,5362551,53648167 / 20820848 / 100 KB90%
sm_8765,5362551,53648167 / 20820848 / 100 / 164 KB90%
sm_8865,5362551,53648167 / 208208(same as sm_86)85%
sm_8965,5362551,53648167 / 20820848 / 100 KB90%
sm_9065,5362551,02464328 / 22422448 / 100 / 132 / 164 / 228 KB90%
sm_10065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB90%
sm_10365,5362551,024643216 / 240240(same as sm_100)88%
sm_11065,5362551,024643216 / 240240(same as sm_100)85%
sm_12065,5362551,024643216 / 24024048 / 100 / 132 / 164 / 228 KB88%
sm_12165,5362551,024643216 / 240240(same as sm_120)85%

Column definitions:

  • Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
  • Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
  • Max Threads/CTA: Maximum threads per cooperative thread array (block). Not stored as a ptxas constant; derived from warps_per_SM * warp_size / max_CTAs.
  • Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
  • Max CTAs/SM: Maximum concurrent CTAs per SM.
  • Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD). The scheduler partition count is the number of warp scheduler units; dispatch slots is the total scheduling capacity.
  • Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute. Stored as pointer-to-table at profile offsets +1488/+1496; sm_75 has 3 entries, later architectures have more.

sm_88 note: No known product ships on sm_88. It shares all handler functions with sm_86. Listed parameters are inherited; actual hardware behavior is unverifiable.

Scheduler Partition Geometry (sub_8E4400 Detail)

The packed DWORD at offset +18 of the warp-level profile encodes scheduler partition counts. The WORD at offset +22 is the dispatch slot count -- a scheduling capacity value distinct from the raw warp count.

Codegen Factory RangePacked DWORDHexPartitionsDispatch SlotsSM Era
<= 20479458,7590x00070007796sm_50 (Maxwell)
20480 -- 24575786,4440x000C000C12176sm_60 (Pascal)
24576 -- 28672851,9810x000D000D13192sm_70 (Volta)
28673 -- 32767917,5180x000E000E14208sm_75 -- sm_89
32768 -- 36863983,0550x000F000F15224sm_90 (Hopper)
> 368631,048,5920x0010001016240sm_100+ (Blackwell)

The dispatch slot count increases monotonically across generations, reflecting wider scheduling capacity. All sm_75 through sm_89 targets (Turing, Ampere, Ada Lovelace) share identical scheduling partition geometry despite their hardware differences -- the differentiation occurs in the per-SM latency tables, not in the partition structure.

Shared Memory Configuration Tables

ptxas stores configurable shared memory sizes as a pointer + count pair at profile offsets +1488 and +1496. The driver uses this table to validate cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, ...) calls.

sub_8688F0 sets the sm_75 configuration:

*(a1+1488) = &unk_21D9168    // pointer to shared memory size table
*(a1+1496) = 3               // 3 valid configurations

For sm_75 (Turing), the 3 entries correspond to 32 KB, 48 KB, and 64 KB configurable shared memory. The L1/shared partitioning on Turing splits the 96 KB unified data cache between L1 and shared memory.

For sm_80 (Ampere), the configurable shared memory extends to 164 KB, reflecting the larger shared memory/L1 combined capacity. sub_ABF250 records the maximum as 167,936 bytes (163.8 KB) for the base sm_60 path and 233,472 bytes (228 KB) for sm_70+ paths, though these values encode via xmmword constants that depend on the specific SM variant.

For sm_90+ (Hopper, Blackwell), sub_ABF250 populates a maximum configurable value of 233,472 bytes (228 KB), supporting the opt-in extended shared memory mode added in Hopper.

Register Allocation Mechanics

ptxas allocates registers in units determined by the register allocation granularity stored in sub_ABF250:

SM GenerationAlloc Granularitya2[6][1]a2[6][2]Notes
sm_30 -- sm_6064 registers / warp631Allocates in blocks of 2 regs/thread
sm_70+256 registers / warp2552Allocates in blocks of 8 regs/thread

The register allocation unit directly affects occupancy. With 256-register granularity on sm_75+, a kernel using 33 registers effectively consumes 40 (rounded up to the next multiple of 8), which means each warp uses 40 * 32 = 1280 of the 65,536 available registers, allowing up to 51 warps -- but capped by the hardware limit of 32 warps on sm_75.

The formula the GPU driver uses (from EIATTR_REGCOUNT documentation):

effective_regs = ceil(regcount / alloc_granularity) * alloc_granularity
regs_per_warp  = effective_regs * warp_size
max_warps      = min(registers_per_SM / regs_per_warp, hw_max_warps)
max_CTAs       = min(max_warps / warps_per_CTA, hw_max_CTAs)

SM Version Encoding

The raw SM version number stored in profile objects and code object headers uses a packed integer format. This is the value at v4[93] in the code object builder (sub_A465F0):

Encoded ValueSM TargetCode Object VersionMax Threads/CTA
12288sm_300x7000796
20481sm_500xC000C176
24576sm_60----
28673sm_80----
36864sm_900x100010240

The code object builder (sub_A465F0 at 0xA465F0) maps these encoded SM versions to ELF code object version fields and thread-per-CTA limits. The magic number 0x16375564E is written at offset 0 of every code object header, with the SM version at offset +8.

Per-SM Capability Accessors -- sub_609XXX

The 24 functions in the sub_609XXX cluster (range 0x609280--0x609F60, ~1.2KB each) are the per-SM-version capability accessor functions. They are registered into Maps 1 and 2 of the dispatch tables and return architecture-specific values: register file sizes, feature flags, warp geometry, shared memory limits, and similar hardware parameters.

These are the functions that downstream code calls (through the dispatch table) to answer questions like "how many registers does this SM have?" or "is feature X available on this target?"

Profile Layering

Three levels of SM profile information cooperate:

Level 1: sub_607DB0                    // Capability dispatch (7 hash maps)
    |                                  //   -> feature flags, handler functions
    v
Level 2: sub_6765E0                    // Profile objects (name, family, CUDA_ARCH, lto)
    |                                  //   -> identity metadata, isaClass
    v
Level 3: sub_609XXX / sub_60AXXX       // Per-SM accessor functions
                                       //   -> concrete hardware parameter values

Level 1 provides the dispatch infrastructure. Level 2 provides identity metadata for diagnostics and linking. Level 3 provides the actual numeric values that drive register allocation, scheduling, and instruction legality.

Generation-Specific Features

Turing (sm_75)

sm_75 is the default architecture for ptxas v13.0.88, returned by sub_6784B0 when no --gpu-name is specified. Codegen factory value: 24577.

  • Base tensor core (WMMA m16n16k16 f16/f32, m32n8k16, m8n32k16)
  • Integer MMA (IMMA int8/int4), binary MMA (BMMA b1)
  • Base warp-level operations (shfl.sync, vote.sync, match.sync, redux.sync)
  • Named barrier support (bar 0-15 with arrive/red/sync variants)

Ampere (sm_80 -- sm_88)

Codegen factory value: 28673. Shared with Ada Lovelace (sm_89).

  • Extended tensor core: TF32, BF16, FP64 MMA shapes
  • cp.async for asynchronous shared memory copies
  • L2 cache hints on atomic operations
  • createpolicy instructions for cache management
  • 14 additional intrinsics (__cuda_sm80_*: bf16/tf32/s4/s8/b1 MMA, createpolicy)

Ada Lovelace (sm_89)

Codegen factory value: 28673 (same as Ampere). Stored as "Ampere" internally despite being a distinct Ada Lovelace microarchitecture.

  • Same codegen path as Ampere; differentiated through capability flags, not codegen factory
  • 39 additional MMA intrinsics (__cuda_sm_8x_mma_*)

Hopper (sm_90 / sm_90a)

Codegen factory value: 32768. sm_90a is architecture-locked (H100/H200 only).

  • WGMMA (warpgroup MMA async): wgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group
  • Cluster operations: barrier.cluster.arrive/wait, distributed shared memory
  • setmaxnreg: Dynamic register allocation limit
  • Cluster special registers: %clusterid, %cluster_ctaid, %cluster_ctarank, etc.
  • 38 sub-byte MMA intrinsics (__cuda_sm_9x_mma_sub_byte_internal_*: s4/u4 sparse)

Blackwell Datacenter (sm_100, sm_103)

Codegen factory value: 36864. Both a and f sub-variants available.

  • tcgen05: 5th-generation tensor core ISA (alloc, dealloc, ld, st, commit, cp, shift, mma) -- a/f sub-variants only
  • tcgen05 guardrails: 8 debug validation functions (phase validity, column allocation, bounds checking)
  • Extended MMA: 10 Blackwell-specific hmma/imma + bit MMA intrinsics (__cuda_sm_10x_*)
  • 11 tcgen05 guardrail trap intrinsics (__cuda_sm10x_tcgen05_guardrail_trap_*)
  • 18 sm_1xx bulk copy intrinsics (__cuda_sm1xx_*: cp.async.bulk.tensor 1D-5D tile/im2col)

Jetson Thor (sm_110)

Codegen factory value: 36864 (same as sm_100). Originally sm_101 before rename. Automotive/robotics SoC.

  • Retains full tcgen05/TMEM hardware on a/f sub-variants
  • Same Blackwell datacenter feature set for tensor operations
  • Differentiated through capability flags for SoC-specific constraints

Blackwell Consumer (sm_120, sm_121)

Codegen factory value: 36864 (same as sm_100). Architecturally a distinct consumer microarchitecture despite sharing the "Blackwell" family string.

  • No tcgen05: The entire tcgen05 ISA is absent on sm_120/121 -- gated by SM version checks
  • Tensor core falls back to HMMA/IMMA/WGMMA inherited from sm_70--sm_90 path
  • sm_120 = RTX 50xx consumer / RTX Blackwell Pro (enterprise)
  • sm_121 = DGX Spark

Diagnostic Strings

StringContextFunction
"Turing"Family name in profile objectsub_6765E0
"Ampere"Family name in profile objectsub_6765E0
"Hopper"Family name in profile objectsub_6765E0
"Blackwell"Family name in profile objectsub_6765E0
"isaClass"Architecture class reference on profilesub_6765E0
"sm_%d"SM name formattingMultiple
"compute_%d"Compute name formattingsub_6765E0
"lto_%d"LTO name formattingsub_6765E0

Function Map

AddressSizeIdentityConfidence
sub_607DB014KBSM capability dispatch -- builds 7 hash maps99%
sub_608D70384BProfile lookup dispatcher80%
sub_608DF0~1KBsm_120 intrinsic table initializer85%
sub_608F20~1.2KBsm_103 handler A (capability accessor)90%
sub_608F50~1.2KBsm_110 handler B (capability accessor)90%
sub_609280--sub_609F60~1.2KB each24 per-SM capability accessor functions (Maps 1+2)90%
sub_609F602.8KBlds128convert option handler ("always", "nonconst", "never")90%
sub_60A2E0--sub_60AD30~1KB each12 per-SM intrinsic table initializers (Map 3)85%
sub_60B0404.5KBStress test options ("stress-maxrregcount", etc.)85%
sub_6765E054KBSM profile object construction (family, CUDA_ARCH, lto)95%
sub_6784B0--Default architecture -- returns sm_7599%
sub_8688F031 linesUniversal HW profile baseline (warp size, regs, barriers, shmem)95%
sub_8E44003.3KBWarp-level HW profile: scheduler partitions, dispatch slots95%
sub_ABF250~600BOccupancy property table: configurable shmem, reg alloc granularity90%
sub_A95DC0~1.8KBExtended HW profile: architecture-specific shmem config85%
sub_A465F014KBCode object header builder (SM version -> ELF fields)88%

Profile Object Layout (1936 bytes)

Every SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object that carries target-specific parameters throughout the compiler. This is the compilation unit's target descriptor -- the single structure that downstream code reads to answer "what hardware am I compiling for?"

Construction Sequence

1. sub_71BDE0(1936, a1)     heap allocate 1936 bytes
2. sub_C1B7A0(profile)      zero-fill + structural defaults (8 SSE blocks, 5 scalars)
3. sub_917990(a3)           overlay: codegen factory default, tail constants
4. sub_60AXXX(a1,a2,a3,a4)  per-SM: codegen factory, shmem base, capability flags

Key Fields -- Explicitly Initialized

These fields receive non-zero values during construction. Offsets are byte offsets from the profile object base pointer. Type column: D=DWORD(4B), Q=QWORD(8B), O=OWORD(16B), B=BYTE.

OffsetTypeDefaultSet BySemantic NameConfidence
+0Q0sub_C1B7A0object_base -- zeroed, likely vtable/class pointer75%
+112Q0x500000000sub_C1B7A0packed_config -- stores DWORD 5 at +112, DWORD 0 at +11685%
+120D5sub_C1B7A0opt_level_default -- initial optimization level or block dimension85%
+132Q0xFFFFFFFFsub_C1B7A0max_register_limit -- -1 sentinel = "no limit"85%
+340D1sub_C1B7A0enable_flag_A -- default-enabled capability85%
+344D0x100000per-SM initshared_memory_config_base -- 1 MB for all SM 75+ targets95%
+348Dper-SMper-SM initcodegen_factory -- ISA generation encoding (gen << 12) | variant99%
+424Q0x100000000sub_C1B7A0packed_enable -- stores DWORD 1 at +424, DWORD 0 at +42890%
+428D0 (cond.)per-SM initconditional_feature_flag -- sm_90+ only: set to 0 when *(a2+355) is true85%
+432Dcomputedper-SM initmodule_base_address -- callback() - 0x6FFFFE64 or -1 if disabled95%
+588D0sub_917990cleared_field -- explicitly re-zeroed; used in 10+ consumer functions90%
+708D1per-SM initenable_flag_D -- universally set to 1 by all per-SM initializers85%
+944D4sub_C1B7A0pipeline_depth -- possibly barrier count or pipeline stage limit85%
+1200Q"NVIDIA"sub_43A400vendor_string_ptr -- pointer to vendor identification string95%
+1208Q(pointer)sub_43A400associated_data_ptr -- assigned from callback result90%
+1216D1sub_43A400vendor_flag -- set to 1 during ELF builder initialization85%
+1385B0 (bits)runtimescheduling_feature_flags -- bitfield, 21+ consumer sites99%
+1536Q1832sub_C1B7A0dynamic_region_offset -- points to tail SSE constant region start90%
+1552Q0runtimepipeline_progress -- monotonically increasing counter (values 0--21); scoreboard guards check 16--1995%
+1584Qnullsub_856sub_C1B7A0sm_backend_vtable_ptr -- THE central polymorphic pointer; initialized to null stub99%
+1684DCLI valueper-SM initcli_option_value -- *(a1+108) passthrough from compiler driver90%
+1840D1per-SM initelf_section_data -- initially 1 (enable), later overwritten with ptr85%
+1880Q1per-SM initbarrier_tracking_ptr -- initially 1, later pointer to scoreboard data95%
+1892D2sub_917990tail_mode_value -- possibly versioning or encoding mode indicator85%
+1912D0 (cond.)per-SM initconditional_clear -- cleared when *(a2+233) is true (debug mode)85%
+1928D1per-SM initoutput_config_value -- compilation output configuration85%

SSE Constant Blocks

10 blocks of 16 bytes each are loaded from .rodata segments via _mm_load_si128. These likely contain per-register-class sizing parameters, pipeline configuration constants, or default opcode table pointers. Exact values require .rodata dump.

OffsetSourceSet By
+184xmmword_20206F0sub_C1B7A0
+280xmmword_2027950sub_C1B7A0
+312xmmword_2027600sub_C1B7A0
+680xmmword_2027620sub_C1B7A0
+696xmmword_22B4ED0sub_C1B7A0
+740xmmword_22B4EE0sub_C1B7A0
+788xmmword_22B4EF0sub_C1B7A0
+816xmmword_22B4F00sub_C1B7A0
+1832xmmword_21DEBA0sub_917990
+1908xmmword_21DEBB0sub_917990

Scheduling Feature Flags (+1385 bitfield)

The byte at offset +1385 is the most heavily accessed bitfield on the profile object (21+ consumer sites in 15+ decompiled functions). Each bit gates a scheduling or codegen behavior.

BitMaskMeaningEvidence
00x01Function has sync barrierssub_792CD0 sets/clears; sub_75F680, sub_75F580 check
10x02(unknown)--
20x04Extended barrier modelsub_796D60 checks jointly with +1382 & 0x20
30x08Scoreboard tracking enabledsub_925670, sub_925510, sub_9253C0 check jointly with +1880 and +1552
40x10(unknown)--
50x20Scheduling feature flagsub_793220, sub_A36360 (scoreboard encoder) check
60x40Temporary analysis flagsub_752E40 sets; sub_77F0D0 clears
70x80Preserved across resetssub_7F7DC0: *(a1+1385) &= 0x80 (all others cleared)

Per-SM Initializer Differences

All 12 per-SM initializers (one per SM family) are structurally identical. Only two fields differ between families.

Fieldsm_75--sm_88sm_89sm_90sm_100--sm_121
+348 codegen_factory24577--28676286770x8000 (32768)36864--36869
+428 conditional_feature_flagnot writtennot writtenwritten (if *(a2+355))written (if *(a2+355))

All other fields (+344, +432, +588, +708, +1684, +1840, +1880, +1912, +1928) are set identically across all SM families.

Critical Distinction: Profile Object vs SM Backend

The profile object (1936 bytes, this layout) is the compilation unit's target descriptor, stored somewhere in the compilation context. The pointer at context+1584 (sm_backend_vtable_ptr) points to a separate polymorphic SM backend object -- not to another profile object. Fields accessed as *(*(ctx+1584) + N) are on the SM backend, not on this profile object.

Commonly accessed SM backend fields (NOT on the 1936-byte profile):

SM Backend OffsetSemanticConsumer Count
+12arch_class_id (4=Maxwell, 10=Grace, 11=NVLink-Volta+)3+
+372codegen_factory (THE feature-gating value, 37+ sites)37+
+1037hw_capability_flags bitfield (SFU precision, barriers)20+
+1312vtable dispatch for predicate marking capability2+
+1320vtable function slot for optimization dispatch2+
+1417Grace/ARM architecture flag (bit 7)1+
+1418NVLink-capable flag (bit 0)1+

The profile's own +348 stores the codegen factory value at construction time. The SM backend's +372 is where all downstream code reads it. These are the same numeric value stored in two different objects.

Object Region Map

Offset Range   Size    Content
-----------    ----    -------
[0..111]       112B    Object header, vtable chain, zeroed bulk
[112..159]     48B     Config scalars (opt level, register limit)
[160..343]     184B    Zeroed + 3 SSE constant blocks (184, 280, 312)
[344..587]     244B    Target identity (shmem base, codegen factory, enable flags)
[588..879]     292B    Capability fields, 4 SSE constant blocks (680, 696, 740, 788)
[880..1063]    184B    Scheduling config, barrier config (+944=4 default)
[1064..1279]   216B    Extended config, vendor metadata (+1200="NVIDIA")
[1280..1535]   256B    Architecture feature flags (+1385 bitfield, +1417/+1418)
[1536..1663]   128B    Dynamic region pointer, phase counter, sm_backend ptr (+1584)
[1664..1831]   168B    CLI passthrough, ELF section data, barrier tracking ptr
[1832..1935]   104B    Tail region: 2 SSE constant blocks, mode value (+1892=2)

Cross-References