Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GPU Target Architecture

45 SM variants across 6 generations. Processor table at qword_502A920 (stride-2 layout: name + PTX version). Architecture gating throughout the binary controls feature availability.

SM tableqword_502A920 (45 entries, ctor_605 at 0x584510)
Arch detectionsub_95EB40 (38KB, CLI -> 3-column mapping)
NVVM arch enumsub_CD09E0 (14.5KB, NVVM_ARCH_* strings)
EDG arch gatessub_60E7C0 (~60 feature flags based on SM version)
Backend subtargetNVPTXSubtarget (feature offsets at +2498, +2584, +2843)
Target triplesnvptx64-nvidia-cuda, nvsass-nvidia-directx, nvsass-nvidia-spirv

Per-SM Deep Dives:

Complete SM Table

SM__CUDA_ARCHPTX VerGenerationSuffixStatusDeep Dive
sm_757505Turing--Productionsm70-89
sm_808005Ampere--Productionsm70-89
sm_828205Ampere--Undocumentedsm70-89
sm_868605Ampere--Productionsm70-89
sm_878705Ampere--Productionsm70-89
sm_888805Ada--Undocumentedsm70-89
sm_898905Ada--Productionsm70-89
sm_909005Hopper--Productionsm90
sm_90a9006HopperaProductionsm90
sm_10010006Blackwell--Productionsm100
sm_100a10007BlackwellaProductionsm100
sm_100f10007BlackwellfProductionsm100
sm_10110106Jetson Thor (pre-rename)--Undocumentedsm100
sm_101a10107Jetson Thor (pre-rename)aUndocumentedsm100
sm_101f10107Jetson Thor (pre-rename)fUndocumentedsm100
sm_10210206Blackwell--Undocumentedsm100
sm_102a10207BlackwellaUndocumentedsm100
sm_102f10207BlackwellfUndocumentedsm100
sm_10310306Blackwell--Productionsm100
sm_103a10307BlackwellaProductionsm100
sm_103f10307BlackwellfProductionsm100
sm_11011006Jetson Thor--Productionsm120
sm_110a11007Jetson ThoraProductionsm120
sm_110f11007Jetson ThorfProductionsm120
sm_12012006Blackwell (sm120)--Productionsm120
sm_120a12007Blackwell (sm120)aProductionsm120
sm_120f12007Blackwell (sm120)fProductionsm120
sm_12112106Blackwell (sm120)--Productionsm120
sm_121a12107Blackwell (sm120)aProductionsm120
sm_121f12107Blackwell (sm120)fProductionsm120

Legacy architectures also present in the table but not in the CLI mapping: sm_20, sm_21, sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_73.

Suffix Meanings

SuffixMeaningPTX VersionDetail
(none)Base feature set5 (legacy) or 6 (sm_100+)All architectures; sm70-89 has no suffix-gated logic
aAccelerated / advanced features6 (sm_90a) or 7 (sm_100a+)sm_90a enables one EDG gate (sm90); sm_100a+ enables tcgen05 arch-conditional path (sm100)
fForward-compatible feature set7Implies a; never read by cicc logic (sm120); reserved for ptxas

PTX Version Mapping

PTX VersionSM RangeNotes
5sm_20 through sm_90 (legacy/base)All pre-Blackwell base variants
6sm_90a, sm_100/101/102/103/110/120/121 (base)sm_90a is the sole pre-Blackwell PTX 6 target (sm90)
7sm_100a/f through sm_121a/f (extended features)Required for tcgen05 arch-conditional intrinsics (sm100)

Architecture Gating

Four subsystems cooperate to configure feature flags from the SM version. The master configurator sub_60E7C0 runs last and has the highest non-CLI priority. For the complete flag table per tier, see SM 70-89 Complete sub_60E7C0 Flag Table.

Feature Configuration Pipeline

CLI parser (sub_617BD0)              Sets byte_4CF8* override flags
    |
sub_60DFC0 (secondary)               Sets unk_4D041B8 at sm_80+ (__VA_OPT__)
    |
sub_60D650 (optimization level)      ~109 flags from -O level
    |
sub_60E7C0 (master SM configurator)  ~60 flags via SM threshold comparisons
    |--- sub_60E530 (tertiary)        Supplementary progressive unlocks
    |
sub_982C80 (NVPTX subtarget)         224-byte bitfield for LLVM backend

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag. See CLI Flag Inventory for the complete CLI flag-to-pipeline routing and Optimization Levels for per-level flag differences.

EDG-Level Gates -- sub_60E7C0

Sets ~60 unk_4D04* feature flags based on SM version thresholds. Each flag is gated by a byte_4CF8* user-override check.

ThresholdSM BoundaryFeatures EnabledDetail
> 30399sm_75 (Turing)Base CUDA features, dynamic parallelismsm70-89 Turing
> 40000sm_80 (Ampere)C++20 __VA_OPT__, L2 cache hints, extended atomicssm70-89 Ampere
> 89999sm_90 (Hopper)Cluster ops, TMA, setmaxnreg, WGMMA fencesm90 Feature Flags
> 109999sm_100 (Blackwell)tcgen05, match instruction, dword_4D041ACsm100 Feature Flags
> 119999sm_120unk_4D047BC disabled, unk_4D0428Csm120 Feature Flags

Backend Subtarget Feature Offsets (NVPTXSubtarget)

OffsetPurposeStrideDetail
+2498Type legality flags (per MVT)259 bytesSee Type Legalization
+2584Float legality flags (per MVT)259 bytesSee Type Legalization
+2843Integer type support flag1 byte--
+2870Branch distance flag1 byteSee Block Placement
+2871Jump table eligibility flag1 byteSee BranchFolding

For the complete NVPTXSubtarget analysis, see NVPTX Target Infrastructure.

Intrinsic Verifier Architecture Gates -- sub_2C7B6A0

The NVVMIntrinsicVerifier (143KB) gates intrinsics by SM version. For the complete three-layer verification architecture, see NVVM IR Verifier.

GateSMIntrinsicsDetail
sm_72 (Volta)Convergent branch intrinsics, some atomic opssm70-89 Volta
sm_75 (Turing)Conversion type intrinsicssm70-89 Turing
sm_89 (Ada)Specific intrinsicssm70-89 Ada
sm_90 (Hopper)Cluster dimensions, TMA, WGMMAsm90 TMA, sm90 WGMMA
sm_100+ (Blackwell).offset.bindless intrinsics, tcgen05sm100 tcgen05, sm120 .offset.bindless

Feature Gate Matrix

This matrix shows which major compiler features are available at each SM tier. Each cell links to the detailed discussion in the per-SM deep-dive page.

Tensor Core / MMA Instructions

Featuresm_70-75sm_80-89sm_90/90asm_100/103sm_110sm_120/121
HMMA m16n16k16 (f16)YesYesYesYesYesYes
IMMA int8/int4, BMMAsm_75+YesYesYesYesYes
DMMA fp64, TF32, BF16--sm_80+YesYesYesYes
WGMMA async (f16/bf16/tf32/f8)----YesYesYes--
tcgen05.mma (MX formats)------a/f onlya/f onlyNo
mma.sync.block_scale----------Future

See Tensor / MMA Builtins for the per-builtin ID reference and Tensor / MMA Codegen for the code generation pipeline.

Memory and Synchronization

Featuresm_70-75sm_80-89sm_90/90asm_100/103sm_110sm_120/121
Full atomic memory orderingsm_70+YesYesYesYesYes
128-bit atomicssm_70+YesYesYesYesYes
L2 cache hint atomics--sm_80+YesYesYesYes
Cluster scope atomics----YesYesYesYes
cp.async--sm_80+YesYesYesYes
TMA (tensor memory access)----YesYesYesYes
TMA 2CTA mode, Im2Col_W------sm_100+sm_100+sm_100+
setmaxnreg----YesYesYesYes
fence.sc.cluster----YesYesYesYes

See Atomics Builtins for atomic PTX generation detail and Barriers & Sync for barrier builtins.

Thread Block Clusters

Featuresm_70-89sm_90/90asm_100+
__cluster_dims__ attributeDiagnostic 3687YesYes
__launch_bounds__ 3rd paramDiagnostic 3704YesYes
__block_size__ 5th argDiagnostic 3790YesYes
Cluster special registers (15)--YesYes
barrier.cluster.arrive/wait--YesYes
Cluster query builtins (9)--YesYes
Distributed shared memory--YesYes
.blocksareclusters directive--YesYes

Numeric Formats

FormatFirst AvailableGate LocationDetail
f16, f32, f64All--Standard types
bf16 (bfloat16)sm_80+Ampere tensor coreTensor core and cvt
tf32 (TensorFloat-32)sm_80+Ampere tensor coreTensor core only
fp8 e4m3, e5m2sm_90+WGMMAcvt_packfloat cases 2-3
fp6 e2m3, e3m2sm_100+cvt_packfloatArch-conditional only
fp4 e2m1sm_100+cvt_packfloatArch-conditional only
ue8m0 (scale factor)sm_100+cvt_packfloatBoth arch and family-conditional
MX formats (mxf4, mxf8f6f4, mxf4nvf4)sm_100+tcgen05.mmatcgen05 a/f sub-variants only

Texture and Surface

Featuresm_70-89sm_90sm_100/103sm_120/121
Standard texture intrinsicsYesYesYesYes
.offset.bindless intrinsics (68 variants)------sm_120+
f16 texture element typesLimited (builtin 3811 only)LimitedLimitedFull support

See Surface & Texture Builtins for the tex_surf_handler dispatch algorithm.

EDG Frontend Feature Flags

FeatureThresholdFlagDetail
C++17 feature gates (EDG)sm_70+unk_4D041DC, unk_4D04858, unk_4D041ECsm70-89 Flag Table
C++20 __VA_OPT__sm_80+unk_4D041B8sm70-89 sub_60DFC0
C++23 extended float suffixessm_70+unk_4D0428Csm70-89 Tertiary Cascade
C++20 feature gatessm_90+unk_4D043D0, unk_4D041B0, unk_4D04814sm90 Feature Flags
Blackwell extended featuressm_100+unk_4D04184, dword_4D041ACsm100 Feature Flags

See EDG 6.6 Frontend for the 737-define configuration system.

tcgen05 Sub-Variant Access Table

The tcgen05 instruction family uses a two-tier gating system unique to Blackwell. Base variants (sm_100, sm_103, sm_110) are excluded; only a and f sub-variants pass the bitmask check.

SmVersionTargettcgen05Detail
1001sm_100aAllowedsm100 Arch-Conditional Gate
1002sm_100fAllowedsm100 Arch-Conditional Gate
1031sm_103aAllowedsm100 Arch-Conditional Gate
1032sm_103fAllowedsm100 Arch-Conditional Gate
1101sm_110aAllowedsm120: Jetson Thor
1102sm_110fAllowedsm120: Jetson Thor
1000, 1030, 1100base variantsBlockedBitmask 0xC0000C03 rejects; see sm100
1200-1212all sm_120/121Blockedv-1101 > 1; see sm120 No tcgen05

Generation-Specific Features

Turing (sm_75)

sm_75 is the default architecture for cicc v13.0, hardcoded as "compute_75" in sub_900130 and sub_125FB30.

Full detail: SM 70-89 (Volta through Ada)

Ampere (sm_80-sm_89)

  • L2::cache_hint on atomic operations (sub_21E6420) -- see Atomics Builtins
  • Extended tensor core shapes (tf32, bf16) -- see Tensor / MMA Builtins
  • Async copy (cp.async) -- see SM 70-89: Ampere
  • C++20 __VA_OPT__ support -- the sole differentiator between sm_75 and sm_80+ in sub_60E7C0/sub_60DFC0

Full detail: SM 70-89 (Volta through Ada)

Hopper (sm_90/90a)

  • Cluster operations: barrier.cluster.arrive/wait, fence.sc.cluster -- see Cluster Barriers
  • Cluster registers: %cluster_ctarank, %clusterid.x/y/z, %is_explicit_cluster -- see Cluster Special Registers
  • Kernel attributes: .blocksareclusters, .maxclusterrank, .reqnctapercluster, .cluster_dim -- see PTX Directives
  • setmaxnreg: Dynamic register allocation limit (sub_21EA5F0) -- see setmaxnreg
  • TMA: Tensor Memory Access with Im2Col, dimension validation, 2CTA mode -- see TMA
  • WGMMA: Warpgroup MMA async (f16, bf16, tf32, f8) -- see WGMMA
  • Distributed shared memory: .shared::cluster qualifier for cross-CTA access -- see DSMEM
  • Mbarrier extensions: DMA fence/arrive/wait for TMA coordination -- see Mbarrier

Full detail: SM 90 -- Hopper

Blackwell Datacenter (sm_100-sm_103)

  • tcgen05: Next-gen tensor core instruction set (scaleD, transA, negA, negB at sub_21E8CD0) -- see tcgen05
  • Arch-conditional vs. family-conditional gating: Two-tier feature system for tcgen05 sub-instructions -- see Gating
  • match instruction: Architecture-gated ("match instruction not supported on this architecture!") -- see sm100
  • Extended MMA shapes: m16n8k256 with MX format support
  • .offset.bindless intrinsics -- gated at sm_120+, NOT sm_100 (see sm120 .offset.bindless)
  • cvt_packfloat extended types: FP4, FP6, MX formats -- see cvt_packfloat

Full detail: SM 100 -- Blackwell Datacenter

Jetson Thor (sm_110)

sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename). It retains full tcgen05/TMEM hardware on a/f sub-variants. The sm_110 section is documented on the sm_120 page because the two are often compared.

Full detail: SM 120 -- Jetson Thor section

Blackwell Consumer (sm_120, sm_121)

  • No tcgen05: The entire tcgen05 ISA is rejected by cicc for all sm_120/121 variants -- see No tcgen05
  • .offset.bindless texture intrinsics (68 variants) -- see .offset.bindless
  • 16-bit texture element types -- see f16 Texture
  • mma.sync.block_scale: Present in upstream LLVM 22 but NOT emitted by cicc v13.0 -- see block_scale
  • Tensor core falls back to HMMA/IMMA inherited from sm_70-sm_90 path

Full detail: SM 120 -- Blackwell Consumer

NVVM Container Architecture Enum -- sub_CD09E0

The NVVM container format uses an architecture enumeration. See NVVM Container for the complete tag inventory.

Enum StringImplied SMDetail
NVVM_ARCH_BLACKWELL_10_0sm_100sm100
NVVM_ARCH_BLACKWELL_10_1sm_101Undocumented
NVVM_ARCH_BLACKWELL_10_3sm_103sm100
NVVM_ARCH_BLACKWELL_11_0sm_110sm120: Jetson Thor
NVVM_ARCH_BLACKWELL_12_0sm_120sm120
NVVM_ARCH_BLACKWELL_12_1sm_121sm120
NVVM_ARCH_HOPPER_9_0sm_90sm90
NVVM_ARCH_ADA_8_9sm_89sm70-89
NVVM_ARCH_AMPERE_8_0 through 8_8sm_80-sm_88sm70-89
NVVM_ARCH_HW_SM_5_0 through 10_4sm_50-sm_104Hardware SM enum

Notable: NVVM_ARCH_HW_SM_10_4 (sm_104) and NVVM_ARCH_BLACKWELL_11_0 are not publicly documented. NVIDIA's internal naming uses "BLACKWELL" for all sm_100-sm_121 variants, even though sm_110 is marketed as Jetson Thor and sm_120/121 are a distinct consumer microarchitecture (RTX 50xx). See SM 120: Architecture Identity for the "SM 10.4" internal designation.

Target Triples

TriplePurposeDetail
nvptx64-nvidia-cudaStandard 64-bit CUDA compilationDefault; see NVPTX Target Infrastructure
nvptx-nvidia-cuda32-bit CUDA compilationLegacy
nvptx64-nvidia-nvclOpenCL target--
nvsass-nvidia-cudaSASS backend (native assembly)--
nvsass-nvidia-directxDirectX SASS backendDiscovered in sub_2C80C90; see NVVM IR Verifier
nvsass-nvidia-spirvSPIR-V SASS backendDiscovered in sub_2C80C90

The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) reveal that NVIDIA's SASS-level backend supports DirectX and SPIR-V targets alongside traditional CUDA and OpenCL.

Data Layout Strings

ModeLayoutNotes
64-bit + sharede-p:64:64:64-p3:32:32:32-i1:8:8-...-n16:32:64p3:32:32:32 = 32-bit shared mem pointers
64-bite-p:64:64:64-i1:8:8-...-n16:32:64No shared memory specialization
32-bite-p:32:32:32-i1:8:8-...-n16:32:6432-bit mode

Address space 3 (shared memory) uses 32-bit pointers even in 64-bit mode, controlled by nvptx-short-ptr and nvptx-32-bit-smem flags. See Address Spaces for the complete address space reference.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

  • qword_4F077A8 -- Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 89999 for pre-Hopper). See SM 70-89: SM Version Encoding for full detail.

  • unk_4D045E8 -- Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic. See SM 70-89: unk_4D045E8 Frontend Gates for the complete gate table.

Cross-References