GPU Target Architecture
45 SM variants across 6 generations. Processor table at qword_502A920 (stride-2 layout: name + PTX version). Architecture gating throughout the binary controls feature availability.
| SM table | qword_502A920 (45 entries, ctor_605 at 0x584510) |
| Arch detection | sub_95EB40 (38KB, CLI -> 3-column mapping) |
| NVVM arch enum | sub_CD09E0 (14.5KB, NVVM_ARCH_* strings) |
| EDG arch gates | sub_60E7C0 (~60 feature flags based on SM version) |
| Backend subtarget | NVPTXSubtarget (feature offsets at +2498, +2584, +2843) |
| Target triples | nvptx64-nvidia-cuda, nvsass-nvidia-directx, nvsass-nvidia-spirv |
Per-SM Deep Dives:
- SM 70-89 (Volta through Ada Lovelace) -- Feature configuration call order, complete
sub_60E7C0flag table, atomic lowering, cumulative flag profiles - SM 90 -- Hopper -- Thread block clusters, TMA descriptor format and lowering, WGMMA, setmaxnreg, distributed shared memory
- SM 100 -- Blackwell Datacenter -- tcgen05 tensor core ISA, arch-conditional vs. family-conditional gating, cvt_packfloat FP4/FP6/MX formats
- SM 120 -- Blackwell Consumer -- No tcgen05, .offset.bindless texture intrinsics, f16 texture support, mma.sync.block_scale (future)
Complete SM Table
| SM | __CUDA_ARCH | PTX Ver | Generation | Suffix | Status | Deep Dive |
|---|---|---|---|---|---|---|
sm_75 | 750 | 5 | Turing | -- | Production | sm70-89 |
sm_80 | 800 | 5 | Ampere | -- | Production | sm70-89 |
sm_82 | 820 | 5 | Ampere | -- | Undocumented | sm70-89 |
sm_86 | 860 | 5 | Ampere | -- | Production | sm70-89 |
sm_87 | 870 | 5 | Ampere | -- | Production | sm70-89 |
sm_88 | 880 | 5 | Ada | -- | Undocumented | sm70-89 |
sm_89 | 890 | 5 | Ada | -- | Production | sm70-89 |
sm_90 | 900 | 5 | Hopper | -- | Production | sm90 |
sm_90a | 900 | 6 | Hopper | a | Production | sm90 |
sm_100 | 1000 | 6 | Blackwell | -- | Production | sm100 |
sm_100a | 1000 | 7 | Blackwell | a | Production | sm100 |
sm_100f | 1000 | 7 | Blackwell | f | Production | sm100 |
sm_101 | 1010 | 6 | Jetson Thor (pre-rename) | -- | Undocumented | sm100 |
sm_101a | 1010 | 7 | Jetson Thor (pre-rename) | a | Undocumented | sm100 |
sm_101f | 1010 | 7 | Jetson Thor (pre-rename) | f | Undocumented | sm100 |
sm_102 | 1020 | 6 | Blackwell | -- | Undocumented | sm100 |
sm_102a | 1020 | 7 | Blackwell | a | Undocumented | sm100 |
sm_102f | 1020 | 7 | Blackwell | f | Undocumented | sm100 |
sm_103 | 1030 | 6 | Blackwell | -- | Production | sm100 |
sm_103a | 1030 | 7 | Blackwell | a | Production | sm100 |
sm_103f | 1030 | 7 | Blackwell | f | Production | sm100 |
sm_110 | 1100 | 6 | Jetson Thor | -- | Production | sm120 |
sm_110a | 1100 | 7 | Jetson Thor | a | Production | sm120 |
sm_110f | 1100 | 7 | Jetson Thor | f | Production | sm120 |
sm_120 | 1200 | 6 | Blackwell (sm120) | -- | Production | sm120 |
sm_120a | 1200 | 7 | Blackwell (sm120) | a | Production | sm120 |
sm_120f | 1200 | 7 | Blackwell (sm120) | f | Production | sm120 |
sm_121 | 1210 | 6 | Blackwell (sm120) | -- | Production | sm120 |
sm_121a | 1210 | 7 | Blackwell (sm120) | a | Production | sm120 |
sm_121f | 1210 | 7 | Blackwell (sm120) | f | Production | sm120 |
Legacy architectures also present in the table but not in the CLI mapping: sm_20, sm_21, sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_73.
Suffix Meanings
| Suffix | Meaning | PTX Version | Detail |
|---|---|---|---|
| (none) | Base feature set | 5 (legacy) or 6 (sm_100+) | All architectures; sm70-89 has no suffix-gated logic |
a | Accelerated / advanced features | 6 (sm_90a) or 7 (sm_100a+) | sm_90a enables one EDG gate (sm90); sm_100a+ enables tcgen05 arch-conditional path (sm100) |
f | Forward-compatible feature set | 7 | Implies a; never read by cicc logic (sm120); reserved for ptxas |
PTX Version Mapping
| PTX Version | SM Range | Notes |
|---|---|---|
| 5 | sm_20 through sm_90 (legacy/base) | All pre-Blackwell base variants |
| 6 | sm_90a, sm_100/101/102/103/110/120/121 (base) | sm_90a is the sole pre-Blackwell PTX 6 target (sm90) |
| 7 | sm_100a/f through sm_121a/f (extended features) | Required for tcgen05 arch-conditional intrinsics (sm100) |
Architecture Gating
Four subsystems cooperate to configure feature flags from the SM version. The master configurator sub_60E7C0 runs last and has the highest non-CLI priority. For the complete flag table per tier, see SM 70-89 Complete sub_60E7C0 Flag Table.
Feature Configuration Pipeline
CLI parser (sub_617BD0) Sets byte_4CF8* override flags
|
sub_60DFC0 (secondary) Sets unk_4D041B8 at sm_80+ (__VA_OPT__)
|
sub_60D650 (optimization level) ~109 flags from -O level
|
sub_60E7C0 (master SM configurator) ~60 flags via SM threshold comparisons
|--- sub_60E530 (tertiary) Supplementary progressive unlocks
|
sub_982C80 (NVPTX subtarget) 224-byte bitfield for LLVM backend
Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag. See CLI Flag Inventory for the complete CLI flag-to-pipeline routing and Optimization Levels for per-level flag differences.
EDG-Level Gates -- sub_60E7C0
Sets ~60 unk_4D04* feature flags based on SM version thresholds. Each flag is gated by a byte_4CF8* user-override check.
| Threshold | SM Boundary | Features Enabled | Detail |
|---|---|---|---|
| > 30399 | sm_75 (Turing) | Base CUDA features, dynamic parallelism | sm70-89 Turing |
| > 40000 | sm_80 (Ampere) | C++20 __VA_OPT__, L2 cache hints, extended atomics | sm70-89 Ampere |
| > 89999 | sm_90 (Hopper) | Cluster ops, TMA, setmaxnreg, WGMMA fence | sm90 Feature Flags |
| > 109999 | sm_100 (Blackwell) | tcgen05, match instruction, dword_4D041AC | sm100 Feature Flags |
| > 119999 | sm_120 | unk_4D047BC disabled, unk_4D0428C | sm120 Feature Flags |
Backend Subtarget Feature Offsets (NVPTXSubtarget)
| Offset | Purpose | Stride | Detail |
|---|---|---|---|
| +2498 | Type legality flags (per MVT) | 259 bytes | See Type Legalization |
| +2584 | Float legality flags (per MVT) | 259 bytes | See Type Legalization |
| +2843 | Integer type support flag | 1 byte | -- |
| +2870 | Branch distance flag | 1 byte | See Block Placement |
| +2871 | Jump table eligibility flag | 1 byte | See BranchFolding |
For the complete NVPTXSubtarget analysis, see NVPTX Target Infrastructure.
Intrinsic Verifier Architecture Gates -- sub_2C7B6A0
The NVVMIntrinsicVerifier (143KB) gates intrinsics by SM version. For the complete three-layer verification architecture, see NVVM IR Verifier.
| Gate | SM | Intrinsics | Detail |
|---|---|---|---|
| sm_72 (Volta) | Convergent branch intrinsics, some atomic ops | sm70-89 Volta | |
| sm_75 (Turing) | Conversion type intrinsics | sm70-89 Turing | |
| sm_89 (Ada) | Specific intrinsics | sm70-89 Ada | |
| sm_90 (Hopper) | Cluster dimensions, TMA, WGMMA | sm90 TMA, sm90 WGMMA | |
| sm_100+ (Blackwell) | .offset.bindless intrinsics, tcgen05 | sm100 tcgen05, sm120 .offset.bindless |
Feature Gate Matrix
This matrix shows which major compiler features are available at each SM tier. Each cell links to the detailed discussion in the per-SM deep-dive page.
Tensor Core / MMA Instructions
| Feature | sm_70-75 | sm_80-89 | sm_90/90a | sm_100/103 | sm_110 | sm_120/121 |
|---|---|---|---|---|---|---|
| HMMA m16n16k16 (f16) | Yes | Yes | Yes | Yes | Yes | Yes |
| IMMA int8/int4, BMMA | sm_75+ | Yes | Yes | Yes | Yes | Yes |
| DMMA fp64, TF32, BF16 | -- | sm_80+ | Yes | Yes | Yes | Yes |
| WGMMA async (f16/bf16/tf32/f8) | -- | -- | Yes | Yes | Yes | -- |
| tcgen05.mma (MX formats) | -- | -- | -- | a/f only | a/f only | No |
| mma.sync.block_scale | -- | -- | -- | -- | -- | Future |
See Tensor / MMA Builtins for the per-builtin ID reference and Tensor / MMA Codegen for the code generation pipeline.
Memory and Synchronization
| Feature | sm_70-75 | sm_80-89 | sm_90/90a | sm_100/103 | sm_110 | sm_120/121 |
|---|---|---|---|---|---|---|
| Full atomic memory ordering | sm_70+ | Yes | Yes | Yes | Yes | Yes |
| 128-bit atomics | sm_70+ | Yes | Yes | Yes | Yes | Yes |
| L2 cache hint atomics | -- | sm_80+ | Yes | Yes | Yes | Yes |
| Cluster scope atomics | -- | -- | Yes | Yes | Yes | Yes |
| cp.async | -- | sm_80+ | Yes | Yes | Yes | Yes |
| TMA (tensor memory access) | -- | -- | Yes | Yes | Yes | Yes |
| TMA 2CTA mode, Im2Col_W | -- | -- | -- | sm_100+ | sm_100+ | sm_100+ |
| setmaxnreg | -- | -- | Yes | Yes | Yes | Yes |
| fence.sc.cluster | -- | -- | Yes | Yes | Yes | Yes |
See Atomics Builtins for atomic PTX generation detail and Barriers & Sync for barrier builtins.
Thread Block Clusters
| Feature | sm_70-89 | sm_90/90a | sm_100+ |
|---|---|---|---|
__cluster_dims__ attribute | Diagnostic 3687 | Yes | Yes |
__launch_bounds__ 3rd param | Diagnostic 3704 | Yes | Yes |
__block_size__ 5th arg | Diagnostic 3790 | Yes | Yes |
| Cluster special registers (15) | -- | Yes | Yes |
| barrier.cluster.arrive/wait | -- | Yes | Yes |
| Cluster query builtins (9) | -- | Yes | Yes |
| Distributed shared memory | -- | Yes | Yes |
.blocksareclusters directive | -- | Yes | Yes |
Numeric Formats
| Format | First Available | Gate Location | Detail |
|---|---|---|---|
| f16, f32, f64 | All | -- | Standard types |
| bf16 (bfloat16) | sm_80+ | Ampere tensor core | Tensor core and cvt |
| tf32 (TensorFloat-32) | sm_80+ | Ampere tensor core | Tensor core only |
| fp8 e4m3, e5m2 | sm_90+ | WGMMA | cvt_packfloat cases 2-3 |
| fp6 e2m3, e3m2 | sm_100+ | cvt_packfloat | Arch-conditional only |
| fp4 e2m1 | sm_100+ | cvt_packfloat | Arch-conditional only |
| ue8m0 (scale factor) | sm_100+ | cvt_packfloat | Both arch and family-conditional |
| MX formats (mxf4, mxf8f6f4, mxf4nvf4) | sm_100+ | tcgen05.mma | tcgen05 a/f sub-variants only |
Texture and Surface
| Feature | sm_70-89 | sm_90 | sm_100/103 | sm_120/121 |
|---|---|---|---|---|
| Standard texture intrinsics | Yes | Yes | Yes | Yes |
.offset.bindless intrinsics (68 variants) | -- | -- | -- | sm_120+ |
| f16 texture element types | Limited (builtin 3811 only) | Limited | Limited | Full support |
See Surface & Texture Builtins for the tex_surf_handler dispatch algorithm.
EDG Frontend Feature Flags
| Feature | Threshold | Flag | Detail |
|---|---|---|---|
| C++17 feature gates (EDG) | sm_70+ | unk_4D041DC, unk_4D04858, unk_4D041EC | sm70-89 Flag Table |
C++20 __VA_OPT__ | sm_80+ | unk_4D041B8 | sm70-89 sub_60DFC0 |
| C++23 extended float suffixes | sm_70+ | unk_4D0428C | sm70-89 Tertiary Cascade |
| C++20 feature gates | sm_90+ | unk_4D043D0, unk_4D041B0, unk_4D04814 | sm90 Feature Flags |
| Blackwell extended features | sm_100+ | unk_4D04184, dword_4D041AC | sm100 Feature Flags |
See EDG 6.6 Frontend for the 737-define configuration system.
tcgen05 Sub-Variant Access Table
The tcgen05 instruction family uses a two-tier gating system unique to Blackwell. Base variants (sm_100, sm_103, sm_110) are excluded; only a and f sub-variants pass the bitmask check.
| SmVersion | Target | tcgen05 | Detail |
|---|---|---|---|
| 1001 | sm_100a | Allowed | sm100 Arch-Conditional Gate |
| 1002 | sm_100f | Allowed | sm100 Arch-Conditional Gate |
| 1031 | sm_103a | Allowed | sm100 Arch-Conditional Gate |
| 1032 | sm_103f | Allowed | sm100 Arch-Conditional Gate |
| 1101 | sm_110a | Allowed | sm120: Jetson Thor |
| 1102 | sm_110f | Allowed | sm120: Jetson Thor |
| 1000, 1030, 1100 | base variants | Blocked | Bitmask 0xC0000C03 rejects; see sm100 |
| 1200-1212 | all sm_120/121 | Blocked | v-1101 > 1; see sm120 No tcgen05 |
Generation-Specific Features
Turing (sm_75)
sm_75 is the default architecture for cicc v13.0, hardcoded as "compute_75" in sub_900130 and sub_125FB30.
- Base tensor core (HMMA m16n16k16) -- see Tensor / MMA Builtins
- Conversion intrinsics
- Baseline for cicc v13.0 (default architecture) -- see CLI Flag Inventory
Full detail: SM 70-89 (Volta through Ada)
Ampere (sm_80-sm_89)
L2::cache_hinton atomic operations (sub_21E6420) -- see Atomics Builtins- Extended tensor core shapes (tf32, bf16) -- see Tensor / MMA Builtins
- Async copy (
cp.async) -- see SM 70-89: Ampere - C++20
__VA_OPT__support -- the sole differentiator between sm_75 and sm_80+ insub_60E7C0/sub_60DFC0
Full detail: SM 70-89 (Volta through Ada)
Hopper (sm_90/90a)
- Cluster operations:
barrier.cluster.arrive/wait,fence.sc.cluster-- see Cluster Barriers - Cluster registers:
%cluster_ctarank,%clusterid.x/y/z,%is_explicit_cluster-- see Cluster Special Registers - Kernel attributes:
.blocksareclusters,.maxclusterrank,.reqnctapercluster,.cluster_dim-- see PTX Directives - setmaxnreg: Dynamic register allocation limit (
sub_21EA5F0) -- see setmaxnreg - TMA: Tensor Memory Access with Im2Col, dimension validation, 2CTA mode -- see TMA
- WGMMA: Warpgroup MMA async (f16, bf16, tf32, f8) -- see WGMMA
- Distributed shared memory:
.shared::clusterqualifier for cross-CTA access -- see DSMEM - Mbarrier extensions: DMA fence/arrive/wait for TMA coordination -- see Mbarrier
Full detail: SM 90 -- Hopper
Blackwell Datacenter (sm_100-sm_103)
- tcgen05: Next-gen tensor core instruction set (
scaleD,transA,negA,negBatsub_21E8CD0) -- see tcgen05 - Arch-conditional vs. family-conditional gating: Two-tier feature system for tcgen05 sub-instructions -- see Gating
- match instruction: Architecture-gated (
"match instruction not supported on this architecture!") -- see sm100 - Extended MMA shapes: m16n8k256 with MX format support
.offset.bindlessintrinsics -- gated at sm_120+, NOT sm_100 (see sm120 .offset.bindless)- cvt_packfloat extended types: FP4, FP6, MX formats -- see cvt_packfloat
Full detail: SM 100 -- Blackwell Datacenter
Jetson Thor (sm_110)
sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename). It retains full tcgen05/TMEM hardware on a/f sub-variants. The sm_110 section is documented on the sm_120 page because the two are often compared.
Full detail: SM 120 -- Jetson Thor section
Blackwell Consumer (sm_120, sm_121)
- No tcgen05: The entire tcgen05 ISA is rejected by cicc for all sm_120/121 variants -- see No tcgen05
.offset.bindlesstexture intrinsics (68 variants) -- see .offset.bindless- 16-bit texture element types -- see f16 Texture
- mma.sync.block_scale: Present in upstream LLVM 22 but NOT emitted by cicc v13.0 -- see block_scale
- Tensor core falls back to HMMA/IMMA inherited from sm_70-sm_90 path
Full detail: SM 120 -- Blackwell Consumer
NVVM Container Architecture Enum -- sub_CD09E0
The NVVM container format uses an architecture enumeration. See NVVM Container for the complete tag inventory.
| Enum String | Implied SM | Detail |
|---|---|---|
NVVM_ARCH_BLACKWELL_10_0 | sm_100 | sm100 |
NVVM_ARCH_BLACKWELL_10_1 | sm_101 | Undocumented |
NVVM_ARCH_BLACKWELL_10_3 | sm_103 | sm100 |
NVVM_ARCH_BLACKWELL_11_0 | sm_110 | sm120: Jetson Thor |
NVVM_ARCH_BLACKWELL_12_0 | sm_120 | sm120 |
NVVM_ARCH_BLACKWELL_12_1 | sm_121 | sm120 |
NVVM_ARCH_HOPPER_9_0 | sm_90 | sm90 |
NVVM_ARCH_ADA_8_9 | sm_89 | sm70-89 |
NVVM_ARCH_AMPERE_8_0 through 8_8 | sm_80-sm_88 | sm70-89 |
NVVM_ARCH_HW_SM_5_0 through 10_4 | sm_50-sm_104 | Hardware SM enum |
Notable: NVVM_ARCH_HW_SM_10_4 (sm_104) and NVVM_ARCH_BLACKWELL_11_0 are not publicly documented. NVIDIA's internal naming uses "BLACKWELL" for all sm_100-sm_121 variants, even though sm_110 is marketed as Jetson Thor and sm_120/121 are a distinct consumer microarchitecture (RTX 50xx). See SM 120: Architecture Identity for the "SM 10.4" internal designation.
Target Triples
| Triple | Purpose | Detail |
|---|---|---|
nvptx64-nvidia-cuda | Standard 64-bit CUDA compilation | Default; see NVPTX Target Infrastructure |
nvptx-nvidia-cuda | 32-bit CUDA compilation | Legacy |
nvptx64-nvidia-nvcl | OpenCL target | -- |
nvsass-nvidia-cuda | SASS backend (native assembly) | -- |
nvsass-nvidia-directx | DirectX SASS backend | Discovered in sub_2C80C90; see NVVM IR Verifier |
nvsass-nvidia-spirv | SPIR-V SASS backend | Discovered in sub_2C80C90 |
The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) reveal that NVIDIA's SASS-level backend supports DirectX and SPIR-V targets alongside traditional CUDA and OpenCL.
Data Layout Strings
| Mode | Layout | Notes |
|---|---|---|
| 64-bit + shared | e-p:64:64:64-p3:32:32:32-i1:8:8-...-n16:32:64 | p3:32:32:32 = 32-bit shared mem pointers |
| 64-bit | e-p:64:64:64-i1:8:8-...-n16:32:64 | No shared memory specialization |
| 32-bit | e-p:32:32:32-i1:8:8-...-n16:32:64 | 32-bit mode |
Address space 3 (shared memory) uses 32-bit pointers even in 64-bit mode, controlled by nvptx-short-ptr and nvptx-32-bit-smem flags. See Address Spaces for the complete address space reference.
SM Version Encoding
Two parallel version tracking systems coexist in the binary:
-
qword_4F077A8-- EncodesSM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use theXX99pattern (e.g., 69999 for pre-Volta, 89999 for pre-Hopper). See SM 70-89: SM Version Encoding for full detail. -
unk_4D045E8-- Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic. See SM 70-89: unk_4D045E8 Frontend Gates for the complete gate table.
Cross-References
- NVPTX Target Infrastructure -- NVPTXTargetMachine, NVPTXSubtarget, TTI hooks
- Tensor / MMA Builtins -- Per-builtin-ID reference for all MMA generations
- Tensor / MMA Codegen -- Code generation pipeline for tensor core operations
- Atomics Builtins -- Atomic PTX generation and scope validation
- Surface & Texture Builtins -- Texture intrinsic dispatch algorithm
- NVVM IR Verifier -- SM-gated intrinsic verification
- NVVM Container -- Architecture enum and tag inventory
- CLI Flag Inventory --
-arch=compute_XXparsing and flag routing - Optimization Levels -- Per-level flag differences that interact with SM gates
- EDG 6.6 Frontend -- 737-define configuration, CUDA keyword handling
- Address Spaces -- Address space 3 shared memory and data layout strings
- GPU Execution Model -- CTA, warp, and cluster execution model context