Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Volta through Ada Lovelace (sm_70 – sm_89)

The sm_70 through sm_89 range spans four GPU generations — Volta, Turing, Ampere, and Ada Lovelace — and represents the most mature feature tier in cicc v13.0. Turing (sm_75) serves as the compiler's default architecture. Volta (sm_70/72) is no longer directly targetable: no compute_70 or compute_72 entry exists in the CLI parser, though the sm_70 feature boundary is still checked at 23 locations throughout the binary.

Supported Compute Capabilities

The architecture registration table at sub_95EB40 maps CLI strings to internal flags. Only the following are accepted for this generation range:

Compute CapabilityInternal Target__CUDA_ARCHPTX VersionGeneration
compute_75sm_757505Turing
compute_80sm_808005Ampere
compute_86sm_868605Ampere
compute_87sm_878705Ampere (Jetson Orin)
compute_88sm_888805Ada Lovelace
compute_89sm_898905Ada Lovelace

There is no compute_70, compute_72, compute_73, or compute_82. The sm_73, sm_82, and sm_88 targets exist only as internal processor table entries — they have no publicly documented differentiation and no unique feature gates in the compiler.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

  • qword_4F077A8 — Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 79999 for pre-Ampere, 89999 for pre-Hopper).

  • unk_4D045E8 — Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic.

Feature Configuration Call Order

The compiler configures feature flags through a strict four-function call sequence. Each subsequent function can override or augment the previous one's settings:

  1. CLI parser — Sets byte_4CF8* override flags from user-specified options. These prevent any subsequent auto-configuration from touching the guarded flag.
  2. sub_60DFC0 — Basic initialization. Sets unk_4D041B8 for sm_80+ (C++20 __VA_OPT__ support).
  3. sub_60D650(opt_level) — Optimization-level-based flag configuration. Sets approximately 109 flags based on the -O level. Many of the same unk_4D04* flags set by SM gates are also set here under C++17/C++20 language-version conditions.
  4. sub_60E7C0 — Master SM architecture feature configurator. Reads qword_4F077A8 and sets approximately 60 backend flags through threshold comparisons. Also calls sub_60E530 (tertiary cascade) for supplementary flags.
  5. sub_982C80 — NVPTX subtarget feature table initialization (224-byte bitfield for the LLVM backend). This is a separate path from the EDG flags above.

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag.

Feature Gates by Generation

Volta (sm_70+) — Threshold qword_4F077A8 > 69999

Volta introduced the first tensor core generation and independent thread scheduling. Although not directly targetable in this compiler version, the sm_70 boundary enables:

  • HMMA tensor core intrinsics — Builtin IDs 678–707 registered in sub_90AEE0. Three shape variants (m16n16k16, m32n8k16, m8n32k16) with load, store, and MMA operations across f16/f32 accumulator combinations.

  • Convergent branch intrinsicllvm.nvvm.branch.if.all.convergent (builtin 3755/8282) requires sm_70+. Error: "not supported on pre-Volta Architectures" (checked in sub_1C36530 and sub_2C7B6A0).

  • Proper atomic memory ordering — At sm_70+, atomics use acquire/release/relaxed semantics instead of falling back to volatile qualification. The gate is unk_4D045E8 > 69.

  • 128-bit atomic operations — Enabled at sm_70+. Below this threshold, diagnostic 3758 is emitted: "16-byte atomics only supported on sm_70+".

  • Optimizer feature flagsunk_4D041DC, unk_4D04858, unk_4D041EC are set by sub_60E7C0. The tertiary cascade sub_60E530 additionally sets unk_4D0428C (extended float suffix support for C++23 std::float*_t / std::bfloat16_t). Multiple SelectionDAG patterns in sub_706250 activate for sm_70+ codegen.

  • Variant-flag-gated features — When dword_4F077BC (SM variant flag, the a/f suffix) is set and sm_70+ is active, unk_4D043C4 is enabled. When compiling for a virtual architecture with effective SM > 69999, unk_4D04740 is set for multi-arch optimization.

  • WMMA memory space optimization — The wmma-memory-space-opt pass (registered at ctor_267, ctor_531) optimizes memory access patterns for tensor core operations.

Turing (sm_75) — Default Architecture

sm_75 is the baseline for cicc v13.0. The default is hardcoded in sub_900130 and sub_125FB30 via strcpy("compute_75"), and in sub_95EB40 as "-arch=compute_75".

No explicit sm_75-specific feature gates exist beyond the sm_70 tier. All Volta-era features are available. The key behavioral distinction is that sm_75 passes all pre-Volta gates cleanly — no diagnostic 3703 (sub_5C68F0), no volatile atomic fallback, no 128-bit atomic restrictions.

Ampere (sm_80+) — Threshold qword_4F077A8 > 79999

  • C++20 __VA_OPT__ supportunk_4D041B8 set at sub_60DFC0 line 132–133. This is the only flag set exclusively by sub_60DFC0 at the sm_80 threshold. It enables __VA_OPT__ recognition in the EDG macro expander (sub_A03 line 1010), variadic trailing argument elision (line 1584), and diagnostic 2939 for misuse.

  • Additional convergent branchllvm.nvvm.branch.if.convergent (builtin 3754/8283) requires sm_80+. Error: "not supported on pre-Ampere Architectures". Note the distinction: branch.if.all.convergent requires only sm_70+, while branch.if.convergent requires sm_80+.

  • L2 cache hint atomics — The L2::cache_hint suffix on atomic operations, emitted from sub_21E6DD0 when bit 0x400 is set in instruction encoding flags. Supported operations: exch, add, and, or, xor, max, min, cas, and floating-point add. These are PTX 7.3+ features. Emission logic lives in sub_21E6420.

  • cp.async.bulk patterns — String matching for cp.async.bulk.tensor.g2s. and cp.async.bulk. in inline assembly validation at sub_A8E250.

Important correction: The master SM feature configurator sub_60E7C0 does NOT set any new flags at the sm_80 boundary (> 79999). The Ampere-specific unk_4D041B8 is set by the secondary configurator sub_60DFC0. The next threshold in sub_60E7C0 after sm_70+ (> 69999) is sm_90+ (> 89999). This means sm_80 through sm_89 share the same sub_60E7C0 flag profile as sm_75.

Ada Lovelace and Ampere Variants (sm_86 – sm_89)

All of sm_86, sm_87, sm_88, and sm_89 share identical feature gates within cicc. They occupy unk_4D045E8 values 86–89 and qword_4F077A8 range 86000–89999, all below the 89999 Hopper boundary.

The primary gate at this tier is unk_4D045E8 <= 89, which delineates pre-Hopper from Hopper+:

LocationFeatureBehavior at sm_89 and below
sub_5D1A60__block_size__ attributeDiagnostic 3790; only 4 args parsed (5th cluster arg is sm_90+)
sub_5D1FE0__cluster_dims__ attributeDiagnostic 3687 emitted (cluster dimensions are Hopper-only)
sub_5D2430__launch_bounds__ 3rd paramDiagnostic 3704 emitted (cluster launch bounds)
sub_6BBC40Atomic scope "cluster"Falls through to "gpu" scope; diagnostic 3763/3759
sub_6BBC4016-byte extended atomicsDiagnostic 3764 for certain scope+type combinations
sub_9502D0 / sub_12AE930Atomic scope emission"gpu" used instead of "cluster"
sub_214DA90Cluster PTX directivesSkipped entirely (arch_id <= 89)

No code path differentiates sm_89 from sm_86/87/88. Hardware differences between these sub-architectures (e.g., Ada Lovelace RTX 4090 at sm_89 vs. Jetson Orin at sm_87) are resolved at the ptxas assembler level, not in cicc.

Atomic Lowering Detail

The atomic builtin lowering (sub_12AE930 / sub_9502D0) follows two paths split at the sm_70 boundary:

Pre-sm_70 path (unk_4D045E8 <= 69): Atomics are emitted with a volatile qualifier instead of memory ordering. Scope (cta/gpu/sys) is parsed but ordering is forced to volatile. 128-bit atomics emit diagnostic 3758.

sm_70+ path (unk_4D045E8 > 69): Full memory ordering support — relaxed, acquire, release, acq_rel. Scope resolution: cta (scope 0–1), gpu (scope 3), sys (scope 4). Cluster scope (scope 2) is only available at sm_90+; on sm_70–89, scope 2 falls through to "gpu".

Operations: ld, st, atom.add, atom.and, atom.or, atom.xor, atom.max, atom.min, atom.exch, atom.cas. Type suffixes via lookup table: b (bitwise), u (unsigned), s (signed), f (float).

Hopper-Gated Intrinsics Rejected on sm_70–89

Multiple intrinsics emit "this intrinsic is only supported for Hopper+" when the SM version field is non-zero and <= 899:

Builtin IDDescription
0x10B3 (4275)Hopper+ intrinsic requiring i1 or i32 return
0xFD5 (4053)Hopper+ intrinsic
0xEB7 (3767)Memory ordering/fence intrinsic with operation modes
0xEB9–0xEBA (3769–3770)Pointer-size-dependent intrinsics (>= 64-bit)

Complete sub_60E7C0 Flag Table

The master feature configurator sub_60E7C0 (address 0x60E7C0, 12,466 bytes, 56 qword_4F077A8 comparisons) is the primary SM-architecture-to-feature-flag mapper. Every flag assignment follows a guarded pattern: if the corresponding byte_4CF8* override byte is nonzero (set by a CLI flag), the auto-configuration is skipped and the user's explicit value is preserved.

Unconditional Assignments

These flags are set regardless of SM version with no user override check:

FlagValueNotes
unk_4D047C01Always enabled
unk_4D047B41Always enabled
unk_4F075840Always cleared
unk_4D0423C1Always enabled
unk_4D042080Always cleared
unk_4D042181Always enabled
unk_4D042141Always enabled
unk_4F069700Always cleared
unk_4F069640Always cleared
unk_4F069040Always cleared

SM-Dependent Unconditional Flags

These depend on SM version but have no user override check:

FlagConditionValueNotes
unk_4D047BCSM <= 1199991Disabled only for sm_120+
unk_4D04758SM <= 303001sm_32 and below only
unk_4D04764SM <= 303991sm_32 and below only
unk_4D044B8SM <= 402991Pre-Maxwell only

Guarded Flags (byte_4CF8* Override Bypass)

Each flag is set only when its guard byte is zero (user has not overridden via CLI):

GuardFlagDefault Value (guard=0)
byte_4CF807Bdword_4D048B8= 1
byte_4CF810Cdword_4D04824= 1
byte_4CF80F0unk_4D04388= 1
byte_4CF8108unk_4D04338= (dword_4F077BC && !dword_4F077B4 && SM <= 30399) ? 1 : 0
byte_4CF8123unk_4D047C8= 1 (only if SM > 30399, sm_35+)
byte_4CF8125dword_4D047B0= 1 (only if SM > 30399, sm_35+)
byte_4CF8139unk_4D04314= (SM <= 40299), pre-Maxwell
byte_4CF814Dunk_4D047C4= 0
byte_4CF8119unk_4D047D0= (SM <= 40000)
byte_4CF8119unk_4D047CC= (SM <= 40099)
byte_4CF810Funk_4D047EC= 1
byte_4CF8107unk_4D04340= 0
byte_4CF8116unk_4D047E0= 1
byte_4CF815Funk_4F0771C= 1
byte_4CF813Eunk_4D044B0= (SM > 40299), Maxwell+
byte_4CF8149unk_4D04470Complex Maxwell+ gate
byte_4CF8172dword_4D041AC= 0 (when SM <= 109999)
byte_4CF8159dword_4D048B0= (dword_4D048B8 && dword_4D048B4 && SM > 40799)
byte_4CF811Cunk_4D04790= 0 (when virtual arch flag set)
byte_4CF813Cdword_4D047ACsm_35+ feature gate
byte_4CF8156unk_4D04408CUDA C++ feature gate
byte_4CF815Dunk_4D048A0= 1 (when SM > 40699)

Total: 21 override bytes controlling approximately 25 feature flags.

Feature Escalation by SM Version

The cumulative flag-setting cascade. Each tier inherits all flags from lower tiers. Only the tiers relevant to sm_70–89 plus their immediate predecessors and successors are shown.

SM > 59999 (sm_60+, Pascal+):

FlagIdentified Meaning
unk_4D043CCEDG C++17 feature gate (also set by C++17 language block)
unk_4D04404EDG extended feature gate (also set by C++17 language block)
unk_4D043D8EDG C++17 feature gate (also set by C++17 language block)
unk_4D043D4EDG feature gate (also set via virtual arch > 30599)
dword_4F07760PTX generation mode flag
unk_4D04870EDG C++20 feature gate (also set by C++20 language block)

SM > 69999 (sm_70+, Volta+):

FlagIdentified Meaning
unk_4D041DCEDG C++17 feature gate (also set by C++17 language block)
unk_4D04858EDG C++17 feature gate (also set by C++17 language block)
unk_4D041ECEDG C++17/Pascal virtual arch feature gate

SM > 89999 (sm_90+, Hopper+) — NOT active for sm_70–89:

FlagIdentified Meaning
unk_4D043D0EDG C++20 feature gate (also set by C++20 language block)
unk_4D041B0EDG C++20 feature gate (also set by C++20 language block)
unk_4D04814EDG C++20 feature gate (also set by C++20 language block)
unk_4D0486C(with additional C++ version check)

sub_60E530 Tertiary Cascade

This supplementary function provides additional progressive unlocks. For the sm_70–89 range:

ThresholdHexFlags Set
> 405990x9E97unk_4F07764
> 406990x9EFBunk_4D043F0, unk_4D043F4
> 408990x9FC3unk_4D04220, unk_4D044D0
> 599990xEA5Funk_4D043CC (duplicates sub_60E7C0)
> 699990x1116Funk_4D0428C (extended float suffixes: C++23 std::float*_t / std::bfloat16_t)
> 899990x15F8Fdword_4F07760 (duplicates sub_60E7C0)
> 999990x1869Fdword_4D043F8, dword_4D041E8

Note: unk_4D0428C is set at > 69999 (sm_70+) by the cascade but at > 119999 (sm_120+) by sub_60E7C0. The cascade runs as part of sub_60E7C0, so the sm_70+ activation wins for all practical SM versions. This flag gates C++23 extended float suffixes (std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t) in the EDG numeric parser at sub_A02 line 1612.

sub_60DFC0 SM-Gated Flags

The secondary configurator adds one flag at the sm_80 boundary:

ThresholdFlagIdentified Meaning
> 79999 (sm_80+)unk_4D041B8C++20 __VA_OPT__ support in EDG macro expander. Enables __VA_OPT__ recognition, variadic trailing argument elision, and diagnostic 2939.

Virtual Architecture Downgrade Path

When compiling for a virtual architecture (dword_4F077B4 = 1), sub_60E7C0 uses unk_4F077A0 (the effective/real SM) for a secondary tier of feature decisions:

Effective SM >Flags Set
29999unk_4D043E4
30099unk_4D044D0
30199unk_4D043F0
30299unk_4D04220
30599unk_4D043D4
59999unk_4D041EC, unk_4D043D8, unk_4D04404
69999unk_4D04740
79999unk_4D043D0
89999unk_4D043D0 (redundant — already set at > 79999)
129999unk_4D04184

Note: In the virtual arch path, unk_4D043D0 is set at > 79999 (sm_80+), while in the primary path it requires > 89999 (sm_90+). Virtual arch compilation is more conservative, enabling features the real target supports even if the virtual arch normally gates them.

unk_4D045E8 Frontend Gates

These gates use the raw SM number and control frontend semantic checks rather than backend flags:

GateLocationsEffect
<= 69sub_12AE930 ln 241, sub_9502D0 ln 294Atomic volatile fallback
<= 69sub_6BBC40 ln 763128-bit atomic error 3758
<= 69sub_5C68F0Diagnostic 3703
<= 51sub_691790 ln 126Surface builtin warning
<= 59sub_6BBC40 ln 639Atomic scope restriction
60–69sub_6BBC40 ln 814Diagnostic 3762
<= 79sub_5C6950 ln 15Diagnostic 3660
<= 89sub_5D1A60 ln 35__block_size__ 5th arg blocked
<= 89sub_5D1FE0 ln 19__cluster_dims__ diagnostic 3687
<= 89sub_5D2430 ln 33__launch_bounds__ 3rd param diagnostic 3704
<= 89sub_6BBC40 ln 684Atomic scope diagnostic 3763/3759
<= 89sub_6BBC40 ln 805, 82716-byte atomic diagnostic 3764
<= 89sub_9502D0 ln 424, sub_12AE930 ln 255Cluster scope falls through to "gpu"
<= 89sub_214DA90 ln 66Cluster PTX directives skipped

Cumulative Flag Profile per SM Version

This table shows the net flag state for each SM version in the range, combining all three configurators (sub_60E7C0 + sub_60E530 + sub_60DFC0). Only flags that differ across the sm_70–89 range are shown.

Flagsm_75sm_80sm_86–89Set ByIdentified Role
unk_4D041DC111sub_60E7C0 > 69999EDG C++17 feature gate
unk_4D04858111sub_60E7C0 > 69999EDG C++17 feature gate
unk_4D041EC111sub_60E7C0 > 69999EDG C++17 / virtual arch feature gate
unk_4D0428C111sub_60E530 > 69999Extended float suffixes (C++23)
unk_4D041B8011sub_60DFC0 > 79999C++20 __VA_OPT__ support
unk_4D043D0000sub_60E7C0 > 89999(sm_90+ only)
unk_4D041B0000sub_60E7C0 > 89999(sm_90+ only)
unk_4D04814000sub_60E7C0 > 89999(sm_90+ only)
unk_4D0486C000sub_60E7C0 > 89999(sm_90+ only)

The sole differentiator between sm_75 and sm_80+ within sub_60E7C0/sub_60DFC0 is unk_4D041B8. All flags set at > 69999 are shared by all sm_70–89 targets. All flags set at > 89999 are absent from all sm_70–89 targets. There is no per-flag difference between sm_86, sm_87, sm_88, and sm_89.

Identified Flag Semantics

Where flag consumers have been positively identified in the decompiled binary:

FlagSet AtConsumerMeaning
unk_4D041B8sm_80+ (sub_60DFC0)EDG macro expander (sub_A03 ln 1010)C++20 __VA_OPT__ support: recognition, variadic trailing argument elision, diagnostic 2939
unk_4D0428Csm_70+ (sub_60E530), sm_120+ (sub_60E7C0)EDG numeric parser (sub_A02 ln 1612)Extended float suffixes: C++23 std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t
dword_4F07760sm_60+ (sub_60E7C0, sub_60E530)PTX generation pathPTX emission mode flag
unk_4D047C8sm_35+ (sub_60E7C0)BackendDynamic parallelism optimization
dword_4D047B0sm_35+ (sub_60E7C0)BackendDynamic parallelism support
unk_4D04780alwaysEDG macro expanderGNU ##__VA_ARGS__ comma-deletion extension

The remaining approximately 50 flags feed into the EDG frontend and NVVM IR generation pipeline. Based on the pattern that sub_60D650 (optimization level) and sub_60E7C0 (SM version) set the same flags with overlapping conditions, most are language feature gates (C++17/20/23 features that are also SM-gated) or optimization pass enables that depend on target capability.

Key Binary Locations

FunctionAddressSizeRole
sub_60E7C00x60E7C0Master SM feature flag initialization (12,466 bytes, 56 comparisons)Master SM feature flag initialization (12,466 bytes, 56 comparisons)
sub_60DFC00x60DFC0Secondary feature flag initialization (unk_4D041B8 at sm_80+)Secondary feature flag initialization (unk_4D041B8 at sm_80+)
sub_60E5300x60E530Tertiary feature cascade (unk_4D0428C at sm_70+)Tertiary feature cascade (unk_4D0428C at sm_70+)
sub_60D6500x60D650Optimization-level flag configurator (~109 flags)Optimization-level flag configurator (~109 flags)
sub_982C800x982C80NVPTX subtarget 224-byte feature bitfieldNVPTX subtarget 224-byte feature bitfield
sub_617BD00x617BD0CLI parser; sets unk_4D045E8 per compute_XXCLI parser; sets unk_4D045E8 per compute_XX
sub_12AE9300x12AE930Atomic builtin lowering (volatile vs. ordering)Atomic builtin lowering (volatile vs. ordering)
sub_9502D00x9502D0Duplicate atomic lowering (standalone pipeline)Duplicate atomic lowering (standalone pipeline)
sub_6BBC400x6BBC40Builtin semantic checker (atomics, scope validation)Builtin semantic checker (atomics, scope validation)
sub_90AEE00x90AEE0Builtin registration table (HMMA builtins 678–707)Builtin registration table (HMMA builtins 678–707)
sub_95EB400x95EB40Architecture registration (compute_XX to sm_XX)Architecture registration (compute_XX to sm_XX)
sub_1C365300x1C36530NVVM verifier (convergent intrinsic SM gates)NVVM verifier (convergent intrinsic SM gates)
sub_2C7B6A00x2C7B6A0NVVM lowering (convergent intrinsic SM gates)NVVM lowering (convergent intrinsic SM gates)
sub_21E6DD00x21E6DD0PTX emission (volatile / L2::cache_hint / .unified)PTX emission (volatile / L2::cache_hint / .unified)
sub_21E64200x21E6420Atomic L2 cache hint PTX emissionAtomic L2 cache hint PTX emission
sub_214DA900x214DA90Kernel attribute PTX emitter (cluster directives gated at arch_id > 89)Kernel attribute PTX emitter (cluster directives gated at arch_id > 89)
sub_5D1A600x5D1A60__block_size__ attribute (cluster dims at sm_90+)__block_size__ attribute (cluster dims at sm_90+)
sub_5D1FE00x5D1FE0__cluster_dims__ attribute (sm_90+ feature)__cluster_dims__ attribute (sm_90+ feature)
sub_5D24300x5D2430__launch_bounds__ 3rd param (sm_90+ cluster)__launch_bounds__ 3rd param (sm_90+ cluster)
sub_5C68F00x5C68F0Pre-sm_70 diagnostic 3703Pre-sm_70 diagnostic 3703
sub_5C69500x5C6950Pre-sm_80 diagnostic 3660Pre-sm_80 diagnostic 3660