Turing & Ampere (SM 75--88)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

SM 75 through SM 88 span two microarchitecture generations that ptxas treats as a contiguous feature band. sm_75 (Turing) is the default target for ptxas v13.0.88 -- when --gpu-name is omitted, sub_6784B0 returns sm_75. The Ampere targets (sm_80, sm_86, sm_87, sm_88) share generation-7 SASS encoding and add incremental tensor core and async-copy capabilities. sm_89 (Ada Lovelace) is architecturally Ampere-derived internally but is covered in Ada & Hopper because it bridges to sm_90 features.


SM targets	sm_75, sm_80, sm_86, sm_87, sm_88 (+ sm_82 validation-only)
Codegen factory range	24577--28676
ISA generation	6 (Turing), 7 (Ampere)
Encoding format	128-bit per-instruction control word
Scheduler profile	7 warps, 208 dispatch slots
Family strings	`"Turing"` (sm_75), `"Ampere"` (sm_80--88)
Sub-variants	None (no `a` or `f` suffixes)
Profile object size	1,936 bytes (allocated by `sub_917990`)

SM Version Table

SM	Product	Family	`__CUDA_ARCH__`	Codegen Factory	Hex	Variant
`sm_75`	TU10x (RTX 20xx, Quadro RTX)	Turing	750	24577	`0x6001`	1 (gen 6)
`sm_80`	GA100 (A100, A30)	Ampere	800	28673	`0x7001`	1 (gen 7)
`sm_86`	GA10x (A40, A10, RTX 30xx)	Ampere	860	28674	`0x7002`	2
`sm_87`	GA10B (Jetson Orin)	Ampere	870	28675	`0x7003`	3
`sm_88`	-- (undocumented)	Ampere	880	28676	`0x7004`	4

Codegen factory encoding: (isa_generation << 12) | sub_variant. Turing is generation 6; Ampere is generation 7. The sub-variant distinguishes silicon cut within a generation. sm_75 and Pascal sm_60 share generation 6 (sm_60 = 24576 = 0x6000), differentiated by sub-variant 0 vs 1.

sm_88 note: Registered in ptxas with CUDA_ARCH=880 and codegen factory 28676, but no public product ships on this SM. It may represent an unreleased Ampere derivative or internal test target.

SM 82 -- Internal Ampere Target

sm_82 is an undocumented internal Ampere target present in the base validation table (unk_1D16220, entry [20]) but not registered in the profile constructor sub_6765E0. It has no capability dispatch handler, no profile object, and no handler functions in any of the 7 hash maps. It exists in ptxas solely as a validation table entry and as the SASS opcode generation boundary.


Validation table entry	`{82, 6, 2}` -- sm_82, PTX 6.2
PTX ISA requirement	6.2 (anomalously low -- see below)
Profile object	None -- not registered in `sub_6765E0`
Capability handlers	None -- not registered in `sub_607DB0`
SASS opcode role	`SM82_FIRST` (index 172) through `SM82_LAST` (index 193)

PTX 6.2 Anomaly

sm_82 requires PTX ISA version 6.2, which is lower than both its neighbors:

SM	PTX ISA	CUDA Toolkit
sm_75	6.3	CUDA 10.0
sm_80	7.0	CUDA 11.0
sm_82	6.2	--
sm_86	7.1	CUDA 11.1

PTX 6.2 corresponds to CUDA 10.1 (Turing era). This backward version number strongly suggests sm_82 was created as an early Ampere development target -- a PTX-level placeholder added before the Ampere PTX ISA (7.0) was defined. The validation table entry was never removed, but no profile object was ever created for it.

SASS Opcode Boundary Role

sm_82's primary significance in ptxas is as the opcode generation boundary for Ampere SASS instructions. The opcode hierarchy uses SM-number-based range labels:

SM82_FIRST  = index 172   (first Ampere-era SASS opcode)
SM82_LAST   = index 193   (last opcode in the sm_82 range)

These 22 opcode slots (indices 172--193) cover the core Ampere SASS additions:

Opcodes	Category
GATHER, GENMETADATA, SPMETADATA	Sparse MMA infrastructure
BMMA_88128, BMMA_168128, BMMA_168256	Binary tensor core MMA shapes
DMMA	FP64 tensor core MMA (re-introduced at index 215 for Hopper)
HMMA_SP_1688, HFMA2_MMA, HMNMX2	FP16 sparse/packed operations
IMMA_88, IMMA_SP_88, IMMA_16816, IMMA_16832, IMMA_SP_16832	Integer tensor core MMA shapes
ARRIVES, LDGDEPBAR, LDGSTS	Async copy and barrier infrastructure
REDUX	Warp-wide reduction
CLMAD	Carry-less multiply-add (GF(2) arithmetic)

The name SM82_FIRST/SM82_LAST is used as the boundary label even though these instructions are available on sm_80+ (any codegen factory >= 28673). The "82" in the label refers to the internal target used during Ampere development, not to a minimum SM requirement for the opcodes themselves.

Why sm_82 Matters

sm_82 is a ghost target: it occupies a validation table slot and lends its name to an opcode range, but cannot be compiled for. Passing --gpu-name sm_82 to ptxas would pass the initial validation check (bsearch succeeds in the base table) but fail during profile construction because sub_6765E0 has no case for SM 82. The practical consequence is that sm_82 is a naming artifact preserved from Ampere development, not a usable compilation target.

Profile Object Construction

Each SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object, then populates architecture-specific fields.

Source: decompiled sub_60A2E0 (sm_75), sub_60A3E0 (sm_80), sub_60AC30 (sm_86), sub_60AD30 (sm_87), sub_60AB30 (sm_88).

All five initializers are structurally identical. The only field that differs between them is offset +348 (codegen factory). Common initialization:

v6 = sub_917990(a3);              // allocate 1936-byte profile object
*(_DWORD *)(v6 + 344) = 0x100000; // shared memory config base (1 MB)
*(_DWORD *)(v6 + 348) = FACTORY;  // codegen factory (per-SM)
v7[147] = 0;                      // cleared
v7[421] = cli_option_value;       // from a1+108
v7[460] = 1;                      // enable flag
v7[482] = 1;                      // enable flag
v7[470] = 1;                      // enable flag
v7[177] = 1;                      // enable flag

The sub_917990 base constructor sets the default codegen factory to 0x2000 (8192), which every per-SM initializer immediately overwrites. Key base constructor fields:

Offset	Default	Content
+348	0x2000	Codegen factory (overridden)
+588	0	Cleared
+1892	2	Mode/config value
+1832--1848	xmmword	SSE-loaded constant block
+1908--1924	xmmword	SSE-loaded constant block

Handler Dispatch

ptxas registers per-SM handler functions into 7 parallel hash maps via sub_607DB0. For sm_75--88, all handlers are thin wrappers around shared codegen infrastructure.

Map 1 (Handler A) and Map 2 (Handler B)

SM	Handler A	Handler B
sm_75	`sub_609B70`	`sub_609B40`
sm_80	`sub_609CC0`	`sub_609C90`
sm_86	`sub_609D50`	`sub_609D80`
sm_87	`sub_609F00`	`sub_609DE0`
sm_88	`sub_609E70`	`sub_609EA0`

Every Handler A function is identical:

bool handler_A(int64_t a1, int64_t a2) {
    *(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
    return sub_663C30(a2, 0) != 0;  // a2=0: sets *(a1+96) = 1
}

Every Handler B function is identical except the second argument:

bool handler_B(int64_t a1, int64_t a2) {
    *(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
    return sub_663C30(a2, 1) != 0;  // a2=1: skips the flag set
}

The "cpf_optx" option controls OptiX IR compilation mode. sub_663C30 is the core codegen-factory-aware driver that delegates to sub_662920 (ELF section iteration, 26KB) and sub_7FBB70 (actual compilation pass). The only behavioral difference between Handler A and Handler B: when called as Handler A (arg=0), sub_663C30 sets *(a1 + 96) = 1 before processing -- likely a "primary pass" flag.

Map 3 (Intrinsic Table Initializer)

SM	Initializer	Notes
sm_75	`sub_60A2E0`	Factory 24577
sm_80	`sub_60A3E0`	Factory 28673
sm_86	`sub_60AC30`	Factory 28674
sm_87	`sub_60AD30`	Factory 28675
sm_88	`sub_60AB30`	Factory 28676

Maps 6--7 (Performance / Occupancy)

Map	sm_75 Handler	Purpose
6	`sub_608D50`	Perf-stats / occupancy handler E
7	`sub_6096E0`	Perf-stats / occupancy handler F

These handlers return per-SM occupancy parameters used by the driver API's occupancy calculator. Other Ampere SMs have their own handlers registered into the same maps (addresses not fully traced in sweep data).

Scheduler Profile

sub_8E4400 (InitHWProfile_Warp) uses the codegen factory value to select scheduling parameters. The dispatch is a linear threshold cascade:

Factory Range	Warps	Dispatch Slots	Architecture Class
<= 20479	4	96	Kepler (sm_30)
<= 24575	6	176	Pascal (sm_60)
<= 28672	7	192	Volta (sm_70)
<= 32767	7	208	Turing / Ampere (sm_75--88)
<= 36863	8	224	Hopper (sm_90)
> 36863	16	240	Blackwell (sm_100+)

All SM 75--88 targets fall into the 7-warp / 208-slot bucket. After the warp count, a secondary switch maps specific codegen factory values to sub-architecture variants (stored at a1+26):

Codegen Factory	Variant	SM
24576, 32768, 36864	0	sm_60, sm_90, sm_100 (base)
8193, 20481, 28674	2	sm_30, sm_50, sm_86
28675, 36867	3	sm_87, sm_103
28676, 36868	4	sm_88, sm_110
28677, 36869	5	sm_89, sm_121

sm_75 (24577) and sm_80 (28673) are absent from the variant table and fall through to the default variant (0 or 1). This means sm_75 and sm_80 use the baseline latency model, while sm_86--88 get tuned sub-architecture parameters.

HW Latency Tables

Each SM has a dedicated latency table function containing per-opcode pipeline stage assignments, stall cycle costs, and functional unit mappings. These are called from the scheduling infrastructure to drive instruction ordering.

Function	Size	SM	Notes
`sub_8E7720`	3.5 KB	sm_75	Turing baseline
`sub_8E7940`	2.9 KB	sm_80 (base)	Ampere shared layer
`sub_8E7B40`	3.3 KB	sm_80	Ampere full table
`sub_8E7D80`	4.4 KB	sm_86	Largest in Ampere family
`sub_8E8070`	3.5 KB	sm_87	Orin tuning
`sub_8E8280`	3.1 KB	sm_89	Ada Lovelace

sm_80 uniquely has two latency tables: a "base" table (sub_8E7940, 2.9KB) and a full table (sub_8E7B40, 3.3KB), suggesting a layered lookup where the base provides defaults and the full table overrides specific entries. sm_86's table is the largest at 4.4KB, likely because RTX 30xx consumer GPUs have different pipeline characteristics from A100 datacenter parts.

No separate table entry was found for sm_88 in the sweep data. It may share sm_86 or sm_87's latency profile, or be registered through a path not captured in the sweep.

SASS Instruction Encoding

SM 75--88 all use the 128-bit per-instruction encoding format introduced with Turing. This replaced the Volta/Pascal scheme where scheduling control was packed into a separate 64-bit control header shared by 3 instructions.

Control Word Layout

Each 128-bit SASS instruction carries a 23-bit control word encoding scheduling decisions. The control word is generated by sub_A36360 (52KB, the control-word / scoreboard encoder).

bits [0:3]   stall count       (4 bits, max 15 cycles)
bit  [4]     yield flag        (1 bit, warp scheduler hint)
bits [5:7]   write barrier idx (3 bits, selects 1 of 6 barriers)
bits [8:13]  read barrier mask (6 bits, one per barrier)
bits [14:19] wait barrier mask (6 bits, one per barrier)
bits [20:25] reuse flags       (6 bits, per source operand)

Instruction Word Structure

128-bit instruction word:
  bits [0:3]     = 0x2       (format code: 128-bit)
  bits [4:6]     = slot      (scheduling group slot)
  bits [8:16]    = MAJOR     (9-bit major opcode, 0x00-0x171)
  bits [17:24]   = MINOR     (8-bit minor opcode / variant)
  bits [25:31]   = SUBOP     (7-bit sub-opcode / format ID)
  bits [48+]     = MODIFIERS (format-dependent modifier fields)
  bits [132:134] = 0x0       (extended opcode flag, at offset 0x84)

The 3-level opcode hierarchy (major/minor/subop) allows up to 102 major opcodes with 48 sub-operations each. Maximum observed variant value: 0x2F (47 decimal).

Scoreboard / Dependency Barriers

SM 75--88 provide 6 hardware dependency barriers per warp. The scoreboard tracker is managed by sub_8E4920 (BuildScoreboardEntries, 6.9KB) and encoded by sub_A36360.

Resource	Width	Range	Notes
Write barrier index	3 bits	0--5 active, 6--7 reserved	Assigns instruction to a barrier
Read barrier mask	6 bits	1 bit per barrier	Indicates which barriers to check before read
Wait barrier mask	6 bits	1 bit per barrier	Indicates which barriers to wait on

The scoreboard tracker allocates 952 bytes per function when bit 4 of the flag byte at offset +1385 is set. An additional 856-byte bitset is allocated when bit 8 is also set (for barrier register tracking in writeback mode).

The scoreboard infrastructure is shared across all SM 75--88 targets. The barrier count (6) is constant for this entire range. sm_90 (Hopper) potentially increases this, and sm_100+ (Blackwell) changes the barrier model further.

Intrinsic Table

Intrinsic availability is cumulative. Each generation adds to the previous.

sm_75 Baseline (IDs 0x89--0x1FA, 370 intrinsics)

sm_75 inherits the full sm_70 (Volta) intrinsic set labeled __cuda_sm70_*:

Category	Intrinsics	PTX Operations
Named barriers	barrier_arrive/red/sync (0--15)	`bar.arrive`, `bar.red.{and,or,popc}`, `bar.sync`
Warp shuffle	shflsync_bfly/down/idx/up	`shfl.sync.{bfly,down,idx,up}`
Warp vote	votesync_all/any/ballot/uni	`vote.sync.{all,any,ballot,uni}`
Warp match	matchsync_all/any_b32/b64	`match.sync.{all,any}.b{32,64}`
Warp sync	warpsync	`bar.warp.sync`
Redux	reduxsync_* (IDs 0x01--0x11)	`redux.sync.{and,or,xor,min,max,add}`
WMMA	m16n16k16, m32n8k16, m8n32k16	`wmma.{load,store,mma}`

WMMA intrinsics cover all combinations of:

Shapes: m16n16k16, m32n8k16, m8n32k16
Operations: load_a, load_b, load_c, store_d, mma
Layouts: row/col combinations
Types: f16, f32, with/without satfinite
Address spaces: generic, global, shared

sm_80 Additions (IDs 0x1FB--0x22F, 53 intrinsics)

sm_80 adds two intrinsic groups:

14 __cuda_sm80_* intrinsics (IDs 0x1FB--0x208):

Intrinsic	PTX Operation	Notes
createpolicy_range	`createpolicy.range`	L2 cache persistence control
createpolicy_fractional	`createpolicy.fractional`	L2 cache fraction control
createpolicy_cvt	Cache policy conversion
mma_bf16_*	`mma.sync` with `.bf16`	BF16 tensor core MMA
mma_tf32_*	`mma.sync` with `.tf32`	TF32 tensor core MMA
mma_s4_*	`mma.sync` with `.s4`	INT4 tensor core MMA
mma_s8_*	`mma.sync` with `.s8`	INT8 tensor core MMA
mma_b1_*	`mma.sync` with `.b1`	Binary tensor core MMA

39 __cuda_sm_8x_mma_* intrinsics (IDs 0x209--0x22F):

Extended MMA shapes and sparse variants for the 2nd/3rd generation tensor core.

Intrinsic Gate Mechanism

The per-SM intrinsic table initializer (Map 3 handler) controls which intrinsics are available. sm_75 registers only the sm_70 intrinsic block. sm_80+ additionally registers the sm_80 and sm_8x blocks. The gate is not a runtime check -- it is a registration-time decision: if the intrinsic's handler function is not registered for a given SM, the PTX parser emits an error when it encounters a call to that intrinsic.

Peephole Optimizer Gates

The SASS-level peephole optimizer (sub_83EF00, 4,858 lines decompiled) applies pattern-matching transformations to the instruction stream. Several transformations are gated by codegen factory thresholds.

FMA/DFMA Combining

// sub_83EF00, case 0x50 (opcode 80 = FMA/DFMA)
if (*(_QWORD *)(*(_QWORD *)(a1 + 1584) + 372) > 28673) {
    // FMA combining enabled: sm_86+ (codegen factory 28674+)
    // Look for CVT -> FMA patterns and combine
}

Gate: codegen_factory > 28673. This means:

sm_75 (24577): FMA combining disabled
sm_80 (28673): FMA combining disabled (equality fails the > check)
sm_86 (28674): FMA combining enabled
sm_87 (28675): FMA combining enabled
sm_88 (28676): FMA combining enabled

This is a significant compiler optimization difference between A100 (sm_80) and RTX 30xx (sm_86). A100 code does not get FMA/DFMA combining, possibly because A100's pipeline already has hardware FMA fusion or because the combining transformation is not profitable on GA100 silicon.

Master Gate

All peephole passes check capability 487 ("enable peephole") and capability 350 (per-instruction gate) before applying any transformation. These capabilities are controlled by optimization level and user options, not by SM target.

BB Initialization Flags

The basic block initializer sub_6E8EB0 sets architecture-specific flags using a secondary SM version encoding (distinct from the codegen factory):

SM Version (secondary)	Hex	Flags Set	Architecture
20480	0x5000	bits 1, 8	sm_80 encoding space
20484	0x5004	bits 16, 64	sm_84 encoding space

This secondary encoding uses (gen << 12) with generation 5 for Ampere in the BB init context. The specific bit flags control opcode descriptor table population -- each BB gets a set of 40+ (opcode_id, encoding_word) pairs that define which SASS instructions are legal in that basic block.

Feature Comparison

Feature	sm_75	sm_80	sm_86	sm_87	sm_88
ISA generation	6	7	7	7	7
Codegen factory	24577	28673	28674	28675	28676
WMMA (1st gen TC)	Yes	Yes	Yes	Yes	Yes
MMA bf16/tf32 (2nd gen TC)	--	Yes	Yes	Yes	Yes
MMA s4/s8/b1 extended	--	Yes	Yes	Yes	Yes
`createpolicy` (L2 cache)	--	Yes	Yes	Yes	Yes
`cp.async` (async copy)	--	Yes	Yes	Yes	Yes
sm_8x MMA intrinsics (39)	--	Yes	Yes	Yes	Yes
FMA/DFMA peephole combining	--	--	Yes	Yes	Yes
Scheduler: warps / slots	7/208	7/208	7/208	7/208	7/208
Sub-arch variant	default	default	2	3	4
Separate HW latency table	Yes	Yes (2)	Yes	Yes	?
Family string	Turing	Ampere	Ampere	Ampere	Ampere

Hardware Resource Geometry

Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.

SM	Regs/SM	Max Regs/Thread	Max Threads/CTA	Warps/SM	Max CTAs/SM	Sched Partitions	Dispatch Slots	Configurable Shared Memory	Conf
`sm_75`	65,536	255	1,024	32	16	7 / 208	208	32 / 48 / 64 KB	90%
`sm_80`	65,536	255	2,048	64	32	7 / 208	208	48 / 100 / 132 / 164 KB	90%
`sm_86`	65,536	255	1,536	48	16	7 / 208	208	48 / 100 KB	90%
`sm_87`	65,536	255	1,536	48	16	7 / 208	208	48 / 100 / 164 KB	90%
`sm_88`	65,536	255	1,536	48	16	7 / 208	208	(same as sm_86)	85%

Column definitions:

Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
Max Regs/Thread: Maximum registers a single thread can use. 255 universally (sub_8688F0 offset +612).
Max Threads/CTA: Maximum threads per cooperative thread array (block).
Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
Max CTAs/SM: Maximum concurrent CTAs per SM.
Sched Partitions / Dispatch Slots: From sub_8E4400 offset +18 (packed DWORD) and offset +22 (WORD).
Configurable Shared Memory: Valid shared memory sizes per CTA, selected by cudaFuncSetAttribute.

All Turing/Ampere targets share the 7-partition / 208-slot scheduling geometry. The major resource difference is sm_80 (A100 datacenter) with 2,048 max threads and 64 warps vs. the consumer/embedded parts (sm_86--88) with 1,536 max threads and 48 warps. sm_75 (Turing) is the most constrained with 1,024 max threads and 32 warps.

Codegen Factory Gating Patterns

Throughout ptxas, the codegen factory value at profile offset +348 (or equivalently *(*(QWORD*)(a1+1584)+372) from the function context) is used to gate features. Common patterns for the Turing/Ampere range:

// Pattern 1: Generation check (factory >> 12)
int gen = codegen_factory >> 12;
if (gen >= 7) { /* Ampere+ feature */ }

// Pattern 2: Exact threshold
if (codegen_factory > 28673) { /* sm_86+ only */ }
if (codegen_factory >= 28673) { /* sm_80+ */ }

// Pattern 3: Range check
if (codegen_factory >= 24577 && codegen_factory < 32768) {
    /* Turing/Ampere only, not Hopper+ */
}

The >> 12 shift extracts the ISA generation, allowing coarse checks (Turing = 6, Ampere = 7). Fine-grained checks compare against specific factory values to distinguish sm_80 from sm_86, etc.

Function Map

Address	Size	Identity	SM	Confidence
`sub_609B70`	~48B	Handler A (cpf_optx + compile)	sm_75	99%
`sub_609B40`	~48B	Handler B (cpf_optx + compile)	sm_75	99%
`sub_60A2E0`	~300B	Intrinsic table initializer	sm_75	95%
`sub_609CC0`	~48B	Handler A	sm_80	99%
`sub_609C90`	~48B	Handler B	sm_80	99%
`sub_60A3E0`	~300B	Intrinsic table initializer	sm_80	95%
`sub_609D50`	~48B	Handler A	sm_86	99%
`sub_609D80`	~48B	Handler B	sm_86	99%
`sub_60AC30`	~300B	Intrinsic table initializer	sm_86	95%
`sub_609F00`	~48B	Handler A	sm_87	99%
`sub_609DE0`	~48B	Handler B	sm_87	99%
`sub_60AD30`	~300B	Intrinsic table initializer	sm_87	95%
`sub_609E70`	~48B	Handler A	sm_88	99%
`sub_609EA0`	~48B	Handler B	sm_88	99%
`sub_60AB30`	~300B	Intrinsic table initializer	sm_88	95%
`sub_608D50`	~200B	Perf/occupancy handler E	sm_75	85%
`sub_6096E0`	~200B	Perf/occupancy handler F	sm_75	85%
`sub_663C30`	~400B	Core codegen driver (shared)	all	90%
`sub_662920`	26KB	ELF section iteration (shared)	all	85%
`sub_917990`	~300B	Profile object constructor	all	90%
`sub_8E7720`	3.5KB	HW latency table	sm_75	90%
`sub_8E7940`	2.9KB	HW latency table (base layer)	sm_80	90%
`sub_8E7B40`	3.3KB	HW latency table (full)	sm_80	90%
`sub_8E7D80`	4.4KB	HW latency table	sm_86	90%
`sub_8E8070`	3.5KB	HW latency table	sm_87	90%
`sub_8E4400`	3.3KB	InitHWProfile_Warp (shared)	all	90%
`sub_83EF00`	~100KB	Primary peephole optimizer	all	85%
`sub_A36360`	52KB	Control word / scoreboard encoder	all	85%
`sub_8E4920`	6.9KB	BuildScoreboardEntries	all	85%

Cross-References

SM Architecture Map -- Overview of all 23 SM targets and the 3-level profile system
Ada & Hopper (SM 89--90a) -- sm_89 (Ada) shares Ampere codegen factory range but bridges to Hopper features
Blackwell (SM 100--121) -- Next-generation targets with codegen factory 36864+
Intrinsic Table (608 Entries) -- Full intrinsic catalog with per-SM generation ranges
SASS Instruction Encoding -- 128-bit encoding format, bitfield packer, opcode hierarchy
Peephole Optimization -- FMA combining and other post-scheduling SASS transforms
3-Phase Scheduler Architecture -- Scheduler infrastructure that consumes HW latency tables
CLI Options -- --gpu-name parsing, sm_75 default

Keyboard shortcuts

PTXAS Reverse Engineering Reference