PTXAS v13.0 — Reverse Engineering Reference
Purpose: reimplementation-grade documentation of NVIDIA's PTX-to-SASS assembler, recovered entirely from static analysis of the stripped x86-64 binary.
PTX (Parallel Thread Execution) is NVIDIA's virtual ISA for GPU compute. SASS (Shader Assembly) is the native machine code executed by GPU hardware. PTXAS is the binary that transforms PTX into SASS. At 37.7 MB stripped, it is a fully proprietary compiler with no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, every data structure, and every encoding table was built in-house by NVIDIA. This wiki documents its internal architecture using IDA Pro 8.x and Hex-Rays decompilation.
Version note: All addresses and binary offsets in this wiki apply to ptxas v13.0.88 (CUDA Toolkit 13.0). Other versions will have different addresses.
| Binary | ptxas v13.0.88, 37,741,528 bytes, x86-64, stripped |
| Build | cuda_13.0.r13.0/compiler.36424714_0 (Aug 20 2025) |
| Decompilation | 40,185 functions, IDA Pro 8.x + Hex-Rays |
| Strings | 30,632 extracted |
| Call graph | 548,693 edges |
| Version string | Cuda compilation tools, release 13.0, V13.0.88 (sub_612DE0) |
| LLVM code | None — fully proprietary compiler |
| Default target | sm_75 (Turing) |
| Supported SMs | sm_75 through sm_121f (Turing through DGX Spark) |
| Internal codename | OCG (Optimizing Code Generator), Mercury (SASS encoder) |
Glossary
| Term | Meaning |
|---|---|
| Ori IR | PTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym. |
| Mercury | The SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words. Named in NVIDIA source paths and error strings. |
| OCG | Optimizing Code Generator — NVIDIA's internal name for the ptxas optimization+codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings. |
| Fatpoint | The register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers. The allocator works by computing these sets and mapping them to physical registers. |
| Opex | Operand expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields. Converts virtual register references, immediates, and address modes into the bit patterns Mercury expects. |
| Capmerc | Capsule Mercury — an ELF section (.nv.capmerc) that embeds a secondary Mercury-encoded representation of the kernel alongside the primary .text section. Used for debug metadata and binary patching support. |
| ELFW | PTXAS's custom ELF writer (sub_1C9F280, 97 KB). Not a standard library — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions. |
| EIATTR | Extended Info Attributes — per-kernel metadata encoded in .nv.info sections. Each attribute is a tag-length-value record carrying register counts, barrier usage, shared memory sizes, CRS stack depth, and other kernel properties consumed by the CUDA runtime and driver. |
Three Subsystems
PTXAS is not a monolithic assembler. It decomposes into three largely independent subsystems with distinct coding conventions, data structures, and lineages:
1. PTX Frontend (~3 MB, 0x400000--0x5AA000) — A Flex-generated DFA scanner (sub_720F00, 64 KB, ~552 rules) feeds tokens into a Bison-generated LALR(1) parser (sub_4CE6B0, 48 KB). The parser is driven from sub_446240 (the real main, 11 KB), which orchestrates the full pipeline: parse, DAGgen, OCG, ELF, DebugInfo. The frontend also contains 1,141 instruction descriptors registered via sub_46E000 (93 KB) that define accepted type combinations for every PTX opcode, 608 CUDA runtime intrinsics registered in sub_5D1660 (46 KB), and a suite of per-instruction semantic validators (0x460000--0x4D5000) that check architecture requirements, type compatibility, and operand constraints before lowering. See PTX Parser and Entry Point & CLI.
2. Ori Optimizer (~8 MB, 0x5AA000--0xC52000) — A proprietary 159-phase optimization pipeline managed by the PhaseManager (sub_C62720). The phase factory at sub_C60D30 is a 159-case switch that allocates polymorphic phase objects from a vtable table at off_22BD5C8. Each phase has virtual methods for execute(), isNoOp(), and getName(). Major subsystems include: a fatpoint-based register allocator (sub_957160 core, sub_95DC10 driver, sub_926A30 interference graph builder), a 3-phase instruction scheduler (sub_688DD0 with ReduceReg/DynBatch modes and 9 register pressure counters), copy propagation, strength reduction, predication (if-conversion), rematerialization, and GMMA/WGMMA pipelining. The pipeline reads its default phase ordering from a 159-entry table at 0x22BEEA0. See Optimization Pipeline and Phase Manager.
3. SASS Backend (~14 MB, 0xC52000--0x1CE3000) — The Mercury encoder generates native SASS binary code. Instruction encoding is handled by ~4,000 per-variant handler functions (683 + 678 = 1,361 in the SM100 Blackwell encoding tables alone at 0xED1000--0x107B000, with additional tables for other SM generations). Each handler follows a rigid template: set opcode ID, load a 128-bit encoding format descriptor via SIMD, initialize a 10-slot register class map, register operand descriptors via sub_7BD3C0/sub_7BD650/sub_7BE090, finalize with sub_7BD260, then extract bitfields from the packed instruction word. The backend also contains 3 peephole optimizers (the PeepholeOptimizer class at 0x7A5D10 with Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, and SchedulingAwarePatterns methods), a capsule Mercury ELF embedder for debug metadata (sub_1CB53A0, section .nv.capmerc), and a custom ELF emitter (sub_1C9F280, 97 KB) that builds the final CUBIN output. See SASS Code Generation, Mercury Encoder, and Peephole Optimization.
Additionally, the binary embeds a custom pool allocator (sub_424070, 3,809 callers), MurmurHash3-based hash maps (sub_426150 insert / sub_426D60 lookup), a thread pool with pthread-based parallel compilation support, and a GNU Make jobserver client for integration with build systems.
Compilation Pipeline
Both standalone and library-mode invocations converge on the same pipeline, visible in the timing strings emitted by sub_446240:
PTX text (.ptx file or string)
|
+-- Flex Scanner (sub_720F00, 64KB)
| 552-rule DFA, off_203C020 transition table
| Tokens: 340+ terminal symbols for Bison grammar
|
+-- Bison LALR(1) Parser (sub_4CE6B0, 48KB)
| Semantic validators: 0x460000-0x4D5000
| 1,141 instruction descriptors via sub_46E000
|
+-- Ori IR Construction (DAGgen phase)
| Internal representation: basic blocks + instruction DAG
| 608 CUDA runtime intrinsics (sub_5D1660)
|
+-- 159-Phase Optimization Pipeline (PhaseManager, sub_C62720)
| Phase factory: sub_C60D30 (159-case switch)
| Fatpoint register allocator (sub_957160)
| 3-phase instruction scheduler (sub_688DD0)
| Copy propagation, CSE, strength reduction, predication,
| rematerialization, GMMA pipelining, late legalization
|
+-- Mercury SASS Encoder
| Instruction encoding: ~4000 per-variant handlers
| 3 peephole optimizers (PeepholeOptimizer at 0x7A5D10)
| WAR hazard resolution (sub_6FC240)
| Operand expansion (Opex pipeline)
|
+-- ELF/CUBIN Output (sub_1C9F280, 97KB)
Sections: .text, .nv.constant0, .nv.info, .symtab
Capsule Mercury: .nv.capmerc (debug metadata)
DWARF: .debug_line, .debug_info, .debug_frame
The driver at sub_446240 reports per-stage timing: Parse-time, CompileUnitSetup-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time, plus PeakMemoryUsage in KB. For multi-entry PTX files, each compile unit is processed independently with the header "\nCompile-unit with entry %s".
Dual Compilation Modes
PTXAS operates in two modes selected at invocation:
| Standalone CLI | Library Mode | |
|---|---|---|
| Invocation | ptxas [options] file.ptx | Called from nvcc/nvlink as a subprocess |
| Entry | main at 0x409460 | sub_9F63D0 (library/ftrace entry) |
| Real driver | sub_446240 (11 KB) | Same pipeline, alternate setup |
| Input | PTX file on disk | PTX string via --input-as-string |
| Output | .cubin / .o file | Binary blob returned to caller |
| Usage string | "Usage : %s [options] <ptx file>,...\n" | N/A |
The main function (0x409460, 84 bytes) is a thin wrapper: it stores argv[0], sets stdout/stderr to unbuffered via setvbuf, and delegates to sub_446240. The --input-as-string flag enables accepting PTX source directly as a CLI argument rather than reading from a file.
Configuration
PTXAS exposes three layers of configuration:
CLI Options (~100 flags) — Registered in sub_432A00 and parsed by sub_434320. Key options include --gpu-name (target SM), --maxrregcount (register limit), --opt-level (0--4), --verbose, --warn-on-spills, --warn-on-local-memory-usage, --fast-compile, --fdevice-time-trace (Chrome trace JSON output), --compile-as-tools-patch (sanitizer mode), and --extensible-whole-program. Help is printed by sub_403588 which calls sub_1C97640 to enumerate all registered options.
Internal Knobs (1,294 ROT13-encoded entries) — A separate configuration system implemented in generic_knobs_impl.h (source path recovered: /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h). The knob table is populated by two massive static constructors: ctor_005 at 0x40D860 (80 KB, ~2,000 general OCG knobs) and ctor_007 at 0x421290 (8 KB, 98 Mercury scheduler knobs). All knob names are ROT13-obfuscated in the binary. Examples after decoding: MercuryUseActiveThreadCollectiveInsts, MercuryTrackMultiReadsWarLatency, MercuryPresumeXblockWaitBeneficial, ScavInlineExpansion, ScavDisableSpilling. Knobs are read from environment variables and knob files via ReadKnobsFile (sub_79D070) which parses [knobs]-header INI files. Lookup is performed by GetKnobIndex (sub_79B240) with inline ROT13 decoding and case-insensitive comparison. See Knobs System.
SM Profile Tables — Per-architecture capability maps initialized by sub_607DB0 (14 KB) which creates 7 hash maps indexing sm_XX / compute_XX strings to handler functions. Profile objects are constructed by sub_6765E0 (54 KB) with architecture-to-family mappings (sm_75 -> Turing, sm_80/86/87/88 -> Ampere, sm_89 -> Ada Lovelace, sm_90/90a -> Hopper, sm_100/100a/100f -> Blackwell, sm_103/103a/103f -> Blackwell Ultra, sm_110/110a/110f -> Jetson Thor, sm_120/120a/120f -> RTX 50xx, sm_121/121a/121f -> DGX Spark). See SM Architecture Map.
Reading This Wiki
The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with GPU compiler experience.
Section Index
Overview
- Function Map — Address-to-identity lookup for key functions with confidence levels.
- Binary Layout — Subsystem address map at pass granularity.
- Methodology — How this analysis was performed.
- Version Tracking — Cross-version address deltas.
Compilation Pipeline
- Pipeline Overview — End-to-end PTX-to-SASS flow diagram with links to every stage.
- Entry Point & CLI — CLI parsing,
mainat0x409460, the real driver atsub_446240. - PTX Parser (Flex + Bison) — 552-rule Flex DFA scanner, Bison LALR(1) parser, instruction descriptor table.
- PTX Directive Handling —
.version,.target,.entry,.func,.reg,.shared,.constprocessing. - PTX-to-Ori Lowering — How parsed PTX is lowered into the Ori internal representation.
- Optimization Pipeline (159 Phases) — PhaseManager, phase factory, default phase ordering, per-phase timing.
- SASS Code Generation — Mercury encoder, instruction selection, operand expansion.
- ELF/Cubin Output — Custom ELF emitter, section layout, capsule Mercury, DWARF generation.
Ori IR — Internal Representation
- IR Overview & Design — Instruction DAG, basic blocks, typed virtual registers.
- Instructions & Opcodes — Ori opcode set and instruction encoding.
- Basic Blocks & CFG — Control flow graph construction and manipulation.
- Register Model (R/UR/P/UP) — Four register classes and their constraints.
- Data Structure Layouts — Memory layout of key IR objects.
Optimization Passes
- Pass Inventory & Ordering — All 159 phases with names, addresses, and pipeline positions.
- Phase Manager Infrastructure — Phase factory, vtable dispatch, execute/isNoOp/getName.
- GeneralOptimize Bundles — Mega-pass bundles that group related sub-passes.
- Loop Passes — Unrolling, LICM, induction variable optimization, strength reduction.
- Copy Propagation & CSE — Value forwarding and common subexpression elimination.
- Predication — If-conversion for GPU divergence control.
- Rematerialization — Recomputing values to reduce register pressure.
- Synchronization & Barriers — Barrier insertion and dead barrier elimination.
- Late Expansion & Legalization — Final lowering before codegen.
Register Allocation
- Allocator Architecture — Fatpoint algorithm, interference graph, spilling, ABI constraints.
- Fatpoint Algorithm — Core allocation loop and heuristics.
- Spilling — Spill cost model and spill code generation.
- GPU ABI & Calling Convention — Register assignment rules and caller/callee contracts.
Instruction Scheduling
- Scheduler Architecture — 3-phase scheduler, ReduceReg/DynBatch modes.
- Scheduling Algorithm — Priority list scheduling with register pressure tracking.
- Latency Model & HW Profiles — Per-SM instruction latency tables.
- Scoreboards & Dependency Barriers — WAR hazard resolution, barrier allocation.
SASS Code Generation
- Code Generation Overview — Instruction selection, encoding, peephole, Mercury.
- Instruction Selection — Pattern-based DAG-to-SASS lowering.
- SASS Instruction Encoding — 128-bit instruction word format and bitfield packing.
- Peephole Optimization — Three peephole dispatchers with SM-variant patterns.
- Mercury Encoder — Per-variant handler architecture, encoding tables.
- Capsule Mercury & Finalization —
.nv.capmercsection, debug metadata embedding. - SASS Text Generation — Disassembly-format printing for
--verboseoutput.
GPU Architecture Targets
- SM Architecture Map — SM feature gates from sm_75 through sm_121f.
- Turing & Ampere (SM 75--88) — Feature delta between generations.
- Ada & Hopper (SM 89--90a) — Async copy, TMA, distributed shared memory.
- Blackwell (SM 100--121) — TCGen05, fifth-gen tensor cores, new SM variants.
- TCGen05 — 5th Gen Tensor Cores — Blackwell tensor core instruction set.
CUDA Intrinsics
- Intrinsic Table (608 Entries) — Math, tensor, sync, warp intrinsics.
- Math Intrinsics — Fast-math, Newton-Raphson, special functions.
- Tensor Core Intrinsics — WMMA, GMMA, WGMMA instruction families.
- Sync & Warp Intrinsics — Barrier, vote, shuffle, match.
ELF/Cubin Output
- Custom ELF Emitter — ELFW internals, section construction, symbol table.
- Section Catalog & EIATTR —
.nv.infoattribute encoding, per-kernel metadata. - Debug Information — DWARF generation for GPU debugging.
- Relocations & Symbols — CUBIN relocation types and symbol conventions.
Configuration
- CLI Options — ~100 flags registered in
sub_432A00. - Knobs System (1,294 Knobs) — ROT13 knob table, environment variables, INI files.
- Optimization Levels — O-level to phase mapping,
--fast-compiletiers. - DUMPIR & NamedPhases — Dumping IR at specific pipeline points.
Infrastructure
- Memory Pool Allocator —
sub_424070, 3,809 callers, arena-style allocation. - Hash Tables & Bitvectors — MurmurHash3-based maps, bitvector liveness sets.
- Thread Pool & Concurrency — pthread pool, GNU Make jobserver client.
Reference
- SASS Opcode Catalog — Complete SASS opcode enumeration.
- PTX Instruction Table — All PTX instructions with type signatures.
- EIATTR Attribute Catalog — Tag-length-value format for
.nv.infoattributes.
Reading Path 1: End-to-End Pipeline Understanding
Goal: understand how PTX text becomes SASS binary, what each stage does, and how control flows between subsystems.
- Pipeline Overview — The complete flow diagram. Establishes all stages and their address ranges.
- Entry Point & CLI — How ptxas is invoked, the ~100 CLI flags, and the
sub_446240driver function. - PTX Parser — The Flex scanner and Bison parser. How PTX text becomes an internal parse tree.
- PTX-to-Ori Lowering — How the parse tree is lowered to Ori IR (basic blocks + instruction DAG).
- Optimization Pipeline — The 159-phase PhaseManager. Phase factory, ordering, timing infrastructure.
- SASS Code Generation — Mercury encoder, instruction selection, operand expansion, peephole.
- ELF/Cubin Output — Custom ELF emitter, section layout, DWARF debug info, capsule Mercury.
Reading Path 2: Reimplementing a Specific Pass
Goal: reproduce the exact behavior of one optimization phase deeply enough to write a compatible replacement.
- Pass Inventory & Ordering — Locate the phase in the 159-entry table. Note its index, vtable address, and pipeline position.
- The phase's dedicated page (e.g., Copy Propagation & CSE, Predication). Every dedicated page contains the function address, decompiled algorithm, data flow, and controlling knobs.
- Knobs System — Find which ROT13 knobs control the phase's behavior (enable/disable toggles, thresholds).
- Ori IR Overview — Understand the IR data structures the phase operates on.
- Register Model — The R/UR/P/UP register classes and their constraints.
- Function Map — Cross-reference internal function addresses with the master function map.
Reading Path 3: Debugging Correctness
Goal: diagnose a miscompilation, crash, or incorrect SASS output by tracing the problem to a specific phase.
- DUMPIR & NamedPhases — How to dump IR at specific pipeline points. Use
DUMPIRto observe the IR before and after each phase. - Optimization Levels — Compare phase pipelines at different O-levels. If a bug appears at
-O2but not-O1, the diff identifies suspect phases. - Pipeline Overview — The pipeline is linear: Parse -> DAGgen -> OCG (159 phases) -> Mercury -> ELF. The stage where output first goes wrong narrows the search.
- Knobs System — Check whether the suspect phase has enable/disable knobs. Toggle them to confirm or rule out the phase.
- Instruction Scheduling and Scoreboards & Dependency Barriers — If the generated SASS hangs or produces wrong results under specific warp configurations, the scheduler or barrier insertion may be at fault.
Reading Path 4: Tuning Performance
Goal: understand what ptxas does at each optimization level and what knobs control aggressiveness.
- Optimization Levels — The O-level to phase mapping, including
--fast-compiletiers. - Knobs System — The 1,294 ROT13-encoded internal tuning parameters. The primary mechanism for fine-grained control.
- Register Allocation — The fatpoint allocator directly determines register count, which determines maximum occupancy.
- Instruction Scheduling — The scheduler's ReduceReg and DynBatch modes, WAR hazard resolution, and interaction with register pressure.
- Peephole Optimization — The 3 peephole dispatchers that perform late SASS-level rewrites.
- SM Architecture Map — Per-SM feature gates that influence code generation decisions.
Function Map
Binary: ptxas v13.0.88, 37.7 MB stripped ELF, ~40,000 functions
Documented: 2,063 unique functions across 70 wiki pages
This page: Top ~100 most cross-referenced functions, plus routing tables
Complete listings: Each wiki page has its own Function Map section with full details
This page is the central lookup index for identified functions in ptxas. It lists the functions that appear most frequently across the wiki (cross-cutting infrastructure and major entry points), and provides routing tables to find any function by address range or subsystem.
Confidence levels: CERTAIN = named in symbols or strings. HIGH = strong evidence from strings and call patterns (>90%). MEDIUM = structural analysis with partial string evidence (70-90%).
Core Infrastructure
These functions appear in 10+ wiki pages -- they are the universal building blocks called by nearly every subsystem.
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x424070 | pool_alloc(pool, size) | 19 | 3,809 | Custom slab allocator, 8-byte aligned |
0x4248B0 | pool_free(ptr) | 8 | 1,215 | Coalescing free, boundary tags |
0x4280C0 | get_thread_local_context | 10 | 3,928 | Most-called function in ptxas; 280-byte TLS struct |
0x42BDB0 | fatal_OOM_handler | 8 | 3,825 | Called on every allocation failure |
0x426150 | hashmap_put(map, key, value) | 11 | 2,800 | Open-addressing + chaining, auto-resize |
0x426D60 | hashmap_get(map, key) | 11 | 422 | Returns value or 0 |
0x425CA0 | hashmap_create(hash_fn, cmp_fn, cap) | 7 | 127 | Integer/pointer/custom hash modes |
0x427630 | murmurhash3_x86_32(str) | 5 | 73 | Constants: 0xcc9e2d51, 0x1b873593 |
0x42D850 | hashset_insert(set, key) | 4 | 282 | Hash set variant |
0x42FBA0 | diagnostic_emit(desc, loc, fmt...) | 7 | 2,350 | Central error/warning reporter |
0x42F590 | fatal_internal_error(desc, ...) | 8 | 3,825 | Assertion handler |
0x4279D0 | starts_with(str, prefix) | 4 | 185 | Returns suffix pointer or 0 |
0x42CA60 | list_push_front(node, head_ptr) | 4 | 298 | Pool-allocated linked list |
0xBDBA60 | bitvector_allocate | 8 | many | (bits+31)>>5 word count |
0xBDCDE0 | bitvector_or_assign (SSE2) | 5 | many | _mm_or_si128 on 128-bit chunks |
Details: Memory Pools, Hash & Bitvector, Threading
Compilation Driver & CLI
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x409460 | main | 5 | 1 | Delegates to 0x446240 |
0x446240 | real_main (top-level driver) | 13 | 1 | Orchestrates entire pipeline |
0x4428E0 | ptx_input_setup | 6 | 1 | Version/target validation |
0x43CC70 | per_entry_compile_unit | 5 | 1 | Processes each entry through pipeline |
0x43F400 | function_abi_config | 4 | 1 | Parameter regs, return addr, scratch |
0x43A400 | compilation_target_config | 7 | 1 | SM-specific defaults |
0x43B660 | register_constraint_calculator | 5 | 1 | Balances .maxnreg, occupancy |
0x432A00 | option_registration | 9 | 1 | CLI option definitions |
0x434320 | option_parser | 9 | 1 | Validates combinations, applies state |
Details: Pipeline Entry, Pipeline Overview, CLI Options
PTX Front End
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x46E000 | instruction_table_builder | 9 | 1 | 93 KB, 1168 callees, one per PTX opcode |
0x451730 | parser_setup (special register init) | 9 | 1 | %ntid, %laneid, %clock, etc. |
0x4CE6B0 | bison_parser (directive/decl) | 7 | 1 | .local_maxnreg, .alias, .pragma |
0x720F00 | flex_lexer (ptxlex / yylex) | 8 | 2 | ~550 Flex rules, DFA scanner |
0x4B2F20 | ptx_validator_general | 4 | 1 | Validates texture, surface, cvt, call |
0x4C5FB0 | ptx_validator_mma_wmma_tcgen05 | 4 | 1 | MMA, WMMA, tensor core validation |
0x71F630 | preprocessor_dispatch | 4 | 1 | .MACRO, .ELSE, .INCLUDE |
0x489050 | ptx_to_ori_converter | 5 | 1 | PTX AST to ORI IR translation |
Details: PTX Parser, PTX Directives, PTX to ORI
Static Initialization
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x4094C0 | ctor_001 -- thread infra init | 4 | 0 | pthread_key_create, mutex |
0x4095D0 | ctor_003 -- PTX opcode name table | 6 | 0 | ~900 ROT13-encoded PTX mnemonics |
0x40D860 | ctor_005 -- tuning knob registry | 6 | 0 | 80 KB, 2000+ ROT13 knob names |
0x421290 | ctor_007 -- scheduler knob registry | 4 | 0 | 98 ROT13 scheduler knobs |
Details: Pipeline Entry, Binary Layout
Phase Manager & Optimization Framework
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0xC60D30 | phase_factory (159-case switch) | 12 | 1 | Allocates phase objects |
0xC62720 | PhaseManager_ctor | 10 | 2 | 159-entry phase table |
0xC64F70 | phase_dispatch_loop | 5 | 2 | Executes phases, reports timing |
0xC64310 | per_phase_timing_reporter | 5 | 1 | "[Total N KB] [Freeable N KB]" |
0xC641D0 | phase_name_to_index_lookup | 5 | 3 | Binary search, case-insensitive |
0x7DDB50 | phase_run_dispatch | 14 | many | Vtable-based phase execution |
0x9F4040 | NamedPhases_parse_and_build | 6 | 1 | "shuffle", "OriCopyProp", etc. |
0x798B60 | NamedPhases_parser | 4 | 2 | PTXAS_DISABLE env var parsing |
0x799250 | IsPassDisabled | 5 | 4 | Checks knob index 185 |
0xA36360 | pass_sequence_builder | 6 | 1 | Constructs NvOptRecipe pass list |
Details: Phase Manager, Pass Inventory, Optimizer Pipeline
ORI IR & Instruction Access
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x9253C0 | instruction_operand_get | 11 | many | Operand accessor on ORI instructions |
0x7E6090 | instruction_modifier_set | 10 | many | IR modification helper |
0x781F80 | instruction_iterator | 12 | many | Doubly-linked list traversal |
0x7DF3A0 | instruction_property_query | 5 | many | Instruction flag/attribute checker |
0x91BF30 | register_type_query | 8 | many | Register class/type inspection |
0x9314F0 | register_class_id_query | 7 | 1,547 | Most-called non-trivial regalloc fn |
0x931920 | register_class_compat_checker | 6 | 328 | Pair register class handling |
0x934630 | register_id_packer | 9 | 856 | Packs reg#/class/type into 32-bit |
0xB28E00 | ir_node_type_query | 5 | many | Node kind discrimination |
0xB28E90 | ir_node_field_accessor | 6 | many | Generic field getter |
0xA50650 | CodeObject_EmitRecords | 1 | 8 | 74 KB, ORI record serializer (56 section types) |
0xA53840 | EmitRecords_wrapper | 1 | 1 | Thin wrapper, adds type-44 header |
Details: Instructions, Registers, Data Structures, CFG
Intrinsic Infrastructure
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x5D1660 | intrinsic_table_register (608 entries) | 7 | 1 | Master name-to-ID table |
0x5D4190 | intrinsic_dispatch_builder | 13 | 1 | PTX opcode -> codegen handler mapping |
0x5FF700 | intrinsic_prototype_emitter | 5 | 1 | 354 KB -- largest function in binary |
0x5C7A50 | wmma_mma_codegen | 4 | 1 | 173 KB, all shapes/types/layouts |
0x5C10A0 | mma_codegen (mma.sync) | 4 | 1 | 120 KB, m8n8k4 through m16n8k256 |
0x5BBC30 | tcgen05_mma_codegen (Blackwell) | 5 | 1 | 90 KB, 5th-gen tensor core |
0x70FA00 | ocg_intrinsic_handler | 8 | 1 | OCG-level intrinsic routing |
0x6A97B0 | intrinsic_lowering_main | 4 | 1 | 26 KB, switch-based lowering |
0x6C9EB0 | ocg_builtin_name_lookup | 5 | 1 | Blackwell+ OCG name table |
Details: Intrinsics Index, Math Intrinsics, Tensor Intrinsics, Sync & Warp
Register Allocator
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x9721C0 | regalloc_entry ("REGALLOC GUIDANCE") | 6 | 1 | Top-level allocator entry |
0x957160 | fatpoint_allocator_core | 7 | 1 | Core fatpoint graph coloring |
0x96D940 | spill_guidance_engine | 5 | 1 | Determines spill strategy |
0x971A90 | full_alloc_with_spill_retry | 4 | 1 | "NOSPILL REGALLOC" path |
0x9714E0 | regalloc_failure_reporter | 6 | 1 | "Register allocation failed..." |
0x926A30 | interference_graph_builder | 9 | 7 | 22 KB, SSE bitvectors |
0x92C240 | liveness_bitvector_ops | 5 | 87 | Set/clear/query with aliasing |
0x917A60 | opcode_to_regclass_mapping | 4 | 221 | Massive switch |
0x910840 | ConvertMemoryToRegisterOrUniform | 5 | 1 | Pass driver |
Details: RegAlloc Overview, RegAlloc Algorithm, Spilling, ABI
Instruction Scheduling
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x8D0640 | ScheduleInstructions (top-level) | 7 | 1 | String: "ScheduleInstructions" |
0x688DD0 | scheduler_engine (main BB loop) | 5 | 1 | ReduceReg / DynBatch selection |
0x8C9320 | scheduling_priority_function | 4 | 0 | ~300 locals, core heuristic |
0x68B9C0 | dependency_graph_builder | 4 | 1 | RAW/WAR/WAW hazard analysis |
0x6820B0 | build_ready_list | 5 | 1 | Zero-dependency instructions |
0x8CD6E0 | reverse_scheduling_driver | 4 | 1 | Reverse post-order iteration |
0x8CEE80 | register_budget_with_occupancy | 4 | 1 | Pressure coeff default 0.045 |
0x8E4400 | hw_profile_table_init | 6 | 3 | Encoding/latency property tables |
0xA9CDE0 | scheduling_metadata_builder | 6 | 1 | Per-instruction sched metadata |
0xA9CF90 | scheduling_metadata_accessor | 5 | many | Sched metadata field queries |
0xAED3C0 | scheduling_optimization_mega_pass | 4 | 0 | 137 KB, ~560 locals, largest vtable pass |
Details: Scheduling Overview, Scheduling Algorithm, Latency Model, Scoreboards
Codegen & ISel
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x169B190 | isel_pattern_dispatch (master) | 5 | 1 | 280 KB, 65,999 insns -- largest function |
0x143C440 | sm120_peephole_dispatch | 4 | 1 | SM120 (RTX 50), 373-case switch |
0x198BCD0 | sm100_peephole_dispatch | 4 | 1 | SM100 (Blackwell), 1336 callees |
0x83EF00 | main_peephole_pass | 6 | 0 | 29 KB, 392 callees |
0x6D9690 | master_instruction_encoder | 7 | 1 | 94 KB, opcode switch |
0x6E4110 | sass_codegen_main | 4 | 1 | EmitSASSForFunction, FNV-1a BB hash |
0x6F52F0 | SASS_pipeline_run_stages | 5 | 1 | Mercury SASS compilation pipeline |
0x9ED2D0 | MercConverter_entry | 6 | 1 | ORI to Mercury IR conversion |
0x9F1A90 | MercConverter_builder | 6 | 1 | Mercury instruction construction |
Bitfield Encoding
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x7B9B80 | bitfield_insert(insn, off, wid, val) | 9 | 18,347 | Most-called by caller count |
0x7BC030 | encode_register_operand | 4 | 6,147 | 1-bit + 4-bit type + 10-bit reg |
0x7B9D60 | encode_reuse_flags_predicate | 4 | 2,408 | 1-bit reuse + 5-bit predicate |
0x7BC5C0 | encode_immediate_const_operand | 4 | 1,449 | Const buffer index or immediate |
0x7BCF00 | encode_predicate_register | 4 | 1,657 | PT=14, 2-bit type + 3-bit condition |
0x10B6180 | 1_bit_boolean_encoder | 3 | 8,091 | .S/.U, .STRONG, etc. |
Details: Encoding, SASS Printing
ELF / CUBIN Output
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x612DE0 | section_attr_builder | 11 | 1 | 76 KB, ELF section/attribute config |
0x1C9F280 | master_elf_emitter | 9 | 1 | Complete CUBIN assembly |
0x1CB53A0 | elf_world_init | 7 | 1 | 672-byte ELFW context |
0x1CB68D0 | symbol_table_builder | 5 | 1 | .symtab from internal symbols |
0x1CABD60 | master_section_allocator | 5 | 1 | Shared/const/local memory |
0x1CB3570 | add_function_section | 5 | 44 | Creates .text.FUNCNAME + .rela |
0x1CD48C0 | relocation_processor | 5 | 1 | Relocation section emission |
0x1C9B110 | mercury_capsule_builder | 4 | 1 | Creates embedded .nv.merc ELF |
Details: ELF Emitter, Sections, Relocations, Debug Info, Capsule Mercury
Knobs System
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x79B240 | GetKnobIndex | 6 | 2 | ROT13 name lookup, case-insensitive |
0x79D070 | ReadKnobsFile | 5 | 1 | Parses [knobs] section from file |
0x79F540 | ParseKnobValue | 4 | 1 | 12-type switch: bool/int/float/string/... |
0x79D990 | ProcessKnobs (top-level) | 4 | 1 | File + pragma + numbered config |
0xA0F020 | knob_conditional_evaluator | 5 | many | [WHEN condition] handler |
Details: Knobs, Opt Levels
Target-Specific Code
| Address | Identity | Pages | Callers | Notes |
|---|---|---|---|---|
0x6765E0 | target_profile_selector | 7 | 1 | SM-dependent profile dispatch |
0x607DB0 | target_feature_query | 7 | many | SM feature capability checks |
0x896D50 | sass_mnemonic_table_init (ROT13) | 4 | 1 | ~400+ SASS instruction names |
0x89FBA0 | instruction_latency_init | 4 | 3 | Encoding/latency property tables |
Details: Targets Index, Turing-Ampere, Ada-Hopper, Blackwell, tcgen05
Subsystem Routing Table
To find a specific function, locate it by address range or subsystem topic in this table. Each page contains a detailed Function Map section with complete listings.
By Subsystem Topic
| Subsystem | Primary Pages | Functions |
|---|---|---|
| Memory allocator, pools | memory-pools.md | 30 |
| Hash maps, bitvectors, sets | hash-bitvector.md | 51 |
| Threading, TLS, jobserver | threading.md | 41 |
| CLI parsing, option handling | cli-options.md | 17 |
| Tuning knobs (2000+ knobs) | knobs.md | 56 |
| Optimization levels | opt-levels.md | 14 |
| DumpIR debug output | dumpir.md | 14 |
| Compilation pipeline | overview.md, entry.md | 56+25 |
| PTX lexer & parser | ptx-parser.md | 75 |
| PTX directives | ptx-directives.md | 41 |
| PTX-to-ORI translation | ptx-to-ori.md | 41 |
| Optimizer pipeline | optimizer.md | 28 |
| ORI instruction IR | instructions.md | 80 |
| CFG construction | cfg.md | 18 |
| Register representation | registers.md | 40 |
| IR data structures | data-structures.md | 74 |
| Phase manager (159 phases) | phase-manager.md | 26 |
| Copy propagation, CSE, GVN | copy-prop-cse.md | 65 |
| General optimization passes | general-optimize.md | 71 |
| Loop optimization (unroll, LICM, SWP) | loop-passes.md | 92 |
| Branch/switch optimization | branch-switch.md | 24 |
| Strength reduction | strength-reduction.md | 25 |
| Predication | predication.md | 28 |
| Rematerialization | rematerialization.md | 55 |
| Liveness analysis | liveness.md | 42 |
| Sync barriers | sync-barriers.md | 66 |
| Late legalization | late-legalization.md | 59 |
| Hot/cold splitting | hot-cold.md | 10 |
| GMMA pipelining | gmma-pipeline.md | 47 |
| Uniform registers | uniform-regs.md | 22 |
| Register allocator core | algorithm.md | 50 |
| Spilling | spilling.md | 54 |
| ABI handling | abi.md | 87 |
| Scheduling overview | overview.md | 112 |
| Scheduling algorithm | algorithm.md | 121 |
| Latency model & HW profiles | latency-model.md | 78 |
| Scoreboards & barriers | scoreboards.md | 56 |
| ISel pattern matching | isel.md | 182 |
| SASS encoding | encoding.md | 92 |
| Peephole optimization | peephole.md | 67 |
| Mercury IR conversion | mercury.md | 79 |
| SASS templates | templates.md | 46 |
| SASS printing / renderer | sass-printing.md | 96 |
| Capsule Mercury | capmerc.md | 20 |
| Intrinsic infrastructure | index.md | 159 |
| Math intrinsics | math.md | 42 |
| Tensor core intrinsics | tensor.md | 45 |
| Sync & warp intrinsics | sync-warp.md | 65 |
| SM targets & features | index.md | 70 |
| ELF emitter | elf-emitter.md | 29 |
| ELF sections | sections.md | 33 |
| Debug info (DWARF) | debug-info.md | 33 |
| Relocations | relocations.md | 19 |
By Address Range
Functions in the binary are clustered by subsystem. This table maps address ranges to the pages that document them.
| Address Range | Primary Subsystem | Key Pages |
|---|---|---|
0x400000-0x424000 | Entry, static init, main | entry.md, binary-layout.md |
0x424000-0x42E000 | Memory pools, hash maps, lists | memory-pools.md, hash-bitvector.md |
0x42E000-0x446000 | Diagnostics, CLI parsing | cli-options.md, entry.md |
0x446000-0x452000 | Compilation driver | overview.md, entry.md |
0x452000-0x4D5000 | PTX parser & validator | ptx-parser.md, ptx-directives.md |
0x4D5000-0x5AA000 | PTX-to-ORI, early IR | ptx-to-ori.md, instructions.md |
0x5AA000-0x612000 | Intrinsic infrastructure | index.md, math.md, tensor.md |
0x612000-0x67F000 | Section builder, target config | sections.md, index.md |
0x67F000-0x6E4000 | Scheduling engine, OCG lowering, encoding | overview.md, encoding.md |
0x6E4000-0x754000 | SASS codegen, SASS pipeline | mercury.md, overview.md |
0x754000-0x7C0000 | Liveness, knobs, bitfield encoding | liveness.md, knobs.md, encoding.md |
0x7C0000-0x8FE000 | Peephole, SASS mnemonics, scheduling upper | peephole.md, algorithm.md |
0x8FE000-0x9D3000 | Register allocator | overview.md, algorithm.md, abi.md |
0x9D3000-0xAA8000 | Post-regalloc, named phases, remat | rematerialization.md, phase-manager.md |
0xAA8000-0xC52000 | Mega-passes, sync barriers, dataflow | sync-barriers.md, general-optimize.md |
0xC52000-0xD27000 | Phase manager, phase factory | phase-manager.md, optimizer.md |
0xD27000-0x10B7000 | 592 SASS encoder bodies | encoding.md, isel.md |
0x10B7000-0x1225000 | Field encoders, ISel helpers | encoding.md, isel.md |
0x1225000-0x13CF000 | Bitvector, ISel coordinators | hash-bitvector.md, isel.md |
0x13CF000-0x17F8000 | SM-specific ISel, pattern matchers, templates | isel.md, templates.md |
0x17F8000-0x1C21000 | SASS printing, peephole mega-dispatchers | sass-printing.md, peephole.md |
0x1C21000-0x1CE3000 | ELF emitter, capsule mercury, relocations | elf-emitter.md, capmerc.md |
Statistics
Top 10 Most-Called Functions
| Rank | Address | Identity | Callers |
|---|---|---|---|
| 1 | 0x7B9B80 | bitfield_insert | 18,347 |
| 2 | 0x10B6180 | 1-bit boolean encoder | 8,091 |
| 3 | 0x7BC030 | encode_register_operand | 6,147 |
| 4 | 0x4280C0 | get_thread_local_context | 3,928 |
| 5 | 0x42BDB0 | fatal_OOM_handler | 3,825 |
| 6 | 0x424070 | pool_alloc | 3,809 |
| 7 | 0x426150 | hashmap_put | 2,800 |
| 8 | 0x7B9D30 | clear_const_buffer_slots | 2,408 |
| 9 | 0x7B9D60 | encode_reuse_flags_predicate | 2,408 |
| 10 | 0x42FBA0 | diagnostic_emit | 2,350 |
Top 5 Largest Functions
| Rank | Address | Identity | Size |
|---|---|---|---|
| 1 | 0x5FF700 | intrinsic_prototype_emitter | 354 KB |
| 2 | 0x169B190 | isel_pattern_dispatch | 280 KB |
| 3 | 0x198BCD0 | sm100_peephole_dispatch | 233 KB |
| 4 | 0x143C440 | sm120_peephole_dispatch | 233 KB |
| 5 | 0x5C7A50 | wmma_mma_codegen | 173 KB |
Top 10 Most Cross-Referenced (by wiki page count)
| Rank | Address | Identity | Pages |
|---|---|---|---|
| 1 | 0x424070 | pool_alloc | 19 |
| 2 | 0x7DDB50 | phase_run_dispatch | 14 |
| 3 | 0x446240 | real_main | 13 |
| 3 | 0x5D4190 | intrinsic_dispatch_builder | 13 |
| 5 | 0x781F80 | instruction_iterator | 12 |
| 5 | 0xC60D30 | phase_factory | 12 |
| 7 | 0x9253C0 | instruction_operand_get | 11 |
| 7 | 0x612DE0 | section_attr_builder | 11 |
| 7 | 0x426150 | hashmap_put | 11 |
| 7 | 0x426D60 | hashmap_get | 11 |
Documentation Coverage
| Metric | Count |
|---|---|
| Total unique functions documented | 2,063 |
| Wiki pages with function maps | 70 |
| Functions in 5+ pages (high cross-reference) | 89 |
| Functions in 1 page only (subsystem-internal) | 1,324 |
| Confidence CERTAIN | ~40 |
| Confidence HIGH | ~1,400 |
| Confidence MEDIUM | ~620 |
Binary Layout
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
PTXAS v13.0.88 is a 37,741,528-byte stripped x86-64 ELF executable. Its .text section spans 26.2 MB (0x403520--0x1CE2DE2) containing 40,185 functions. This page maps every byte of the binary to the subsystem that owns it, derived from all 40 sweep reports covering the complete address range.
ELF Section Map
| Section | Address | Size | Notes |
|---|---|---|---|
.plt | 0x402C00 | 2,336 B (146 stubs) | Procedure linkage table for libc/libpthread imports |
.text | 0x403520 | 26,212,546 B (26.2 MB) | All executable code -- 40,185 functions |
.rodata | 0x1CE2E00 | 7,508,368 B (7.5 MB) | Read-only data: encoding tables, strings, DFA tables |
.eh_frame_hdr | 0x240BF90 | 358,460 B (350 KB) | Exception handling frame index |
.eh_frame | 0x2664A60 | 3,751,640 B (3.7 MB) | Unwinding data for 40K functions |
.gcc_except_table | 0x29F8938 | 940 B | C++ exception filter tables |
.ctors | 0x29F8CE8 | 104 B (12 entries) | Static constructor table |
.data.rel.ro | 0x29F8D60 | 4,256 B | Vtable pointers, resolved at load time |
.got.plt | 0x29FA000 | 1,184 B (148 entries) | Global offset table for PLT |
.data | 0x29FA4A0 | 14,032 B (13.7 KB) | Initialized globals: function pointers, defaults |
.bss | 0x29FDB80 | 85,864 B (83.9 KB) | Zero-init globals: knob tables, TLS keys, mutexes |
Total file composition:
| Component | Size | Percentage |
|---|---|---|
.text | 26.2 MB | 69.4% |
.rodata | 7.5 MB | 19.9% |
.eh_frame + .eh_frame_hdr | 4.0 MB | 10.7% |
.data + .bss + other | 0.1 MB | 0.3% |
Program Headers
| Segment | VirtAddr | MemSiz | Flags | Contents |
|---|---|---|---|---|
| LOAD 0 | 0x400000 | 32.4 MB | R E | .text + .rodata + headers + .eh_frame_hdr |
| LOAD 1 | 0x2664A60 | 3.7 MB | RW | .eh_frame + .data + .bss + .got |
| GNU_RELRO | 0x2664A60 | 3.6 MB | R | Read-only after relocation (.eh_frame through .data.rel.ro) |
| GNU_EH_FRAME | 0x240BF90 | 350 KB | R | Exception handling index |
| GNU_STACK | 0x0 | 0 | RW | Non-executable stack |
Entry point: 0x42333C (ELF e_entry), which is inside .text (the CRT startup stub _start). The actual main is at 0x409460.
Three Subsystems
The .text section decomposes into three subsystems with distinct coding styles, data structures, and origins:
.text linear address map (26.2 MB)
0x403520 0x67F000 0xC52000 0x1CE2DE2
|--- PTX Frontend 2.9 MB ---|-- Ori Optimizer 5.8 MB --|---- SASS Backend 17.6 MB ----|
| 11% | 22% | 67% |
| parsers, validators, | passes, regalloc, | encoding handlers, ISel, |
| intrinsics, formatters | scheduling, CFG analysis | peephole, codecs, ABI, ELF |
| Subsystem | Address Range | Size | Functions | Share | Avg Fn Size | Largest Function |
|---|---|---|---|---|---|---|
| PTX Frontend | 0x403520--0x67F000 | 2.9 MB | ~2,592 | 11% | ~1,170 B | sub_46E000 (93 KB, opcode table builder) |
| Ori Optimizer | 0x67F000--0xC52000 | 5.8 MB | ~11,001 | 22% | ~550 B | sub_926A30 (155 KB decomp, interference graph) |
| SASS Backend | 0xC52000--0x1CE2DE2 | 17.6 MB | ~26,592 | 67% | ~690 B | sub_169B190 (280 KB, master ISel dispatch) |
The backend dominates the binary because SASS instruction encoding is template-generated code: each of the ~4,000 encoding handler functions is a standalone vtable entry, never called directly. The optimizer has the highest function density (many small pass helpers), while the frontend has the largest average function size (complex validators and parsers).
Complete .text Address Map
The table below maps every address range in the .text section to its subsystem, function count, and key entry points. Data is aggregated from the 30 sweep partitions (p1.01 through p1.30).
PTX Frontend (0x403520--0x67F000, 2.9 MB)
Note on the
0x400000--0x403520gap. The LOAD segment begins at0x400000, but the first 13.6 KB before.textcontains the ELF header (64 B at0x400000), program headers (7 entries, 392 B),.interp(28 B, path told-linux-x86-64.so.2),.hash/.gnu.hash(symbol hash tables),.dynsym/.dynstr(dynamic symbol table, 146 entries),.gnu.version/.gnu.version_r(symbol versioning),.rela.plt(PLT relocations, 146 entries), and the.pltstub table (2,336 B, 146 stubs at0x402C00--0x403520). These are standard ELF infrastructure, not ptxas application code. The first ptxas function begins at0x403520.
| Address Range | Size | Functions | Subsystem | Key Functions |
|---|---|---|---|---|
0x403520--0x430000 | 178 KB | ~300 | Runtime infrastructure: pool allocator, hash maps, TLS, diagnostics, error reporting, string utilities | sub_424070 (pool alloc, 3809 callers), sub_4280C0 (TLS context, 3928 callers), sub_426150 (hash insert, 2800 callers), sub_42FBA0 (diagnostic emitter, 2350 callers), sub_427630 (MurmurHash3) |
0x430000--0x460000 | 200 KB | ~120 | CLI parsing and compilation driver: option registration, argument parser, target configuration, register/resource constraints, Chrome trace JSON parser | sub_446240 (real main, 11 KB), sub_432A00 (option registration, 6 KB), sub_434320 (option parser, 10 KB), sub_43B660 (register constraint calc), sub_439880 (trace JSON parser) |
0x460000--0x4D5000 | 470 KB | ~350 | PTX instruction validators: per-opcode semantic checkers for MMA, WMMA, load/store, cvt, atomics, barriers, tensormap, async copy | sub_4B2F20 (general validator, 52 KB), sub_4CE6B0 (Bison parser, 48 KB), sub_4C5FB0 (operand validator, 28 KB), sub_4C2FD0 (WMMA/MMA validator, 12 KB), sub_4A73C0 (tensormap validator, 11 KB) |
0x4D5000--0x5AA000 | 872 KB | 581 | PTX instruction text generation: 580 per-opcode formatters that convert internal IR to PTX assembly text, plus a built-in function declaration emitter | sub_5D4190 (formatter dispatch, 13 KB), sub_5FF700 (builtin decl emitter, 34 KB), ~580 formatter functions (avg 1.5 KB each) |
0x5AA000--0x67F000 | 874 KB | 628 | Intrinsic infrastructure: 608 CUDA intrinsic handlers, MMA/WMMA/tcgen05 tensor core codegen, SM profile tables (sm_75 through sm_121), special register init, ELF/DWARF finalization, memory space management | sub_5D1660 (608 intrinsics, 46 KB), sub_607DB0 (SM profile hash maps, 14 KB), sub_6765E0 (arch capability constructor, 54 KB), sub_612DE0 (version string) |
Ori Optimizer (0x67F000--0xC52000, 5.8 MB)
| Address Range | Size | Functions | Subsystem | Key Functions |
|---|---|---|---|---|
0x67F000--0x754000 | 869 KB | ~500 | Mercury SASS backend core: scheduling engine (ReduceReg/DynBatch, 9 reg pressure counters), WAR hazard management, Opex (operand expansion) pipeline, OCG intrinsic lowering, instruction encoding core, Flex DFA scanner, ELF section helpers | sub_688DD0 (scheduler engine, 20 KB), sub_6D9690 (encoding switch, 94 KB), sub_6FC240 (WAR/scoreboard), sub_720F00 (Flex scanner, 64 KB, 552 rules) |
0x754000--0x829000 | 872 KB | 1,545 | Knobs infrastructure (1,294 entries) and peephole optimizer class: knob lookup/read/file parsing, PeepholeOptimizer with 7 virtual methods (Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, SchedulingAwarePatterns), pipeline orchestrator, Mercury operand registration helpers | sub_79B240 (GetKnobIndex), sub_79D070 (ReadKnobsFile), sub_7A5D10 (PeepholeOptimizer), sub_7BD3C0/sub_7BD650/sub_7BE090 (operand registrars), sub_7BD260 (encoding finalize) |
0x829000--0x8FE000 | 872 KB | 1,069 | Debug line tables, scheduler core, and HW profiles: ScheduleInstructions pipeline (context setup, priority computation, reverse scheduling, register budget with occupancy optimization), ROT13 SASS mnemonic table, architecture-specific latency/throughput profiles, constant bank naming, peephole/legalization passes, cutlass-aware scheduling heuristics | sub_8BF000--0x8D1600 (ScheduleInstructions), sub_896D50 (ROT13 SASS mnemonics), sub_8F0D00 (HW latency profiles), sub_8F4820 (cutlass heuristics) |
0x8FE000--0x9D3000 | 872 KB | 1,090 | Register allocator: fatpoint algorithm core, interference graph builder (155 KB decompiled -- largest non-dispatch function), spill/refill mechanism, live range analysis, retry with reduced register count, memory-to-register promotion, ConvertMemoryToRegisterOrUniform pass | sub_926A30 (interference graph, 155 KB decomp), sub_957160 (fatpoint core), sub_95DC10 (regalloc driver), sub_9714E0 (failure handler + retry), sub_910840 (ConvertMemoryToRegister) |
0x9D3000--0xAA8000 | 860 KB | 1,218 | Post-RA pipeline phases: NamedPhases registry (OriPerformLiveDead, OriCopyProp, shuffle, swap1--swap6), DAG/dependency analysis, IR statistics printer (instruction count, reg count, estimated latency, spill bytes, occupancy, throughput), hot/cold split, mbarrier intrinsics, regalloc verification, uninitialized register detection | sub_9F4040 (NamedPhases registry), sub_A3A7E0 (IR stats printer), sub_A0B5E0 (uninitialized reg detector), sub_A9EDB0 (mbarrier/scheduling, 85 KB decomp) |
0xAA8000--0xB7D000 | 862 KB | 4,493 | GMMA/WGMMA pipeline optimizer, ISel, and instruction emission: GMMA register allocation, warpgroup sync injection, instruction emission helpers (SASS encoder dispatch), post-scheduling IR statistics, operand legalization, 1,269 tiny vtable dispatchers (~160 bytes each), live range analysis, scheduler-integrated mega-pass | sub_AED3C0 (mega scheduling/ISel pass, 137 KB decomp), sub_AF7DF0/sub_AF7200 (register decode helpers), ~1,269 vtable dispatchers |
0xB7D000--0xC52000 | 870 KB | 1,086 | CFG analysis, bitvectors, and IR manipulation: ~390 instruction operand pattern matchers, bitvector dataflow framework (alloc, OR, AND, XOR, clear, iterate), CFG analysis (edge printing, reverse post-order, DOT graph dump), scoreboard and instruction classification, sync analysis | sub_BDC000 (bitvector infra), sub_BDE8B0 (CFG/RPO/DOT), sub_BE2E40 (scoreboard classification), ~390 operand pattern matchers |
SASS Backend (0xC52000--0x1CE2DE2, 17.6 MB)
| Address Range | Size | Functions | Subsystem | Key Functions |
|---|---|---|---|---|
0xC52000--0xD27000 | 853 KB | 1,053 | PhaseManager (159 phases): phase factory (159-case switch), phase vtable table at off_22BD5C8, default phase ordering table at 0x22BEEA0, 530 encoding table initialization bodies, instruction handler vtable bodies | sub_C60D30 (phase factory), sub_C62720 (PhaseManager constructor), sub_C60D20 (default table pointer), ~530 phase table body functions |
0xD27000--0xDFC000 | 853 KB | 592 | SASS encoder table (SM100 Blackwell, set 1): 592 uniform template-generated encoding handlers, each packing operands into a 1,280-bit instruction word at a1+544. Covers 60 opcode classes across 16 format groups. All vtable-dispatched (zero direct callers). | 592 per-variant handlers (avg 1,473 B), sub_7B9B80 (bitfield insert helper) |
0xDFC000--0xED1000 | 877 KB | 591 | SASS encoder/decoder (SM100 Blackwell, set 2): 494 encoders translating IR to packed SASS bitfields, plus 97 decoders for the reverse direction (disassembly/validation). All vtable-dispatched. | 494 encoders (0xDFC--0xEB2), 97 decoders (0xEB3--0xED0), sub_E0F370 (largest, 11 KB) |
0xED1000--0xFA6000 | 860 KB | 683 | SM100 SASS encoders (set 3): 683 per-variant encoding handlers for 59 SASS opcodes. Each sets opcode ID, loads 128-bit format descriptor via SSE, initializes 10-slot register class map, registers operands, finalizes, extracts bitfields. | 683 template-generated handlers, 128-bit xmmword format descriptors |
0xFA6000--0x107B000 | 851 KB | 678 | SM100 SASS encoders (set 4): 587 primary encoders (opcodes 16--372, predicate/comparison/memory/tensor/control flow), plus 91 alternate-form encoders for dual-width or SM-variant instruction encodings. Combined with sets 1--3: 2,544 SM100 encoding handlers total. Six mega dispatch tables. | 587 primary + 91 alternate-form encoders, 6 dispatch tables |
0x107B000--0x1150000 | 853 KB | 3,396 | SM100 codec completion: 641 final encoding handlers, 78 object lifecycle and scheduling support functions (FNV-1a hash, instruction construction), 2,095 bitfield accessor functions (machine-generated read/write primitives for the packed encoding format). Seven core extractors handle 1-bit, 2-bit, and multi-bit fields across 192-bit words. | sub_10AFF80 (instruction constructor, 11 KB, 32 params), 2,095 bitfield accessors, 7 core extractors |
0x1150000--0x1225000 | 852 KB | 733 | SASS codec (decoders + encoders): both directions of the instruction codec for an older SM target (likely sm_89 Ada Lovelace or sm_90 Hopper). Decoders read 128-bit words and extract fields; encoders pack fields back. Three mega-decoders (29--33 KB each) and two mega-dispatchers (78--104 KB, too large for Hex-Rays). | 3 mega decoders (29--33 KB), 2 mega dispatchers (78--104 KB), 728 of 733 vtable-dispatched |
0x1225000--0x12FA000 | 860 KB | 1,552 | Register-pressure scheduling + ISel + encoders: register-pressure-aware instruction scheduling (0x1225--0x1240), instruction selection and emission pipeline (0x1240--0x1254), 982 SASS binary encoders packing operand fields into 128-bit words (0x1254--0x12FA). All encoders vtable-dispatched. | Scheduling at 0x1225--0x1240, ISel at 0x1240--0x1254, 982 encoding handlers |
0x12FA000--0x13CF000 | 845 KB | 1,282 | Operand legalization and peephole: 522 per-instruction bit-field encoders (366 KB), 186 peephole pattern matchers (81 KB), 11 operand legalization/materialization functions (40 KB), 38 operand encoding emitters (31 KB), 8 live-range analysis functions (14 KB). | sub_137B790 (operand legalization, 8.5 KB), 186 peephole matchers, 522 encoders |
0x13CF000--0x14A4000 | 844 KB | 1,219 | SM120 (RTX 50-series) peephole pipeline: 1,087 instruction pattern matchers (429 KB), one 233 KB master opcode dispatch switch (sub_143C440, 373-case primary switch), 123 instruction encoders (180 KB). Pattern matchers validate opcode, modifiers, and operand types; dispatch rewrites opcode byte and operand mapping. | sub_143C440 (233 KB dispatch, 373-case switch), 1,087 pattern matchers, 123 encoders |
0x14A4000--0x1579000 | 852 KB | 606 | Blackwell ISA encode/decode: 332 encoder functions (0x14A4--0x1520) packing SASS bitstreams, 1 dispatcher (vtable router at 0x15209F0), 273 decoder functions (0x1520--0x1578) unpacking bitstreams and validating fields. Encoder state struct is 600+ bytes with 128-bit format descriptor at +8, operand arrays at +24--+143. | 332 encoders, 273 decoders, 1 dispatcher |
0x1579000--0x164E000 | 852 KB | 1,324 | SASS encoding + peephole matchers: Zone A has 367 instruction encoders, Zone B has 78 utility/transition functions, Zone C has 469 peephole pattern matchers. All pattern matchers are called from a single 280 KB mega-dispatcher (sub_169B190). | 367 encoders, 469 peephole matchers, 78 utilities |
0x164E000--0x1723000 | 873 KB | 899 | ISel pattern matching core: 762 PTX opcode pattern matchers (Zone A), the master dispatch function sub_169B190 at 280 KB / 66K instructions (Zone B -- the single largest function in the binary), 100 encoding table entries, and 36 multi-instruction template expanders. The dispatch tries every matcher, selects the highest-scoring match, and records which SASS expansion template to use. | sub_169B190 (280 KB, 66K insns, 15,870 callees), 762 matchers, 36 template expanders |
0x1723000--0x17F8000 | 852 KB | 631 | ISA description database: ~555 SASS instruction format descriptor classes (one per opcode variant), ~316 bitfield layout initializers, ~239 opcode handler vtable entries. Also contains instruction sequence generators (multi-instruction expansions for complex PTX operations), register allocation helpers, and Newton-Raphson approximation templates. 91.8% of functions have zero static callers (vtable-dispatched). | ~555 format descriptor classes, ~316 bitfield initializers, ~239 vtable entries |
0x17F8000--0x18CD000 | 852 KB | 1,460 | SASS instruction printer + peephole: Subsystem A (0x17F8--0x181F) implements SASS disassembly rendering via virtual method overrides on a builder/visitor with a 4,080+ byte vtable. Subsystem B (0x1820--0x18CC) is a 231 KB peephole dispatch function (sub_18A2CA0, 54K instructions, 1,330 unique callees). | sub_18189C0 (SASS printer, 45 KB), sub_181B370 (SASS printer, 28 KB), sub_18A2CA0 (231 KB peephole dispatch) |
0x18CD000--0x19A2000 | 877 KB | 1,598 | Scheduling + peephole dispatchers: Zone A (275 KB) is the instruction scheduling core (list scheduler, dependency graph, ready queue, register pressure tracking). Zone B (130 KB) contains 318 opcode property/classification tables. Zones C+D (460 KB) contain 888 peephole pattern matchers called from sub_198BCD0 (239 KB, 1,336 unique callees). | sub_198BCD0 (239 KB peephole dispatch), 392 scheduling functions, 318 opcode property tables, 888 pattern matchers |
0x19A2000--0x1A77000 | 880 KB | 1,393 | GPU ABI/calling convention + SM89/90 encoders: Zone A (250 KB, 276 functions) implements the NVIDIA GPU calling convention -- parameter register allocation, return address placement, scratch/preserved classification, convergent boundary enforcement, coroutine SUSPEND semantics, uniform register support, per-SM ABI lowering (sm_35 through sm_100+). Zone B (480 KB) has ~1,117 supplementary SASS encoding vtable handlers. | sub_19D1AF0 (master ABI setup, 5.6 KB), 276 ABI functions, ~1,117 encoding handlers |
0x1A77000--0x1B4C000 | 829 KB | 1,518 | SASS emission backend (4 SM families): Zone A has 1,083 bit-field packing encoders spanning sm_50 through sm_100+. Zone B has 339 instruction lowering/expansion functions (two SM families: sm_8x and sm_9x/10x). Zone C has 84 Ampere/Ada/Hopper-era encoders. Zone D has 92 Blackwell-era encoders. | sub_1B6B250 (register-class-to-HW mapping, 254 callers), 1,083 emitters, 339 lowering functions |
0x1B4C000--0x1C21000 | 876 KB | 1,974 | SASS emission + format descriptors: register-class encoding tables (Zone A), per-SM instruction bit-field encoders (Zone B), instruction emission orchestrators (Zone C), multi-operand dispatch emitters (Zone D), mirrored SM-variant emitters (Zone E), instruction format descriptors (Zone F, 0x1C05--0x1C21). | 487 functions exceed 2 KB decompiled |
0x1C21000--0x1CE2DE2 | 776 KB | 1,628 | Library layer: custom ELF emitter (CUBIN output), capsule Mercury ELF (.nv.capmerc debug metadata), section layout and memory allocation (shared/constant/local/global), relocation resolution (branch targets, UFT/UDT, YIELD-to-NOP), call graph analysis (recursion detection, dead function elimination), DWARF debug generation (.debug_info/.debug_line/.debug_frame), option parsing library, thread pool (pthread-based), JSON builder, GNU Make jobserver client, C++ name demangler (Itanium ABI), ELF file writer | sub_1C9F280 (ELF emitter, 97 KB decomp), sub_1CABD60 (section allocator, 67 KB), sub_1CC9800 (EIATTR builder, 90 KB), sub_1CDC780 (demangler, 93 KB), sub_1CB53A0 (ELF world init), sub_1CD48C0 (relocation resolver, 22 KB), sub_1CBB920 (recursion detector), sub_1CB18B0 (thread pool), sub_1CD13A0 (file writer, 11 KB) |
.rodata Contents (7.5 MB)
The .rodata section at 0x1CE2E00--0x240BF8F is 29% of the binary by size. Its dominant consumers:
| Content | Estimated Size | Notes |
|---|---|---|
| SASS encoding format descriptors | ~3.5 MB | 128-bit xmmword constants loaded via SSE by ~4,000 encoding handlers |
| Flex DFA transition tables | ~600 KB | off_203C020, the 552-rule PTX scanner's state machine |
| Bison parser tables | ~400 KB | LALR(1) action/goto tables for the PTX grammar |
| Error/diagnostic format strings | ~300 KB | 30,632 strings extracted from the binary |
| Phase ordering + vtable tables | ~100 KB | Default 159-entry phase table at 0x22BEEA0, vtable table at off_22BD5C8 |
| ROT13-encoded string tables | ~200 KB | PTX opcode names (~900 entries), knob names (~2,000 entries) |
| Architecture capability tables | ~150 KB | Per-SM feature maps (sm_75 through sm_121), HW latency profiles |
| DWARF name tables | ~50 KB | DW_FORM_*, DW_AT_*, DW_OP_* string tables |
| Hash constants + misc | ~2.2 MB | MurmurHash3 mixing constants, lookup tables, padding |
.bss Contents (84 KB)
| Content | Notes |
|---|---|
| ROT13 PTX opcode name table | Populated by ctor_003 (0x4095D0, 17 KB) at startup |
| General OCG knob table | Populated by ctor_005 (0x40D860, 80 KB) -- ~2,000 entries |
| Mercury scheduler knob table | Populated by ctor_007 (0x421290, 8 KB) -- 98 entries |
| Thread-local storage keys | pthread_key_t for per-thread context (280-byte struct) |
| Global pool allocator mutex | pthread_mutex_t at pool struct offset 7128 |
| Diagnostic suppression bitmaps | Per-warning-ID suppression flags |
| SM architecture profile objects | Constructed on demand per sub_6765E0 |
| Global error/warning counters | Incremented by sub_42FBA0 |
| Make jobserver state | Atomic state machine (0=init, 5=no MAKEFLAGS, 6=no auth, 7=failed) |
.data Contents (14 KB)
| Content | Notes |
|---|---|
| Function pointer tables | Exit wrapper (off_29FA4B0), error handler dispatch |
| Default option values | Populated by sub_432A00 (option registration) |
| Static string table pointers | Version strings, format strings |
| Diagnostic output tables | Severity prefix strings: "error ", "warning ", "info ", "fatal " |
Static Constructors
The .ctors section holds 12 entries executed before main. The four largest are:
| Constructor | Address | Binary Size | Purpose |
|---|---|---|---|
ctor_001 | 0x4094C0 | 204 B | Thread infrastructure: pthread_key_create, mutex init, thread priority range |
ctor_003 | 0x4095D0 | 17,007 B | PTX opcode name table: ~900 ROT13-encoded opcode mnemonics |
ctor_005 | 0x40D860 | 80,397 B | General OCG knob table: ~2,000 ROT13-encoded knob names + hex defaults |
ctor_007 | 0x421290 | 7,921 B | Mercury scheduler knob table: 98 ROT13-encoded scheduler knobs |
The remaining 8 constructors handle memory allocator pool initialization, hash map infrastructure setup, diagnostic system initialization, and architecture vtable factory registration (sub_1CCD900).
Mega-Functions (>50 KB binary)
| Address | Binary Size | Decompiled | Function | Callees |
|---|---|---|---|---|
sub_169B190 | 280 KB | N/A | Master ISel pattern dispatch (66K instructions) | 15,870 |
sub_198BCD0 | 239 KB | N/A | Peephole dispatch, SM variant 2 | 1,336 |
sub_143C440 | 233 KB | N/A | SM120 peephole dispatch (373-case switch) | ~1,100 |
sub_18A2CA0 | 231 KB | N/A | Peephole dispatch, SM variant 1 | 1,330 |
sub_6D9690 | 94 KB | N/A | Instruction encoding switch | ~500 |
sub_46E000 | 93 KB | N/A | PTX opcode-to-handler table builder | 1,168 |
sub_40D860 | 80 KB | N/A | ctor_005: general knob registration | ~2,000 |
sub_720F00 | 64 KB | N/A | Flex DFA scanner (552 rules) | ~50 |
These eight functions account for 1.2 MB of code (4.8% of .text) but only 0.02% of the function count.
Most-Called Functions
| Address | Callers | Identity |
|---|---|---|
sub_4280C0 | 3,928 | Thread-local context accessor (pthread_getspecific) |
sub_42BDB0 | 3,825 | Fatal OOM handler (called from every allocation site) |
sub_424070 | 3,809 | Pool memory allocator (alloc) |
sub_426150 | 2,800 | Hash map insert/update |
sub_42FBA0 | 2,350 | Central diagnostic message emitter |
sub_4248B0 | 1,215 | Pool memory deallocator (free) |
sub_42CA60 | 298 | Linked list prepend |
sub_42D850 | 282 | Hash set insert |
sub_1B6B250 | 254 | Register-class-to-hardware-number lookup (SASS emission) |
sub_4279D0 | 185 | String prefix match (starts_with) |
The top five functions are all in the runtime infrastructure region (0x403520--0x42F000). Together they represent the core allocation, error handling, and data structure layer that the rest of the binary depends on.
Binary Composition by Purpose
Estimated from function classification across 30 sweep reports (p1.01--p1.30). Each function was assigned to a single purpose category based on its dominant behavior; functions straddling categories (e.g., a scheduling pass that also emits SASS) are attributed to the category consuming the larger share of their code.
| Purpose | Estimated Size | Share of .text |
|---|---|---|
| SASS instruction encoding/decoding | ~12 MB | 46% |
| Optimization passes + scheduling | ~5 MB | 19% |
| Peephole pattern matching + dispatch | ~3 MB | 12% |
| Frontend: parsing + validation | ~2 MB | 8% |
| ISel pattern matching + templates | ~1.5 MB | 6% |
| Infrastructure: allocator, hash, ELF, debug | ~1.5 MB | 6% |
| GPU ABI + calling convention | ~0.7 MB | 3% |
The single largest consumer of code space is SASS instruction encoding. Each SM architecture generation requires its own set of per-opcode encoding/decoding handler functions. With support for SM75 through SM121 (six major generations), this yields approximately 4,000 encoding handlers, each a standalone function averaging 1,400 bytes.
Methodology
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page documents how the reverse engineering of ptxas v13.0.88 was performed. It serves as a transparency record so readers can assess the confidence of any claim in this wiki, and as a practical guide for anyone who wants to reproduce or extend the analysis.
Scope and Scale
PTXAS is a 37.7 MB stripped x86-64 ELF binary with no debug symbols, no DWARF information, and no export table beyond 146 libc/libpthread PLT stubs. Unlike NVIDIA's cicc (which is an LLVM fork), ptxas contains no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, data structure, and encoding table is proprietary NVIDIA code. This makes the analysis harder than LLVM-derived binaries -- there is no upstream source to compare against.
| Metric | Value |
|---|---|
| Binary size | 37,741,528 bytes |
| Build string | cuda_13.0.r13.0/compiler.36424714_0 |
| Total functions detected | 40,185 |
| Functions decompiled | 39,881 (99.2%) |
| Strings extracted | 30,632 |
| Call graph edges | 548,693 |
| Cross-references | 7,427,044 |
| IDA comments recovered | 66,598 |
| IDA auto-names recovered | 16,019 |
| Control flow graphs exported | 80,078 |
| PLT imports | 146 (libc, libpthread, libm, libgcc) |
| Functions with 0 static callers | 15,907 (39.6%) -- vtable-dispatched |
| Functions < 100 bytes | 11,532 (28.7%) |
| Functions > 10 KB | 86 (0.2%) |
Named functions (not sub_*) | 319 (0.8%) |
| Internal codenames | OCG (Optimizing Code Generator), Mercury (SASS encoder), Ori (IR) |
The 304 functions that Hex-Rays could not decompile are predominantly PLT stubs, computed-jump trampolines in the Flex DFA scanner, and the four mega-dispatch functions exceeding 200 KB (too large for Hex-Rays to handle within default limits). None are in critical analysis paths -- the dispatch functions are understood from their callee lists and the PLT stubs from their import names.
Why PTXAS Is Harder Than LLVM-Based Binaries
Reverse engineering cicc (NVIDIA's LLVM-based CUDA compiler) benefits from extensive prior art: LLVM's open-source codebase provides structural templates, pass names are registered in predictable patterns, and cl::opt strings directly name their global variables. PTXAS offers none of these advantages:
- No upstream source. Every identified function is identified from first principles -- string evidence, callgraph position, structural fingerprinting, or decompiled algorithm analysis. There is no reference implementation to compare against.
- ROT13 obfuscation. Internal names for tuning knobs and PTX opcode mnemonics are ROT13-encoded in the binary, requiring decoding before they become useful anchors.
- Vtable-heavy architecture. 39.6% of functions have zero static callers because they are dispatched through vtable pointers or function pointer tables. The call graph alone cannot reach them.
- Template-generated code. The SASS backend contains approximately 4,000 encoding handler functions generated from templates, each structurally near-identical. These dominate the function count but carry almost no unique identifying features.
- No pass registration infrastructure. LLVM passes register themselves via
PassInfoobjects with name strings. PTXAS phases are allocated by a factory switch (sub_C60D30) and their names are only visible through theNamedPhasesregistry andAdvancedPhase*timing strings -- far fewer anchors than LLVM's registration system.
Toolchain
All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. The entire effort is static analysis of the binary at rest -- no dynamic analysis (debugging, tracing, instrumentation) was used for function identification. Runtime tools (ptxas --stat, DUMPIR knob, --keep) were used only for validation and cross-referencing.
| Tool | Purpose |
|---|---|
| IDA Pro 8.x | Disassembly, auto-analysis, cross-referencing, vtable reconstruction |
| Hex-Rays decompiler | Pseudocode generation for 39,881 recovered functions |
| IDA Python scripting | Complete database extraction: all 8 JSON artifact exports |
| Custom Python script | analyze_ptxas.py: batch string, function, graph, xref, and decompilation export |
| ptxas CLI | --stat, --verbose, --compiler-stats, --fdevice-time-trace for runtime validation |
| ptxas DUMPIR knob | -knob DUMPIR=<phase> to dump IR at specific pipeline points |
| ROT13 decoder | Standard codecs.decode(s, "rot_13") for 2,000+ obfuscated knob/opcode names |
IDA Pro Setup and Initial Analysis
Loading the Binary
PTXAS is a dynamically-linked ELF with 146 PLT imports but no symbol table beyond those imports. IDA auto-analysis settings:
- Processor: Meta PC (x86-64)
- Analysis options: default. IDA correctly identifies the Flex DFA scanner tables, Bison parser tables, and the
.ctors/.dtorssections. - Auto-analysis time: approximately 8-10 minutes on a modern machine for the 37.7 MB binary.
- Compiler detection: IDA identifies GCC as the compiler. The binary uses the Itanium C++ ABI (confirmed by the embedded C++ name demangler at
sub_1CDC780, 93 KB).
Post-Auto-Analysis Steps
After auto-analysis completes:
- Run string extraction. IDA's auto-analysis finds 30,632 strings. All are exported via the
analyze_ptxas.pyIDA Python script. - Force function creation. Some address ranges, particularly the template-generated encoding handlers, are not automatically recognized as functions. IDA's "Create function" (P key) was applied selectively in the
0xD27000--0x1579000range where encoding handler stubs are tightly packed. - Batch decompile. The IDA Python script iterates all 40,185 detected functions and calls
ida_hexrays.decompile()on each, saving per-function.cfiles. 39,881 succeeded; 304 failed (PLT stubs, computed-jump trampolines, and 4 mega-functions exceeding decompiler limits). - Export control flow graphs. For each function, the script extracts the
FlowChart(basic blocks, edges, per-instruction disassembly) as JSON. 80,078 graph files were produced.
Type Recovery
PTXAS uses no C++ RTTI (no typeid, no dynamic_cast -- the binary has no .data.rel.ro RTTI structures). Type recovery relies on:
- Vtable layout analysis. Each vtable is a contiguous array of function pointers in
.data.rel.ro(4,256 bytes total). The vtable atoff_22BD5C8contains 159 entries, one per optimization phase. Each entry points to the phase's constructor function. - Structure offset patterns. The pool allocator struct has free-list bins at offset +2128 and a mutex at +7128. The thread-local context is a 280-byte struct accessed via
pthread_getspecific. These offsets were recovered from the decompiled code ofsub_424070(pool alloc, 3,809 callers) andsub_4280C0(TLS accessor, 3,928 callers). - Parameter/return type propagation. Once a function's signature is established (e.g.,
pool_alloc(pool*, size_t) -> void*), Hex-Rays propagates types to all 3,809 call sites, improving decompilation quality throughout the binary.
String-Driven Analysis
Strings are the single most productive source of function identification in ptxas. Of the 30,632 strings extracted, several categories are particularly valuable.
ROT13-Encoded Knob Names (2,000+ entries)
PTXAS uses ROT13 encoding as a light obfuscation layer on internal configuration names. Two massive static constructors populate these tables at startup:
ctor_005at0x40D860(80 KB) registers approximately 2,000 general OCG tuning knobsctor_007at0x421290(8 KB) registers 98 Mercury scheduler knobs
Each entry pairs a ROT13-encoded name with a hex-encoded default value. Decoding examples:
| ROT13 in binary | Decoded name |
|---|---|
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf | MercuryUseActiveThreadCollectiveInsts |
ZrephelGenpxZhygvErnqfJneYngrapl | MercuryTrackMultiReadsWarLatency |
ZrephelCerfhzrKoybpxJnvgOrarsvpvny | MercuryPresumeXblockWaitBeneficial |
ZrephelZretrCebybthrOybpxf | MercuryMergePrologueBlocks |
ZrephelTraFnffHPbqr | MercuryGenSassUCode |
FpniVayvarRkcnafvba | ScavInlineExpansion |
FpniQvfnoyrFcvyyvat | ScavDisableSpilling |
The knob names directly reveal subsystem organization. Names prefixed with Mercury* belong to the SASS encoder. Names prefixed with Scav* belong to the register allocator's scavenger. Names like XBlockWait* and WarDeploy* belong to the instruction scheduler. The knob lookup function GetKnobIndex at sub_79B240 performs inline ROT13 decoding and case-insensitive comparison, which was itself identified by tracing the xrefs from the ROT13-encoded strings.
ROT13-Encoded PTX Opcode Names (~900 entries)
A third static constructor, ctor_003 at 0x4095D0 (17 KB), populates a table of ~900 ROT13-encoded PTX opcode mnemonics. Decoding examples:
| ROT13 | Decoded |
|---|---|
NPDOHYX | ACQBULK |
OFLAP | BSYNC |
SZN | FMA |
FRGC | SETP |
ERGHEA | RETURN |
RKVG | EXIT |
These strings are used by the PTX parser to match instruction mnemonics. Each xref from one of these strings leads to a parser action or instruction validator function.
Timing and Phase Name Strings
The compilation driver at sub_446240 emits per-stage timing via format strings:
Parse-time : %.3f ms (%.2f%%)
CompileUnitSetup-time : %.3f ms (%.2f%%)
DAGgen-time : %.3f ms (%.2f%%)
OCG-time : %.3f ms (%.2f%%)
ELF-time : %.3f ms (%.2f%%)
DebugInfo-time : %.3f ms (%.2f%%)
PeakMemoryUsage = %.3lf KB
Tracing the xrefs from these format strings identifies the code that brackets each pipeline stage, revealing the stage boundaries within sub_446240.
The NamedPhases registry (string at 0x21B64C8, xrefs to sub_9F4040) and the AdvancedPhase* timing strings provide phase-level anchors within the 159-phase optimization pipeline:
AdvancedPhaseBeforeConvUnSup,AdvancedPhaseAfterConvUnSupAdvancedPhaseEarlyEnforceArgs,AdvancedPhaseLateConvUnSupAdvancedPhasePreSched,AdvancedPhaseAllocReg,AdvancedPhasePostSchedAdvancedPhaseOriPhaseEncoding,AdvancedPhasePostFixUpGeneralOptimizeEarly,GeneralOptimize,GeneralOptimizeMid,GeneralOptimizeMid2GeneralOptimizeLate,GeneralOptimizeLate2OriPerformLiveDead,OriPerformLiveDeadFirstthroughOriPerformLiveDeadFourth
Each AdvancedPhase* string xrefs to exactly one call site, which is a boundary marker in the phase pipeline. These 15 markers divide the 159-phase pipeline into named segments whose boundaries were used to identify the phases between each pair of markers.
Error and Diagnostic Strings
The central diagnostic emitter sub_42FBA0 (2,350 callers) prints error messages whose text reveals the calling function's purpose. Examples:
"Please use -knob DUMPIR=AllocateRegisters for debugging"-- identifies the register allocator failure path atsub_9714E0"SM does not support LDCU"-- identifies SM capability checking in the instruction legalizer"Invalid knob identifier","Invalid knob specified (%s)"-- identifies the knob parsing infrastructure aroundsub_79D070"fseek() error knobsfile %s","[knobs]"-- identifiesReadKnobsFileatsub_79D070
Source File Path
One recovered source path provides a structural anchor:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h
This string (at 0x202D4D8, 66 xrefs) is referenced from assertion checks throughout the knobs infrastructure, confirming that the knob system is a shared utility component (generic_knobs_impl.h) used across NVIDIA's compiler drivers.
Build and Version Strings
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
The version string at sub_612DE0 identifies both the exact build and the version reporting function. The Usage : string at 0x1CE3666 identifies the usage printer. The "\nCompile-unit with entry %s" string identifies the per-kernel compilation loop within the driver.
Vtable-Driven Discovery
The Phase Vtable Table
The most productive vtable discovery was the phase vtable table at off_22BD5C8 in .rodata. This is an array of 159 pointers, each pointing to a vtable for one optimization phase class. The phase factory function at sub_C60D30 is a 159-case switch statement that allocates a 16-byte phase object and assigns the corresponding vtable from this table:
// Simplified from decompiled sub_C60D30
switch (phase_index) {
case 0: obj->vtable = off_22BD5C8[0]; break;
case 1: obj->vtable = off_22BD5C8[1]; break;
...
case 158: obj->vtable = off_22BD5C8[158]; break;
}
return obj;
Each vtable contains pointers to the phase's virtual methods. The virtual method at slot 0 is execute() (the phase body). The virtual method at slot 1 is isNoOp() (returns whether the phase should be skipped). The virtual method at slot 2 is getName() (returns the phase name string).
By following each of the 159 vtable entries to their execute() slot, every optimization phase's main function was identified. The getName() slot provided the phase name for phases that implement it. For phases that return a constant empty string, the name was inferred from the NamedPhases registry or from the AdvancedPhase* timing strings that bracket the phase in the pipeline.
Encoding Handler Vtables
The SASS backend uses vtable dispatch for instruction encoding. Each SASS opcode variant has its own encoding handler function, registered in dispatch tables rather than called directly. This explains why 15,907 functions (39.6%) have zero static callers -- they are reached exclusively through indirect calls via function pointer tables.
The encoding handler vtables were identified by their structural uniformity: every handler in the 0xD27000--0x1579000 range follows an identical template:
- Set opcode ID via bitfield insert into the instruction word at
a1+544 - Load a 128-bit format descriptor from
.rodatavia SSE (movaps xmm0, xmmword_XXXXXX) - Initialize a 10-slot register class map
- Register operand descriptors via
sub_7BD3C0/sub_7BD650/sub_7BE090 - Finalize encoding via
sub_7BD260 - Extract bitfields from the packed instruction word
The uniformity of this template allowed batch identification: once the template was recognized in a few handlers, the remaining ~4,000 were identified by structural matching alone.
Peephole Optimizer Vtable
The PeepholeOptimizer class at 0x7A5D10 has a reconstructed vtable with 7 virtual methods:
| Slot | Method | Purpose |
|---|---|---|
| 0 | Init | Initialize peephole state for a compilation unit |
| 1 | RunOnFunction | Entry point for per-function peephole optimization |
| 2 | RunOnBB | Per-basic-block dispatch |
| 3 | RunPatterns | Standard pattern matching pass |
| 4 | SpecialPatterns | Architecture-specific pattern pass |
| 5 | ComplexPatterns | Multi-instruction pattern pass |
| 6 | SchedulingAwarePatterns | Schedule-preserving pattern pass |
The three peephole dispatch mega-functions (sub_143C440 at 233 KB, sub_18A2CA0 at 231 KB, sub_198BCD0 at 239 KB) each serve a different SM generation family and call 1,100--1,336 pattern matcher functions. These dispatchers were identified by their enormous callee counts and their position in the pipeline after instruction encoding.
Callgraph Analysis
The 548,693-edge call graph, exported from IDA, reveals the binary's module structure and function relationships. Several callgraph properties were systematically exploited.
Hub Function Identification
Functions with extreme callee or caller counts serve as structural anchors:
Top callees (hub functions -- "fan-out" nodes):
| Address | Name | Size | Callees | Role |
|---|---|---|---|---|
sub_169B190 | ISel master dispatch | 280 KB | 15,870 | The single largest function in the binary. Dispatches to all ISel pattern matchers. |
sub_143C440 | SM120 peephole dispatch | 233 KB | 13,425 | SM120 (RTX 50-series) peephole optimization |
sub_198BCD0 | Peephole dispatch (variant 2) | 239 KB | 13,391 | Peephole optimization for another SM family |
sub_18A2CA0 | Peephole dispatch (variant 1) | 231 KB | 12,974 | Peephole optimization for another SM family |
sub_BA9D00 | Bitvector/CFG analysis | 204 KB | 11,335 | Dataflow framework core |
Top callers (utility functions -- "fan-in" nodes):
| Address | Name | Size | Callers | Role |
|---|---|---|---|---|
sub_B28F30 | (unknown leaf) | 12 B | 31,399 | Tiny utility, likely a type tag or opcode check |
sub_10AE5C0 | (unknown leaf) | 60 B | 30,768 | Small encoding helper |
.sprintf | libc sprintf | 6 B | 20,398 | String formatting (PLT stub) |
sub_7B9B80 | Bitfield insert | 216 B | 18,347 | Inserts bits into the 1280-bit instruction word |
sub_424070 | Pool allocator | 2,098 B | 3,809 | Custom memory allocator |
sub_4280C0 | TLS context accessor | 597 B | 3,928 | Thread-local storage via pthread_getspecific |
sub_42FBA0 | Diagnostic emitter | 2,388 B | 2,350 | Central error/warning reporter |
The fan-out nodes identify the mega-dispatch functions: ISel, peephole, and dataflow. The fan-in nodes identify the shared infrastructure layer: memory allocation, encoding primitives, string formatting, and error reporting.
Module Boundary Detection
The call graph reveals clear module boundaries. Functions in the 0x400000--0x67F000 range (PTX frontend) rarely call functions in 0xC52000--0x1CE3000 (SASS backend) directly, and vice versa. The optimizer region (0x67F000--0xC52000) bridges the two, calling into both the frontend (for IR construction) and the backend (for encoding).
The call graph was used to validate the three-subsystem decomposition:
| Call direction | Edge count | Interpretation |
|---|---|---|
| Frontend -> Frontend | ~8,000 | Internal frontend cohesion |
| Frontend -> Optimizer | ~1,200 | IR construction handoff |
| Optimizer -> Optimizer | ~15,000 | Phase-to-phase internal calls |
| Optimizer -> Backend | ~3,500 | Scheduling, encoding setup |
| Backend -> Backend | ~18,000 | Encoding handler internal calls |
| Backend -> Frontend | ~500 | Shared infrastructure (allocator, hash) |
Propagation from Known Functions
Once a high-confidence function is identified, its callees and callers gain contextual identity. The most productive propagation chains:
-
sub_446240(real main, CERTAIN) -> calls stage entry points for Parse, DAGgen, OCG, ELF, DebugInfo. Each stage's entry point was identified by following the timing format string pattern. -
sub_C62720(PhaseManager constructor) -> allocates 159 phase objects viasub_C60D30(factory). The factory's 159 case targets are the phase constructors. Each constructor installs a vtable whose slot 0 points to the phase'sexecute()method. -
sub_79B240(GetKnobIndex) -> called from every function that reads a tuning knob. The first argument toGetKnobIndexis the ROT13-encoded knob name, so every call site reveals which knob a function checks. -
sub_42FBA0(diagnostic emitter) -> the format string argument at each of the 2,350 call sites reveals the error context. A call with"Cannot take address of texture/surface variable (%s)"identifies a PTX semantic checker.
Pattern Recognition
16-Byte Phase Objects
All 159 optimization phases share a uniform object layout:
Offset 0: vtable pointer (8 bytes) -- points to phase-specific vtable
Offset 8: phase data pointer or inline data (8 bytes)
The phase factory (sub_C60D30) allocates each phase as a 16-byte object from the pool allocator, sets the vtable pointer from the vtable table at off_22BD5C8, and returns the object. The PhaseManager stores these 159 objects in its internal array and iterates them to execute the pipeline.
Pool Allocator Usage Pattern
The custom pool allocator (sub_424070, 3,809 callers) is the dominant allocation mechanism. Its usage pattern is recognizable throughout the binary:
ptr = sub_424070(pool, size); // Allocate
if (!ptr) sub_42BDB0(); // Fatal OOM -- never returns
// ... use ptr ...
sub_4248B0(ptr); // Free (1,215 callers)
The OOM handler sub_42BDB0 (14 bytes, 3,825 callers) is a tiny wrapper that calls sub_42F590 (fatal internal error). Because every allocation site checks for failure and calls the same handler, the allocator usage pattern is a reliable structural marker. Finding sub_42BDB0 in a function's callee list confirms that function performs heap allocation.
SASS Encoding Handler Template
Every encoding handler in the backend follows a rigid 6-step template (described in the vtable section above). The key identification markers:
- Calls to
sub_7B9B80(bitfield insert, 18,347 callers) - SSE
movapsloading a 128-bit constant from.rodata - Calls to
sub_7BD3C0,sub_7BD650, orsub_7BE090(operand registrars) - Final call to
sub_7BD260(encoding finalize)
Any function matching this pattern is a SASS encoding handler. This template recognition identified approximately 4,000 handlers spanning 6 SM architecture generations.
Hash Map Infrastructure Pattern
The MurmurHash3-based hash map infrastructure (sub_426150 insert, sub_426D60 lookup, sub_427630 MurmurHash3) appears throughout the binary with a consistent usage pattern:
map = sub_425CA0(hash_fn, cmp_fn, initial_capacity); // Create
sub_426150(map, key, value); // Insert (2,800 callers)
result = sub_426D60(map, key); // Lookup (422 callers)
sub_425D20(map); // Destroy
The MurmurHash3 constants (0xcc9e2d51, 0x1b873593) in sub_427630 confirmed the hash algorithm. The hash map supports three modes (custom function pointers, pointer hash, integer hash) selected by flags at struct offset 84.
Data Artifacts
The complete IDA database was exported via analyze_ptxas.py into 8 JSON artifacts. These artifacts are the foundation for all subsequent analysis.
| Artifact | File | Size | Entries | Schema |
|---|---|---|---|---|
| Functions | ptxas_functions.json | 92 MB | 40,185 | {addr, end, name, size, insn_count, is_library, is_thunk, callers[], callees[]} |
| Strings | ptxas_strings.json | 4.8 MB | 30,632 | {addr, value, type, xrefs[{from, func, type}]} |
| Call graph | ptxas_callgraph.json | 64 MB | 548,693 | {from, from_addr, to, to_addr} -- one edge per call site |
| Cross-references | ptxas_xrefs.json | 978 MB | 7,427,044 | Complete xref database (code, data, string references) |
| Comments | ptxas_comments.json | 5.9 MB | 66,598 | {addr, type, text} -- IDA auto-comments and analyst annotations |
| Names | ptxas_names.json | 972 KB | 16,019 | {addr, name} -- IDA auto-generated and analyst-assigned names |
| Imports | ptxas_imports.json | 17 KB | 146 | {module, name, addr, ordinal} -- PLT import stubs |
| Segments | ptxas_segments.json | 3 KB | 24 | {name, start, end, size, type, perm} -- ELF segment map |
Total artifact storage: 1.14 GB (dominated by the 978 MB xref database).
What Each Artifact Reveals
Functions (ptxas_functions.json): The master index. Every function's address, size, instruction count, caller list, and callee list. The caller/callee lists are the basis for callgraph analysis. The is_thunk flag identifies PLT stubs (exclude from analysis). The is_library flag identifies functions IDA tagged as library code (CRT startup, jemalloc-like allocator internals).
Strings (ptxas_strings.json): The primary identification tool. Each string's xref list shows which functions reference it. Searching for "AdvancedPhase" returns 15 strings, each xref pointing to a pipeline boundary in the PhaseManager. Searching for strings starting with "Z" (ROT13 "M" for "Mercury") returns the Mercury subsystem's knob names. The 2,035 hex-encoded default value strings ("0k..." / "0x...") are paired 1:1 with knob name strings in the constructors.
Call graph (ptxas_callgraph.json): The structural backbone. Each edge records a direct call from one function to another. Indirect calls (vtable dispatch, function pointer callbacks) are not captured, which is the primary limitation -- the 15,907 zero-caller functions are almost all vtable-dispatched. The call graph is used for module boundary detection, propagation from known functions, and entry/exit point analysis.
Cross-references (ptxas_xrefs.json): The most comprehensive artifact. Contains all code-to-code, code-to-data, and data-to-data references detected by IDA. At 7.4 million entries, it is too large to load into memory on machines with less than 16 GB RAM. Used for deep analysis of specific functions: finding all references to a particular .rodata constant, tracing data flow through global variables, and identifying vtable consumers.
Comments (ptxas_comments.json): IDA's auto-generated comments (e.g., "File format: \\x7FELF") plus analyst-added annotations. The auto-comments on function prologues identify calling conventions and stack frame layouts. Analyst comments record identification rationale for reviewed functions.
Names (ptxas_names.json): IDA's auto-generated names for data and code addresses. Of 16,019 entries, approximately 9,670 are auto-generated string reference names (aLib64LdLinuxX8, aGnu, etc.) and ~6,349 are analyst-assigned or IDA-recovered names (PLT stubs, constructors, etc.). These names appear in the callgraph edges as from/to identifiers.
Imports (ptxas_imports.json): The 146 PLT imports. Key imports include pthread_* (13 functions), malloc/free/realloc, _setjmp/longjmp (used by the error recovery system), select/fcntl (used by the GNU Make jobserver client), and clock (used by the timing infrastructure).
Segments (ptxas_segments.json): The 24 ELF segments/sections. Used to establish the address space layout and map code/data boundaries. The .ctors section (104 bytes, 12 entries) is particularly important -- it lists the static constructors that initialize the ROT13 tables and the knob registry.
The 30-Region Sweep Approach
The primary analysis was conducted as a systematic address-range sweep of the entire .text section, divided into 30 contiguous regions. Each region was analyzed independently in a single session, producing a raw sweep report. The 40 report files (including sub-region splits) total 34,880 lines of working notes.
Region Partitioning
The .text section (0x403520--0x1CE2DE2, 26.2 MB) was divided into approximately 870 KB regions. The partitioning was not arbitrary -- region boundaries were chosen to align with subsystem boundaries where possible, so that each sweep report covers a coherent functional area.
| Report | Address Range | Size | Functions | Subsystem |
|---|---|---|---|---|
| p1.01 | 0x400000--0x4D5000 | 853 KB | 1,383 | Runtime infra + CLI + PTX validators |
| p1.02 | 0x4D5000--0x5AA000 | 853 KB | 581 | PTX text generation (580 formatters) |
| p1.03 | 0x5AA000--0x67F000 | 853 KB | 628 | Intrinsics + SM profiles |
| p1.04 | 0x67F000--0x754000 | 469 KB | ~500 | Mercury core + scheduling engine |
| p1.05 | 0x754000--0x829000 | 853 KB | 1,545 | Knobs + peephole optimizer class |
| p1.06 | 0x829000--0x8FE000 | 853 KB | 1,069 | Debug tables + scheduler + HW profiles |
| p1.07 | 0x8FE000--0x9D3000 | 853 KB | 1,090 | Register allocator (fatpoint) |
| p1.08 | 0x9D3000--0xAA8000 | 853 KB | 1,218 | Post-RA pipeline + NamedPhases |
| p1.09 | 0xAA8000--0xB7D000 | 853 KB | 4,493 | GMMA/WGMMA + ISel + emission |
| p1.10 | 0xB7D000--0xC52000 | 853 KB | 1,086 | CFG analysis + bitvectors |
| p1.11 | 0xC52000--0xD27000 | 853 KB | 1,053 | PhaseManager + phase factory |
| p1.12 | 0xD27000--0xDFC000 | 853 KB | 592 | SM100 SASS encoders (set 1) |
| p1.13 | 0xDFC000--0xED1000 | 853 KB | 591 | SM100 SASS encoders (set 2) + decoders |
| p1.14 | 0xED1000--0xFA6000 | 853 KB | 683 | SM100 SASS encoders (set 3) |
| p1.15 | 0xFA6000--0x107B000 | 853 KB | 678 | SM100 SASS encoders (set 4) |
| p1.16 | 0x107B000--0x1150000 | 853 KB | 3,396 | SM100 codec + 2,095 bitfield accessors |
| p1.17 | 0x1150000--0x1225000 | 853 KB | 733 | SM89/90 codec (decoders + encoders) |
| p1.18 | 0x1225000--0x12FA000 | 853 KB | 1,552 | Reg-pressure scheduling + ISel + encoders |
| p1.19 | 0x12FA000--0x13CF000 | 853 KB | 1,282 | Operand legalization + peephole |
| p1.20 | 0x13CF000--0x14A4000 | 853 KB | 1,219 | SM120 peephole pipeline |
| p1.21 | 0x14A4000--0x1579000 | 853 KB | 606 | Blackwell ISA encode/decode |
| p1.22 | 0x1579000--0x164E000 | 853 KB | 1,324 | Encoding + peephole matchers |
| p1.23 | 0x164E000--0x1723000 | 853 KB | 899 | ISel pattern matching core |
| p1.24 | 0x1723000--0x17F8000 | 853 KB | 631 | ISA description database |
| p1.25 | 0x17F8000--0x18CD000 | 853 KB | 1,460 | SASS printer + peephole dispatch |
| p1.26 | 0x18CD000--0x19A2000 | 853 KB | 1,598 | Scheduling + peephole dispatchers |
| p1.27 | 0x19A2000--0x1A77000 | 853 KB | 1,393 | GPU ABI + SM89/90 encoders |
| p1.28 | 0x1A77000--0x1B4C000 | 853 KB | 1,518 | SASS emission backend |
| p1.29 | 0x1B4C000--0x1C21000 | 853 KB | 1,974 | SASS emission + format descriptors |
| p1.30 | 0x1C21000--0x1CE3000 | 780 KB | 1,628 | ELF emitter + infra library layer |
Several regions were further split into sub-reports (p1.04a/b, p1.05a/b, p1.06a/b, p1.07a/b, p1.08a/b) when the initial analysis revealed that a region contained multiple distinct subsystems requiring separate treatment.
Sweep Report Structure
Each sweep report follows a consistent format:
================================================================================
P1.XX SWEEP: Functions in address range 0xAAAA000 - 0xBBBB000
================================================================================
Range: 0xAAAA000 - 0xBBBB000
Files found: NNN decompiled .c files (of which ~MMM are > 1KB)
Total decompiled size: X,XXX,XXX bytes
Functions in range (from DB): NNN
Named functions: NNN (or 0 if all are sub_XXXXXX)
Functions with identified callers: NNN
CONTEXT: [1-paragraph summary of the region's purpose]
================================================================================
SECTION 1: [Subsystem name]
================================================================================
### 0xAAAAAA -- sub_AAAAAA (NNNN bytes / NNN lines)
**Identity**: [Function identification]
**Confidence**: [CERTAIN / HIGH / MEDIUM]
**Evidence**:
- [String evidence]
- [Structural evidence]
- [Callgraph evidence]
**Key code**:
[Relevant decompiled excerpts]
**Note**: [Additional observations]
Each function entry records the address, size, decompiled line count, proposed identity, confidence level, evidence citations, and key code excerpts. The reports are raw working notes -- they contain false starts, corrections, and evolving hypotheses that were resolved as more context became available.
Analysis Ordering
The sweep was not performed in address order. The analysis followed an information-maximizing sequence:
- p1.01 (infrastructure + CLI) first -- establishes the allocator, hash map, TLS, and diagnostic patterns that appear throughout the binary.
- p1.11 (PhaseManager) second -- identifies all 159 phases and their vtable entries, providing the skeleton of the optimization pipeline.
- p1.07 (register allocator) and p1.06 (scheduler) third -- these are the highest-complexity subsystems with the richest string evidence.
- p1.12--p1.15 (SASS encoders) in batch -- once the encoding template was recognized, all encoder regions were swept rapidly with template matching.
- p1.30 (library layer) late -- identifies shared infrastructure (ELF emitter, demangler, thread pool) referenced by earlier regions.
- Remaining regions filled in by decreasing information density.
Cross-Referencing with PTXAS CLI
Several ptxas command-line features and internal mechanisms provide runtime validation of static analysis findings.
--stat and --verbose
Running ptxas --stat input.ptx prints per-kernel resource usage (register count, shared memory, stack frame size). This output is generated by sub_A3A7E0 (the IR statistics printer), which was identified from the format strings:
ptxas info : Used %d registers, %d bytes smem, %d bytes cmem[0]
Comparing the --stat output against the decompiled statistics printer confirms the register counting and resource tracking logic.
--compiler-stats
Enables the timing output (Parse-time, DAGgen-time, OCG-time, etc.) from sub_446240. This confirms the pipeline stage ordering and the stage boundary functions identified by string xrefs.
--fdevice-time-trace
Generates Chrome trace JSON output showing per-phase timing. The trace parser at sub_439880 and the ftracePhaseAfter string at 0x1CE383F confirm the per-phase instrumentation infrastructure. The trace output lists phase names that can be cross-referenced against the 159-entry phase table.
DUMPIR Knob
The internal DUMPIR knob (accessed via -knob DUMPIR=<phase_name>) dumps the Ori IR at specified pipeline points. The string "Please use -knob DUMPIR=AllocateRegisters for debugging" at 0x21EFBD0 confirms this mechanism. The NamedPhases registry at sub_9F4040 maps phase names to pipeline positions. Available DUMPIR points include:
OriPerformLiveDead,OriPerformLiveDeadFirstthroughOriPerformLiveDeadFourthAllocateRegisters(the register allocation phase)swap1throughswap6(swap elimination phases)shuffle(instruction scheduling)
The DUMPIR output format reveals the IR structure: basic block headers, instruction opcodes, register names (R0--R255, UR0--UR63, P0--P7, UP0--UP7), and operand encodings. This runtime output was used to validate the IR format reconstructed from static analysis.
--keep Flag
The --keep flag preserves intermediate files. While ptxas does not emit intermediate text files in the same way as nvcc, the --keep behavior in the overall CUDA compilation pipeline (nvcc -> cicc -> ptxas) allows inspecting the PTX input that reaches ptxas, confirming the PTX grammar and instruction format expectations.
Confidence Levels
Every function identification in this wiki carries one of three confidence levels:
| Level | Meaning | Basis |
|---|---|---|
| CERTAIN | Identity is certain | Direct string evidence naming the function, or the function is a PLT import with a known name |
| HIGH | Strong identification (>90%) | Multiple corroborating indicators: string xrefs, callgraph position, structural fingerprint, decompiled algorithm match |
| MEDIUM | Probable identification (70--90%) | Single indicator (vtable position, size fingerprint, callgraph context) or inferred from surrounding identified functions |
The distribution across the ~200 key identified functions in the Function Map:
- CERTAIN: ~30 functions (PLT imports,
main, functions with unique identifying strings) - HIGH: ~130 functions (string evidence + structural confirmation)
- MEDIUM: ~40 functions (inferred from callgraph context or structural similarity)
The remaining ~39,985 functions are either unidentified (template-generated encoding handlers, small utility stubs) or identified at subsystem level only (e.g., "this is an SM100 SASS encoding handler" without knowing which specific opcode it encodes).
Reproducing the Analysis
To reproduce this analysis from scratch:
-
Obtain the binary. Install CUDA Toolkit 13.0. The binary is at
<cuda>/bin/ptxas. Verify:ptxas --versionshould reportV13.0.88and the binary should be 37,741,528 bytes. Build string:cuda_13.0.r13.0/compiler.36424714_0. -
Run IDA auto-analysis. Open ptxas in IDA Pro 8.x with default x86-64 settings. Allow auto-analysis to complete (8-10 minutes). Accept GCC as the detected compiler.
-
Run the extraction script. Load
analyze_ptxas.pyin IDA's Python console. The script exports all 8 JSON artifacts plus per-function decompiled C files, disassembly files, and control flow graph JSON files. Expected runtime: 4-8 hours for the full export (the xref export dominates). -
Decode ROT13 strings. Apply
codecs.decode(s, "rot_13")to all strings in the knob constructors (ctor_003,ctor_005,ctor_007). This decodes ~3,000 obfuscated names into readable English identifiers. -
Identify anchor functions. Start with the highest-confidence identifications:
mainat0x409460(named in symbol table)sub_446240(real main -- called frommain, contains timing format strings)sub_C60D30(phase factory -- 159-case switch)sub_C62720(PhaseManager constructor -- references phase vtable table)sub_79B240(GetKnobIndex -- inline ROT13 decoding)sub_42FBA0(diagnostic emitter -- 2,350 callers, severity dispatch)
-
Sweep the address space. Work through the
.textsection in regions of ~870 KB. For each region:- Count functions and decompiled file sizes
- Identify string anchors (search for region-specific strings)
- Classify functions by structural template (encoding handler, phase body, utility, etc.)
- Propagate identities from known callers/callees
- Record findings in the sweep report format
-
Cross-reference with runtime. Compile a simple CUDA kernel and run
ptxas --stat --verbose --compiler-statsto observe runtime behavior. Use-knob DUMPIR=<phase>to dump IR at specific pipeline points. Compare the dumped IR format against the IR structure reconstructed from decompiled code.
Dependencies
The extraction script (analyze_ptxas.py) requires IDA Pro 8.x with Hex-Rays decompiler and Python 3.x. No external Python packages are needed -- only the IDA Python API (idautils, idc, idaapi, ida_bytes, ida_funcs, ida_segment, ida_nalt, ida_gdl, ida_hexrays).
Post-export analysis requires only the Python 3.8+ standard library (json, codecs, collections).
Debug Infrastructure: bugspec.txt
ptxas contains an internal fault injection framework that deliberately corrupts the Mercury IR to test compiler verification passes. The mechanism is entirely file-driven: if a file named ./bugspec.txt exists in the current working directory when ptxas runs, the function sub_A83AC0 reads it and injects controlled mutations into the post-register-allocation instruction stream. No CLI flag activates this -- file presence alone is sufficient. If the file is absent, a diagnostic is printed to stdout (Cannot open file with bug specification) and compilation proceeds normally.
File Format
The file contains a single line of six integers:
COUNT0,COUNT1,COUNT2,COUNT3 COUNT4 COUNT5
The first four are comma-separated; then a space; then two space-separated values. Each integer specifies the number of faults to inject for that bug category. Zero or negative disables the category.
| Field | Variable | Category | Target |
|---|---|---|---|
| COUNT0 | v78 | Register bugs | General (R) and uniform (UR) register operands |
| COUNT1 | v79 | Predicate bugs | Predicated instruction operands |
| COUNT2 | v80 | Offset/spill bugs | Memory offsets in spill/refill instructions |
| COUNT3 | v81 | Remat bugs | Rematerialized value operands |
| COUNT4 | v82 | R2P/P2R bugs | Register-to-predicate conversion instructions |
| COUNT5 | v83 | Bit-spill bugs | Bit-level spill storage operands |
Example: 3,2,1,0 0 1 injects 3 register bugs, 2 predicate bugs, 1 offset bug, and 1 bit-spill bug.
Bug Kind String Table
Each injected fault record carries a kind code (1--10) mapped to a string table at 0x21F0500:
| Kind | String | Meaning |
|---|---|---|
| 1 | r-ur register | General or uniform register replaced with wrong register |
| 2 | p-up register | Predicate or uniform predicate register corrupted |
| 3 | any reg | Any register class operand corrupted |
| 4 | offset | Memory offset shifted by +16 bytes |
| 5 | regular bug | Generic operand value replacement |
| 6 | predicated bug | Predicate source operand corrupted |
| 7 | remat bug | Rematerialization value corrupted |
| 8 | spill-regill bug | Spill or refill path value corrupted |
| 9 | r2p-p2r bug | Register-predicate conversion operand corrupted |
| 10 | bit-spill bug | Bit-level spill storage operand corrupted |
Injection Algorithm
The injection proceeds in four phases:
1. Candidate collection. The function walks the Mercury IR instruction linked list (from context[0]+272). For each instruction, it checks which bug categories are active and whether the instruction qualifies:
- Register bugs (field0): Scans operands for type-tag 1 (register) with register class 6 (general) or 3 (predicate), excluding opcodes 41--44. Eligible instructions are collected into a candidate list.
- Predicate bugs (field1): Checks flag byte at instruction+73 for bit 0x10 (predicated). Eligible instructions are collected separately.
- Offset/spill bugs (field2): Calls
sub_A56DE0/sub_A56CE0against the register allocator state (context[133]) to identify spill/refill instructions. - Remat bugs (field3): Queries the rematerialization hash table (
context+21viasub_A54200) for instructions with remat entries. - R2P/P2R bugs (field4): Checks instruction opcode (offset +72) for values 268, 155, 267, 173 (the R2P and P2R conversion opcodes, with bit-masked variants).
- Bit-spill bugs (field5): Checks operand count > 2, flag bit 0x10 at offset +28, and calls
sub_A53DB0/sub_A53C40/sub_A56880for bit-spill eligibility.
2. Random selection. Seeds the RNG with time(0) via srand(). For each active category, sub_A83490 randomly selects N instruction indices from the candidate list, where N is the count from bugspec.txt. The selector uses FNV-1a hashing on instruction addresses for collision avoidance, re-rolling duplicates.
3. Mutation application. For register and predicate categories, sub_A5EC40 iterates over selected instructions and calls sub_A5E9E0, which finds the last register operand, allocates a new register of the same class via sub_91BF30, and replaces the operand value. For offset bugs, the mutation adds +16 to the signed 24-bit offset field directly: *operand = (sign_extend_24(*operand) + 16) & 0xFFFFFF | (*operand & 0xFF000000).
4. Reporting. Prints to stdout:
Num forced bugs N
Created a bug at index I : kind K inst # ID [OFF] in operand OP correct val V replaced with W
Fault Record Structure (40 bytes)
| Offset | Size | Field |
|---|---|---|
| +0 | 4 | Kind (1--10) |
| +8 | 8 | Pointer to Mercury instruction node |
| +16 | 4 | Operand index within instruction |
| +20 | 4 | Original operand value |
| +24 | 4 | Replacement operand value |
| +28 | 4 | Selection index (position in candidate list) |
| +32 | 4 | Instruction ID (from instruction+16) |
Records are stored in a dynamic array at context[135].
Function Map
| Address | Function | Role | Confidence |
|---|---|---|---|
0xA83AC0 | sub_A83AC0 | bugspec.txt reader and injection coordinator | CERTAIN (string: ./bugspec.txt) |
0xA83490 | sub_A83490 | Random index selector with FNV-1a dedup | HIGH |
0xA5E9E0 | sub_A5E9E0 | Register operand mutation (allocates new register) | HIGH |
0xA5EC40 | sub_A5EC40 | Batch mutation applicator (iterates selected instructions) | HIGH |
0xA832D0 | sub_A832D0 | Hash table resize for dedup tracking | MEDIUM |
Significance
This is NVIDIA's internal compiler testing infrastructure for stochastic fault injection. It targets specific vulnerability surfaces in the register allocator and post-allocation pipeline: wrong-register assignments, address calculation errors, predicate propagation failures, rematerialization correctness, spill code integrity, and register-predicate conversion accuracy. The time(0)-seeded RNG produces different fault patterns on each run for the same bugspec.txt, enabling randomized stress testing of verification passes.
Embedded C++ Name Demangler
PTXAS statically embeds an Itanium ABI C++ name demangler rather than linking libc++abi or libstdc++. The demangler is a self-contained 41-function cluster spanning 0x1CD8B00--0x1CE1E60 in .text, with a single external entry point. The core recursive-descent parser at sub_1CDC780 (93 KB decompiled, 3,442 lines) handles the full Itanium mangling grammar: nested names, template arguments, substitutions, function types, and special names.
API and Integration
The public-facing function is sub_1CE23F0, whose signature matches __cxa_demangle exactly: it takes a mangled name string, an optional output buffer with length pointer, and a status pointer; it returns a malloc-allocated demangled string or NULL with a status code (-1 = memory failure, -3 = invalid arguments). The only caller of this function is the embedded terminate handler at sub_1CD7850, which prints the standard "terminate called after throwing an instance of '...'" diagnostic to stderr, demangling the exception type name before display.
Why Embedded
PTXAS imports only libc, libpthread, libm, and libgcc_s (146 PLT stubs total). It has no dependency on any C++ runtime library. The only C++ ABI symbol in the PLT is __cxa_atexit (at 0x401989), used to register the terminate handler. By embedding the demangler and terminate handler directly, NVIDIA avoids a runtime dependency on libstdc++ or libc++abi, which would otherwise be required solely for exception type name display in fatal error messages. This is consistent with the binary's overall strategy of minimizing external dependencies.
Function Map
| Address | Function | Size | Role | Confidence |
|---|---|---|---|---|
sub_1CDC780 | Demangler core (recursive-descent parser) | 93 KB | Parses Itanium-mangled names via large switch dispatch | HIGH (size, structure, callgraph isolation) |
sub_1CE0600 | Recursive dispatch wrapper | 580 B | Re-enters the parser for nested name components (76 call sites from core) | HIGH (mutual recursion with sub_1CDC780) |
sub_1CE23F0 | __cxa_demangle-compatible API | 340 B | Public entry: mangled string in, demangled string out, malloc-allocated | CERTAIN (API shape, status codes, free/memcpy/strlen callees) |
sub_1CE1E60 | Parse entry point | ~200 B | Initializes parse state and invokes the core | HIGH (bridge between API and parser) |
sub_1CD7850 | Terminate handler (__cxa_terminate) | 280 B | Prints "terminate called after throwing..." to stderr | CERTAIN (string: "terminate called after throwing an instance of '") |
Version Update Procedure
All addresses, function counts, and structural offsets in this wiki are specific to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0, 37,741,528 bytes). When a new CUDA toolkit ships a different ptxas binary, the wiki must be updated. This section documents the procedure.
Version-Stable vs Version-Fragile Findings
Not everything changes between versions. Understanding what is stable dramatically reduces update effort.
Version-stable (survives across minor and most major releases unchanged):
| Category | Examples | Why stable |
|---|---|---|
| Algorithm logic | Copy propagation worklist walk, fatpoint pressure computation, MurmurHash3 constants | Algorithms are rarely rewritten between releases |
| Data structure layouts | Pool allocator bins at +2128, Mercury instruction node at 112 bytes, 16-byte phase objects | Struct layouts change only when fields are added or reordered |
| Knob names | MercuryUseActiveThreadCollectiveInsts, ScavInlineExpansion, all 2,000+ ROT13 names | Knob names are API-like -- changing them breaks internal test harnesses |
| ROT13 encoding | The ROT13 obfuscation layer itself, decoded by codecs.decode(s, "rot_13") | Obfuscation scheme has been consistent across observed versions |
| Phase count and ordering | 159 phases in the OCG pipeline, ordered by the PhaseManager vtable table | Phase count may grow but existing phases retain their relative order |
| Pipeline stage names | Parse-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time | Stage names are embedded in format strings unlikely to change |
| Subsystem names | OCG, Mercury, Ori, Scav | Internal codenames are stable across releases |
| Encoding handler template | 6-step pattern: opcode ID, movaps format descriptor, register class map, operand registration, finalize, bitfield extract | Template structure is generated from a stable code generator |
| Error message text | "SM does not support LDCU", "Invalid knob identifier" | Diagnostic strings are rarely reworded |
Version-fragile (changes with every recompilation):
| Category | Examples | Why fragile |
|---|---|---|
| Function addresses | Every sub_XXXXXX reference, vtable addresses like off_22BD5C8 | ASLR-style shifts from any code or data size change |
| Address ranges | Sweep boundaries 0x400000--0x4D5000, subsystem regions | Functions move when preceding code grows or shrinks |
| Function sizes | sub_446240 at 12,345 bytes | Inlining decisions change, optimizer improvements add/remove code |
| Caller/callee counts | sub_424070 at 3,809 callers | New call sites added, old ones removed |
| Struct offsets | context[133], context+1584 | New fields inserted into context structs |
.rodata addresses | String locations like 0x202D4D8, encoding table addresses | Data layout shifts with code changes |
| Call graph edge counts | 548,693 edges | New functions and call sites |
| Total function count | 40,185 | New SM targets add encoding handlers |
Identifying Function Address Changes
When loading a new ptxas version into IDA:
-
Extract the same 8 JSON artifacts using
analyze_ptxas.py(or equivalent). The critical artifacts for diffing areptxas_functions.json(address, size, callee list) andptxas_strings.json(string content, xref locations). -
Match functions by invariant properties. Functions cannot be matched by address alone. Use these matching criteria in priority order:
- String anchors. Functions containing unique string references (e.g., the function referencing
"Please use -knob DUMPIR=AllocateRegisters") can be matched across versions by searching for the same string in the new binary. This is the highest-confidence matching method. - Size + callee signature. For functions without string anchors, match by (approximate size, sorted callee list). A function of ~2,100 bytes calling the pool allocator, OOM handler, and hash map insert is almost certainly the same function even if its address shifted by megabytes.
- Callgraph position. Functions identified by their caller/callee topology: the phase factory is the function called from the PhaseManager constructor with 159+ case targets. The diagnostic emitter is the function with 2,000+ callers that calls
vfprintf. - Vtable slot position. Phase
execute()methods are at vtable slot 0. If the vtable table address changes but still contains 159 entries, the slot positions identify each phase. - Template fingerprinting. Encoding handlers matching the 6-step template (bitfield insert via the highest-caller utility,
movapsfrom.rodata, operand registrars, finalize call) are encoding handlers in any version.
- String anchors. Functions containing unique string references (e.g., the function referencing
-
Diff the function lists. Produce a mapping
{old_addr -> new_addr}for all matched functions. Functions present in the new binary but absent in the old are new (likely new SM target support). Functions absent in the new binary are removed (dropped legacy SM support) or merged.
Updating Sweep Reports
The 30-region sweep reports in ptxas/raw/ are version-locked historical records -- they document the analysis of v13.0.88 and should not be overwritten. For a new version:
-
Re-run the sweep with new address ranges derived from the new binary's function list. The region partitioning should follow the same subsystem-aligned strategy: infrastructure first, then PhaseManager, then high-complexity subsystems, then batch encoding handlers.
-
Name new reports with a version suffix:
p2.01-sweep-v13.1-0xNNN-0xMMM.txt(or whatever scheme distinguishes the version). -
Cross-reference against old reports. For each region, note which functions moved, which are new, and which disappeared. The old sweep reports provide the expected function identities; the new sweep validates whether those identities still hold at the new addresses.
Pages Most Sensitive to Version Changes
These wiki pages require immediate updates when the binary changes:
| Page | Sensitivity | What changes |
|---|---|---|
function-map.md | Critical | Every address in every table row. The entire page is address-indexed. |
binary-layout.md | Critical | Section addresses, subsystem boundaries, address-range diagram. |
VERSIONS.md | Critical | Binary size, build string, function count, version number. |
pipeline/overview.md | High | Phase factory address, PhaseManager constructor address, vtable table address. |
scheduling/algorithm.md | High | Scheduler function addresses, priority function addresses. |
regalloc/algorithm.md | High | Allocator function addresses, fatpoint computation address. |
codegen/encoding.md | High | Encoding handler address ranges, format descriptor addresses. |
config/knobs.md | Medium | Knob constructor addresses (content of knob names is stable). |
ir/instructions.md | Medium | Opcode numbers may shift if new instructions are added. |
targets/index.md | Medium | New SM targets may appear, changing validation table sizes. |
methodology.md | Low | The methodology itself is version-stable; only the "Scope and Scale" table needs updating. |
Recommended Update Workflow
The update follows a five-step sequence. Steps 1-2 are mechanical; steps 3-5 require analyst judgment.
Step 1: Extract new IDA artifacts.
Load the new ptxas binary into IDA Pro 8.x. Run analyze_ptxas.py to produce the 8 JSON artifacts and per-function decompiled .c files. Store them in a version-specific directory (e.g., ptxas/ida-v13.1/ or alongside the existing artifacts with clear version labeling).
Step 2: Diff against the old artifacts.
Write or use a diff script that:
- Compares
ptxas_functions.json(old vs new) by matching on string anchors, size+callee signature, and callgraph position. - Produces a
{old_addr -> new_addr}mapping for matched functions. - Lists unmatched functions in both directions (new functions, removed functions).
- Compares
ptxas_strings.jsonto detect new strings, removed strings, and strings whose xref functions changed. - Reports total function count delta, binary size delta, and new section addresses.
Step 3: Update address-sensitive pages.
Using the address mapping from Step 2:
- Update every
sub_XXXXXXreference infunction-map.md,binary-layout.md, and all pages listed in the sensitivity table above. - Update the "Scope and Scale" table in
methodology.mdwith new function counts, string counts, binary size, and build string. - Update
VERSIONS.mdwith the new binary metadata. - For pages with address ranges (sweep boundaries, subsystem regions), recompute the ranges from the new function list.
Step 4: Verify key struct layouts.
Struct offset changes are the most dangerous kind of version drift because they silently invalidate decompiled code analysis. For each documented struct:
- Re-decompile the struct's primary accessor function (e.g.,
sub_424070for the pool allocator,sub_4280C0for the TLS context). - Compare field offsets against the documented layout.
- If offsets shifted, update the struct documentation and propagate the change to all pages that reference those offsets.
Priority structs to verify: pool allocator (free-list bins at +2128, mutex at +7128), TLS context (280 bytes), Mercury instruction node (112 bytes), scheduler context (~1000 bytes), allocator state (1590+ bytes), phase objects (16 bytes).
Step 5: Validate phase pipeline.
- Re-extract the phase vtable table (find the new address of the 159-entry pointer array in
.data.rel.ro). - Verify all 159 phases are present and in the expected order.
- Check for new phases (count > 159) or removed phases (count < 159).
- Re-run
ptxas --fdevice-time-traceon a test kernel and cross-reference the phase names in the trace output against the wiki's phase list.
Raw Data Locations
All raw analysis artifacts for the current version (v13.0.88) live in the repository under ptxas/:
| Directory | Contents |
|---|---|
ptxas/raw/ | 40 sweep reports (p1.01--p1.30 plus sub-region splits), per-task investigation reports (P0_*, P1_*, P2_*, etc.) |
ptxas/decompiled/ | Per-function Hex-Rays decompiled C files (sub_XXXXXX.c, named functions like ctor_003_0x4095d0.c) |
ptxas/disasm/ | Per-function disassembly files |
ptxas/graphs/ | Per-function control flow graph JSON files (80,078 files) |
ptxas/ (root) | The 8 JSON artifacts (ptxas_functions.json, ptxas_strings.json, ptxas_callgraph.json, ptxas_xrefs.json, ptxas_comments.json, ptxas_names.json, ptxas_imports.json, ptxas_segments.json), the IDA database (ptxas.i64), the extraction script (analyze_ptxas.py), and the binary itself (ptxas) |
ptxas/wiki/src/ | The wiki source pages (this document and all others) |
When updating to a new version, preserve the existing artifacts for v13.0.88 (rename or move to a versioned subdirectory) and store new artifacts alongside them. The sweep reports in ptxas/raw/ are historical records and should never be overwritten.
Limitations and Known Gaps
-
No dynamic validation of optimization correctness. All findings are from static analysis. The identified phase algorithms have not been tested against runtime inputs to verify they produce correct output for all corner cases.
-
39.6% of functions are vtable-dispatched. Functions with zero static callers can only be reached by finding the vtable or function pointer table that references them. Some vtables in deep
.rodatamay have been missed, leaving some functions orphaned. -
No upstream reference for any code. Unlike cicc (LLVM fork) or nvcc (EDG frontend), ptxas has no open-source analog. Every identification is from first principles. This limits confidence for functions where string evidence is absent and structural analysis is the only basis.
-
Template-generated code is indistinguishable. The ~4,000 SASS encoding handlers are generated from internal templates. Without the template source, mapping individual handlers to specific opcodes requires tracing the dispatch table entries, which has only been done for select handlers.
-
Mega-functions are partially opaque. The four functions exceeding 200 KB (
sub_169B190at 280 KB,sub_143C440at 233 KB,sub_198BCD0at 239 KB,sub_18A2CA0at 231 KB) could not be decompiled by Hex-Rays. Their behavior is understood from their callee lists (13,000--15,870 callees each) and their position in the pipeline, but the internal dispatch logic is known only at the disassembly level. -
ROT13 decoding is necessary but not sufficient. Decoding the 2,000+ knob names reveals the existence of tuning parameters but not their semantics. A knob named
MercuryPresumeXblockWaitBeneficialcan be decoded from ROT13, but understanding what "xblock wait beneficial" means requires analyzing the code paths that read the knob. -
Version-specific addresses. All addresses in this wiki apply to ptxas v13.0.88 (build
cuda_13.0.r13.0/compiler.36424714_0). Other CUDA toolkit versions will have different addresses, different function counts, and potentially different phase orderings. However, the analysis methodology (string-driven, vtable-driven, callgraph propagation) applies to any version. -
Indirect calls are undercounted. The 548,693-edge call graph captures only direct
callinstructions resolved by IDA. Virtual calls through vtable pointers, function pointer callbacks, and computed jumps are not fully captured. The true call graph is significantly denser than what is recorded.
Corrections Log
This section documents every factual error discovered and corrected during the wiki improvement pass. Each entry records the error, the correction, affected pages, and the agent task that performed the fix. The full detail for each correction is in ptxas/raw/P5_11_corrections_log_report.txt.
Summary
| Metric | Count |
|---|---|
| Distinct factual errors corrected | 22 |
| Wiki pages with at least one fix | 30+ |
| Agent tasks that discovered errors | 15 |
| Agent tasks that propagated fixes | 5 |
Corrections by Severity
Systematic errors (affected 5+ pages each)
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 01 | Opcode numbering: wiki assumed two numbering systems; "Selected Opcode Values" table had wrong SASS mnemonic labels (e.g., 93=CALL, 95=EXIT, 97=MOV, 130=BAR) | One numbering system: ROT13 name table index IS the instruction opcode. Correct labels: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET2 | 15 pages (ir/instructions, ir/cfg, passes/predication, passes/sync-barriers, passes/liveness, passes/general-optimize, passes/rematerialization, passes/copy-prop-cse, passes/strength-reduction, regalloc/abi, regalloc/spilling, intrinsics/sync-warp, codegen/isel, scheduling/latency-model, scheduling/algorithm) | P0-01, P4-02, P5-01 |
| 02 | Register class 6 = UB (Uniform Barrier); classes 2-6 all wrong | Class 6 = Tensor/Accumulator (MMA/WGMMA). Correct table: 2=R(alt), 3=UR, 4=UR(ext), 5=P/UP, 6=Tensor/Acc. Barrier regs use reg_type 9, outside the 7-class system | 7 pages (ir/registers, regalloc/overview, regalloc/algorithm, regalloc/spilling, passes/gmma-pipeline, intrinsics/tensor, ir/overview) | P0-02 |
| 03 | context+1584 had 5 conflicting names: code_object, sched_ctx, arch_backend, optimizer_state, function manager | Single object: SM-specific architecture backend ("sm_backend"), constructed per-compilation-unit in sub_662920 via SM version switch | 3 pages corrected (ir/data-structures, ir/overview, passes/copy-prop-cse); 14 pages acceptable as-is | P0-03 |
Identity misattributions
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 06 | sub_83EF00 (29KB) listed as "Top-level unrolling driver" | sub_83EF00 is MainPeepholeOptimizer (opcode switch on 2, 134, 133, 214, 213, 210). Actual unrolling driver: sub_1390B30 via Phase 22 entry sub_1392E30 | passes/loop-passes.md | P1-04, P5-03 |
| 07 | sub_926A30 (22KB) listed as "Main pipelining engine (modulo scheduling)" | sub_926A30 is the operand-level latency annotator and interference weight builder, called by sub_92C0D0 per-instruction | passes/loop-passes.md | P1-06 |
| 08 | sub_7E7380 described as "full structural equivalence" (opcode, type, all operands, register class comparison) | sub_7E7380 is 30 lines / 150 bytes: narrow predicate-operand compatibility check (predicate bit parity + last operand 24-bit ID + penultimate 8-byte encoding). Full structural comparison done by the 21 callers | passes/copy-prop-cse.md, passes/general-optimize.md | P1-07, P5-06 |
Inverted semantics
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 05 | isNoOp()=1 "means it executes unconditionally" | isNoOp()=1 means the dispatch loop SKIPS execute(). Code: if (!phase->isNoOp()) { phase->execute(ctx); } | passes/rematerialization.md | P0-05 |
| 09 | Hot-cold priority: "1 = cold, 0 = hot" | 1 = hot = higher priority, 0 = cold = lower priority. sub_A9CDE0 (hot detector) returns true -> bit 5 set -> higher priority | passes/hot-cold.md | P1-09, P5-06 |
| 10 | "Fatpoint" implied to be maximum-pressure point | Fatpoint scans for MINIMUM-cost slot. The name refers to the exhaustive (fat) scan evaluating all slots, not to picking the maximum | (verified correct across all pages -- 0 fixes needed) | P1-10, P5-06 |
Wrong numeric values
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 04 | context+1552 = "Legalization stage counter" with 3 values (3, 7, 12) | Pipeline progress counter with 22 values (0-21) spanning all pipeline categories | 4 pages (ir/data-structures, passes/late-legalization, passes/rematerialization, passes/copy-prop-cse) | P0-04 |
| 12 | 5 SASS opcode mnemonic typos: PSMTEST, LGDEPBAR, LGSTS, UBLKPC, UTMAREDG | CSMTEST, LDGDEPBAR, LDGSTS, UBLKCP, UTMREDG | reference/sass-opcodes.md | P2-11 |
| 14 | WGMMA case 9 = 0x1D5D (7517), case 10 = 0x1D5E (7518) | Case 9 = 0x1D5E (7518), case 10 = 0x1D60 (7520). Codes 0x1D5D/0x1D5F are advisory (non-serialization) warnings | passes/gmma-pipeline.md | P3-25 |
| 15 | ABI minimum: gen 5 (sm_60-sm_89) = 16 regs, gen 9+ = 24 regs | gen 3-4 (sm_35-sm_53) = 16, gen 5-9 (sm_60-sm_100) = 24. Binary: (generation - 5) < 5 ? 24 : 16 | regalloc/abi.md | P3-26 |
| 17 | Unrolling rejection table at 0x21D1980 with 36-byte structures | Rejection string pointer array at 0x21D1EA0 with simple integer indices 7-24. The 0x21D1980 table is for peephole operand range lookups | passes/loop-passes.md | P1-04 |
Phantom data and scope errors
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 11 | "Approximately 80 additional entries bulk-copied from unk_21C0E00" at SASS opcode indices 322-401, "totaling roughly 402 named opcodes" | Table has exactly 322 entries. The 1288-byte block at unk_21C0E00 is a 322-element identity map {0,1,...,321} copied to a different data structure (encoding category map at obj+0x2478) | reference/sass-opcodes.md | P2-11 |
| 13 | "139 explicitly named phases and 20 architecture-specific unnamed phases" | All 159 phases have names in the static table at off_22BD0C0. The original 139-phase inventory missed 20 phases (e.g., OriCopyProp, Vectorization, MercConverter, AllocateRegisters) | pipeline/overview.md, passes/index.md | P2-14, P4-03 |
| 16 | Warning 7018 (0x1B6A) attributed to SUSPEND/preserved scratch diagnostic | Code 0x1B6A does not exist in the binary. The actual code is 7011 (0x1B63) | regalloc/abi.md | P3-26 |
| 18 | Unrolling rejection codes listed as 0x80000001-0x80000018 | Those hex values appear in diagnostic message STRINGS, not as internal codes. Internal codes are simple integers 7-24 | passes/loop-passes.md | P1-04 |
Minor corrections
| # | Error | Correction | Pages | Agent |
|---|---|---|---|---|
| 19 | sub_80B700/sub_80BC80 listed as unrolling functions | Both are peephole optimizer functions (called through sub_83EF00), not unrolling | passes/loop-passes.md | P1-04 |
| 22 | general-optimize.md called sub_7E7380 "instruction_equivalent" / "structural instruction equivalence" in 6 locations | Renamed to "predicate_operand_compatible" / "predicate-operand compatibility check" | passes/general-optimize.md | P5-06 |
Error Categories
| Category | Count | Examples |
|---|---|---|
| Identity misattribution | 5 | Wrong function-to-role mappings, wrong names for context fields |
| Wrong numeric values | 5 | Wrong opcode labels, wrong hex codes, wrong thresholds, wrong addresses |
| Inverted semantics | 3 | isNoOp skip-vs-execute, hot-cold bit polarity, fatpoint min-vs-max |
| Conflicting definitions | 3 | Register class contradictions across pages |
| Phantom data | 2 | Nonexistent SASS entries 322-401, nonexistent warning 7018 |
| Scope mischaracterization | 2 | context+1552 scope too narrow, phase naming scope too narrow |
| Encoding confusion | 2 | Hex-in-message-string vs internal code, wrong address for lookup table |
Lessons Learned
-
Behavioral inference is unreliable for opcode identity. Observing that an opcode appears in branch contexts does not make it BRA. Always check the authoritative ROT13 name table.
-
Cross-page consistency checks catch conflicting speculations. Five pages independently naming the same field (context+1584) is a strong signal that at least four are wrong.
-
Counts from partial analysis are systematically low. The "3 values" for context+1552 and "139 named phases" both resulted from stopping the search too early. Exhaustive binary sweeps consistently reveal more entries.
-
Function size is not a reliable identity signal. sub_83EF00 (29KB) was large enough to seem like a major driver, but size alone does not distinguish a peephole optimizer from a loop unroller.
-
ROT13 decoding + binary cross-validation is the gold standard. Every correction that replaced speculative labels with ROT13-decoded names has held up under subsequent audits.
Version Tracking
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page documents the exact ptxas binary under analysis and the version-related metadata recovered from the stripped ELF.
Binary Under Analysis
| Field | Value |
|---|---|
| Tool | ptxas (PTX optimizing assembler) |
| Version | 13.0.88 |
| Build tag | cuda_13.0.r13.0/compiler.36424714_0 |
| Build date | Wed Aug 20 01:55:12 PM PDT 2025 |
| Source path | /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h |
| ELF size | 37,741,528 bytes (37.7 MB) |
| Architecture | x86-64 (AMD64) |
| Linking | Dynamically linked, stripped |
| Functions | ~40,000 (estimated from IDA/Ghidra DB) |
Embedded Version Strings
sub_432A00 (0x432A00, CLI option registration) contains the self-identification
strings that ptxas prints for --version / --list-version:
| String | Location |
|---|---|
"Ptx optimizing assembler" | Product name |
"NVIDIA (R)" | Vendor |
| Copyright 2005-2025 | Date range |
"ptxocg.0.0" | OCG backend version tag |
The "ptxocg.0.0" tag also appears in sub_43A400 (compilation setup) and at
address 0x1CE74AB in the .rodata section, identifying the backend optimizer
component embedded inside ptxas.
Default Target Architecture
sub_6784B0 returns sm_75 (Turing) as the default compilation target when no
--gpu-name flag is supplied. This is consistent with the CUDA 13.0 toolkit
defaulting to a Turing-class GPU.
The full set of architecture strings referenced in the front-end validators (addresses 0x460000-0x4D5000) includes:
sm_20 sm_30 sm_35 sm_50 sm_60 sm_75 sm_80 sm_86 sm_89 sm_90
with sm_%d format-string patterns covering all supported SM codes.
Output ELF Format
Cubins emitted by ptxas use the ELF standard with:
| Field | Value |
|---|---|
e_machine | EM_CUDA (0xBE = 190) |
| ELF class | ELFCLASS32 or ELFCLASS64 (per target) |
| Custom section type | SHT_CUDA_INFO = 0x70000064 |
| Magic (code object header) | 0x16375564E ("dUWc" + version nibble) |
The SM-version-to-code-object mapping lives in the ELF emitter at
sub_1C9F280. Example encodings recovered from sub_A3D000 range:
field[93] | Target | Version encoding |
|---|---|---|
| 12288 | sm_30 | 0x70007 |
| 20481 | sm_50 | 0xC000C |
Build System Metadata
The source path leaked through __FILE__ macros in the knobs infrastructure
reveals the NVIDIA internal build tree layout:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/
drivers/common/utils/generic/impl/generic_knobs_impl.h
Key observations:
/dvs/p4/-- Perforce depot root on the DVS (Driver Verification System) build farm.sw/rel/gpgpu/toolkit/r13.0/-- Release branch for CUDA toolkit 13.0.compiler/drivers/common/-- Shared compiler driver code (used by both ptxas and cicc).generic_knobs_impl.h-- The knob system implementation header; the__FILE__macro at lines 395-1090 of this file is embedded in ptxas error metadata.
Evidence Index
| Claim | Source |
|---|---|
| Version 13.0.88, 37.7 MB | Headers of all 30 sweep reports (e.g. p1.23, p1.28) |
sub_432A00 strings | p1.01 lines 514-521 |
sub_6784B0 default sm_75 | User-provided; corroborated by sm_75 prevalence across all validators |
| Source path | p1.05 lines 14-16, p1.04a line 628 |
ptxocg.0.0 | p1.01 line 553, p1.05 line 1256 |
| ELF emitter / EM_CUDA | p1.30 lines 46-69 |
| SM version encoding table | p1.08b lines 217-237 |
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_432A00 | -- | CLI option registration; contains --version / --list-version self-identification strings ("Ptx optimizing assembler", "NVIDIA (R)", "ptxocg.0.0") | 0.92 |
sub_43A400 | -- | Compilation setup; references the "ptxocg.0.0" backend version tag | 0.85 |
sub_6784B0 | -- | Default target architecture selector; returns sm_75 (Turing) when no --gpu-name flag is supplied | 0.90 |
sub_1C9F280 | -- | ELF emitter; SM-version-to-code-object mapping for cubin output | 0.85 |
sub_A3D000 | -- | SM version encoding table; example encodings (12288 = sm_30, 20481 = sm_50) | 0.80 |
Compilation Pipeline Overview
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page maps the complete end-to-end flow of a PTX assembly through ptxas v13.0.88, from the initial CLI invocation to the final ELF/cubin binary output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.
Pipeline Diagram
nvcc / cicc
| (PTX text file or --input-as-string)
v
+================================================================+
| ptxas v13.0.88 (37.7 MB, ~40,000 functions) |
| |
| 1. Entry & CLI Parsing ----------> [entry.md] |
| | main -> sub_446240 -> sub_434320 |
| | target arch, opt level, --maxrregcount, knobs |
| v |
| 2. PTX Lexer + Parser -----------> [ptx-parser.md] |
| | sub_451730: Flex scanner, Bison grammar |
| | ROT13-decoded opcode table (900+ mnemonics) |
| | 30+ per-instruction semantic validators |
| v |
| 3. PTX Directive Handling --------> [ptx-directives.md] |
| | .version, .target, .entry, .func, .reg, .shared |
| | register constraints, ABI configuration |
| v |
| 4. PTX-to-Ori Lowering ----------> [ptx-to-ori.md] |
| | PTX AST -> Ori IR (basic blocks, virtual registers) |
| | address space annotation, special register mapping |
| v |
| 5. 159-Phase Optimization -------> [optimizer.md] |
| | PhaseManager: sub_C62720 (constructor), |
| | sub_C64F70 (executor) |
| | 10 stages, 17 AdvancedPhase hooks, |
| | 8-phase Mercury encoding sub-pipeline |
| | per-kernel via sub_7FBB70 -> sub_7FB6C0 |
| v |
| 6. Register Allocation ----------> [../regalloc/overview.md] |
| | Fatpoint algorithm, phase 101 (AdvancedPhaseAllocReg) |
| | spill/fill insertion, ABI register reservations |
| v |
| 7. Instruction Scheduling -------> [../scheduling/overview.md]|
| | 3-phase: pre-schedule (97), post-schedule (106), |
| | post-fixup (111) |
| | scoreboard generation, dependency barriers |
| v |
| 8. SASS Encoding ----------------> [../codegen/encoding.md] |
| | 530 instruction encoding handlers (vtable dispatch) |
| | Mercury format: phases 113-122 |
| | Capsule Mercury (default on sm_100+) |
| v |
| 9. ELF/Cubin Output -------------> [output.md] |
| | sub_612DE0 (finalizer) -> sub_1C9F280 (ELF emitter) |
| | section layout, symbol table, relocations |
| | DWARF debug info, EIATTR attributes |
| v |
| OUTPUT: .cubin / .o (ELF) |
+================================================================+
Side paths:
* Capsule Mercury (--cap-merc) -----> [../codegen/capmerc.md]
* Debug info (all stages) ----------> [../output/debug-info.md]
* SASS text (--verbose) ------------> [../codegen/sass-printing.md]
Narrative Walk-Through: One Kernel, Start to Finish
A concrete trace of a single-kernel PTX module compiled for sm_100 at -O2:
1. PTX text arrives (~2--200 KB). Either read from a .ptx file or received in-memory via --input-as-string from nvcc. The driver sub_446240 establishes a setjmp recovery point, parses CLI options into the 1,352-byte options block, and allocates the "Top level ptxas memory pool".
2. Lexer + Parser (sub_451730). A Flex-generated scanner tokenizes the PTX text into a token stream. Tokens flow into a Bison-generated LALR parser that builds an AST. The opcode dispatch table (sub_46E000, 93 KB, 1,168 callees) routes each instruction mnemonic through ROT13 decoding, type resolution, and 30+ per-instruction semantic validators. For a 5 KB PTX kernel, the parser typically produces ~200--500 AST nodes with ~50 virtual register declarations. The "PTX parsing state" pool holds all AST memory.
3. Directive processing and CompileUnitSetup. .version/.target directives configure the SM profile via sub_6765E0 (54 KB profile constructor). .entry/.func directives establish the kernel boundary. .reg/.shared/.const directives declare resources. sub_43B660 computes the physical register budget from .maxnreg, --maxrregcount, and .maxntid constraints. The 1,936-byte profile object is now populated with codegen factory value (36864 for sm_100), scheduling parameters, and capability flags.
4. PTX-to-Ori lowering (DAGgen). sub_6273E0 (44 KB) converts each AST instruction into an Ori IR node: a basic block with virtual registers, control flow edges, and memory space annotations. Special registers (%ntid, %laneid, %smid) map to internal IDs. Address computation uses a 6-bit operand type encoding. A 500-instruction PTX kernel typically produces ~600--1,200 Ori instructions (expansion from pseudo-ops, address calculations, and predicate materialization). The "Permanent OCG memory pool" is created here to hold all IR state.
5. 159-phase OCG pipeline (sub_C62720 constructs, sub_C64F70 executes). Each phase is a 16-byte polymorphic object with execute(), isNoOp(), and getName() vtable methods. The PhaseManager iterates the phase table at 0x22BEEA0, skipping any phase whose isNoOp() returns true. At -O2, roughly 80--100 of the 159 phases are active. Typical expansion factors: the initial 1,000 Ori instructions may grow to 1,200--1,500 after unrolling and intrinsic expansion, then shrink to 800--1,000 after CSE/DCE, then re-expand to 1,500--2,500 after register allocation spill/fill insertion. The PhaseManager logs "Before <phase>" / "After <phase>" strings (visible in the sub_C64F70 decompile) for DUMPIR.
6. Register allocation (phase 101, sub_971A90). The Fatpoint algorithm attempts NOSPILL allocation first. If pressure exceeds the register budget, the spill guidance engine (sub_96D940, 84 KB) computes spill candidates across 7 register classes, and the retry loop makes up to N attempts (knob 638/639) with progressively more aggressive spilling. Physical register assignments are committed; spill/fill instructions are inserted into the Ori IR.
7. Instruction scheduling (phases 97, 106, 111). Three scheduling passes assign dependency barriers and reorder instructions for pipeline throughput. The scoreboard generator tracks 6 dependency barriers per warp. For a 1,500-instruction kernel, scheduling typically produces a ~2,000--3,000-entry instruction stream after barrier insertion and NOP padding.
8. SASS encoding (phases 113--122). Each Ori instruction is lowered to a 128-bit SASS binary instruction via the 530-handler vtable dispatch. The 1,280-bit (160-byte) encoding workspace at instruction+544 is filled by sub_7B9B80 (bitfield insert, 18,347 callers). A 2,000-instruction kernel produces ~32 KB of raw SASS binary. On sm_100+, Capsule Mercury (capmerc) is the default format, embedding PTX source alongside the SASS.
9. ELF/cubin emission (sub_612DE0, 47 KB). The finalizer assembles the cubin: .text.FUNCNAME (SASS binary), .nv.info.FUNCNAME (EIATTR attributes), .nv.shared.FUNCNAME (shared memory layout), .nv.constant0.FUNCNAME (constant bank), plus global sections (.shstrtab, .strtab, .symtab). Section layout (sub_1CABD60, 67 KB) assigns addresses; the master ELF emitter (sub_1C9F280, 97 KB) writes headers, section tables, and program headers. A single-kernel cubin for a medium-complexity kernel is typically 40--120 KB.
Approximate data sizes at each stage (medium kernel, sm_100, -O2):
| Stage | Input | Output | Peak Memory |
|---|---|---|---|
| PTX text | -- | 5--50 KB text | 100 KB (file buffer + parser state) |
| AST | Token stream | 200--500 nodes (~40--100 KB) | 200 KB |
| Ori IR (initial) | AST | 600--1,200 instructions (~100--250 KB) | 500 KB |
| Ori IR (post-OCG) | 1,200 instr | 1,500--2,500 instr (~300--600 KB) | 2--8 MB (peak during regalloc) |
| SASS binary | Scheduled IR | 32--128 KB | 1 MB |
| Cubin (ELF) | SASS + metadata | 40--120 KB | 2 MB |
Timed Phases
The compilation driver sub_446240 measures six timed phases per compile unit and reports them when --compiler-stats is enabled. The format strings are embedded directly in the binary:
| Phase | Format String | Subsystem |
|---|---|---|
| Parse-time | "Parse-time : %.3f ms (%.2f%%)\n" | PTX lexer + Bison parser + semantic validation |
| CompileUnitSetup-time | "CompileUnitSetup-time : %.3f ms (%.2f%%)\n" | Target configuration, ABI setup, register constraints |
| DAGgen-time | "DAGgen-time : %.3f ms (%.2f%%)\n" | PTX-to-Ori lowering, CFG construction, initial DAG formation |
| OCG-time | "OCG-time : %.3f ms (%.2f%%)\n" | Optimized Code Generation: all 159 optimization phases, register allocation, instruction scheduling, SASS encoding |
| ELF-time | "ELF-time : %.3f ms (%.2f%%)\n" | ELF construction, section layout, symbol table, relocations, EIATTR, file write |
| DebugInfo-time | "DebugInfo-time : %.3f ms (%.2f%%)\n" | DWARF .debug_info/.debug_line/.debug_frame generation, LEB128 encoding |
Additional aggregate stats:
CompileTime = %f ms (100%)
PeakMemoryUsage = %.3lf KB
The per-unit header prints "\nCompile-unit with entry %s" before each kernel's phase breakdown.
Per-Kernel Parallelism
ptxas supports two compilation modes for multi-kernel PTX modules:
Single-Threaded Mode (Default)
The compilation driver sub_446240 iterates over compile units sequentially. For each kernel entry:
sub_43CC70-- per-entry compilation unit processor, skips__cuda_dummy_entry__sub_7FBB70-- per-kernel entry point, prints"\nFunction name: "+ kernel namesub_7FB6C0-- pipeline orchestrator: builds phases viasub_C62720, executes viasub_C64F70- Cleanup: destroys 17 analysis data structures (live ranges, register maps, scheduling state)
Each kernel runs through the entire 159-phase pipeline independently. Cross-kernel state is limited to shared memory layout and the global symbol table.
Thread Pool Mode (--split-compile)
When --allow-expensive-optimizations or --split-compile is active, ptxas uses a pthread-based thread pool for per-kernel parallelism:
- Pool constructor (
sub_1CB18B0): allocates a 184-byte pool struct (0xB8), spawns N detached worker threads viapthread_create, initializes mutex at +24, two condition variables at +64 and +112 - Task submit (
sub_1CB1A50): allocates a 24-byte task node{func_ptr, arg, next}, enqueues via linked list, broadcasts oncond_work - Jobserver integration (
sub_1CC7300): readsMAKEFLAGSenvironment variable, parses--jobserver-auth=for eitherfifo:named pipes or pipe-based file descriptors, throttles thread count to respect GNU Make's-jslot limit
The thread pool is used throughout the OCG and ELF phases (stages 5-9 in the diagram). Each worker thread receives its own thread-local context (sub_4280C0, 280-byte TLS struct with per-thread error flags, diagnostic suppression state, Werror flag, and synchronization primitives).
Thread-Local Context Layout
struct ThreadLocalContext { // 280 bytes (0x118), per-thread via pthread_getspecific
uint64_t error_flags; // +0: error/warning state flags
uint64_t has_error; // +8: nonzero if error occurred
// +16..+48: internal fields (jmp_buf pointer, pool pointer, counters)
uint8_t diag_suppress; // +49: diagnostic suppression flag
uint8_t werror_flag; // +50: --Werror promotion flag
// +51..+127: reserved / internal state
pthread_cond_t cond; // +128: condition variable (48 bytes)
pthread_mutex_t mutex; // +176: per-thread mutex (40 bytes)
sem_t sem; // +216: semaphore (32 bytes)
// +248..+279: linked-list pointers (global thread list at +256/+264)
};
Accessed by sub_4280C0 (3,928 callers -- the single most-called function in the binary). On first call in a new thread, allocates and initializes via malloc(0x118) + memset + pthread_cond_init + pthread_mutex_init + sem_init. The decompiled code confirms the 280-byte size: v5 = malloc(0x118u), followed by memset(v5, 0, 0x118u), pthread_cond_init(v5 + 128), pthread_mutex_init(v5 + 176), sem_init(v5 + 216). After initialization, the struct is inserted into a global doubly-linked list (offsets +256 and +264 hold prev/next pointers, protected by a global mutex). The pthread_setspecific(key, v5) call stores the pointer for subsequent pthread_getspecific retrieval.
Key Function Call Chain
The top-level control flow from program entry to ELF output:
main (0x409460, 84 bytes)
| setvbuf(stdout/stderr, unbuffered)
v
sub_446240 (0x446240, 11KB) ---- "Top-level compilation driver"
|
|-- sub_434320 (0x434320, 10KB) -- Parse CLI options, validate flags
| reads: --gpu-name, --maxrregcount, --opt-level, --verbose,
| --compiler-stats, --split-compile, --fast-compile
|
|-- [allocate "Top level ptxas memory pool"]
|-- [allocate "Command option parser" pool]
|
|-- sub_445EB0 (setup) ----------- Target configuration, texturing mode
| sub_43A400 --------------- SM-specific defaults ("ptxocg.0.0")
| sub_43B660 --------------- Register/resource constraint calculation
|
|-- sub_451730 (0x451730, 14KB) -- Parser initialization
| | "PTX parsing state" pool allocation
| | Builtin symbol table: %ntid, %laneid, %smid, %clock64, ...
| | sub_46E000 (93KB) ---- Opcode-to-handler dispatch table (1168 callees)
| v
| [Flex lexer + Bison parser: PTX text -> AST]
|
|-- for each compile unit:
| sub_4428E0 (0x4428E0, 14KB) -- PTX input validation
| | .version/.target checks, ABI mode selection
| | --extensible-whole-program, --compile-only handling
| |
| sub_43CC70 (5.4KB) --------- Per-entry unit processor
| | skip __cuda_dummy_entry__
| | generate .sass and .ucode sections
| |
| sub_7FBB70 (198 bytes) ----- Per-kernel entry point
| |
| sub_7FB6C0 (1.2KB) ------- Pipeline orchestrator
| | check knob 298 (NamedPhases mode)
| | if NamedPhases: delegate to sub_9F63D0
| | else:
| | sub_C62720 -- PhaseManager constructor (159 phases)
| | sub_C60D20 -- get default phase table (at 0x22BEEA0)
| | sub_C64F70 -- execute all phases
| | cleanup: destroy 17 analysis data structures
| v
| [159-phase pipeline: optimization -> regalloc -> scheduling -> encoding]
|
|-- sub_612DE0 (0x612DE0, 47KB) -- Kernel finalizer / ELF builder
| | "Finalizer fastpath optimization"
| | version: "Cuda compilation tools, release 13.0, V13.0.88"
| | build: "Build cuda_13.0.r13.0/compiler.36424714_0"
| |
| sub_1CB53A0 (13KB) ------- ELF world initializer (672-byte object)
| | "elfw memory space", .shstrtab, .strtab, .symtab
| |
| sub_1CB3570 (10KB) ------- Add .text.FUNCNAME sections (44 callers)
| sub_1CB68D0 (49KB) ------- Symbol table builder
| sub_1CABD60 (67KB) ------- Section layout & memory allocation
| sub_1CD48C0 (22KB) ------- Relocation resolver
| sub_1C9B110 (23KB) ------- Mercury capsule builder (capmerc)
| sub_1C9F280 (97KB) ------- Master ELF emitter (largest in range)
| sub_1CD13A0 (11KB) ------- Final file writer
|
v
[report CompileTime, PeakMemoryUsage, per-phase breakdown]
Memory Pools
ptxas uses a custom hierarchical pool allocator (sub_424070 / sub_4248B0, the most-called allocation functions with 3,809 and 1,215 callers respectively) instead of the system malloc/free. Three named pools are created during the top-level driver:
| Pool Name | Created By | Lifetime | Purpose |
|---|---|---|---|
"Top level ptxas memory pool" | sub_446240 | Entire compilation | Global allocations, cross-kernel data structures |
"Command option parser" | sub_446240 | Entire compilation | CLI option storage, flag validation state |
"Permanent OCG memory pool" | OCG initialization | Per-kernel | Optimization phase state, instruction IR, register maps |
Additional per-subsystem pools exist:
"PTX parsing state"-- created bysub_451730, holds the lexer/parser symbol tables and AST nodes"elfw memory space"-- created bysub_1CB53A0, holds the ELF world object (672 bytes) and section data
Pool Allocator Internals
The allocator at sub_424070 implements a dual-path design:
- Small allocations (up to 4,999 bytes /
0x1387): 8-byte-aligned, size-class binned free lists at pool struct offset +2128. Pop from free list head on alloc, push back on free. - Large allocations (above 4,999 bytes): boundary-tag allocator with coalescing of adjacent free blocks.
- Thread safety:
pthread_mutex_lock/unlockaround all pool operations, mutex at pool struct offset +7128. - OOM handling: calls
sub_42BDB0(3,825 callers) which triggerslongjmp-based fatal abort viasub_42F590.
Pipeline Stage Breakdown
Terminology note. The 6 stages below (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo) correspond to the 6 timed phases measured by
--compiler-stats. They cover the entire program lifecycle. The OCG stage (Stage 4 here) is itself subdivided into 10 internal stages in the Pass Inventory, numbered OCG-Stage 1--10. To avoid confusion, cross-references use "timed phase" for the 6 whole-program stages and "OCG stage" for the 10 optimizer sub-stages.
Stage 1: Parse (Parse-time)
The Flex-generated scanner and Bison-generated parser consume PTX text and produce an internal AST. The opcode dispatch table at sub_46E000 (93KB, 1,168 callees) registers type-checking rules for every PTX instruction. Thirty separate validator functions (in 0x460000-0x4D5000) enforce SM architecture requirements, PTX version constraints, operand types, and state space compatibility. See PTX Parser.
Stage 2: CompileUnitSetup (CompileUnitSetup-time)
Target configuration via sub_43A400: sets SM-specific defaults (texturing mode, cache policies, def-load-cache, force-load-cache), applies --fast-compile shortcuts, configures ABI (parameter registers, return address register, scratch registers). Register constraints computed by sub_43B660 from .maxnreg, --maxrregcount, .minnctapersm, and .maxntid directives. See Entry Point & CLI.
Stage 3: DAGgen (DAGgen-time)
Lowers the validated PTX AST into the Ori intermediate representation: basic blocks with a control flow graph, virtual registers, and memory space annotations. Special PTX registers (%ntid, %laneid, %smid, %ctaid, etc.) are mapped to internal identifiers. Operand processing at sub_6273E0 (44KB) handles address computation with a 6-bit operand type encoding. See PTX-to-Ori Lowering.
Stage 4: OCG (OCG-time)
The core of ptxas: the 159-phase Optimized Code Generation pipeline. This single timed phase encompasses:
- Early optimization (phases 13-36): general optimization, branch/switch, loop simplification, strength reduction, unrolling, pipelining, barrier removal
- Mid-level optimization (phases 37-58): GVN/CSE, reassociation, commoning, late expansion, speculative hoisting
- Late optimization (phases 59-95): loop fusion, predication, GMMA propagation, legalization
- Register allocation (phase 101): Fatpoint algorithm
- Instruction scheduling (phases 97, 106, 111): pre-schedule, post-schedule, post-fixup
- Mercury encoding (phases 113-122): SASS binary format generation
The PhaseManager (sub_C62720) instantiates phases via a 159-case factory switch (sub_C60D30), each phase a 16-byte polymorphic object with a vtable providing execute(), isNoOp(), and getName() methods. See Optimization Pipeline.
Stage 5: ELF (ELF-time)
The finalizer sub_612DE0 (47KB) assembles the NVIDIA ELF/cubin from the compiled SASS. Section layout (sub_1CABD60, 67KB) assigns addresses to shared memory, constant banks (with OCG deduplication), local memory, and reserved shared memory (.nv.reservedSmem.begin/cap/offset0). The master ELF emitter sub_1C9F280 (97KB) constructs headers, section tables, and program headers. Three binary output modes exist:
- mercury -- traditional SASS binary format
- capmerc -- Capsule Mercury (default on sm_100+), embeds PTX source in
.nv.merc.*sections - sass -- direct SASS output
See ELF/Cubin Output.
Stage 6: DebugInfo (DebugInfo-time)
DWARF debug information generation: .debug_info, .debug_line, .debug_frame sections. The LEB128 encoder at sub_45A870 handles all variable-length integer encoding. Source location tracking uses the location map (hash map at sub_426150/sub_426D60) with file offset caching every 10 lines for fast random access. Labels follow the pattern .L__$locationLabel$__%d. See Debug Information.
Error Paths and Recovery
ptxas uses setjmp/longjmp as its sole error recovery mechanism -- there are no C++ exceptions (the binary is compiled as C). Three nested recovery points exist, each catching progressively more localized failures.
Recovery Point Hierarchy
sub_446240 (top-level driver)
setjmp(jmp_buf_global) // Level 1: catches any fatal anywhere
|
sub_43A400 (per-kernel worker)
setjmp(jmp_buf_kernel) // Level 2: catches per-kernel fatals
|
sub_432500 (finalization bridge)
setjmp(jmp_buf_local) // Level 3: catches OCG pipeline fatals
|
[159-phase pipeline, regalloc, encoding, ELF]
Level 1 (global). Established by sub_446240 at function entry. If any code path anywhere in ptxas calls sub_42FBA0 with severity >= 6 (fatal), execution longjmps back here. The handler cleans up global resources and returns a non-zero exit code. This is the last-resort handler.
Level 2 (per-kernel). Established by sub_43A400 before the OCG pipeline runs. On longjmp, the handler destroys the partially-compiled kernel's state, clears the error flags in the TLS context, and continues to the next kernel. This allows multi-kernel compilations to survive a single kernel's failure.
Level 3 (finalization). Established by sub_432500, which saves and replaces the TLS jmp_buf pointer for nested recovery. On longjmp: restores the previous jmp_buf, sets error_flags = 1, releases output buffers, and calls report_internal_error(). Execution returns false to the Level 2 handler.
Parse Error Recovery
Parse errors in sub_451730 (the Flex/Bison parser) invoke sub_42FBA0 with the message "syntax error":
- Severity 4--5 (non-fatal error): The error is printed with file:line location, and the parser attempts to continue via Bison's error recovery rules. Multiple non-fatal parse errors can accumulate. After parsing completes, if the error count > 0, the compilation is aborted before entering the OCG pipeline.
- Severity 6 (fatal): Triggers
longjmpto the Level 1 handler immediately. The parser state pool is leaked (accepted trade-off since the process is about to exit).
Bison error recovery operates through the error token in the grammar. When the parser encounters a token that matches no production, it discards tokens until it finds one that allows the error production to reduce, then resumes parsing. This provides reasonable error recovery for common mistakes (missing semicolons, misspelled opcodes) but can cascade badly for structural errors (mismatched braces, corrupt PTX).
Phase Failure in PhaseManager
The phase executor sub_C64F70 runs each phase by calling its vtable execute() method. There is no explicit per-phase error check -- phases that detect internal errors call the diagnostic emitter sub_42FBA0 directly. The error handling cascade:
- Non-fatal phase error (severity 3--5): The error is printed and the error flag is set in the TLS context. The PhaseManager continues executing subsequent phases. This allows multiple diagnostics to be collected in a single run.
- Fatal phase error (severity 6): Triggers
longjmpto Level 2 or Level 3. The current kernel's compilation is aborted. The PhaseManager's loop is unwound non-locally -- no cleanup of intermediate phase state occurs. Resources are reclaimed when the per-kernel memory pool is destroyed. - OOM during phase execution: Any allocation failure calls
sub_42BDB0(3,825 callers), which forwards tosub_42F590with a severity-6 descriptor atunk_29FA530. This always triggerslongjmp.
The PhaseManager logs phase transitions using "Before <phase>" and "After <phase>" string construction (visible in sub_C64F70). When DUMPIR is set to a phase name, the IR is dumped to a file after that phase completes. This enables bisection of phase failures: --knob DUMPIR=<phase_name> isolates which phase corrupted the IR.
Register Allocation Failure and Retry
The register allocator has its own retry mechanism that operates within the normal pipeline (not via longjmp). The retry driver sub_971A90 (355 lines) wraps the Fatpoint allocator in a two-phase strategy:
Phase 1 -- NOSPILL. Attempt allocation without spilling. If the allocator fits within the register budget, proceed directly to finalization.
Phase 2 -- SPILL retry loop. If NOSPILL fails:
- The spill guidance engine
sub_96D940(84 KB) computes per-register-class spill candidates - The allocator retries with progressively more aggressive spilling, up to N attempts (controlled by knobs 638/639)
- Each attempt prints:
"-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d"(note: the typo "attemp" is in the binary) - The best result across all attempts is tracked by
sub_93D070 - The finalization function
sub_9714E0(290 lines) commits the best result or emits a fatal error
On allocation failure (all retry attempts exhausted):
Register allocation failed with register count of '%d'.
Compile the program with a higher register target
This error is emitted by sub_9714E0 through two paths: with source location (via sub_895530, including function name and PTX line number) or without source location (via sub_7EEFA0, generic). After this error, sub_9714E0 returns with HIBYTE(status) set, causing the retry driver to clear all register assignments to -1 and propagate the failure.
A dedicated DUMPIR hook exists: "Please use -knob DUMPIR=AllocateRegisters for debugging" -- this string (found at sub_9714E0's error path) directs users to dump the IR state before the allocator runs.
Fatal Error Handler Chain
The complete chain from any error site to process termination:
[any function, 2,350 call sites]
sub_42FBA0(descriptor, location, ...) // central diagnostic emitter
| checks descriptor[0] for severity
| severity 0: silently ignored
| severity 1-2: prints "info " message
| severity 3: prints "warning " (or "error " if TLS[50] Werror flag set)
| severity 4-5: prints "error " / "error* " (non-fatal)
| severity 6: prints "fatal " then:
v
longjmp(tls->jmp_buf, 1)
| unwinds to nearest setjmp recovery point
v
[Level 3] sub_432500 -> restore jmp_buf, set error_flags, return false
[Level 2] sub_43A400 -> cleanup kernel state, continue to next kernel
[Level 1] sub_446240 -> cleanup global state, exit(non-zero)
Resource leak note. Because longjmp bypasses normal stack unwinding, all heap allocations made between the setjmp and the fatal error are leaked unless tracked in a pool. This is why ptxas uses pool allocators -- the per-kernel pool can be destroyed wholesale at the Level 2 recovery point, reclaiming all leaked memory without tracking individual allocations.
Architecture Dispatch
An architecture vtable factory at sub_1CCEEE0 (17KB, 244 callees) constructs a 632-byte vtable object (79 function pointers) based on the target SM version. The version dispatch ranges:
| Range | Architecture | Generation | Status in v13.0.88 |
|---|---|---|---|
| sm_30-39 | Kepler | 1st gen | Validation only -- accepted by bsearch in unk_1D16220, but no codegen factory, no capability dispatch handlers, and no SASS encoders ship for these targets. Compilation fails after parsing. |
| sm_50-59 | Maxwell | 2nd gen | Validation only -- same as Kepler. Present in the base validation table for backward-compatible PTX version/target checking, but no backend support. |
| sm_60-69 | Pascal | 3rd gen | Validation only -- same as above. The codegen factory value 24576 (6 << 12) is referenced in comparison thresholds but no Pascal-specific encoder tables exist. |
| sm_70-73 | Volta | 4th gen | Validation only -- sm_70, sm_72, sm_73 are in the base table but have no active capability dispatch handlers in sub_607DB0. |
| sm_75 | Turing | 4th gen | Active -- lowest SM with full codegen support (factory 24577). |
| sm_80-89 | Ampere / Ada | 5th gen | Active -- factory 28673. |
| sm_90 | Hopper | 6th gen | Active -- factory 32768. |
| sm_100-110 | Blackwell | 7th gen | Active -- factory 36864. |
| sm_120-121 | Consumer / DGX Spark | 7th gen (desktop) | Active -- factory 36864 (shared with Blackwell datacenter). |
The distinction between "validation only" and "active" is critical: the base validation table at unk_1D16220 contains 32 entries including all legacy SMs back to sm_20, allowing ptxas to parse PTX files that declare .target sm_30 without immediately rejecting them. However, the capability dispatch initializer sub_607DB0 only registers handler functions for sm_75 through sm_121 (13 base targets). Attempting to compile code for an unregistered SM produces a fatal error during codegen factory lookup -- the architecture vtable factory sub_1CCEEE0 cannot construct a backend object for these targets.
The legacy codegen factory values (12288 for sm_30, 16385/20481 for sm_50, 24576 for sm_60) survive as comparison constants in feature-gating checks throughout the backend (e.g., if (factory_value > 28673) gates sm_90+ features), but the code paths they would activate no longer exist.
Each vtable entry is a function pointer to an SM-specific implementation of a codegen or emission primitive. This is the central dispatch mechanism for all architecture-dependent behavior in the backend.
Obfuscation: ROT13 Encoding
All internal identifiers in ptxas's static initializers are ROT13-encoded:
- Opcode table (
ctor_003at0x4095D0, 17KB): 900+ PTX opcode mnemonics. Example:NPDOHYXdecodes toACQBULK,SZNdecodes toFMA,RKVGdecodes toEXIT. - General knob table (
ctor_005at0x40D860, 80KB): 2,000+ Mercury/OCG tuning knob names with hex default values. Example:ZrephelHfrNpgvirGuernqPbyyrpgvirVafgfdecodes toMercuryUseActiveThreadCollectiveInsts. - Scheduler knob table (
ctor_007at0x421290, 8KB): 98 scheduler-specific knob names. Example:XBlockWaitOut,ScavInlineExpansion.
The ROT13 decoding is performed inline during lookup (in sub_79B240, GetKnobIndex) using character-range detection: bytes in A-M get +13, bytes in N-Z get -13, with case-insensitive comparison via tolower().
Cross-References
- Binary Layout -- address ranges for every subsystem
- Function Map -- master index of recovered function addresses
- CLI Options -- complete flag catalog
- Knobs System -- 1,294 internal tuning parameters
- Optimization Levels -- what changes at
-O0/-O1/-O2/-O3 - Phase Manager -- PhaseManager object layout and dispatch
- Memory Pool Allocator -- pool struct layout and allocation algorithm
- Thread Pool & Concurrency -- thread pool struct, task submit, jobserver
Function Map
| Address | Size | Callers | Identity | Confidence |
|---|---|---|---|---|
0x409460 | 84 B | -- | main (entry point) | CERTAIN |
0x446240 | 11 KB | 1 | Top-level compilation driver | HIGH |
0x434320 | 10 KB | 1 | CLI option parser + validator | HIGH |
0x445EB0 | -- | 1 | Target configuration setup | HIGH |
0x43A400 | 4.7 KB | 1 | SM-specific default configuration | HIGH |
0x43B660 | 3.8 KB | 1 | Register/resource constraint calculator | HIGH |
0x451730 | 14 KB | 1 | Parser init + special register setup | HIGH |
0x46E000 | 93 KB | 1 | Opcode dispatch table builder (1,168 callees) | HIGH |
0x4428E0 | 14 KB | 1 | PTX input validation + preprocessing | HIGH |
0x43CC70 | 5.4 KB | 1 | Per-entry compilation unit processor | HIGH |
0x7FBB70 | 198 B | vtable | Per-kernel entry point | CERTAIN |
0x7FB6C0 | 1.2 KB | 1 | Pipeline orchestrator | CERTAIN |
0xC62720 | 4.7 KB | 1 | PhaseManager constructor | VERY HIGH |
0xC60D30 | 3.6 KB | 1 | Phase factory (159-case switch) | VERY HIGH |
0xC64F70 | -- | 1 | Phase executor | HIGH |
0x9F63D0 | 342 B | 1 | NamedPhases executor | VERY HIGH |
0x612DE0 | 47 KB | 1 | Kernel finalizer / ELF builder | HIGH |
0x1C9F280 | 97 KB | 1 | Master ELF emitter | HIGH |
0x1CB53A0 | 13 KB | 1 | ELF world initializer (672-byte object) | HIGH |
0x1CABD60 | 67 KB | 1 | Section layout & memory allocator | HIGH |
0x1CD13A0 | 11 KB | 2 | Final ELF file writer | HIGH |
0x1CB18B0 | ~200 B | 1 | Thread pool constructor | HIGH |
0x1CB1A50 | ~200 B | N | Thread pool task submit | HIGH |
0x1CC7300 | 8 KB | 1 | GNU Make jobserver client | HIGH |
0x1CCEEE0 | 17 KB | 3 | Architecture vtable factory | HIGH |
0x424070 | 2.1 KB | 3,809 | Pool allocator: alloc(pool, size) | HIGH |
0x4248B0 | 923 B | 1,215 | Pool allocator: free(ptr) | HIGH |
0x4280C0 | 597 B | 3,928 | Thread-local context accessor | HIGH |
0x426150 | 2.5 KB | 2,800 | Hash map: put(map, key, value) | HIGH |
0x42FBA0 | 2.4 KB | 2,350 | Diagnostic message emitter | HIGH |
Entry Point & CLI
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas binary has a deceptively simple entry point. The exported main at 0x409460 is an 84-byte wrapper that sets up unbuffered I/O and immediately tail-calls sub_446240 -- the real compilation driver. This driver is a monolithic 11 KB function that allocates a 1,352-byte master options block on the stack, establishes setjmp-based error recovery, parses all command-line options through a generic framework, reads PTX input, and then loops over compile units running the full Parse -> CompileUnitSetup -> DAGgen -> OCG -> ELF -> DebugInfo pipeline for each. The entire error-handling strategy is non-local: any of the 2,350 call sites to the central diagnostic emitter sub_42FBA0 can trigger a longjmp back to the driver's recovery point on fatal errors.
The same binary doubles as an in-process library. When nvcc loads ptxas as a shared object rather than spawning a subprocess, three extra arguments to the driver carry an output buffer pointer, an extra option count, and an extra options array. Callback function pointers at fixed offsets in the options block allow the host process to receive diagnostics and progress notifications without going through stderr.
| main() | 0x409460 (84 bytes) -- setvbuf + tail-call to sub_446240 |
| Real main | sub_446240 (11,064 bytes, ~900 lines) |
| Options block | 1,352 bytes on stack |
| Error recovery | setjmp / longjmp (no C++ exceptions) |
| Option registration | sub_432A00 (6,427 bytes, ~100 options via sub_1C97210) |
| Option parser | sub_434320 (10,289 bytes, ~800 lines) |
| Diagnostic emitter | sub_42FBA0 (2,388 bytes, 2,350 callers, 7 severity levels) |
| TLS context | sub_4280C0 (597 bytes, 3,928 callers, 280-byte per-thread struct) |
| Pipeline phases | Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo |
| Library mode | sub_446240(argc, argv, output_buf, extra_opt_count, extra_opts) |
Architecture
main (0x409460, 84B)
│
├─ nullsub_1(*argv) // store program name (no-op)
├─ setvbuf(stdout, _IONBF)
├─ setvbuf(stderr, _IONBF)
│
└─ sub_446240(argc, argv, 0, 0, 0) // REAL MAIN
│
├─ setjmp(jmp_buf) // fatal error recovery point
│
├─ sub_434320(opts_block, ...) // OPTION PARSER
│ └─ sub_432A00(...) // register ~100 options via sub_1C97210
│
├─ sub_4428E0(...) // PTX INPUT SETUP
│ ├─ validate .version / .target
│ ├─ handle --input-as-string
│ └─ generate __cuda_dummy_entry__ if --compile-only
│
├─ sub_43A400(...) // TARGET CONFIGURATION
│ └─ set cache defaults, texmode, arch-specific flags
│
├─ FOR EACH compile unit:
│ ├─ sub_451730(...) // parser/lexer init + special regs
│ ├─ sub_43B660(...) // register constraint calculator
│ ├─ sub_43F400(...) // function/ABI setup
│ └─ sub_43CC70(...) // per-entry: DAGgen → OCG → ELF → DebugInfo
│
├─ timing / memory stats output (--compiler-stats)
│
└─ cleanup + return exit code
Pre-main Static Constructors
Before main executes, four static constructors run as part of the ELF .init_array. Three of them populate ROT13-obfuscated lookup tables that are foundational to the rest of the binary. This obfuscation is deliberate -- it prevents trivial string searching for internal opcode names and tuning knob identifiers in the stripped binary.
ctor_001 -- Thread Infrastructure (0x4094C0, 204 bytes)
Initializes the POSIX threading foundation used throughout ptxas:
pthread_key_create(&key, destr_function); // TLS key for sub_4280C0
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
__cxa_atexit(cleanup_func, ...); // registered destruction
The TLS key created here is the one used by sub_4280C0 (3,928 callers), making it the single most important piece of global state in the binary.
ctor_003 -- PTX Opcode Name Table (0x4095D0, 17,007 bytes)
Populates a table at 0x29FE300+ with approximately 900 ROT13-encoded PTX opcode mnemonic strings. Each entry is a (string_ptr, length) pair. The ROT13 encoding maps A-Z to N-Z,A-M and a-z to n-z,a-m, leaving digits and punctuation unchanged.
| Encoded | Decoded | Instruction |
|---|---|---|
NPDOHYX | ACQBULK | Bulk acquire |
NPDFUZVAVG | ACQSHMINIT | Shared memory acquire init |
OFLAP | BSYNC | Barrier sync |
PPGY.P | CCTL.C | Cache control |
SZN | FMA | Fused multiply-add |
FRGC | SETP | Set predicate |
ERGHEA | RETURN | Return |
RKVG | EXIT | Thread exit |
These decoded names are the canonical PTX opcode mnemonics used during parsing and validation. The table is consumed by the PTX lexer initialization at sub_451730 and the opcode-to-handler dispatch table at sub_46E000 (93 KB, the largest function in the front-end range).
ctor_005 -- Mercury Tuning Knob Registry (0x40D860, 80,397 bytes)
The single largest function in the front-end address range. Registers 2,000+ ROT13-encoded internal tuning knob names, each paired with a hexadecimal default value string. These are the "Mercury" (OCG) backend tuning parameters that control every aspect of code generation, scheduling, and register allocation.
| Encoded Name | Decoded Name | Default |
|---|---|---|
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf | MercuryUseActiveThreadCollectiveInsts | 0x3e40 |
ZrephelGenpxZhygvErnqfJneYngrapl | MercuryTrackMultiReadsWarLatency | — |
ZrephelCerfhzrKoybpxJnvgOrarsvpvny | MercuryPresumeXblockWaitBeneficial | — |
ZrephelZreteCebybthrOybpxf | MercuryMergePrologueBlocks | — |
ZrephelTraFnffHPbqr | MercuryGenSassUCode | — |
The knob system is documented in detail in the Knobs System page. The ROT13 encoding applies identically to all knob name strings in all four constructors.
ctor_007 -- Scheduler Knob Registry (0x421290, 7,921 bytes)
A smaller companion to ctor_005 that registers 98 scheduler-specific knobs. These control the instruction scheduler (Mercury/OCG) behavior at a finer granularity than the general knobs:
Decoded examples: XBlockWaitOut, XBlockWaitInOut, XBlockWaitInOnTarget, WarDeploySyncsFlush_SW4397903, WaitToForceCTASwitch, VoltageWar_SW4981360PredicateOffDummies, TrackMultiReadsWarLatency, ScavInlineExpansion, ScavDisableSpilling.
Knob names containing _SW followed by a number (e.g., _SW4397903) indicate workarounds for specific hardware bugs identified by NVIDIA's internal bug tracking system.
Real Main -- sub_446240
The exported main() tail-calls sub_446240 with three zero arguments appended. This function is the complete compilation orchestrator: it owns the options block, the error recovery, the compilation loop, and the statistics output.
| Field | Value |
|---|---|
| Address | 0x446240 |
| Size | 11,064 bytes |
| Stack frame | 1,352+ bytes (master options block + locals) |
| Callers | 1 (main) |
| Error recovery | setjmp at function entry |
Signature and Library Mode
int sub_446240(int argc, char **argv,
void *output_buf, // a3: cubin output buffer (NULL for standalone)
int extra_opt_count, // a4: count of extra options from nvcc
char **extra_opts); // a5: array of extra option strings
When main calls this, a3/a4/a5 are all zero -- standalone mode. When nvcc loads ptxas as a shared library and calls the entry point directly, these arguments carry non-null values:
- a3 (output_buf): Pointer to a memory buffer where the compiled cubin is written. Eliminates the need for temporary files and filesystem round-trips, which matters for large CUDA compilations where nvcc may invoke ptxas hundreds of times.
- a4 (extra_opt_count): Number of additional option strings injected by nvcc beyond what appears on the command line.
- a5 (extra_opts): Array of those extra option strings.
Additionally, callback function pointers at offsets 37--39 of the 1,352-byte options block (byte offsets ~296, ~304, ~312) allow the host process to receive progress notifications and diagnostic messages in-process rather than through stderr.
Error Recovery with setjmp/longjmp
The first significant action in sub_446240 is establishing a setjmp recovery point:
if (setjmp(jmp_buf) != 0) {
// Fatal error occurred somewhere in the pipeline.
// Clean up and return non-zero exit code.
goto cleanup;
}
This is the only error recovery mechanism in ptxas -- there are no C++ exceptions (the binary is compiled as C, not C++). Any function anywhere in the call tree that encounters an unrecoverable error calls sub_42FBA0 with severity >= 6, which internally calls longjmp(jmp_buf, 1) to unwind directly back to this point. The approach is simple but has a critical implication: all resources allocated between the setjmp and the fatal error are leaked unless explicitly tracked and cleaned up at the recovery site.
The 1,352-Byte Options Block
The master options block lives on the stack and accumulates all compilation state during option parsing. It is passed by pointer to virtually every subsystem. Key fields (approximate offsets based on access patterns):
| Offset Range | Purpose |
|---|---|
| 0--63 | Input/output file paths, PTX version, target SM |
| 64--127 | Optimization level, debug flags, cache modes |
| 128--255 | Register limits, occupancy constraints |
| 256--295 | Warning/error control flags |
| 296--319 | Library-mode callback function pointers (offsets 37--39) |
| 320--1351 | Per-pass configuration, knob overrides, feature flags |
Compilation Loop
After option parsing and PTX input setup, the driver enters a loop over compile units. Each unit corresponds to one entry function (or device function in --compile-only mode). The per-entry processing is handled by sub_43CC70, which prints a separator:
printf("\n# ============== entry %s ==============\n", entry_name);
and then sequences: DAGgen (PTX-to-Ori lowering), OCG (optimization and code generation), ELF (binary emission), and DebugInfo (DWARF generation). The special entry __cuda_dummy_entry__ is silently skipped.
Timing and Memory Statistics
When --compiler-stats is active, sub_446240 prints per-phase timing and peak memory after all compile units complete:
CompileTime = 42.3 ms (100%)
Parse-time : 12.1 ms (28.61%)
CompileUnitSetup-time : 1.4 ms ( 3.31%)
DAGgen-time : 8.7 ms (20.57%)
OCG-time : 15.2 ms (35.93%)
ELF-time : 3.8 ms ( 8.98%)
DebugInfo-time : 1.1 ms ( 2.60%)
PeakMemoryUsage = 2048.000 KB
When --compiler-stats-file is specified, the same data is written in JSON format using the shared JSON builder (sub_1CBA950). When --fdevice-time-trace is active, sub_439880 parses Chrome DevTools trace format JSON and merges ptxas timing events into the trace.
Option Parser -- sub_434320 and sub_432A00
Option parsing is split into two phases: registration and processing.
Option Registration -- sub_432A00
This 6,427-byte function calls sub_1C97210 approximately 100 times, once per recognized option. Each call provides the option's long name, short name, value type, help text, and default value to the generic option framework (implemented in the 0x1C96xxx--0x1C97xxx range, shared with other NVIDIA tools).
| Option | Short | Type | Help Text |
|---|---|---|---|
--arch | -arch | string | "Specify the 'sm_' name of the target architecture" |
--output-file | -o | string | "Specify name and location of the output file" |
--opt-level | -O | int | "Specify optimization level" |
--maxrregcount | — | int | "Specify the maximum number of registers" |
--register-usage-level | — | enum(0..10) | Register usage reporting level |
--verbose | -v | bool | Verbose output |
--version | -V | — | Print version and exit |
--compile-only | — | bool | Compile without linking |
--compile-functions | — | string | "Entry function name" |
--input-as-string | — | string | "PTX string" (compile from memory) |
--fast-compile | — | bool | Reduce compile time at cost of code quality |
--suppress-stack-size-warning | — | bool | Suppress stack size warnings |
--warn-on-local-memory-usage | — | bool | Warn when local memory is used |
--warn-on-spills | — | bool | Warn on register spills |
--warn-on-double-precision-use | — | bool | Warn on FP64 usage |
--compiler-stats | — | bool | Print compilation timing |
--compiler-stats-file | — | string | "/path/to/file" (JSON output) |
--fdevice-time-trace | — | string | Chrome trace JSON output |
--def-load-cache | — | enum | Default load cache operation |
--force-load-cache | — | enum | Force load cache operation |
--position-independent-code | — | bool | Generate PIC |
--compile-as-tools-patch | — | bool | CUDA sanitizer/tools patch mode |
--extensible-whole-program | — | bool | Whole-program compilation |
--cloning | — | enum(yes/no) | Inline cloning control |
--ptxlen | — | — | PTX length statistics |
--list-version / --version-ls | — | — | List supported PTX versions |
--disable-smem-reservation | — | bool | Disable shared memory reservation |
--generate-relocatable-object | -c | bool | Generate relocatable object |
Option Processing -- sub_434320
The 10,289-byte parser iterates over argv (and any extra options from library mode), matches each argument against registered options via the framework, and populates fields in the 1,352-byte options block. Special handling exists for:
--version: Prints the identification string "Ptx optimizing assembler" followed by the version (e.g., "Cuda compilation tools, release 13.0, V13.0.88") and exits.--help: Delegates tosub_403588, which prints"Usage : %s [options] <ptx file>,...\n"followed by all registered options, then exits.--fast-compile: Validated against conflicting optimization options.-cloning=yes/-cloning=no: Inline cloning control parsed as an equality option.
Generic Option Framework
The option parsing library lives in the 0x1C96000--0x1C97FFF range and is shared with other NVIDIA tools (nvlink, fatbinary, etc.):
| Address | Identity | Role |
|---|---|---|
sub_1C960C0 | Option parser constructor | Creates the option parser state |
sub_1C96680 | Argv processor | Matches argv entries against registered options |
sub_1C97210 | Option registrar | Registers one option with name, type, help |
sub_1C97640 | Help printer | Iterates all registered options, prints help text |
Diagnostic System -- sub_42FBA0
The central diagnostic emitter is the most important error-reporting function in ptxas. With 2,350 call sites, it handles every warning, error, and fatal message in the entire binary.
Signature
void sub_42FBA0(
int *descriptor, // a1: points to severity level at *a1
void *location, // a2: source location context
... // variadic: printf-style format args
);
Severity Levels
| Level | Prefix | Tag | Behavior |
|---|---|---|---|
| 0 | (none) | — | Suppressed -- message is silently discarded |
| 1 | "info " | @I@ | Informational |
| 2 | "info " | @I@ | Informational (alternate) |
| 3 | "warning " / "error " | @W@ / @E@ | Warning, promoted to error if TLS[50] set |
| 4 | "error* " | @O@ | Non-fatal error with special marker |
| 5 | "error " | @E@ | Non-fatal error |
| 6 | "fatal " | @E@ | Fatal -- triggers longjmp(jmp_buf, 1) |
The machine-readable tags (@E@, @W@, @O@, @I@) allow nvcc and other tools to parse ptxas output programmatically, extracting severity without parsing the human-readable text.
Warning-to-Error Promotion
Severity level 3 has context-dependent behavior controlled by two flags in the thread-local storage:
v5 = *a1; // severity
if (v5 == 3) {
if (sub_4280C0()[49]) // TLS offset 49: suppression flag
return; // silently discard
if (sub_4280C0()[50]) // TLS offset 50: Werror flag
prefix = "error ";
else
prefix = "warning ";
}
This implements the --Werror equivalent: when the Werror flag is active in the TLS context, all warnings become errors.
Output Format
<filename>, line <N>; <severity>: <message>
When source is available, the diagnostic emitter reads the PTX input file, seeks to line N, and prints the source line prefixed with "# ". To avoid O(n) seeking through large files on every diagnostic, it maintains a hash map (sub_426150/sub_426D60) that caches file byte offsets every 10 lines for fast random access to arbitrary line numbers.
Fatal Error Handler -- sub_42BDB0
A 14-byte wrapper called from 3,825 sites (nearly every allocation in ptxas). It fires whenever the pool allocator sub_424070 returns NULL:
void sub_42BDB0(...) {
return sub_42F590(&unk_29FA530, ...); // internal error descriptor
}
The descriptor at unk_29FA530 has severity 6 (fatal), so this always triggers longjmp back to the driver's recovery point.
Thread-Local Storage -- sub_4280C0
The most-called function in the entire binary (3,928 callers). Returns a pointer to a 280-byte per-thread context struct, allocating and initializing it on first access.
void *sub_4280C0(void) {
void *ctx = pthread_getspecific(key);
if (ctx) return ctx;
ctx = malloc(0x118); // 280 bytes
memset(ctx, 0, 0x118);
pthread_cond_init(ctx + 128, NULL);
pthread_mutex_init(ctx + 176, NULL);
sem_init(ctx + 216, 0, 0);
pthread_setspecific(key, ctx);
return ctx;
}
TLS Context Layout (280 bytes)
| Offset | Size | Type | Purpose |
|---|---|---|---|
| 0 | 8 | int/flags | Error/warning state flags |
| 8 | 8 | int | has_error flag |
| 49 | 1 | byte | Diagnostic suppression flag |
| 50 | 1 | byte | Werror promotion flag |
| 128 | 48 | pthread_cond_t | Condition variable |
| 176 | 40 | pthread_mutex_t | Per-thread mutex |
| 216 | 32 | sem_t | Semaphore for synchronization |
The TLS key is created by ctor_001 before main runs, and a destructor function registered via pthread_key_create frees the 280-byte struct when a thread terminates. This per-thread context enables concurrent compilation of multiple compile units (when the thread pool is active), with each thread maintaining independent error state and diagnostic suppression flags.
PTX Input Setup -- sub_4428E0
After options are parsed, this 13,774-byte function reads and preprocesses the PTX input:
-
Version and target validation. Checks
.versionand.targetdirectives in the input. Emits synthetic headers ("\t.version %s\n","\t.target %s\n") when needed. -
Compile-only mode. When
--compile-onlyis active and no real entries exist, generates a dummy entry:"\t.entry %s { ret; }\n"with name__cuda_dummy_entry__. -
Input-as-string mode. When
--input-as-stringis active, PTX is read from memory (passed as a string argument) rather than from a file. This is used by the library-mode interface. -
Whole-program mode.
--extensible-whole-programenables inter-function optimization across all entries in the compilation unit. -
Cache and debug configuration. Applies
--def-load-cache,--def-store-cache,--force-load-cache,--force-store-cache, andsuppress-debug-infosettings. -
Tools-patch mode.
--compile-as-tools-patchactivates the CUDA sanitizer compilation path, checking for__cuda_sanitizersymbols.
Key diagnostic strings from this function:
"'--fast-compile'""calls without ABI""compilation without ABI""device-debug or lineinfo""unified Functions"
Target Configuration -- sub_43A400
A 4,696-byte function that configures target-specific defaults after option parsing completes. It reads the SM architecture number from the options block and sets:
- Texturing mode:
texmode_unifiedvs raw texture mode. - Cache defaults: Based on architecture capabilities.
- Feature flags: Hardware-specific workaround flags (e.g.,
--sw4575628). - Indirect function support:
"Indirect Functions or Extern Functions"validation.
The function references "NVIDIA" and "ptxocg.0.0" (the internal name for the OCG optimization pass), suggesting it also initializes the pass pipeline configuration for the target architecture.
Register Constraint Calculator -- sub_43B660
A 3,843-byte function that resolves potentially conflicting register limit specifications into a single register budget per function. Register constraints come from four sources with different priorities:
| Source | Directive/Option | Priority |
|---|---|---|
| PTX directive | .maxnreg N | Per-function, highest priority |
| CLI option | --maxrregcount N | Global, overridden by .maxnreg |
| PTX directive | .minnctapersm N | Occupancy target, derived limit |
| PTX directive | .maxntid Nx,Ny,Nz | Thread block size, derived limit |
The occupancy-derived limit is computed from .minnctapersm and .maxntid: given a minimum number of CTAs per SM and a maximum thread count per CTA, the function calculates the maximum register count that allows the requested occupancy level, accounting for per-SM register file size.
Diagnostic strings indicate the resolution process:
"computed using thread count"-- derived from.maxntid"of .maxnreg"-- explicit per-function limit"of maxrregcount option"-- CLI override"global register limit specified"-- global cap applied
Per-Entry Compilation -- sub_43CC70
A 5,425-byte function that processes each entry function through the complete backend pipeline. For each entry:
- Skips
__cuda_dummy_entry__(generated by compile-only mode). - Prints the entry separator:
"\n# ============== entry %s ==============\n". - Runs DAGgen (PTX-to-Ori lowering).
- Runs OCG (the 159-phase optimization pipeline + SASS code generation).
- Generates
.sassand.ucodeELF sections. - Generates DWARF debug information if requested.
The function also handles reg-fatpoint configuration (the register allocation algorithm, documented in the Fatpoint Algorithm page).
Function/ABI Setup -- sub_43F400
A 9,078-byte function that configures the calling convention for each function before compilation. This includes:
| Resource | Diagnostic String |
|---|---|
| Parameter passing registers | "number of registers used for parameter passing" |
| First parameter register | "first parameter register" |
| Return address register | "return address register" |
| Scratch data registers | "scratch data registers" |
| Scratch control barriers | "scratch control barriers" |
| Call prototype | "callprotoype" (sic -- misspelled in binary) |
| Call target | "calltarget" |
The function handles both entry functions (kernels launched from the host) and device functions (callable from other device code), with different ABI requirements for each. Entry functions use a simplified ABI where parameters come from constant memory, while device functions use register-based parameter passing.
The --compile-as-tools-patch and --sw200428197 flags activate a special ABI variant for CUDA sanitizer instrumentation, which inserts additional scratch registers for sanitizer state.
Function Map
| Address | Size | Callers | Identity |
|---|---|---|---|
0x409460 | 84 B | — | main (entry point thunk) |
0x4094C0 | 204 B | — | ctor_001 (thread infrastructure init) |
0x4095D0 | 17 KB | — | ctor_003 (ROT13 opcode table, ~900 entries) |
0x40D860 | 80 KB | — | ctor_005 (ROT13 knob registry, 2000+ entries) |
0x421290 | 8 KB | — | ctor_007 (scheduler knob registry, 98 entries) |
0x403588 | 75 B | 1 | Usage printer (--help) |
0x4280C0 | 597 B | 3,928 | TLS context accessor (280-byte struct) |
0x42BDB0 | 14 B | 3,825 | OOM fatal error handler |
0x42FBA0 | 2.4 KB | 2,350 | Central diagnostic emitter |
0x42F590 | — | 1 | Internal fatal error handler |
0x430570 | — | 2 | Program name getter |
0x432A00 | 6.4 KB | 1 | Option registration (~100 options) |
0x434320 | 10 KB | 1 | Option parser and validator |
0x439880 | 2.9 KB | 1 | Chrome trace JSON parser |
0x43A400 | 4.7 KB | 1 | Target configuration |
0x43B660 | 3.8 KB | 1 | Register constraint calculator |
0x43CC70 | 5.4 KB | 1 | Per-entry compilation processor |
0x43F400 | 9 KB | 1 | Function/ABI setup |
0x4428E0 | 13.8 KB | 1 | PTX input setup and preprocessing |
0x446240 | 11 KB | 1 | Compilation driver (real main) |
0x451730 | 14 KB | 1 | Parser/lexer init + special registers |
0x46E000 | 93 KB | 1 | Opcode-to-handler dispatch table builder |
0x1C960C0 | — | — | Option parser constructor |
0x1C96680 | — | — | Argv processor |
0x1C97210 | — | ~100 | Option registrar (per-option) |
0x1C97640 | — | 1 | Help text printer |
0x1CBA950 | — | — | JSON context constructor |
0x1CBAC20 | 2.9 KB | 3 | JSON recursive descent parser |
Cross-References
- Pipeline Overview -- full PTX-to-SASS compilation flow
- CLI Options -- complete option catalog
- Knobs System -- the 2,000+ Mercury tuning knobs registered in ctor_005
- Memory Pool Allocator -- the allocator (
sub_424070) that callssub_42BDB0on OOM - Hash Tables & Bitvectors -- the hash map used by diagnostics for line offset caching
- Thread Pool & Concurrency -- thread pool that creates the TLS contexts
- PTX Parser -- the parser initialized by
sub_451730 - Optimization Pipeline -- the 159-phase pipeline invoked per compile unit
- Fatpoint Algorithm -- register allocation referenced in per-entry compilation
PTX Parser (Flex + Bison)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas front-end parses PTX assembly text into internal IR using a classic two-stage architecture: a Flex-generated DFA scanner (lexer) and a Bison-generated LALR(1) shift-reduce parser. Unlike most compiler front-ends, the parser does not construct an AST. Instead, Bison reduction actions directly build IR nodes, populate the instruction table, and emit validation calls -- the parse tree is consumed inline and never materialized as a data structure. A separate macro preprocessor handles .MACRO, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives at the character level before tokens reach the Flex DFA. The instruction table builder (sub_46E000, 93 KB) registers all PTX opcodes with their legal type combinations during parser initialization, and an instruction lookup subsystem classifies operands into 12 categories at parse time.
| Flex scanner | sub_720F00 (15.8 KB, 64 KB with inlined helpers) |
| DFA table | off_203C020 (transition/accept array) |
| Scanner rules | ~552 Flex rules, 162 token types (codes 258--422) |
| Scanner prefix | ptx (all Flex symbols: ptxlex, ptxensure_buffer_stack, etc.) |
| Bison parser | sub_4CE6B0 (48 KB, spans 0x4CE6B0--0x4DA337) |
| Grammar size | ~512 productions, 443 reduction cases |
| LALR tables | word_1D146A0 (yydefact), word_1D121A0 (yycheck), word_1D13360 (yypact), word_1D150C0 (yypgoto), byte_1D15960 (yyr2) |
| Instruction table builder | sub_46E000 (93 KB, 1,141 calls to sub_46BED0) |
| Instruction lookup | sub_46C690 (entry), sub_46C6E0 (6.4 KB descriptor matcher) |
| Macro preprocessor | sub_71F630 (14 KB dispatcher), sub_71E2B0 (32 KB conditional handler) |
| Parser state object | 1,128 bytes (+ 2,528-byte lexer state via pointer at +1096) |
| Error handler | sub_42FBA0 (2,350 callers, central diagnostics) |
| Parser init | sub_451730 (14 KB, symbol table + special registers + opcode table) |
Architecture
PTX source text
│
▼
┌─────────────────────────────────────────────────────────┐
│ MACRO PREPROCESSOR (character-level, 0x71B000-0x720000)│
│ sub_71F630 dispatch: .MACRO / .ELSE / .INCLUDE │
│ sub_71E2B0 conditional: .ELSE / .ELIF / .ENDIF (32KB) │
│ sub_71DCA0 macro definition handler │
│ sub_71C310 .INCLUDE file handler │
└────────────────────┬────────────────────────────────────┘
│ preprocessed character stream
▼
┌─────────────────────────────────────────────────────────┐
│ FLEX DFA SCANNER sub_720F00 (15.8KB, 552 rules) │
│ off_203C020 DFA transition table │
│ Token codes: 258-422 (162 types) │
│ Helper: sub_720410 (yy_get_next_buffer) │
│ sub_720630 (yy_get_previous_state) │
│ sub_720BA0 (yy_scan_string) │
└────────────────────┬────────────────────────────────────┘
│ token stream (code + attribute)
▼
┌─────────────────────────────────────────────────────────┐
│ BISON LALR(1) PARSER sub_4CE6B0 (48KB, 512 prods) │
│ 5 LALR tables at 0x1D12xxx-0x1D15xxx │
│ 443 reduction actions → direct IR construction │
│ NO AST: reductions emit IR nodes inline │
└────────────────────┬────────────────────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
INSTRUCTION TABLE SEMANTIC VALIDATORS
sub_46E000 (93KB) sub_4B2F20 (52KB, general)
sub_46BED0 (per-opcode) sub_4C5FB0 (28KB, operands)
sub_46C690 (lookup) sub_4C2FD0 (12KB, WMMA/MMA)
sub_46C6E0 (6.4KB match) sub_4ABFD0 (11KB, async copy)
sub_4A73C0 (10KB, tensormap)
+ 20 more validators
Flex DFA Scanner -- sub_720F00
The scanner is a standard Flex-generated DFA with the ptx prefix (all exported symbols use ptx instead of yy: ptxlex, ptxensure_buffer_stack, ptx_create_buffer, etc.). At 15.8 KB of core logic (64 KB including inlined buffer management), it is the largest single function in the lexer region. The DFA transition table lives at off_203C020 and is indexed by *(DWORD*)(state + 76) (the current start condition). The main loop structure follows the textbook Flex pattern:
// DFA transition core (reconstructed from sub_720F00)
while (1) {
v10 = (DWORD*)(table_base + 8 * state); // table[state]
if (current_char == *v10) { // character match
state = table_base + 8 * v10[1]; // goto next state
action = *(unsigned int*)(state - 4); // accept action (or 0)
}
if (action != 0) break; // matched a rule
}
// Giant switch on action number (0..~550)
switch (action) { ... }
The scanner returns integer token codes to the Bison parser. The value 550 is YY_NULL (end-of-input sentinel). Token attributes are communicated through the lexer state object, which the parser state carries as a pointer at offset +1096. The scanner receives this pointer as its a3 argument and dereferences it (e.g., *(_QWORD *)(a3 + 1096)) to reach the 2,528-byte lexer state.
Token Categories
The 552 Flex rules map PTX lexemes to 162 distinct token types. Bison terminal codes range from 258 to 422. The scanner switch cases reveal the following category structure:
| Switch cases | Token code | Category | Examples / attributes |
|---|---|---|---|
| 2 | 364 | Semicolons / newlines | Statement terminator |
| 5--7 | 340, 341, 344 | Keywords | PTX keywords |
| 63--65 | 302 | Register names | Attribute: -1, chr-48, chr-38 (register numbering) |
| 74--91 | 320 | Data types | Values 1--18: .b8 through .f64 (18 type qualifiers) |
| 92--94 | 322 | Comparison types | Values 9, 7, 11 |
| 95--99 | 323 | Rounding modes | Values 24--29: .rn, .rz, .rm, .rp, etc. |
| 1 | (internal) | #include | Strips whitespace, copies filename |
| 3 | (dispatch) | Preprocessor directive | Calls sub_71F630 |
| 4 | 339 | #pragma | Strips whitespace |
Line and column tracking uses fields at *(state+48) (line number) and *(state+52) (column), incremented on each newline character.
Buffer Management
The scanner uses the standard Flex buffer stack for nested input sources (includes, macros, inline strings). Key buffer management functions:
| Address | Size | Identity | Purpose |
|---|---|---|---|
sub_720190 | 2.0 KB | ptxensure_buffer_stack | Grows buffer stack via realloc |
sub_7202E0 | 1.3 KB | ptx_create_buffer | Creates YY_BUFFER_STATE from FILE* |
sub_720410 | 3.3 KB | yy_get_next_buffer | Refills character buffer, handles EOF |
sub_720630 | 9.7 KB | yy_get_previous_state | Restores DFA state, SIMD-optimized memmove |
sub_720BA0 | 4.3 KB | ptx_scan_string | Scans inline string into buffer |
sub_724CC0 | 4.9 KB | ptx_scan_bytes | Macro expansion buffer allocation |
sub_725070 | 2.7 KB | ptx_scan_buffer | Buffer creation with error recovery |
Notable: sub_720630 contains SSE2-optimized memmove using __m128i aligned 16-byte copies for buffer compaction -- a Flex optimization for large input buffers. The ptx_scan_bytes function (sub_724CC0) is called from the Bison parser actions (3 call sites in sub_4CCF30) to handle inline macro expansion during parsing.
Error strings in the buffer system:
"out of dynamic memory in ptxensure_buffer_stack()""out of dynamic memory in ptx_create_buffer()""out of dynamic memory in yy_get_next_buffer()""out of dynamic memory in ptx_scan_bytes()""bad buffer in ptx_scan_bytes()""out of dynamic memory in ptx_scan_buffer()""fatal flex scanner internal error--no action found""fatal flex scanner internal error--end of buffer missed""unexpected EOF while scanning"
Macro Preprocessor
Before tokens reach the Flex DFA, a character-level macro preprocessor handles .MACRO/.ENDM, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives. The preprocessor lives at 0x71B000--0x720000 (~20 KB) and operates on raw character streams, not tokens. This design is identical to C's preprocessor running before the lexer.
Preprocessor Dispatch -- sub_71F630
The top-level dispatcher (14 KB) is called from the Flex scanner's case 3 (directive detection). It examines the directive name and routes to the appropriate handler:
| Directive | Handler | Size | Description |
|---|---|---|---|
.MACRO | sub_71DCA0 | 8.4 KB | Macro definition: records body text, handles nesting |
.ELSE / .ELIF | sub_71E2B0 | 32 KB | Conditional code: skips blocks, handles nested conditionals |
.ENDIF | sub_71E2B0 | (shared) | End of conditional block |
.INCLUDE | sub_71C310 | 8.3 KB | File inclusion: pushes new input source onto lexer stack |
The dispatcher uses strstr for substring matching on directive names and returns token codes (e.g., 364 for end-of-directive).
Conditional Handler -- sub_71E2B0
At 32 KB, this is the largest preprocessor function. It handles .ELSE, .ELIF, and .ENDIF by scanning ahead through the input character stream, counting nesting levels, and skipping entire blocks of PTX text when conditions are false. It calls sub_4287D0 (the token reader) to evaluate conditional expressions and sub_428C40 (string compare) for keyword matching. Two nearly-duplicate code blocks handle .ELSE and .ELIF paths with identical scanning logic but different branch conditions.
Macro Definition -- sub_71DCA0
Handles .MACRO directives by recording the macro body text. The function is recursive to support nested .MACRO definitions. It delegates to sub_71D710 (macro body scanner, 7.5 KB) and sub_71D1B0 (macro argument scanner, 6.8 KB). The argument scanner uses strlen + strncmp for keyword matching against a delimiter string parameter.
Include Handler -- sub_71C310
Processes .INCLUDE by pushing a new file onto the lexer's input stack. The function is recursive (calls itself 4 times) for nested includes. It manages the include-stack pointers at offsets +2128, +2136, +2160, and +2168 of the lexer state object (the 2,528-byte struct pointed to by parser+1096), and uses the "pushback character" register at offset +2441 of the same lexer state. String reference: "ptxset_lineno called with no buffer".
Error Handling
Macro errors are reported through sub_71BF60 (fatal macro abort) which calls sub_71BF30 to print "out of dynamic memory..." messages, and sub_71C140 (format error) which calls sub_42CA60 (error output). Nesting depth is checked by sub_724CC0 which prints "macro nesting too deep!" on overflow.
Bison LALR(1) Parser -- sub_4CE6B0
The parser is a standard Bison-generated LALR(1) shift-reduce parser spanning 48 KB (addresses 0x4CE6B0--0x4DA337). It contains ~512 grammar productions with 443 reduction cases. The function calls ptxlex (sub_720F00) to obtain tokens and uses five LALR tables for state transitions:
| Table | Address | Bison name | Purpose |
|---|---|---|---|
word_1D146A0 | 0x1D146A0 | yydefact | Default reduction rule for each state |
word_1D121A0 | 0x1D121A0 | yycheck | Valid lookahead verification |
word_1D13360 | 0x1D13360 | yypact | Parser action table (shift/reduce) |
word_1D150C0 | 0x1D150C0 | yypgoto | Goto table for nonterminals |
byte_1D15960 | 0x1D15960 | yyr2 | Right-hand-side length for each rule |
Direct IR Construction (No AST)
The critical architectural decision: Bison reduction actions directly construct IR nodes rather than building an intermediate AST. When a grammar rule is reduced, the semantic action immediately:
- Allocates IR nodes via the pool allocator (
sub_424070) - Populates instruction fields from token attributes
- Calls instruction validators for semantic checking
- Links nodes into the instruction stream
- Registers symbols in the symbol table (via
sub_426150, the hash map)
This means the parser is a single-pass translator from PTX text to IR. The trade-off is clear: no AST means no multi-pass source-level analysis, but it eliminates an entire allocation and traversal phase. For an assembler (as opposed to a high-level language compiler), this is the right choice -- PTX is already a linearized instruction stream with no complex scoping or overload resolution that would benefit from an AST.
Reduction Actions -- Semantic Processing
The 443 reduction cases in the parser body handle PTX constructs from simple register declarations to complex matrix instruction specifications. Diagnostic strings found in the parser tail (0x4D5000--0x4DA337) reveal the kinds of semantic checks performed during reduction:
Directive validation:
"Defining labels in .section""dwarf data"-- DWARF section processing"reqntid"/".reqntid directive"-- required thread count".minnctapersm directive"-- min CTAs per SM".maxnctapersm"/".maxnctapersm directive"-- max CTAs per SM (deprecated)".maxntid and .reqntid cannot both be specified"".maxnctapersm directive deprecated..."".minnctapersm is ignored..."
Type and operand validation:
"Vector Type not specified properly"".f16x2 packed data-type"-- half-precision packed type"matrix shape"-- matrix instruction dimensions".scale_vectorsize"-- vector scaling modifier"too many layout specifiers"
Resource limits:
"Kernel parameter size larger than 4352 bytes"
Architecture gating:
"sm_50","sm_20","sm_53"-- target architecture checks viasub_485520(ctx, sm_number)- PTX version checks via
sub_485570(ctx, major, minor)
Expression handling:
"%s+%llu"/"%s-%s"-- label arithmetic in address expressions"Negative numbers in dwarf section"-- DWARF data validation
Symbol resolution:
"unrecognized symbol"-- lexer/symbol table failure"syntax error"-- generic parse error".extern"-- external declarations".noreturn directive"-- function attributes"texmode_unified"/"texmode_raw"-- texture mode selection"cache eviction priority"/".level::eviction_priority"-- cache policy
Error Recovery
Parse errors trigger sub_42FBA0 with "syntax error" as the message. The central diagnostic emitter (sub_42FBA0, 2,388 bytes, 2,350 callers) handles all severity levels:
| Severity | Prefix | Tag | Behavior |
|---|---|---|---|
| 0 | (suppressed) | -- | Silently ignored |
| 1--2 | "info " | @I@ | Informational message |
| 3 | "warning " or "error " | @W@ or @E@ | Context-dependent; promoted to error by --Werror |
| 4 | "error* " | @E@ | Non-fatal error |
| 5 | "error " | @E@ | Error |
| 6+ | "fatal " | (none) | Calls longjmp to abort compilation |
The diagnostic system reads the source file to display context lines (prefixed with "# "), caching file offsets every 10 lines in a hash map for fast random-access seeking.
Parser Initialization -- sub_451730
Parser initialization (14 KB) builds the lexer's symbol table with all built-in PTX names before parsing begins. This function is called from the compilation driver (sub_446240) and performs three major tasks:
1. Special Register Registration
All PTX special registers are pre-registered in the symbol table with their internal identifiers:
| Category | Registers |
|---|---|
| Thread/block ID | %ntid, %laneid, %warpid, %nwarpid, %smid, %nsmid, %ctaid, %nctaid, %gridid |
| Clocks | %clock, %clock_hi, %clock64 |
| Performance counters | %%pm0--%%pm7, %%pm0_64--%%pm7_64 |
| Lane masks | %lanemask_eq, %lanemask_le, %lanemask_lt, %lanemask_ge, %lanemask_gt |
| Environment | %%envreg0--%%envreg31 |
| Timers | %globaltimer_lo, %globaltimer_hi |
| Shared memory | %total_smem_size, %dynamic_smem_size |
| Texture types | .texref, .samplerref, .surfref |
| Predefined macros | GPU_ARCH, PTX_MAJOR_VERSION, PTX_MINOR_VERSION |
2. Opcode Table Construction
Calls sub_46E000 -- the 93 KB instruction table builder -- to register all PTX opcodes with their legal type combinations. See the dedicated section below.
3. Context State Initialization
Allocates and initializes two objects: the parser state (1,128 bytes, sub_424070(pool, 1128)) and the lexer state (2,528 bytes, sub_424070(pool, 2528)). The parser state stores a pointer to the lexer state at offset +1096. The string "PTX parsing state" identifies the parser state allocation in memory dumps. The string "<builtin>" serves as the filename for built-in declarations. Both objects are zeroed via memset before field initialization.
Instruction Table Builder -- sub_46E000
This is the largest single function in the front-end region at 93 KB. It is not a normal function body but a massive initialization sequence that calls sub_46BED0 exactly 1,141 times -- once per legal PTX instruction variant. Each call registers an opcode name together with its accepted type combinations using compact encoding strings.
Operand Encoding Strings
Each instruction variant is registered with a string that encodes its operand signature. The encoding uses single-character codes for operand categories:
| Code | Meaning |
|---|---|
F | Float operand (.f16, .f32, .f64) |
H | Half-precision (.f16, .f16x2) |
I | Integer operand (.s8--.s64, .u8--.u64) |
B | Bitwise operand (.b8--.b128) |
N | Immediate / numeric literal |
P | Predicate operand |
String references found in the function include composite type signatures:
"F32F32"-- binary float32 operation"F16F16F16F16"-- quad half-precision"I32I8I8I32"-- integer MMA (int32 accumulator, int8 operands)"F64F64F64F64"-- quad float64 (double-precision MMA)"_mma.warpgroup"-- warp-group MMA marker
Hash Tables
The instruction table builder populates two hash tables at offsets +2472 and +2480 within the lexer state object (the 2,528-byte struct passed as the first argument to sub_46E000). These hash tables provide O(1) lookup from opcode name to the registered type combination list.
Registration Function -- sub_46BED0
Called 1,141 times from sub_46E000. Each call takes an opcode name string and an operand encoding string, creates a descriptor node, and inserts it into the hash table. The descriptor captures the opcode, its legal operand types, and the semantic validation function to call during parsing.
Instruction Lookup -- sub_46C690 and sub_46C6E0
At parse time, when the parser reduces an instruction production, it calls sub_46C690 to look up the instruction name in the hash table built by sub_46E000. The lookup returns a descriptor list, and sub_46C6E0 (6.4 KB, the descriptor matcher) walks the list to find the variant matching the actual operands present in the source.
Operand Classification -- 12 Categories
The descriptor matcher (sub_46C6E0) classifies each operand into one of 12 categories based on its syntactic form, then matches the category sequence against the registered encoding strings. The 12 categories cover:
- General-purpose register (R)
- Predicate register (P)
- Uniform register (UR)
- Uniform predicate (UP)
- Integer immediate
- Float immediate
- Address expression (register + offset)
- Label / symbol reference
- Special register
- Vector operand
- Texture / surface / sampler reference
- Bitfield / compound modifier
The classification examines token attributes set by the lexer -- register type bits at (field >> 28) & 7, immediate flag (0x1000000), uniform flag (0x6000000), and operand descriptor fields at instruction offset 84+.
Parser State Object (1,128 bytes)
The parser passes a state object through all phases. This 1,128-byte structure (sub_424070(pool, 1128)) carries compilation context and pointers to sub-systems. It is indexed as _QWORD* (8-byte slots), so QWORD index [N] = byte offset N*8. The highest accessed byte is +1120 (index [140]), fitting exactly within the 1,128-byte allocation.
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | pool_context | Pool allocator handle (from sub_4258D0) |
| +8 | 8 | compilation_unit | Pointer to compilation unit (parameter a2) |
| +16 | 8 | macro_symbol_table | Hash table for macros (sub_425CA0, 64 buckets) |
| +24 | 8 | module_ptr | Pointer to module object (parameter a3) |
| +32 | 8 | container_a | Sorted set container (8,192 buckets) |
| +56 | 8 | scope_chain[0] | Scope chain entry (sub_44F7C0), used for symbol resolution |
| +64 | 8 | scope_chain[1] | Second scope chain entry |
| +72 | 8 | scope_chain[2] | Third scope chain entry |
| +80 | 8 | type_map | Type descriptor hash map (sub_42D150, 8 buckets) |
| +96 | 8 | symbol_tables[0..5] | Six hash tables for symbol lookup (at +96, +104, +112, +120, +128, +136) |
| +152 | 8 | current_function | Pointer to current function being parsed |
| +160 | 4 | ptx_major_version | PTX ISA major version (set by Bison reduction) |
| +164 | 4 | ptx_minor_version | PTX ISA minor version |
| +168 | 4 | sm_version_check | SM target version for feature gating |
| +177 | 1 | flag_a | Initialization flag |
| +192 | 2 | word_96 | Zero-initialized word at WORD index 96 |
| +196 | 4 | address_size | 32 or 64 (address width) |
| +208 | 8 | hash_ref_a | Hash table reference (64-bucket) |
| +236 | 1 | default_flag | Initialized to 1 |
| +264 | 16 | list_a | Linked list (head at +264, tail ptr at +272 points to head) |
| +280 | 8 | sorted_set_b | Sorted set (8,192 buckets) |
| +288 | 8 | sorted_set_c | Sorted set (1,024 buckets) |
| +296 | 16 | sorted_maps[0..1] | Two sorted maps (sub_42A300) |
| +320 | 8 | hash_e | Hash table (1,024 buckets) |
| +328 | 16 | list_b | Linked list (head/tail pair) |
| +344 | 16 | list_c | Linked list (head/tail pair) |
| +360 | 256 | offset_table[16] | SSE-initialized offset table (16 entries of 16 bytes each, computed from base address + constants at xmmword_1CFDA00--1CFDA70) |
| +616 | 16 | list_d | Linked list (head/tail pair) |
| +632 | 16 | list_e | Linked list (head/tail pair); low bits of first word used as address_space_flags |
| +648 | 8 | local_symbol_table | Per-scope local symbol table pointer |
| +824 | 8 | symbol_lookup_ref | Hash table for symbol name lookup |
| +832 | 1 | dwarf_section_flag | Nonzero when inside .section DWARF data |
| +834 | 1 | directive_flag_a | Checked as pair with +835 |
| +836 | 1 | directive_flag_b | Set to 1 by multiple Bison reductions |
| +840 | 8 | builtin_filename | Interned string "<builtin>" |
| +848 | 8 | empty_string | Interned empty string "" |
| +856 | 4 | sm_arch_number | SM architecture number (parameter a6, e.g. 90 for sm_90) |
| +860 | 1 | feature_a | Feature flags set during parsing |
| +861 | 1 | feature_b | |
| +862 | 1 | feature_c | |
| +864 | 1 | feature_d | |
| +865 | 1 | feature_e | ORed with 1 by Bison reductions |
| +869 | 1 | flag_h | Initialized to 0 |
| +960 | 4 | sm_target_code | SM target code used in sub_454E70 checks |
| +968 | 8 | insn_stream_a | Instruction stream pointer A (set in Bison) |
| +976 | 8 | insn_stream_b | Instruction stream pointer B |
| +984 | 8 | insn_stream_c | Instruction stream pointer C |
| +1000 | 1 | insn_state_flag | Instruction state flag (= 0) |
| +1008 | 8 | string_pool | String pool pointer |
| +1016 | 8 | context_ref | Compilation context reference (parameter a4) |
| +1048 | 4 | dword_262 | Zero-initialized |
| +1053 | 1 | parsing_active | Toggled 1/0 during active parsing |
| +1080 | 16 | list_f | Linked list (head/tail pair) |
| +1096 | 8 | lexer_state_ptr | Pointer to 2,528-byte lexer state object (see below) |
| +1104 | 16 | list_g | Linked list (head/tail pair) |
| +1120 | 1 | param_flag | From parameter a10 |
Lexer State Object (2,528 bytes)
The lexer state is a separate heap-allocated object (sub_424070(pool, 2528)) pointed to by parser_state+1096. It is the primary state carrier for the Flex DFA scanner and the instruction table subsystem. All functions that need scanner state (the Bison parser, the Flex scanner, the include handler, and the instruction table builder) access this object through the pointer at +1096.
| Offset | Size | Field | Description |
|---|---|---|---|
| +48 | 4 | line_number | Current source line (incremented on newline) |
| +52 | 4 | column_number | Current source column |
| +64 | 8 | buffer_limit | Pointer to end of current scan buffer |
| +76 | 4 | start_condition | Flex DFA start condition (*(state+76), indexes off_203C020) |
| +152 | 1 | flag_a | Scanner state flag |
| +156 | 8 | sentinel_a | Initialized to -1 (0xFFFFFFFFFFFFFFFF) |
| +164 | 8 | sentinel_b | Initialized to -1 |
| +172 | 4 | address_size_proxy | Written by Bison via sub_4563E0; -1 on init |
| +180 | 8 | zero_pair | Zero-initialized |
| +188 | 8 | sentinel_c | Initialized to 0xFFFFFFFF00000000 |
| +196 | 8 | sentinel_d | Initialized to -1 |
| +204 | 4 | sentinel_e | DWORD[51], initialized to -1 |
| +208 | 2 | word_104 | WORD[104], zero-initialized |
| +540 | 1 | flag_b | Scanner flag |
| +541 | 1 | include_active | Checked by Flex (lexer+541) and Bison to gate .INCLUDE behavior |
| +784 | 8 | current_filename | Pointer to current filename string (set during include handling) |
| +1984 | 128 | version_array[32] | DWORD array of version fields; written by sub_70FDD0(lexer, index, value) as *(lexer + 4*index + 1984) = value |
| +2104 | 4 | ptx_major_ver | version_array[30] = PTX major version (initialized to 9) |
| +2108 | 4 | ptx_minor_ver | version_array[31] = PTX minor version (initialized to 0) |
| +2128 | 8 | include_stack_a | Include nesting pointer 1 (linked list for file stack) |
| +2136 | 8 | include_stack_b | Include nesting pointer 2 |
| +2160 | 8 | include_stack_head | Head of include stack (walked by sub_71C310) |
| +2168 | 8 | include_stack_file | Include stack filename pointer |
| +2441 | 1 | pushback_char | Character pushed back into input stream by scanner |
| +2464 | 2 | word_1232 | Zero-initialized |
| +2466 | 1 | flag_c | Flag |
| +2472 | 8 | opcode_hash_a | Opcode lookup hash table (populated by sub_46E000) |
| +2480 | 8 | opcode_hash_b | Second opcode lookup hash table (populated by sub_46E000) |
| +2488 | 8 | context_sub_ref | Compilation context sub-reference (parameter a9); accessed by Bison for sub_457CB0/sub_70A5B0 calls |
| +2496 | 1 | flag_d | Flag |
| +2504 | 24 | tail_fields | Three zero-initialized QWORD slots (indices [313],[314],[315]) |
Version checks use sub_485520(ctx, sm_number) (SM architecture >= N) and sub_485570(ctx, major, minor) (PTX version >= major.minor). For example, the address-space attribute setter (sub_4035D3) checks sm_90 and PTX 7.8:
if (!sub_485520(ctx, 90))
sub_42FBA0(&err, loc, "sm_90", ...); // Error: requires sm_90
if (!sub_485570(ctx, 7, 8))
sub_42FBA0(&err, loc, "7.8", ...); // Error: requires PTX 7.8
*(byte*)(v15 + 632) = (old & 0xFC) | (a2 & 3); // Set address space bits
Semantic Validators
The parser's reduction actions dispatch to specialized validator functions for each instruction category. These functions live in 0x460000--0x4D5000 and check SM architecture requirements, type compatibility, operand constraints, and instruction-specific invariants.
| Address | Size | Identity | Coverage |
|---|---|---|---|
sub_4B2F20 | 52.6 KB | General instruction validator | Textures, surfaces, loads, stores, cvt, calls |
sub_4CE6B0 tail | 48 KB | Directive/declaration validator | .local_maxnreg, .alias, .unified, .pragma, .noreturn |
sub_4C5FB0 | 28.5 KB | Operand validator | State spaces, rounding, barriers, cache levels |
sub_4C2FD0 | 12.2 KB | WMMA/MMA validator | Matrix dimensions, FP8 types, layout specifiers |
sub_49BBA0 | 11.4 KB | MMA scale/block validator | .scale_vec_size, .block_scale, sparse GMMA |
sub_4ABFD0 | 11.1 KB | Async copy validator | cp.async, bulk copy, cvt.tf32.f32.rna |
sub_4A73C0 | 10.9 KB | Tensormap validator | .tile, field ranges, .tensormap::generic |
sub_4BFED0 | 10.3 KB | WMMA shape/type validator | .m%dn%dk%d shapes, .aligned modifier |
sub_4AF9F0 | 5.8 KB | CVT validator | cvt.f16x2.f32, type combinations, rounding |
sub_4AEB60 | 3.7 KB | LDSM validator | _ldsm.s8.s4/_ldsm.u8.u4 format conversion |
sub_4B1630 | 4.6 KB | Function address validator | cudaDeviceSynchronize, kernel/device addresses |
sub_498AF0 | 3.9 KB | MMA layout validator | Row/col layout, floating-point type constraints |
sub_497C00 | 3.0 KB | Prototype validator | .FORCE_INLINE, .noreturn, .unique, register counts |
sub_496690 | 3.6 KB | Scope/barrier validator | Scope modifiers, barrier constraints |
sub_494210 | 2.3 KB | Sparse GMMA validator | Sparse GMMA with specific types |
sub_492C80 | 4.0 KB | Cache eviction validator | L2 eviction priority, .v8.b32/.v4.b64 |
sub_49A5A0 | 3.5 KB | Special register validator | %laneid, %clock64, %lanemask_*, arch gating |
sub_4A0CD0 | 4.9 KB | Variable declaration validator | .texref, .managed, .reserved, .common |
sub_4A02A0 | 2.6 KB | Initializer validator | generic() operator, function addresses |
sub_4036D9 | 437 B | Parameter list validator | Count, types, alignment, state space |
Validators follow a uniform pattern: they receive the parser context and instruction data, check constraints against the current SM architecture and PTX version, and call sub_42FBA0 with descriptive error messages when violations are found. The general validator (sub_4B2F20, 52.6 KB) is the second-largest function in the front-end and covers the broadest range of PTX instructions.
ROT13 Opcode Name Obfuscation
PTX opcode names stored in the binary are ROT13-encoded as an obfuscation measure. The static constructor ctor_003 at 0x4095D0 (17 KB, ~1,700 lines) decodes and populates the opcode name table at 0x29FE300 during program startup. Each entry is a (string_ptr, length) pair. Decoded examples:
| ROT13 | Decoded | PTX instruction |
|---|---|---|
NPDOHYX | ACQBULK | acqbulk |
OFLAP | BSYNC | bsync |
PPGY.P | CCTL.C | cctl.c |
SZN | FMA | fma |
FRGC | SETP | setp |
ERGHEA | RETURN | return |
RKVG | EXIT | exit |
The table covers the entire PTX ISA vocabulary -- hundreds of opcodes. A separate ROT13 table in ctor_005 (0x40D860, 80 KB) encodes 2,000+ internal Mercury/OCG tuning knob names (see Knobs System).
Compilation Pipeline Integration
The parser is invoked from the top-level compilation driver sub_446240 (11 KB), which orchestrates the full pipeline:
Parse → CompileUnitSetup → DAGgen → OCG → ELF → DebugInfo
The driver reports timing for each phase:
"Parse-time : %.3f ms (%.2f%%)""CompileUnitSetup-time : %.3f ms (%.2f%%)""DAGgen-time : %.3f ms (%.2f%%)""OCG-time : %.3f ms (%.2f%%)""ELF-time : %.3f ms (%.2f%%)""DebugInfo-time : %.3f ms (%.2f%%)"
The parse phase encompasses the Flex scanner, macro preprocessor, Bison parser, instruction table lookup, and all semantic validation. Since the parser directly builds IR, the output of the parse phase is a populated instruction stream ready for the DAG generation phase.
PTX Text Generation (Reverse Direction)
The inverse of parsing -- converting IR back to PTX text -- lives in 0x4DA340--0x5A8E40 (580 formatter functions). Each handles one PTX opcode. A dispatcher at sub_5D4190 (12.9 KB) routes by opcode name using 81 direct string comparisons plus a 473-entry hash switch. Every formatter follows an identical allocation pattern:
pool = sub_4280C0(ctx)[3]; // Get allocator pool
buf = sub_424070(pool, 50000); // 50KB temp buffer
// ... sprintf() operands into buf ...
len = strlen(buf);
result = sub_424070(pool, len + 1); // Exact-size allocation
strcpy(result, buf);
sub_4248B0(buf); // Free temp buffer
return result;
A monolithic format string table (~1.8 MB) at the a2 parameter contains pre-assembled PTX text templates with %s/%llu/%d placeholders. This trades memory for speed: instead of building instruction text dynamically, ptxas simply fills in operand names at runtime.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_720F00 | 15.8 KB | ptxlex -- Flex DFA scanner main | 98% |
sub_4CE6B0 | 48 KB | ptxparse -- Bison LALR(1) parser | HIGH |
sub_46E000 | 93 KB | Instruction table builder (1,141 opcode registrations) | HIGH |
sub_46BED0 | -- | Per-opcode registration function (called 1,141x) | HIGH |
sub_46C690 | -- | Instruction lookup entry | HIGH |
sub_46C6E0 | 6.4 KB | Descriptor matcher (12-category operand classifier) | HIGH |
sub_451730 | 14 KB | Parser initialization (allocs 1,128B parser state + 2,528B lexer state) | HIGH |
sub_70FDD0 | 14 B | Lexer version array writer: *(a1 + 4*a2 + 1984) = a3 | HIGH |
sub_71F630 | 14 KB | Preprocessor directive dispatcher | 93% |
sub_71E2B0 | 32 KB | Conditional handler (.ELSE/.ELIF/.ENDIF) | 92% |
sub_71DCA0 | 8.4 KB | Macro definition handler (.MACRO) | 90% |
sub_71C910 | 13 KB | Directive scanner | 91% |
sub_71C310 | 8.3 KB | Include handler (.INCLUDE) | 90% |
sub_71D1B0 | 6.8 KB | Macro argument scanner | 89% |
sub_71D710 | 7.5 KB | Macro body scanner | 89% |
sub_71BA10 | 2.3 KB | Macro character peek | 88% |
sub_71BB80 | 2.6 KB | Macro buffer reader | 88% |
sub_71BE20 | 1.1 KB | Macro expansion entry | 85% |
sub_71BF60 | 1.8 KB | Macro fatal abort | 90% |
sub_71C140 | 2.5 KB | Macro format error | 88% |
sub_720190 | 2.0 KB | ptxensure_buffer_stack | 95% |
sub_7202E0 | 1.3 KB | ptx_create_buffer | 96% |
sub_720410 | 3.3 KB | yy_get_next_buffer | 95% |
sub_720630 | 9.7 KB | yy_get_previous_state (SSE2 optimized) | 94% |
sub_720BA0 | 4.3 KB | ptx_scan_string | 93% |
sub_724CC0 | 4.9 KB | ptx_scan_bytes / macro nesting check | 91% |
sub_725070 | 2.7 KB | ptx_scan_buffer | 93% |
sub_42FBA0 | 2.4 KB | Central diagnostic emitter (2,350 callers) | HIGH |
sub_4280C0 | 597 B | Thread-local context accessor (3,928 callers) | HIGH |
sub_424070 | 2.1 KB | Pool allocator (3,809 callers) | HIGH |
sub_4248B0 | 923 B | Pool deallocator (1,215 callers) | HIGH |
sub_42BDB0 | 14 B | Fatal OOM handler (3,825 callers) | HIGH |
sub_446240 | 11 KB | Top-level compilation driver | HIGH |
sub_4095D0 | 17 KB | ROT13 opcode name table initializer | HIGH |
sub_5D4190 | 12.9 KB | PTX text format dispatcher | HIGH |
sub_4B2F20 | 52.6 KB | General instruction validator | HIGH |
sub_4C5FB0 | 28.5 KB | Instruction operand validator | HIGH |
sub_4C2FD0 | 12.2 KB | WMMA/MMA validator | HIGH |
sub_485520 | -- | SM architecture check (sm >= N) | HIGH |
sub_485570 | -- | PTX version check (version >= M.N) | HIGH |
Cross-References
- Pipeline Overview -- where the parser fits in the compilation flow
- PTX Directive Handling -- detailed directive processing after parsing
- PTX-to-Ori Lowering -- what happens to the IR the parser builds
- Knobs System -- ROT13-encoded knob names from
ctor_005 - Memory Pool Allocator --
sub_424070/sub_4248B0pool system - Hash Tables & Bitvectors --
sub_426150/sub_426D60hash map - PTX Instruction Table -- full opcode catalog
- CLI Options --
sub_432A00/sub_434320option handling
PTX Directive Handling
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
PTX directives -- .version, .target, .entry, .func, .global, .shared, .local, .const, .reg, .param, .weak, .common, .extern, .visible, .alias, .pragma -- are parsed and semantically validated by the Bison reduction actions embedded in the 48 KB parser function sub_4CE6B0. Unlike instructions which pass through opcode table lookup (sub_46E000) and per-instruction semantic validators, directives are handled entirely within the Bison reduction switch: each grammar production's action block reads values from the parser value stack, validates them against the current PTX version and target architecture, and writes the results into the 1,200-byte parser state object or its child compile-unit state (CU_state). No intermediate AST is constructed; directives take effect immediately during parsing.
The state object maintains 18 linked lists (9 head/tail pairs at offsets 368--512) that track symbols per state space, a string-keyed hash map (offset 208) for target feature flags, and a scope chain (offset 984) rooted at offset 968 for nested function declarations. Two version-gating functions -- sub_489050 (PTX ISA version) and sub_489390 (SM architecture) -- guard every directive that was introduced after the baseline ISA.
| Bison parser | sub_4CE6B0 (48,263 bytes, 631 case labels) |
| Version validator | sub_44A100 (bsearch over 44 valid PTX version IDs at xmmword_1CFD940) |
| PTX version gate | sub_489050 -- sub_454E70 + sub_455A80(major, minor, state) |
| SM arch gate | sub_489390 -- checks state+168 >= required_sm |
| Target handler | sub_4B1080 (per-target, texmode logic) |
| Function handler | sub_497C00 (entry/func declarations, ABI) |
| Variable handler | sub_4A0CD0 (state-space declarations, type validation) |
| Parameter allocator | sub_44F6E0 (48-byte parameter nodes) |
| Scope manager | sub_44B9C0 (scope hash map at state+1008) |
| State-space lists | 18 linked lists at state+368--state+512 |
| Target feature map | Hash map at state+208 (string keys, presence values) |
Architecture
PTX source text
|
v
+-------------------------------------------------------------------+
| BISON LALR(1) PARSER sub_4CE6B0 |
| 631 reduction cases, each a direct action |
| |
| DIRECTIVE CASES HANDLER |
| .version 35 sscanf + sub_44A100 |
| .target 5, 38 sub_4B1080 (per-target) |
| .address_size 10 inline validation |
| .entry 82, 86-88 sub_497C00 |
| .func 97, 100-105 sub_497C00 |
| .global/shared 57-68 sub_4A0CD0 |
| /local/const |
| .reg/.param 110-112 inline + sub_48BE80 |
| .weak 55 sub_489050(3,1) |
| .common 56 sub_489050(5,0) |
| .extern 79 sets CU+81, linkage=3 |
| .visible 80 sets CU+81, linkage=2 |
| .alias 41 sub_4036D9 (param match) |
| .pragma 42 prefix-match dispatch chain |
| |
+-------------------+-----------------------------------------------+
|
+---------+---------+
v v
PARSER STATE OBJECT CU_STATE (compile-unit)
~1200 bytes pointed to by state+1096
state+144: version CU+0: linkage code
state+152: target CU+24: state-space ID
state+160: ptx_major CU+48: func metadata buf
state+164: ptx_minor CU+80: return type
state+168: sm_id CU+81: declaration linkage
state+196: addr_size CU+88: current function
state+208: feature map CU+156: noinline pragma
state+368: 18 ll heads CU+172: reg-usage pragma
state+968: scope root CU+784: arch capability
state+984: scope chain CU+2448: target string
state+1008: scope map CU+2456: version string
.version X.Y -- Case 35
The .version directive establishes the PTX ISA version for the compilation unit. The parser extracts the major and minor version integers from the grammar, validates the combined version against a sorted table of 44 known versions, and stores both the numeric and string forms.
// Reconstructed from case 35 of sub_4CE6B0
int major = sub_449950(); // extract major from parser state
int minor = sub_449960(); // extract minor from parser state
sscanf(token, "%d.%d", &major, &minor);
// Allocate formatted version string
char* ver_str = pool_alloc(pool, 5);
sprintf(ver_str, "%d.%d", major, minor);
// Validate: bsearch over 44 valid version IDs
int combined = major * 10 + minor;
if (!sub_44A100(combined))
fatal_error("Unsupported PTX version %s", ver_str);
pool_free(ver_str);
// Store in parser state
state->version_string = token; // state+144
state->ptx_major = major; // state+160
state->ptx_minor = minor; // state+164
CU_state->version_string = token; // CU+2456
Version Validation -- sub_44A100
// sub_44A100: validate PTX version against known versions
bool sub_44A100(int version_id) {
int key = version_id;
return bsearch(&key,
xmmword_1CFD940, // sorted table base
0x2C, // 44 entries
4, // sizeof(int)
compar) != NULL; // simple integer compare
}
The 44-entry table at xmmword_1CFD940 contains the combined version IDs (major*10 + minor) for every PTX ISA version recognized by ptxas v13.0. This covers PTX 1.0 through 8.7+.
.target sm_XX -- Cases 5 and 38
The .target directive accepts a comma-separated list of targets: SM architecture identifiers (sm_XX, compute_XX) and feature modifiers (texmode_unified, texmode_independent, texmode_raw, map_f64_to_f32, debug).
Case 38 -- Target List Iteration
// Reconstructed from case 38
for (node = list_begin(*v5); !list_end(node); node = list_next(node)) {
char* target_str = list_value(node);
sub_4B1080(target_str, location, state);
}
Per-Target Handler -- sub_4B1080
The function branches on whether the target string contains "sm_" or "compute_".
SM/compute targets:
// SM target path in sub_4B1080
state->target_string = target_str; // state+152
CU->target_string = target_str; // CU+2448
state->arch_variant = sub_1CBEFD0(target_str); // state+177
int sm_id;
sscanf(target_str + prefix_len, "%d", &sm_id);
state->target_id = sm_id; // state+168
if (sm_id > state->max_target)
state->max_target = sm_id; // state+204
// Validate against one of three target tables:
// compute_ targets: unk_1D16160 (6 entries, 12 bytes each)
// sm_ sub-variant: unk_1D161C0 (7 entries, 12 bytes each)
// standard sm_: unk_1D16220 (32 entries, 12 bytes each)
// Each entry: { sm_id, required_ptx_major, required_ptx_minor }
entry = bsearch(&sm_id, table, count, 12, sub_484B70);
if (entry) {
if (!sub_455A80(entry->ptx_major, entry->ptx_minor, state))
state->version_mismatch_flag |= 1; // state+178
}
Feature modifiers:
| Modifier | PTX Requirement | Action | CU State |
|---|---|---|---|
map_f64_to_f32 | Deprecated for sm > 12 | Stored in feature map; CU+152 |= 1 | Feature flag |
texmode_unified | -- | Stored in feature map; default if none specified | Default |
texmode_independent | PTX >= 1.5 | Stored in feature map; CU+2464 = 1 | Tex mode |
texmode_raw | Requires state+220 flag | Stored in feature map; CU+2465 = 1 | Tex mode |
debug | PTX >= 3.0 | CU+2466 = 1; state+1033 = 1; state+834 = 1 | Debug on |
Texmode values are mutually exclusive. Each setter checks the feature hash map at state+208 for conflicting entries before inserting:
// texmode_unified path in sub_4B1080
if (map_get(state->feature_map, "texmode_independent"))
error("conflicting texmode: %s", target_str);
if (map_get(state->feature_map, "texmode_raw"))
error("conflicting texmode: %s", target_str);
map_put(state->feature_map, "texmode_unified", 1);
Case 5 -- Automatic Texmode Inference
When the .target directive omits an explicit texmode, case 5 infers one based on CLI flags:
if (arch_supports_texmode(CU->arch_capability)) {
if (!map_has(feature_map, "texmode_independent") &&
!map_has(feature_map, "texmode_raw")) {
if (state->cli_texmode_independent)
sub_4B1080("texmode_independent", loc, state);
else if (state->cli_texmode_raw)
sub_4B1080("texmode_raw", loc, state);
else
sub_4B1080("texmode_unified", loc, state);
}
}
.address_size 32|64 -- Case 10
// Reconstructed from case 10
sub_489050(state, 2, 3, ".address_size directive", location); // PTX >= 2.3
int value = stack_value;
if (((value - 32) & ~0x20) != 0) // allows exactly 32 and 64
error("Invalid address size: %d", value);
state->address_size = value; // state+196
The bit trick (v - 32) & ~0x20 passes for exactly two values:
v=32:(0) & 0xFFFFFFDF = 0v=64:(32) & 0xFFFFFFDF = 0
Any other value produces a nonzero result and triggers an error.
.entry / .func Declarations -- Cases 76+, 82, 88, 97, 103
Function and entry declarations span multiple Bison productions because the grammar decomposes them into prototype, parameter list, linkage qualifier, and body productions. The central handler sub_497C00 processes both entry functions and device functions.
sub_497C00 -- Function Declaration Handler
// Reconstructed signature
int64 sub_497C00(
state, // parser state
int decl_type, // 1=visible, 2=forward, 3=extern, 4=static, 5=definition
name, // function name token
return_params, // return parameter list (NULL for entries)
params, // input parameter list
bool is_entry, // 1 for .entry, 0 for .func
bool is_func, // CU+80 qualifier for .func
scratch_regs, // scratch register list
int retaddr, // return address allocno (-1 if none)
bool noreturn, // .noreturn attribute
bool unique, // .unique attribute
bool force_inline, // .FORCE_INLINE attribute
location // source location token
);
Processing steps:
-
Scope creation:
sub_44B9C0(state)creates a new scope context. The scope hash map atstate+1008maps scope IDs (starting at 61) to 40-byte scope descriptors. -
Parameter node allocation:
sub_44F6E0(state, scope, name, 0, 0, location)allocates a 48-byte parameter descriptor:{type_info, name, scope, alignment, init_data, location}. -
Symbol lookup:
sub_4504D0(state+968, name, 1, state)searches the current scope chain for an existing declaration. -
Forward declaration resolution: If a matching forward declaration exists, the handler validates compatibility:
- Declaration type consistency (except
2->1and4->1promotions) - Parameter list type/alignment/state-space matching via
sub_484DA0 - Return parameter matching via
sub_484DA0 - Scratch register count and types
- Return address register, first parameter register
.noreturnand.uniqueattribute consistency- Unified identifier matching
- Declaration type consistency (except
-
New function creation: If no prior declaration:
- Registers in
state+968(regular scope) orstate+976(extern scope) - Calls
sub_44FDC0to record ABI metadata - For Blackwell GB10B architecture (
sub_70FA00(CU, 33)): allocates__nv_reservedSMEM_gb10b_war_varin shared memory as a hardware workaround
- Registers in
Case 82 -- Entry Function
// Case 82: .entry declaration
if (CU->output_param_context)
error("Parameter to entry function");
result = sub_497C00(state, decl_type, name,
NULL, // no return params for entries
params,
1, // is_entry = true
0, // is_func = false
NULL, // no scratch regs
-1, // no retaddr
0, 0, 0, // no .noreturn/.unique/.force_inline
location);
Case 88 -- Entry Function Body Completion
After the function body is parsed, case 88 performs the final validation pass:
-
Performance directive validation:
.maxntidand.reqntidare mutually exclusive.maxnctapersm/.minnctapersmrequire either.maxntidor.reqntid.reqntid+.reqnctaperclusterrequire.blocksareclusters.reqnctaperclusterand.maxclusterrankare mutually exclusive
-
Kernel parameter size limits (computed via
sub_42CBF0+sub_484ED0):PTX Version Max Kernel Param Size < 1.5 256 bytes >= 1.5, < 8.1 4,352 bytes >= 8.1 32,764 bytes Parameters exceeding 4,352 bytes also require SM >= 70 and PTX >= 8.1.
-
Debug labels: Generates
__$startLabel$__<name>and__$endLabel$__<name>for DWARF debug info. -
Debug hash: If debug mode enabled (
state+856 != 0), computesCRC32(name) % 0xFFFF + baseas a debug identifier stored atfunc->80+176.
Case 97 -- Device Function
// Case 97: .func declaration
result = sub_497C00(state, decl_type, name,
return_params, params,
0, // is_entry = false
CU->return_qualifier, // CU+80
scratch_regs, retaddr,
noreturn, unique, force_inline,
location);
State-Space Declarations -- .global, .shared, .local, .const
State-space directives set the "current state space" field (CU+24) and then delegate to sub_4A0CD0 for variable declaration processing or sub_4A2020 for declaration-without-initializer processing.
State-Space Code Assignment
| Case | Action | State Space |
|---|---|---|
| 57 | *CU = 1 | (extern/unresolved) |
| 59 | *CU = 3 | .shared |
| 61 | *CU = 2 | .global |
| 63 | *CU = 4 | .local |
| 65 | *CU = 5 | .const |
| 67 | *CU = 0 | .reg |
| 58, 60, 62, 64, 66, 68 | sub_4A2020(...) | Process declaration in current space |
The odd-numbered cases set the state-space code; the immediately following even-numbered cases trigger the actual declaration processing.
Variable Validator -- sub_4A0CD0
This 4,937-byte function validates variable declarations across all state spaces. Key checks:
-
Type validation: Resolves
.texrefviasub_450D00. For types 9 (.surfref) and 10 (.texref), enforces.texdeprecation after PTX 1.5 and.surfrefscope restrictions. -
.b128type: Requires PTX >= 8.3 (sub_455A80(8, 3)) and SM >= 70 (sub_489390(state, 70)). -
State-space restrictions:
.managedvalid only with.global(space 5).reservedvalid only with.shared(space 8).reservedshared alignment must be <= 64.commonvalid only with.const.paramat file scope requires.constspace.localconst disallowed at file scope
-
Texmode interaction:
.surfreftypes requiretexmode_independentin the feature map.tex/.texreftypes incompatible withtexmode_raw
-
Initializer handling: If an initializer is present, calls
sub_4A02A0to validate constant expressions (no function pointers, no entry functions as values, no opaque type initializers).
State-Space Linked Lists -- 18 Lists at state+368
The parser maintains 18 linked list heads (9 head/tail pairs) at state offsets 368--512 to track declared symbols per state space:
Offset Pair State Space
368/376 0 .global
384/392 1 .shared
400/408 2 .local
416/424 3 .const
432/440 4 .param
448/456 5 .tex
464/472 6 .surf
480/488 7 .sampler
496/504 8 reserved / other
Initialization (case 3 -- section begin): Iterates j from 0 to 144 in steps of 8, allocating an 88-byte sentinel node (type=6) for each list. Each node's +48 field links to per-section tracking data at state+656 + j.
Scope teardown (case 76 -- new compilation unit): Destroys old symbol tables via sub_425D20, clears the target feature map, and merges scope-level lists into the parent scope by concatenating linked list chains for offsets 16, 48, 112, 128, 144, and 184 of the scope node.
.reg / .param -- Register and Parameter Declarations
Within function bodies, .reg and .param declarations create typed register/parameter entries. Three grammar productions handle the variants:
Declaration Node Layout (56 bytes)
| Offset | Type | Field |
|---|---|---|
| 0 | ptr | Type list pointer |
| 8 | ptr | Name pointer |
| 16 | int32 | State-space code |
| 20 | byte | Is array |
| 21 | byte | Is vector |
| 24 | int32 | Alignment |
| 28 | byte | Extra flags |
| 40 | int32 | Count / range start |
| 44 | int32 | Range end (0xFFFFFFFF = no upper bound) |
| 48 | ptr | Auxiliary data |
Case 110 -- Single declaration: Reads type info from CU_state (offsets +16, +24, +28, +29, +32, +36), allocates the 56-byte node, sets count from the parsed integer, and calls sub_48BE80(state) to validate.
Case 111 -- Range declaration: Same as 110 but sets both start and end bounds. The sentinel value 0xFFFFFFFF at offset 44 distinguishes range from single declarations.
Case 112 -- Vector declaration: Handles vector type qualifiers (.v2, .v4).
Visibility / Linkage Directives
.weak -- Case 55
sub_489050(state, 3, 1, ".weak directive", location); // PTX >= 3.1
.common -- Case 56
sub_489050(state, 5, 0, ".common directive", location); // PTX >= 5.0
Linkage Qualifiers -- Cases 78--81
These set CU+81 (declaration linkage type) within function prototype production contexts:
| Case | Linkage | PTX Directive |
|---|---|---|
| 78 | 1 | .visible (default/internal) |
| 79 | 3 | .extern |
| 80 | 2 | .visible |
| 81 | 4 | .weak |
.alias -- Case 41
Symbol aliasing requires PTX >= 6.3 and SM >= 30:
// Reconstructed from case 41
sub_489050(state, 6, 3, ".alias", location); // PTX >= 6.3
sub_489390(state, 0x1E, ".alias", location); // SM >= 30
sym1 = sub_4504D0(state->scope_chain, name1, 1, state);
sym2 = sub_4504D0(state->scope_chain, name2, 1, state);
if (!sym1) error("undefined symbol: %s", name1);
if (!sym2) error("undefined symbol: %s", name2);
Validation:
- Both symbols must be function type (node type == 5)
sym1must not already have a body defined (sym1->80->88 == 0)- Neither can be entry functions
- No self-aliasing (names must differ)
- Parameter lists must match (calls
sub_4036D9twice: once for return params, once for input params) .noreturnattribute must be consistent across both symbols- Cannot alias to
.externor declaration-qualified functions
On success: sym1->80->64 = sym2 (sets the alias-target pointer).
.pragma -- Case 42
The .pragma directive requires PTX >= 2.0 and dispatches through a prefix-matching chain. Each pragma string is compared against known prefixes via sub_4279D0 (starts-with test):
// Reconstructed dispatch structure from case 42
for (node = list_begin(pragma_list); !list_end(node); node = list_next(node)) {
char* pragma_str = list_value(node);
sub_489050(state, 2, 0, ".pragma directive", location); // PTX >= 2.0
char* arch_str = sub_457CB0(CU->arch_descriptor, index);
if (starts_with(arch_str, pragma_str)) {
// matched known pragma
dispatch_to_handler(pragma_str, state);
}
}
Pragma Dispatch Chain
| Priority | Prefix Index | Pragma | Handler | Storage |
|---|---|---|---|---|
| 1 | sub_457CB0(arch, 1) | "noinline" | sub_456A50 + sub_48D8F0 | CU+156, CU+192 |
| 2 | sub_457CB0(arch, 3) | inline-related | Sets CU+160 = 1 | CU+160 |
| 3 | sub_457CB0(arch, 16) | register-usage | sub_4563E0 + sub_48C370 | CU+172 |
| 4 | sub_457CB0(arch, 5) | min threads | sub_4563E0 + sub_48C6F0 | CU+164 or CU+168 |
| 5 | sub_457CB0(arch, 9) | max constraint | sub_4567E0 + sub_403D2F | CU+176 |
| 6 | sub_457CB0(arch, 10) | min constraint | sub_4567E0 + sub_403D2F | CU+184 |
| 7 | sub_457CB0(arch, 18) | deprecated | Warning via dword_29FA6C0 | -- |
| 8 | sub_457CC0(arch, 1) | deprecated | Warning via dword_29FA6C0 | -- |
| 9--11 | sub_457C60/CA0/C70 | unsupported | Warning via dword_29FA7F0 | -- |
| 12 | sub_457D30/D50 | unsupported | Warning via dword_29FA7F0 | -- |
| 13 | sub_457CB0(arch, 22) | function-level | Appends to func or module pragma list | func->80->80 or state+272 |
Unmatched pragmas trigger an error via dword_29FA6C0.
Feature Version Gating
Two functions guard every directive against minimum PTX ISA version and SM architecture requirements. They are called hundreds of times throughout the Bison reduction actions.
sub_489050 -- PTX ISA Version Check
// sub_489050(state, required_major, required_minor, directive_name, location)
char sub_489050(state, int major, int minor, char* name, location) {
if (sub_454E70(state->version_check_disabled)) // state+960
return 1; // checks disabled
if (state->lenient_mode) // state+832
return 1; // lenient mode
if (!sub_455A80(major, minor, state)) {
char buf[152];
sprintf(buf, "%d.%d", major, minor);
sub_42FBA0(error_desc, location, name, buf);
}
return result;
}
sub_489390 -- SM Architecture Check
// sub_489390(state, required_sm, directive_name, location)
char sub_489390(state, uint required_sm, char* name, location) {
if (sub_454E70(state->version_check_disabled)) // state+960
return 1;
if (!state->target_string || state->target_id > required_sm) {
// state+152 == NULL or state+168 > required_sm
char buf[48];
sprintf(buf, "sm_%d", required_sm);
sub_42FBA0(error_desc, location, name, buf);
}
return result;
}
Version Requirements by Directive
| Directive | PTX ISA | SM Architecture |
|---|---|---|
.address_size | >= 2.3 | -- |
.weak | >= 3.1 | -- |
.common | >= 5.0 | -- |
.alias | >= 6.3 | >= 30 |
.branchtargets | >= 6.0 | >= 30 |
.calltargets | >= 2.1 | >= 20 |
.callprototype | >= 2.1 | >= 20 |
.pragma | >= 2.0 | -- |
texmode_independent | >= 1.5 | -- |
debug target | >= 3.0 | -- |
| kernel param list | >= 1.4 | -- |
| opaque types | >= 1.5 | -- |
.b128 type | >= 8.3 | >= 70 |
| kernel params > 4352B | >= 8.1 | >= 70 |
Parser State Object Layout
The parser state object (v1127 / a1 in sub_4CE6B0) is approximately 1,200 bytes. Key offsets for directive handling:
| Offset | Type | Field |
|---|---|---|
| 72 | ptr | Module-level output buffer |
| 88 | ptr | Current function link |
| 144 | char* | .version string (e.g., "8.5") |
| 152 | char* | .target string (e.g., "sm_90") |
| 160 | int32 | PTX major version |
| 164 | int32 | PTX minor version |
| 168 | int32 | SM architecture ID |
| 177 | byte | Architecture sub-variant flag |
| 178 | byte | Version mismatch flag |
| 196 | int32 | .address_size (32 or 64) |
| 204 | int32 | Maximum SM ID encountered |
| 208 | ptr | Target feature hash map |
| 219 | byte | CLI texmode_independent flag |
| 220 | byte | CLI texmode_raw flag |
| 272 | ptr | Module pragma list head |
| 368--512 | ptr[18] | State-space linked list heads |
| 656--800 | bytes | Per-section tracking data (144 bytes) |
| 832 | byte | Lenient mode flag |
| 834 | word | Debug mode flags |
| 856 | int32 | Debug hash base |
| 960 | int32 | Version check disable flag |
| 968 | ptr | Scope root (top-level symbol table) |
| 976 | ptr | Extern function scope |
| 984 | ptr | Current scope chain pointer |
| 1000 | byte | Function body active flag |
| 1008 | ptr | Scope hash map |
| 1033 | byte | Debug info enabled |
| 1096 | ptr | CU_state pointer |
Function Map
| Address | Size | Identity | Callers |
|---|---|---|---|
sub_44A100 | 39 B | PTX version bsearch validator | case 35 |
sub_44B9C0 | 171 B | Scope context creator | case 82, 97 via sub_497C00 |
sub_44F6E0 | 135 B | Parameter node allocator (48 B nodes) | sub_497C00 |
sub_489050 | 115 B | PTX ISA version gate | ~30 directive cases |
sub_489390 | 85 B | SM architecture version gate | ~15 directive cases |
sub_497C00 | 2,992 B | Function/entry declaration handler | cases 82, 97 |
sub_4A0CD0 | 4,937 B | Variable/symbol declaration validator | cases 58--68 |
sub_4A02A0 | 2,607 B | Initializer/constant expression validator | sub_4A0CD0 |
sub_4B1080 | ~700 B | Per-target handler (SM + texmode) | cases 5, 38 |
sub_4036D9 | 437 B | Parameter list compatibility check | case 41 (.alias) |
sub_4CE6B0 | 48,263 B | Bison parser (all directive cases) | compilation driver |
PTX-to-Ori Lowering
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The PTX-to-Ori lowering is the transition from parsed PTX assembly into the Ori internal representation -- the SASS-level, virtual-register IR that all subsequent optimization operates on. Unlike a traditional compiler where the parser builds an AST and a separate lowering pass consumes it, ptxas has no materialized AST: the Bison parser's reduction actions directly construct Ori IR nodes, basic blocks, and CFG edges inline. What the --compiler-stats timer calls "DAGgen-time" measures this inline construction phase. The result is a raw Ori IR that still uses PTX-derived opcodes and has unresolved architecture-dependent constructs. Fourteen "bridge phases" (pipeline indices 0--13) then transform this raw IR into the optimizer-ready form where every instruction carries its final SASS opcode, the CFG is fully annotated, and architecture-incompatible operations have been legalized.
The key architectural consequence of this design: there is no separate "lowering" function that you can point at and say "this converts PTX to Ori." The conversion is distributed across (1) the Bison parser's 443 reduction actions, (2) a 44 KB operand processing function, (3) the MercConverter instruction legalization pass, and (4) six additional bridge phases that handle FP16 promotion, control flow canonicalization, macro fusion, and recipe application.
| DAGgen timer | "DAGgen-time : %.3f ms (%.2f%%)\n" (inline Bison -> Ori construction) |
| Bison parser | sub_4CE6B0 (48 KB, 512 productions, 443 reductions, no AST) |
| Operand processing | sub_6273E0 (44 KB, 6-bit operand type switch) |
| MercConverter | sub_9F1A90 (35 KB, opcode-dispatched visitor) |
| MercConverter orchestrator | sub_9F3340 (7 KB) |
| Opcode dispatch | sub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF) |
| Post-conversion lowering | sub_9EF5E0 (27 KB, string "CONVERTING") |
| Bridge phases | Phases 0--13 (14 phases, first group in the 159-phase pipeline) |
| Diagnostic dump | Phase 9: ReportInitialRepresentation (sub_A3A7E0 stats emitter) |
| Intrinsic descriptors | sub_9EE390 (20 KB, "IntrinsicDescrFile=%s") |
Architecture
PTX source text
|
v
[Flex scanner] sub_720F00 (15.8KB, 552 rules)
| token stream
v
[Bison parser] sub_4CE6B0 (48KB, 512 productions)
| NO AST -- reduction actions build IR directly:
| - allocate instruction nodes from pool
| - set opcode field (instruction +72)
| - build operand array (instruction +84)
| - link into doubly-linked list per basic block
| - create basic block entries (40B each)
| - populate CFG hash maps (Code Object +648, +680)
|
v "DAGgen-time"
[Operand processing] sub_6273E0 (44KB) boundary
| 6-bit type switch (v12 & 0x3F) ----------
| address computation, state space annotation
v
+----------------------------------------------------------+
| RAW ORI IR (PTX-derived opcodes, virtual registers) |
| Instructions: PTX-level names (add.f32, ld.global, etc) |
| Registers: virtual R-file, typed descriptors |
| CFG: basic blocks + edge hash maps (partially formed) |
+----------------------------------------------------------+
|
| Phase 0: OriCheckInitialProgram (validate)
| Phase 1: ApplyNvOptRecipes (configure opt levels)
| Phase 2: PromoteFP16 (FP16 -> FP32 where needed)
| Phase 3: AnalyzeControlFlow (finalize CFG + RPO + backedges)
| Phase 4: AdvancedPhaseBeforeConvUnSup (arch hook, no-op default)
| Phase 5: ConvertUnsupportedOps (MercConverter: PTX ops -> SASS ops)
| Phase 6: SetControlFlowOpLastInBB (CFG structural fixup)
| Phase 7: AdvancedPhaseAfterConvUnSup (arch hook, no-op default)
| Phase 8: OriCreateMacroInsts (fuse instruction sequences)
| Phase 9: ReportInitialRepresentation (diagnostic dump)
| Phase 10: EarlyOriSimpleLiveDead (dead code elimination)
| Phase 11: ReplaceUniformsWithImm (fold known constants)
| Phase 12: OriSanitize (validate post-bridge IR)
| Phase 13: GeneralOptimizeEarly (bundled copy-prop + const-fold)
v "OCG-time"
+----------------------------------------------------------+ begins
| OPTIMIZER-READY ORI IR |
| Instructions: SASS opcodes (FADD, IMAD, LDG, STG, ...) |
| Registers: virtual R/UR/P/UP files |
| CFG: complete with RPO, backedge map, loop headers |
+----------------------------------------------------------+
|
v
[Phase 14+: main optimization pipeline]
Inline IR Construction (Bison -> Ori)
The Bison parser at sub_4CE6B0 has 512 grammar productions with 443 reduction-action cases. Each reduction action constructs IR directly -- no intermediate AST is ever materialized. The instruction table builder (sub_46E000, 93 KB, 1,141 per-opcode registration calls to sub_46BED0) runs during parser initialization and registers the legal type combinations for every PTX instruction. The instruction lookup subsystem (sub_46C690 entry, sub_46C6E0 matcher at 6.4 KB) classifies operands into 12 categories at parse time.
When the parser encounters a PTX instruction like add.f32 %r1, %r2, %r3, it:
- Looks up
add.f32in the opcode table to get the internal opcode index and validate the type qualifier.f32 - Allocates an Ori instruction node from the memory pool
- Writes the opcode into the instruction field at offset
+72 - Processes each operand through
sub_6273E0to build the packed operand array at offset+84 - Links the instruction into the current basic block's doubly-linked list
- If the instruction is a branch/jump/return, creates a CFG edge in the successor hash map at Code Object
+648
Special PTX registers (%ntid, %laneid, %smid, %ctaid, %clock64, etc.) are mapped to internal identifiers during parser initialization at sub_451730. The mapping table is built from the ROT13-encoded opcode table populated by ctor_003 at 0x4095D0.
Operand Processing -- sub_6273E0
The 44 KB operand processing function handles all PTX operand forms. It switches on a 6-bit type encoding extracted as v12 & 0x3F:
| Type bits | Operand kind | PTX syntax | Processing |
|---|---|---|---|
| Register | Direct register reference | %r1, %rd1, %f1 | Look up register descriptor via *(ctx+88) + 8*regId |
| Register pair | 64-bit register pair | %rd1 (on 32-bit ALU) | Allocate paired descriptors, link hi/lo |
| Immediate | Integer constant | 42, 0xFF | Pack into operand field |
| Float immediate | Floating-point constant | 0F3F800000 | Encode IEEE 754 bits |
| Address | Base + offset | [%rd1+16] | Compute effective address, annotate state space |
| Constant bank | Constant memory ref | c[2][0x100] | Bank index + offset encoding |
| Label | Branch target | $L__BB0_1 | Resolve to basic block index |
| Special register | Built-in register | %ntid.x, %laneid | Map to internal ID from sub_451730 table |
String evidence in sub_6273E0:
".nv.reservedSmem.offset0"-- reserved shared memory region handling"COARSEOFFSET"-- coarse-grained offset computation for large address spaces"__$endLabel$__%s"-- label generation for structured control flow expansion
The function bridges PTX's explicitly-typed operand model (where .u32, .f32, .b64 qualifiers are part of the syntax) to Ori's implicitly-typed model where the operand type is determined by the SASS opcode.
Bridge Phases (0--13)
Phase 0: OriCheckInitialProgram -- Validation
Validates the raw Ori IR produced by the Bison parser for structural correctness: all basic blocks have valid entry/exit points, instruction operand counts match opcode requirements, register references are within bounds, and CFG edges are consistent. This is a pure validation pass that produces no IR transformations. It catches malformed IR early, before any optimization pass can amplify a structural error into a hard-to-diagnose miscompile.
Phase 1: ApplyNvOptRecipes -- Optimization Level Configuration
Applies NvOptRecipe transformations controlled by option 391. When enabled, the PhaseManager's constructor (sub_C62720) allocates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This sub-manager configures per-phase behavior based on the NvOpt level (0--5), controlling which later phases are active and their aggressiveness:
| NvOpt level | Behavior |
|---|---|
| 0 | Minimal optimization (fast-compile path, many phases isNoOp()) |
| 1--2 | Standard optimization |
| 3--4 | Aggressive optimization (loop unrolling, speculative hoisting enabled) |
| 5 | Maximum optimization (may significantly increase compile time) |
The string "Invalid nvopt level : %d." in sub_C173E0 confirms the valid range. The recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes. The sub-manager maintains its own sorted array (+376) and hash table (+400..+416) for fast recipe lookup by phase index.
NvOptRecipe Sub-Manager (440 bytes, at PhaseManager+56)
+0 compilation_unit
+8 phase_manager back-reference
+16 ref_counted_list_1
+312 recipe_data
+336 allocator
+344 timing_records (stride = 584 per entry)
+376 sorted_array (for binary search by phase index)
+400 hash_bucket_count
+408 hash_buckets
+432 shared_list_ptr (ref-counted)
Phase 2: PromoteFP16 -- Half-Precision Type Promotion
Promotes half-precision (FP16) operations where hardware support is insufficient or promotion yields better throughput. The promotion strategy is architecture-dependent:
- Pre-sm_53: no native FP16 ALUs. All FP16 arithmetic is expanded to FP32 with narrowing conversions at stores.
- sm_53+: native FP16 support. Only operations that require expensive multi-instruction sequences in FP16 (certain transcendentals, complex comparisons) are promoted.
- sm_89+ (Ada, Blackwell): wide FP16 tensor paths. Promotion is minimal; most FP16 stays native.
The phase walks the instruction linked list, inspects each instruction's type encoding at offset +72, and rewrites FP16 operations to FP32 equivalents by replacing the opcode and inserting conversion instructions (F2F in SASS terminology) at use/def boundaries.
Phase 3: AnalyzeControlFlow -- CFG Finalization
Builds and finalizes the control flow graph data structures that the optimizer requires:
- Successor edges: populates the FNV-1a hash table at Code Object
+648 - Backedge map: computes backedges and stores them at Code Object
+680 - RPO array: builds the reverse post-order traversal at Code Object
+720 - Loop identification: marks loop headers and backedge targets for later loop optimization passes (phases 18, 22, 24, 59)
The Bison parser constructs basic blocks and edges incrementally as it processes PTX instructions, but the CFG is not guaranteed to be fully consistent until this phase runs. For example, forward branch targets may reference blocks that were not yet created at parse time. This phase resolves all pending edges and ensures the CFG is complete.
Phases 4 and 7: Architecture Hook Points
Phases 4 (AdvancedPhaseBeforeConvUnSup) and 7 (AdvancedPhaseAfterConvUnSup) are no-op-by-default hook points that bracket ConvertUnsupportedOps. Architecture backends override their vtables to inject target-specific processing:
- Phase 4 (before): prepare target-specific state, mark instructions that need special handling on this architecture
- Phase 7 (after): clean up after legalization, fix architecture-specific edge cases introduced by the generic lowering
These hooks are part of the 16 AdvancedPhase injection points distributed throughout the 159-phase pipeline. The architecture vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the sm_version.
Phase 5: ConvertUnsupportedOps -- Instruction Legalization
The most substantial bridge phase. Lowers PTX operations that have no direct SASS equivalent for the target architecture. This phase runs the MercConverter engine (see next section) and handles:
- 64-bit integer arithmetic on architectures with 32-bit ALUs: splits
add.s64,mul.lo.s64into hi/lo 32-bit instruction pairs using carry chains - Complex addressing modes: decomposes multi-component addresses into separate arithmetic instructions
- PTX-specific operations: converts PTX instructions that have no 1:1 SASS mapping (e.g.,
bfe,bfi,prmtvariants not supported on all targets) - Architecture availability: gates instructions by SM version (an instruction added in sm_80 is lowered to a multi-instruction sequence on sm_70)
- Texture/surface operations: legalizes texture sampling and surface access patterns (
sub_9E8B20, 17 KB) - Memory operations: legalizes load/store patterns, address register handling (
sub_9D76D0/sub_9D80E0, 17--18 KB each)
After ConvertUnsupportedOps completes, every instruction in the IR has a valid SASS opcode for the target architecture.
The late phase 132 (UpdateAfterConvertUnsupportedOps) runs cleanup for edge cases introduced by this phase that are only detectable after optimization.
Phase 6: SetControlFlowOpLastInBB -- CFG Structural Fixup
Enforces a critical structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering when a PTX instruction expands to a sequence ending in a branch), this phase splits the block at the control flow point.
The invariant is required by the scheduler (which assumes only the last instruction in a block can transfer control) and the register allocator (which computes live-out sets at block boundaries). The phase rewrites the instruction linked list and allocates new 40-byte basic block entries as needed.
Phase 8: OriCreateMacroInsts -- Macro Fusion
Identifies and fuses instruction sequences into macro instructions for hardware efficiency. The phase scans the instruction linked list for patterns that the GPU hardware can execute as a single macro-op:
- Compare + branch: fused into a conditional branch macro instruction
- Multiply + add: fused into FMA where not already (different from PTX
fma-- this catchesmulfollowed byaddon the same operands) - Address computation + memory access: fused sequences for coalesced access patterns
The fused macro instructions carry composite semantics in a single IR node. They are expanded back into individual SASS instructions much later at phase 118 (MercExpandInstructions), after scheduling has determined the optimal placement. This late expansion allows the optimizer to treat the fused sequence as atomic, preventing passes from inserting unrelated instructions between the components.
Phase 9: ReportInitialRepresentation -- Diagnostic Dump
Dumps the Ori IR state for debugging, active when DUMPIR or --ftrace diagnostics are enabled. The stats emitter at sub_A3A7E0 prints a per-function profile:
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
# [SharedMem Alloc thru=0.000000]
# [instHint=0] [instPairs=0]
This snapshot provides the pre-optimization baseline. Comparing it against ReportBeforeScheduling (phase 96) and ReportFinalMemoryUsage (phase 126) shows the optimizer's impact on instruction count, register pressure, and estimated latency.
Phases 10--13: Early Cleanup
| Phase | Name | Purpose |
|---|---|---|
| 10 | EarlyOriSimpleLiveDead | First dead code elimination pass. Removes instructions whose results are unused. Uses the SIMD-accelerated bitvector library (sub_BDBA60..sub_BDE150) for liveness computation. |
| 11 | ReplaceUniformsWithImm | Folds known-constant uniform register loads into immediate operands. Important for kernel launch parameters passed through constant memory. |
| 12 | OriSanitize | Second structural validation after all bridge transformations. Catches errors introduced by phases 1--11 before the main optimizer begins. |
| 13 | GeneralOptimizeEarly | First compound optimization pass: copy propagation + constant folding + algebraic simplification in a single fixed-point iteration. Cleans up redundancies introduced by the bridge phases. |
The MercConverter Engine
The MercConverter (sub_9F1A90, 35 KB) is the instruction conversion engine at the heart of ConvertUnsupportedOps. Despite its name referencing "Mercury" (NVIDIA's SASS encoding format), it operates purely at the IR level -- converting instruction semantics, not binary encodings.
Call Chain
sub_9F3340 (orchestrator, 7KB)
|
+-- sub_9F1A90 (MercConverter main pass, 35KB)
| |
| +-- sub_9ED2D0 (opcode dispatch, 25KB)
| | |
| | | Large switch on (*(instr+72)) with byte-1 mask:
| | | BYTE1(opcode) &= 0xCF -- strips modifier bits 4-5
| | |
| | +-- case 1: sub_9DA5C0 (2KB) -- opcode class 1
| | +-- case 6: sub_9DA100 (9KB) -- arithmetic operations
| | +-- case 8: sub_9D2440 -- specific class
| | +-- case 10,11,149,151,152,290,291:
| | | sub_9D80E0 (17KB) -- memory load/store
| | +-- default: vfunc[0](a1, a2) -- vtable dispatch
| |
| +-- sub_934630 (instruction creation utility, called N times)
|
+-- sub_9EF5E0 (post-conversion lowering, 27KB)
| string "CONVERTING"
+-- sub_9EC160, sub_7C11F0, sub_7BFC30 (intrinsic expansion)
Per-Category Handlers
| Handler | Size | Category | Key behavior |
|---|---|---|---|
sub_9D76D0 | 18 KB | Memory legalization (load/store) | Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication). |
sub_9D80E0 | 17 KB | Memory legalization (variant) | Same opcode set as sub_9D76D0, alternate code path for different operand patterns. |
sub_9EC340 | 23 KB | Multi-operand legalization | Operand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0. |
sub_9E6600 | 25 KB | Instruction expansion | Splits instructions into multiple SASS equivalents (e.g., 64-bit ops on 32-bit ALU). Uses sub_9D4380 ~10 times. |
sub_9E8B20 | 17 KB | Texture/surface lowering | Register type 6 = GPR. Manipulates bitmask at register descriptor offset +48. |
sub_9DA100 | 9 KB | Arithmetic operations | Handles opcode case 6 -- standard ALU instruction legalization. |
sub_9DE890 | 17 KB | Control flow legalization | Branch/call instruction patterns. Calls sub_9D4380 (builder) 5 times. |
sub_9DDEE0 | 14 KB | Address computation | Address arithmetic lowering, complex addressing mode decomposition. |
Intrinsic Descriptor Loading
sub_9EE390 (20 KB) loads architecture-specific instruction descriptions from a file ("IntrinsicDescrFile=%s"). This allows the MercConverter to query which intrinsic operations are natively supported on the target SM and which require multi-instruction expansion. The descriptor file is architecture-versioned and loaded once during the first compilation of a kernel targeting that architecture.
The PTX-to-SASS Opcode Transition
The fundamental semantic transformation during lowering: PTX uses high-level, explicitly-typed opcodes; Ori uses SASS-level opcodes where the type is encoded in the mnemonic. All SASS opcode strings in the binary are ROT13-encoded.
PTX source (typed virtual ISA) Ori IR (SASS machine-level)
--------------------------------- ---------------------------------
add.f32 %r1, %r2, %r3 --> FADD R1, R2, R3
add.s32 %r4, %r5, %r6 --> IADD3 R4, R5, R6, RZ
mul.f64 %d1, %d2, %d3 --> DMUL D1, D2, D3
mad.lo.s32 %r7, %r8, %r9, %r10 --> IMAD R7, R8, R9, R10
ld.global.f32 %r11, [%rd1] --> LDG R11, [R1]
st.shared.f32 [%rd2], %r12 --> STS [R2], R12
bra $L__BB0_1 --> BRA bix1
@%p0 bra $L__BB0_2 --> @P0 BRA bix2
exit --> EXIT
bar.sync 0 --> BAR
ROT13 encoding in the binary:
SNQQ = FADD VZNQ = IMAD SSZN = FFMA
VNQQ3 = IADD3 QZHY = DMUL YQT = LDG
FGT = STG OEN = BRA RKVG = EXIT
ERG = RET ONE = BAR FGF = STS
Key semantic differences at the transition:
-
Type moves into the opcode: PTX
add.f32becomesFADD(the "F" encodes float); PTXadd.s32becomesIADD3(the "I" encodes integer). The type qualifier disappears from the instruction syntax. -
Register namespace unification: PTX's typed virtual registers (
%rfor int,%ffor float,%rdfor 64-bit,%pfor predicate) merge into Ori's four register files (R, UR, P, UP) with type tracked in the register descriptor at offset+64. -
Operand count changes: SASS
IADD3takes 3 source operands where PTXaddtakes 2 -- the third source defaults toRZ(the hardware zero register). This is handled by the expansion insub_9E6600. -
Multi-instruction expansion: Complex PTX operations expand to multiple SASS instructions. A PTX
div.f32may become a Newton-Raphson sequence ofRCP+FMUL+ correction iterations. -
Predication mapping: PTX
@%p0 instructionmaps to an Ori predicate operand in the P register file, attached to the instruction node's predicate slot.
Error Detection During Lowering
The bridge phases include two error detection mechanisms:
Internal compiler error assertion (sub_9EB990, 1.4 KB): three references to "Internal compiler error.". Called when a bridge phase encounters an impossible IR state (e.g., an opcode value outside the known range in the MercConverter dispatch switch). Triggers longjmp-based fatal abort via sub_42F590 back to the driver's error recovery point in sub_446240.
Uninitialized register detector (sub_A0B5E0, 7 KB): "Found %d potentially uninitialized register(s) in function %s". Walks the instruction list per block, checks register descriptor flags at offset +48 (bit 5 = "defined"). Reports registers that appear as sources without any prior definition. This detector fires after the bridge phases to catch conversion errors that leave registers undefined.
Key Data Structures
Instruction Node
Instruction (variable size, linked list node)
+0 prev_ptr // doubly-linked list: previous instruction
+8 next_ptr // doubly-linked list: next instruction
+16 child_ptr // child/expanded instruction chain
+32 control_word_ptr // set later during scheduling (initially NULL)
+72 opcode // byte 0: primary opcode
// byte 1 bits 4-5: modifier (masked with 0xCF)
+80 operand_count // number of operands
+84 operand_array // packed operand descriptors
Operand Encoding
Each operand is a packed 32-bit value:
Bits 28-30: operand kind ((value >> 28) & 7)
1 = register operand
5 = predicate register
(other values for immediate, constant bank, label, etc.)
Lower bits: operand-kind-specific payload (register ID, immediate value, etc.)
Register Descriptor
Register descriptor (accessed via *(ctx+88) + 8*regId)
+12 register number (int)
+48 flags (bit 5 = "defined", other bits for liveness state)
+64 type (3=address, 6=GPR, 7=predicate)
Timing Boundary
The lowering spans two --compiler-stats timer phases:
| Timer | Covers |
|---|---|
DAGgen-time | Bison parser reduction actions -> Ori instruction nodes, operand processing (sub_6273E0), basic block / CFG construction |
OCG-time | Phases 0--13 (bridge), then phases 14--158 (optimization + codegen) |
The boundary between "lowering" and "optimization" is therefore between phase 13 (GeneralOptimizeEarly, the last bridge phase) and phase 14 (DoSwitchOptFirst, the first pure optimization). After phase 13, the IR is in its final SASS-opcode form with validated structure, ready for the main optimization pipeline.
Cross-References
- PTX Parser -- Flex scanner + Bison LALR(1) parser (the source of raw Ori IR)
- Ori IR -- IR design: Code Object, basic blocks, instruction format, register files
- Optimization Pipeline -- 159-phase pipeline (phases 0--13 are the bridge)
- Phase Manager -- PhaseManager object, phase factory, dispatch loop
- Optimization Levels -- NvOpt levels 0--5 and their effect on recipes
- SASS Opcodes -- target SASS instruction set after lowering
Function Map
| Address | Size | Callers | Identity | Confidence |
|---|---|---|---|---|
0x451730 | 14 KB | 1 | Parser init, special register setup | HIGH |
0x46E000 | 93 KB | 1 | Opcode table builder (1,141 per-opcode calls) | HIGH |
0x4CE6B0 | 48 KB | 1 | Bison LALR(1) parser (512 productions) | HIGH |
0x6273E0 | 44 KB | N | Operand processing (6-bit type switch) | MEDIUM |
0x9D4380 | 7 KB | ~10 | Instruction builder / inserter into linked list | HIGH |
0x9D76D0 | 18 KB | 1 | Memory instruction legalization (load/store) | HIGH |
0x9D80E0 | 17 KB | 1 | Memory instruction legalization (variant) | HIGH |
0x9DA100 | 9 KB | 1 | Arithmetic operation handler (case 6) | HIGH |
0x9DE890 | 17 KB | 1 | Control flow legalization (branch/call) | MEDIUM |
0x9DDEE0 | 14 KB | 1 | Address computation legalization | MEDIUM |
0x9E6600 | 25 KB | 1 | Instruction expansion (64-bit split, etc.) | HIGH |
0x9E8B20 | 17 KB | 1 | Texture/surface lowering | MEDIUM |
0x9EB990 | 1.4 KB | 3 | Internal compiler error assertion | HIGH |
0x9EC340 | 23 KB | 1 | Multi-operand instruction legalization | MEDIUM |
0x9ED2D0 | 25 KB | 1 | Opcode dispatch (master switch, & 0xCF mask) | HIGH |
0x9EE390 | 20 KB | 1 | Intrinsic descriptor file loader | MEDIUM |
0x9EF5E0 | 27 KB | 1 | Post-MercConverter lowering ("CONVERTING") | HIGH |
0x9F1A90 | 35 KB | 1 | MercConverter main instruction conversion pass | HIGH |
0x9F3340 | 7 KB | 1 | MercConverter orchestrator ("After MercConverter") | HIGH |
0xA0B5E0 | 7 KB | N | Uninitialized register detector | HIGH |
0xA3A7E0 | 6 KB | N | Scheduling statistics printer (phase 9 output) | VERY HIGH |
Optimization Pipeline (159 Phases)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas optimizer is a fixed-order pipeline of 159 compilation phases that transform Ori IR from its initial post-lowering form into scheduled, register-allocated SASS machine code. Unlike LLVM's PassManager -- which uses dependency-driven scheduling and analysis preservation -- ptxas runs every phase unconditionally in a predetermined order, relying on per-phase isNoOp() checks to skip inapplicable transformations. This design trades flexibility for predictability: the phase ordering is identical across all compilations, and architecture-specific behavior is injected through 16 "AdvancedPhase" hook points whose vtables are overridden per target.
Each phase is a polymorphic C++ object exactly 16 bytes in size, allocated from a memory pool by a 159-case factory switch. The PhaseManager constructs all 159 phase objects up front during initialization, stores them in a flat array, and iterates the array in a simple dispatch loop. Per-phase timing and memory consumption are optionally tracked for --stat=phase-wise output.
Key Facts
| Field | Value |
|---|---|
| Total phases | 159 (indices 0--158) |
| Named phases (static table) | 139 (indices 0--138) |
| Dynamic phases (vtable names) | 20 (indices 139--158) |
| AdvancedPhase hook points | 16 |
| Mercury sub-pipeline phases | 8 (phases 113--114, 117--122) |
| Phase object size | 16 bytes: {vtable_ptr, allocator_ptr} |
| Factory switch | sub_C60D30 (3554 bytes, 159 cases) |
| PhaseManager constructor | sub_C62720 (4734 bytes) |
| Dispatch loop | sub_C64F70 (1455 bytes) |
| Phase name table | off_22BD0C0 (159 entries, 1272 bytes) |
| Default ordering table | unk_22BEEA0 (159-entry index array) |
| Vtable range | off_22BD5C8..off_22BEE78 (40-byte stride) |
| NamedPhases option ID | 298 |
| Pipeline orchestrator | sub_7FB6C0 |
Phase Object Layout
Every phase is a 16-byte polymorphic object created by the factory:
struct Phase {
void** vtable; // +0: pointer to phase-specific vtable in .data.rel.ro
void* allocator; // +8: memory pool used for allocation
};
The vtable provides three virtual methods common to all phases:
| Offset | Signature | Purpose |
|---|---|---|
+0 | execute(Phase*, CompilationContext*) | Run the phase on the IR |
+8 | isNoOp(Phase*) -> bool | Return true to skip execution |
+16 | getName(Phase*) -> int | Return index into the phase name table |
Additional vtable slots (+24 pool alloc, +32 pool free) are present but belong to the allocator interface, not the phase protocol.
Dispatch Loop
The dispatch loop at sub_C64F70 drives execution:
// sub_C64F70 -- simplified
void dispatch(PhaseManager* pm, int* phase_indices, int count) {
MemorySnapshot baseline = take_snapshot();
for (int i = 0; i < count; i++) {
int idx = phase_indices[i];
Phase* phase = pm->phase_list[idx];
const char* name = pm->name_table[phase->getName()];
if (!phase->isNoOp()) {
MemorySnapshot before = take_snapshot();
phase->execute(pm->compilation_unit);
if (pm->timing_enabled) {
report_phase_stats(pm, name, &before);
}
}
}
if (pm->timing_enabled) {
report_summary(pm, "All Phases Summary", &baseline);
report_pool_consumption(pm);
}
}
Timing output format (to stderr when --stat=phase-wise):
<phase_name> :: [Total 42 KB ] [Freeable 8 KB ] [Freeable Leaked 0 KB ] (0%)
Complete Phase Table
Group 1 -- Initial Setup (phases 0--13)
Program validation, recipe application, FP16 promotion, control flow analysis, macro instruction creation.
| # | Phase Name | Category |
|---|---|---|
| 0 | OriCheckInitialProgram | Validation |
| 1 | ApplyNvOptRecipes | Recipe application |
| 2 | PromoteFP16 | Type promotion |
| 3 | AnalyzeControlFlow | CFG analysis |
| 4 | AdvancedPhaseBeforeConvUnSup | Hook (no-op default) |
| 5 | ConvertUnsupportedOps | Legalization |
| 6 | SetControlFlowOpLastInBB | CFG fixup |
| 7 | AdvancedPhaseAfterConvUnSup | Hook (no-op default) |
| 8 | OriCreateMacroInsts | Macro expansion |
| 9 | ReportInitialRepresentation | Diagnostics |
| 10 | EarlyOriSimpleLiveDead | Early DCE |
| 11 | ReplaceUniformsWithImm | Immediate folding |
| 12 | OriSanitize | IR validation |
| 13 | GeneralOptimizeEarly | Bundled early opts |
Phase 0 validates the initial Ori IR for structural correctness. Phase 1 applies NvOptRecipe transformations (controlled by option 391, which allocates a 440-byte sub-manager at PhaseManager+56). Phase 2 promotes FP16 operations where profitable. Phases 4 and 7 are architecture hooks that bracket ConvertUnsupportedOps -- backends override them to inject target-specific pre/post-legalization logic.
Group 2 -- Early Optimization (phases 14--32)
Branch optimization, loop canonicalization, strength reduction, software pipelining, SSA formation.
| # | Phase Name | Category |
|---|---|---|
| 14 | DoSwitchOptFirst | Switch optimization |
| 15 | OriBranchOpt | Branch optimization |
| 16 | OriPerformLiveDeadFirst | Liveness / DCE |
| 17 | OptimizeBindlessHeaderLoads | Texture header opt |
| 18 | OriLoopSimplification | Loop canonicalization |
| 19 | OriSplitLiveRanges | Live range splitting |
| 20 | PerformPGO | Profile-guided opt |
| 21 | OriStrengthReduce | Strength reduction |
| 22 | OriLoopUnrolling | Loop unrolling |
| 23 | GenerateMovPhi | SSA phi insertion |
| 24 | OriPipelining | Software pipelining |
| 25 | StageAndFence | Memory fence insertion |
| 26 | OriRemoveRedundantBarriers | Barrier elimination |
| 27 | AnalyzeUniformsForSpeculation | Uniform analysis |
| 28 | SinkRemat | Sink + rematerialization |
| 29 | GeneralOptimize | Bundled mid opts |
| 30 | DoSwitchOptSecond | Switch optimization (2nd) |
| 31 | OriLinearReplacement | Linear scan replacement |
| 32 | CompactLocalMemory | Local memory compaction |
The GeneralOptimize* phases (13, 29, 37, 46, 58, 65) are compound passes that bundle multiple small optimizations (copy propagation, constant folding, algebraic simplification) into a single fixed-point iteration. They appear at multiple pipeline positions to re-clean the IR after major transformations. Liveness/DCE also runs repeatedly (OriPerformLiveDead at phases 16, 33, 61, 84) to remove dead code exposed by intervening passes.
Group 3 -- Mid-Level Optimization (phases 33--52)
GVN-CSE, reassociation, shader constant extraction, CTA expansion, argument enforcement.
| # | Phase Name | Category |
|---|---|---|
| 33 | OriPerformLiveDeadSecond | Liveness / DCE (2nd) |
| 34 | ExtractShaderConstsFirst | Shader constant extraction |
| 35 | OriHoistInvariantsEarly | LICM (early) |
| 36 | EmitPSI | PSI emission |
| 37 | GeneralOptimizeMid | Bundled mid opts |
| 38 | OptimizeNestedCondBranches | Nested branch opt |
| 39 | ConvertVTGReadWrite | VTG read/write conversion |
| 40 | DoVirtualCTAExpansion | Virtual CTA expansion |
| 41 | MarkAdditionalColdBlocks | Cold block marking |
| 42 | ExpandMbarrier | Mbarrier expansion |
| 43 | ForwardProgress | Forward progress guarantee |
| 44 | OptimizeUniformAtomic | Uniform atomic opt |
| 45 | MidExpansion | Mid-level legalization |
| 46 | GeneralOptimizeMid2 | Bundled mid opts (2nd) |
| 47 | AdvancedPhaseEarlyEnforceArgs | Hook (no-op default) |
| 48 | EnforceArgumentRestrictions | ABI enforcement |
| 49 | GvnCse | GVN + CSE |
| 50 | OriReassociateAndCommon | Reassociation + commoning |
| 51 | ExtractShaderConstsFinal | Shader constants (final) |
| 52 | OriReplaceEquivMultiDefMov | Redundant move elimination |
Shader constant extraction (phases 34, 51) identifies uniform values that can be loaded from constant memory rather than recomputed per-thread. GvnCse (phase 49) combines global value numbering with common subexpression elimination in a single pass. The MidExpansion (phase 45) performs target-dependent lowering of operations that must be expanded before register allocation but after high-level optimizations have had their chance.
Group 4 -- Late Optimization (phases 53--77)
Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.
| # | Phase Name | Category |
|---|---|---|
| 53 | OriPropagateVaryingFirst | Varying propagation |
| 54 | OriDoRematEarly | Early rematerialization |
| 55 | LateExpansion | Late legalization |
| 56 | SpeculativeHoistComInsts | Speculative hoisting |
| 57 | RemoveASTToDefaultValues | AST cleanup |
| 58 | GeneralOptimizeLate | Bundled late opts |
| 59 | OriLoopFusion | Loop fusion |
| 60 | DoVTGMultiViewExpansion | Multi-view expansion |
| 61 | OriPerformLiveDeadThird | Liveness / DCE (3rd) |
| 62 | OriRemoveRedundantMultiDefMov | Dead move elimination |
| 63 | OriDoPredication | If-conversion |
| 64 | LateOriCommoning | Late commoning |
| 65 | GeneralOptimizeLate2 | Bundled late opts (2nd) |
| 66 | OriHoistInvariantsLate | LICM (late) |
| 67 | DoKillMovement | Kill movement |
| 68 | DoTexMovement | Texture movement |
| 69 | OriDoRemat | Rematerialization |
| 70 | OriPropagateVaryingSecond | Varying propagation (2nd) |
| 71 | OptimizeSyncInstructions | Sync optimization |
| 72 | LateExpandSyncInstructions | Late sync expansion |
| 73 | ConvertAllMovPhiToMov | Phi destruction |
| 74 | ConvertToUniformReg | Uniform reg conversion |
| 75 | LateArchOptimizeFirst | Arch-specific late opt |
| 76 | UpdateAfterOptimize | IR update pass |
| 77 | AdvancedPhaseLateConvUnSup | Hook (no-op default) |
Predication (phase 63) converts short conditional branches into predicated instruction sequences, eliminating branch divergence. Rematerialization runs twice (phases 54 and 69) -- the early pass targets values that are cheap to recompute, while the late pass handles cases exposed by predication and loop fusion. Phase 73 (ConvertAllMovPhiToMov) destroys SSA form by converting phi nodes into move instructions, preparing the IR for register allocation. Phase 74 converts qualifying values to uniform registers (UR), reducing general register pressure.
Group 5 -- Legalization (phases 78--96)
Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attribute setting, final inspection.
| # | Phase Name | Category |
|---|---|---|
| 78 | LateExpansionUnsupportedOps | Late unsupported ops |
| 79 | OriHoistInvariantsLate2 | LICM (late 2nd) |
| 80 | ExpandJmxComputation | JMX expansion |
| 81 | LateArchOptimizeSecond | Arch-specific late opt (2nd) |
| 82 | AdvancedPhaseBackPropVReg | Hook (no-op default) |
| 83 | OriBackCopyPropagate | Backward copy propagation |
| 84 | OriPerformLiveDeadFourth | Liveness / DCE (4th) |
| 85 | OriPropagateGmma | GMMA propagation |
| 86 | InsertPseudoUseDefForConvUR | UR pseudo use/def |
| 87 | FixupGmmaSequence | GMMA sequence fixup |
| 88 | OriHoistInvariantsLate3 | LICM (late 3rd) |
| 89 | AdvancedPhaseSetRegAttr | Hook (no-op default) |
| 90 | OriSetRegisterAttr | Register attribute setting |
| 91 | OriCalcDependantTex | Texture dependency calc |
| 92 | AdvancedPhaseAfterSetRegAttr | Hook (no-op default) |
| 93 | LateExpansionUnsupportedOps2 | Late unsupported ops (2nd) |
| 94 | FinalInspectionPass | Final IR validation |
| 95 | SetAfterLegalization | Post-legalization marker |
| 96 | ReportBeforeScheduling | Diagnostics |
GMMA (phases 85, 87) handles WGMMA (warp group matrix multiply-accumulate) instruction sequences that require specific register arrangements and ordering constraints. OriSetRegisterAttr (phase 90) annotates registers with scheduling attributes (latency class, bank assignment) consumed by the downstream scheduler. FinalInspectionPass (phase 94) is a validation gate that catches illegal IR patterns before the irreversible scheduling/RA phases.
Group 6 -- Pre-Scheduling and Register Allocation (phases 97--103)
Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.
| # | Phase Name | Category |
|---|---|---|
| 97 | AdvancedPhasePreSched | Hook (no-op default) |
| 98 | BackPropagateVEC2D | Vector back-propagation |
| 99 | OriDoSyncronization | Synchronization insertion |
| 100 | ApplyPostSyncronizationWars | Post-sync WAR fixup |
| 101 | AdvancedPhaseAllocReg | Hook (no-op default) |
| 102 | ReportAfterRegisterAllocation | Diagnostics |
| 103 | Get64bRegComponents | 64-bit register splitting |
Phase 99 inserts the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model. Phase 100 fixes write-after-read hazards exposed by sync insertion. Register allocation is driven through the hook at phase 101 -- the actual allocator is architecture-specific and invoked from the AdvancedPhase override. Phase 103 splits 64-bit register pairs into their 32-bit components for architectures that require it.
Group 7 -- Post-RA and Post-Scheduling (phases 104--116)
Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboards.
| # | Phase Name | Category |
|---|---|---|
| 104 | AdvancedPhasePostExpansion | Hook (no-op default) |
| 105 | ApplyPostRegAllocWars | Post-RA WAR fixup |
| 106 | AdvancedPhasePostSched | Hook (no-op default) |
| 107 | OriRemoveNopCode | NOP removal |
| 108 | OptimizeHotColdInLoop | Hot/cold in loops |
| 109 | OptimizeHotColdFlow | Hot/cold flow opt |
| 110 | PostSchedule | Post-scheduling |
| 111 | AdvancedPhasePostFixUp | Hook (no-op default) |
| 112 | PlaceBlocksInSourceOrder | Block layout |
| 113 | PostFixForMercTargets | Mercury target fixup |
| 114 | FixUpTexDepBarAndSync | Texture barrier fixup |
| 115 | AdvancedScoreboardsAndOpexes | Scoreboard generation |
| 116 | ProcessO0WaitsAndSBs | O0 wait/scoreboard |
Hot/cold partitioning (phases 108--109) separates frequently executed blocks from cold paths, improving instruction cache locality. PlaceBlocksInSourceOrder (phase 112) determines the final layout of basic blocks in the emitted binary. The scoreboard sub-system has two paths: at -O1 and above, AdvancedScoreboardsAndOpexes (phase 115) performs full dependency analysis to compute the 23-bit control word per instruction (4-bit stall count, 1-bit yield, 3-bit write barrier, 6-bit read barrier mask, 6-bit wait barrier mask, plus reuse flags). At -O0, phase 115 is a no-op and ProcessO0WaitsAndSBs (phase 116) inserts conservative waits.
Group 8 -- Mercury Backend (phases 117--122)
SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.
| # | Phase Name | Category |
|---|---|---|
| 117 | MercEncodeAndDecode | Mercury encode/decode |
| 118 | MercExpandInstructions | Instruction expansion |
| 119 | MercGenerateWARs1 | WAR generation (1st pass) |
| 120 | MercGenerateOpex | Opex generation |
| 121 | MercGenerateWARs2 | WAR generation (2nd pass) |
| 122 | MercGenerateSassUCode | SASS microcode generation |
"Mercury" is NVIDIA's internal name for the SASS encoding framework. Phase 117 converts Ori instructions into Mercury's intermediate encoding, then decodes them back to verify round-trip correctness. Phase 118 expands pseudo-instructions into their final SASS sequences. WAR generation runs in two passes (119, 121) because expansion in phase 118 can introduce new write-after-read hazards. Phase 120 generates "opex" (operation extension) annotations. Phase 122 produces the final SASS microcode bytes. The MercConverter infrastructure (sub_9F1A90, 35KB) drives the instruction-level legalization using a visitor pattern dispatched through a large opcode switch (sub_9ED2D0, 25KB).
Group 9 -- Post-Mercury (phases 123--131)
Register map, diagnostics, debug output.
| # | Phase Name | Category |
|---|---|---|
| 123 | ComputeVCallRegUse | Virtual call reg use |
| 124 | CalcRegisterMap | Register map computation |
| 125 | UpdateAfterPostRegAlloc | Post-RA update |
| 126 | ReportFinalMemoryUsage | Diagnostics |
| 127 | AdvancedPhaseOriPhaseEncoding | Hook (no-op default) |
| 128 | UpdateAfterFormatCodeList | Code list formatting |
| 129 | DumpNVuCodeText | SASS text dump |
| 130 | DumpNVuCodeHex | SASS hex dump |
| 131 | DebuggerBreak | Debugger breakpoint |
CalcRegisterMap (phase 124) computes the final physical-to-logical register mapping emitted as EIATTR metadata in the output ELF. DumpNVuCodeText and DumpNVuCodeHex (phases 129--130) produce the human-readable SASS text and raw hex dumps used by cuobjdump and debugging workflows. DebuggerBreak (phase 131) is a development-only hook that triggers a breakpoint when a specific phase is reached.
Group 10 -- Finalization (phases 132--158)
Late merge operations, late unsupported-op expansion, high-pressure live range splitting, architecture-specific fixups.
| # | Phase Name | Category |
|---|---|---|
| 132 | UpdateAfterConvertUnsupportedOps | Post-conversion update |
| 133 | MergeEquivalentConditionalFlow | Conditional flow merge |
| 134 | AdvancedPhaseAfterMidExpansion | Hook (no-op default) |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Hook (no-op default) |
| 136 | LateMergeEquivalentConditionalFlow | Late conditional merge |
| 137 | LateExpansionUnsupportedOpsMid | Late unsupported mid |
| 138 | OriSplitHighPressureLiveRanges | High-pressure splitting |
| 139--158 | (architecture-specific) | Arch-specific fixups |
Phases 132--138 handle late-breaking transformations that must run after the Mercury backend but before finalization. OriSplitHighPressureLiveRanges (phase 138) is a last-resort live range splitter that fires when register pressure exceeds hardware limits after the main allocation pass.
Phases 139--158 are 20 additional slots whose names are not in the static name table but are returned by their vtable getString() methods. These are architecture-specific phases registered in the factory switch (vtable addresses off_22BEB08..off_22BEE78) that target particular SM generations or compilation modes. They provide extensibility for new architectures without modifying the fixed 139-phase base table.
Optimization Level Gating
AdvancedPhase Hook Points
Sixteen phases serve as conditional extension points. Their isNoOp() method returns true by default, causing the dispatch loop to skip them. Architecture backends and optimization-level configurations override the vtable to activate these hooks:
| Phase | Name | Gate Location |
|---|---|---|
| 4 | AdvancedPhaseBeforeConvUnSup | Before unsupported-op conversion |
| 7 | AdvancedPhaseAfterConvUnSup | After unsupported-op conversion |
| 47 | AdvancedPhaseEarlyEnforceArgs | Before argument enforcement |
| 77 | AdvancedPhaseLateConvUnSup | Late unsupported-op boundary |
| 82 | AdvancedPhaseBackPropVReg | Before backward copy prop |
| 89 | AdvancedPhaseSetRegAttr | Before register attr setting |
| 92 | AdvancedPhaseAfterSetRegAttr | After register attr setting |
| 97 | AdvancedPhasePreSched | Before scheduling |
| 101 | AdvancedPhaseAllocReg | Register allocation driver |
| 104 | AdvancedPhasePostExpansion | After post-RA expansion |
| 106 | AdvancedPhasePostSched | After post-scheduling |
| 111 | AdvancedPhasePostFixUp | After post-fixup |
| 115 | AdvancedScoreboardsAndOpexes | Full scoreboard analysis |
| 127 | AdvancedPhaseOriPhaseEncoding | Phase encoding hook |
| 134 | AdvancedPhaseAfterMidExpansion | After mid-expansion |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Late sync expansion |
The pattern is consistent: AdvancedPhase hooks bracket major pipeline stages, allowing backends to insert target-specific transformations without altering the fixed phase ordering. Phase 101 (AdvancedPhaseAllocReg) is notable because register allocation itself is entirely driven through this hook -- the base pipeline has no hardcoded allocator.
O0 vs O1+ Behavior
At -O0, the pipeline skips most optimization phases via their individual isNoOp() checks. The critical difference is in scoreboard generation:
-O1and above: Phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis usingsub_A36360(52KB control word encoder) andsub_A23CF0(54KB DAG list scheduler heuristic). Phase 116 is a no-op.-O0: Phase 115 is a no-op. Phase 116 (ProcessO0WaitsAndSBs) inserts conservative stall counts and wait barriers -- every instruction gets the maximum stall, and barriers are placed at every potential hazard point. This produces correct but slow code.
Individual phases also check the optimization level internally via the compilation context. The scheduling infrastructure (sub_8D0640) reads the opt-level via sub_7DDB50 and selects between forward-pass scheduling (opt-level <= 2, register-pressure-reducing) and reverse-pass scheduling (opt-level > 2, latency-hiding).
NamedPhases Override (Option 298)
The NamedPhases mechanism allows complete replacement of the default 159-phase pipeline with a user-specified phase sequence, primarily used for debugging and performance investigation.
Activation
The pipeline orchestrator (sub_7FB6C0) checks option ID 298 via a vtable call at compilation context offset +72. When set, the orchestrator bypasses the default pipeline and delegates to sub_9F63D0 (NamedPhases entry point):
// sub_7FB6C0 -- simplified
void orchestrate(CompilationUnit* cu) {
if (cu->config->getOption(298)) {
// NamedPhases mode -- user-specified phase sequence
NamedPhases_run(cu); // sub_9F63D0
} else {
// Default mode -- fixed 159-phase pipeline
PhaseManager* pm = PhaseManager_new(cu); // sub_C62720
int* ordering = get_default_ordering(); // sub_C60D20
dispatch(pm, ordering, 159); // sub_C64F70
PhaseManager_destroy(pm); // sub_C61B20
}
// ... cleanup 17 data structures, refcounted objects ...
}
Configuration String Format
Option 298 is set via a knob string (environment variable or command-line). The string is stored at compilation context offset 21464 with a type indicator at offset 21456. The parser (sub_798B60, NamedPhases::ParsePhaseList) tokenizes the comma-delimited string:
"phase_name1,phase_name2=param,shuffle,swap1,..."
Maximum 256 entries. The parser populates three parallel arrays:
- Phase name strings
- Parameter value strings (parsed via
strtol) - Full
name=valuepairs
Phase List Builder
The core builder (sub_9F4040, 49KB) processes the parsed configuration:
- Allocates a
0x2728-byte stack frame with 256-entry string tables - Initializes a 158-entry phase descriptor table (zeroed
0x400bytes) - Resolves phase names to indices via
sub_C641D0(case-insensitive binary search) - Recognized manipulation keywords:
shuffle-- randomize the phase orderingswap1..swap6-- swap specific phase pairs (for A/B testing)OriPerformLiveDead-- override liveness pass placementOriCopyProp-- override copy propagation placement
- Constructs the final phase index sequence and dispatches via
sub_C64F70
Pass-Disable Integration
Individual passes can be disabled without reordering the pipeline. The check function sub_799250 (IsPassDisabled, 68 bytes) performs a case-insensitive substring match against the PTXAS_DISABLED_PASSES string at context offset 13328:
// sub_799250 -- simplified
bool is_pass_disabled(Context* ctx, const char* pass_name) {
if (ctx->pass_disable_flag == 0) return false; // offset 13320
if (ctx->pass_disable_flag == 5) {
return strcasestr(ctx->pass_disable_string, pass_name); // offset 13328
}
return false;
}
This check is called from 16+ sites across the codebase, guarding passes like LoopMakeSingleEntry and SinkCodeIntoBlock. A more thorough variant (sub_7992A0, IsPassDisabledFull) uses FNV-1a hashing for function-specific override tables.
PhaseManager Data Structures
PhaseManager Object (~112 bytes)
Offset Type Field
------ ---- -----
+0 int64 compilation_unit pointer
+8 int64* allocator
+16 void* sorted_name_table (for binary search)
+24 int32 sorted_name_count
+28 int32 sorted_name_capacity
+32 int64* allocator_copy
+40 void* phase_list (array of 16-byte Phase entries)
+48 int32 phase_list_count
+52 int32 phase_list_capacity
+56 int64 nvopt_recipe_ptr (NvOptRecipe sub-manager, or NULL)
+64 int64 (reserved)
+72 bool timing_enabled (from options[17928])
+76 int32 (flags)
+80 bool flag_byte
+88 int64* timing_allocator
+96 void* phase_name_raw_table
+104 int32 phase_name_raw_count
+108 int32 phase_name_raw_capacity
Timing Record (32 bytes)
Offset Type Field
------ ---- -----
+0 int32 phase_index (-1 = sentinel)
+8 int64 phase_name_or_magic (0x2030007 = sentinel)
+16 int64 timing_value
+24 int32 memory_flags
NvOptRecipe Sub-Manager (440 bytes, at PhaseManager+56)
Created when option 391 is set. Contains timing records with 584-byte stride, a hash table for recipe lookup, sorted arrays, and ref-counted shared lists. The sub-manager inherits the phase chain from the previous execution context, enabling recipe-based pipeline modification across compilation units.
Function Map
| Address | Size | Identity |
|---|---|---|
sub_C60D20 | 16 B | Default phase table pointer |
sub_C60D30 | 3554 B | Phase factory (159-case switch) |
sub_C60BD0 | 334 B | Multi-function phase invoker |
sub_C61B20 | 1753 B | PhaseManager destructor |
sub_C62200 | 888 B | Pool consumption reporter |
sub_C62580 | 253 B | Timing record array resizer (1.5x growth) |
sub_C62640 | 223 B | Phase list resizer (1.5x growth) |
sub_C62720 | 4734 B | PhaseManager constructor |
sub_C639A0 | 1535 B | Case-insensitive quicksort (median-of-3) |
sub_C63FA0 | 556 B | Phase name table sort/rebuild |
sub_C641D0 | 305 B | Phase name-to-index binary search |
sub_C64310 | 3168 B | Per-phase timing reporter |
sub_C64F70 | 1455 B | Phase dispatch loop |
sub_7FB6C0 | 1193 B | Pipeline orchestrator (option 298 gate) |
sub_798B60 | 1776 B | NamedPhases::ParsePhaseList |
sub_799250 | 68 B | IsPassDisabled (substring check) |
sub_7992A0 | 894 B | IsPassDisabledFull (FNV-1a hash) |
sub_9F4040 | 9093 B | NamedPhases::parseAndBuild |
sub_9F63D0 | 342 B | NamedPhases::run |
sub_9F1A90 | 6310 B | MercConverter main pass |
sub_9F3340 | ~7 KB | MercConverter orchestrator |
sub_9ED2D0 | ~25 KB | MercConverter opcode dispatch |
Diagnostic Strings
| String | Location | Trigger |
|---|---|---|
"All Phases Summary" | sub_C64F70 | End of dispatch loop (timing enabled) |
"[Pool Consumption = " | sub_C62200 | End of dispatch loop (timing enabled) |
" :: " | sub_C64310 | Per-phase timing line |
"[Total ", "[Freeable ", "[Freeable Leaked " | sub_C64310 | Memory delta columns |
"Before ", "After " | sub_C64F70 | Phase execution markers |
"NamedPhases" | sub_9F4040 | NamedPhases config parsing |
"shuffle", "swap1".."swap6" | sub_9F4040 | NamedPhases manipulation keywords |
"After MercConverter" | near sub_9F3340 | Post-MercConverter diagnostic |
"CONVERTING" | sub_9EF5E0 | During MercConverter lowering |
"Internal compiler error." | sub_9EB990 | ICE assertion (3 sites) |
Cross-References
- Phase Manager Infrastructure -- detailed PhaseManager internals
- Pass Inventory & Ordering -- per-pass documentation index
- GeneralOptimize Bundles -- compound optimization passes
- Scheduler Architecture -- scheduling phases 97--116
- Mercury Encoder -- Mercury backend phases 117--122
- Scoreboards & Dependency Barriers -- control word generation
- Optimization Levels -- O-level gating details
- DUMPIR & NamedPhases -- NamedPhases configuration reference
- Memory Pool Allocator -- pool allocator used by phase objects
SASS Code Generation
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page covers the top-level compilation orchestration layer of ptxas: the code that sits between the PTX front-end (parsing, directive handling) and the 159-phase optimization pipeline. It is responsible for validating the parsed PTX, selecting a compilation strategy, computing register constraints, dispatching per-kernel compilation (either sequentially or via a thread pool), and collecting per-kernel outputs for finalization. The orchestrator is the single largest function in the front-end region at 2,141 decompiled lines.
Key Facts
| Core orchestrator | sub_4428E0 (13,774 bytes, 2,141 decompiled lines) |
| Per-kernel worker | sub_43A400 (4,696 bytes, 647 lines) |
| Per-kernel DAGgen+OCG | sub_64BAF0 (~30 KB, 1,006 decompiled lines) |
| Per-entry output | sub_43CC70 (5,425 bytes, 1,077 decompiled lines) |
| Thread pool worker | sub_436DF0 (485 bytes, 59 decompiled lines) |
| Thread pool constructor | sub_1CB18B0 (184-byte pool struct, calls pthread_create) |
| Finalization | sub_432500 (461 bytes, 47 decompiled lines) |
| Regalloc finalize | sub_4370F0 (522 bytes, 64 decompiled lines) |
| Compilation strategies | 4 (normal, compile-only, debug, non-ABI) |
| Error recovery | setjmp/longjmp (non-local, no C++ exceptions) |
Architecture
sub_446240 (top-level driver)
|
v
sub_4428E0 (core orchestrator, 2141 lines)
|
|-- Option validation: .version/.target, --compile-only, --compile-as-tools-patch
|-- Cache config: def-load-cache, force-load-cache, def-store-cache, force-store-cache
|-- Strategy selection: 4 function-pointer pairs (see below)
|-- Register constraints: sub_43B660 per kernel (via strategy function)
|-- Compile-unit table: 48-byte per-CU entry at a1+336
|-- Timing array: 112-byte per-kernel entry at a1+256
|
+-- IF single-threaded (thread_count == 0):
| |
| FOR EACH compile unit:
| |
| +-- sub_43A400 (per-kernel setup, 647 lines)
| | |-- Target-specific defaults ("ptxocg.0.0", cache, texmode)
| | |-- ABI configuration, fast-compile shortcuts
| | +-- Error recovery via setjmp
| |
| +-- sub_432500 (finalization wrapper, 47 lines)
| | |-- setjmp error guard
| | +-- vtable call at a1+96: invokes the actual OCG pipeline
| |
| +-- sub_4370F0 (timing finalization, 64 lines)
| | +-- Accumulates per-kernel timing into 112-byte records
| |
| +-- sub_43CC70 (per-entry output, 1077 lines)
| |-- Skip __cuda_dummy_entry__
| |-- Generate .sass and .ucode sections
| +-- Emit "# ============== entry %s ==============" header
|
+-- IF multi-threaded (thread_count > 0):
|
|-- sub_1CB18B0(thread_count) --> create thread pool
|
FOR EACH compile unit:
|
+-- sub_43A400 (per-kernel setup, same as above)
+-- Snapshot 15 config vectors (v158[3]..v158[17])
+-- Copy hash maps for thread isolation
+-- sub_1CB1A50(pool, sub_436DF0, task_struct) --> enqueue
|
v
sub_436DF0 (thread pool worker, 59 lines)
|-- sub_430590("ptxas", ...) -- set thread-local program name
|-- Jobserver slot check (sub_1CC6EC0)
|-- sub_432500(...) -- finalize via vtable call (DAGgen+OCG+SASS)
|-- Timing: wall-clock and phase timers into per-CU record
|-- sub_693630 (release compiled output to downstream)
+-- sub_4248B0 (free task struct)
|
sub_1CB1AE0(pool) --> wait for all tasks
sub_1CB1970(pool) --> destroy pool
sub_4370F0(a1, -1) --> finalize aggregate timing
The Core Orchestrator: sub_4428E0
This is a 2,141-line monolith that drives the entire compilation after the PTX has been parsed. Its responsibilities, in execution order:
1. Cache and Option Validation
The first 200+ lines read four cache-mode knobs from the options store at a1+904:
def_load_cache = get_knob(a1->options, "def-load-cache");
force_load_cache = get_knob(a1->options, "force-load-cache");
def_store_cache = get_knob(a1->options, "def-store-cache");
force_store_cache = get_knob(a1->options, "force-store-cache");
It then validates a long series of option combinations:
--compile-as-tools-patch(a1+727) incompatibility checks against shared memory, textures, surfaces, samplers, constants--assyscall(a1+627) resource allocation checks--compile-only(a1+726) vs unified functions- Non-ABI mode (
a2+218,a2+235): disables--fast-compile,--extensible-whole-program --position-independent-codevs--extensible-whole-programmutual exclusion- Architecture version checks:
.targetSM version vs--gpu-nameSM version -noFwdPrgforward-progress flag against SM version--legacy-bar-warp-wide-behavioragainst SM >= 70
2. Strategy Selection
Four compilation strategies are expressed as pairs of function pointers (v314, v293), selected at lines 756-779 of the decompilation. Each strategy pair consists of a register-constraint calculator and a compile-unit builder:
| Strategy | Condition | Register Calculator (v314) | CU Builder (v293) |
|---|---|---|---|
| Compile-only | --compile-only OR --assyscall OR --compile-as-tools-patch | sub_43C6F0 (225 lines) | sub_4383B0 (177 lines) |
| Debug | --compile-as-tools-patch AND NOT debug mode | sub_43CAE0 (91 lines) | sub_4378E0 (250 lines) |
| Non-ABI | --extensible-whole-program | sub_43C570 (77 lines) | sub_438B50 (375 lines) |
| Normal | default | sub_43CA80 (24 lines) | sub_438B50 (375 lines) |
The selection logic:
if (compile_only || assyscall || tools_patch) {
calc_regs = sub_43C6F0;
build_cus = sub_4383B0;
} else if (tools_patch_mode) {
calc_regs = debug_mode ? sub_43C6F0 : sub_43CAE0;
build_cus = debug_mode ? sub_4383B0 : sub_4378E0;
} else {
calc_regs = extensible_whole_program ? sub_43C570 : sub_43CA80;
build_cus = sub_438B50;
}
3. Register Constraint Calculation
Each strategy's register calculator iterates the kernel list and calls sub_43B660 to compute per-kernel register limits. The result is stored in a hash map at a1+1192:
// sub_43CA80 (normal strategy, 24 lines) -- simplest form
for (node = kernel_list; node; node = node->next) {
entry = node->data;
name = entry->func_ptr; // at +16
limit = sub_43B660(a1, name, a1->opt_level, entry->thread_count);
map_put(a1->reg_limit_map, name, limit); // a1+1192
}
sub_43B660 (3,843 bytes) balances register pressure against occupancy by considering .maxnreg, --maxrregcount, .minnctapersm, and .maxntid. String evidence: ".minnctapersm and .maxntid", "threads per SM", "computed using thread count", "of .maxnreg", "of maxrregcount option", "global register limit specified".
4. Compile-Unit Table Construction
The CU builder (v293) constructs a linked list of 72-byte compile-unit descriptors. Each descriptor contains:
struct CompileUnitDescriptor { // 72 bytes
int32 index; // +0: CU ordinal
void* dep_list; // +8: dependency tracking set
void* entry_ptr; // +16: pointer to entry function symbol
bool is_entry; // +25: true if .entry, false if .func
int32 regalloc[2]; // +28: register allocation mode pair
bool flags[4]; // +36: has_shared_mem, has_surfaces, has_textures, has_samplers
int16 cap_flags; // +40: capability flags from func_attr+240
int32 min_regs; // +44: minimum register count (from profile check)
// +48..55: additional attribute OR bits from func_attr+208..236
// +56..63: reserved
void* smem_info; // +48: 24-byte sub-struct for shared memory
};
The builder allocates this via sub_424070(pool, 72), populates it from the function attribute struct at entry_ptr+80, and enqueues it into the output list via sub_42CA60.
5. Per-Kernel Dispatch
After building the CU list, the orchestrator enters one of two dispatch modes based on the thread count at a1+668:
Single-Threaded Path (thread_count == 0)
The loop at decompilation lines 1607-1686 iterates each CU sequentially:
for (node = cu_list; node; node = node->next) {
cu_desc = node->data;
// Record start time in 112-byte timing record
timing_record[cu_desc->index].start = get_time();
// Allocate 360-byte work buffer
work = pool_alloc(pool, 360);
memset(work, 0, 360);
// Per-kernel setup: target config, cache defaults, ABI
sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
// Finalization: runs the actual DAGgen + OCG pipeline
sub_432500(a1, cu_desc + 16, work[0], work[1]);
// Timing finalization for this kernel
sub_4370F0(a1, cu_desc->index);
// Per-entry output: .sass/.ucode sections
sub_43CC70(a1, parser_state, cu_desc, work);
pool_free(work);
}
Multi-Threaded Path (thread_count > 0)
The thread pool path (decompilation lines 1709-1929) uses the pthread-based thread pool:
// Phase 1: prepare all tasks
pool_obj = sub_1CB18B0(thread_count); // create pool with N threads
for (node = cu_list; node; node = node->next) {
cu_desc = node->data;
// Allocate 360-byte work buffer (same as single-threaded)
work = pool_alloc(pool, 360);
// Extra per-thread state: 3 hash maps for thread isolation
work[288] = hashmap_new(8); // per-thread reg constraints
work[296] = hashmap_new(8); // per-thread symbol copies
work[304] = hashmap_new(8); // per-thread attribute copies
sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
// Snapshot 15 config vectors from global state (a1+1072..a1+1296)
// into work[48]..work[288] for thread-safe access
for (i = 0; i < 15; i++)
work[48 + 16*i] = load_128bit(a1 + 1072 + 16*i);
// Copy hash maps from shared state into per-thread copies
// (reg constraints, symbol tables, attribute maps)
// Enqueue: sub_1CB1A50(pool, sub_436DF0, task_struct)
task = malloc(48);
task->pool_data = work;
task->timing_base = ...;
sub_1CB1A50(pool_obj, sub_436DF0, task);
}
// Phase 2: wait for all tasks
sub_1CB1AE0(pool_obj); // wait-for-all
sub_1CB1970(pool_obj); // destroy pool
// Phase 3: aggregate timing
sub_4370F0(a1, -1); // -1 = aggregate all CUs
Jobserver Integration
When --split-compile is active with GNU Make, the thread pool integrates with Make's jobserver protocol. The worker function sub_436DF0 checks sub_1CC6EC0() (jobserver slot acquire) before starting compilation and calls sub_1CC7040() (jobserver slot release) after completion. This prevents ptxas from exceeding the -j slot limit during parallel builds.
Per-Kernel Worker: sub_43A400
This 647-line function sets up target-specific defaults for each kernel before the OCG pipeline runs. Key responsibilities:
- Timing instrumentation -- records start timestamps, wall-clock time
- Target configuration -- reads
"ptxocg.0.0"defaults, sets cache mode, texturing mode,"specified texturing mode"string evidence - Fast-compile shortcuts -- when
--fast-compileis active, reduces optimization effort - ABI setup -- configures parameter passing, return address register, scratch registers
- Error recovery -- establishes
setjmppoint for fatal errors during kernel compilation
The function allocates a _jmp_buf on the stack for error recovery. If any phase in the downstream pipeline calls the fatal diagnostic path (sub_42F590 with severity >= 6), execution longjmps back to sub_43A400's recovery handler, which cleans up the partially-compiled kernel and continues to the next.
String evidence: "def-load-cache", "force-load-cache", "--sw4575628", "NVIDIA", "ptxocg.0.0", "specified texturing mode", "Indirect Functions or Extern Functions".
Finalization: sub_432500
This 47-line wrapper function is the bridge between the orchestrator and the actual DAGgen+OCG pipeline. It:
- Retrieves the thread-local context via
sub_4280C0 - Saves and replaces the
jmp_bufpointer in the TLS struct (for nested error recovery) - Saves the current error/warning flags
- Clears the error flags to create a clean compilation context
- Calls through a vtable pointer at
a1+96to invoke the actual compilation:
// sub_432500 -- simplified
bool finalize(Context* ctx, CUDesc* cu, void* sass_out, void* ucode_out) {
char* tls = get_tls();
jmp_buf* old_jmp = tls->jmp_buf;
tls->jmp_buf = &local_jmp;
char saved_err = tls->error_flags;
char saved_warn = tls->warning_flags;
tls->error_flags = 0;
tls->warning_flags = 0;
if (setjmp(local_jmp)) {
// Error path: restore state, cleanup, report ICE
tls->jmp_buf = old_jmp;
tls->error_flags = 1; tls->warning_flags = 1;
release_output(ucode_out);
release_output(cu->output);
report_internal_error();
return false;
}
// Normal path: invoke the pipeline
bool ok = ctx->vtable->compile(ctx->state, sass_out, ctx + 384);
if (!ok) report_internal_error();
// Merge error flags
tls->jmp_buf = old_jmp;
tls->error_flags = saved_err ? true : (tls->error_flags != 0);
tls->warning_flags = saved_warn ? true : (tls->warning_flags != 0);
return ok;
}
The vtable call at a1+96 is the entry point into sub_64BAF0 (the 1,006-line function that runs DAGgen, the 159-phase OCG pipeline, and Mercury SASS encoding for a single kernel).
Regalloc Finalization: sub_4370F0
This 64-line function accumulates per-kernel timing results into the master timing array at a1+256. Each entry in this array is a 112-byte record:
struct KernelTimingRecord { // 112 bytes, at a1->timing_array + 112*index
char* kernel_name; // +0
float ocg_time; // +20
float total_time; // +36
float cumulative; // +40
double wall_clock; // +72
// ... other timing fields
};
When called with index == -1 (aggregate mode after multi-threaded compilation), it sums all per-kernel records into the global timing counters at a1+176 (total parse time), a1+184 (total OCG time), and a1+208 (peak wall-clock).
Per-Entry Output: sub_43CC70
This 1,077-line function produces the final per-kernel output artifacts. Key behaviors:
- Skip dummy entries -- checks for
"__cuda_dummy_entry__"and returns immediately - Section generation -- creates
.sassand.ucodeELF sections for each kernel - Entry banner -- emits
"\n# ============== entry %s ==============\n"to the SASS text output - Register map -- calls
"reg-fatpoint"to annotate the register allocation - Verbose SASS output -- when
--verboseis active, formats and writes human-readable SASS text - Multiple output paths -- supports mercury, capmerc, and direct SASS output modes
Thread Pool Worker: sub_436DF0
The worker function dispatched to each thread pool thread is compact (59 lines) but carefully structured for thread safety:
void thread_worker(Context* a1, TaskStruct* task) {
set_thread_program_name("ptxas", task);
// Acquire jobserver slot if applicable
if (a1->jobserver_enabled && !jobserver_acquire())
report_fatal_error(); // Cannot get build slot
float time_before = get_pool_time(a1->timer);
double wall_before = get_wall_time();
// Run the actual compilation
sub_432500(a1->state, task + 64, task->sass_output, task->ucode_output);
float time_after = get_pool_time(a1->timer);
double wall_after = get_wall_time();
// Record timing in per-CU record
int cu_index = *(int*)task->cu_desc;
TimingRecord* rec = &a1->timing_array[cu_index];
rec->ocg_time = time_after - time_before;
rec->cumulative += (time_after - time_before);
if (wall_after - wall_before > 0)
rec->wall_clock = wall_after - wall_before;
// Peak wall-clock tracking (under lock)
if (a1->compiler_stats && a1->per_kernel_stats) {
lock_timing(6);
double peak = a1->peak_wall_clock;
if (get_wall_time() - a1->start_time > peak)
a1->peak_wall_clock = get_wall_time() - a1->start_time;
unlock_timing(6);
}
// Emit compiled output downstream
if (a1->dump_sass)
dump_sass(task->ucode_output);
release_output(task->ucode_output);
// Transfer kernel name to output
task->output->name = **(task->cu_desc->entry_ptr + 88);
// Release jobserver slot
if (a1->jobserver_enabled && !jobserver_release())
report_fatal_error();
pool_free(task);
}
The timing lock at index 6 (sub_607D70(6) / sub_607D90(6)) serializes access to the peak wall-clock counter across threads. This is the only shared mutable state in the multi-threaded path -- all other per-kernel state is isolated in the 360-byte work buffer and per-thread hash map copies.
Data Flow Summary
PTX text
|
v (parsed by sub_451730 into AST at parser_state+88)
|
sub_4428E0: strategy_calc(kernel_list) --> reg_limit_map (a1+1192)
sub_4428E0: strategy_build(kernel_list) --> cu_descriptor_list (72-byte nodes)
|
v (for each CU descriptor)
|
sub_43A400: target_config(cu_desc) --> 360-byte work buffer
|
sub_432500: vtable->compile()
| invokes sub_64BAF0 (DAGgen + 159-phase OCG + Mercury)
| |
| +-- Ori IR construction (DAGgen phase)
| +-- 159 phases via PhaseManager (sub_C62720 / sub_C64F70)
| +-- Mercury SASS encoding (phases 113-122)
| |
v v
work[0] = .sass output work[1] = .ucode output
|
sub_4370F0: record timing
|
sub_43CC70: emit .sass/.ucode sections, entry banner, verbose output
|
v
ELF builder (sub_612DE0)
Cross-References
- Pipeline Overview -- end-to-end compilation flow
- Entry Point & CLI -- the top-level driver that calls sub_4428E0
- Optimization Pipeline (159 Phases) -- the OCG pipeline invoked per-kernel
- Code Generation Overview -- detailed codegen subsystem
- SASS Instruction Encoding -- Mercury encoding phases
- Register Allocation -- Fatpoint algorithm invoked at phase 101
- Thread Pool & Concurrency -- thread pool struct and jobserver
- Memory Pool Allocator -- pool allocator used throughout
- Knobs System -- cache-mode knobs read by sub_4428E0
Function Map
| Address | Size | Lines | Identity | Confidence |
|---|---|---|---|---|
sub_4428E0 | 13,774 B | 2,141 | Core compilation orchestrator | HIGH |
sub_43CA80 | 192 B | 24 | Normal strategy: register calculator | HIGH |
sub_438B50 | 2,419 B | 375 | Normal/non-ABI strategy: CU builder | HIGH |
sub_43C6F0 | 1,600 B | 225 | Compile-only strategy: register calculator | HIGH |
sub_4383B0 | 1,320 B | 177 | Compile-only/debug strategy: CU builder | HIGH |
sub_43CAE0 | 648 B | 91 | Debug strategy: register calculator | HIGH |
sub_4378E0 | 2,010 B | 250 | Debug strategy: CU builder | HIGH |
sub_43C570 | 577 B | 77 | Non-ABI strategy: register calculator | HIGH |
sub_43A400 | 4,696 B | 647 | Per-kernel worker (target config + setup) | HIGH |
sub_64BAF0 | ~30 KB | 1,006 | DAGgen + OCG + SASS (per-kernel pipeline) | MEDIUM |
sub_43CC70 | 5,425 B | 1,077 | Per-entry output (.sass/.ucode sections) | HIGH |
sub_436DF0 | 485 B | 59 | Thread pool worker function | HIGH |
sub_432500 | 461 B | 47 | Finalization wrapper (setjmp + vtable call) | HIGH |
sub_4370F0 | 522 B | 64 | Timing finalization (per-kernel + aggregate) | HIGH |
sub_43B660 | 3,843 B | ~300 | Register/resource constraint calculator | HIGH |
sub_1CB18B0 | ~200 B | 33 | Thread pool constructor (184-byte struct) | HIGH |
sub_1CB1A50 | ~200 B | 21 | Thread pool task submit | HIGH |
sub_1CB1AE0 | -- | -- | Thread pool wait-for-all | HIGH |
sub_1CB1970 | -- | -- | Thread pool destructor | HIGH |
sub_1CC7300 | 2,027 B | -- | GNU Make jobserver client | HIGH |
Diagnostic Strings
| String | Location | Purpose |
|---|---|---|
"def-load-cache" | sub_4428E0 | Cache mode knob read |
"force-load-cache" | sub_4428E0 | Cache mode knob read |
"def-store-cache" | sub_4428E0 | Cache mode knob read |
"force-store-cache" | sub_4428E0 | Cache mode knob read |
"--compile-only" | sub_4428E0 | Option validation |
"--compile-as-tools-patch" | sub_4428E0 | Option validation |
"--extensible-whole-program" | sub_4428E0 | Option validation |
"calls without ABI" | sub_4428E0 | Non-ABI mode diagnostic |
"compilation without ABI" | sub_4428E0 | Non-ABI mode diagnostic |
"unified Functions" | sub_4428E0 | Compile-only restriction |
"suppress-debug-info" | sub_4428E0 | Debug info suppression warning |
"position-independent-code" | sub_4428E0 | PIC mode configuration |
"__cuda_dummy_entry__" | sub_43CC70 | Dummy entry skip check |
"reg-fatpoint" | sub_43CC70 | Register map annotation |
".sass", ".ucode" | sub_43CC70 | Output section names |
"# ============== entry %s ==" | sub_43CC70 | Per-entry SASS banner |
"ptxocg.0.0" | sub_43A400 | Target config identifier |
"specified texturing mode" | sub_43A400 | Texturing mode diagnostic |
".local_maxnreg" | sub_438B50 | Per-function register limit |
"device functions" | sub_438B50 | Compile-only device function handling |
"--compile-only option" | sub_438B50 | Compile-only restriction |
"ptxas" | sub_436DF0 | Thread-local program name |
ELF/Cubin Output
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
After all per-kernel SASS encoding completes, ptxas enters the ELF output phase -- the final stage of the compilation pipeline. This phase transforms the accumulated per-kernel SASS bytes, relocation metadata, constant bank data, shared memory layouts, and debug information into a complete NVIDIA CUBIN file. The CUBIN is a standard ELF container with NVIDIA-proprietary extensions: machine type EM_CUDA (0xBE), non-standard ELF class bytes, CUDA-specific section types, and a rich per-entry metadata system called EIATTR. The output pipeline is a custom implementation with no libelf dependency -- ptxas constructs every byte of the ELF from scratch, including headers, section tables, symbol tables, string tables, relocations, and program headers.
The output phase handles three binary kinds: SASS (raw resolved SASS, legacy default), Mercury (SM 75--99 default), and Capsule Mercury (SM 100+ default, supporting deferred finalization). All three produce a valid CUBIN ELF; the difference is whether the .text sections contain final SASS bytes or Mercury-encoded streams that a downstream finalizer resolves at link or load time.
| Cubin entry point | sub_612DE0 (47 KB, called from sub_446240) |
| ELFW constructor | sub_1CB53A0 (3,480 bytes, 672-byte central object) |
| Section creator | sub_1CB3570 (1,963 bytes, 44 call sites) |
| Symbol table builder | sub_1CB68D0 (9,578 bytes, ~1,700 decompiled lines) |
| Master ELF emitter | sub_1C9F280 (15,263 bytes, 97 KB decompiled -- largest function in output range) |
| Section layout calculator | sub_1C9DC60 (5,663 bytes) |
| Master section allocator | sub_1CABD60 (11,856 bytes, 67 KB decompiled -- shared/constant/local addresses) |
| nvinfo/EIATTR builder | sub_1CC9800 (14,764 bytes, 90 KB decompiled) |
| Master relocation resolver | sub_1CD48C0 (4,184 bytes, 22 KB decompiled) |
| File serializer | sub_1CD13A0 (2,541 bytes, writes final bytes to disk) |
| ELF machine type | EM_CUDA = 0xBE (190) |
| CUDA section type | SHT_CUDA_INFO = 0x70000064 |
| ELF timing | "ELF-time : %.3f ms (%.2f%%)" in --compiler-stats output |
| Peak ELF memory | "PeakELFMemoryUsage : %.3lf KB" |
Pipeline Overview
The ELF output pipeline runs as a single-threaded sequence after all per-kernel OCG passes have completed (multi-kernel compilation may be parallel, but ELF emission is serialized). The flow is orchestrated by sub_612DE0, which reads compilation flags, constructs the ELFW central object, then drives 11 phases to produce the final .cubin or .o file.
sub_446240 (compilation driver -- "real main")
| all per-kernel OCG passes complete
v
sub_612DE0 (cubin entry, 47KB)
| reads: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
| establishes setjmp/longjmp error recovery
| writes "Cuda compilation tools, release 13.0, V13.0.88"
| "Build cuda_13.0.r13.0/compiler.36424714_0"
|
|-- Phase 1: ELFW construction
| sub_1CB53A0 -- create 672-byte ELFW object, 7 standard sections
|
|-- Phase 2: Per-kernel section creation
| sub_1CB42D0 x N -- .text.<func>, .rela.text.<func> (one per kernel)
| sub_1CB3570 x 44 -- .nv.constant0.<func>, .nv.shared.<func>, etc.
|
|-- Phase 3: Call graph analysis
| sub_1CBB920 -- recursion detection (DFS)
| sub_1CBC090 -- dead function elimination
| sub_1CBE1B0 -- .nv.callgraph section builder
|
|-- Phase 4: Symbol fixup
| sub_1CB2CA0 -- renumber symbols after dead code elimination
| sub_1C99BB0 -- remap .symtab_shndx extended indices
|
|-- Phase 5: Memory allocation
| sub_1CABD60 -- assign addresses: shared, constant, local memory
| sub_1CA92F0 -- shared memory interference graph
| sub_1CA6890 -- constant bank deduplication
|
|-- Phase 6: nvinfo/EIATTR generation
| sub_1CC9800 -- build .nv.info.<func> sections (EIATTR attributes)
| sub_1CC8950 -- propagate barrier/register counts across call graph
|
|-- Phase 7: Symbol table construction
| sub_1CB68D0 -- build .symtab, handle SHN_XINDEX overflow
|
|-- Phase 8: Section layout
| sub_1C9DC60 -- compute file offsets with alignment padding
|
|-- Phase 9: Relocation resolution
| sub_1CD48C0 -- resolve all R_CUDA_* relocations
| sub_1CD5920 -- write .nv.resolvedrela sections
|
|-- Phase 10: Capsule Mercury embedding (SM 100+ only)
| sub_1C9B110 -- create .nv.merc.* section namespace
| sub_1CA2E40 -- clone memory-space sections into merc namespace
| sub_1C9C300 -- build 328-byte capsule descriptors per function
|
|-- Phase 11: Final assembly & write
| sub_1C9F280 -- master ELF emitter (assemble complete CUBIN)
| sub_1CD13A0 -- file serializer (write to disk)
|
v
OUTPUT: .cubin / .o file
Custom ELF Emitter
ptxas builds the entire ELF output without libelf. The custom implementation spans approximately 20 functions in the 0x1C99--0x1CD6 address range (~300 KB of binary code). At the center is the ELFW ("ELF world") object -- a 672-byte structure that owns all sections, symbols, and string tables for a single compilation unit.
ELFW Object Layout
The ELFW constructor sub_1CB53A0 allocates 672 bytes from the pool allocator sub_424070, creates a dedicated "elfw memory space" pool (4,096-byte initial allocation), writes the ELF header, and initializes 7 mandatory sections:
| Index | Section | Type | Purpose |
|---|---|---|---|
| 0 | (null) | SHT_NULL | Required ELF null section |
| 1 | .shstrtab | SHT_STRTAB | Section name string table |
| 2 | .strtab | SHT_STRTAB | Symbol name string table |
| 3 | .symtab | SHT_SYMTAB | Symbol table |
| 4 | .symtab_shndx | SHT_SYMTAB_SHNDX | Extended section indices (for >65,280 sections) |
| 5 | .note.nv.tkinfo | SHT_NOTE | NVIDIA toolkit info (version, build ID, CLI args) |
| 6 | .note.nv.cuinfo | SHT_NOTE | NVIDIA CUDA info (SM version, features) |
| 7 | .nv.uft.entry | SHT_PROGBITS | Unified Function Table entries |
ELF Header
The ELF header uses standard structure with NVIDIA-specific overrides:
| Offset | Size | Field | CUDA Value |
|---|---|---|---|
0x00 | 4 | e_ident[EI_MAG0..3] | 0x7F 'E' 'L' 'F' (magic 0x464C457F) |
0x04 | 1 | e_ident[EI_CLASS] | 0x33 ('3', 32-bit) or 0x41 ('A', 64-bit) |
0x05 | 1 | e_ident[EI_DATA] | 0x01 (little-endian) |
0x06 | 1 | e_ident[EI_VERSION] | 0x01 (EV_CURRENT) |
0x07 | 1 | e_ident[EI_OSABI] | CUDA ABI version |
0x12 | 2 | e_machine | 0x00BE (EM_CUDA = 190) |
0x24 | 4 | e_flags | SM version bits [7:0] + CUDA control flags |
Standard ELF uses ELFCLASS32 = 1 and ELFCLASS64 = 2. CUDA cubins use non-standard values '3' (0x33) and 'A' (0x41), which the CUDA driver recognizes as the cubin signature. Any tool using standard libelf will reject these as invalid, which is one reason ptxas uses a custom emitter.
The e_flags field packs the SM architecture version in the low byte (e.g., 100 for sm_100) along with flags for relocatable vs. executable mode and address size. The mask 0x7FFFBFFF (clears bits 14 and 31) is applied during finalization to strip internal control flags that must not appear in the output cubin.
Section Generation
Each kernel/function produces a set of ELF sections. For a program with N entry functions and M device functions, the CUBIN contains on the order of 4*(N+M) sections minimum. The section creator sub_1CB3570 (44 call sites) handles the generic case; sub_1CB42D0 is specialized for .text.<funcname> code sections.
Per-Kernel Sections
For each kernel entry my_kernel, the output pipeline generates:
| Section | Type | Content |
|---|---|---|
.text.my_kernel | SHT_PROGBITS | SASS instruction bytes (SHF_ALLOC | SHF_EXECINSTR) |
.rela.text.my_kernel | SHT_RELA | Relocation entries for the code section |
.nv.info.my_kernel | SHT_CUDA_INFO (0x70000064) | EIATTR metadata (register count, barriers, stack sizes, etc.) |
.nv.constant0.my_kernel | SHT_PROGBITS | Constant bank 0 data (kernel parameters + literal constants) |
Additional per-kernel sections are generated as needed:
| Section | Condition |
|---|---|
.nv.shared.my_kernel | Kernel uses shared memory |
.nv.local.my_kernel | Kernel uses local (spill) memory |
.nv.global.init | Program uses initialized global variables |
.nv.callgraph | Relocatable object mode (-c) |
.nv.prototype | Prototype information for cross-module linking |
Global Sections
| Section | Purpose |
|---|---|
.nv.info | Global EIATTR attributes (not per-kernel) |
.nv.constant0 | Merged constant bank (whole-program mode) |
.nv.reservedSmem | Reserved shared memory for runtime (tmem allocation, mbarrier parity) |
.nv.metadata | Module-level metadata |
.nv.compat | Forward-compatibility attributes |
.note.nv.cuver | CUDA version note |
.nv.uft | Unified Function Table (indirect call support) |
.nv.udt | Unified Data Table |
.nv.uft.entry | UFT entry point table |
.nv.udt.entry | UDT entry point table |
.nv.rel.action | Relocation action table |
.nv.resolvedrela | Resolved relocations (post-linking) |
.nv.host | Host-side interop data |
Constant Banks
CUDA supports up to 18 numbered constant banks (0--17) plus 6 named banks:
| Bank | Name | Purpose |
|---|---|---|
| 0 | .nv.constant0 | Kernel parameters + compiler constants (per-entry) |
| 1--17 | .nv.constant1--.nv.constant17 | User-declared __constant__ variables |
| -- | .nv.constant.entry_params | Entry point parameter block |
| -- | .nv.constant.entry_image_header_indices | Texture/surface header index table |
| -- | .nv.constant.driver | Driver-injected constants |
| -- | .nv.constant.optimizer | Optimizer-generated constants (OCG) |
| -- | .nv.constant.user | User-specified constants |
| -- | .nv.constant.pic | Position-independent code constants |
| -- | .nv.constant.tools_data | Tools/debugger-injected data |
Section Ordering
During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:
| Bucket | Contents |
|---|---|
| 0 (highest) | ELF header pseudo-section, .shstrtab |
| 1 | .strtab, .symtab, .symtab_shndx |
| 2 | .note.nv.tkinfo, .note.nv.cuinfo |
| 3 | .text.<funcname> (code sections) |
| 4 | .nv.constant0.*, .nv.shared.*, .nv.local.* (data sections) |
| 5 | .rela.*, .rel.* (relocation sections) |
| 6 | .nv.info.* (EIATTR metadata sections) |
| 7 (lowest) | .debug_*, .nv.merc.* (debug and Mercury metadata) |
Section file offsets are assigned by sub_1C9DC60, walking the sorted list with alignment padding:
uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
section_t* sec = sorted_sections[i];
if (sec->sh_addralign > 1)
offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
sec->sh_offset = offset;
offset += sec->sh_size;
}
Two section types receive special treatment during layout: .nv.constant0 (address assigned by the OCG constant bank allocator, not the layout calculator) and .nv.reservedSmem (address assigned by the shared memory master allocator sub_1CABD60).
EIATTR Metadata
Each kernel's .nv.info.<funcname> section contains a sequence of EIATTR (Entry Information Attribute) records. These encode per-kernel metadata that the CUDA driver reads at launch time to configure the hardware correctly. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled, 51 callees) -- one of the largest functions in the output pipeline.
EIATTR Encoding
Each EIATTR record is a TLV (type-length-value) structure:
+--------+--------+------------------+
| type | length | value |
| 2 bytes| 2 bytes| length bytes |
+--------+--------+------------------+
The type field is a 16-bit EIATTR code. The length field specifies the payload size. The value is type-dependent (scalar, array, or structured).
EIATTR Catalog
ptxas v13.0.88 defines 98 EIATTR codes. The critical ones that every cubin emitter must produce:
| EIATTR | Purpose | Encoding |
|---|---|---|
EIATTR_REGCOUNT | Register count for this kernel | 4-byte LE integer |
EIATTR_NUM_BARRIERS | Hardware barrier count (0--16) | 4-byte LE integer |
EIATTR_FRAME_SIZE | Per-thread stack frame size in bytes | 4-byte LE integer |
EIATTR_MIN_STACK_SIZE | Minimum call stack size | 4-byte LE integer |
EIATTR_MAX_STACK_SIZE | Maximum call stack size (recursive) | 4-byte LE integer |
EIATTR_CRS_STACK_SIZE | Call/Return/Sync stack size | 4-byte LE integer |
EIATTR_EXIT_INSTR_OFFSETS | Byte offsets of EXIT instructions in .text | Array of 4-byte offsets |
EIATTR_S2RCTAID_INSTR_OFFSETS | Byte offsets of S2R SR_CTAID.* instructions | Array of 4-byte offsets |
EIATTR_CTAIDZ_USED | Kernel reads CTA ID Z dimension | Flag (0-byte payload) |
EIATTR_REQNTID | Required thread block dimensions | 3x 4-byte integers (X, Y, Z) |
EIATTR_MAX_THREADS | Maximum threads per block | 4-byte LE integer |
EIATTR_PARAM_CBANK | Constant bank for kernel parameters | 4-byte bank index + offset |
EIATTR_CBANK_PARAM_SIZE | Size of parameter constant bank region | 4-byte LE integer |
EIATTR_KPARAM_INFO | Per-parameter ordinal/offset/size/alignment | Structured (V1) |
EIATTR_KPARAM_INFO_V2 | Per-parameter info, extended format | Structured (V2) |
EIATTR_MAXREG_COUNT | Maximum register count directive | 4-byte LE integer |
EIATTR_EXTERNS | List of external symbol references | Array of symbol indices |
Additional EIATTR codes for textures/surfaces, barriers, cooperative groups, tensor cores, and hardware workarounds:
| EIATTR | Purpose |
|---|---|
EIATTR_IMAGE_SLOT | Texture/surface image binding slot |
EIATTR_SAMPLER_INIT | Sampler initialization data |
EIATTR_TEXID_SAMPID_MAP | Texture-to-sampler mapping |
EIATTR_BINDLESS_IMAGE_OFFSETS | Bindless texture/surface offset table |
EIATTR_SYNC_STACK | Synchronization stack requirements |
EIATTR_COOP_GROUP_MASK_REGIDS | Cooperative group register IDs |
EIATTR_NUM_MBARRIERS | Number of mbarrier objects used |
EIATTR_MBARRIER_INSTR_OFFSETS | mbarrier instruction locations |
EIATTR_WMMA_USED | Kernel uses WMMA (Tensor Core) instructions |
EIATTR_TCGEN05_1CTA_USED | Kernel uses 5th-gen Tensor Core (1-CTA mode) |
EIATTR_TCGEN05_2CTA_USED | Kernel uses 5th-gen Tensor Core (2-CTA mode) |
EIATTR_SPARSE_MMA_MASK | Structured sparsity mask for MMA |
EIATTR_CTA_PER_CLUSTER | CTAs per cluster (SM 90+) |
EIATTR_EXPLICIT_CLUSTER | Kernel requires explicit cluster launch |
EIATTR_MAX_CLUSTER_RANK | Maximum cluster rank |
EIATTR_REG_RECONFIG | Register reconfiguration (setmaxreg) |
EIATTR_SAM_REGION_STACK_SIZE | SAM (Shared Address Mode) region stack |
EIATTR_RESERVED_SMEM_USED | Kernel uses reserved shared memory |
EIATTR_RESERVED_SMEM_0_SIZE | Size of reserved shared memory region 0 |
EIATTR_SW1850030_WAR | Hardware workaround (bug SW-1850030) |
EIATTR_SW2393858_WAR | Hardware workaround (bug SW-2393858) |
EIATTR_SW2861232_WAR | Hardware workaround (bug SW-2861232) |
EIATTR_CUDA_API_VERSION | Required CUDA API version |
EIATTR_MERCURY_ISA_VERSION | Mercury ISA version for capmerc binaries |
EIATTR_MERCURY_FINALIZER_OPTIONS | Finalizer tuning for capmerc |
EIATTR_SYSCALL_OFFSETS | Byte offsets of syscall instructions |
EIATTR_INSTR_REG_MAP | Instruction-to-register allocation map (debug) |
EIATTR_STATISTICS | Per-kernel compilation statistics |
EIATTR_PERF_STATISTICS | Performance statistics |
Barrier and Register Count Propagation
The EIATTR builder runs a cross-function propagation pass via sub_1CC8950 (2,634 bytes). When a device function uses barriers or a high register count, these requirements must propagate upward to the entry kernel that calls it:
"regcount %d for %s propagated to entry %s"
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
of %s of entry symbol %s"
This ensures that the CUDA driver allocates sufficient barriers and registers for the entry kernel, accounting for all callees.
Relocation Processing
The relocation system handles symbol resolution for branch targets, constant bank references, function descriptors, texture/surface bindings, and address computations. The master relocation resolver is sub_1CD48C0 (4,184 binary bytes, 22 KB decompiled). ptxas defines 117 CUDA-specific relocation types (R_CUDA_NONE through R_CUDA_NONE_LAST).
Relocation Categories
| Category | Types | Purpose |
|---|---|---|
| Absolute address | R_CUDA_32, R_CUDA_64, R_CUDA_ABS* | Global memory addresses |
| Global address | R_CUDA_G32, R_CUDA_G64, R_CUDA_G8_* | Global-space addresses |
| PC-relative | R_CUDA_PCREL_IMM24_26, R_CUDA_PCREL_IMM24_23 | Branch/call targets |
| Constant field | R_CUDA_CONST_FIELD19_*, R_CUDA_CONST_FIELD21_*, R_CUDA_CONST_FIELD22_* | Constant bank references |
| Function descriptor | R_CUDA_FUNC_DESC* | Indirect function call targets |
| Texture/surface | R_CUDA_TEX_HEADER_INDEX, R_CUDA_SAMP_HEADER_INDEX, R_CUDA_SURF_* | Texture/surface binding |
| Bindless | R_CUDA_BINDLESSOFF*, R_CUDA_TEX_BINDLESSOFF* | Bindless texture/surface offsets |
| Sub-byte patching | R_CUDA_8_0 through R_CUDA_8_56 | Individual byte within a 64-bit instruction |
| Unified address | R_CUDA_UNIFIED, R_CUDA_UNIFIED_32, R_CUDA_UNIFIED_8_* | Unified address space |
| Instruction-level | R_CUDA_INSTRUCTION64, R_CUDA_INSTRUCTION128 | Whole-instruction replacement |
| Yield/NOP | R_CUDA_YIELD_OPCODE9_0, R_CUDA_YIELD_CLEAR_PRED4_87 | YIELD-to-NOP patching |
| Cleanup | R_CUDA_UNUSED_CLEAR32, R_CUDA_UNUSED_CLEAR64 | Zero out unused instruction fields |
Resolution Logic
The resolver performs these operations for each relocation entry:
- Alias resolution -- redirect relocations from alias symbols to their targets (
"change alias reloc %s to %s") - Dead function filtering -- skip relocations on eliminated functions (
"ignore reloc on dead func %s") - UFT/UDT pseudo-relocation -- handle
__UFT_OFFSET,__UFT_CANONICAL,__UDT_OFFSET,__UDT_CANONICALsynthetic symbols - PC-relative validation -- ensure branch targets are in the same section (
"PC relative branch address should be in the same section") - YIELD-to-NOP conversion -- convert YIELD instructions to NOP when forward progress requirements prevent yielding
- Unified reloc replacement -- convert type 103 (unified) to type 1 (absolute) for final resolution
- Address computation -- compute final patched value from symbol address + addend
Output relocation sections (.nv.resolvedrela) are written by sub_1CD5920.
Debug Information
When --device-debug or --generate-line-info is active, ptxas generates DWARF debug sections. The debug subsystem spans 0x1CBF--0x1CC9 and includes parsers, emitters, and dumpers for the standard DWARF sections plus NVIDIA-specific debug extensions.
Debug Sections
| Section | Content |
|---|---|
.debug_info | DWARF DIE tree (compilation units, types, variables) |
.debug_abbrev | DWARF abbreviation table |
.debug_line | Source-to-address line number mapping |
.debug_frame | Call frame information for unwinding |
.debug_loc | Location lists for variables |
.debug_str | DWARF string table |
.debug_ranges | Address ranges |
.debug_aranges | Address range lookup table |
.debug_pubnames | Public name index |
.debug_pubtypes | Public type index |
.nv_debug_ptx_txt | Embedded PTX source text |
.nv_debug_line_sass | SASS-level line number mapping |
.nv_debug_info_reg_sass | Register allocation debug info |
.nv_debug_info_reg_type | Register type information |
Key debug infrastructure functions:
| Function | Purpose |
|---|---|
sub_1CBF820 | DWARF form name table (DW_FORM_addr, DW_FORM_data4, etc.) |
sub_1CBF9B0 | DWARF attribute name table (DW_AT_producer, DW_AT_comp_dir, etc.) |
sub_1CC0850 | .debug_abbrev parser/emitter (18 KB decompiled) |
sub_1CC4A40 | .debug_info DIE tree walker (28 KB decompiled) |
sub_1CC34E0 | DWARF location expression decoder (DW_OP_* operations) |
sub_1CC24C0 | .debug_info emission pass (18 KB decompiled) |
sub_1CC5EB0 | Compilation unit header parser |
sub_1C9D1F0 | Debug section classifier/mapper (16 KB decompiled) |
The --suppress-debug-info option (sub_432A00) disables debug section generation even when debug flags are present.
Capsule Mercury Output Path
For SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series), the default output mode is Capsule Mercury. In this mode, the CUBIN ELF contains two layers of content: standard CUBIN sections and a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization.
Mercury-Specific Sections
| Section | Purpose |
|---|---|
.nv.merc.debug_abbrev | Cloned DWARF abbreviation table |
.nv.merc.debug_info | Cloned DWARF info |
.nv.merc.debug_line | Cloned DWARF line table |
.nv.merc.debug_frame | Cloned DWARF frame info |
.nv.merc.debug_loc | Cloned DWARF locations |
.nv.merc.debug_str | Cloned DWARF string table |
.nv.merc.debug_ranges | Cloned DWARF ranges |
.nv.merc.debug_aranges | Cloned DWARF address ranges |
.nv.merc.debug_pubnames | Cloned DWARF public names |
.nv.merc.debug_pubtypes | Cloned DWARF public types |
.nv.merc.debug_macinfo | Cloned DWARF macro info |
.nv.merc.nv_debug_ptx_txt | Embedded PTX source text |
.nv.merc.nv_debug_line_sass | SASS-level line mapping |
.nv.merc.nv_debug_info_reg_sass | Register allocation debug info |
.nv.merc.nv_debug_info_reg_type | Register type debug info |
.nv.merc.symtab_shndx | Extended section index table (merc copy) |
.nv.merc.nv.shared.reserved | Shared memory reservation metadata |
.nv.merc.rela<secname> | Per-section relocation tables |
Capsule Mercury Construction
The capmerc path is integrated into the master ELF emitter. The sequence:
sub_1C9B110(23 KB decompiled) creates the.nv.merc.*section namespacesub_1CA2E40(18 KB decompiled) clones constant/global/shared/local sections into the merc namespace, creating.nv.merc.relarelocation sectionssub_1C9C300(24 KB decompiled) processes.nv.capmerc<funcname>sections. Constructs a 328-byte capsule descriptor per function containing: Mercury-encoded instruction stream, relocation metadata, KNOBS configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage)sub_1CA3A90(45 KB decompiled) merges sections that have both merc and non-merc copiessub_1C99BB0(25 KB decompiled) remaps section indices after merc section insertion
Off-Target Finalization
The cubin entry sub_612DE0 implements a "fastpath optimization" for off-target finalization:
"[Finalizer] fastpath optimization applied for off-target %u -> %u finalization"
When a capmerc binary compiled for SM X is finalized for SM Y (within the same family), the fastpath patches the Mercury instruction stream directly without full recompilation. The compatibility checker sub_60F290 determines whether fastpath is safe based on instruction set compatibility, register file layout, and memory model.
Self-Check
The --self-check option performs roundtrip verification: generate capmerc, reconstitute SASS from the capsule, and compare section-by-section. The verifier uses a Flex SASS lexer (sub_720F00, 64 KB) and a comparator (sub_729540, 35 KB). Error string: "Failure of '%s' section in self-check for capsule mercury".
Multi-Kernel Output
A typical CUDA program compiles multiple kernels and device functions into a single CUBIN. The output pipeline handles this through per-function section isolation, combined with cross-function analysis for call graph construction, barrier propagation, and dead code elimination.
Per-Function Section Layout
Each entry function and each device function gets its own .text section (the -ffunction-sections pattern). This enables:
- Function-level dead code elimination --
sub_1CBC090removes.text,.rela.text,.nv.info, and.nv.constant0sections for unreachable functions - Linker granularity --
nvlinkcan select individual functions from relocatable objects - Driver loading -- the CUDA runtime can load individual kernels by name
Call Graph Construction
The call graph builder (sub_1CBE1B0) emits a .nv.callgraph section that encodes inter-function call edges. This section is present only in relocatable object mode (-c). The recursion detector (sub_1CBB920) performs a DFS traversal with manual 9-level unrolling, emitting "recursion at function %d" for each cycle found.
Dead functions are eliminated by sub_1CBC090:
"dead function %d(%s)"
"removed un-used section %s (%d)" (x8 -- once per section type)
"function %d(%s) has address taken but no call to it"
Memory Allocation Across Kernels
The master section allocator sub_1CABD60 (11,856 binary bytes, 67 KB decompiled, 69 callees) assigns addresses to all memory-space sections across all kernels. It runs a multi-pass algorithm:
- Global shared allocation -- shared variables visible to multiple kernels
- Per-entry shared memory -- shared variables private to each kernel
- Extern shared handling -- dynamically-sized shared memory (
extern __shared__) - Reserved shared memory -- runtime reservations (
.nv.reservedSmem.begin,.nv.reservedSmem.cap,.nv.reservedSmem.offset0,.nv.reservedSmem.offset1) - Local memory -- per-thread spill storage
- Constant bank merging -- merges constant bank data across kernels, with deduplication (
sub_1CA6890:"found duplicate value 0x%x, alias %s to %s")
The shared memory allocator sub_1CA92F0 (2,804 bytes) builds an interference graph for shared objects and performs group allocation for non-overlapping variables.
SHN_XINDEX Overflow
Large CUDA programs can exceed the ELF 65,280-section limit (SHN_LORESERVE = 0xFF00). Each kernel generates at minimum 4 sections (.text, .rela.text, .nv.info, .nv.constant0), so a program with 16,000+ kernels triggers the overflow mechanism:
e_shnum= 0 in the ELF header (signals overflow)- Section header
[0].sh_size= real section count e_shstrndx=SHN_XINDEX(0xFFFF)- Section header
[0].sh_link= real.shstrtabindex - Symbol
st_shndx=SHN_XINDEXwhen real index >= 0xFF00 .symtab_shndxentries hold the actual section indices
This is standard ELF overflow handling, and it is production-critical -- sub_1CB68D0 checks for it with "overflow number of sections %d".
Key Functions
| Address | Size | Decompiled | Purpose |
|---|---|---|---|
sub_612DE0 | ~12 KB | 47 KB | Cubin generation entry point |
sub_1C9F280 | 15,263 B | 97 KB | Master ELF emitter |
sub_1CC9800 | 14,764 B | 90 KB | nvinfo/EIATTR section builder |
sub_1CABD60 | 11,856 B | 67 KB | Master section allocator (shared/const/local) |
sub_1CB68D0 | 9,578 B | 49 KB | Symbol table builder (.symtab) |
sub_1CA3A90 | 6,289 B | 45 KB | Section merger (merc + non-merc) |
sub_1C9DC60 | 5,663 B | 29 KB | Section layout calculator |
sub_1C99BB0 | 4,900 B | 25 KB | Section index remap (.symtab_shndx) |
sub_1C9C300 | 3,816 B | 24 KB | Capsule Mercury section processor |
sub_1C9B110 | 4,585 B | 23 KB | Mercury capsule builder |
sub_1CD48C0 | 4,184 B | 22 KB | Master relocation resolver |
sub_1CBC090 | 2,870 B | 20 KB | Dead function eliminator |
sub_1CA2E40 | 3,152 B | 18 KB | Mercury section cloner |
sub_1CA92F0 | 2,804 B | 16 KB | Shared memory interference graph |
sub_1C9D1F0 | 2,667 B | 16 KB | Debug section classifier |
sub_1CA6890 | 2,286 B | 15 KB | Constant bank deduplication |
sub_1CC8950 | 2,634 B | 15 KB | Barrier/register count propagator |
sub_1CBB920 | ~2,000 B | 14 KB | Recursion detector (DFS) |
sub_1CB91C0 | 2,668 B | 13 KB | ELF structure dumper (debug) |
sub_1CB53A0 | 3,480 B | 13 KB | ELFW constructor |
sub_1CAB300 | 2,157 B | 12 KB | Bindless texture/surface handler |
sub_1CD5920 | 1,985 B | 11 KB | Relocation writer (.nv.resolvedrela) |
sub_1CD13A0 | 2,541 B | 11 KB | File serializer (final disk write) |
sub_1CBE1B0 | 1,992 B | 10 KB | .nv.callgraph section builder |
sub_1CD22E0 | 1,979 B | 10 KB | UFT manager |
sub_1CB3570 | 1,963 B | 10 KB | Section creator (44 call sites) |
sub_1C98C60 | 1,755 B | 9 KB | Mercury debug section classifier |
sub_1CB2CA0 | 2,038 B | 8 KB | Symbol fixup (post-deletion) |
sub_1CC7FB0 | -- | -- | .nv.info section name formatter |
sub_1CB9FF0 | -- | -- | Section count accessor |
sub_1CB9C40 | -- | -- | Get section by index |
Cross-References
- Custom ELF Emitter -- deep dive into ELFW object, header construction, section management, file serialization
- Section Catalog & EIATTR -- complete inventory of section types and EIATTR attribute encoding
- Relocations & Symbols -- relocation resolution, UFT/UDT management, symbol table details
- Debug Information -- DWARF generation and
.debug_*section handling - Capsule Mercury & Finalization -- capmerc packaging format, off-target finalization, self-check
- Mercury Encoder -- Mercury instruction encoding (phases 117--122) that feeds the ELF emitter
- SASS Code Generation -- the upstream per-kernel compilation that produces SASS bytes
- Pipeline Overview -- where the ELF phase fits in the full PTX-to-SASS flow
The Ori Internal Representation
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Ori -- short for "Original IR" -- is ptxas's sole intermediate representation. It is a fully proprietary, SASS-level IR with virtual registers, its own CFG infrastructure, and a partial-SSA discipline. Ori has no relationship to LLVM IR: there is no LLVM Value hierarchy, no LLVM-style use-def chains, no SSA dominance-frontier construction. Every IR-level optimization pass in ptxas (prefixed Ori in the NamedPhases table: OriCopyProp, OriSanitize, OriBranchOpt, OriLoopSimplification, OriStrengthReduce, OriDoPredication, etc.) operates on this representation.
The key design decision that distinguishes Ori from PTX: Ori uses SASS opcode names, not PTX mnemonics. After the MercConverter pass (sub_9F1A90, 35KB) runs, every instruction carries the name of the hardware SASS instruction it will become -- IMAD, FFMA, LDG, STG, BAR, BRA, EXIT, etc. -- just with virtual (not physical) register operands. This means the optimizer already knows exactly which hardware functional unit each instruction will execute on, enabling accurate latency modeling and scheduling from the earliest optimization phases.
Key Facts
| Property | Value |
|---|---|
| Name | Ori ("Original IR") |
| Heritage | Fully proprietary (not LLVM-based) |
| Level | SASS machine-level with virtual registers |
| SSA form | Partial -- constructed by phase 23, destroyed by phase 73 |
| Code Object size | ~1136 bytes per function (C++ object) |
| Code Object vtable | 0x21EE238 |
| Register files | 4: R (GPR), UR (uniform), P (predicate), UP (uniform predicate) |
| Operand kinds | 10 distinct types |
| CFG representation | FNV-1a hash maps for successor/backedge edges |
| Opcode encoding | ROT13 of real SASS mnemonic names |
| BB entry size | 40 bytes per basic block, contiguous array |
| Instruction linkage | Doubly-linked list within each basic block |
Architecture Overview
PTX source
|
v
[Flex/Bison parser] -- see pipeline/ptx-parser.md
|
v
[PTX-to-Ori lowering] -- see pipeline/ptx-to-ori.md
|
v
+-------------------------------------------+
| Ori IR |
| |
| Code Object (per-function container) |
| +-- Basic Block array (40B entries) |
| | +-- Instruction linked list |
| | +-- Packed operand array |
| +-- CFG (FNV-1a hash map edges) |
| +-- RPO array |
| +-- Register file arrays |
| +-- Backedge map |
+-------------------------------------------+
|
| 159 optimization phases (phases 0-158)
| phase 23: GenerateMovPhi (enter partial SSA)
| phase 73: ConvertAllMovPhiToMov (exit partial SSA)
|
v
[Instruction selection] -- see codegen/isel.md
|
v
[Register allocation] -- see regalloc/overview.md
|
v
[Instruction scheduling] -- see scheduling/overview.md
|
v
[SASS binary encoding] -- see codegen/encoding.md
The Code Object
Every function under compilation is represented by a single Code Object -- a ~1136-byte C++ structure that owns all IR data for that function. The Code Object vtable is at 0x21EE238. Its constructor is at sub_A3B080.
Field Map
| Offset | Type | Field | Description |
|---|---|---|---|
| +24 | u32 | sm_version | SM target (encoded: 12288=sm30, 20481=sm50, 36865=sm90) |
| +72 | ptr | code_buf | Output code object buffer |
| +88 | ptr | reg_file | Register descriptor array. *(ctx+88)+8*regId -> descriptor |
| +152 | ptr | sym_table | Symbol/constant lookup array |
| +272 | ptr | instr_head | Instruction linked-list head |
| +296 | ptr | bb_array | Basic block array pointer (40B per entry) |
| +304 | u32 | bb_index | Basic block array count/current index |
| +312 | ptr | options | OptionsManager* for knob queries |
| +648 | ptr | succ_map | CFG successor edge hash table |
| +680 | ptr | backedge_map | CFG backedge hash table |
| +720 | ptr | rpo_array | Reverse post-order array (int*) |
| +768 | ptr | const_sections | Constant memory section array |
| +776 | ptr | smem_sections | Shared memory section array |
| +976 | ptr | block_info | Block info array (40 bytes per entry, contiguous) |
| +984 | i32 | num_blocks | Number of basic blocks |
| +1584 | ptr | sm_backend | SM-specific architecture backend object (see data-structures.md) |
| +1664 | ptr | knob_container | Knob container pointer (for -knob queries) |
| +1928 | ptr | codegen_ctx | Code object / code generation context |
Register and Instruction Counts (SM Backend Object)
The register counts and instruction counts live in the SM backend object at *(code_obj+1584), accessed via DWORD-indexed fields (not Code Object byte offsets). Earlier versions of this page incorrectly listed these as Code Object offsets +99, +102, +159, +335, +341 -- those are DWORD indices, making the actual byte offsets 396, 408, 636, 1340, and 1364 respectively within the SM backend.
| DWORD Index | Byte Offset | Type | Field | Description |
|---|---|---|---|---|
[99] | +396 | u32 | ur_count | Uniform register (UR) count |
[102] | +408 | u32 | r_alloc | R-register count (allocated) |
[159] | +636 | u32 | r_reserved | R-register count (reserved) |
[335] | +1340 | u32 | instr_hi | Instruction count (upper bound) |
[341] | +1364 | u32 | instr_lo | Instruction count (lower bound) |
Register count formula (from sub_A4B8F0, where v5 = *(_DWORD **)(ctx + 1584)):
total_R_regs = v5[159] + v5[102] // reserved + allocated
instruction_count = v5[335] - v5[341] // upper - lower
The stats emitter at sub_A3A7E0 prints a detailed per-function profile:
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
Basic Blocks
Basic blocks are stored as 40-byte entries in a contiguous array at Code Object +976. The block count is at +984.
Block Entry Layout (40 bytes)
| Offset | Type | Field |
|---|---|---|
| +0 | ptr | Head instruction pointer (first instruction in BB) |
| +8 | ptr | Instruction list link / tail |
| +28 | i32 | bix -- block index (unique ID for CFG operations) |
| +32 | u64 | Flags / padding |
Blocks are additionally accessible via a sub-block array at Code Object +368, indexed as *(ctx+368) + 8*blockIndex.
The debug dumper (sub_BE21D0) emits Graphviz DOT output for the CFG:
digraph f {
node [fontname="Courier" ...]
bix0 -> bix1
bix0 -> bix3
bix1 -> bix2
bix2 -> bix1 // backedge (loop)
bix2 -> bix3
}
Control Flow Graph
The CFG uses FNV-1a hash maps to represent edges. Two separate hash tables exist at Code Object offsets +648 (successor edges) and +680 (backedge info).
FNV-1a Hashing
All CFG hash lookups use the same parameters, confirmed across 50+ call sites:
| Parameter | Value |
|---|---|
| Initial hash | 0x811C9DC5 |
| Prime | 16777619 (0x01000193) |
| Input | 4-byte block index, hashed byte-by-byte |
Hash Map Structure
Each hash map uses chained hashing with 24-byte bucket entries:
Bucket (24 bytes):
+0 node* head // first node in chain
+8 node* tail // last node in chain
+16 i32 count // entries in this bucket
Full Node (64 bytes):
+0 node* next // chain link
+8 i32 key // block index
+16 ptr values // successor/predecessor block list
+32 sub-hash data // embedded sub-table for multi-edge blocks
+56 u32 hash // cached FNV-1a hash
Simple Node (16 bytes):
+0 node* next
+8 i32 key
+12 u32 hash
Growth policy: rehash when total_elements > num_unique_keys (load factor exceeds 1.0). Capacity doubles on each rehash.
Key CFG Functions
| Address | Size | Function | Notes |
|---|---|---|---|
sub_BDE150 | 9KB | CFG::computeRPO | Explicit DFS stack, assigns RPO numbers into +720 array |
sub_BDE8B0 | 2KB | CFG::printEdges | FNV-1a lookup, prints "bix%d -> bix%d\n" |
sub_BDEA50 | 4KB | CFG::dumpRPOAndBackedges | RPO + backedge debug dump |
sub_BE0690 | 54KB | CFG::buildAndAnalyze | Main CFG constructor: predecessors, successors, RPO, loop detection |
sub_BE21D0 | 1.4KB | CFG::dumpDOT | Graphviz DOT format output |
sub_BE2330 | 4KB | CFG::computeDominators | Post-build dominator/loop analysis with bitvector ops |
The RPO dump (sub_BDEA50) produces output like:
Showing RPO state for each basic block:
bix0 -> RPONum: 0
bix1 -> RPONum: 1
bix2 -> RPONum: 3
bix3 -> RPONum: 2
RPO traversal order: bix0, bix1, bix3, bix2
Showing backedge info:
bix2 -> backedge's successor BB: 1
Instructions
Instructions are C++ objects with a large vtable, linked into per-basic-block doubly-linked lists. Each instruction carries a unique integer ID, an opcode, and a packed operand array.
Instruction Layout
| Offset | Type | Field | Description |
|---|---|---|---|
| +8 | varies | reg_class | Register class / encoding fields |
| +16 | i32 | id | Unique instruction ID |
| +28 | u32 | opcode | SASS opcode (lower 12 bits = base, bits 11-12 = modifier) |
| +36 | u32 | flags | Flags (bits 19-21 = subtype) |
| +48 | u8 | special_flags | Volatile/special (bit 5 = volatile) |
| +72 | u32 | opcode_info | Opcode info (duplicate/extended field, confirmed 50+ sites) |
| +73 | u8 | instr_flags | Per-instruction flag byte |
| +80 | u32 | operand_count | Number of operands |
| +84 | u32[] | operands | Packed operand array (8 bytes per operand) |
| +160 | ptr | enc_buf | Encoding buffer pointer (post-selection) |
| +184 | u32 | enc_mode | Encoding mode |
| +200 | u64 | imm_value | Immediate value |
Packed Operand Encoding
Each operand occupies 8 bytes in the operand array starting at instruction offset +84:
31 30 29 28 27 24 23 22 21 20 19 0
+---+---+---+---+-----------+---+---+---+---+---------------------+
| type | modifier bits (8 bits) | index (20 bits) |
+---+---+---+---+-----------+---+---+---+---+---------------------+
^ ^
bit 24: extended flag bits 0-19: reg/sym index
type field (bits 28-30):
1 = register operand -> index into *(ctx+88) register file
5 = symbol/const operand -> index into *(ctx+152) symbol table
Operand Word 1 (Upper 4 Bytes)
Each 8-byte operand slot has two DWORDs. Word 0 (documented above) carries type/modifier/index. Word 1 carries extended flags:
Word 1 (at instr + 84 + 8*i + 4):
31 30 29 28 27 26 25 24 23 0
+---+---+---+---+---+---+---+---+-------------------------------+
| reserved / mod flags |CB | auxiliary data |
+---+---+---+---+---+---+---+---+-------------------------------+
^
bit 24: const-bank flag (CB)
Bits 25-31 (mask 0xFE000000): extended modifier flags
When any bit is set, the operand has special semantics.
Peephole matchers bail out early if (word1 & 0xFE000000) != 0.
Bit 25 (0x2000000): operand reuse / negation extension
Bit 26 (0x4000000): absolute-value modifier (|x|)
Bit 24 (mask 0x1000000): const-bank flag
When set, indicates the source references a constant bank (c[N][offset]).
The scheduler uses this to distinguish FADD (standard) from FADD (const-bank)
for latency modeling (see scheduling/latency-model.md).
Bits 0-23: auxiliary data
For symbol/const operands (type 5): constant bank number
For predicate guards (type 6): predicate sense (true/false)
For register operands (type 1): typically zero
Evidence: sub_40848E checks (word1 & 0xFE000000) != 0 across all operands; sub_405769 tests both 0x1000000 and 0x6000000 combinations; sub_404AD0 verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms. Confirmed in 30+ decompiled functions (confidence 0.92).
Extraction Pattern
Extraction pattern (appears in 50+ functions):
uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type = (operand >> 28) & 7;
int index = operand & 0xFFFFF;
int mods = (operand >> 20) & 0xFF;
uint32_t word1 = *(uint32_t*)(instr + 84 + 8 * i + 4);
bool has_const_bank = (word1 & 0x1000000) != 0;
bool has_ext_mods = (word1 & 0xFE000000) != 0;
Opcode Constants
Selected confirmed opcodes (from multiple independent functions):
| Value | Instruction | Notes |
|---|---|---|
| 47 | NOP / barrier | |
| 72 | CALL / JMP | Function call or jump |
| 91 | ATOM | Atomic memory operation |
| 92 | RED | Reduction operation |
| 95 | STS | Store to shared memory (ROT13: FGF). Note: EXIT = opcode 77 (RKVG), RET = opcode 72 (ERG) |
| 155 | LD variant | Load instruction |
| 173 | ST variant | Store instruction |
| 183 | LD.E | Extended load (& 0xFFFFCFFF mask removes modifier bits) |
| 267 | ST variant | Store (& 0xFFFFCFFF) |
| 268 | LD variant | Load (& 0xFFFFCFFF) |
| 288 | ST.E | Extended store |
The 0xFFFFCFFF mask (clear bits 12-13) strips modifier/suboperation bits from the opcode, yielding the base instruction class. This pattern appears in InstructionClassifier, MBarrierDetector, and OperandLowering code.
ROT13 Opcode Names
All SASS opcode mnemonic strings stored in the binary are ROT13-encoded. The master table is initialized in sub_BE7390 (InstructionInfo constructor) at offset 4184 of the InstructionInfo object, with 16-byte {name, length} entries. This is lightweight obfuscation -- not a security measure.
Selected decoded names (~200+ total, covering the full sm_70+ SASS ISA):
| ROT13 | Real | Category |
|---|---|---|
VZNQ | IMAD | Integer multiply-add |
VNQQ3 | IADD3 | 3-input integer add |
SSZN | FFMA | FP fused multiply-add |
SNQQ | FADD | FP add |
SZHY | FMUL | FP multiply |
ZBI | MOV | Move |
FRY | SEL | Select |
YBC3 | LOP3 | 3-input logic |
VFRGC | ISETP | Integer set-predicate |
SFRGC | FSETP | FP set-predicate |
YRN | LEA | Load effective address |
FUS | SHF | Shift / funnel shift |
ZHSH | MUFU | Multi-function unit (SFU) |
YQT | LDG | Load global |
FGT | STG | Store global |
YQP | LDC | Load constant |
YQY | LDL | Load local |
YQF | LDS | Load shared |
NGBZ | ATOM | Atomic |
ONE | BAR | Barrier |
OEN | BRA | Branch |
PNYY | CALL | Call |
ERG | RET | Return |
RKVG | EXIT | Exit |
GRK | TEX | Texture |
ZRZONE | MEMBAR | Memory barrier |
JNECFLAP | WARPSYNC | Warp synchronize |
C2E | P2R | Predicate to register |
E2C | R2P | Register to predicate |
ABC | NOP | No-op |
OFFL | BSSY | Branch sync stack push |
OFLAP | BSYNC | Branch sync |
QRCONE | DEPBAR | Dependency barrier |
Register Files
Ori maintains four distinct register files, mirroring the SASS hardware register model.
Register File Summary
| File | Width | Range | Special | ABI type | Code Object offset |
|---|---|---|---|---|---|
| R | 32-bit | R0 -- R255 | RZ (read-zero) | 2 | +102 (alloc), +159 (reserved) |
| UR | 32-bit | UR0 -- UR63 | URZ (read-zero) | 3 | +99 |
| P | 1-bit | P0 -- P6 | PT (always-true) | 5 | (tracked separately) |
| UP | 1-bit | UP0 -- UP6 | UPT (always-true) | -- | (tracked separately) |
R registers are the main 32-bit general-purpose registers. 64-bit values occupy consecutive pairs (e.g., R4:R5). The total R-register count for a function is field[159] + field[102] (reserved + allocated). Maximum is 255 usable registers (R0-R254); R255 is the hardware zero register RZ.
UR registers (sm_75+) are uniform registers shared across the warp. Every thread sees the same value. UR0-UR63 on supported architectures. The count is at Code Object +99.
P registers are 1-bit predicate registers used for conditional execution. P0-P6 are usable; PT is the hardwired always-true predicate (writes are discarded).
UP registers are the uniform variant of predicates, shared across the warp like UR.
Register Descriptor
Each register is described by a descriptor in the register file array, accessed as *(ctx+88) + 8*regId:
| Offset | Type | Field |
|---|---|---|
| +8 | u32 | Size / live range info |
| +12 | u32 | Register number |
| +16 | u32 | Register class (enum) |
| +20 | u32 | Physical register name (assigned after regalloc) |
| +24 | ptr | Definition info (0 = undefined / uninitialized) |
| +36 | u32 | Flags (bits 19-21 = subtype) |
| +48 | u8 | Volatile/special flags (bit 5 = volatile marker) |
| +64 | u32 | Register file type enum |
| +68 | u32 | Physical register number (post-allocation) |
Register file type values at descriptor +64:
| Value | Meaning |
|---|---|
| 2 | General-purpose (R) |
| 3 | Uniform (UR) |
| 5 | Predicate (P) |
| 6 | General register (alternate classification) |
| 7 | Predicate (alternate classification) |
| 10 | Extended register pair (64-bit) |
| 11 | Extended register quad (128-bit) |
The register class name table at off_21D2400 maps reg_type enum values to string names. The stat collector (sub_A60B60, 24KB) enumerates ~25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others. The allocator processes classes 0--6 (matching reg_type values 0--6); barrier registers (reg_type 9) are handled separately.
Partial SSA
Ori does not maintain full SSA form at all times. Instead, it uses a bounded "partial SSA" window managed by two phases in the 159-phase optimization pipeline.
Phase 23: GenerateMovPhi
Constructs phi-like MovPhi pseudo-instructions at CFG merge points. Inserted after loop unrolling (phase 22) and before pipelining (phase 24). This establishes partial SSA form -- not through LLVM-style dominance-frontier phi insertion, but through explicit MovPhi nodes that represent value merging at control-flow join points.
Phase 73: ConvertAllMovPhiToMov
Destructs SSA form by lowering every MovPhi into a plain MOV instruction. Runs after sync instruction expansion (phase 72) and before uniform register conversion (phase 74). This is SSA destruction without the need for interference-graph-based coalescing -- the MovPhi nodes simply become copies.
The SSA Window
The partial-SSA window spans phases 23 through 73, covering the bulk of the optimization pipeline:
Phase 23 GenerateMovPhi <-- SSA construction
Phase 24 OriPipelining
Phase 25 StageAndFence
Phase 26 OriRemoveRedundantBarriers
Phase 29 GeneralOptimize
Phase 37 GeneralOptimizeMid
Phase 46 GeneralOptimizeMid2
Phase 49 GvnCse
Phase 50 OriReassociateAndCommon
Phase 54 OriDoRematEarly
Phase 58 GeneralOptimizeLate
Phase 63 OriDoPredication
Phase 65 GeneralOptimizeLate2
Phase 69 OriDoRemat
Phase 70 OriPropagateVaryingSecond
Phase 71 OptimizeSyncInstructions
Phase 72 LateExpandSyncInstructions
Phase 73 ConvertAllMovPhiToMov <-- SSA destruction
All optimizations between these two phases can rely on the single-definition property of MovPhi nodes for reaching-definition analysis.
MovPhi Instruction Format
A MovPhi is not a distinct opcode -- it reuses the MOV opcode (19) with a distinguishing flag in the instruction's auxiliary fields. Phase 73 (ConvertAllMovPhiToMov) converts MovPhi to plain MOV by clearing this flag, without changing the opcode value.
MovPhi operand layout:
+72 opcode = 19 (MOV)
+76 opcode_aux = flag distinguishing MovPhi from plain MOV
+80 operand_count = 2*N + 1 (variable, one destination + N source-predecessor pairs)
operand[0]: destination register (the merged value)
operand[1], [2]: {source_reg, predecessor_bix} for predecessor 0
operand[3], [4]: {source_reg, predecessor_bix} for predecessor 1
...
operand[2*N-1], [2*N]: {source_reg, predecessor_bix} for predecessor N-1
This is the operational equivalent of an SSA phi node. For a CFG merge with two predecessors:
;; PTX-level CFG: ;; Ori MovPhi:
;; bix1 defines R7 ;;
;; bix2 defines R9 ;; MovPhi R3, R7, bix1, R9, bix2
;; bix3 merges ;;
;; uses R3 ;; "if from bix1, R3 = R7; if from bix2, R3 = R9"
Phase 23 (GenerateMovPhi) inserts these at merge points where a register has different reaching definitions from different predecessors. Phase 73 destructor linearizes them: it inserts a MOV R3, R7 at the end of bix1 and a MOV R3, R9 at the end of bix2, then deletes the MovPhi.
Operand Kinds
The IR supports 10 distinct operand kinds, identified through the register allocator verifier (sub_A55D80) and the instruction selection pattern matcher infrastructure.
| # | Kind | Description |
|---|---|---|
| 1 | R/UR register | General-purpose or uniform register operand |
| 2 | P/UP register | Predicate or uniform-predicate register operand |
| 3 | Any register | Wildcard -- matches any register class |
| 4 | Offset | Memory offset for address computation |
| 5 | Regular | Standard immediate or constant value |
| 6 | Predicated | Guard predicate controlling conditional execution |
| 7 | Remat | Rematerialization marker (value can be recomputed instead of spilled) |
| 8 | Spill-refill | Spill/refill pair marker for register allocator |
| 9 | R2P / P2R | Register-to-predicate or predicate-to-register conversion pair |
| 10 | Bit-spill | Single-bit spill (predicate register spill to GPR) |
The regalloc verifier (sub_A55D80, confidence 0.95) classifies 10 problem categories that map to these operand kinds:
- Missing spill match for refill
- Refill reads uninitialized memory
- P2R-R2P pattern match failure
- Bit-spill-refill pattern match failure
- Previously defined operand now uninitialized
- Extra post-regalloc definitions (mixed-size check)
- Rematerialization problem
- P2R-R2P base destroyed
- Bit-spill-refill base destroyed
- Definitions disappeared without new ones added
The pattern matcher infrastructure at 0xB7D000--0xBA9D00 (~390 functions) uses a separate classification for instruction selection:
| Function | Predicate |
|---|---|
sub_B28E10 | isRegOperand |
sub_B28E20 | isPredOperand |
sub_B28E40 | isImmOperand |
sub_B28E80 | isConstOperand |
sub_B28E90 | isUReg |
sub_B28E00 | getRegClass (1023 = wildcard, 1 = GPR) |
Ori vs. PTX
PTX is a virtual ISA -- a stable interface between the compiler frontend and the architecture-specific backend. Ori is the architecture-specific backend representation that replaces PTX opcodes with actual SASS instructions early in compilation.
| Aspect | PTX | Ori |
|---|---|---|
| Opcode set | Virtual mnemonics (add, mul, ld, st) | SASS hardware opcodes (IMAD, FFMA, LDG, STG) |
| Register model | Unlimited virtual registers, typed | 4 hardware register files (R, UR, P, UP) with virtual numbering |
| SSA form | Not applicable (PTX is a linear ISA) | Partial SSA between phases 23 and 73 |
| CFG representation | Implicit (labels + branches) | Explicit hash-map-based CFG with RPO, backedges, dominators |
| Target dependence | Architecture-independent (forward-compatible) | Architecture-specific (per-SM instruction selection) |
| Conversion point | Input to ptxas | After MercConverter (sub_9F1A90) |
The MercConverter pass is the boundary: it transforms PTX-derived intermediate opcodes into SM-specific SASS opcodes by dispatching through a large opcode switch (sub_9ED2D0, 25KB). After MercConverter, the string "After MercConverter" appears in diagnostic output, and the IR is fully in SASS-opcode form. Each instruction then carries enough information for the scheduler to compute accurate latencies, throughputs, and functional-unit assignments.
Worked Example: add.f32 to FADD
This traces a single PTX instruction through the Ori representation, showing exactly how the opcode, operands, and register references are encoded in memory.
PTX Input
add.f32 %f3, %f1, %f2
After MercConverter (sub_9F1A90), this becomes the Ori instruction:
FADD R3, R1, R2
The type qualifier .f32 disappears -- the "F" in FADD encodes the float type. Register names %f1, %f2, %f3 become virtual register IDs R1, R2, R3 in the R (GPR) register file.
Instruction Object in Memory
FADD is opcode 12 in the ROT13 name table (ROT13: SNQQ, at InstructionInfo+4184+16*12). The 296-byte instruction object:
Offset Value Field
------ ----------------- ---------------------
+0 prev_ptr Linked-list prev
+8 next_ptr Linked-list next
+16 <id> Unique instruction ID
+72 0x0000000C opcode = 12 (FADD)
+80 0x00000003 operand_count = 3
+84 0x10000003 operand[0] word0: dst R3
+88 0x00000000 operand[0] word1: no ext flags
+92 0x10000001 operand[1] word0: src R1
+96 0x00000000 operand[1] word1: no ext flags
+100 0x10000002 operand[2] word0: src R2
+104 0x00000000 operand[2] word1: no ext flags
Operand Decoding
Take operand[0] word0 = 0x10000003:
0x10000003 in binary:
bit 31 = 0 (no sign/negate)
bits 28-30 = 001 (type = 1 = register operand)
bits 20-27 = 00000000 (no modifiers)
bits 0-19 = 00003 (register index = 3)
The register index resolves through the register descriptor array:
reg_desc = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * 3);
// reg_desc + 64: reg_file_type = 2 (R / GPR file)
// reg_desc + 12: register number = 3
If the source operand were a constant-bank reference (e.g., FADD R3, R1, c[0][0x10]), operand[2] would have type=5 (symbol/constant) in word0 and the const-bank flag (0x1000000) set in word1. The scheduler distinguishes these two FADD variants for latency modeling: standard FADD gets throughput class 0x3D, while const-bank FADD gets 0x78.
Memory Space Classification
Memory operands carry a space type enum, resolved by sub_91C840 which maps the PTX-level space identifier to an internal category number. The full input enumeration (from complete decompilation of sub_91C840, confidence 0.98):
| Input | PTX Space | Internal Category | Notes |
|---|---|---|---|
| 0 | (none) | -- | Unmapped, no memory space |
| 1 | Register / generic | 16 | Register file address |
| 2 | Code / function | 12 | Function address |
| 3 | (gap) | -- | Unmapped |
| 4 | .shared | 1 | Shared memory |
| 5 | .const | 3 | Constant memory |
| 6 | .global | 11 | Global memory |
| 7 | .local | 2 | Local memory |
| 8 | (gap) | -- | Unmapped |
| 9 | .local (variant) | 2 | Same as 7, alternate encoding |
| 10--11 | (gap) | -- | Unmapped |
| 12 | .param | 4 | Parameter memory |
| 13 | Generic (unqualified) | 0 | Generic address space |
| 14 | .tex | 8 | Texture memory |
| 15 | .surf | 17 | Surface memory |
| 16 | Spill space | 7 | Register spill/fill scratch |
| 17 | (gap) | -- | Unmapped |
| 18 | (instruction-dependent) | varies | Sub-classifies by opcode at a2[1] |
| 19 | .uniform | 15 | Uniform (sm_75+) |
| 20 | .global (extended) | 6 | Global, extended variant |
| 21 | .const (extended) | 5 | Constant, extended store-to-global path |
| 22 | .const (extended, alt) | 5 | Constant, alternate extended |
| 23 | .surf / tensor (ext) | 18 | Surface/tensor extended (sm_90+) |
Case 18 (0x12) uses a sub-switch on the opcode value at a2[1] to further classify: opcodes 7, 43, 45, 53 map to category 6 (global-like); opcode 111 and opcodes in the 183--199 range map to category 5 (constant-like); opcodes 54 and 189 map to category 9 (special).
The hot/cold classifier pair (sub_A9CDE0 / sub_A9CF90) consumes the internal category to partition instructions for scheduling. Hot memory operations (global loads/stores, certain atomics -- category 11) have long latencies and benefit from aggressive scheduling; cold operations (constant loads -- category 3) have shorter latencies and are treated more conservatively.
Related Pages
- Instructions -- detailed instruction format and encoding
- CFG -- basic block and control-flow-graph internals
- Registers -- register model, descriptor layout, allocation interface
- Data Structures -- hash tables, bitvectors, linked lists
- Pipeline Overview -- where Ori sits in the full PTX-to-SASS flow
- PTX-to-Ori Lowering -- how PTX becomes Ori
- Optimizer -- the 159-phase optimization pipeline
- Hash Tables and Bitvectors -- FNV-1a maps and SSE2 bitvectors used by the CFG
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_A3B080 | -- | Code Object constructor; allocates ~1136-byte per-function IR container (vtable at 0x21EE238) | 0.90 |
sub_A4B8F0 | -- | Register count formula: total_R = v5[159] + v5[102], instr_count = v5[335] - v5[341] | 0.90 |
sub_A3A7E0 | -- | Stats emitter; prints per-function profile (instruction count, register count, occupancy, latency) | 0.90 |
sub_BE21D0 | 1.4KB | CFG::dumpDOT; emits Graphviz DOT output for the control flow graph | 0.92 |
sub_BDE150 | 9KB | CFG::computeRPO; explicit DFS stack, assigns reverse post-order numbers into Code Object +720 array | 0.92 |
sub_BDE8B0 | 2KB | CFG::printEdges; FNV-1a lookup, prints "bix%d -> bix%d\n" | 0.92 |
sub_BDEA50 | 4KB | CFG::dumpRPOAndBackedges; RPO traversal order + backedge debug dump | 0.92 |
sub_BE0690 | 54KB | CFG::buildAndAnalyze; main CFG constructor -- predecessors, successors, RPO, loop detection | 0.92 |
sub_BE2330 | 4KB | CFG::computeDominators; post-build dominator and loop analysis with bitvector operations | 0.92 |
sub_BE7390 | -- | InstructionInfo constructor; initializes 322-entry ROT13 opcode name table at object offset +4184 | 0.90 |
sub_9F1A90 | 35KB | MercConverter pass; transforms PTX-derived opcodes into SM-specific SASS opcodes | 0.92 |
sub_9ED2D0 | 25KB | Opcode switch inside MercConverter; dispatches per-opcode legalization | 0.90 |
sub_91C840 | -- | Memory space classifier; maps PTX-level space identifiers (0--23) to internal category numbers | 0.98 |
sub_A9CDE0 | -- | Hot/cold memory classifier (hot path); partitions instructions by memory category for scheduling | 0.85 |
sub_A9CF90 | -- | Hot/cold memory classifier (cold path); complement of sub_A9CDE0 | 0.85 |
sub_A60B60 | 24KB | Register stat collector; enumerates ~25 register sub-classes (R, P, B, UR, UP, UB, Tensor/Acc, etc.) | 0.85 |
sub_A55D80 | -- | Register allocator verifier; classifies 10 operand-kind problem categories for regalloc validation | 0.95 |
sub_40848E | -- | Operand extended-flag checker; tests (word1 & 0xFE000000) != 0 across all operands | 0.85 |
sub_405769 | -- | Operand flag tester; tests 0x1000000 and 0x6000000 combinations in operand word 1 | 0.85 |
sub_404AD0 | -- | Peephole guard; verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms | 0.85 |
sub_B28E10 | -- | isRegOperand; ISel pattern matcher operand predicate | 0.90 |
sub_B28E20 | -- | isPredOperand; ISel pattern matcher operand predicate | 0.90 |
sub_B28E40 | -- | isImmOperand; ISel pattern matcher operand predicate | 0.90 |
sub_B28E80 | -- | isConstOperand; ISel pattern matcher operand predicate | 0.90 |
sub_B28E90 | -- | isUReg; ISel pattern matcher operand predicate | 0.90 |
sub_B28E00 | -- | getRegClass; returns register class (1023 = wildcard, 1 = GPR) | 0.90 |
Instructions & Opcodes
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page documents the Ori IR instruction representation: in-memory layout, opcode encoding, operand model, instruction flags, creation/iteration APIs, the master descriptor table, and opcode categories. All offsets are from ptxas v13.0.88 (37.7 MB stripped x86-64 ELF).
Instruction Object Layout
Every Ori instruction is a 296-byte C++ object allocated from the Code Object's arena. Instructions are linked into per-basic-block doubly-linked lists via pointers at offsets +0 and +8. The allocator at sub_7DD010 allocates exactly 296 bytes per instruction and zeroes the object before populating it.
Memory Layout (296 bytes)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +0 | 8 | ptr | prev | Previous instruction in BB linked list (nullptr for head) |
| +8 | 8 | ptr | next | Next instruction in BB linked list (nullptr for tail) |
| +16 | 4 | i32 | id | Unique instruction ID (monotonically increasing within function) |
| +20 | 4 | i32 | ref_count | Reference/use count (incremented by sub_7E6090) |
| +24 | 4 | i32 | bb_index | Basic block index (bix) this instruction belongs to |
| +28 | 4 | u32 | reserved_28 | Reserved / padding |
| +32 | 4 | u32 | control_word | Scheduling control word (stall cycles, yield, etc.) |
| +36 | 4 | u32 | flags_36 | Instruction flags (bits 19-21 = subtype, see below) |
| +40 | 8 | ptr | sched_slot | Scheduling state pointer |
| +48 | 8 | u64 | flag_bits | Extended flag bits (bit 5 = volatile, bit 27 = reuse) |
| +56 | 8 | ptr | def_instr | Defining instruction (for SSA def-use chains) |
| +64 | 8 | ptr | reserved_64 | Reserved / register class info |
| +72 | 4 | u32 | opcode | Full opcode word (lower 12 bits = base opcode, bits 12-13 = modifier) |
| +76 | 4 | u32 | opcode_aux | Auxiliary opcode data (sub-operation, comparison predicate) |
| +80 | 4 | u32 | operand_count | Total number of operands (destinations + sources) |
| +84 | var | u32[N*2] | operands[] | Packed operand array (8 bytes per operand slot) |
| +88 | 4 | u32 | operands[0].extra | High word of first operand slot |
| +100 | 1 | u8 | type_flags | Data type / modifier flags (bits 0-2 = data type code) |
| +104 | 4 | u32 | reserved_104 | Reserved |
| +112 | 8 | ptr | use_chain | Use chain linked list head (for CSE) |
| +120 | 8 | ptr | reserved_120 | Reserved |
| +136 | 4 | i32 | reserved_136 | Reserved |
| +160 | 8 | ptr | enc_buf | Encoding buffer pointer (populated during code generation) |
| +168 | 8 | ptr | reserved_168 | Reserved |
| +184 | 4 | u32 | enc_mode | Encoding mode selector |
| +200 | 8 | u64 | imm_value | Immediate value (for instructions with constant operands) |
| +208 | 16 | xmm | sched_params | Scheduling parameters (loaded via _mm_load_si128) |
| +240 | 4 | u32 | reserved_240 | Reserved |
| +244 | 1 | u8 | reserved_244 | Reserved |
| +248 | 8 | i64 | sentinel_248 | Initialized to -1 (0xFFFFFFFFFFFFFFFF) |
| +256 | 8 | i64 | sentinel_256 | Initialized to 0xFFFFFFFF |
| +264 | 8 | i64 | bb_ref | Basic block reference / block index storage |
| +272 | 8 | i64 | reserved_272 | Reserved |
| +280 | 16 | u128 | reserved_280 | Zeroed on creation |
Linked-List Pointers
Instructions form a doubly-linked list within each basic block. The Code Object stores the global list head at offset +272 and tail at offset +280:
Code Object +272 --> head instruction (prev = nullptr)
|
v (+8 = next)
instruction 2
|
v
instruction 3
|
v ...
Code Object +280 --> tail instruction (next = nullptr)
The linked-list traversal pattern appears in hundreds of functions throughout ptxas:
// Forward iteration over all instructions
for (instr = *(ptr*)(code_obj + 272); instr != nullptr; instr = *(ptr*)(instr + 8)) {
uint32_t opcode = *(uint32_t*)(instr + 72);
uint32_t num_ops = *(uint32_t*)(instr + 80);
// process instruction...
}
Opcode Encoding
The opcode field at offset +72 is a 32-bit word with a structured layout.
Opcode Word Format
31 16 15 14 13 12 11 0
+------------------+---+---+---+---+---------------+
| upper flags | | | M | M | base opcode |
+------------------+---+---+---+---+---------------+
^ ^
| bit 12: modifier bit 0
bit 13: modifier bit 1
M = modifier bits (stripped by the 0xFFFFCFFF mask)
base opcode = 12-bit instruction class identifier (0-4095)
The mask 0xFFFFCFFF (clear bits 12-13) is used throughout InstructionClassifier, MBarrierDetector, OperandLowering, and many other subsystems to extract the base instruction class, stripping sub-operation modifier bits:
uint32_t raw_opcode = *(uint32_t*)(instr + 72);
uint32_t base_opcode = raw_opcode & 0xFFFFCFFF;
Additionally, bit 11 is sometimes used in operand count calculations:
// Effective operand count adjustment (appears in 50+ functions)
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2; // 0 or 2
int dst_count = *(uint32_t*)(instr + 80) - adj;
Canonical Opcode Reference
The opcode value stored at instruction+72 is the same index into the ROT13 name table at InstructionInfo+4184. There is a single numbering system -- the ROT13 table index IS the runtime opcode. This was verified by tracing sub_BEBAC0 (getName), which computes InstructionInfo + 4184 + 16 * opcode with no remapping.
The following table lists frequently-referenced opcodes from decompiled code, with their canonical SASS mnemonic names from the ROT13 table. Each opcode appears in 10+ decompiled functions reading *(instr+72).
| Base Opcode | SASS Mnemonic | Category | Reference Count |
|---|---|---|---|
| 0 | ERRBAR | Error barrier (internal) | Sentinel in scheduler |
| 1 | IMAD | Integer multiply-add | 100+ functions |
| 7 | ISETP | Integer set-predicate | sub_7E0030 switch |
| 18 | FSETP | FP set-predicate | sub_7E0030 switch |
| 19 | MOV | Move | 80+ functions |
| 23 | PLOP3 | Predicate 3-input logic | sub_7E0030 case 23 |
| 25 | NOP | No-op | Scheduling, peephole |
| 52 | AL2P_INDEXED | BB boundary pseudo-opcode | sub_6820B0, 100+ |
| 54 | BMOV_B | Barrier move (B) | sub_7E6090 case 54 |
| 61 | BAR | Barrier synchronization | Sync passes |
| 67 | BRA | Branch | sub_74ED70, CFG builders |
| 71 | CALL | Function call | sub_7B81D0, ABI, spill |
| 72 | RET | Return | sub_74ED70 (with 67) |
| 77 | EXIT | Exit thread | sub_7E4150, CFG sinks |
| 93 | OUT_FINAL | Tessellation output (final) | sub_734AD0, 25+ |
| 94 | LDS | Load shared | sub_7E0650 case 94 |
| 95 | STS | Store shared | sub_7E0030, 40+ |
| 96 | LDG | Load global | Memory analysis |
| 97 | STG | Store global | sub_6820B0, 30+ |
| 102 | ATOM | Atomic | Encoding switch |
| 104 | RED | Reduction | Encoding switch |
| 111 | MEMBAR | Memory barrier | Sync passes |
| 119 | SHFL | Warp shuffle | sub_7E0030 case 119 |
| 122 | DFMA | Double FP fused mul-add | sub_7E0030 case 122 |
| 130 | HSET2 | Half-precision set (packed) | 20+ functions |
| 135 | INTRINSIC | Compiler intrinsic (pseudo) | ISel, lowering |
| 137 | SM73_FIRST | SM gen boundary (real instr) | Strength reduction |
| 183 | sm_82+ opcode | Extended mem operation | & 0xFFFFCFFF mask |
Important caveats:
-
Opcode 52 (
AL2P_INDEXEDin name table) is universally used as a basic block delimiter in 100+ decompiled functions. The SASS mnemonic name may be vestigial; no decompiled code uses it for attribute-to-patch operations. -
SM boundary markers (136=
SM70_LAST, 137=SM73_FIRST, etc.) have marker names in the ROT13 table but are valid runtime opcodes. Instructions with these opcode values exist in the IR and are processed by optimization passes (e.g., strength reduction operates on opcode 137). -
Earlier versions of this page had a "Selected Opcode Values" table that assigned incorrect SASS mnemonics based on behavioral inference rather than the ROT13 name table. Those labels (93=BRA/CALL, 95=EXIT, 97=CALL/label, 130=MOV) were wrong. The correct labels are: 93=
OUT_FINAL, 95=STS, 97=STG, 130=HSET2. Branch/call/exit are at 67=BRA, 71=CALL, 77=EXIT.
Opcode Ranges by SM Generation
The ROT13 opcode name table in sub_BE7390 (InstructionInfo constructor) includes explicit SM generation boundary markers:
| Marker Opcode | Decoded Name | Meaning |
|---|---|---|
| 136 | SM70_LAST | Last sm_70 (Volta) opcode |
| 137 | SM73_FIRST | First sm_73 (Volta+) opcode |
| 171 | SM73_LAST | Last sm_73 opcode |
| 172 | SM82_FIRST | First sm_82 (Ampere) opcode |
| 193 | SM82_LAST | Last sm_82 opcode |
| 194 | SM86_FIRST | First sm_86 (Ampere+) opcode |
| 199 | SM86_LAST | Last sm_86 opcode |
| 200 | SM89_FIRST | First sm_89 (Ada) opcode |
| 205 | SM89_LAST | Last sm_89 opcode |
| 206 | SM90_FIRST | First sm_90 (Hopper) opcode |
| 252 | SM90_LAST | Last sm_90 opcode |
| 253 | SM100_FIRST | First sm_100 (Blackwell) opcode |
| 280 | SM100_LAST | Last sm_100 opcode |
| 281 | SM104_FIRST | First sm_104 (Blackwell Ultra) opcode |
| 320 | SM104_LAST | Last sm_104 opcode |
| 321 | LAST | Sentinel (end of table) |
This gives a clear partitioning: opcodes 0-136 are the base sm_70+ ISA, 137-171 extend to sm_73, and so on up through sm_104. Each SM generation only adds opcodes; no base opcodes are removed.
Operand Model
Packed Operand Encoding
Each operand occupies 8 bytes (two 32-bit words) in the operand array starting at instruction offset +84. The first word carries the type, modifier bits, and index. The second word carries additional data (extended flags, immediate bits, etc.).
Word 0 (at instr + 84 + 8*i):
31 30 29 28 27 26 25 24 23 22 21 20 19 0
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
| S | type(3) | modifier (8 bits) | index (20 bits) |
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
^ ^ ^
| bits 28-30: operand type bits 0-19: register/symbol index
bit 31: sign/negative flag (S)
Word 1 (at instr + 88 + 8*i):
31 0
+--------------------------------------------------------------------+
| extended data / immediate bits / flags |
+--------------------------------------------------------------------+
Operand Type Field (bits 28-30)
| Value | Type | Index Meaning |
|---|---|---|
| 0 | Unused / padding | — |
| 1 | Register | Index into *(code_obj+88) + 8*index register descriptor array |
| 2 | Predicate register | Index into predicate register file |
| 3 | Uniform register | UR file index |
| 4 | Address/offset | Memory offset value |
| 5 | Symbol/constant | Index into *(code_obj+152) symbol table |
| 6 | Predicate guard | Guard predicate controlling conditional execution |
| 7 | Immediate | Encoded immediate value |
Operand Extraction Pattern
This exact extraction pattern appears in 50+ functions across scheduling, regalloc, encoding, and optimization passes:
uint32_t operand_word = *(uint32_t*)(instr + 84 + 8 * i);
int type = (operand_word >> 28) & 7; // bits 28-30
int index = operand_word & 0xFFFFF; // bits 0-19 (also seen as 0xFFFFFF)
int mods = (operand_word >> 20) & 0xFF; // bits 20-27
bool is_neg = (operand_word >> 31) & 1; // bit 31
// Register operand check (most common pattern)
if (type == 1) {
reg_descriptor = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * index);
reg_file_type = *(uint32_t*)(reg_descriptor + 64);
reg_number = *(uint32_t*)(reg_descriptor + 12);
}
Some functions use a 24-bit index mask (& 0xFFFFFF) instead of 20-bit, packing additional modifier bits into the upper nibble of the index field.
Operand Classification Predicates
Small predicate functions at 0xB28E00-0xB28E90 provide the instruction selection interface for operand queries:
| Address | Function | Logic |
|---|---|---|
sub_B28E00 | getRegClass | Returns register class; 1023 = wildcard, 1 = GPR |
sub_B28E10 | isRegOperand | (word >> 28) & 7 == 1 |
sub_B28E20 | isPredOperand | (word >> 28) & 7 == 2 |
sub_B28E40 | isImmOperand | (word >> 28) & 7 == 7 |
sub_B28E80 | isConstOperand | (word >> 28) & 7 == 5 |
sub_B28E90 | isUReg | (word >> 28) & 7 == 3 |
Destination vs. Source Operand Split
Destinations come first in the operand array, followed by sources. The boundary is computed from the operand_count field and the modifier bits in the opcode:
uint32_t total_ops = *(uint32_t*)(instr + 80);
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2; // 0 or 2
int first_src_index = total_ops - adj; // or total_ops + ~adj + 1
// Destinations: operands[0 .. first_src_index-1]
// Sources: operands[first_src_index .. total_ops-1]
For most instructions, adj = 0 and the split point equals operand_count. Instructions with bit 11 set in the opcode word shift the split by 2, indicating 2 extra destination operands (e.g., predicated compare-and-swap operations that write both a result register and a predicate).
Predicate Guard Operand
The last operand (at index operand_count - 1) can be a predicate guard (type 6) controlling conditional execution. The guard predicate check in sub_7E0E80:
bool has_pred_guard(instr) {
int last_idx = *(uint32_t*)(instr + 80) + ~((*(uint32_t*)(instr + 72) >> 11) & 2);
uint32_t last_op = *(uint32_t*)(instr + 84 + 8 * last_idx);
return ((last_op & 0xF) - 2) < 7; // type bits in low nibble
}
Instruction Flags and Modifiers
Opcode Modifier Bits (offset +72, bits 12-13)
Bits 12-13 of the opcode word encode sub-operation modifiers. The 0xFFFFCFFF mask strips them to yield the base opcode. Common uses:
| Modifier | Meaning |
|---|---|
| 0 | Default operation |
| 1 | .HI or alternate form |
| 2 | .WIDE or extended form |
| 3 | Reserved / architecture-specific |
Extended Flag Bits (offset +48)
The 64-bit flag word at offset +48 accumulates flags throughout the compilation pipeline:
| Bit | Hex Mask | Flag | Set By |
|---|---|---|---|
| 6 | 0x40 | Live-out | sub_7E6090 (def-use builder) |
| 16 | 0x10000 | Has single def | sub_7E6090 |
| 25 | 0x2000000 | Has prior use | sub_7E6090 |
| 27 | 0x8000000 | Same-block def | sub_7E6090 |
| 33 | 0x200000000 | Source-only ref | sub_7E6090 |
Control Word (offset +32)
The control word encodes scheduling metadata added by the instruction scheduler. It is initialized to zero and populated during scheduling (phases ~150+):
- Stall cycles (how many cycles to wait before issuing the next instruction)
- Yield hint (whether the warp scheduler should yield after this instruction)
- Dependency barrier assignments
- Reuse flags (register reuse hints for the hardware register file cache)
The stall cycle field is checked during scoreboard computation at sub_A08910. The control word format is the same as the SASS encoding control field.
Data Type Flags (offset +100)
The byte at offset +100 encodes the instruction's data type in its low 3 bits:
uint8_t type_code = *(uint8_t*)(instr + 100) & 7;
These correspond to SASS data type suffixes (.F32, .F64, .U32, .S32, .F16, .B32, etc.). The exact encoding is architecture-specific and queried through the InstructionInfo descriptor table.
ROT13 Opcode Name Table
All SASS opcode mnemonic strings in the binary are ROT13-encoded. This is lightweight obfuscation, not a security measure. The InstructionInfo constructor at sub_BE7390 populates a name table at object offset +4184 with 16-byte {char* name, uint64_t length} entries.
Table Structure
InstructionInfo object:
+0 vtable pointer (off_233ADC0)
+8 parent pointer
...
+4184 opcode_names[0].name_ptr -> "REEONE" (ROT13 of ERRBAR)
+4192 opcode_names[0].length -> 6
+4200 opcode_names[1].name_ptr -> "VZNQ" (ROT13 of IMAD)
+4208 opcode_names[1].length -> 4
...
+9320 opcode_names[321].name_ptr -> "YNFG" (ROT13 of LAST)
+9328 opcode_names[321].length -> 4
+9336 encoding_category_map[0..321] (322 x int32, from unk_22B2320)
+10624 (end of encoding category map)
Total: 322 named opcodes (indices 0-321). The 0x508 bytes at +9336 are not additional name entries -- they are a 322-element int32 array mapping each opcode index to an encoding category number (see Encoding Category Map below).
Full Decoded Opcode Table (Base ISA, sm_70+)
| Idx | ROT13 | SASS | Category |
|---|---|---|---|
| 0 | REEONE | ERRBAR | Error barrier (internal) |
| 1 | VZNQ | IMAD | Integer multiply-add |
| 2 | VZNQ_JVQR | IMAD_WIDE | Integer multiply-add wide |
| 3 | VNQQ3 | IADD3 | 3-input integer add |
| 4 | OZFX | BMSK | Bit mask |
| 5 | FTKG | SGXT | Sign extend |
| 6 | YBC3 | LOP3 | 3-input logic |
| 7 | VFRGC | ISETP | Integer set-predicate |
| 8 | VNOF | IABS | Integer absolute value |
| 9 | YRN | LEA | Load effective address |
| 10 | FUS | SHF | Funnel shift |
| 11 | SSZN | FFMA | FP fused multiply-add |
| 12 | SNQQ | FADD | FP add |
| 13 | SZHY | FMUL | FP multiply |
| 14 | SZAZK | FMNMX | FP min/max |
| 15 | SFJMNQQ | FSWZADD | FP swizzle add |
| 16 | SFRG | FSET | FP set |
| 17 | SFRY | FSEL | FP select |
| 18 | SFRGC | FSETP | FP set-predicate |
| 19 | ZBI | MOV | Move |
| 20 | FRY | SEL | Select |
| 21 | C2E | P2R | Predicate to register |
| 22 | E2C | R2P | Register to predicate |
| 23 | CYBC3 | PLOP3 | Predicate 3-input logic |
| 24 | CEZG | PRMT | Byte permute |
| 25 | ABC | NOP | No-op |
| 26 | IBGR | VOTE | Warp vote |
| 27 | PF2E_32 | CS2R_32 | Control/status to register (32-bit) |
| 28 | PF2E_64 | CS2R_64 | Control/status to register (64-bit) |
| 29 | CZGEVT | PMTRIG | Performance monitor trigger |
| 30 | CFZGRFG | PSMTEST | PSM test |
| 31 | INOFQVSS | VABSDIFF | Vector absolute difference |
| 32 | INOFQVSS4 | VABSDIFF4 | Vector absolute difference (4-way) |
| 33 | VQC | IDP | Integer dot product |
| 34 | VQR | IDE | Integer dot expand |
| 35 | V2V | I2I | Integer to integer conversion |
| 36 | V2VC | I2IP | Integer to integer (packed) |
| 37 | VZAZK | IMNMX | Integer min/max |
| 38 | CBCP | POPC | Population count |
| 39 | SYB | FLO | Find leading one |
| 40 | SPUX | FCHK | FP check (NaN/Inf) |
| 41 | VCN | IPA | Interpolate attribute |
| 42 | ZHSH | MUFU | Multi-function unit (SFU) |
| 43 | S2S | F2F | Float to float conversion |
| 44 | S2S_K | F2F_X | Float to float (extended) |
| 45 | S2V | F2I | Float to integer |
| 46 | S2V_K | F2I_X | Float to integer (extended) |
| 47 | V2S | I2F | Integer to float |
| 48 | V2S_K | I2F_X | Integer to float (extended) |
| 49 | SEAQ | FRND | FP round |
| 50 | SEAQ_K | FRND_X | FP round (extended) |
| 51 | NY2C | AL2P | Attribute to patch |
| 52 | NY2C_VAQRKRQ | AL2P_INDEXED | Attribute to patch (indexed) |
| 53 | OERI | BREV | Bit reverse |
| 54 | OZBI_O | BMOV_B | Barrier move (B) |
| 55 | OZBI_E | BMOV_R | Barrier move (R) |
| 56 | OZBI | BMOV | Barrier move |
| 57 | F2E | S2R | Special register to register |
| 58 | O2E | B2R | Barrier to register |
| 59 | E2O | R2B | Register to barrier |
| 60 | YRCP | LEPC | Load effective PC |
| 61 | ONE | BAR | Barrier synchronization |
| 62 | ONE_VAQRKRQ | BAR_INDEXED | Barrier (indexed) |
| 63 | FRGPGNVQ | SETCTAID | Set CTA ID |
| 64 | FRGYZRZONFR | SETLMEMBASE | Set local memory base |
| 65 | TRGYZRZONFR | GETLMEMBASE | Get local memory base |
| 66 | QRCONE | DEPBAR | Dependency barrier |
| 67 | OEN | BRA | Branch |
| 68 | OEK | BRX | Branch indirect |
| 69 | WZC | JMP | Jump |
| 70 | WZK | JMX | Jump indirect |
| 71 | PNYY | CALL | Function call |
| 72 | ERG | RET | Return |
| 73 | OFFL | BSSY | Branch sync stack push |
| 74 | OERNX | BREAK | Break |
| 75 | OCG | BPT | Breakpoint trap |
| 76 | XVYY | KILL | Kill thread |
| 77 | RKVG | EXIT | Exit |
| 78 | EGG | RTT | Return to trap handler |
| 79 | OFLAP | BSYNC | Branch sync |
| 80 | ZNGPU | MATCH | Warp match |
| 81 | ANABFYRRC | NANOSLEEP | Nanosleep |
| 82 | ANABGENC | NANOTRAP | Nano trap |
| 83 | GRK | TEX | Texture fetch |
| 84 | GYQ | TLD | Texture load |
| 85 | GYQ4 | TLD4 | Texture load 4 |
| 86 | GZZY | TMML | Texture mip-map level |
| 87 | GKQ | TXD | Texture fetch with derivatives |
| 88 | GKD | TXQ | Texture query |
| 89 | YQP | LDC | Load constant |
| 90 | NYQ | ALD | Attribute load |
| 91 | NFG | AST | Attribute store |
| 92 | BHG | OUT | Tessellation output |
| 93 | BHG_SVANY | OUT_FINAL | Tessellation output (final) |
| 94 | YQF | LDS | Load shared |
| 95 | FGF | STS | Store shared |
| 96 | YQT | LDG | Load global |
| 97 | FGT | STG | Store global |
| 98 | YQY | LDL | Load local |
| 99 | FGY | STL | Store local |
| 100 | YQ | LD | Load (generic) |
| 101 | FG | ST | Store (generic) |
| 102 | NGBZ | ATOM | Atomic |
| 103 | NGBZT | ATOMG | Atomic global |
| 104 | ERQ | RED | Reduction |
| 105 | NGBZF | ATOMS | Atomic shared |
| 106 | DFCP | QSPC | Query space |
| 107 | PPGY_AB_FO | CCTL_NO_SB | Cache control (no scoreboard) |
| 108 | PPGY | CCTL | Cache control |
| 109 | PPGYY | CCTLL | Cache control (L2) |
| 110 | PPGYG | CCTLT | Cache control (texture) |
| 111 | ZRZONE | MEMBAR | Memory barrier |
| 112 | FHYQ | SULD | Surface load |
| 113 | FHFG | SUST | Surface store |
| 114 | FHNGBZ | SUATOM | Surface atomic |
| 115 | FHERQ | SURED | Surface reduction |
| 116 | CVKYQ | PIXLD | Pixel load |
| 117 | VFOREQ | ISBERD | Indexed set binding for redirect |
| 118 | VFORJE | ISBEWR | Indexed set binding for write |
| 119 | FUSY | SHFL | Warp shuffle |
| 120 | JNECFLAP | WARPSYNC | Warp synchronize |
| 121 | ZVRYQ | MYELD | Yield (internal) |
| 122 | QSZN | DFMA | Double FP fused multiply-add |
| 123 | QNQQ | DADD | Double FP add |
| 124 | QZHY | DMUL | Double FP multiply |
| 125 | QFRGC | DSETP | Double FP set-predicate |
| 126 | UNQQ2 | HADD2 | Half-precision add (packed) |
| 127 | UNQQ2_S32 | HADD2_F32 | Half-precision add (F32 accum) |
| 128 | USZN2 | HFMA2 | Half FP fused multiply-add (packed) |
| 129 | UZHY2 | HMUL2 | Half-precision multiply (packed) |
| 130 | UFRG2 | HSET2 | Half-precision set (packed) |
| 131 | UFRGC2 | HSETP2 | Half-precision set-predicate (packed) |
| 132 | UZZN_16 | HMMA_16 | Half MMA (16-wide) |
| 133 | UZZN_32 | HMMA_32 | Half MMA (32-wide) |
| 134 | VZZN | IMMA | Integer MMA |
| 135 | VAGEVAFVP | INTRINSIC | Compiler intrinsic (pseudo) |
Opcode Categories
The ~400 opcodes group into these functional categories:
Integer ALU (14 opcodes): IMAD, IMAD_WIDE, IADD3, IADD, IMNMX, IABS, BMSK, SGXT, LOP3, ISETP, LEA, SHF, POPC, FLO, BREV, IDP, IDE, PRMT
FP32 ALU (9 opcodes): FFMA, FADD, FMUL, FMNMX, FSWZADD, FSET, FSEL, FSETP, FCHK
FP64 ALU (4 opcodes): DFMA, DADD, DMUL, DSETP
FP16 Packed (6 opcodes): HADD2, HADD2_F32, HFMA2, HMUL2, HSET2, HSETP2
Conversion (12 opcodes): F2F, F2I, I2F, I2I, F2FP, F2IP, I2FP, I2IP, FRND, and their _X extended variants
Data Movement (6 opcodes): MOV, UMOV, MOVM, SEL, USEL, PRMT
Special Function (1 opcode): MUFU (sin, cos, rsqrt, rcp, etc.)
Predicate (4 opcodes): PLOP3, P2R, R2P, VOTE
Memory -- Global (4 opcodes): LDG, STG, LD, ST
Memory -- Shared (4 opcodes): LDS, STS, LDSM, STSM
Memory -- Local (2 opcodes): LDL, STL
Memory -- Constant (2 opcodes): LDC, LDCU
Atomic/Reduction (6 opcodes): ATOM, ATOMG, ATOMS, RED, REDUX, REDAS
Texture (6 opcodes): TEX, TLD, TLD4, TMML, TXD, TXQ
Surface (4 opcodes): SULD, SUST, SUATOM, SURED
Control Flow (12 opcodes): BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, KILL, BPT
Synchronization (6 opcodes): BAR, BAR_INDEXED, DEPBAR, MEMBAR, WARPSYNC, NANOSLEEP
Tensor Core / MMA (25+ opcodes): HMMA_*, IMMA_*, BMMA_*, DMMA, GMMA, QMMA_*, OMMA_*, and their sparse (_SP_) variants
Uniform Register (30+ opcodes): All U-prefixed variants (UIMAD, UIADD3, UMOV, USEL, ULOP3, ULEPC, etc.) that operate on uniform registers shared across the warp
Blackwell sm_100+ (28 opcodes): ACQBLK, CGABAR_*, CREATEPOLICY, ELECT, ENDCOLLECTIVE, FENCE_G/S/T, LDTM, STTM, MEMSET, ACQSHMINIT, UTCBAR_*, UTCMMA_*, UTCSHIFT_*, UTCCP_*, TCATOMSWS, TCLDSWS, TCSTSWS, VIRTCOUNT, UGETNEXTWORKID, FADD2, FFMA2, FMUL2, FMNMX3, CREDUX, QFMA4, QADD4, QMUL4, WARPGROUP
Instruction Descriptor Table
The InstructionInfo class at sub_BE7390 (inheriting from the base class at sub_738E20) provides a per-opcode descriptor table consulted by every pass in the compiler. The derived constructor calls the base class constructor sub_738E20, then populates the ROT13 name table, allocates the per-opcode descriptor block, and queries SM-specific configuration knobs. The resulting object is ~11,240 bytes inline plus a 10,288-byte dynamically allocated descriptor block.
Construction Sequence
sub_BE7390(this, parent_context) executes in this order:
- Base class init (
sub_738E20): sets vtable, stores parent pointer, allocates the opcode-to-descriptor mapping array (512 bytes, 64 QWORD slots), zeroes all four descriptor data areas (+744..+3624), queries SM version and stores at +3728, allocates per-opcode property array (4 * sm_opcode_countbytes at +4112), allocates a reference-counted descriptor block (24 bytes at +4136), queries knobs 812/867/822/493 for configuration. Sets+4132 = 8and+4176 = 0(init incomplete). - Override vtable:
+0 = off_233ADC0(derived vtable). - Populate ROT13 name table: 322 inline entries (indices 0-321) at offsets +4184..+9328, each 16 bytes (
{char* name_ptr, u64 length}). - Bulk-copy encoding category map:
qmemcpy(+9336, unk_22B2320, 0x508)-- 322-entryint32array (1288 bytes) mapping opcode index to encoding category number. The source table varies by arch constructor (see below). - Initialize post-table fields: zero offsets +10624..+10680.
- Store sentinels:
+11200 = -2,+11224 = 0xFFFFFFFF. - Set constants:
+4048 = 2,+4056 = 10,+3733 = 1. - Descriptor defaults (
sub_1370BD0): populates scheduling templates and operand defaults at +192..+704. - Override property mode:
+4132 = 7(overwriting base class's 8). - Allocate descriptor block: 10,288 bytes via the MemoryManager, partitioned into 3 sections.
- Query SM-specific config: reads
parent->+1664->+72->+55080and stores result at +10648.
InstructionInfo Object Layout
The complete byte-level field map, derived from sub_BE7390 (derived constructor), sub_738E20 (base constructor), and sub_1370BD0 (descriptor defaults init).
Region 1: Vtable, Parent, and Core Identity (+0 to +91)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +0 | 8 | ptr | vtable | off_233ADC0 (derived); base chain: off_21DB6E8 / off_21B4790 |
| +8 | 8 | ptr | parent_ctx | Parent compilation context pointer |
| +44 | 8 | u64 | operand_counts | Packed pair 0x100000001: lo=1 dst, hi=1 src (base default) |
Region 2: Scheduling Defaults and Flags (+92 to +159)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +92 | 16 | xmm | sched_defaults | Scheduling parameter defaults (loaded from xmmword_2029FE0) |
| +108 | 4 | i32 | desc_idx_a | Descriptor index sentinel = 0 |
| +112 | 4 | i32 | desc_idx_b | Descriptor index sentinel = -1 (0xFFFFFFFF) |
| +116 | 1 | u8 | flag_116 | = 0 |
| +117 | 1 | u8 | flag_117 | = 0 |
| +118 | 1 | u8 | flag_118 | = 1 |
| +120 | 3 | u8[3] | flags_120 | All = 0 |
| +136 | 4 | i32 | sentinel_136 | = -1 (0xFFFFFFFF) |
| +148 | 8 | u64 | reserved_148 | = 0 |
Region 3: Opcode-to-Descriptor Mapping (+160 to +191)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +160 | 8 | ptr | mapping_allocator | MemoryManager used for mapping array |
| +168 | 8 | ptr | mapping_array | Dynamically allocated QWORD array (initial: 512 bytes, 64 entries) |
| +176 | 4 | i32 | mapping_count | Current entry count (initially 63) |
| +180 | 4 | i32 | mapping_capacity | Current capacity (initially 64) |
| +184 | 8 | u64 | packed_flags | = 0x4000000000 (bit 38: descriptor config flag) |
Region 4: Descriptor Defaults (+192 to +704, set by sub_1370BD0)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +192 | 8 | u64 | default_operand_cfg | Packed 0x200000002: lo=2, hi=2 |
| +200 | 4 | u32 | default_dst_count | = 4 |
| +208 | 4 | u32 | default_modifier | = 2 |
| +216 | 16 | xmm | sched_template_a | Scheduling template (from xmmword_233B1E0) |
| +240 | 4 | u32 | default_operand_w | = 4 |
| +448 | 8 | u64 | section_marker_448 | = 1 |
| +456 | 4 | u32 | section_id_456 | = 2 |
| +464 | 4 | u32 | section_id_464 | = 3 |
| +472 | 16 | xmm | sched_template_b | Scheduling template (from xmmword_233B1F0) |
| +496 | 4 | u32 | default_value_496 | = 5 |
Gaps within +204..+447 and +500..+695 are zero-initialized by sub_1370BD0.
Region 5: Primary Descriptor Data (+744 to +2155)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +744 | 8 | u64 | desc_data_start | Primary area header = 0 |
| +752..+2155 | 1404 | u8[] | desc_data | Zero-initialized per-opcode descriptor records |
Region 6: Secondary Descriptor Area (+2156 to +2211)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +2156 | 8 | u64 | secondary_header | = 0 |
| +2164..+2211 | 48 | u8[] | secondary_data | Zero-initialized |
Region 7: Tertiary Descriptor Area (+2212 to +3623)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +2212 | 8 | u64 | tertiary_header | = 0 |
| +2220..+3623 | 1404 | u8[] | tertiary_data | Zero-initialized |
| +2372 | 4 | u32 | desc_record_type_a | = 4 (set by derived constructor) |
| +2400 | 4 | u32 | desc_record_type_b | = 4 (set by derived constructor) |
Region 8: Quaternary Descriptor Area and Target Config (+3624 to +3735)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +3624 | 8 | u64 | quaternary_header | = 0 |
| +3640..+3664 | 32 | u64[4] | quat_ptrs | All = 0 |
| +3672 | 1 | u8 | is_sm75_plus | = 1 if SM ID >= 16389, else 0 |
| +3673 | 1 | u8 | target_flag_bit6 | Bit 6 of *(target+1080) |
| +3674 | 1 | u8 | target_flag_bit7 | Bit 7 of *(target+1080) |
| +3675..+3682 | 8 | u8[8] | zero_pad | All = 0 |
| +3684 | 32 | u128[2] | zero_pad_3684 | = 0 |
| +3716..+3717 | 2 | u8[2] | flags_3716 | = 0 |
| +3720 | 4 | u32 | value_3720 | = 0 |
| +3724 | 1 | u8 | flag_3724 | = 1 |
| +3725 | 1 | u8 | flag_3725 | = 0 |
| +3728 | 4 | u32 | sm_opcode_count | SM version / total opcode count from arch query |
| +3732 | 1 | u8 | knob_812_flag | Knob 812 derived flag |
| +3733 | 1 | u8 | derived_flag | = 1 (set by derived constructor; base leaves at 0) |
Region 9: Scheduling Configuration (+4016 to +4111)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +4016 | 16 | u128 | sched_config_a | = 0 |
| +4032 | 8 | u64 | sched_config_b | = 0 |
| +4040 | 16 | xmm | sched_constants | Loaded from xmmword_21B4EE0 |
| +4048 | 4 | u32 | constant_2 | = 2 (derived overrides base default 0) |
| +4056 | 4 | u32 | constant_10 | = 10 (derived overrides base default 0x7FFFFFFF) |
| +4060..+4064 | 8 | u32[2] | zero_pad | = 0 |
| +4072 | 8 | u64 | sched_ptr | = 0 |
| +4080 | 8 | u64 | sched_ext | = 0 |
| +4088 | 1 | u8 | flag_4088 | = 0 |
| +4089 | 1 | u8 | knob_867_flag | = 1 if knob absent; = (knob_value == 1) otherwise |
| +4090 | 1 | u8 | flag_4090 | = 0 |
| +4092 | 4 | u32 | knob_822_value | Default 7; overridden by knob 822 |
| +4096 | 4 | u32 | knob_493_value | Default 5; overridden by knob 493 |
Region 10: Per-Opcode Property Array (+4112 to +4183)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +4112 | 8 | ptr | property_array | Allocated: 4 * sm_opcode_count bytes; 4 bytes per opcode |
| +4120 | 4 | u32 | property_count | = 4 * !hasExtendedPredicates (0 or 4) |
| +4124 | 4 | u32 | property_aux | = 0 |
| +4128 | 1 | u8 | property_init_flag | = 1 |
| +4132 | 4 | u32 | property_mode | Base sets 8, derived overwrites to 7 |
| +4136 | 8 | ptr | ref_counted_block | 24-byte block: [refcount=2, data=0, allocator_ptr] |
| +4144..+4160 | 24 | u64[3] | rc_aux | All = 0 |
| +4176 | 1 | u8 | init_complete | = 0 initially; set to 1 after full initialization |
Region 11: ROT13 Opcode Name Table (+4184 to +10623)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +4184 | 5152 | struct[322] | opcode_names[0..321] | 322 inline entries, each 16 bytes: {char* name, u64 len} |
| +9336 | 1288 | int32[322] | encoding_category_map[0..321] | Per-opcode encoding category; bulk-copied from arch-specific static table (see below) |
Total: 322 named opcodes. Index N name is at offset 4184 + 16*N. The getName accessor at sub_BEBAC0 computes this + 4184 + 16 * opcode directly. Encoding category for opcode N is at +9336 + 4*N.
Encoding Category Map
The 1288-byte block at +9336 is a 322-element int32 array that maps each opcode index to an encoding category number. The SASS mnemonic lookup function (sub_1377C60) uses this to resolve a (mnemonic, arch) pair to a binary encoding format descriptor.
Arch-specific source tables:
| Constructor | Source Table | Content |
|---|---|---|
sub_7A5D10 (base) | unk_21C0E00 | Identity map: map[i] = i for all i in 0..321 |
sub_7C5410 | unk_21C3600 | Arch-remapped: some entries differ from identity |
sub_BE7390 | unk_22B2320 | Arch-remapped: some entries differ from identity |
The base constructor uses a pure identity map where opcode N maps to encoding category N. Arch-specific constructors override selected entries so the same mnemonic at different opcode indices can map to different encoding formats. For example, DMMA at opcode index 180 maps to encoding category 434 on one arch, while DMMA at opcode index 215 maps to encoding category 515 on another.
Reader: sub_1377C60 (SASS mnemonic lookup)
// After matching mnemonic string v11 to opcode index v18 via ROT13 comparison:
v84 = *(_DWORD *)(a1 + 4 * v18 + 9336); // encoding_category_map[v18]
// v84 is then FNV-1a hashed together with arch discriminator v16,
// and looked up in the hash table at *(a1 + 10672) to find the
// encoding format descriptor for this (category, arch) pair.
The hash table at +10672 stores entries of the form {encoding_category, arch_code, format_value}, keyed by FNV-1a of (encoding_category, arch_discriminator). This is the central mechanism that maps a SASS mnemonic string plus target architecture to the correct binary encoding format.
Region 12: Descriptor Block Control (+10624 to +10687)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +10624 | 8 | u64 | block_ctrl_a | = 0 |
| +10632 | 8 | u64 | block_ctrl_b | = 0 |
| +10648 | 4 | u32 | arch_config | SM-specific config from target+55080/55088 |
| +10656 | 8 | ptr | descriptor_block | Pointer to allocated 10,288-byte per-opcode descriptor block |
| +10664 | 8 | ptr | block_allocator | MemoryManager that allocated the descriptor block |
| +10672 | 8 | ptr | encoding_lookup_table | Hash table for (encoding_category, arch) -> format descriptor lookup; read by sub_1377C60 |
| +10680 | 8 | u64 | block_aux_b | = 0 |
Region 13: Sentinels and Architecture Handler (+11200 to +11240)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +11200 | 4 | i32 | sentinel | = -2 (0xFFFFFFFE) |
| +11208 | 8 | ptr | arch_handler | = parent_ctx->+16 (MemoryManager) |
| +11216 | 8 | u64 | zero_11216 | = 0 |
| +11224 | 8 | u64 | sentinel_11224 | = 0xFFFFFFFF |
| +11232 | 1 | u8 | flag_11232 | = 0 |
| +11236 | 4 | u32 | zero_11236 | = 0 |
Per-Opcode Descriptor Block (10,288 bytes)
Allocated by the derived constructor and stored at +10656. The block is 10288 / 8 = 1286 QWORD entries, partitioned into three sections:
+--------------------+ block + 0
| Section 0 header | QWORD[0] = 0
+--------------------+ block + 8
| Section 0 payload | QWORD[1..640] = all zero (memset)
| (640 slots) | Per-opcode descriptors for opcodes 0..639
+--------------------+ block + 5128
| Section 1 header | QWORD[641] = 0
+--------------------+ block + 5136
| Section 1 payload | QWORD[642..1283] (NOT explicitly zeroed)
| (642 slots) | Modifier-variant descriptors (opcode | 0x1000, etc.)
+--------------------+ block + 10272
| Section 2 (16B) | QWORD[1284] = parent_ctx (back-pointer)
| | QWORD[1285] = instr_info (self back-pointer)
+--------------------+ block + 10288
Section 0 (5,128 bytes): 641 QWORD slots. Only the payload (slots 1..640, 5,120 bytes) is explicitly zeroed. Each slot corresponds to a base opcode index. With 402 named opcodes, ~240 slots remain spare.
Section 1 (5,144 bytes): 643 QWORD slots. The header is zeroed but the payload is NOT explicitly zeroed -- it relies on the arena allocator's default behavior or lazy initialization during opcode registration. Likely stores modifier-variant descriptors (e.g., entries for opcode | 0x1000 when bits 12-13 carry sub-operation modifiers).
Section 2 (16 bytes): Two back-pointers for navigating from the descriptor block back to its owning objects (parent compilation context and the InstructionInfo instance).
Architecture-Specific Sub-Tables (sub_896D50, 26,888 bytes)
The architecture-specific extended property object is NOT stored inside InstructionInfo. It is lazily allocated by sub_7A4650, which gates on target+372 == 0x8000 (sm_80 / Ampere targets). The allocation is 26,888 bytes, constructed by sub_896D50(block, parent_context).
sub_896D50 Object Layout
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| +0 | 8 | ptr | vtable | off_21DADF8 |
| +8 | 8 | ptr | parent_ctx | From construction parameter |
| +40 | 8 | ptr | allocator_base | MemoryManager from parent->+16 |
Property Array A (at sub-object +56):
| Sub-offset | Field | Description |
|---|---|---|
| +56 | ptr | Array pointer: 64 bytes per entry, 772 entries (49,408 bytes allocated) |
| +64 | i32 | Count = 771 |
| +68 | i32 | Capacity = 772 |
Each 64-byte entry: bytes [0..11] initialized to 0xFF (pipeline-unassigned sentinel), bytes [12..63] zeroed. Stores latency, throughput, port mask, and register class requirements per opcode.
Property Array B (at sub-object +80):
| Sub-offset | Field | Description |
|---|---|---|
| +80 | ptr | Array pointer: 36 bytes per entry, 772 entries (27,792 bytes allocated) |
| +88 | i32 | Count = 771 |
| +92 | i32 | Capacity = 772 |
Each 36-byte entry: all zeroed. Stores encoding class, format identifiers, operand encoding rules.
Property Array C (at sub-object +176):
| Sub-offset | Field | Description |
|---|---|---|
| +176 | ptr | Array pointer: 16 bytes per entry, 35 entries (560 bytes allocated) |
| +184 | i32 | Count = 34 |
| +188 | i32 | Capacity = 35 |
Each 16-byte entry: zeroed. Stores functional unit properties for major FU categories.
Property Array D (at sub-object +200):
| Sub-offset | Field | Description |
|---|---|---|
| +200 | ptr | Array pointer: 16 bytes per entry, 35 entries (560 bytes allocated) |
| +208 | i32 | Count = 34 |
Parallel table for alternate functional unit configurations.
Dimension Table (at sub-object +472):
| Sub-offset | Field | Description |
|---|---|---|
| +472 | ptr | 168-byte block: [count=40, entries[0..39]], 4 bytes per entry, zero-initialized |
Alphabetical SASS Name Table (at sub-object +11360):
Starting at offset +11360, sub_896D50 populates an alphabetically sorted ROT13 name table using the same {char*, u64} format. Unlike the InstructionInfo name table (indexed by opcode), this table is sorted by decoded mnemonic name and includes modifier variants:
OZZN.168128(BMMA.168128)PPGY.P.YQP.VINYY(CCTL.C.LDC.IVALL)VZNQ.JVQR.ERNQ.NO(IMAD.WIDE.READ.AB)VZZN.FC.{168128.*|16864.*8.*8}(IMMA.SP.{...} -- regex patterns for variant matching)
This table is used for SASS assembly parsing and opcode-to-encoding resolution, where a single base opcode may map to multiple encoding variants distinguished by modifier suffixes.
Knob-derived fields:
| Sub-offset | Field | Source |
|---|---|---|
| +108 | i32 | Knob 803 value (instruction scheduling latency override) |
| +468 | u8 | = 0 |
| +469 | u8 | = 1 |
| +470 | u8 | = 1 |
Accessor Stubs
40+ tiny vtable accessor stubs at 0x859F80-0x85A5F0 and 0x868500-0x869700 provide virtual dispatch access to per-opcode properties. Typical pattern:
int getLatency(ArchSpecificInfo* this, int opcode) {
return *(int*)(this->property_array_a + 64 * opcode + latency_offset);
}
PTX Text-Generation Operand Accessor API
The PTX text generation subsystem (instruction pretty-printer, dispatcher at sub_5D4190) converts Ori IR instructions into PTX assembly text. The ~580 formatter functions at 0x4DA340-0x5A9FFF query a PTX instruction context object through a stable API of 48 small accessor helpers concentrated at 0x707000-0x710FFF.
PTX Instruction Context Object
The accessor functions do NOT operate on the 296-byte Ori IR instruction directly. They take a PTX instruction context object (~2500+ bytes) that contains pre-decoded fields for text generation. The raw Ori instruction is accessible at *(context + 1096). Each formatter receives this context as argument a1 and a pool allocator table as argument a2.
Partial field map of the PTX instruction context (offsets used by accessors):
| Offset | Size | Type | Field | Accessed By |
|---|---|---|---|---|
| +544 | 8 | ptr | predicate_ptr | has_predicate, get_opcode_string |
| +564 | 4 | u32 | saturation_code | get_saturation_mode (== 12 means saturate) |
| +596 | 4 | u32 | field_operand_count | get_field_a..get_field_d |
| +600 | 1 | u8 | flag_byte_a | Bit 0: precision, bit 6: addressing, bit 7: addr_mode |
| +604 | 1 | u8 | rounding_mode | Bits 0-2: rounding mode code (3 bits) |
| +605 | 1 | u8 | scale_byte | Bits 4-7: scale code (4 bits, 16 entries) |
| +609 | 1 | u8 | base_addr_byte | Bits 2-3: base address mode (2 bits, 4 entries) |
| +611 | 1 | u8 | param_flags | Bits 4-5: parameter variant selector |
| +615 | 1 | u8 | ftz_byte | Bits 6-7: FTZ flag code (2 bits, 4 entries) |
| +620 | 1 | u8 | variant_index | Variant string lookup index (8 bits, 256 entries) |
| +627 | 1 | u8 | flag_byte_b | Bits 0-1: extended_op, 2-3: flag_b, 4-5: modifier/variant |
| +640 | 4 | i32 | precision_code | Index into precision string table |
| +648 | var | ptr[] | operand_names | Per-operand name string pointer array (8B per slot) |
| +800 | 4 | u32 | operand_count | Number of operands for comparison/count accessors |
| +816 | var | ptr[] | reg_operands | Register operand pointer array (8B per slot) |
| +944 | var | u32[] | operand_types | Per-operand type code array (4B per slot) |
| +1024 | var | ptr[] | src_part0 | Source part 0 pointer array (8B per slot) |
| +1264 | var | ptr[] | src_part1 | Source part 1 pointer array (8B per slot) |
| +1504 | var | ptr[] | data_types_0 | Data type array, part 0 (8B per slot) |
| +1744 | var | ptr[] | data_types_1 | Data type array, part 1 (8B per slot) |
| +1984 | var | u32[] | target_sm | Target SM version array (4B per slot) |
| +2120 | 8 | ptr | opcode_name | Opcode mnemonic string pointer |
| +2488 | 8 | ptr | string_intern | String interning table for modifier deduplication |
Accessor Catalog
Tier 1: Core Accessors (>200 callers)
Used by nearly every formatter function. These are the fundamental building blocks of PTX text generation.
| Address | Name | Size | Callers | Signature | Logic |
|---|---|---|---|---|---|
sub_710860 | getDataType | 39B | 2953 | (ctx, idx, part) -> u8 | part ? **(ctx+1744+8*idx) & 0x3F : **(ctx+1504+8*idx) & 0x3F |
sub_70B910 | getSrcPart0 | 12B | 1656 | (ctx, idx) -> ptr | *(ctx + 8*idx + 1024) |
sub_70B8E0 | getRegOperand | 12B | 1449 | (ctx, idx) -> ptr | *(ctx + 8*idx + 816) |
sub_70B920 | getSrcPart1 | 12B | 1296 | (ctx, idx) -> ptr | *(ctx + 8*idx + 1264) |
sub_70B700 | hasPredicate | 14B | 946 | (ctx) -> bool | *(ctx + 544) != 0 |
sub_70B780 | getPredicateName | 151B | 514 | (ctx, pool) -> str | Allocates "@" + opcode_name; inserts "!" if negated |
sub_70CA60 | getOperandType | 11B | 480 | (ctx, idx) -> u32 | *(ctx + 4*idx + 944) |
sub_70B710 | getOpcodeString | 111B | 348 | (ctx, pool) -> str | Allocates "@" + *(ctx+2120) from arena pool |
sub_70FA00 | getTargetSM | 10B | 286 | (ctx, idx) -> u32 | *(ctx + 4*idx + 1984) |
Tier 2: Modifier and Property Accessors (10-200 callers)
Used by instruction-class families (memory ops, float ops, texture ops, etc.).
| Address | Name | Size | Callers | Signature | Logic |
|---|---|---|---|---|---|
sub_70CA70 | getTypeSuffix | 427B | 191 | (ctx, pool) -> str | Iterates *(ctx+796) type codes; looks up in off_2032300[] with interning |
sub_70CD20 | getOperandOffset | 122B | 158 | (ctx, idx) -> str | off_2032300[*(ctx+4*idx+944)]; resolves via string interning for codes <= 0x39 |
sub_707CE0 | getAddressOperand | 22B | 93 | (ctx) -> str | off_2033DE0[*(ctx+600) >> 7] |
sub_70B930 | getOperandCount | 7B | 68 | (ctx) -> u32 | *(ctx + 800) |
sub_70B4C0 | getBaseAddress | 22B | 46 | (ctx) -> str | off_2032700[(*(ctx+609) >> 2) & 3] |
sub_709A10 | getVariantString | 73B | 46 | (ctx) -> str | off_2033060[*(ctx+620)] resolved via string interning |
sub_70B6E0 | hasPredicate_v2 | 14B | 42 | (ctx) -> bool | *(ctx + 544) != 0 (identical body to hasPredicate) |
sub_709760 | getComparisonOp | 127B | 21 | (ctx, pool) -> str | Iterates *(ctx+800) operand names from +648 array with " , " separator |
sub_709FE0 | getRoundingMode | 11B | 17 | (ctx) -> u8 | *(ctx + 604) & 7 |
sub_70A500 | getSaturationMode | 13B | 15 | (ctx) -> bool | *(ctx + 564) == 12 |
sub_709910 | getVariantCount | 14B | 13 | (ctx) -> u8 | (*(ctx+627) >> 4) & 3 |
sub_708E40 | getExtendedOperand | 29B | 10 | (ctx, idx) -> str | off_2033720[(*(ctx+627) >> (idx==1 ? 0 : 2)) & 3] |
Tier 3: Instruction-Class-Specific Accessors (<10 callers)
Used by specific instruction families (MMA/tensor, texture, guardrail formatters).
| Address | Name | Size | Callers | Signature | Purpose |
|---|---|---|---|---|---|
sub_70FA10 | checkTargetSM | 66B | 7 | (ctx, idx, str) -> bool | sscanf(str, "sm_%d") then compare to *(ctx+1984+4*idx) |
sub_70C890 | getOperandDetail | ~300B | varies | (ctx, pool, maxlen, type) -> str | Complex: hex parse, fallback to sub_707380, type-dispatch |
sub_70A810 | getScaleString | 22B | varies | (ctx) -> str | off_2032BA0[(*(ctx+605) >> 4) & 0xF] |
sub_70B3F0 | getFtzFlag | 22B | varies | (ctx) -> str | off_20327C0[(*(ctx+615) >> 6) & 3] |
sub_707530 | getPrecisionString | 12B | varies | (ctx) -> str | off_2033FA0[*(ctx+640)] |
sub_707C60 | getAddressingMode | 12B | varies | (ctx) -> bool | (*(ctx+600) & 0x40) != 0 |
sub_707C80 | getScopeString | 22B | varies | (ctx) -> str | off_2033E00[(*(ctx+600) & 0x40) != 0] |
sub_7075E0 | getLayoutString | 22B | varies | (ctx) -> str | off_2033EE0[*(ctx+600) & 1] -- WMMA/TCGEN05 |
sub_707BE0 | getShapeString | 22B | varies | (ctx) -> str | off_2033E30[(*(ctx+600) & 4) != 0] -- WMMA/TCGEN05 |
sub_7075C0 | getInstrFlagA | 7B | varies | (ctx) -> u8 | *(ctx+600) & 1 -- WMMA/rsqrt |
sub_707BC0 | getInstrFlagB | varies | varies | (ctx) -> varies | Secondary flag accessor -- WMMA/rsqrt |
sub_70D3B0 | getFieldA | 91B | 2 | (ctx) -> str | Returns ".transA" if operand count matches MMA shape |
sub_70D410 | getFieldB | 99B | 2 | (ctx) -> str | Returns ".transB" (symmetric with getFieldA) |
sub_70D480 | getFieldC | 91B | 2 | (ctx) -> str | MMA field C modifier string |
sub_70D4E0 | getFieldD | 91B | 2 | (ctx) -> str | MMA field D modifier string |
sub_70D360 | getModifier | 76B | 1 | (ctx, pool) -> str | Reads operand at index 3 or 5 depending on byte 627 |
sub_70D2F0 | getImmediate | 107B | 1 | (ctx, pool) -> str | Reads operand at +672, conditionally appends second value |
sub_70FCB0 | getParamA | varies | varies | (ctx) -> u64 | Dispatch on (*(ctx+611) & 0x30): selects guardrail constant |
sub_70FCF0 | getParamB | varies | varies | (ctx) -> u64 | Similar dispatch on different bit field |
sub_70E670 | getParamC | varies | varies | (ctx) -> u64 | Third parameter accessor |
Static String Tables
The accessor functions perform table-driven lookups using static string pointer arrays in .rodata. Each table is indexed by a small bit-field extracted from the context object:
| Table Address | Entries | Indexed By | Content |
|---|---|---|---|
off_2032300 | >57 | Operand type code | Type suffix strings (.f32, .u16, .b64, etc.) |
off_2032700 | 4 | (ctx+609 >> 2) & 3 | Base address mode strings |
off_20327C0 | 4 | (ctx+615 >> 6) & 3 | FTZ flag strings (empty, .ftz, etc.) |
off_2032BA0 | 16 | (ctx+605 >> 4) & 0xF | Scale modifier strings |
off_2033060 | 256 | ctx+620 | Variant name strings |
off_2033720 | 4 | (ctx+627 >> N) & 3 | Extended operand strings |
off_2033DE0 | 2 | ctx+600 >> 7 | Address operand strings |
off_2033E00 | 2 | (ctx+600 & 0x40) != 0 | Scope strings (.cta, .gpu, etc.) |
off_2033E30 | 2 | (ctx+600 & 4) != 0 | Shape strings -- WMMA/TCGEN05 |
off_2033EE0 | 2 | ctx+600 & 1 | Layout strings -- WMMA/TCGEN05 |
off_2033FA0 | indexed by int | ctx+640 | Precision strings for texture ops |
Architectural Notes
-
String interning: String-returning accessors for type codes <= 0x39 go through a string interning table at
*(ctx+2488). The pattern is: look up a candidate string from the static table, then pass it throughsub_426D60(hash lookup) orsub_7072A0(insert-and-return). This deduplicates PTX modifier strings across the entire text generation pass. -
Pool allocation: Accessors that construct new strings (prefixing
"@", joining with separators) receive a pool allocator parameter. They allocate from the formatter's 50KB temp buffer viasub_4280C0(get pool) ->sub_424070(alloc from pool) ->sub_42BDB0(abort on failure). -
Duplicate functions:
sub_70B700(hasPredicate, 946 callers) andsub_70B6E0(hasPredicate_v2, 42 callers) have bytewise-identical bodies. Both return*(a1+544) != 0. These are likely methods in different classes (base and derived, or two sibling classes) that were not merged by the linker because they have distinct mangled names. -
MMA/tensor accessors:
getFieldAthroughgetFieldD,getLayoutString, andgetShapeStringare used exclusively by WMMA, HMMA, and TCGEN05 instruction formatters. They decode matrix operation modifiers (.transA,.transB,.row,.col) from compressed bit fields.
Instruction Creation
Allocation: sub_7DD010
The primary instruction allocator at sub_7DD010 (called from pass code that needs to create new instructions):
- Allocates 296 bytes from the Code Object's arena allocator (
vtable+16, size 296) - Zeroes the entire 296-byte object
- Initializes sentinel fields: offset +248 = -1, +256 = 0xFFFFFFFF, +264 and +272 = 0xFFFFFFFF00000000
- Loads scheduling parameter defaults from
xmmword_2027620into offset +208 - Appends the new instruction to the Code Object's instruction index array at +368 (resizable, 1.5x growth policy)
- Assigns a unique instruction index:
*(instr + 264) = index - Invalidates cached analysis (RPO at +792)
The instruction is created unlinked -- it is not yet in any basic block's linked list.
Linking: sub_925510 (Insert Before)
sub_925510 inserts instruction a2 before instruction a3 in the doubly-linked list of Code Object a1:
void InsertBefore(CodeObject* ctx, Instr* instr, Instr* before) {
// 1. Check if instruction removal impacts scheduling state
if (IsScheduleRelevant(instr, ctx))
UpdateScheduleState(ctx, instr);
// 2. Notify observers
NotifyObservers(ctx->observer_chain + 1952, instr);
// 3. Unlink from current position
if (instr->prev) {
instr->prev->next = instr->next;
if (instr->next)
instr->next->prev = instr->prev;
else
ctx->tail = instr->prev; // was tail
} else {
ctx->head = instr->next; // was head
instr->next->prev = nullptr;
}
// 4. Insert before target
instr->next = before;
instr->bb_index = before->bb_index;
instr->prev = before->prev;
if (before->prev)
before->prev->next = instr;
if (before == ctx->head)
ctx->head = instr;
before->prev = instr;
// 5. Post-insert bookkeeping
PostInsertUpdate(ctx, instr);
}
Removal: sub_9253C0
sub_9253C0 (634 callers) removes an instruction from its linked list:
- Checks if the instruction affects scheduling state (same check as insert)
- Notifies the observer chain at Code Object +1952
- Unlinks from the doubly-linked list (updating head/tail pointers at +272/+280)
- Optionally updates the instruction map at Code Object +1136 (if
a3flag is set) - Handles debug info cleanup if the debug flag at byte +1421 bit 5 is set
Instruction Removal Check: sub_7E0030
Before removing an instruction (sub_7E0030, called from both sub_9253C0 and sub_925510), the compiler checks whether the removal is legal. This function examines:
- Whether the instruction is an
STS(store shared, base opcode 95) with specific operand count and data type patterns (operand_count - adj == 5 with data type codes 1, 2, or 4 prevent removal) - Whether a target-specific scheduler hook (vtable offset 2128 on the SM backend at compilation context +1584) vetoes the removal
- Whether the instruction is a
PLOP3(predicate logic, opcode 23) writing to a special register (register file type 9 at descriptor +64) - Whether the dead-code check (
sub_7DF3A0) clears the instruction, excluding opcodes 93 (OUT_FINAL), 124 (DMUL), and 248 (SM90+ opcode) which have required side effects - Whether the opcode class has a "must keep" flag in the per-opcode property array at Code Object +776 (
byte[4*opcode + 2] & 4)
Instruction Iteration
Forward Walk
The standard forward walk over a basic block's instructions:
// code_obj->head is at +272, tail at +280
instr_ptr instr = *(ptr*)(code_obj + 272);
while (instr) {
// process instruction
instr = *(ptr*)(instr + 8); // next
}
Reverse Walk
instr_ptr instr = *(ptr*)(code_obj + 280); // tail
while (instr) {
// process instruction
instr = *(ptr*)(instr + 0); // prev
}
Block-Scoped Iteration
When iterating within a specific basic block (used by scheduling, regalloc, and peephole passes), the block's head instruction pointer at block_entry +0 is the starting point, and iteration continues until the next block boundary (opcode 52, named AL2P_INDEXED in the ROT13 table but universally used as a BB delimiter pseudo-opcode) or the list tail:
// Block info at code_obj+976, 40 bytes per block
ptr block_head = *(ptr*)(*(ptr*)(code_obj + 976) + 40 * block_index);
for (instr = block_head; instr != nullptr; instr = *(ptr*)(instr + 8)) {
uint32_t op = *(uint32_t*)(instr + 72) & 0xFFFFCFFF;
if (op == 52) // BB boundary
break;
// process instruction
}
Def-Use Chain Iterator: sub_7E6090
The complex def-use chain builder sub_7E6090 (650 lines decompiled) is the core instruction analysis function. Called from sub_8E3A80 and numerous optimization passes, it:
- Walks all instructions in program order
- For each register operand (type == 1 via
(word >> 28) & 7), updates the register descriptor's def/use counts at offsets +20 and +24 - Builds use chains via linked list nodes allocated from the arena (16-byte nodes with
{next, instruction_ptr}) - Sets flag bits in register descriptors (+48) for live-out, same-block-def, has-prior-use, and source-only-ref
- Tracks the single-definition instruction at register descriptor +56
- Handles CSE matching: compares operand arrays of instructions with matching opcode, operand count, and auxiliary data to detect redundant computations
- Takes parameter
a5as a bitmask of register file types to process (bit per register class)
Instruction Lowering Handler -- sub_65D640 (48 KB)
The central PTX-to-Ori instruction lowering handler lives at sub_65D640. It is installed at vtable offset +32 in the ISel Phase 1 dispatch table (sub_660CE0) and called through the vtable for every PTX instruction during lowering.
Signature: int64 sub_65D640(context*, bb_ref, ptx_node*, ori_instr*)
The function reads the PTX opcode from *(*(ptx_node+32)+8) and dispatches through a ~60-case switch. An entry gate (sub_44AC80) diverts certain opcode types to an alternate handler (sub_656600). The function calls sub_A2FD90 (operand setter) 59 times to populate Ori operands on the resulting instructions.
Opcode Case Map
| Case(s) | PTX family | Handler | Description |
|---|---|---|---|
| 5 | prmt (byte permute) | inline | Decodes 8-bit per-byte channel mask, sets 2 operands |
| 6 | prmt (extended) | inline | Two-operand permute with address computation via sub_6294E0 |
| 10 | mov (special) | inline | Clears immediate flag for float type 109 |
| 12 | (delegated) | sub_659F90 | -- |
| 13 | multi-operand expansion | inline | Expands via sub_62E840, resolves type 87 (address) and 97 (register) operands |
| 17, 18, 24 | mov/cvt variants | sub_652FA0 | -- |
| 19, 20, 23 | surface ops | inline | ~200 lines: multi-register data, sub_6273E0 operand classification, up to 4 data regs + address |
| 34, 35 | load/store | inline | Optional address resolution gated on (ptx_node+61 & 0xC) |
| 45, 238 | conversion | inline | Rewrites operand type to 20 (integer), binds address via sub_6294E0 |
| 68, 71 | register indirect rewrite | inline | Checks operand size == 8, rewrites descriptor to type 110 |
| 81 | instruction expansion | inline | Creates IADD3 (opcode 38) with constant 0, reg class 12 |
| 82 | instruction expansion | inline | Rewrites to opcode 162 with IADD3 operand |
| 84 | load expansion | inline | Creates IADD3 with offset, flags 0x2000 |
| 85 | operand reorder | inline | 3-operand shuffle |
| 87 | reg class adjustment | inline | Table lookup at dword_2026C60, swaps operands 1/2, sets opcode 150 |
| 88 | matrix config | inline | MMA dimension table at dword_2026C48, sets fields 179/180 |
| 104 | 4-wide load | inline | Creates 4-operand instruction, address binding via sub_6294E0 |
| 110 | (delegated) | sub_652610 | -- |
| 123 | generic addressing | inline | Converts flat-to-specific addresses; SM-version-dependent multi-instruction sequences |
| 124, 125 | cvta / isspacep | inline | Address space conversion; creates CVTA opcode 538/539 on SM > 0x1A |
| 130 | instruction fusion | inline | Fuses instruction if operand count is not 3 or 4 |
| 165 | (delegated) | sub_65BF40 | -- |
| 175--178 | texture addr_mode | inline | Resolves .addr_mode_0/1/2 attributes from texture descriptor |
| 179 | atomic address mode | inline | Classifies atomic op type, creates SEL + ATOM sequence |
| 180 | (delegated) | sub_65CE90 | -- |
| 181, 182 | (delegated) | sub_64FF20 | -- |
| 183 | conditional atomic | inline | State space 0x20: rewrites to opcode 71 with mask 0xFF01010101 |
| 184--190 | surface/texture lowering | inline | Handles SULD/SUST/SURED (opcodes 449-456); SM-dependent operand resolution |
| 197, 198 | call site lowering | inline | Same-module vs cross-module call dispatch |
| 201--204, 208--211 | wide load/store | inline | .v2/.v4 multi-element operations with IADD3 offset computation |
| 206, 207, 212, 213 | 3-op wide load/store | inline | 3-operand variants of wide memory operations |
| 221, 222 | TMA operations | inline | Sets field 197 with value 365/366 |
Addressing Mode Types
ptxas handles four distinct addressing mode categories during instruction lowering, all resolved by sub_65D640:
1. Texture Addressing Modes (per-dimension)
Cases 175--178 resolve .addr_mode_0, .addr_mode_1, .addr_mode_2 attributes from texture descriptors. These are the PTX txq query targets.
The function walks the texture descriptor's attribute linked list at *(descriptor+16)+24, comparing each attribute name string:
// Pseudocode for cases 175-178:
addr_mode_0 = addr_mode_1 = addr_mode_2 = 0;
found = false;
for (node = attr_list_head; node != NULL; node = *node) {
name = *(node[1] + 16); // attribute name string
value = *(*(node[1] + 24) + 16); // integer value
if (strcmp(name, "addr_mode_0") == 0) { addr_mode_0 = value; found = true; }
else if (strcmp(name, "addr_mode_1") == 0) { addr_mode_1 = value; found = true; }
else if (strcmp(name, "addr_mode_2") == 0) { addr_mode_2 = value; found = true; }
}
For 2D textures (state space byte & 0xB0 == 0x20), the function checks addr_mode_0 == addr_mode_1. For 3D textures (0x30), it checks all three equal. If modes are uniform (all equal), the instruction gets a single addressing mode flag (field 91 = 1 for clamp_to_border). If modes differ, it delegates to sub_64FC90 for a multi-instruction lowering that handles per-dimension mode selection.
2. Generic-to-Specific Address Conversion (case 123)
Converts flat/generic pointers to specific memory space pointers. The address space ID from *(ptx_node+40) selects the conversion strategy:
| Space ID | Memory space | Strategy |
|---|---|---|
| 4 | shared | sub_654A90 (direct conversion) |
| 5 | combined | OR of global + shared + local conversions |
| 6 | local | sub_64F7A0 with register pair 101/102 |
| 7 | generic (flat) | SM-dependent: sub_654FB0 (SM <= 0x1A) or SHR/AND extraction + SEL mux (SM > 0x10) |
| 8 | global | sub_64F7A0 with register pair 98/99 |
For generic space on older architectures (SM <= 0x1A with feature flag via sub_61AF90), a simpler single-instruction path is used. On newer architectures, a multi-instruction sequence extracts the space tag from the upper address bits.
3. Address Space Conversion (cases 124--125, cvta/isspacep)
The cvta (Convert Address) and isspacep (Is Space Predicate) instructions convert between generic and specific address spaces. For global space (type 8) on SM > 0x1A, the handler creates CVTA with opcode 538 (isspacep) or 539 (cvta) and sets register class 7 with width 4 or 16 bytes.
4. Memory Addressing Modes (implicit)
Memory addressing modes for load/store/atomic instructions are not enumerated as named constants. Instead, they emerge from the operand construction patterns in cases 19--23, 34--35, 81--84, 104, 201--213:
| Pattern | PTX syntax | Ori representation |
|---|---|---|
| Register indirect | [%rd1] | Operand type 87 from sub_629E40 |
| Register + offset | [%rd1+16] | Register operand + immediate via sub_6273E0 |
| Constant bank | c[2][0x100] | Constant operand via sub_620320 (type 12) |
| Immediate address | .local space | Constant value via sub_620320 |
| Base + index | [%rd1], %r2 | Two-operand form |
ISel Phase 1 Dispatch Vtable
sub_660CE0 constructs a 17-slot vtable at context offset +3784 for the ISel Phase 1 instruction handlers:
| Offset | Handler | Size | Role |
|---|---|---|---|
| +0 | sub_650840 | -- | Primary handler |
| +8 | sub_64EEB0 | -- | Operand handler |
| +16 | sub_64F270 | -- | Type handler |
| +24 | sub_6575D0 | 49 KB | Register-class-to-opcode dispatch |
| +32 | sub_65D640 | 48 KB | Instruction lowering (this function) |
| +40 | sub_64EDD0 | -- | Auxiliary handler |
| +128 | sub_64EEC0 | -- | Lowering helper |
Key Function Reference
| Address | Size | Function | Description |
|---|---|---|---|
sub_7DD010 | 1.3KB | Instruction::create | Allocate and initialize 296-byte instruction |
sub_7E0030 | 3.6KB | Instruction::canRemove | Check if instruction removal is legal |
sub_7E0650 | 0.7KB | Instruction::hasPredGuard | Check if instruction has predicate guard |
sub_7E0E80 | 0.1KB | Instruction::lastOpIsPred | Quick predicate-guard check on last operand |
sub_7E6090 | 10KB | DefUseChain::build | Build def-use chains for all instructions |
sub_7DDCA0 | 0.2KB | Observer::notify | Walk observer chain and notify |
sub_9253C0 | 0.5KB | Instruction::remove | Remove instruction from linked list (634 callers) |
sub_925510 | 0.5KB | Instruction::insertBefore | Insert instruction before another (13 callers) |
sub_917A60 | 6.8KB | InstrInfo::getRegClass | Opcode-to-register-class mapping (221 callers) |
sub_91A0F0 | 5.6KB | InstrInfo::resolveRegClass | Resolve operand register class with constraints |
sub_9314F0 | 0.4KB | RegClass::query | Register class query (1,547 callers) |
sub_738E20 | 10KB | InstrDescTable::init | Base instruction descriptor table constructor |
sub_BE7390 | 16KB | InstructionInfo::init | InstructionInfo constructor (ROT13 table + descriptors) |
sub_896D50 | 21KB | InstrMnemTable::init | Architecture-specific mnemonic table initializer |
sub_65D640 | 48KB | InstrLowering::handle | PTX-to-Ori instruction lowering handler (60+ opcode cases, addressing mode resolution) |
sub_660CE0 | 0.3KB | InstrLowering::initVtable | Constructs ISel Phase 1 dispatch vtable (17 slots) |
sub_6575D0 | 49KB | RegClassOpcodeDispatch::handle | Register-class-to-opcode dispatch (vtable +24 sibling) |
sub_6D9690 | 94KB | Instruction::encode | Master SASS instruction encoder |
sub_B28E00 | varies | isReg/isPred/isImm | Operand type predicates (isel infrastructure) |
sub_5D4190 | 12.9KB | PTXFormatter::dispatch | PTX text generation dispatcher (580 formatters) |
sub_710860 | 39B | PTXCtx::getDataType | Data type accessor (2,953 callers) |
sub_70B8E0 | 12B | PTXCtx::getRegOperand | Register operand accessor (1,449 callers) |
sub_70B910 | 12B | PTXCtx::getSrcPart0 | Source part 0 accessor (1,656 callers) |
sub_70B700 | 14B | PTXCtx::hasPredicate | Predicate presence check (946 callers) |
sub_70CA60 | 11B | PTXCtx::getOperandType | Operand type code accessor (480 callers) |
sub_70B710 | 111B | PTXCtx::getOpcodeString | Opcode string with "@" prefix (348 callers) |
sub_70FA00 | 10B | PTXCtx::getTargetSM | Target SM version accessor (286 callers) |
Related Pages
- Ori IR Overview -- Code Object, basic blocks, CFG, register files
- Registers -- Register descriptor layout, register file types
- CFG -- Basic block structure, control-flow graph
- Data Structures -- Hash tables, bitvectors, linked lists
- Peephole Optimization -- Instruction rewriting passes
- SASS Encoding -- How Ori instructions become SASS binary
- Instruction Selection -- Pattern matching for instruction selection
- PTX-to-Ori Pipeline -- Full lowering pipeline context for
sub_65D640 - Scheduling -- 3-phase instruction scheduler
Basic Blocks & Control Flow Graph
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas maintains a custom CFG infrastructure built entirely from scratch -- no LLVM BasicBlock, no LLVM MachineBasicBlock, no LLVM dominator framework. Basic blocks are stored in contiguous arrays, edges are stored in FNV-1a hash maps, and RPO / backedge / loop information is computed by dedicated functions in the 0xBDE000--0xBE2400 address range.
Key Facts
| Property | Value |
|---|---|
| BasicBlock object size | 136 bytes (allocated by sub_62BB00) |
| Block info entry (scheduling) | 40 bytes per entry, contiguous array |
| Block naming | bix%d (block index, 0-based integer) |
| Edge representation | FNV-1a hash map (key = block index, value = successor list) |
| RPO storage | int[] array, indexed by RPO position |
| Backedge storage | Separate FNV-1a hash map |
| CFG construction phase | Phase 3: AnalyzeControlFlow |
| Block layout phase | Phase 112: PlaceBlocksInSourceOrder |
| BB merge suppression | --dont-merge-basicblocks / -no-bb-merge CLI flag |
Two-Level Block Representation
ptxas uses two distinct but linked representations for basic blocks. The first is owned by the Code Object (used by all optimization passes); the second is owned by the scheduling/CFG analysis context (used by scheduling and post-regalloc passes).
Code Object Block Array
The Code Object stores an array of pointers to full BasicBlock objects:
| Code Object Offset | Type | Field | Description |
|---|---|---|---|
| +296 | ptr | bb_array | Array of BasicBlock* pointers (8 bytes each) |
| +304 | i32 | bb_count | Number of basic blocks |
Access pattern (from sub_78B430):
int bb_count = *(int*)(ctx + 304);
for (int i = 0; i <= bb_count; i++) {
BasicBlock* bb = *(BasicBlock**)(*(ctx + 296) + 8 * i);
int rpo = *(int*)(bb + 144);
// ...
}
Scheduling Block Info Array
The scheduling context maintains a parallel 40-byte-per-entry array:
| Scheduling Context Offset | Type | Field | Description |
|---|---|---|---|
| +976 | ptr | block_info | Contiguous array of 40-byte entries |
| +984 | i32 | num_blocks | Max block index (0-based; actual count = num_blocks + 1) |
Block Info Entry Layout (40 bytes)
| Offset | Type | Field | Description |
|---|---|---|---|
| +0 | ptr | bb_ptr | Pointer to the full BasicBlock object |
| +8 | ptr | insn_head | Pointer to the instruction list head (or sentinel) |
| +16 | u64 | reserved | Reserved / padding |
| +24 | u32 | flags | Block flags |
| +28 | i32 | bix | Block index (unique ID used in all CFG operations) |
| +32 | u64 | aux | Auxiliary data (varies by pass) |
The DOT dumper at sub_BE21D0 iterates this array with a 40-byte stride:
for (int i = 0; i <= num_blocks; i++) {
entry = *(sched_ctx + 976) + 40 * i;
int bix = *(int*)(entry + 28);
int label = *(int*)(*(ptr*)(entry + 0) + 152);
printf("bix%d(L%x)", bix, label);
}
BasicBlock Object (136 bytes)
Allocated by sub_62BB00 during the parsing/lowering phase. The parser references the string "bb-controlflow" when constructing these objects. After allocation, the 136-byte block is zeroed via memset, then individual fields are populated.
BasicBlock Field Map
| Offset | Type | Field | Description |
|---|---|---|---|
| +0 | ptr | vtable | Virtual function table pointer (or type tag) |
| +8 | ptr | insn_list | Instruction doubly-linked list head/sentinel |
| +16 | ptr | insn_tail | Instruction list tail (for O(1) append) |
| +24 | u32 | insn_count | Number of instructions in the block |
| +28 | u32 | flags_a | Block attribute flags (see below) |
| +104 | ptr | bb_next | Linked-list link to next BasicBlock in function |
| +108 | u8 | opcode_flags | Terminator opcode classification bits |
| +128 | ptr | succ_list | Linked list of successor block references |
| +136 | ptr | pred_list | Linked list of predecessor block references |
| +144 | i32 | rpo_number | Reverse post-order number (set by RPO computation) |
| +152 | i32 | label_id | Label / source line identifier (displayed as L%x in DOT) |
The insn_list at +8 is the head of a doubly-linked list. Each instruction node has a next pointer at offset +8 of the instruction object. The sentinel/end is detected by comparing the current node pointer against the tail stored in the BasicBlock or against a per-block sentinel address.
Successor/Predecessor Lists
Both succ_list (+128) and pred_list (+136) are singly-linked lists of small nodes. Each node contains:
| Offset | Type | Field |
|---|---|---|
| +0 | ptr | next pointer (NULL = end of list) |
| +8 | i32 | Block index of the referenced block |
Iteration pattern (from sub_78B430 -- LoopStructurePass):
// Walk predecessor list
PredNode* pred = *(PredNode**)(bb + 136);
while (pred) {
BasicBlock* pred_bb = *(BasicBlock**)(*(ctx + 296) + 8 * pred->bix);
int pred_rpo = *(int*)(pred_bb + 144);
// ...
pred = pred->next;
}
// Walk successor list
SuccNode* succ = *(SuccNode**)(bb + 128);
while (succ) {
BasicBlock* succ_bb = *(BasicBlock**)(*(ctx + 296) + 8 * succ->bix);
// ...
succ = succ->next;
}
CFG Edge Hash Maps
In addition to the per-block predecessor/successor linked lists, the scheduling context maintains two global FNV-1a hash maps for fast edge queries. These are the primary edge representation used by RPO computation, backedge detection, and the scheduling pass.
Successor Edge Map (Code Object +648)
Maps block index to a set of successor block indices. Used by CFG::computeRPO (sub_BDE150), CFG::printEdges (sub_BDE8B0), and CFG::buildAndAnalyze (sub_BE0690).
Backedge Map (Code Object +680)
Maps block index to the set of backedge targets. A backedge exists when block bix_src has a successor bix_dst where RPO(bix_dst) <= RPO(bix_src) -- i.e., the successor was visited before the source in the DFS traversal, indicating a loop.
FNV-1a Hash Parameters
All CFG hash lookups use identical parameters, confirmed across 50+ call sites:
| Parameter | Value |
|---|---|
| Initial hash | 0x811C9DC5 |
| FNV prime | 16777619 (0x01000193) |
| Key size | 4 bytes (block index) |
| Hash method | Byte-by-byte XOR-fold |
The hash computation for a 32-bit block index bix:
uint32_t hash = 0x811C9DC5;
hash = 16777619 * (hash ^ (bix & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 8) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 16) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 24) & 0xFF));
uint32_t bucket = hash & (num_buckets - 1);
Hash Map Structure
HashMap:
+0 ptr first_free_node // Free list for node recycling
+8 ptr node_arena // Pool allocator for new nodes
+16 ptr bucket_array // Array of 24-byte bucket headers
+24 u64 num_buckets // Power of two, initial = 8
+32 i32 total_elements // Total entries across all buckets
+36 i32 num_unique_keys // Distinct keys inserted
Bucket (24 bytes):
+0 ptr head // First node in collision chain
+8 ptr tail // Last node in collision chain
+16 i32 count // Number of nodes in this bucket
Full Node (64 bytes, for edge maps):
+0 ptr next // Chain link within bucket
+8 i32 key // Block index (bix)
+12 i32 value_info // Edge count or flags
+16 ptr value_array // Pointer to sub-array of successor indices
+24 i32 value_count // Number of successors in sub-array
+32 ptr sub_hash_data // Embedded sub-hash for multi-edge blocks
+40 u64 sub_hash_size // Sub-hash capacity
+56 u32 cached_hash // Cached FNV-1a hash of key
Simple Node (16 bytes, for backedge set membership):
+0 ptr next // Chain link within bucket
+8 i32 key // Block index
+12 u32 cached_hash // Cached hash
Growth policy: rehash when total_elements > num_unique_keys (load factor > 1.0). New capacity = 4 * old_bucket_count. Hash map insert/find is implemented at sub_BDED20 (full nodes, 64 bytes) and sub_BDF480 (simple nodes, 16 bytes).
CFG Construction: Phase 3 (AnalyzeControlFlow)
AnalyzeControlFlow is phase 3 in the 159-phase optimizer pipeline. It runs immediately after the parser builds the initial instruction list and before any optimization. This phase:
- Populates the successor edge hash table at Code Object +648 by scanning the last instruction of each basic block. Branch instructions (opcode 67 =
BRA, opcode 77 =EXIT; the code also checks opcodes 93 and 95 which areOUT_FINALandSTSrespectively in the ROT13 name table but serve as internal control-flow markers in this context) provide the target block indices. - Computes the backedge map at Code Object +680 by identifying edges where the target has a lower or equal block index position in the DFS tree.
- Builds the reverse post-order (RPO) array at Code Object +720 via iterative DFS.
- Identifies loop headers and backedges for later loop optimization passes.
The phase is critical because the Bison parser constructs basic blocks and instruction lists incrementally. AnalyzeControlFlow ensures the CFG is fully consistent and annotated before optimization begins.
Phase 6: SetControlFlowOpLastInBB
Phase 6 enforces a structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering passes), this phase splits the block at the control-flow instruction. New basic block entries are allocated and the instruction linked list is rewritten.
This invariant is required by all downstream passes -- the scheduler and register allocator assume that only the final instruction in a block can be a control-flow transfer.
Reverse Post-Order (RPO) Computation
RPO is computed by sub_BDE150 (CFG::computeRPO), a 9KB function that implements iterative DFS using an explicit stack.
RPO Storage
| Code Object Offset | Type | Field |
|---|---|---|
| +720 | ptr | rpo_array -- int*, indexed by RPO position |
| +728 | i32 | rpo_size -- number of entries used |
| +732 | i32 | rpo_capacity -- allocated capacity |
The array is resized with the standard ptxas growth policy: new_capacity = old + (old + 1) / 2, with a minimum of num_blocks + 1. Growth is implemented in sub_BDFB10.
Algorithm
The RPO computation uses a standard iterative DFS with post-order numbering:
function computeRPO(cfg, entry_block):
stack = [entry_block] // Explicit stack at offset +88..+100
visited = new BitArray(num_blocks) // At offset +16..+40
in_stack = new BitArray(num_blocks) // At offset +40
counter = num_blocks // Decremented as blocks complete
while stack is not empty:
bix = stack.top()
if visited[bix]:
stack.pop()
rpo_number[bix] = counter // *(cfg+64)[bix] = counter
rpo_array[counter] = bix // *(*(cfg+720))[counter] = bix
counter--
continue
visited[bix] = true
for each successor s of bix (via hash map lookup):
if not visited[s]:
stack.push(s)
return counter // Should be -1 if all blocks reachable
The key assignment line from the decompilation:
*(_DWORD *)(*(_QWORD *)(a1 + 64) + 4 * v16) = *a3; // rpo_number[bix] = counter
*(_DWORD *)(*(_QWORD *)(*(_QWORD *)a1 + 720) + 4 * (*a3)--) = v16; // rpo_array[counter--] = bix
After completion, rpo_array[0] is the entry block, and rpo_array[num_blocks] is the deepest post-dominator (typically the EXIT block).
RPO Debug Dump
sub_BDEA50 (CFG::dumpRPOAndBackedges) prints the RPO state:
Showing RPO state for each basic block:
bix0 -> RPONum: 0
bix1 -> RPONum: 1
bix2 -> RPONum: 3
bix3 -> RPONum: 2
RPO traversal order: [0, 1, 3, 2]
Showing backedge info:
bix2 -> backedge's successor BB: 1
This output is gated by option flag #24 at offset +1728 relative to the options manager.
Backedge Detection and Loop Identification
Backedges are identified during CFG::buildAndAnalyze (sub_BE0690, 54KB). A backedge from block src to block dst exists when dst has already been visited in the DFS traversal (i.e., dst has a smaller or equal RPO number than src). Backedges are stored in the hash map at Code Object +680.
Natural Loop Detection
The LoopStructurePass (sub_78B430) combines RPO numbering with backedge analysis to identify natural loops:
- Calls
sub_781F80(BasicBlockAnalysis) to compute RPO numbers and dominance. - Iterates the
bb_arrayat Code Object +296. - For each block, checks if
rpo_number(+144) is non-zero and equals the value at +152 (loop exit RPO marker). Combined with a branch opcode check (opcode & 0xFFFFFFFD == 0x5D=BRAor conditional branch), this identifies loop header blocks. - Walks the predecessor list to find the backedge source -- the predecessor with the largest RPO number that is still less than the header's RPO.
- Walks the successor list to find the loop latch -- the successor with the smallest RPO number greater than the loop preheader's RPO.
The RPO range [header_rpo, exit_rpo] defines the set of blocks belonging to the loop body. A block with header_rpo <= block_rpo <= exit_rpo is inside the loop.
LoopMakeSingleEntry Transformation
If a natural loop has multiple entry points, sub_78B430 transforms it into a single-entry loop. This is gated by:
- The
LoopMakeSingleEntrypass-disable check (viasub_799250) - Knob 487 (queried via the knob vtable at +152)
Two code paths handle different branch types:
- Opcode 93 (
OUT_FINALin the ROT13 name table; used here as a control-flow boundary marker): Callssub_9253C0to rewrite the branch target - Conditional branches: Calls
sub_748BF0to insert a new preheader block and redirect edges
After transformation, sub_931920 is called to split blocks and update the instruction list.
Dominance
Dominance is computed by sub_BE2330 (4KB) and/or within sub_781F80 (12KB, BasicBlockAnalysis). The implementation uses bitvector operations -- each block has a bitvector of dominators, and the fixpoint iteration proceeds in RPO order.
The bitvector layout used by the dominator computation:
| Offset | Type | Field |
|---|---|---|
| +0 | ptr | data -- pointer to uint32_t[] words |
| +8 | i32 | word_count |
| +12 | i32 | capacity |
| +16 | i32 | bit_count |
Evidence for an iterative dataflow approach (rather than Lengauer-Tarjan) comes from the function sizes and patterns: sub_781F80 at 12KB and sub_BE2330 at 4KB are both small enough that they likely implement the simple iterative algorithm:
dom[entry] = {entry}
for all other blocks b: dom[b] = all_blocks
repeat until no changes:
for each block b in RPO order (skip entry):
dom[b] = {b} union (intersection of dom[p] for all predecessors p)
This is adequate for the small CFGs typical of GPU kernels (rarely exceeding a few hundred blocks). The O(n^2) worst case is not a concern at GPU kernel scale.
Block Layout: Phase 112 (PlaceBlocksInSourceOrder)
Phase 112 (PlaceBlocksInSourceOrder) runs in the post-scheduling stage of the pipeline, after register allocation and before Mercury encoding. It reorders the basic block array to restore source-order layout.
The implementation at sub_A92C50 (3.5KB binary, ~19KB decompiled) manipulates linked list structures and uses hash table lookups to reorder blocks. The goal is to minimize branch distances in the final SASS output -- placing fall-through successors immediately after their predecessors.
Hot/Cold Block Layout
Two companion phases handle hot/cold partitioning:
| Phase | Name | Purpose |
|---|---|---|
| 108 | OptimizeHotColdInLoop | Moves cold blocks out of loop bodies |
| 109 | OptimizeHotColdFlow | Global hot/cold block separation |
Cold blocks (e.g., error handlers, unlikely branches, assert paths) are moved to the end of the function's block sequence. The MarkAdditionalColdBlocks pass marks blocks as cold based on heuristics. This separation improves instruction cache utilization on the GPU's SM instruction fetch unit.
BB Merge Suppression
The --dont-merge-basicblocks (alias -no-bb-merge) CLI flag prevents the optimizer from merging consecutive basic blocks. This is used for debuggable code -- without it, the debugger cannot set breakpoints at the original source line boundaries. The flag is documented in the binary as:
"Normally, ptxas attempts to merge consecutive basic blocks as part of its optization process. However, for debuggable code this is very confusing. This option prevents basic block merging, at a slight perfomance cost."
(Note: "optization" and "perfomance" are typos in the original binary string.)
Entry and Exit Blocks
Block index 0 (bix0) is always the function entry block. It is the first element in the bb_array and the root of the RPO traversal. The entry block has no predecessors (its predecessor list at +136 is NULL).
The exit block is the block containing the EXIT instruction (opcode 77 = EXIT in the ROT13 name table). For functions with multiple exit points, each EXIT-containing block is a CFG sink. The RPO computation assigns these the highest RPO numbers. The SetControlFlowOpLastInBB phase (phase 6) ensures each EXIT is the final instruction in its block.
The CFG::buildAndAnalyze function (sub_BE0690) checks the terminator opcode at instruction offset +28. Opcodes 4 and 7 (internal control-flow opcodes) receive special treatment during edge construction:
| Opcode | Type | Edge behavior |
|---|---|---|
| 4 | Unconditional branch | Single successor edge to target block |
| 7 | Conditional branch | Two successor edges (taken + fall-through) |
| 93 | OUT_FINAL | ROT13 name is OUT_FINAL; used as a control-flow boundary marker in CFG construction |
| 95 | STS | ROT13 name is STS; used as a control-flow terminator marker in CFG construction |
CFG Update Protocol
Passes that modify the CFG (block splitting, merging, edge redirection) must maintain consistency across several data structures:
- Block array -- both the Code Object
bb_array(+296) and the schedulingblock_info(+976) must be updated. - Predecessor/successor linked lists -- the per-block lists at +128 and +136 must reflect the new edges.
- Edge hash maps -- the successor map (+648) and backedge map (+680) must be invalidated or updated.
- RPO array -- the RPO order at +720 must be recomputed after structural changes.
- Block count -- both
bb_count(+304) andnum_blocks(+984) must be incremented.
The general pattern observed in sub_931920 (block splitter called from sub_78B430):
function splitBlock(ctx, bb, split_point):
new_bb = allocateBasicBlock()
// Move instructions after split_point to new_bb
new_bb->insn_list = split_point->next
bb->insn_list_tail = split_point
split_point->next = sentinel
// Transfer successors from bb to new_bb
new_bb->succ_list = bb->succ_list
bb->succ_list = new_node(new_bb->bix)
// Update predecessor lists of old successors
for each succ in new_bb->succ_list:
replace bb in succ->pred_list with new_bb
// new_bb's only predecessor is bb
new_bb->pred_list = new_node(bb->bix)
// Invalidate and recompute RPO
ctx->bb_count++
recomputeRPO(ctx)
The AnalyzeControlFlow phase (phase 3) is explicitly re-run or incrementally updated after phases that modify the CFG structure. The phase pipeline contains multiple OriPerformLiveDead and GeneralOptimize passes that may rebuild portions of the CFG.
Key CFG Functions
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_62BB00 | 16.5KB | BasicBlock::allocate -- allocates 136-byte block, initializes fields | HIGH |
sub_781F80 | 12KB | BasicBlockAnalysis -- RPO, loop detection, dominance | MEDIUM |
sub_78B430 | 1.2KB | LoopStructurePass -- single-entry loop transformation | HIGH |
sub_BDE150 | 9KB | CFG::computeRPO -- iterative DFS with explicit stack | HIGH |
sub_BDE6C0 | 3KB | HashMap::erase -- remove node from edge hash map | MEDIUM |
sub_BDE8B0 | 2KB | CFG::printEdges -- prints "bix%d -> bix%d\n" | HIGH |
sub_BDEA50 | 4KB | CFG::dumpRPOAndBackedges -- RPO + backedge debug dump | HIGH |
sub_BDED20 | 12KB | HashMap::insertOrFind -- full 64-byte node insert | HIGH |
sub_BDF480 | 10KB | HashMap::insertOrFind_simple -- 16-byte node insert | HIGH |
sub_BDFB10 | 24KB | CFG::buildBlockMap -- block array init, RPO resize | MEDIUM |
sub_BE0690 | 54KB | CFG::buildAndAnalyze -- master CFG builder | HIGH |
sub_BE21D0 | 1.4KB | CFG::dumpDOT -- Graphviz DOT format output | HIGH |
sub_BE2330 | 4KB | CFG::computeDominators -- bitvector-based dominance | MEDIUM |
sub_A92C50 | 3.5KB | PlaceBlocksInSourceOrder -- block reordering (phase 112) | MEDIUM |
CFG Visualization
The CFG::dumpDOT function (sub_BE21D0) generates Graphviz DOT output when option flag #20 is enabled (offset +1440 from the options manager). The output format:
digraph f {
node [fontname="Courier",fontsize=10,shape=Mrecord];
"bix0"
[label="bix0(L0)"]
bix0 -> bix1
bix0 -> bix3
"bix1"
[label="bix1(L10)"]
bix1 -> bix2
"bix2"
[label="bix2(L20)"]
bix2 -> bix1
bix2 -> bix3
"bix3"
[label="bix3(L30)"]
}
Where L%x is the label identifier at BasicBlock +152. This can be converted to a visual graph with dot -Tpng.
If option flag #24 is also enabled (offset +1728), the RPO and backedge dump from sub_BDEA50 is appended.
Related Pages
- Ori IR Overview -- Code Object layout, instruction format, register files
- Instructions -- instruction format and opcode details
- Data Structures -- FNV-1a hash maps, bitvectors, linked lists
- Optimizer Pipeline -- the 159-phase pipeline including CFG phases
- Branch & Switch Optimization -- OriBranchOpt pass
- Loop Optimization -- OriLoopSimplification, LoopUnrolling
- Hot/Cold Partitioning -- OptimizeHotColdFlow, MarkAdditionalColdBlocks
Register Model (R / UR / P / UP)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas models four hardware register files plus two auxiliary barrier register files. Every Ori instruction references registers from one or more of these files. During the optimization phases (0--158), registers carry virtual numbers; the fat-point register allocator (phase 159+) maps them to physical hardware slots. This page documents the register files, the virtual/physical register descriptor, the 7 allocator register classes, wide register conventions, special registers, the operand encoding format, pressure tracking, and SM-specific limits.
Four Register Files
| File | Mnemonic | Width | Usable range | Zero/True | ABI type | Introduced |
|---|---|---|---|---|---|---|
| R | General-purpose | 32 bits | R0 -- R254 | RZ (R255) | 2 | sm_30 |
| UR | Uniform | 32 bits | UR0 -- UR62 | URZ (UR63) | 3 | sm_75 |
| P | Predicate | 1 bit | P0 -- P6 | PT (P7) | 5 | sm_30 |
| UP | Uniform predicate | 1 bit | UP0 -- UP6 | UPT (UP7) | -- | sm_75 |
R registers are per-thread 32-bit general-purpose registers. They hold integers, floating-point values, and addresses. 64-bit values occupy consecutive even/odd pairs (R4:R5); 128-bit values occupy aligned quads (R0:R1:R2:R3). The total R-register count for a function is field[159] + field[102] (reserved + allocated), stored in the Code Object at offsets +159 and +102. Maximum usable: 254 (R0--R254). R255 is the hardware zero register RZ -- reads return 0, writes are discarded.
UR registers (uniform general-purpose) are warp-uniform: every thread in a warp sees the same value. Available on sm_75 and later. Range: UR0--UR62 usable, UR63 is the uniform zero register URZ. The UR count is at Code Object +99. Attempting to use UR on pre-sm_75 targets triggers the diagnostic "Uniform registers were disallowed, but the compiler required (%d) uniform registers for correct code generation.".
P registers are 1-bit predicates used for conditional execution (@P0 FADD ...) and branch conditions. P0--P6 are usable; P7 is the hardwired always-true predicate PT. Writes to PT are discarded. The assembler uses PT as the default predicate for unconditional instructions. In the allocator, predicate registers support half-width packing: two virtual predicates can be packed into one physical predicate slot, with the hi/lo distinction stored in bit 23 (0x800000) of the virtual register flags.
UP registers are the uniform predicate variant. UP0--UP6 are usable; UP7 is UPT (always-true). Available on sm_75+.
Seven Allocator Register Classes
The fat-point allocator processes 7 register classes, indexed by the reg_type field at vreg+64. Class 0 is the cross-class constraint propagation channel and is skipped in the main allocation loop. Classes 1--6 are allocated independently, in order. The allocator distribution loop in sub_9721C0 (lines 520--549) reads *(int*)(vreg+64) and uses it directly as the class bucket index, guarded by reg_type <= 6:
| Class ID | Name | Width | HW limit | Description |
|---|---|---|---|---|
| 0 | (unified) | -- | -- | Cross-class constraint propagation (skipped) |
| 1 | R | 32-bit | 255 | General-purpose registers (R0--R254) |
| 2 | R (alt) | 32-bit | 255 | GPR variant (used for RZ sentinel, stat collector alternate) |
| 3 | UR | 32-bit | 63 | Uniform general-purpose (UR0--UR62) |
| 4 | UR (ext) | 32-bit | 63 | Uniform GPR variant (triggers flag update at +1369 in constructor) |
| 5 | P / UP | 1-bit | 7 | Predicate registers (P0--P6, UP0--UP6) |
| 6 | Tensor/Acc | 32-bit | varies | Tensor/accumulator registers for MMA/WGMMA operations |
The class ID is the reg_type value stored at vreg+64. The allocator class at vreg+12 is a separate field used for instruction-level classification, not for the per-class allocation passes. The allocator's per-class linked lists at alloc[3*reg_type + 138] are populated directly from vreg+64.
Per-class state is initialized via the target descriptor vtable call vtable[896](alloc_state, class_id), which populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class).
Barrier Registers
Barrier registers (B and UB) are a distinct register file used by the BAR, DEPBAR, BSSY, and BSYNC instructions for warp-level and CTA-level synchronization. B0--B15 are the non-uniform barrier registers; UB0--UB15 are the uniform variant. Barrier registers have reg_type = 9, which is above the <= 6 cutoff for the main allocator class buckets. They are handled by a separate allocation mechanism outside the 7-class system.
Tensor/Accumulator Registers (Class 6)
Class 6 registers are created during intrinsic lowering of tensor core operations (MMA, WGMMA, HMMA, DMMA). Over 30 intrinsic lowering functions in the 0x6B--0x6D address range call sub_91BF30(ptr, ctx, 6) to create these registers. The GMMA pipeline pass (sub_ADA740, sub_69E590) identifies accumulator operands by checking *(vreg+64) == 6. The accumulator counting function at sub_78C6B0 uses the pair-mode bits at vreg+48 (bits 20--21) to determine whether a type-6 register consumes 1 or 2 physical R slots.
Virtual Register Descriptor
Every virtual register in a function is represented by a 160-byte descriptor allocated from the per-function arena. The register file array is at Code Object +88, indexed as *(ctx+88) + 8*regId. The descriptor is created by sub_91BF30 (register creation function).
Descriptor Layout
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
| +0 | 8 | ptr | next | Linked list pointer (allocation worklist) |
| +8 | 4 | i32 | id | Unique register ID within function |
| +12 | 4 | i32 | class_index | Allocator register class (0--6) |
| +20 | 1 | u8 | flags_byte | Bit 0x20 = live |
| +24 | 4 | i32 | bb_index | Basic block of definition |
| +28 | 4 | i32 | epoch | Epoch counter for liveness tracking |
| +32 | 8 | ptr | alias_next | Next aliased register (coalescing chain) |
| +36 | 8 | ptr | alias_parent | Coalesced parent pointer |
| +40 | 4 | f32 | spill_cost | Accumulated spill cost |
| +48 | 8 | u64 | flags | Multi-purpose flag word (see below) |
| +56 | 8 | ptr | def_instr | Defining instruction pointer |
| +64 | 4 | i32 | reg_type | Register file type enum |
| +68 | 4 | i32 | physical_reg | Physical register number (-1 = unassigned) |
| +72 | 1 | u8 | size | 0 = scalar, nonzero = encoded width |
| +76 | 4 | f32 | secondary_cost | Secondary spill cost |
| +80 | 4 | i32 | spill_flag | 0 = not spilled, 1 = spilled |
| +97 | 2 | u16 | reserved | |
| +104 | 8 | ptr | use_chain | Use chain head (instruction pointer) |
| +112 | 8 | ptr | def_chain | Definition chain |
| +120 | 8 | ptr | regfile_next | Next in register file linked list |
| +128 | 8 | ptr | linked_next | Next in linked-register chain |
| +136 | 8 | ptr | reserved2 | |
| +144 | 8 | ptr | constraint_list | Constraint list head for allocator |
| +152 | 8 | ptr | reserved3 |
Initial values set by the constructor (sub_91BF30):
vreg->next = NULL; // +0
vreg->id = ctx->reg_count + 1; // +8, auto-incrementing
vreg->class_index = 0; // +12
vreg->flags_byte = 0; // +20
vreg->alias_parent = (ptr)-1; // +20..27 (qword write)
vreg->physical_reg = -1; // +68 (unassigned)
vreg->reg_type = a3; // +64 (passed as argument)
vreg->size = 0; // +72
vreg->spill_flag = 0; // +80
vreg->use_chain = NULL; // +104
vreg->def_chain = NULL; // +112
vreg->constraint_list = NULL; // +144
For predicate types (a3 == 2 or a3 == 3), the flags word at +48 is initialized to 0x1000 (4096). For all other types, it is initialized to 0x1018 (4120). If the type is 7 (alternate predicate classification), the physical register is initialized to 0 instead of -1.
Flag Bits at +48
| Bit | Mask | Meaning |
|---|---|---|
| 9 | 0x200 | Pre-assigned / fixed register |
| 10 | 0x400 | Coalesced source |
| 11 | 0x800 | Coalesced target |
| 12 | 0x1000 | Base flag (set for all types) |
| 14 | 0x4000 | Spill marker (already spilled) |
| 18 | 0x40000 | Needs-spill (allocator sets when over budget) |
| 20--21 | (pair mode) | 0 = single, 1 = lo-half of pair, 3 = double-width |
| 22 | 0x400000 | Constrained to architecture limit |
| 23 | 0x800000 | Hi-half of pair (predicate half-width packing) |
| 27 | 0x8000000 | Special handling flag |
Register File Type Enum (at +64)
This enum determines the register file a VR belongs to. It is used by the register class name table at off_21D2400 to map type values to printable strings ("R", "UR", "P", etc.) for diagnostic output such as "Referencing undefined register: %s%d".
| Value | File | Alloc class | Description |
|---|---|---|---|
| 1 | R | 1 | General-purpose register (32-bit) |
| 2 | R (alt) | 2 | GPR variant (RZ sentinel in sub_7D82E0, stat collector alternate) |
| 3 | UR | 3 | Uniform register (32-bit) |
| 4 | UR (ext) | 4 | Uniform GPR variant (triggers flag update at +1369 in constructor) |
| 5 | P / UP | 5 | Predicate register (1-bit); covers both P and UP |
| 6 | Tensor/Acc | 6 | Tensor/accumulator register for MMA/WGMMA operations |
| 7 | P (alt) | -- | Predicate variant (physical = 0 at init); above allocator cutoff |
| 8 | -- | -- | Extended type (created by sub_83EF00); above allocator cutoff |
| 9 | B / UB | -- | Barrier register; above allocator cutoff, separate allocation |
| 10 | R2 | -- | Extended register pair (64-bit, two consecutive R regs) |
| 11 | R4 | -- | Extended register quad (128-bit, four consecutive R regs) |
Values 0--6 are within the allocator's class system (the distribution loop in sub_9721C0 guards with reg_type <= 6). Values 7+ are handled by separate mechanisms. The off_21D2400 name table is indexed by reg_type and provides display strings for diagnostic output.
The stat collector at sub_A60B60 (24 KB) enumerates approximately 25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others by iterating vtable getter functions per register class.
Wide Registers
NVIDIA GPUs have only 32-bit physical registers. Wider values are composed from consecutive registers.
64-Bit Pairs (R2)
A 64-bit value occupies two consecutive registers where the base register has an even index: R0:R1, R2:R3, R4:R5, and so on. The low 32 bits reside in the even register; the high 32 bits in the odd register. In the Ori IR, a 64-bit pair is represented by a single virtual register with:
vreg+64(type) = 10 (extended pair)vreg+48bits 20--21 (pair mode) = 3 (double-width)
The allocator selects even-numbered physical slots by scanning with stride 2 instead of 1. The register consumption function (sub_939CE0) computes slot + (1 << (pair_mode == 3)) - 1, consuming two physical slots.
128-Bit Quads (R4)
A 128-bit value occupies four consecutive registers aligned to a 4-register boundary: R0:R1:R2:R3, R4:R5:R6:R7, etc. Used by texture instructions, wide loads/stores, and tensor core operations. In the Ori IR:
vreg+64(type) = 11 (extended quad)- Allocator scans with stride 4
Alignment Constraints
| Width | Base alignment | Stride | Example |
|---|---|---|---|
| 32-bit (scalar) | Any | 1 | R7 |
| 64-bit (pair) | Even | 2 | R4:R5 |
| 128-bit (quad) | 4-aligned | 4 | R8:R9:R10:R11 |
The texture instruction decoder (sub_1170920) validates even-register alignment via a dedicated helper (sub_1170680) that checks if a register index falls within the set {34, 36, 38, ..., 78} and returns 0 if misaligned.
The SASS instruction encoder for register pairs (sub_112CDA0, 8.9 KB) maps 40 register pair combinations (0/1, 2/3, ..., 78/79) to packed 5-bit encoding values at 0x2000000 (33,554,432) intervals.
Special Registers
Zero and True Registers
| Register | File | Index | Internal sentinel | Behavior |
|---|---|---|---|---|
| RZ | R | 255 | 1023 | Reads return 0; writes discarded |
| URZ | UR | 63 | 1023 | Uniform zero; reads return 0 |
| PT | P | 7 | 31 | Always-true predicate; writes discarded |
| UPT | UP | 7 | 31 | Uniform always-true |
The internal sentinel value 1023 (0x3FF) represents "don't care" or "zero register" throughout the Ori IR and allocator. During SASS encoding, hardware register index 255 is mapped to sentinel 1023 for R/UR files, and hardware index 7 is mapped to sentinel 31 for P/UP files. These sentinels are checked in encoders to substitute the default register value:
// Decoder: extract register operand (sub_9B3C20)
if (reg_idx == 255)
internal_idx = 1023; // RZ sentinel
// Decoder: extract predicate operand (sub_9B3D60)
if (pred_idx == 7)
internal_idx = 31; // PT sentinel
// Encoder: emit register field
if (reg == 1023)
use *(a1+8) as default; // encode physical RZ
Architectural Predicate Indices
The allocator skips architectural predicate registers by index number:
| Index | Register | Treatment |
|---|---|---|
| 39 | (special) | Skipped during allocation (skip predicate sub_9446D0) |
| 41 | PT | Skipped -- hardwired true predicate |
| 42 | P0 | Skipped -- architectural predicate |
| 43 | P1 | Skipped -- architectural predicate |
| 44 | P2 | Skipped -- architectural predicate |
The skip check in sub_9446D0 returns true (skip) for register indices 41--44 and 39, regardless of register class. For other registers, it checks whether the instruction is a CSSA phi (opcode 195 with barrier type 9) or whether the register is in the exclusion set hash table at alloc+360.
Special System Registers (S2R / CS2R)
Thread identity and hardware state are accessed through the S2R (Special Register to Register) and CS2R (Control/Status Register to Register) instructions. These read read-only hardware registers into R-file registers.
Common system register values (from PTX parser initialization at sub_451730):
| PTX name | Hardware | Description |
|---|---|---|
%tid / %ntid | SR_TID_X/Y/Z | Thread ID within CTA |
%ctaid / %nctaid | SR_CTAID_X/Y/Z | CTA ID within grid |
%laneid | SR_LANEID | Lane index within warp (0--31) |
%warpid / %nwarpid | SR_WARPID | Warp index within CTA |
%smid / %nsmid | SR_SMID | SM index |
%gridid | SR_GRIDID | Grid identifier |
%clock / %clock_hi / %clock64 | SR_CLOCK / SR_CLOCK_HI | Cycle counter |
%lanemask_eq/lt/le/gt/ge | SR_LANEMASK_* | Lane bitmask variants |
The S2R register index must be between 0 and 255 inclusive, enforced by the string "S2R register must be between 0 and 255 inclusive". Special system register ranges are tracked at Code Object offsets +1712 (start) and +1716 (count).
Operand Encoding in Ori Instructions
Each instruction operand is encoded as a 32-bit packed value in the operand array starting at instruction offset +84. The operand at index i is at *(instr + 84 + 8*i).
Packed Operand Format (Ori IR)
31 30 29 28 27 24 23 22 21 20 19 0
+----+---+---+---+---------------+---+---+---+---+---------------------+
|sign| type | modifier (8) | index (20) |
+----+---+---+---+---------------+---+---+---+---+---------------------+
bit 31: sign/direction flag bits 0-19: register/symbol index
bits 28-30: operand type (3 bits) bit 24: pair extension flag
Extraction pattern (50+ call sites):
uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type = (operand >> 28) & 7; // bits 28-30
int index = operand & 0xFFFFF; // bits 0-19
int mods = (operand >> 20) & 0xFF; // bits 20-27
bool is_neg = (operand >> 31) & 1; // bit 31
| Type value | Meaning |
|---|---|
| 1 | Register operand (index into register file at *(ctx+88) + 8*index) |
| 5 | Symbol/constant operand (index into symbol table at *(ctx+152)) |
| 6 | Special operand (barrier, system register) |
For register operands (type 1), the index is masked as operand & 0xFFFFFF (24 bits) to extract the full register ID. Indices 41--44 are architectural predicates that are never allocated.
SASS Instruction Register Encoding
During final SASS encoding, the register operand encoder (sub_7BC030, 814 bytes, 6147 callers) packs register operands into the 128-bit instruction word:
Encoded register field (16 bits at variable bit offset):
bit 0: presence flag (1 = register present)
bits 1-4: register file type (4 bits, 12 values)
bits 5-14: register number (10 bits)
The 4-bit register file type field in the SASS encoding maps the internal operand type tag to hardware encoding:
| Operand type tag | Encoded value | Register file |
|---|---|---|
| 1 | 0 | R (32-bit) |
| 2 | 1 | R pair (64-bit) |
| 3 | 2 | UR (uniform 32-bit) |
| 4 | 3 | UR pair (uniform 64-bit) |
| 5 | 4 | P (predicate) |
| 6 | 5 | (reserved) |
| 7 | 6 | (reserved) |
| 8 | 7 | B (barrier) |
| 16 | 8 | (extended) |
| 32 | 9 | (extended) |
| 64 | 10 | (extended pair) |
| 128 | 11 | (extended quad) |
The predicate operand encoder (sub_7BCF00, 856 bytes, 1657 callers) uses a different format: 2-bit predicate type, 3-bit predicate condition, and 8-bit value. It checks for PT (operand byte[0] == 14) and handles the always-true case.
Register-Class-to-Hardware Encoding
The function sub_1B6B250 (2965 bytes, 254 callers) implements the mapping from the compiler's abstract (register_class, sub_index) pair to hardware register numbers:
hardware_reg = register_class * 32 + sub_index
For example: class 0, index 1 returns 1; class 1, index 1 returns 33; class 2, index 1 returns 65. The guard wrapper sub_1B73060 (483 callers) returns 0 for the no-register case (class=0, index=0).
The register field writer (sub_1B72F60, 483 callers) packs the encoded register number into the 128-bit instruction word with the encoding split across two bitfields:
*(v2 + 12) |= (encoded_reg << 9) & 0x3E00; // bits [13:9]
*(v2 + 12) |= (encoded_reg << 21) & 0x1C000000; // bits [28:26]
Register Pressure Tracking
Scheduling Phase Pressure Counters
The scheduler maintains 10 per-block register pressure counters at offsets +4 through +40 of the per-BB scheduling record (72 bytes per basic block). At BB entry, these are copied into the scheduler context at context offsets +48 through +87. The counters track live register counts for each register class:
| BB record offset | Context offset (idx) | Register class |
|---|---|---|
| +4 | +48 (idx 12) | R (general-purpose) |
| +8 | +52 (idx 13) | P (predicate) |
| +12 | +56 (idx 14) | UR (uniform) |
| +16 | +60 (idx 15) | UP (uniform predicate) |
| +20 | +64 (idx 16) | B (barrier) |
| +24 | +68 (idx 17) | (arch-specific class 0) |
| +28 | +72 (idx 18) | (arch-specific class 1) |
| +32 | +76 (idx 19) | (arch-specific class 2) |
| +36 | +80 (idx 20) | (arch-specific class 3) |
| +40 | +84 (idx 21) | (arch-specific class 4 / control total) |
The spill cost analyzer (sub_682490, 14 KB) allocates two stack arrays (v94[511] and v95[538]) as per-register-class pressure delta arrays. For each instruction, it computes pressure increments and decrements based on the instruction's register operand definitions and uses.
The register pressure coefficient is controlled by knob 740 (double, default 0.045). The pressure curve function uses a piecewise linear model with parameters (4, 2, 6) via sub_8CE520.
Liveness Bitvectors
The Code Object maintains register liveness as bitvectors:
| Offset | Bitvector | Description |
|---|---|---|
| +832 | Main register liveness | One bit per virtual register; tracks which registers are live at the current program point |
| +856 | Uniform register liveness | Separate bitvector for UR/UP registers |
These bitvectors are allocated via sub_BDBAD0 (bitvector allocation, with size = register count + 1 bits) and manipulated via the SSE2-optimized bitvector primitives at sub_BDBA60 / sub_BDC180 / sub_BDCDE0 / sub_BDC300.
For each basic block during dependency graph construction (sub_A0D800, 39 KB), the per-block liveness is computed by iterating instructions and checking operand types ((v >> 28) & 7 == 1 for register operands), then updating the bitvector at +832 with set/clear operations.
Allocator Pressure Arrays
The fat-point allocator (sub_957160) uses two 512-DWORD (2048-byte) arrays per allocation round:
| Array | Role |
|---|---|
Primary (v12[512]) | Per-physical-register interference count |
Secondary (v225[512]) | Tie-breaking cost metric |
Both are zeroed with SSE2 vectorized _mm_store_si128 loops at the start of each round. For each VR being allocated, the pressure builder (sub_957020) walks the VR's constraint list and increments the corresponding physical register slots. The threshold (knob 684, default 50) filters out congested slots.
ABI Register Reservations
Reserved Registers
Registers R0--R3 are unconditionally reserved by the ABI across all SM generations. The diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s" fires if they are targeted by parameter assignment or user directives.
Minimum Register Counts by SM Generation
| SM generation | Value | SM targets | Minimum registers |
|---|---|---|---|
| 3 | (sm_target+372) >> 12 == 3 | sm_35, sm_37 | (no minimum) |
| 4 | == 4 | sm_50 -- sm_53 | 16 |
| 5 | == 5 | sm_60 -- sm_89 | 16 |
| 9 | == 9 | sm_90, sm_90a | 24 |
| >9 | > 9 | sm_100+ | 24 |
Violating the minimum emits warning 7016: "regcount %d specified below abi_minimum of %d".
Per-Class Hardware Limits
| Class | Limit | Notes |
|---|---|---|
| R | 255 | R0--R254 usable; controlled by --maxrregcount and --register-usage-level (0--10) |
| UR | 63 | UR0--UR62 usable; sm_75+ only |
| P | 7 | P0--P6 usable |
| UP | 7 | UP0--UP6 usable; sm_75+ only |
| B | 16 | B0--B15 |
| UB | 16 | UB0--UB15 |
The --maxrregcount CLI option sets a per-function hard ceiling for R registers. The --register-usage-level option (0--10, default 5) modulates the register allocation target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class budget at alloc + 32*class + 884 reflects the interaction between the CLI limit and the optimization level.
The --device-function-maxrregcount option overrides the kernel-level limit for device functions when compiling with -c.
Dynamic Register Allocation (setmaxnreg)
sm_90+ (Hopper and later) supports dynamic register allocation through the setmaxnreg.inc and setmaxnreg.dec instructions, which dynamically increase or decrease the per-thread register count at runtime. ptxas tracks these as internal states setmaxreg.try_alloc, setmaxreg.alloc, and setmaxreg.dealloc. Multiple diagnostics guard correct usage:
"setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count in the program (%d)""setmaxreg.dealloc/release has register count (%d) less than launch min target (%d) allowed""Potential Performance Loss: 'setmaxnreg' ignored to maintain minimum register requirements."
Pair Modes and Coalescing
The pair mode at vreg+48 bits 20--21 controls how the allocator handles wide registers:
| Pair mode | Value | Behavior |
|---|---|---|
| Single | 0 | Occupies one physical register slot |
| Lo-half | 1 | Low half of a register pair |
| Double-width | 3 | Occupies two consecutive physical slots |
The allocator computes register consumption via sub_939CE0:
consumption = slot + (1 << (pair_mode == 3)) - 1;
// single: slot + 0 = slot (1 slot)
// double: slot + 1 = slot+1 (2 slots)
The coalescing pass (sub_9B1200, 800 lines) eliminates copy instructions by merging the source and destination VRs into the same physical register. The alias chain at vreg+36 (coalesced parent) is followed during assignment (sub_94FDD0) to propagate the physical register through all aliased VRs:
alias = vreg->alias_parent; // vreg+36
while (alias != NULL) {
alias->physical_reg = slot; // alias+68
alias = alias->alias_parent; // alias+36
}
Register Name Table
The register class name table at off_21D2400 is a pointer array indexed by the register file type enum (from vreg+64). Each entry points to a string: "R", "UR", "P", "UP", "B", "UB", etc. This table is used by diagnostic functions:
sub_A4B9F0(StatsEmitter::emitUndefinedRegWarning):"Referencing undefined register: %s%d"where%sisoff_21D2400[*(vreg+64)]and%dis*(vreg+68)(physical register number).sub_A60B60(RegisterStatCollector::collectStats, 24 KB): Enumerates ~25 register sub-classes by iterating vtable getters, one per register class. The enumerated classes include R, P, B, UR, UP, UB, SRZ, PT, RZ, and others."Fatpoint count for entry %s for regclass %s : %d": Prints per-function per-class allocation statistics.
Key Functions
| Address | Size | Function | Description |
|---|---|---|---|
sub_91BF30 | 99 lines | createVirtualRegister | Allocates 160-byte VR descriptor, initializes fields, appends to register file array |
sub_9446D0 | 29 lines | shouldSkipRegister | Returns true for indices 41--44, 39 (architectural specials); checks CSSA phi and exclusion set |
sub_A4B8F0 | 248B | emitInstrRegStats | Emits "instr/R-regs: %d instructions, %d R-regs" |
sub_A4B9F0 | 774B | emitUndefinedRegWarning | Walks operands backward, formats "Referencing undefined register: %s%d" |
sub_A60B60 | 4560B | collectRegisterStats | Enumerates ~25 register sub-classes via vtable getters |
sub_7BC030 | 814B | encodeRegOperand | Packs register into SASS instruction: 1-bit presence + 4-bit type + 10-bit number |
sub_7BCF00 | 856B | encodePredOperand | Packs predicate into SASS: 2-bit type + 3-bit condition + 8-bit value |
sub_9B3C20 | -- | decodeRegOperand | Decoder helper: extracts register, maps 255 to 1023 (RZ) |
sub_9B3D60 | -- | decodePredOperand | Decoder helper: extracts predicate, maps 7 to 31 (PT) |
sub_1B6B250 | 2965B | regClassToHardware | Maps (class, sub_index) to hardware number: class * 32 + sub_index |
sub_1B73060 | 19B | regClassToHardwareGuard | Guard wrapper: returns 0 for no-register case |
sub_1B72F60 | 32B | writeRegField | Packs encoded register into instruction word bits [13:9] and [28:26] |
sub_112CDA0 | 8.9KB | encodeRegisterPair | Maps 40 register pair combinations to 5-bit packed encoding values |
sub_939CE0 | 23 lines | computeConsumption | Pair-aware register slot consumption counter |
sub_94FDD0 | 155 lines | assignRegister | Commits physical register assignment, propagates through alias chain |
sub_A0D800 | 39KB | buildDependencyGraph | Per-block dependency graph with register-to-instruction mapping |
sub_A06A60 | 15KB | scheduleWithPressure | Per-block scheduling loop tracking live register set bitvector |
sub_682490 | 14KB | computeRegPressureDeltas | Per-instruction register pressure delta computation |
sub_B28E00 | -- | getRegClass | Returns register class (1023 = wildcard, 1 = GPR) |
sub_B28E10 | -- | isRegOperand | Predicate: is this a register operand? |
sub_B28E20 | -- | isPredOperand | Predicate: is this a predicate operand? |
sub_B28E90 | -- | isUReg | Predicate: is this a uniform register? |
Opcode Register Class Table
Every Ori opcode carries an implicit register class contract: which register files its operands may reference, what data widths are valid, and which addressing modes apply. The function sub_6575D0 (49 KB, buildEncodingDescriptor) is the central dispatch that translates each instruction's opcode into a packed encoding descriptor consumed by the SASS encoder.
Function Signature
// sub_6575D0 -- buildEncodingDescriptor
// a1 = compiler context
// a2 = Ori instruction node pointer
// a3 = output: 4-DWORD packed encoding descriptor
char buildEncodingDescriptor(Context *a1, Instruction *a2, uint32_t *a3);
Architecture
The function is a two-level dispatch:
-
Outer switch on the Ori opcode at
*(instr->info + 8)-- 168 unique case values spanning opcodes 3 (IADD3) through 0xF5 (PIXLD). -
Inner encoding per opcode (or group): assigns an encoding category ID to
a3[0], then calls the bitfield packers to filla3[1..2]with register class attributes.
Two helper functions pack fields into the descriptor:
| Function | Role | Call count | Field ID range |
|---|---|---|---|
sub_917A60 (packRegClassField) | Bitfield encoder -- field IDs 91--340 map to specific bit positions in a3[1] and a3[2] | 112 | 91--340 |
sub_A2FF00 (packOperandField) | Alternate encoder for operand-level slots (data type, memory space) | 28 | 3--71 |
Encoding Category Assignment
The encoding category at a3[0] selects which SASS instruction format template the downstream per-SM encoder uses. Key mappings (opcode index to encoding category):
| Opcode(s) | SASS mnemonic | Category | Register class summary |
|---|---|---|---|
| 3 | IADD3 | 489 | R dest, R/UR sources, P carry |
| 4 | BMSK | 106 | R only |
| 5--6 | SGXT / LOP3 | 490--491 | R dest, R/UR sources |
| 7 | ISETP | 59 | P dest, R/UR sources + memory ordering fields |
| 8 | IABS | 60 | R dest, R source + memory ordering fields |
| 0x0E--0x10 | FSET/FSEL/FSETP | 510 | R/P dest, FP operation variant |
| 0x11/0x12/0x18 | FSETP/MOV/PRMT | 517 | FP comparison, combine, data width (IDs 288--299) |
| 0x15--0x16 | P2R/R2P | 524--525 | P-to-R or R-to-P conversion |
| 0x19 | VOTE | 526 | R dest, optional memory class |
| 0x1A | CS2R variant | 527 | UR source width (494--496), data type from a2+92 |
| 0x1B | CS2R_32 | 497 | Source width (494/495/496), predicate flag (ID 270) |
| 0x1E | IPA | 494 | Interpolation mode (440--442), flat/smooth (443/444) |
| 0x1F | MUFU | 501 | Subfunction (445--447), precision (450--459) |
| 0x20 | SHF | 502 | Direction (461--463), source class (464--466), clamp, data type |
| 0x21 | SHFL | 503 | Mode (470/471), operand classes (472--482) |
| 0x22--0x23 | I2I/I2IP | 55/56 | Integer conversion type (23 entries in dword_2026B20) |
| 0x28--0x2A | IPA/MUFU ext | 512 | Extended encoding variants (428--430) |
| 0x2B--0x2C | F2F/F2F_X | 513 | Conversion direction (432/433), saturation (434/435) |
| 0x2D | FRND | 516 | Rounding variant (526), mode (528/529) |
| 0x51--0x53 | AL2P, AL2P_IDX | 437--438 | Bindless flag (ID 148), predicate (ID 147) |
| 0x54--0x56 | BMOV_B/BMOV_R/BMOV | 423--424 | B-register class |
| 0x64--0x67 | SETLMEMBASE/ATOM | 156/463 | Atom-vs-red (ID 178), data width (ID 181) |
| 0x68 | BRX | 468 | Target (ID 190), call convention (IDs 191--192) |
| 0x6A/0x6C/0x6D | JMP/JMX/CALL | 469 | Control flow target class (ID 176) |
| 0x77--0x79 | BSSY/BREAK/BSYNC | 528--530 | Sync mode (ID 324), variant (ID 325) |
| 0x82 | NANOTRAP | 487 | Trap operation class (ID 257), has-source (ID 256) |
| 0x9E--0x9F | Hopper+ instrs | 535--536 | Hopper class A/B (IDs 337--338) |
| 0xAF--0xB2 | LD/ST variants | 431--446 | Full modifier set: uniform (91), pair (92--102) |
| 0xB8--0xBE | LDG/STG/LDL/STL | 449--456 | Cache policy (131), float mode (134), width (131) |
| 0xC1 | Conditional | 10/13 | Branch type (ID 167), divergent (ID 168) |
| 0xC8 | PRMT | 24 | Permute selector (ID 65/66) |
| 0xC9--0xD3 | Texture/surface | 61/455 | Texture data type (IDs 17/18), surface (IDs 19--22) |
| 0xD6--0xD7 | DMMA/CVTA | 515 | Direction (304), predicate (305), data type (306) |
| 0xDA--0xDB | SUATOM | 521/533 | Data width (326--331), sync mode (328) |
| 0xDC | SURED | 534 | Data width (331), type (335--336), sync (333) |
| 0xE0 | WGMMA | 500 | Data type (198), enable (199), barrier (201) |
| 0xF5 | PIXLD | 532 | Mode from dword_2026AA0 (ID 323) |
Extended Opcode Path (Memory/Atomic Sub-dispatch)
When the opcode falls in the 0xF6--0x10C range (memory/atomic extended instructions), a separate sub-dispatch applies. The function sub_44AC80 gates entry; sub_44AC60 and sub_44AC70 select among three encoding categories:
| Category | Gate function | Meaning |
|---|---|---|
| 441 | default | Base memory operation |
| 442 | sub_44AC60 true | Predicated memory variant |
| 443 | sub_44AC70 true | Extended memory variant |
Within each category, the sub-opcode selects register class fields:
| Sub-opcode | Register class (field 115) | Data width (field 113) |
|---|---|---|
| 0xF6/0xFF/0x106 | 69 (class A) | 60 (standard) |
| 0xF7/0x100/0x107 | 71 (class B) | 60 (standard) |
| 0xF8/0x102/0x109 | 0 (default) | 63 (wide) |
| 0xF9/0x103/0x10A | 0 (default) | 61 (narrow) |
| 0xFA/0x104/0x10B | 0 (default) | 62 (medium) |
| 0xFB | 0 (default) | 65 (type A) |
| 0xFC | 0 (default) | 66 (type B) |
| 0xFD | 0 (default) | 68 (type C) |
| 0xFE/0x105/0x10C | 0 (from table) | 64 (from dword_2026C30) |
| 0x101/0x108 | 72 (class C) | 60 (standard) |
Packed Descriptor Layout
The output descriptor a3 is a 4-DWORD (16-byte) structure:
| DWORD | Content |
|---|---|
a3[0] | Encoding category ID (0--542) -- selects SASS format template |
a3[1] | Packed bitfield: memory space (bits 0--3), address type (bits 4--7) |
a3[2] | Packed bitfield: register class attributes (data width, type, modifiers) |
a3[3] | Auxiliary flags (bit 1 = texture scope, bit 29 = special) |
a3[4] | Operand count override (set to 12 for KILL/extended mem ops) |
Register Class Field Groups
The 112 calls to packRegClassField (sub_917A60) use field IDs organized into functional groups. Each field ID maps to a specific bit range in the output descriptor via a mask-and-OR encoding:
// Example: field 113 (data width) -- bits 7-9 of a3[2]
case 113:
val = dword_21DEB20[a3_value - 61]; // 8-entry lookup
a3[2] = (val << 7) | (a3[2] & 0xFFFFF87F);
break;
// Example: field 91 (uniform flag) -- bit 16 of a3[2]
case 91:
a3[2] = ((value == 1) << 16) | (a3[2] & 0xFFFEFFFF);
break;
| Field group | IDs | Bits written | Purpose |
|---|---|---|---|
| Core class | 91--102 | a3[2] bits 5--22 | Uniform, pair, predicate, data type, saturate, negate, abs, complement |
| Data width | 113--117 | a3[2] bits 0--9 | Width code, uniform-mem, source regclass, type specifier, write-back |
| Load/store | 118--134 | a3[1] + a3[2] | Memory space, address type, cache policy, atomic op, scope, float mode |
| Texture/surface | 135--165 | a3[2] bits 1--31 | Texture type, dimension, LOD mode, ordering, acquire, scope hint |
| Control flow | 167--202 | a3[2] bits 1--6 | Branch type, divergent, WGMMA data type/enable/barrier |
| FP/conversion | 230--264 | a3[2] various | FP operation, comparison, combine, interpolation, MUFU, SHF, SHFL |
| Extended | 269--299 | a3[2] various | CS2R, FSETP, rounding, data type wide, destination regclass |
| Hopper/Blackwell | 304--340 | a3[2] various | DMMA, WGMMA, TMA hints, surface sync, Hopper-specific classes |
Sub-handler Functions
Complex opcode families delegate register class encoding to dedicated sub-functions:
| Function | Opcodes handled | Purpose |
|---|---|---|
sub_650390 | TEX, TLD, texture family | Texture register class (sampler, coordinate, LOD) |
sub_650220 | LDG, STG, LD, ST, ATOM, RED | Memory instruction register class |
sub_651330 | FMUL (opcode 0x0D) | FP multiply register class |
sub_650920 | LEA, special (0x09, 0x72, 0x74, 0x7A, 0x80, 0x81) | LEA / special instruction |
sub_650A90 | I2I, F2F, conversions (0x24--0x27, 0xE2--0xEB) | Type conversion register class |
sub_652190 | Branch/call (0x13, 0x14, 0x17) | Branch/call register class |
sub_653B90 | Misc (0x0C) | Miscellaneous instruction |
sub_650C80 | Memory barrier modifiers | Applied when (a2+56) & 0x4F0 is nonzero |
sub_651A90 | Texture modifiers (0x83) | Applied before texture encoding |
sub_62D5D0 | Memory space computation | Computes memory space tag from operand types |
Lookup Tables
The function references 28 static lookup tables that map instruction attribute values to register class encoding values:
| Table | Size | Used by field(s) | Content |
|---|---|---|---|
dword_21DEB80 | 5 | 94 | Data type encoding |
dword_21DEB50 | 3 | 107, 115, 145, 157, 165 | 3-value encoding (reused across 5 fields) |
dword_21DEB20 | 8 | 113 | Data width code |
dword_21DEB00 | 7 | 116, 126, 131, 170 | Type encoding (reused across 4 fields) |
dword_21DEAE0 | 5 | 119/123, 136, 143, 159 | Variant table (reused across 4 fields) |
dword_21DEAA0 | 13 | 120 | Memory space code |
dword_21DEA60 | 10 | 121, 135/151 | Address/texture type |
dword_21DEA20 | 15 | 124/125 | Reduction type |
dword_21DE9F0 | 6 | 129/130, 150 | Scope code |
dword_2026C30 | 6 | 116 (ext path) | Sub-opcode to data type |
dword_2026C80 | 20 | 165 (surface) | Surface operation codes |
dword_2026E20 | 17 | 286 | Data type (wide) |
dword_2026AC0 | 16 | 198 | WGMMA data type |
dword_2026B20 | 23 | I2I conversion | Integer conversion type |
Related Pages
- Ori IR Overview -- register files in the context of the full IR
- Instructions -- packed operand format and opcode encoding
- Allocator Architecture -- the 7-class fat-point allocator
- Fat-Point Algorithm -- pressure arrays, constraint types, selection loop
- GPU ABI -- reserved registers, parameter passing, return address
- Spilling -- spill/reload for each register class
- Scheduler -- 10 per-block pressure counters at record +4..+40
- SASS Encoding -- how the descriptor drives instruction word layout
Data Structure Layouts
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page documents the key internal data structures in ptxas v13.0.88: the compilation context ("god object"), the Ori Code Object, symbol tables, constant/shared memory descriptors, the pool allocator's object model, and the generic container types (hash maps, linked lists, growable arrays) that underpin nearly every subsystem.
All offsets are byte offsets from the structure base unless otherwise noted. Types are inferred from decompiled access patterns. Field names are reverse-engineered -- the binary is stripped.
Compilation Context (the "God Object")
The compilation context is the central state object passed to every phase in the pipeline. It is not the Code Object (which is per-function); it is the per-compilation-unit container that owns the Code Object, the knob system, the output stream, the function list, and all per-pass configuration. The sub_7FBB70 (PerKernelEntry) function receives this as a1, and every phase's execute() receives it as the second argument.
The context is a polymorphic C++ object with a vtable at offset +0. It is allocated by the compilation driver and persists for the lifetime of a single compilation unit. Key observations:
- The vtable at
+0provides 263+ virtual methods (vtable spans to offset 2104+) - The object is at least 1928 bytes based on the highest confirmed field access (
+1928 = codegen_ctx) - The knob/options system is accessed through an indirection at
+1664(pointer to knob container object) - The output stream lives at
+1440
Compilation Context Field Map
| Offset | Type | Field | Evidence |
|---|---|---|---|
| +0 | vtable* | vtable | *(_QWORD *)a1 in every virtual dispatch |
| +8 | ptr | parent / driver_ctx | Back-pointer; sub_A3A7E0 reads v2 = *(a1+8) then v2[198] for Code Object |
| +80 | u32 | last_exit_code | sub_663C30: *(a1+80) = v2[462] |
| +96 | u32 | compile_unit_index | sub_663C30: *(a1+96) = 1 on first call |
| +139 | u8 | multi_function_flag | sub_663C30: if (!*(a1+139)) |
| +144 | ptr | name_table (via vtable+144) | sub_7FBB70: *(*a1 + 144) -> name lookup vtable |
| +296 | ptr | current_function | sub_7FBB70: *(*(a1+296) + 164) = function index |
| +368 | ptr | function_name_array | sub_7FBB70: *(a1+368 + 8*func_id) -> name object |
| +1144 | ptr | function_list_head | sub_663C30: linked list of function descriptors |
| +1160 | ptr | entry_list_head | sub_663C30: linked list of kernel entry descriptors |
| +1376 | u32 | scheduling_mode_flags | Bit 0x08 = forward, bit 0x10 = bidirectional |
| +1412 | i8 | compilation_flags_byte | sub_A3B080: *(char*)(a2+1412) < 0 |
| +1416 | u8 | output_detail_flags | sub_7FBB70: *(a1+1416) |= 0x80; bits 4-5 control latency reporting mode |
| +1418 | u8 | codegen_mode_flags | sub_A3B080: *(a2+1418) & 4 |
| +1428 | i32 | function_index | sub_7FBB70: *(a1+1428) < 0 means first invocation |
| +1440 | stream* | output_stream | sub_7FBB70: sub_7FE930(a1+1440, "\nFunction name: ") |
| +1560 | ptr | timing_records | Growable array of 32-byte timing entries |
| +1576 | u32 | timing_count | sub_C62720: cu->timing_count++ |
| +1552 | i32 | pipeline_progress | Pipeline progress counter (0--21), monotonically increases; see known values |
| +1584 | ptr | sm_backend | SM-specific architecture backend object (polymorphic, 1712--1992B depending on SM target); provides vtable dispatch for legalization, optimization, scheduling, and codegen; see note below |
| +1664 | ptr | knob_container | sub_7FB6C0, sub_A3B080: options/knob dispatch object |
| +1864 | ptr | bb_structure | sub_7FB6C0: destroyed via sub_77F880 |
| +1872 | ptr | per_func_data | sub_7FB6C0: destroyed via sub_7937D0 |
| +1880 | ptr | function_context | sub_7FB6C0: 17 analysis-result pairs at qword offsets |
| +1928 | ptr | codegen_ctx | Confirmed in overview.md Code Object table |
SM Backend Object at +1584
The pointer at context+0x630 (decimal 1584) is the single most confusing field in the compilation context, because it serves multiple roles through a single polymorphic C++ object. Different wiki pages historically called it different names depending on which role they observed:
- Legalization pages see it dispatching
MidExpansion,LateExpansionUnsupportedOps, etc., and call it "SM backend" or "arch_backend" - Scheduling pages see it providing hardware latency profiles at
*(sm_backend+372)and call it "scheduler context" or "hw_profile" - Optimization pages see it dispatching
GvnCse(vtable[23]) andOriReassociateAndCommon(vtable[44]) and call it "optimizer state" or "function manager" - Codegen/template pages see it holding register file capacity at
+372and hardware capability flags at+1037
It is one object. The canonical name is sm_backend. It is constructed per-compilation-unit in sub_662920 with a switch on SM version bits (v3 >> 12). Each SM generation gets a different-sized allocation and a different vtable:
| SM Case | Size | Base Constructor | Vtable | SM Generations |
|---|---|---|---|---|
| 3 | 1712B | sub_A99A30 | off_2029DD0 | sm_30 (Kepler) |
| 4 | 1712B | sub_A99A30 | off_21B4A50 | sm_50 (Maxwell) |
| 5 | 1888B | sub_A99A30 | off_22B2A58 | sm_60 (Pascal) |
| 6 | 1912B | sub_A99A30 | off_21D82B0 | sm_70 (Volta) |
| 7 | 1928B | sub_ACDE20 | off_21B2D30 | sm_80 (Ampere) |
| 8 | 1992B | sub_662220 | off_21C0C68 | sm_89 (Ada) |
| 9 | 1992B | sub_662220 | off_21D6860 | sm_90+ (Hopper/Blackwell) |
Key sub-fields on the SM backend:
+372(i32): codegen factory value / encoded SM architecture version (e.g., 28673 = sm_80)+1037(u8): hardware capability flags (bit 0 = has high-precision FP64 MUFU seeds)- Vtable slots provide architecture-specific dispatch for 50+ operations
Pipeline Progress Counter at +1552
The field at context+1552 is a monotonically increasing int32 that tracks how far the compilation has progressed through the 159-phase pipeline. It is not a legalization-only counter -- it is incremented by phases across all categories (legalization, optimization, scheduling, regalloc). Each increment is performed by a small thunk function whose sole body is *(ctx + 1552) = N.
Known values and their associated phases:
| Value | Thunk Address | Phase / Context |
|---|---|---|
| 0 | (init) | sub_7F7DC0 -- compilation context constructor |
| 1 | sub_C5F620 | Early pipeline (before ConvertUnsupportedOps) |
| 2 | sub_C5F5A0 | After ConvertUnsupportedOps (phase 5) |
| 3 | sub_C5EF80 | After MidExpansion (phase 45) |
| 4 | sub_C5EF30 | After OriDoRematEarly (phase 54) -- signals remat mode active |
| 5 | sub_1233D70 | Mid-pipeline scheduling/ISel context |
| 7 | sub_6612E0 / sub_C60AA0 | After LateExpansion (phase 55) |
| 8 | sub_849C60 | Post-optimization context |
| 9 | sub_C5EB80 | After OriBackCopyPropagate (phase 83) |
| 10 | sub_88E9D0 | Late optimization |
| 11 | sub_C5EA80 | After SetAfterLegalization (phase 95) region |
| 12 | sub_C5E980 | Post-legalization |
| 13 | sub_13B5C80 | ISel/scheduling |
| 14 | sub_C5E830 | Post-scheduling |
| 15 | sub_C5E7C0 | Register allocation phase |
| 16 | sub_C5E6E0 | Post-regalloc |
| 17 | sub_C5E5A0 | Mercury/codegen |
| 18 | sub_C5E4D0 | Post-Mercury |
| 19 | sub_C5E440 | Late codegen |
| 20 | sub_C5E390 | Post-RA cleanup |
| 21 | sub_C5E0B0 | Final pipeline stage |
Readers of downstream passes use *(ctx+1552) > N to gate behavior that should only run after a certain pipeline point. For example, the rematerialization cross-block pass checks *(ctx+1552) > 4 to enable its second-pass mode.
Knob Container Access Pattern
The knob container at +1664 is accessed through a two-level virtual dispatch pattern that appears at 100+ call sites:
// Fast path: known vtable -> direct array read
_QWORD *v2 = *(_QWORD **)(ctx + 1664);
bool (*query)(__int64, int) = *(bool (**)(...))(*v2 + 72);
if (query == sub_6614A0)
result = *(u8*)(v2[9] + knob_index * 72 + offset) != 0;
else
result = query((int64)v2, knob_index); // slow path
The fast path reads directly from the knob value array at v2[9] (offset +72 of the knob state object), where each knob value occupies 72 bytes. The slow path invokes the virtual method for derived knob containers.
Function Context (at +1880)
When a function is under compilation, +1880 points to a large context object containing 17 pairs of analysis-result data structures. Each pair consists of a sorted container and a hash map, holding results such as live ranges, register maps, and scheduling data. The cleanup code in sub_7FB6C0 destroys pairs at qword offsets [102, 97, 92, 87, 82, 77, 72, 67, 62, 57, 52, 47, 42, 36, 31, 26, 21] from the context base, then handles reference-counted objects at offsets [10] and [2].
Ori Code Object (~1136 bytes)
The Code Object is the per-function container for all IR data. One instance exists for each function under compilation. Constructor is at sub_A3B080, vtable at 0x21EE238.
Constructor Analysis
The constructor (sub_A3B080) takes two arguments: a1 (the Code Object to initialize) and a2 (the compilation context). It:
- Sets
+8 = a2(back-pointer to compilation context) - Sets
+0 = &unk_21EE238(vtable) - Zeroes approximately 250 distinct fields across the 1136-byte range
- Loads two SSE constants from
xmmword_2027600andxmmword_21EFAE0into offsets +96 and +112 (likely default register file descriptors or encoding parameters) - Reads
a2+1412anda2+1418to set mode flags at+1101and+1008 - Accesses the knob container at
a2+1664to query knob 367 for initial configuration - Sets
+1008 = 0x300000050(default) or0x400000080(ifa2+1418 & 4)
Code Object Field Map
| Offset | Type | Field | Evidence / Notes |
|---|---|---|---|
| +0 | vtable* | vtable | 0x21EE238, 263+ virtual methods |
| +8 | ptr | compilation_ctx | Back-pointer to owning compilation context |
| +16 | u128 | (zeroed) | SSE zero-store in constructor |
| +24 | u32 | sm_version | Encoded SM target (12288=sm30, 20481=sm50, 36865=sm90) |
| +32 | u128 | (zeroed) | SSE zero-store |
| +48 | u128 | (zeroed) | SSE zero-store |
| +64 | u32 | init_flags | Zeroed in constructor |
| +72 | ptr | code_buf | Output code buffer |
| +80 | u128 | (zeroed) | |
| +88 | ptr | reg_file | Register descriptor array: *(ctx+88) + 8*regId |
| +96 | u128 | reg_defaults_1 | Loaded from xmmword_2027600 |
| +99 | u32 | ur_count | Uniform register (UR) count |
| +102 | u32 | r_alloc | R-register allocated count |
| +112 | u128 | reg_defaults_2 | Loaded from xmmword_21EFAE0 |
| +128--175 | u128[3] | (zeroed) | SSE zero-stores |
| +152 | ptr | sym_table | Symbol/constant lookup array |
| +159 | u32 | r_reserved | R-register reserved count |
| +176 | ptr | (zeroed) | |
| +184 | u32 | (zeroed) | |
| +192 | ptr | (zeroed) | |
| +200 | u128 | (zeroed) | |
| +216 | u128 | (zeroed) | |
| +232 | u32 | (zeroed) | |
| +236 | u32 | (zeroed) | |
| +240 | ptr | (zeroed) | |
| +248 | u128 | (zeroed) | |
| +264 | u128 | (zeroed) | |
| +272 | ptr | instr_head | Instruction linked-list head |
| +280 | u32 | (zeroed) | |
| +288 | ptr | (zeroed) | |
| +296 | ptr | bb_array | Basic block array pointer (40 bytes per entry) |
| +304 | u32 | bb_index | Current basic block count |
| +312 | ptr | options | OptionsManager* for knob queries |
| +320--359 | u128[3] | (zeroed) | |
| +335 | u32 | instr_hi | Instruction count upper bound |
| +336 | u32 | tex_inst_count | Texture instruction count (stats emitter) |
| +338 | u32 | fp16_vect_inst | FP16 vectorized instruction count |
| +340 | u32 | inst_pairs | Instruction pair count |
| +341 | u32 | instr_lo | Instruction count lower bound |
| +342 | u32 | tepid_inst | Tepid instruction count |
| +360 | ptr | (zeroed) | |
| +368 | u32 | sub_block_flags | |
| +372 | u32 | instr_total | Total instruction count (triggers chunked scheduling at > 0x3FFF) |
| +376 | u32 | (zeroed) | |
| +384--416 | ptr[5] | (zeroed) | |
| +424 | u32 | (zeroed) | |
| +432 | ptr | (zeroed) | |
| +440 | u32 | (zeroed) | |
| +448 | ptr | (zeroed) | |
| +464 | ptr | (zeroed) | |
| +472 | u8 | (zeroed) | |
| +473 | u8 | (zeroed) | |
| +536 | u32 | (zeroed) | |
| +540 | u32 | (zeroed) | |
| +648 | ptr | succ_map | CFG successor edge hash table |
| +680 | ptr | backedge_map | CFG backedge hash table |
| +720 | ptr | rpo_array | Reverse post-order array (int*) |
| +728 | ptr | bitmask_array | Grow-on-demand bitmask array for scheduling |
| +740 | u32 | bitmask_capacity | Capacity of bitmask array |
| +752 | ptr | (zeroed) | |
| +760 | u32 | (zeroed) | |
| +764 | u32 | (zeroed) | |
| +768 | ptr | const_sections | Constant memory section array |
| +772 | u8 | (zeroed) | |
| +776 | ptr | smem_sections | Shared memory section array |
| +976 | ptr | block_info | Block info array (40 bytes per entry, contiguous) |
| +984 | i32 | num_blocks | Number of basic blocks |
| +996 | u32 | annotation_offset | Current offset into annotation buffer (sub_A4B8F0) |
| +1000 | ptr | annotation_buffer | Annotation data buffer (sub_A4B8F0) |
| +1008 | u64 | encoding_params | Default 0x300000050 or 0x400000080 |
| +1016 | ptr | (zeroed) | |
| +1024 | u32 | (zeroed) | |
| +1032 | ptr | (zeroed) | |
| +1040 | ptr | (zeroed) | |
| +1064 | ptr | (zeroed) | |
| +1080 | u128 | (zeroed) | |
| +1096 | u32 | (zeroed) | |
| +1100 | u8 | (zeroed) | |
| +1101 | u8 | optimization_mode | Set from knob 367 and compilation_ctx+1412 |
| +1102 | u8 | (zeroed) | |
| +1104 | ptr | (zeroed) | |
| +1120 | u128 | (zeroed) |
Register Count Formula
From the stats emitter at sub_A3A7E0 and the register count function at sub_A4B8F0 (which both use vtable+2104 dispatch with sub_859FC0 as the fast path):
total_R_regs = code_obj[159] + code_obj[102] // reserved + allocated
instruction_count = code_obj[335] - code_obj[341] // upper - lower
Stats Emitter Field Map
The stats emitter (sub_A3A7E0) accesses a per-function stats record through the SM backend: v3 = *(compilation_ctx+8)[198] (offset +1584 from the outer compilation context points to the SM backend object; the emitter then reads per-function stats fields within it). It uses DWORD indexing (4-byte), and reveals these additional fields:
| DWORD Index | Byte Offset | Field | Stat String |
|---|---|---|---|
| 8 | +32 | est_latency | [est latency = %d] |
| 10 | +40 | worst_case_lat | [worstcaseLat=%f] |
| 11 | +44 | avg_case_lat | [avgcaseLat=%f] |
| 12 | +48 | spill_bytes | [LSpillB=%d] |
| 13 | +52 | refill_bytes | [LRefillB=%d] |
| 14 | +56 | s_refill_bytes | [SRefillB=%d] |
| 15 | +60 | s_spill_bytes | [SSpillB=%d] |
| 16 | +64 | low_lmem_spill | [LowLmemSpillSize=%d] |
| 17 | +68 | frame_lmem_spill | [FrameLmemSpillSize=%d] |
| 18 | +72 | non_spill_bytes | [LNonSpillB=%d] |
| 19 | +76 | non_refill_bytes | [LNonRefillB=%d] |
| 20 | +80 | non_spill_size | [NonSpillSize=%d] |
| 26 | +104 | occupancy (float) | [Occupancy = %f] |
| 27 | +108 | div_branches | [est numDivergentBranches=%d] |
| 28 | +112 | attr_mem_usage | [attributeMemUsage=%d] |
| 29 | +116 | program_size | [programSize=%d] |
| 42 | +168 | precise_inst | [Precise inst=%d] |
| 44 | +176 | udp_inst | [UDP inst=%d] |
| 45 | +180 | vec_to_ur | [numVecToURConverts inst=%d] |
| 49 | +196 | max_live_suspend | [maxNumLiveValuesAtSuspend=%d] |
| 87 | +348 | partial_unroll | [partially unrolled loops=%d] |
| 88 | +352 | non_unrolled | [non-unrolled loops=%d] |
| 89 | +356 | cb_bound_tex | [CB-Bound Tex=%d] |
| 90 | +360 | partial_bound_tex | [Partially Bound Tex=%d] |
| 91 | +364 | bindless_tex | [Bindless Tex=%d] |
| 92 | +368 | ur_bound_tex | [UR-Bound Tex=%d] |
| 93 | +372 | sm_version_check | > 24575 triggers UR reporting |
| 99 | +396 | ur_count_stats | [urregs=%d] |
| 102 | +408 | r_alloc | R-register allocated count |
| 159 | +636 | r_reserved | R-register reserved count |
| 303 | +1212 | est_fp | [est fp=%d] |
| 306 | +1224 | est_half | [est half=%d] |
| 307 | +1228 | est_transcendental | [est trancedental=%d] |
| 308 | +1232 | est_ipa | [est ipa=%d] |
| 310 | +1240 | est_shared | [est shared=%d] |
| 311 | +1244 | est_control_flow | [est controlFlow=%d] |
| 315 | +1260 | est_load_store | [est loadStore=%d] |
| 316 | +1264 | est_tex | [est tex=%d] |
| 334 | +1336 | inst_pairs | [instPairs=%d] |
| 335 | +1340 | instr_hi | Instruction count upper bound |
| 336 | +1344 | tex_inst_count | [texInst=%d] |
| 337 | +1348 | fp16_inst | [FP16 inst=%d] |
| 338 | +1352 | fp16_vect_inst | [FP16 VectInst=%d] |
| 339 | +1356 | inst_hint | [instHint=%d] |
| 340 | +1360 | inst_pairs_2 | checked for non-zero to print instHint line |
| 341 | +1364 | instr_lo | Instruction count lower bound |
| 342 | +1368 | tepid_inst | [tepid=%d] |
Note: The stats emitter accesses the Code Object through a float pointer (v3), so DWORD indices map to byte offsets via index * 4 for integers and index * 4 for floats. Float fields at indices 9, 26, 50, 54, 57, 58, 59, 61, 62, 65, 84, 85, 86 hold throughput and occupancy metrics. A linked list at qword index 55 (byte +440) holds additional string annotations.
Basic Block Entry (40 bytes)
Basic blocks are stored in a contiguous array at Code Object +976, with count at +984.
BasicBlock (40 bytes)
+0 ptr instr_head // first instruction in this BB
+8 ptr instr_tail // last instruction (or list link)
+16 ptr (reserved)
+24 u32 (reserved)
+28 i32 bix // block index (unique ID for CFG ops)
+32 u64 flags // scheduling/analysis flags
The scheduling pass (sub_8D0640) initializes per-block scheduling state by iterating the block list and zeroing qword offsets [7], [13], [19], and setting [21] = -1 on each block.
Instruction Layout
Instructions are polymorphic C++ objects linked into per-BB doubly-linked lists. The instruction format is detailed in Instructions; this section covers only the structural linkage.
Each instruction carries a unique integer ID at +16, an opcode at +72 (the peephole optimizer masks with & 0xCF on byte 1 to strip modifier bits), and a packed operand array starting at +84. The operand count is at +80. Operands are 8 bytes each.
Packed Operand Format
31 30 29 28 27 24 23 22 21 20 19 0
+---+---+---+---+-----------+---+---+---+---+---------------------+
| type | modifier bits (8 bits) | index (20 bits) |
+---+---+---+---+-----------+---+---+---+---+---------------------+
Extraction (50+ confirmed sites):
uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type = (operand >> 28) & 7; // bits 28-30
int index = operand & 0xFFFFF; // bits 0-19
int mods = (operand >> 20) & 0xFF; // bits 20-27
| Type Value | Meaning | Resolution |
|---|---|---|
| 1 | Register operand | Index into *(code_obj+88) register file |
| 5 | Symbol/constant operand | Index into *(code_obj+152) symbol table |
The operand classifier functions at 0xB28E00--0xB28E90 provide predicate checks:
| Function | Predicate |
|---|---|
sub_B28E00 | getRegClass (1023 = wildcard, 1 = GPR) |
sub_B28E10 | isRegOperand |
sub_B28E20 | isPredOperand |
sub_B28E40 | isImmOperand |
sub_B28E80 | isConstOperand |
sub_B28E90 | isUReg |
Symbol Table
The symbol table is accessed through Code Object +152. Based on the symbol table builder at sub_621480 (21KB, references a1+30016 for the symbol table base), symbols are stored in a hash-map-backed structure where each symbol has a name and associated properties (address, type, section binding).
Internal Symbol Names
The following internal symbol names appear in decompiled code, indicating the kinds of entities tracked:
| Symbol | Purpose |
|---|---|
__ocg_const | OCG-generated constant data |
__shared_scratch | Shared memory scratch space |
__funcAddrTab_g | Global indirect function call table |
__funcAddrTab_c | Constant indirect function call table |
_global_ptr_%s | Global pointer for named variable |
$funcID$name | Function-local relocation symbol |
__cuda_dummy_entry__ | Dummy entry generated by --compile-only |
__cuda_sanitizer | CUDA sanitizer instrumentation symbol |
Symbol Resolution Flow
Symbol resolution (sub_625800, 27KB) traverses the symbol table to resolve references during the PTX-to-Ori lowering and subsequent optimization phases. The format %s[%d] (from sub_6200A0) is used for array-subscripted symbol references, and __$endLabel$__%s markers delimit function boundaries.
Constant Buffer Layout
Constant memory is organized into banks (c[0], c[1], ...) corresponding to the CUDA .nv.constant0, .nv.constant2, etc. ELF sections. The constant section array at Code Object +768 tracks all constant banks for the current function.
Constant Bank Handling
The constant bank handler at sub_6BC560 (4.9KB) manages references to constant memory using the c[%d] (integer bank) and c[%s] (named bank, sw-compiler-bank) notation. It enforces:
- A maximum constant register count (error: "Constant register limit exceeded; more than %d constant registers")
- LDC (Load Constant) requires a constant or immediate bank number
ELF Constant Symbols
The ELF symbol emitter (sub_7FD6C0) creates symbols for constant bank metadata:
| Symbol Name | Purpose |
|---|---|
.nv.ptx.const0.size | Size of constant bank 0 (kernel parameters) |
The constant emission function (sub_7D14C0, 5.6KB) iterates the constant section array and copies bank data into the output ELF sections.
Shared Memory Layout
Shared memory (.nv.shared) allocations are tracked through the shared memory section array at Code Object +776. Reserved shared memory regions are managed by sub_6294E0 (12.1KB) and sub_629E40 (6.1KB).
Reserved Shared Memory Symbols
The ELF emitter recognizes these special symbols for shared memory layout:
| Symbol Name | Purpose |
|---|---|
.nv.reservedSmem.begin | Start of reserved shared memory region |
.nv.reservedSmem.cap | Capacity of reserved shared memory |
.nv.reservedSmem.end | End of reserved shared memory region |
.nv.reservedSmem.offset0 | First reserved offset within shared memory |
.nv.reservedSmem.offset1 | Second reserved offset within shared memory |
The --disable-smem-reservation CLI option disables the reservation mechanism. Shared memory intrinsic lowering (sub_6C4DA0, 15KB) validates that shared memory operations use types {b32, b64}.
Descriptor Size Symbols
Additional ELF symbols track texture/surface descriptor sizes in shared memory:
| Symbol Name | Purpose |
|---|---|
.nv.unified.texrefDescSize | Unified texture reference descriptor size |
.nv.independent.texrefDescSize | Independent texture reference descriptor size |
.nv.independent.samplerrefDescSize | Independent sampler reference descriptor size |
.nv.surfrefDescSize | Surface reference descriptor size |
Pool Allocator
The pool allocator (sub_424070, 3,809 callers) is the single most heavily used allocation function. Every dynamic data structure in ptxas is allocated through pools.
Pool Object Layout
| Offset | Type | Field | Notes |
|---|---|---|---|
| +0 | ptr | large_block_list | Singly-linked list of large (>4999 byte) blocks |
| +32 | u32 | min_slab_size | Minimum slab allocation size |
| +44 | u32 | slab_count | Number of slabs allocated |
| +48 | ptr | large_free_list | Free list for large blocks (boundary-tag managed) |
| +56 | u32 | fragmentation_count | Fragmentation counter (decremented on split) |
| +60 | u32 | max_order | Maximum power-of-2 order for large blocks |
| +64 | ... | (large block free lists) | a1 + 32*(order+2) = per-order free list head |
| +2112 | ptr | tracking_map | Hash map for allocation metadata tracking |
| +2128 | ptr[N] | small_free_lists | Size-binned free lists: *(pool + 8*(size>>3) + 2128) = head |
| +7128 | mutex* | pool_mutex | pthread_mutex_t* for thread safety |
Allocation Paths
Small path (size <= 4999 bytes = 0x1387):
- Round size up to 8-byte alignment:
aligned = (size + 7) & ~7 - Minimum 16 bytes
- Compute bin:
bin = pool + 8 * (aligned >> 3) + 2128 - If bin has a free block: pop from free list, decrement slab available bytes
- If bin is empty: allocate a new slab from the parent (size =
aligned * ceil(min_slab_size / aligned)), carve into free-list nodes
Large path (size > 4999 bytes):
- Add 32 bytes for boundary tags
- Search power-of-2 order free lists starting from
log2(size+32) - If found: split block if remainder > 39 bytes, return payload
- If not found: call
sub_423B60to grow the pool, allocate new slab from parent
Boundary Tag Format (Large Blocks)
Large blocks use boundary tags for coalescing on free:
Block Header (32 bytes):
+0 i64 sentinel // -1 = allocated, else -> next free
+8 ptr prev_free // previous in free list (or 0)
+16 u64 tag_offset // 32 (header size)
+24 u64 payload_size // user-requested allocation size
Block Footer (32 bytes at end):
+0 i64 sentinel
+8 ptr prev_free
+16 u64 footer_tag // 32
+24 u64 block_size // total size including headers
Slab Descriptor (56 bytes)
Each slab is tracked by a 56-byte descriptor:
| Offset | Type | Field |
|---|---|---|
| +0 | ptr | chain_link |
| +8 | u64 | total_size |
| +16 | u64 | available_size |
| +24 | ptr | owning_pool |
| +32 | ptr | memory_base |
| +40 | u8 | is_small_slab |
| +44 | u32 | slab_id |
| +48 | u32 | bin_size |
Hierarchical Pools
Pools are hierarchical. When sub_424070 is called with a1 = NULL, it falls back to a global allocator (sub_427A10) that uses malloc directly. Non-null a1 values are pool objects that allocate from their own slabs, which are themselves allocated from a parent pool (the TLS context at offset +24 holds the per-thread pool pointer). The top-level pool is named "Top level ptxas memory pool" and is created in the compilation driver.
Hash Map
The hash map (sub_426150 insert / sub_426D60 lookup, 2,800+ and 422+ callers respectively) is the primary associative container in ptxas.
Hash Map Object Layout (~112 bytes)
| Offset | Type | Field | Notes |
|---|---|---|---|
| +0 | fptr | hash_func | Custom hash function pointer |
| +8 | fptr | compare_func | Custom compare function pointer |
| +16 | fptr | hash_func_2 | Secondary hash (or NULL) |
| +24 | fptr | compare_func_2 | Secondary compare (or NULL) |
| +32 | u32 | has_custom_compare | Flag |
| +40 | u64 | bucket_mask | capacity - 1 for power-of-2 masking |
| +48 | u64 | entry_count | Number of stored entries |
| +64 | u64 | load_factor_threshold | Resize when entry_count exceeds this |
| +72 | u32 | first_free_slot | Tracking for bitmap-based slot allocation |
| +76 | u32 | entries_capacity | Capacity of entries array |
| +80 | u32 | bitmap_capacity | Capacity of used-bits bitmap |
| +84 | u32 | flags | Hash mode in bits 4-7 |
| +88 | ptr | entries | Array of 16-byte {key, value} pairs |
| +96 | ptr | used_bitmap | Bitmap tracking occupied slots |
| +104 | ptr | buckets | Array of pointers to chained index lists |
Hash Modes
The hash mode is encoded in bits 4-7 of the flags field at offset +84:
| Mode | Flag Bits | Hash Function | Use Case |
|---|---|---|---|
| 0 | 0x00 | Custom (+0 function pointer) | User-defined hash/compare |
| 1 | 0x10 | Pointer hash: (key>>11) ^ (key>>8) ^ (key>>5) | Pointer-keyed maps |
| 2 | 0x20 | Identity: key used directly | Integer-keyed maps |
Mode selection happens automatically in the constructor (sub_425CA0): if the hash/compare pair matches (sub_427750, sub_427760), mode 2 is set; if (sub_4277F0, sub_427810), mode 1.
Lookup Algorithm
// Mode 1 (pointer hash) example:
uint64_t hash = (key >> 11) ^ (key >> 8) ^ (key >> 5);
uint64_t bucket_idx = hash & map->bucket_mask;
int32_t* chain = map->buckets[bucket_idx];
while (*++chain != -1) {
entry_t* e = map->entries + 16 * (*chain);
if (key == e->key)
return e->value; // found
}
return 0; // not found
Growth policy: the map doubles capacity and rehashes when entry_count > load_factor_threshold.
String-Keyed Maps
String-keyed maps use MurmurHash3 (sub_427630, 73 callers) as the hash function. The implementation uses the standard MurmurHash3_x86_32 constants:
| Constant | Value | Standard Name |
|---|---|---|
| c1 | 0xCC9E2D51 (-862048943) | MurmurHash3 c1 |
| c2 | 0x1B873593 (461845907) | MurmurHash3 c2 |
| fmix1 | 0x85EBCA6B (-2048144789) | MurmurHash3 fmix |
| fmix2 | 0xC2B2AE35 (-1028477387) | MurmurHash3 fmix |
CFG Hash Map (FNV-1a)
The control flow graph uses a separate hash map implementation based on FNV-1a hashing, distinct from the general-purpose hash map above. Two instances exist per Code Object at offsets +648 (successor edges) and +680 (backedge info).
| Parameter | Value |
|---|---|
| Initial hash | 0x811C9DC5 (-2128831035) |
| Prime | 16777619 (0x01000193) |
| Input | 4-byte block index, hashed byte-by-byte |
Bucket entry: 24 bytes {head, tail, count}. Node: 64 bytes with chain link, key, values, sub-hash data, and cached hash. See CFG for the full CFG hash map specification.
Linked List
The linked list (sub_42CA60 prepend, 298 callers; sub_42CC30 length, 48 callers) is a singly-linked list of 16-byte nodes:
ListNode (16 bytes, pool-allocated)
+0 ptr next // pointer to next node (NULL = end)
+8 ptr data // pointer to payload object
Prepend allocates a 16-byte node from the pool, sets node->data = payload, and links it at the list head. This is used for function lists, relocation lists, annotation chains, and many intermediate pass-local collections.
Growable Array (Pool Vector)
Growable arrays appear throughout the PhaseManager and elsewhere. The layout is a triple of {data_ptr, count, capacity}:
PoolVector (24 bytes inline, or embedded in parent struct)
+0 ptr data // pointer to element array
+8 i32 count // current element count
+12 i32 capacity // allocated capacity
Growth strategy (confirmed in the PhaseManager timing records): new_capacity = max(old + old/2 + 1, requested) (1.5x growth factor). Elements are typically 8 bytes (pointers) or 16 bytes (pointer pairs). Reallocation uses sub_424C50 (pool realloc, 27 callers).
The PhaseManager uses this pattern for the phase list (16-byte {phase_ptr, pool_ptr} pairs), the name table (8-byte string pointers), and the timing records (32-byte entries).
Knob Value Array
Knob values are stored in a contiguous array of 72-byte slots, accessed at knob_state[9] + 72 * knob_index (where knob_state[9] is offset +72 of the knob state object).
Knob Value Slot (72 bytes)
| Offset | Type | Field |
|---|---|---|
| +0 | u8 | Type tag (0=unset, 1=bool, 2=int, ..., 12=opcode list) |
| +8 | i64 | Integer value / pointer to string / linked list head |
| +16 | i64 | Secondary value (range max, list count, etc.) |
| +24 | i64 | Tertiary value |
| +64 | ptr | Allocator reference |
Supported types:
| Type | Tag | Storage |
|---|---|---|
| Boolean | 1 | Flag at +0 |
| Integer | 2 | Value at +8 |
| Integer+extra | 3 | Value at +8, extra at +12 |
| Integer range | 4 | Min at +8, max at +16 |
| Integer list | 5 | Growable array of ints |
| Float | 6 | float at +8 |
| Double | 7 | double at +8 |
| String | 8/11 | Pointer at +8 |
| When-string | 9 | Linked list of 24-byte condition+value nodes |
| Value-pair list | 10 | Opcode:integer pairs via vtable |
| Opcode list | 12 | Opcode names resolved through vtable |
Knob Descriptor (64 bytes)
Knob descriptors are stored in a table at knob_state+16, with count at knob_state+24:
| Offset | Type | Field |
|---|---|---|
| +0 | ptr | Primary name (ROT13-encoded) |
| +8 | u64 | Primary name length |
| +16 | u32 | Type tag |
| +24 | ... | (reserved) |
| +40 | ptr | Alias name (ROT13-encoded) |
| +48 | u64 | Alias name length |
Stream Object
The output stream used for diagnostics and stats reporting (e.g., at compilation context +1440) is a C++ iostream-like object with operator overloads. Field layout (from sub_7FE5D0 and sub_7FECA0):
| Offset | Type | Field |
|---|---|---|
| +0 | vtable* | vtable (dispatch for actual I/O) |
| +8 | u32 | width |
| +12 | u32 | precision |
| +16 | u64 | char_count |
| +24 | ptr | format_buffer |
| +56 | u32 | flags (bit 0=hex, bit 1=oct, bit 2=left-align, bit 3=uppercase, bits 7-8=sign) |
ORI Record Serializer (sub_A50650)
The ORI Record Serializer (sub_A50650, 74 KB, 2,728 decompiled lines) is the central function that takes a Code Object's in-memory state and flattens it into a linear output buffer organized as a table of typed section records. It is the serialization backbone for both the DUMPIR diagnostic subsystem and the compilation output path. Despite the _ORI_ string it contains, it is not an optimization pass -- it is infrastructure.
| Address | 0xA50650 |
| Size | ~74 KB |
| Identity | CodeObject::EmitRecords |
| Confidence | 0.90 |
| Called from | sub_A53840 (wrapper), sub_AACBF0 / sub_AAD2A0 (DUMPIR diagnostic path) |
| Calls | sub_A4BC60 (register serializer, new format), sub_A4D3F0 (legacy format), sub_A4B8F0 (register count annotation), sub_A47330 + sub_A474F0 (multi-section finalization), sub_1730890 / sub_17308C0 / sub_17309A0 (scheduling serializers), sub_1730FE0 (register file map) |
Parameters
a1 is a serialization state object ("OriRecordContext") that carries the section table, compilation context back-pointer, and per-subsection index/size pairs. a2 is the output buffer write cursor, advanced as data is emitted.
Key fields on a1:
| Offset | Type | Field | Evidence |
|---|---|---|---|
| +8 | ptr | compilation_ctx | Dereferenced to reach sm_backend at +1584 |
| +24 | i32 | header_section_idx | v5 + 32 * (*(a1+24) + 1) |
| +72 | ptr | section_table | Array of 32-byte section entries |
| +180 | u32 | instr_counter_1 | Reset to 0 at entry |
| +472 | u8 | has_debug_info | Gates debug section emission |
| +916 | i32 | multi_section_count | > 0 triggers link-record emission and tail call to sub_A47330 |
| +1102 | u8 | multi_section_enabled | Master flag for multi-section mode |
| +1120 | ptr | scheduling_ctx | Scheduling context for barrier/scope serialization |
Section Record Format
Each section occupies a 32-byte entry in the table at *(a1+72) + 32 * section_index:
Offset Type Field
+0 u16 type_tag section type identifier
+4 u32 data_size byte size of data payload
+8 ptr data_ptr pointer to data in output buffer
+16 u32 element_count number of elements (or auxiliary metadata)
+20 u32 aux_field additional per-type context
+24 u32 aux_field_2 secondary per-type context
Data payloads are 16-byte aligned: cursor += (size + 15) & ~0xF.
Section Type Tag Catalog
The serializer emits up to 56 unique section types across three tag ranges.
Base types (0x01--0x58):
| Tag | Hex | Content | Evidence |
|---|---|---|---|
| 1 | 0x01 | Instruction stream (register-allocated code body) | Emitted via sub_A4BC60 or sub_A4D3F0 |
| 3 | 0x03 | Virtual-dispatch section (vtable+48 on state obj) | Conditional on *(a1+64) > 0 |
| 16 | 0x10 | Source operand bank (v7[199] entries at v7+97) | *(entry+48) = v7[199] |
| 17 | 0x11 | Destination operand bank (bit-packed from v7+203) | Conditional on !v7[1414] |
| 19 | 0x13 | Annotation stream | *(a1+232) counter |
| 34 | 0x22 | Original-definition name table (_ORI_ prefixed) | strcpy(v50, "_ORI_") at line 1762 |
| 35 | 0x23 | Instruction info snapshot (340 bytes from v7+4) | qmemcpy of 340 bytes |
| 46 | 0x2E | Texture/surface binding table | v7[248] entries, 16 bytes each |
| 50 | 0x32 | Live range interval table (spill map) | From compilation context +984 |
| 51 | 0x33 | Register file occupancy table | *(ctx+1424) & 4 |
| 53 | 0x35 | Source operand type bitmap (4-bit per operand) | v7[131] operands, 20-byte stride |
| 54 | 0x36 | Destination operand type bitmap | v7[134] operands, 20-byte stride |
| 55 | 0x37 | Scheduling barrier data | via sub_1730890 |
| 56 | 0x38 | Register file mapping | via sub_1730FE0 |
| 58 | 0x3A | Scheduling dependency graph | via sub_17309A0 |
| 59 | 0x3B | Multi-section link record | Conditional on *(a1+1102) |
| 64 | 0x40 | External reference (from ctx+2120) | Pointer stored, no data copy |
| 68 | 0x44 | Performance counter section | *(a1+932) counter |
| 70 | 0x46 | Spill/fill metadata | v7[408] |
| 71 | 0x47 | Call graph edge table | From v7+61, linked list traversal |
| 73 | 0x49 | Codegen context snapshot | From ctx+932 register allocation state |
| 80 | 0x50 | Hash table section | v7+207/208, hash bucket traversal |
| 81 | 0x51 | Extended call info | From v7+84 |
| 83 | 0x53 | Convergence scope data | via sub_17308C0 |
| 85 | 0x55 | Register geometry record (banks, warps, lanes) | From ctx+1600, writes bank/warp/lane counts |
| 88 | 0x58 | Extended scheduling annotations | Conditional on *(a1+1088) > 0 |
Extended types (0x1208--0x1221): Emitted only when *(char*)(ctx+1412) < 0, which enables the full post-register-allocation diagnostic mode. These 16 types carry per-register-class live range and operand definition data:
| Tag | Hex | Content |
|---|---|---|
| 4616 | 0x1208 | Extended operand class 0 |
| 4617--4623 | 0x1209--0x120F | Extended operand classes 1--7 |
| 4624 | 0x1210 | Block-level operand summary |
| 4625 | 0x1211 | Live-in vector (12 bytes/element, count at *(a1+668)) |
| 4626 | 0x1212 | Live-out vector (12 bytes/element) |
| 4627 | 0x1213 | Extended operand class 8 |
| 4628--4629 | 0x1214--0x1215 | Extended operand classes 9--10 |
| 4630 | 0x1216 | Memory space descriptor (SM arch > 0x4FFF) |
| 4631 | 0x1217 | Extended scheduling flag (SM arch > 0x4FFF) |
| 4632 | 0x1218 | Instruction hash (ctx+1386 bit 3) |
| 4633 | 0x1219 | Annotation metadata |
| 4640 | 0x1220 | Extended section metadata |
| 4641 | 0x1221 | Optimization level record (from knob system, knob 988) |
The _ORI_ Name Prefix
The _ORI_ string is not a pass name. At line 1762 the serializer iterates the linked list at v7+55 (the original-definition chain maintained for rematerialization debugging) and for each entry creates a string "_ORI_<original_name>":
// Line 1748-1770 (simplified)
for (def = v7->original_defs; def; def = def->next) {
entry = §ion_table[16 * (state->instr_offset + idx)];
entry->type_tag = 34; // original-definition name
entry->data_ptr = cursor;
strcpy(cursor, "_ORI_");
strcpy(cursor + 5, def->name);
cursor += align16(strlen(def->name) + 21);
}
These names are consumed by the register allocation verifier (sub_A55D80) when it compares pre- and post-allocation reaching definitions. A mismatch triggers the "REMATERIALIZATION PROBLEM" diagnostic (string at 0xa55dd8), which lists original definitions under their _ORI_ names alongside the post-allocation state.
Wrapper: sub_A53840
sub_A53840 (48 lines) is a thin wrapper that:
- Emits a type-44 header record if
*(ctx+1600)[1193]is set (scheduling metadata header) - Calls
sub_A50650with the output buffer - Optionally emits a type-62 trailer record if
*(ctx+1600)[48]is set
This wrapper is the typical entry point reached through vtable dispatch.
Function Map
| Address | Size | Callers | Identity |
|---|---|---|---|
sub_A3B080 | ~700 B | multiple | Code Object constructor |
sub_A3A7E0 | ~700 B | 1 | Stats emitter (per-function profile) |
sub_A4B8F0 | ~250 B | 1 | Register count / annotation writer |
sub_A50650 | ~74 KB | 8 | ORI Record Serializer (CodeObject::EmitRecords) |
sub_A53840 | ~400 B | 1 | EmitRecords wrapper (adds type-44 header) |
sub_424070 | 2,098 B | 3,809 | Pool allocator (alloc) |
sub_4248B0 | 923 B | 1,215 | Pool deallocator (free) |
sub_424C50 | 488 B | 27 | Pool reallocator (realloc) |
sub_426150 | ~1.2 KB | 2,800 | Hash map insert |
sub_426D60 | 345 B | 422 | Hash map lookup |
sub_426EC0 | 349 B | 29 | Hash map contains |
sub_425CA0 | 114 B | 127 | Hash map constructor |
sub_425D20 | 121 B | 63 | Hash map destructor |
sub_42CA60 | 81 B | 298 | Linked list prepend |
sub_42CC30 | 34 B | 48 | Linked list length |
sub_427630 | 273 B | 73 | MurmurHash3 string hash |
sub_621480 | 21 KB | low | Symbol table builder |
sub_625800 | 27 KB | low | Symbol resolution |
sub_6BC560 | 4.9 KB | low | Constant bank handler |
sub_6294E0 | 12.1 KB | low | Reserved shared memory management |
sub_6C4DA0 | 15 KB | low | Shared memory intrinsic lowering |
sub_7FD6C0 | ~800 B | 3 | ELF symbol emitter |
sub_7FB6C0 | ~800 B | 1 | Pipeline orchestrator (context cleanup) |
sub_7FBB70 | ~100 B | 1 | Per-kernel entry point |
sub_663C30 | ~300 B | 1 | Compilation loop body |
sub_662920 | varies | 1 | Global initialization (calls KnobsInit) |
Related Pages
- Ori IR Overview -- top-level IR design, Code Object field summary
- Instructions -- detailed instruction format and encoding
- CFG -- FNV-1a hash map CFG implementation
- Registers -- register descriptor layout
- Phase Manager -- PhaseManager object layout, phase dispatch
- Memory Pool Allocator -- full allocator internals
- Hash Tables & Bitvectors -- hash map and bitvector details
- Knobs System -- knob descriptors, value types, ROT13 encoding
- Entry Point & CLI -- compilation driver, options block
Pass Inventory & Ordering
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.
All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.
| Total phases | 159 (indices 0--158) |
| Named (static table) | 159 (all have entries in off_22BD0C0) |
| Late-pipeline phases | 20 (indices 139--158, added after the original 0--138 design) |
| Gate passes (AdvancedPhase) | 17 conditional hooks |
| Update passes | 9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned) |
| Report passes | 10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned) |
| GeneralOptimize instances | 6 compound optimization bundles |
| Liveness/DCE instances | 5 (including EarlyOriSimpleLiveDead) |
| LICM instances | 4 |
| Pipeline infrastructure | Phase Manager, Optimization Pipeline |
Phase Categories
Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.
| Tag | Meaning | Count |
|---|---|---|
| Validation | Checks IR structural correctness, catches illegal patterns | 3 |
| Lowering | Converts unsupported ops, expands macros, legalizes IR | 14 |
| Optimization | Transforms IR to improve performance (DCE, CSE, LICM, etc.) | 68 |
| Analysis | Computes information consumed by later passes (liveness, CFG) | 6 |
| Reporting | Dumps IR, statistics, or memory usage for debugging | 9 |
| Scheduling | Instruction scheduling, sync insertion, WAR fixup | 8 |
| RegAlloc | Register allocation and related fixups | 6 |
| Encoding | Mercury SASS encoding, expansion, microcode generation | 9 |
| Cleanup | Post-transformation updates, NOP removal, block layout | 13 |
| Gate | Conditional hooks (AdvancedPhase*) -- no-op by default | 17 |
Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.
Numbering Discrepancy
Warning: The phase numbers 0--138 on this page use a compressed numbering scheme established before the full 159-entry name table was discovered (P2-14). The true static name table at
off_22BD0C0contains 159 entries indexed 0--158, and 16 of the 20 newly-discovered names occupy indices within the 0--138 range. In the true table, these 16 entries sit at their listed indices, and all subsequent phases shift up. The wiki's compressed numbering diverges from the true binary indices starting around phase 8.Phases 139--158 are correctly numbered (they match the true static table indices). A full renumbering of phases 0--138 to match the true binary indices is deferred as a separate task because it would affect cross-references across 40+ wiki pages.
The 16 omitted name table entries (with their true static table indices) are:
| True Index | Name | Category | Relationship to Wiki |
|---|---|---|---|
| 22 | OriCopyProp | Optimization | Sub-pass within all 6 GeneralOptimize bundles; also injected into Mercury pipeline |
| 32 | OptimizeNaNOrZero | Optimization | Standalone NaN/zero folding pass; not documented under current wiki numbering |
| 37 | ConvertMemoryToRegisterOrUniform | Optimization | Sub-pass of GeneralOptimizeMid; gated by knob 487; sub_910840 |
| 41 | Vectorization | Optimization | Load/store vectorization; gated by DisableReadVectorization/DisableWriteVectorization knobs |
| 57 | OriCommoning | Optimization | Commoning sub-pass; related to LateOriCommoning (wiki phase 64) |
| 69 | OriSimpleLiveDead | Optimization | Liveness/DCE sub-pass; related to EarlyOriSimpleLiveDead (wiki phase 10) |
| 73 | LateVectorization | Optimization | Late vectorization (2nd instance, after optimization exposes new opportunities) |
| 77 | SinkCodeIntoBlock | Optimization | Code sinking; sub_78DB70; DisablePhases=SinkCodeIntoBlock gate |
| 103 | LateEnforceArgumentRestrictions | Lowering | Late counterpart to EnforceArgumentRestrictions (wiki phase 48) |
| 114 | ScheduleInstructions | Scheduling | Worker for AdvancedPhasePreSched; sub_8D0640 (22 KB) |
| 115 | UpdateAfterScheduleInstructions | Cleanup | IR metadata refresh after scheduling completes |
| 118 | UpdateAfterOriDoSyncronization | Cleanup | IR metadata refresh after sync insertion (wiki phase 99) |
| 120 | ReportBeforeRegisterAllocation | Reporting | DUMPIR target; diagnostic dump before register allocation |
| 122 | AllocateRegisters | RegAlloc | Worker for AdvancedPhaseAllocReg; canonical allocator entry |
| 124 | UpdateAfterOriAllocateRegisters | Cleanup | IR metadata refresh after register allocation |
| 127 | PostExpansion | Lowering | Worker for AdvancedPhasePostExpansion; post-RA expansion |
All 16 are valid DUMPIR targets (resolvable through sub_C641D0 binary search over the phase name table). Several are also valid DisablePhases targets.
Gate Passes (AdvancedPhase)
Seventeen phases are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.
Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.
The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.
Gate Pass Worker Correspondence
Several gate passes dispatch to named worker functions when activated by a backend. The worker names appear in the static name table and are valid DUMPIR/NamedPhases targets:
| Gate Pass (Wiki #) | Worker Function (True Table Index) | Evidence |
|---|---|---|
AdvancedPhasePreSched (97) | ScheduleInstructions [114] | sub_8D0640, string "ScheduleInstructions" |
AdvancedPhaseAllocReg (101) | AllocateRegisters [122] | String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0 |
AdvancedPhasePostExpansion (104) | PostExpansion [127] | Post-RA expansion dispatch |
AdvancedPhasePostFixUp (111) | PostFixUp [140] | Target vtable+0x148 dispatch |
See Optimization Levels for per-gate activation rules.
Update Passes
Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):
| Phase | Name | Refreshes |
|---|---|---|
| 76 | UpdateAfterOptimize | Rebuilds IR metadata after the late optimization group |
| 125 | UpdateAfterPostRegAlloc | Rebuilds IR metadata after register allocation and post-RA fixups |
| 128 | UpdateAfterFormatCodeList | Rebuilds the code list after Mercury encoding reformats instructions |
| 132 | UpdateAfterConvertUnsupportedOps | Rebuilds IR metadata after late unsupported-op expansion |
| 150 | UpdateAfterPostRegAlloc | Late-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default) |
| 154 | UpdateAfterFormatCodeList | Late-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default) |
| (true 115) | UpdateAfterScheduleInstructions | Refreshes IR after scheduling completes (omitted from compressed numbering) |
| (true 118) | UpdateAfterOriDoSyncronization | Refreshes IR after sync insertion (omitted from compressed numbering) |
| (true 124) | UpdateAfterOriAllocateRegisters | Refreshes IR after register allocation (omitted from compressed numbering) |
These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.
Report Passes
Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):
| Phase | Name | Output |
|---|---|---|
| 9 | ReportInitialRepresentation | Dumps the Ori IR immediately after initial lowering |
| 96 | ReportBeforeScheduling | Dumps the IR as it enters the scheduling/RA stage |
| 102 | ReportAfterRegisterAllocation | Dumps the IR after register allocation completes |
| (true 120) | ReportBeforeRegisterAllocation | Dumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068) |
| 126 | ReportFinalMemoryUsage | Prints memory pool consumption summary |
| 129 | DumpNVuCodeText | SASS text disassembly (cuobjdump-style) |
| 130 | DumpNVuCodeHex | Raw SASS hex dump |
| 151 | ReportFinalMemoryUsage | Late-pipeline duplicate: memory pool summary (no-op by default, isNoOp=1) |
| 155 | DumpNVuCodeText | Late-pipeline duplicate: SASS text disassembly; guarded by ctx+0x598 and ctx+0x740 |
| 156 | DumpNVuCodeHex | Late-pipeline duplicate: raw SASS hex dump; same guard as phase 155 |
Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).
GeneralOptimize Bundles
The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:
| Phase | Name | Position |
|---|---|---|
| 13 | GeneralOptimizeEarly | After initial setup, before loop passes |
| 29 | GeneralOptimize | After early loop/branch optimizations |
| 37 | GeneralOptimizeMid | After mid-level transformations |
| 46 | GeneralOptimizeMid2 | After VTA/CTA/mbarrier expansion |
| 58 | GeneralOptimizeLate | After late expansion |
| 65 | GeneralOptimizeLate2 | After predication and late commoning |
See GeneralOptimize Bundles for the sub-pass decomposition.
O-Level Gating
Twenty-two phases have confirmed optimization-level gates. The O-Level column in the table below annotates every phase where the activation threshold has been verified from decompiled isNoOp() methods or execute-function guards. Phases without an O-Level annotation run at all optimization levels (O0--O5). Threshold notation: > N means the phase requires opt_level > N; == 0 means the phase is active only at O0.
See Optimization Levels for the complete per-phase activation table, the O-level accessor (sub_7DDB50), and the NvOpt recipe system.
Complete 159-Phase Table
Stage 1 -- Initial Setup (Phases 0--13)
Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 0 | OriCheckInitialProgram | Validation | Validates structural correctness of the initial Ori IR after PTX lowering | ||
| 1 | ApplyNvOptRecipes | Optimization | Applies NvOptRecipe transformations (option 391, 440-byte sub-manager) | ||
| 2 | PromoteFP16 | Lowering | Promotes FP16 operations to FP32 where hardware lacks native support | ||
| 3 | AnalyzeControlFlow | Analysis | Builds the CFG: identifies loops, dominators, back edges | ||
| 4 | AdvancedPhaseBeforeConvUnSup | Gate | Hook before unsupported-op conversion; no-op by default | ||
| 5 | ConvertUnsupportedOps | Lowering | Replaces operations not natively supported on the target SM with equivalent sequences | Late Legalization | |
| 6 | SetControlFlowOpLastInBB | Cleanup | Ensures control flow instructions are the final instruction in each basic block | ||
| 7 | AdvancedPhaseAfterConvUnSup | Gate | Hook after unsupported-op conversion; no-op by default | ||
| 8 | OriCreateMacroInsts | Lowering | Expands PTX-level macro instructions into Ori instruction sequences | ||
| 9 | ReportInitialRepresentation | Reporting | Dumps the Ori IR for debugging (no-op unless DUMPIR enabled) | ||
| 10 | EarlyOriSimpleLiveDead | Optimization | Quick early dead code elimination pass | Liveness | |
| 11 | ReplaceUniformsWithImm | Optimization | Replaces uniform register reads with immediate constants where value is known | Uniform Regs | |
| 12 | OriSanitize | Validation | Validates IR consistency after initial setup transformations | ||
| 13 | GeneralOptimizeEarly | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (early) | GeneralOptimize |
Stage 2 -- Early Optimization (Phases 14--32)
Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 14 | DoSwitchOptFirst | Optimization | > 0 | Optimizes switch statements: jump table generation, case clustering (1st pass) | Branch & Switch |
| 15 | OriBranchOpt | Optimization | > 0 | Branch folding, unreachable block elimination, conditional branch simplification | Branch & Switch |
| 16 | OriPerformLiveDeadFirst | Analysis | Full liveness analysis + dead code elimination (1st of 4 major instances) | Liveness | |
| 17 | OptimizeBindlessHeaderLoads | Optimization | Hoists and deduplicates bindless texture header loads | ||
| 18 | OriLoopSimplification | Optimization | 4--5 | Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+ | Loop Passes |
| 19 | OriSplitLiveRanges | Optimization | Splits live ranges at loop boundaries to reduce register pressure | Liveness | |
| 20 | PerformPGO | Optimization | Applies profile-guided optimization data (block weights, branch probabilities) | ||
| 21 | OriStrengthReduce | Optimization | Replaces expensive operations (multiply, divide) with cheaper equivalents (shift, add) | Strength Reduction | |
| 22 | OriLoopUnrolling | Optimization | > 1 | Unrolls loops based on trip count and register pressure heuristics | Loop Passes |
| 23 | GenerateMovPhi | Lowering | Inserts SSA phi nodes as MOV.PHI pseudo-instructions | ||
| 24 | OriPipelining | Optimization | > 1 | Software pipelining: overlaps loop iterations to hide latency | Loop Passes |
| 25 | StageAndFence | Lowering | Inserts memory fence and staging instructions for coherence | Sync & Barriers | |
| 26 | OriRemoveRedundantBarriers | Optimization | > 1 | Eliminates barrier instructions proven redundant by data-flow analysis | Sync & Barriers |
| 27 | AnalyzeUniformsForSpeculation | Analysis | Identifies uniform values safe for speculative execution | Uniform Regs | |
| 28 | SinkRemat | Optimization | > 1 / > 4 | Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlass | Rematerialization |
| 29 | GeneralOptimize | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid-early) | GeneralOptimize | |
| 30 | DoSwitchOptSecond | Optimization | > 0 | Second switch optimization pass after loop/branch transformations | Branch & Switch |
| 31 | OriLinearReplacement | Optimization | Replaces branch-heavy patterns with linear (branchless) sequences | ||
| 32 | CompactLocalMemory | Optimization | Compacts local memory allocations by eliminating dead slots and reordering |
Stage 3 -- Mid-Level Optimization (Phases 33--52)
GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 33 | OriPerformLiveDeadSecond | Analysis | Full liveness analysis + DCE (2nd instance, post-early-optimization cleanup) | Liveness | |
| 34 | ExtractShaderConstsFirst | Optimization | Identifies uniform values loadable from constant memory instead of per-thread computation (1st pass) | ||
| 35 | OriHoistInvariantsEarly | Optimization | Loop-invariant code motion: hoists invariant computations out of loops (early) | Loop Passes | |
| 36 | EmitPSI | Lowering | Emits PSI (Pixel Shader Input) interpolation setup for graphics shaders | ||
| 37 | GeneralOptimizeMid | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid) | GeneralOptimize | |
| 38 | OptimizeNestedCondBranches | Optimization | > 0 | Simplifies nested conditional branches into flatter control flow | Branch & Switch |
| 39 | ConvertVTGReadWrite | Lowering | Converts vertex/tessellation/geometry shader read/write operations | ||
| 40 | DoVirtualCTAExpansion | Lowering | Expands virtual CTA operations into physical CTA primitives | ||
| 41 | MarkAdditionalColdBlocks | Analysis | Marks basic blocks as cold based on heuristics and profile data | Hot/Cold | |
| 42 | ExpandMbarrier | Lowering | Expands MBARRIER pseudo-instructions into native barrier sequences | Sync & Barriers | |
| 43 | ForwardProgress | Lowering | Inserts instructions guaranteeing forward progress (prevents infinite stalls) | ||
| 44 | OptimizeUniformAtomic | Optimization | Converts thread-uniform atomic operations into warp-level reductions | ||
| 45 | MidExpansion | Lowering | Target-dependent mid-level expansion of operations before register allocation | Late Legalization | |
| 46 | GeneralOptimizeMid2 | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd) | GeneralOptimize | |
| 47 | AdvancedPhaseEarlyEnforceArgs | Gate | Hook before argument enforcement; no-op by default | ||
| 48 | EnforceArgumentRestrictions | Lowering | Enforces ABI restrictions on function arguments (register classes, alignment) | ||
| 49 | GvnCse | Optimization | > 1 | Global value numbering combined with common subexpression elimination | Copy Prop & CSE |
| 50 | OriReassociateAndCommon | Optimization | Reassociates expressions for better commoning opportunities, then eliminates commons | Copy Prop & CSE | |
| 51 | ExtractShaderConstsFinal | Optimization | Final shader constant extraction pass (after GVN may expose new constants) | ||
| 52 | OriReplaceEquivMultiDefMov | Optimization | Eliminates redundant multi-definition move instructions with equivalent sources |
Stage 4 -- Late Optimization (Phases 53--77)
Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 53 | OriPropagateVaryingFirst | Optimization | Propagates varying (non-uniform) annotations to identify divergent values (1st pass) | ||
| 54 | OriDoRematEarly | Optimization | > 1 | Early rematerialization: recomputes cheap values near uses to reduce register pressure | Rematerialization |
| 55 | LateExpansion | Lowering | Expands operations that must be lowered after high-level optimizations | Late Legalization | |
| 56 | SpeculativeHoistComInsts | Optimization | Speculatively hoists common instructions above branches | ||
| 57 | RemoveASTToDefaultValues | Cleanup | Removes AST (address space type) annotations that have been lowered to defaults | ||
| 58 | GeneralOptimizeLate | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (late) | GeneralOptimize | |
| 59 | OriLoopFusion | Optimization | Fuses adjacent loops with compatible bounds and no inter-loop dependencies | Loop Passes | |
| 60 | DoVTGMultiViewExpansion | Lowering | Expands multi-view operations for vertex/tessellation/geometry shaders | ||
| 61 | OriPerformLiveDeadThird | Analysis | Full liveness analysis + DCE (3rd instance, post-late-optimization) | Liveness | |
| 62 | OriRemoveRedundantMultiDefMov | Optimization | Removes dead multi-definition move instructions | ||
| 63 | OriDoPredication | Optimization | > 1 | If-conversion: converts short conditional branches into predicated instructions | Predication |
| 64 | LateOriCommoning | Optimization | Late commoning pass: eliminates common subexpressions exposed by predication | Copy Prop & CSE | |
| 65 | GeneralOptimizeLate2 | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd) | GeneralOptimize | |
| 66 | OriHoistInvariantsLate | Optimization | LICM: hoists loop-invariant code (late, after predication may expose new invariants) | Loop Passes | |
| 67 | DoKillMovement | Optimization | Moves kill annotations closer to last use to improve register pressure | ||
| 68 | DoTexMovement | Optimization | Moves texture fetch instructions to minimize latency exposure | ||
| 69 | OriDoRemat | Optimization | > 1 | Late rematerialization: recomputes values exposed by predication and fusion | Rematerialization |
| 70 | OriPropagateVaryingSecond | Optimization | Propagates varying annotations (2nd pass, after predication changes control flow) | ||
| 71 | OptimizeSyncInstructions | Optimization | > 1 | Eliminates and simplifies synchronization instructions | Sync & Barriers |
| 72 | LateExpandSyncInstructions | Lowering | > 2 | Expands sync pseudo-instructions into final hardware sequences | Sync & Barriers |
| 73 | ConvertAllMovPhiToMov | Lowering | Destroys SSA form: converts MOV.PHI instructions into plain MOV | ||
| 74 | ConvertToUniformReg | Optimization | Converts qualifying values from general registers (R) to uniform registers (UR) | Uniform Regs | |
| 75 | LateArchOptimizeFirst | Optimization | Architecture-specific late optimizations (1st pass) | ||
| 76 | UpdateAfterOptimize | Cleanup | Rebuilds IR metadata invalidated by the late optimization group | ||
| 77 | AdvancedPhaseLateConvUnSup | Gate | Hook at the late unsupported-op boundary; no-op by default |
Stage 5 -- Legalization (Phases 78--96)
Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 78 | LateExpansionUnsupportedOps | Lowering | Expands remaining unsupported operations after all optimizations | Late Legalization | |
| 79 | OriHoistInvariantsLate2 | Optimization | LICM (late 2nd pass) after unsupported-op expansion | Loop Passes | |
| 80 | ExpandJmxComputation | Lowering | Expands JMX (jump with index computation) pseudo-instructions | ||
| 81 | LateArchOptimizeSecond | Optimization | Architecture-specific late optimizations (2nd pass) | ||
| 82 | AdvancedPhaseBackPropVReg | Gate | Hook before backward copy propagation; no-op by default | ||
| 83 | OriBackCopyPropagate | Optimization | Backward copy propagation: propagates values backward through move chains | Copy Prop & CSE | |
| 84 | OriPerformLiveDeadFourth | Analysis | Full liveness analysis + DCE (4th instance, pre-legalization cleanup) | Liveness | |
| 85 | OriPropagateGmma | Optimization | Propagates WGMMA accumulator values through the IR | GMMA Pipeline | |
| 86 | InsertPseudoUseDefForConvUR | Lowering | Inserts pseudo use/def instructions for uniform register conversion bookkeeping | Uniform Regs | |
| 87 | FixupGmmaSequence | Lowering | Fixes WGMMA instruction sequences for hardware ordering constraints | GMMA Pipeline | |
| 88 | OriHoistInvariantsLate3 | Optimization | LICM (late 3rd pass) after GMMA fixup | Loop Passes | |
| 89 | AdvancedPhaseSetRegAttr | Gate | Hook before register attribute setting; no-op by default | ||
| 90 | OriSetRegisterAttr | Analysis | Annotates registers with scheduling attributes (latency class, bank assignment) | Scheduling | |
| 91 | OriCalcDependantTex | Analysis | Computes texture instruction dependencies for scheduling | ||
| 92 | AdvancedPhaseAfterSetRegAttr | Gate | Hook after register attribute setting; no-op by default | ||
| 93 | LateExpansionUnsupportedOps2 | Lowering | Second late unsupported-op expansion (catches ops exposed by GMMA/attr passes) | Late Legalization | |
| 94 | FinalInspectionPass | Validation | Final IR validation gate: catches illegal patterns before irreversible scheduling/RA | ||
| 95 | SetAfterLegalization | Cleanup | > 1 | Sets post-legalization flag on the compilation context | |
| 96 | ReportBeforeScheduling | Reporting | Dumps IR before scheduling (no-op unless diagnostic options enabled) |
Stage 6 -- Scheduling & Register Allocation (Phases 97--103)
Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 97 | AdvancedPhasePreSched | Gate | Hook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114) | Scheduling | |
| 98 | BackPropagateVEC2D | Optimization | Backward-propagates 2D vector register assignments | ||
| 99 | OriDoSyncronization | Scheduling | > 1 | Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory model | Sync & Barriers |
| 100 | ApplyPostSyncronizationWars | Scheduling | > 1 | Fixes write-after-read hazards exposed by sync insertion | Sync & Barriers |
| 101 | AdvancedPhaseAllocReg | Gate | Register allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets this | RegAlloc Architecture | |
| 102 | ReportAfterRegisterAllocation | Reporting | Dumps IR after register allocation (no-op unless diagnostic options enabled) | ||
| 103 | Get64bRegComponents | RegAlloc | Splits 64-bit register pairs into 32-bit components for architectures that require it | RegAlloc Architecture |
Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)
Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 104 | AdvancedPhasePostExpansion | Gate | Hook after post-RA expansion; when active, dispatches to PostExpansion (true table index 127) | ||
| 105 | ApplyPostRegAllocWars | RegAlloc | Fixes write-after-read hazards exposed by register allocation | ||
| 106 | AdvancedPhasePostSched | Gate | Hook after post-scheduling; no-op by default | ||
| 107 | OriRemoveNopCode | Cleanup | Removes NOP instructions and dead code inserted as placeholders | ||
| 108 | OptimizeHotColdInLoop | Optimization | Separates hot and cold paths within loops for cache locality | Hot/Cold | |
| 109 | OptimizeHotColdFlow | Optimization | Separates hot and cold paths at the function level | Hot/Cold | |
| 110 | PostSchedule | Scheduling | > 0 | Post-scheduling pass: finalizes instruction ordering | Scheduling |
| 111 | AdvancedPhasePostFixUp | Gate | Hook after post-fixup; when active, dispatches to PostFixUp (phase 140, target vtable+0x148) | ||
| 112 | PlaceBlocksInSourceOrder | Cleanup | Determines final basic block layout in the emitted binary | ||
| 113 | PostFixForMercTargets | Encoding | Fixes up instructions for Mercury encoding requirements | Mercury | |
| 114 | FixUpTexDepBarAndSync | Scheduling | Fixes texture dependency barriers and sync instructions post-scheduling | Scoreboards | |
| 115 | AdvancedScoreboardsAndOpexes | Gate | > 0 | Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0 | Scoreboards |
| 116 | ProcessO0WaitsAndSBs | Scheduling | == 0 | Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazard | Scoreboards |
Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.
Stage 8 -- Mercury Backend (Phases 117--122)
SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 117 | MercEncodeAndDecode | Encoding | Converts Ori instructions to Mercury encoding, then round-trip decodes for verification | Mercury | |
| 118 | MercExpandInstructions | Encoding | Expands pseudo-instructions into final SASS instruction sequences | Mercury | |
| 119 | MercGenerateWARs1 | Encoding | Generates write-after-read hazard annotations (1st pass, pre-expansion) | Mercury | |
| 120 | MercGenerateOpex | Encoding | Generates "opex" (operation extension) annotations for each instruction | Mercury | |
| 121 | MercGenerateWARs2 | Encoding | Generates WAR annotations (2nd pass, covers hazards introduced by expansion) | Mercury | |
| 122 | MercGenerateSassUCode | Encoding | Produces the final SASS microcode bytes (the actual binary encoding) | Mercury |
"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).
Stage 9 -- Post-Mercury (Phases 123--131)
Register map computation, diagnostics, debug output.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 123 | ComputeVCallRegUse | RegAlloc | Computes register usage for virtual call sites | ||
| 124 | CalcRegisterMap | RegAlloc | Computes the final physical-to-logical register mapping emitted as EIATTR metadata | RegAlloc Architecture | |
| 125 | UpdateAfterPostRegAlloc | Cleanup | Rebuilds IR metadata after post-RA processing | ||
| 126 | ReportFinalMemoryUsage | Reporting | Prints memory pool consumption summary to stderr | ||
| 127 | AdvancedPhaseOriPhaseEncoding | Gate | Phase encoding hook; no-op by default | ||
| 128 | UpdateAfterFormatCodeList | Cleanup | Rebuilds the code list after Mercury encoding reformats instructions | ||
| 129 | DumpNVuCodeText | Reporting | Dumps human-readable SASS text disassembly | ||
| 130 | DumpNVuCodeHex | Reporting | Dumps raw SASS binary as hex | ||
| 131 | DebuggerBreak | Cleanup | Development hook: triggers a debugger breakpoint at this pipeline position |
Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)
Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 132 | UpdateAfterConvertUnsupportedOps | Cleanup | Rebuilds IR metadata after late unsupported-op conversion | ||
| 133 | MergeEquivalentConditionalFlow | Optimization | Merges basic blocks with equivalent conditional flow (tail merging) | ||
| 134 | AdvancedPhaseAfterMidExpansion | Gate | Hook after mid-level expansion; no-op by default | ||
| 135 | AdvancedPhaseLateExpandSyncInstructions | Gate | Hook for late sync instruction expansion; no-op by default | ||
| 136 | LateMergeEquivalentConditionalFlow | Optimization | Second conditional flow merge pass (catches cases exposed by late transforms) | ||
| 137 | LateExpansionUnsupportedOpsMid | Lowering | Mid-late unsupported-op expansion (between the two merge passes) | Late Legalization | |
| 138 | OriSplitHighPressureLiveRanges | RegAlloc | Last-resort live range splitter when register pressure exceeds hardware limits | RegAlloc Architecture | |
| 139 | ProcessO0WaitsAndSBs | Scheduling | == 0 | Conservative scoreboard insertion for -O0; inserts maximum wait counts at every hazard | Scoreboards |
| 140 | PostFixUp | Cleanup | Target-specific post-fixup dispatch (calls target vtable+0x148) | ||
| 141 | MercConverter | Encoding | Initial Mercury conversion: translates Ori instructions to Mercury format (sub_9F3760) | Mercury | |
| 142 | MercEncodeAndDecode | Encoding | Encode/decode round-trip verification of SASS binary encoding (sub_18F21F0) | Mercury | |
| 143 | MercExpandInstructions | Encoding | Expands Mercury pseudo-instructions into final SASS sequences; gated by ctx+0x570 bit 5 | Mercury | |
| 144 | MercGenerateWARs1 | Encoding | WAR hazard annotation (1st pass, pre-expansion); gated by ctx+0x570 sign bit | Mercury | |
| 145 | MercGenerateOpex | Encoding | Generates operation extension annotations per instruction; gated by ctx+0x570 bit 6 | Mercury | |
| 146 | MercGenerateWARs2 | Encoding | WAR hazard annotation (2nd pass, covers hazards from expansion in phase 143) | Mercury | |
| 147 | MercGenerateSassUCode | Encoding | Final SASS microcode emission: produces the binary bytes for the ELF; gated by ctx+0x571 bit 0 | Mercury | |
| 148 | ComputeVCallRegUse | RegAlloc | Computes register usage for virtual call sites (EIATTR metadata for indirect calls) | ||
| 149 | CalcRegisterMap | RegAlloc | Computes the final physical-to-logical register mapping; gated by ctx+0x590 bit 1 | RegAlloc Architecture | |
| 150 | UpdateAfterPostRegAlloc | Cleanup | Rebuilds IR metadata after post-RA processing (no-op by default, isNoOp=1) | ||
| 151 | ReportFinalMemoryUsage | Reporting | Prints memory pool consumption summary (no-op by default, isNoOp=1) | ||
| 152 | AdvancedPhaseOriPhaseEncoding | Gate | Phase encoding gate; when active, sets ctx+0x610 (pipeline_progress) = 0x15 (21) to mark encoding boundary | ||
| 153 | FormatCodeList | Encoding | Formats the instruction list for ELF output; dispatches through ctx+0x648 vtable+0x10 | Mercury | |
| 154 | UpdateAfterFormatCodeList | Cleanup | Rebuilds IR data structures after FormatCodeList reformats instructions (no-op by default, isNoOp=1) | ||
| 155 | DumpNVuCodeText | Reporting | Dumps human-readable SASS text disassembly; guarded by ctx+0x598 > 0 and ctx+0x740 non-null | ||
| 156 | DumpNVuCodeHex | Reporting | Dumps raw SASS binary as hex; same guard as phase 155 | ||
| 157 | DebuggerBreak | Cleanup | Development hook: convenient breakpoint location for pipeline debugging (empty body in release) | ||
| 158 | NOP | Cleanup | Terminal no-op sentinel; final phase in the 159-phase pipeline |
Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride). All 20 have names in the static table at off_22BD0C0 (159 entries, not 139). The vtable slot at +16 is isNoOp() (returns 0 for active phases, 1 for phases skipped by default); name resolution goes through the static table indexed by getIndex() at +8.
The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing backends to selectively enable/disable encoding passes. WAR generation runs in two passes (144, 146) bracketing instruction expansion (143) because expansion can introduce new write-after-read hazards.
Pipeline Ordering Notes
Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.
Identity ordering. The default ordering table at 0x22BEEA0 is an identity mapping: exec[N] = factory[N] for all 159 phases. The factory index IS the execution order. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below). In the true 159-phase table, phases execute in strict index order 0--158.
Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:
| Pass Family | Instances | Phases |
|---|---|---|
GeneralOptimize* | 6 | 13, 29, 37, 46, 58, 65 |
OriPerformLiveDead* | 4 | 16, 33, 61, 84 |
OriHoistInvariants* | 4 | 35, 66, 79, 88 |
LateExpansionUnsupportedOps* | 3 | 78, 93, 137 |
ExtractShaderConsts* | 2 | 34, 51 |
OriPropagateVarying* | 2 | 53, 70 |
OriDoRemat* | 2 | 54, 69 |
DoSwitchOpt* | 2 | 14, 30 |
LateArchOptimize* | 2 | 75, 81 |
MergeEquivalentConditionalFlow | 2 | 133, 136 |
MercGenerateWARs* | 2 | 144, 146 |
UpdateAfterPostRegAlloc | 2 | 125, 150 |
UpdateAfterFormatCodeList | 2 | 128, 154 |
ReportFinalMemoryUsage | 2 | 126, 151 |
DumpNVuCodeText | 2 | 129, 155 |
DumpNVuCodeHex | 2 | 130, 156 |
ComputeVCallRegUse | 2 | 123, 148 |
CalcRegisterMap | 2 | 124, 149 |
DebuggerBreak | 2 | 131, 157 |
Vectorization/LateVectorization | 2 | (true 41, 73) -- omitted from compressed numbering |
EnforceArgumentRestrictions/Late... | 2 | 48 (wiki), (true 103) -- late variant omitted |
Cross-References
- Optimization Pipeline -- pipeline infrastructure, PhaseManager data structures, dispatch loop
- Phase Manager Infrastructure -- PhaseManager object layout, constructor, destructor, factory switch
- GeneralOptimize Bundles -- sub-pass decomposition of compound optimization passes
- Branch & Switch Optimization -- phases 14, 15, 30, 38
- Loop Passes -- phases 18, 22, 24, 35, 59, 66, 79, 88
- Strength Reduction -- phase 21
- Copy Propagation & CSE -- phases 49, 50, 64, 83
- Predication -- phase 63
- Rematerialization -- phases 28, 54, 69
- Liveness Analysis -- phases 10, 16, 19, 33, 61, 84
- Synchronization & Barriers -- phases 25, 26, 42, 71, 72, 99, 100, 114
- Hot/Cold Partitioning -- phases 41, 108, 109
- GMMA/WGMMA Pipeline -- phases 85, 87
- Uniform Register Optimization -- phases 11, 27, 74, 86
- Late Expansion & Legalization -- phases 5, 45, 55, 78, 93, 137
- Register Allocator Architecture -- phases 101, 103, 105, 123, 124, 138, 148, 149
- Scheduler Architecture -- phases 90, 97--100, 110
- Scoreboards & Dependency Barriers -- phases 114, 115, 116
- Mercury Encoder -- phases 113, 117--122, 141--147, 153
- Optimization Levels -- O-level gating of gate passes
- DUMPIR & NamedPhases -- user-specified phase targeting and reordering
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_C60D30 | -- | Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL) | 0.92 |
sub_7DDB50 | 232B | Opt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold | 0.95 |
sub_A36360 | 52KB | Master scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes) | 0.90 |
sub_A23CF0 | 54KB | DAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation | 0.90 |
sub_9F1A90 | 35KB | MercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern | 0.92 |
sub_9ED2D0 | 25KB | Opcode switch inside MercConverter; dispatches per-opcode legalization/conversion | 0.90 |
sub_9F3760 | -- | Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions | 0.85 |
sub_18F21F0 | -- | Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification | 0.85 |
Phase Manager Infrastructure
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The PhaseManager is the central orchestration layer in ptxas. It owns the entire 159-phase optimization and code generation pipeline, constructs each phase as a polymorphic object via an abstract factory, and drives execution through a virtual dispatch loop. Every compilation unit passes through the same PhaseManager sequence: construct all 159 phase objects, iterate the phase index array calling execute() on each, optionally collect per-phase timing and memory statistics, then tear down. The PhaseManager also hosts an optional NvOptRecipe sub-manager (440 bytes) for architecture-specific "advanced phase" hooks that inject additional processing at 16 defined points in the pipeline.
The design is a textbook Strategy + Abstract Factory pattern: a 159-case switch statement maps phase indices to vtable pointers, each vtable provides execute(), isNoOp(), and getName() virtual methods, and the dispatch loop iterates a flat index array that defines execution order. This makes the pipeline fully data-driven -- reordering, disabling, or injecting phases requires only modifying the index array, not the dispatch logic.
| Core range | 0xC60000--0xC66000 (13 functions, ~17.5 KB) |
| Constructor | sub_C62720 (4,734 bytes) |
| Destructor | sub_C61B20 (1,753 bytes) |
| Phase factory | sub_C60D30 (3,554 bytes, 159-case switch) |
| Dispatch loop | sub_C64F70 (1,455 bytes) |
| Name lookup | sub_C641D0 (305 bytes, case-insensitive binary search) |
| Timing reporter | sub_C64310 (3,168 bytes) |
| Pool reporter | sub_C62200 (888 bytes) |
| Total phases | 159 (139 explicitly named + 20 arch-specific) |
| AdvancedPhase hooks | 16 no-op-by-default insertion points |
| Default phase table | Static array at 0x22BEEA0 (returned by sub_C60D20) |
| Phase name table | Static array at off_22BD0C0 (159 string pointers) |
| Vtable range | off_22BD5C8 (phase 0) through off_22BEE78 (phase 158) |
| Callers | sub_7FB6C0 (main compilation driver), sub_9F63D0 (library/ftrace entry) |
PhaseManager Object Layout
The PhaseManager is a plain C++ object (no vtable of its own) allocated by the compilation driver. Minimum size is 112 bytes, though the full extent depends on whether timing and NvOptRecipe are enabled.
PhaseManager (112+ bytes)
+0 int64 compilation_unit // back-pointer to owning compilation unit
+8 int64* allocator // pool allocator (from compilation_unit->field_16)
+16 void* sorted_name_table // sorted {name_ptr, index} pairs for binary search
+24 int sorted_name_count
+28 int sorted_name_capacity
+32 int64* allocator2 // copy of allocator (for phase list ops)
+40 void* phase_list // array of 16-byte {phase_ptr, pool_ptr} pairs
+48 int phase_list_count // always 159 after construction
+52 int phase_list_capacity
+56 int64 nvopt_recipe_ptr // NvOptRecipe sub-manager, or NULL
+64 int64 (reserved)
+72 bool timing_enabled // set from compilation_unit->config->options[17928]
+76 int (flags/padding)
+80 bool flag_byte // initialized to 1, reset after first timing report
+88 int64* timing_allocator
+96 void* phase_name_raw_table // 159 name string pointers, copied from off_22BD0C0
+104 int phase_name_raw_count
+108 int phase_name_raw_capacity
The two allocator fields (+8 and +32) both point to the same pool allocator extracted from the compilation unit, but are used in different contexts: +8 for name table operations, +32 for phase list operations.
Phase Object Model
Each phase is a 16-byte polymorphic object:
Phase (16 bytes)
+0 vtable* // points to one of 159 vtable instances
+8 void* // pool pointer (memory pool for phase-local allocations)
The vtable provides the interface contract:
| Vtable offset | Method | Signature |
|---|---|---|
+0 | execute | void execute(phase*, compilation_context*) |
+8 | isNoOp | bool isNoOp(phase*) -- returns true to skip execution |
+16 | getName | int getName(phase*) -- returns index into name table |
+24 | alloc | void* alloc(pool*, size_t) -- pool allocator |
+32 | free | void free(pool*, void*) -- pool deallocator |
The vtable addresses span off_22BD5C8 (phase 0) through off_22BEE78 (phase 158), with a stride of 0x28 (40 bytes) between consecutive entries. All vtables reside in .data.rel.ro.
Phase Factory -- sub_C60D30
The factory is a 159-case switch statement that serves as the sole point of phase instantiation. For each case:
- Extracts the pool allocator from
context->field_16 - Allocates 16 bytes via
pool_alloc(vtable offset+24) - Writes the case-specific vtable pointer at offset
+0 - Returns a
{phase_ptr, pool_ptr}pair
The default case returns {NULL, NULL}, which the caller treats as an invalid phase index.
// Pseudocode for sub_C60D30
pair<phase*, pool*> PhaseFactory(int phase_index, context* ctx) {
pool* p = ctx->allocator;
phase* obj = p->alloc(16);
switch (phase_index) {
case 0: obj->vtable = off_22BD5C8; break; // OriCheckInitialProgram
case 1: obj->vtable = off_22BD5F0; break; // ApplyNvOptRecipes
case 2: obj->vtable = off_22BD618; break; // PromoteFP16
// ... 156 more cases ...
case 158: obj->vtable = off_22BEE78; break; // sentinel/NOP
default: return {NULL, NULL};
}
return {obj, p};
}
Called exclusively by the constructor (sub_C62720).
Construction Sequence -- sub_C62720
The constructor performs 11 steps, building all internal data structures and instantiating every phase:
// Pseudocode for sub_C62720
bool PhaseManager::construct(compilation_unit* cu) {
this->cu = cu;
this->allocator = cu->field_16; // extract pool allocator
this->allocator2 = cu->field_16;
// 1. Check timing flag
this->timing_enabled = cu->config->options[17928];
// 2. Allocate and copy phase name table (1272 = 159 * 8 bytes)
this->phase_name_raw_table = alloc(1272);
memcpy(this->phase_name_raw_table, off_22BD0C0, 1272);
this->phase_name_raw_count = 159;
this->phase_name_raw_capacity = 159;
// 3. Initialize timing records
resize_timing(/*capacity=*/159); // sub_C62580
cu->timing_count++; // at cu+1576
append_timing({index=-1, name=0x2030007, time=0, flags=0}); // sentinel
// 4. Create all 159 phase objects
resize_phase_list(/*capacity=*/159); // sub_C62640
for (int i = 0; i < 159; i++) {
auto [phase, pool] = PhaseFactory(i, cu); // sub_C60D30
phase_list[i] = {phase, pool};
}
// 5. Optionally create NvOptRecipe sub-manager
if (cu->config->getOption(391)) {
auto* recipe = alloc(440);
// initialize hash table, ref-counted lists, timing arrays (8 entries)
// inherit phase chain from previous execution context
this->nvopt_recipe_ptr = recipe;
}
this->flag_byte = 1;
return true;
}
Key constants:
- 159 -- total phase count, used as loop bound and array capacities
- 1272 --
159 * 8, phase name pointer table size in bytes - 440 -- NvOptRecipe sub-manager object size
- 0x2030007 (33,739,079) -- timing sentinel magic value
- Option 17928 -- enables per-phase timing/memory reporting
- Option 391 -- enables NvOptRecipe sub-manager
Destruction Sequence -- sub_C61B20
Teardown mirrors construction in reverse order, with careful handling of the NvOptRecipe's reference-counted shared state:
// Pseudocode for sub_C61B20
void PhaseManager::destroy() {
// 1. Free raw name table
timing_allocator->free(phase_name_raw_table);
// 2. Tear down NvOptRecipe if present
if (nvopt_recipe_ptr) {
auto* r = nvopt_recipe_ptr;
// decrement shared_list ref-count at +432
if (--r->shared_list_refcount == 0)
free_list_nodes(r->shared_list);
free(r->hash_buckets); // +408
free(r->sorted_array); // +376
free(r->timing_records); // +344, stride=584 per entry
free(r->node_pool); // +16
free(r);
}
// 3. Destroy each phase via virtual destructor (vtable+32)
for (int i = 0; i < phase_list_count; i++) {
auto [phase, pool] = phase_list[i];
pool->free(phase); // invokes vtable+32
}
// 4. Free base arrays
allocator2->free(phase_list);
allocator->free(sorted_name_table);
}
The ref-count decrement-and-destroy pattern on shared_list at +432 follows C++ shared_ptr semantics: the NvOptRecipe may share state across multiple compilation units in library mode.
Phase Dispatch Loop -- sub_C64F70
The dispatch loop is the runtime engine. It takes a slice of the phase index array and executes each phase in order:
// Pseudocode for sub_C64F70
bool PhaseManager::dispatch(int* phase_indices, int count) {
memory_snapshot_t base_snap;
take_snapshot(&base_snap); // sub_8DADE0
for (int i = 0; i < count; i++) {
int idx = phase_indices[i];
phase* p = this->phase_list[idx].phase;
// Resolve phase name
int name_idx = p->getName(); // vtable+16
const char* name = this->phase_name_raw_table[name_idx];
// Record timing entry
append_timing({idx, name, opt_level, flags, metrics});
// Take pre-execution snapshot
memory_snapshot_t pre_snap;
take_snapshot(&pre_snap);
// Execute (unless no-op)
if (!p->isNoOp()) { // vtable+8
p->execute(this->cu); // vtable+0
// Construct diagnostic: "Before <name>" or "After <name>"
}
// Report per-phase stats
if (this->timing_enabled) {
report_phase_stats(name, &pre_snap, false); // sub_C64310
this->flag_byte = 0;
}
}
// Summary after all phases
if (this->timing_enabled) {
report_phase_stats("All Phases Summary", &base_snap, true);
report_pool_consumption(); // sub_C62200
}
return true;
}
The "Before" / "After" diagnostic strings use an interesting encoding trick: the string "Before " is stored as the 64-bit integer 0x2065726F666542 in little-endian, allowing the compiler to emit a single mov instruction instead of a memcpy.
Phase Name Lookup -- sub_C641D0
External callers (e.g., --ftrace-phase-after option processing in sub_9F4040) resolve phase names to indices through a case-insensitive binary search:
// Pseudocode for sub_C641D0
int PhaseManager::lookup_phase(const char* name) {
ensure_sorted(); // sub_C63FA0
int lo = 0, hi = sorted_name_count - 1;
while (lo <= hi) {
int mid = (lo + hi) / 2;
int cmp = strcasecmp(sorted_name_table[mid].name, name);
if (cmp == 0) return sorted_name_table[mid].index;
if (cmp < 0) lo = mid + 1;
else hi = mid - 1;
}
return 158; // sentinel: last phase (NOP)
}
The sorted name table is rebuilt on demand by sub_C63FA0 when the raw count differs from the sorted count. Sorting uses an iterative quicksort (sub_C639A0) with median-of-three pivot selection and three-way partitioning. The sort stack is pre-allocated to 33 entries, sufficient for ceil(log2(160)).
Per-Phase Timing and Memory Reporting
When timing is enabled (option 17928), the dispatch loop calls sub_C64310 after each phase to print memory statistics:
<indent><phase_name> :: [Total 1234 KB] [Freeable 567 KB] [Freeable Leaked 12 KB] (2%)
The reporter computes three memory deltas from snapshot pairs:
| Metric | Helper | Meaning |
|---|---|---|
| Total | sub_8DAE20 | Total memory allocated since snapshot |
| Freeable | sub_8DAE30 | Memory eligible for release |
| Freeable Leaked | sub_8DAE40 | Freeable memory not actually released |
Size formatting thresholds:
- 0--1023: raw bytes (suffix
B) - 1024--10,485,760: kilobytes with 3 decimal places (suffix
KB) - above 10 MB: megabytes with 3 decimal places (suffix
MB)
After all phases complete, the loop prints an "All Phases Summary" line using the same reporter, then calls sub_C62200 to print the pool consumption total:
[Pool Consumption = 45.678 MB]
Timing Record Format
Each timing entry is 32 bytes:
Timing Record (32 bytes)
+0 int phase_index // -1 for sentinel
+8 int64 phase_name // string pointer, or 0x2030007 for sentinel
+16 int64 timing_value // elapsed time
+24 int memory_flags // opt level / additional metrics
Records are stored in a growable array at compilation_unit+1560. Growth uses a 1.5x strategy: new_capacity = max(old + old/2 + 1, requested).
NvOptRecipe Sub-Manager (440 bytes)
When option 391 is enabled, the constructor creates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This object provides the runtime for "AdvancedPhase" hooks -- the 16 phases that are no-ops by default but can be activated for architecture-specific or optimization-level-specific processing. The NvOpt level (0--5) controls per-phase aggressiveness independently of the -O CLI level: -O gates which phases run at all, while the NvOpt level controls how aggressively active phases behave.
Object Layout
NvOptRecipe (440 bytes)
+0 int64 compilation_unit // back-pointer to owning CU
+8 int64 phase_manager_backref // back-pointer to PhaseManager
+16 void* node_pool // 24-byte ref-counted list node
+24 int64 secondary_bucket_count // secondary hash (migration buffer)
+32 void* secondary_bucket_array // secondary hash bucket array
+40 int64 secondary_total_entries // secondary hash entry count
+48 (264 B) [opaque internal region] // +48..+311 undecoded
+312 int64 recipe_data // from option 391 value (ext. pointer)
+320 int64 (reserved) // zeroed in constructor
+328 (8 B) [alignment gap]
+336 int64 allocator // cu->field_16->field_16
+344 void* timing_records // stride = 584 bytes per entry
+352 int32 timing_count // init -1 (empty sentinel)
+356 int32 timing_flags // init 0
+360 int32 timing_extra // init 0
+364 (4 B) (padding)
+368 int64* timing_allocator // cu->field_16->field_16 copy
+376 void* sorted_array // 4-byte entries, init capacity = 8
+384 int32 sorted_count // init 7 (pre-filled)
+388 int32 sorted_capacity // init 8
+392 void* ref_counted_list_2 // 24-byte ref-counted list node
+400 int32 hash_bucket_count // primary hash table bucket count
+404 (4 B) (padding)
+408 void* hash_buckets // primary hash, 24-byte stride/bucket
+416 int64 hash_size // total entries across all buckets
+424 (8 B) (padding)
+432 void* shared_list_ptr // ref-counted, shared across CUs
Sub-Structures
Ref-Counted List Node (24 bytes) -- used at +16, +392, +432:
RefCountedListNode (24 bytes)
+0 int64 refcount // manual shared_ptr: decrement-and-destroy
+8 void* next // singly-linked list chain
+16 void* allocator // for self-deallocation when refcount → 0
When the refcount reaches zero, the destructor walks the next chain freeing each node, then frees the head node itself through the allocator at +16.
Hash Bucket Entry (24 bytes) -- array at +408:
HashBucketEntry (24 bytes)
+0 void* chain_head // first element in bucket chain
+8 void* chain_sentinel // end-of-chain sentinel
+16 int32 chain_count // number of elements in this bucket
Timing Record (584 bytes) -- array at +344:
TimingRecord (584 bytes)
+0 (40 B) header
+40 void* sub_allocator // allocator for sub-data at +48
+48 void* sub_data // freed during cleanup
+56 int32 sub_count // set to -1 when cleaned
+60 int32 cleanup_flag // if >= 0: sub_data exists, free it
+64 (520 B) timing/metric data
Records are iterated backward during cleanup (base + 584 * (count + 1) - 584 down to base). The sentinel value -1 at offset +56 marks an entry as already cleaned up.
Construction Sequence
The constructor (sub_C62720, lines 356--850 in decompilation) performs these steps:
-
Check option 391 -- fast path:
*(config_obj[9] + 28152) != 0; slow path: virtual call with argument391. If disabled, skip entirely. -
Read option 391 value -- the value is the
recipe_datapointer. Fast path checks type tag5(int64) at config offset28152, reads the 64-bit value at offset28160. This is an externally-provided pointer, not computed locally. -
Allocate 440 bytes from the pool allocator at
compilation_unit->field_16. -
Initialize core fields -- back-pointers at
+0/+8,node_poolat+16(24-byte ref-counted node, refcount=1), zero+24/+32/+40, storerecipe_dataat+312. -
Initialize timing -- zero
+344, set+352to-1(empty sentinel), zero+360, copy allocator to+336and+368. -
Allocate sorted_array -- initial capacity 8 entries (32 bytes), pre-fill 7 entries, set
+384= 7,+388= 8. -
Allocate
ref_counted_list_2at+392(24-byte node, refcount=1), zero+400/+408/+416. -
Allocate
shared_listat+432(24-byte node, refcount=1). -
Inherit from previous recipe -- if
PhaseManager+56already holds an NvOptRecipe from a prior compilation unit:- Decrement old
shared_listrefcount; free if zero - Migrate hash bucket chains from old recipe to new
ref_counted_list_2 - Walk old timing records backward (stride 584), freeing sub-allocations
- Drain old secondary hash table, release old
node_pool - Free old NvOptRecipe object
- Decrement old
-
Install -- set
PhaseManager+56= new recipe,PhaseManager+64= allocator.
Destruction Sequence
The destructor (sub_C61B20) tears down the recipe in reverse:
- Decrement
shared_list_ptr(+432) refcount; free linked nodes if zero - Walk hash buckets (
+408, stride 24, count from+416): for each chain element, clean sub-entries (timing at offsets+56/+60/+64/+76), decrement per-entry refcounts at element[9], append toref_counted_list_2; zero bucket; reset+400to 0 - Clean up
ref_counted_list_2(+392); free if refcount zero - Free
sorted_array(+376) ifsorted_count(+388) >= 0 - Walk
timing_records(+344) backward, stride 584, freeing sub-allocations; reset+352to-1 - Drain secondary hash (
+24/+32/+40), move chains tonode_pool - Release
node_pool(+16); free if refcount zero - Free the 440-byte object via
PhaseManager+64allocator
NvOpt Level Validation
The recipe application function sub_C173E0 validates the NvOpt level at each recipe record:
// At sub_C173E0 + 0x2FD9 (line 1431)
int nvopt_level = *(int*)(recipe_record + 344);
if (nvopt_level > 5) {
emit_warning(cu + 1232, 8000, "Invalid nvopt level : %d.", nvopt_level);
// warning 8000 (0x1F40) -- non-fatal, compilation continues
}
Valid levels are 0--5. The level is consumed as a bitmask 1 << nvopt_level, passed to a vtable call that dispatches on a recipe configuration byte at target descriptor offset 35280 (8-case switch: cases 0--5, 7). This byte controls which recipe application mode is used for the target architecture.
Shared State in Library Mode
The shared_list at +432 enables recipe state persistence across compilation units in library mode (multiple .ptx files compiled by one ptxas invocation):
- Each new NvOptRecipe sets its
shared_listrefcount to 1 - During inheritance (step 9), hash bucket contents are migrated from the old recipe to the new one, accumulating per-kernel recipe decisions
- When a PhaseManager is destroyed, the recipe decrements the shared_list refcount; only the last reference frees the nodes
- This allows the NvOptRecipe to cache per-kernel optimization decisions across compilation passes
Key Constants
| Value | Meaning |
|---|---|
| 440 | NvOptRecipe object size (bytes) |
| 584 | Per-entry timing record stride (bytes) |
| 24 | Hash bucket entry size / ref-counted list node size |
| 8 | Initial sorted_array capacity |
| 7 | Initial sorted_count (pre-filled entries) |
| 391 | Option ID (enables NvOptRecipe; value = recipe data pointer) |
| 28152 | Option 391 type-tag offset in config storage |
| 28160 | Option 391 value offset (8 bytes after type tag) |
| 0x1F40 | Warning code 8000: "Invalid nvopt level" |
| 5 | Maximum valid NvOpt level |
| 35280 | Recipe config byte offset in target descriptor |
Multi-Function Dispatch -- sub_C60BD0
When a compilation unit contains more than one function, sub_C60BD0 redirects to a per-function dispatch path:
// Pseudocode for sub_C60BD0
void PhaseManager::invoke_multi(compilation_unit* cu) {
int func_count = get_function_count(cu); // sub_7DDB50
if (func_count > 1) {
auto list1 = create_refcounted_list();
auto list2 = create_refcounted_list();
this->phase_chain = current_chain; // +88
per_function_dispatch(cu, list1, list2); // sub_790A40
release(list1);
release(list2);
}
}
Complete Phase Table
Stage numbering. The 7 groups below are a coarse summary of the 159-phase OCG pipeline. The authoritative fine-grained grouping is the 10-stage scheme in the Pass Inventory (OCG-Stage 1--10). The 7-group table here collapses several of those stages for brevity; phase boundaries differ slightly. When citing a stage by number, prefer the Pass Inventory's 10-stage numbering.
Group 1: Initial Setup (phases 0--12)
| Index | Phase Name | Purpose |
|---|---|---|
| 0 | OriCheckInitialProgram | Validate initial Ori IR |
| 1 | ApplyNvOptRecipes | Apply NvOptRecipe transformations |
| 2 | PromoteFP16 | Promote FP16 operations where beneficial |
| 3 | AnalyzeControlFlow | Build/analyze control flow graph |
| 4 | AdvancedPhaseBeforeConvUnSup | Hook -- before unsupported op conversion |
| 5 | ConvertUnsupportedOps | Lower unsupported operations to supported sequences |
| 6 | SetControlFlowOpLastInBB | Mark control flow ops as last in basic block |
| 7 | AdvancedPhaseAfterConvUnSup | Hook -- after unsupported op conversion |
| 8 | OriCreateMacroInsts | Create macro instruction patterns |
| 9 | ReportInitialRepresentation | Diagnostic dump of initial IR |
| 10 | EarlyOriSimpleLiveDead | Early dead code elimination |
| 11 | ReplaceUniformsWithImm | Replace uniform register loads with immediates |
| 12 | OriSanitize | IR consistency checks |
Group 2: Early Optimization (phases 13--36)
| Index | Phase Name | Purpose |
|---|---|---|
| 13 | GeneralOptimizeEarly | First GeneralOptimize pass (peephole + simplify) |
| 14 | DoSwitchOptFirst | Switch statement optimization, first pass |
| 15 | OriBranchOpt | Branch simplification and folding |
| 16 | OriPerformLiveDeadFirst | Liveness analysis, first pass |
| 17 | OptimizeBindlessHeaderLoads | Optimize bindless texture header loads |
| 18 | OriLoopSimplification | Canonicalize loop structure |
| 19 | OriSplitLiveRanges | Split long live ranges to reduce pressure |
| 20 | PerformPGO | Apply profile-guided optimizations |
| 21 | OriStrengthReduce | Strength reduction on induction variables |
| 22 | OriLoopUnrolling | Loop unrolling |
| 23 | GenerateMovPhi | Convert phi nodes to MOV-phi representation |
| 24 | OriPipelining | Software pipelining of loops |
| 25 | StageAndFence | Memory staging and fence insertion |
| 26 | OriRemoveRedundantBarriers | Remove unnecessary barrier instructions |
| 27 | AnalyzeUniformsForSpeculation | Identify uniform values for speculative execution |
| 28 | SinkRemat | Sink rematerializable instructions |
| 29 | GeneralOptimize | Main GeneralOptimize pass |
| 30 | DoSwitchOptSecond | Switch optimization, second pass |
| 31 | OriLinearReplacement | Replace complex patterns with linear sequences |
| 32 | CompactLocalMemory | Compact local memory layout |
| 33 | OriPerformLiveDeadSecond | Liveness analysis, second pass |
| 34 | ExtractShaderConstsFirst | Extract shader constants, first pass |
| 35 | OriHoistInvariantsEarly | Early loop-invariant hoisting |
| 36 | EmitPSI | Emit program state information |
Group 3: Mid-Level Optimization (phases 37--58)
| Index | Phase Name | Purpose |
|---|---|---|
| 37 | GeneralOptimizeMid | Mid-pipeline GeneralOptimize |
| 38 | OptimizeNestedCondBranches | Simplify nested conditional branches |
| 39 | ConvertVTGReadWrite | Convert vertex/tessellation/geometry read/write ops |
| 40 | DoVirtualCTAExpansion | Expand virtual CTA operations |
| 41 | MarkAdditionalColdBlocks | Mark additional basic blocks as cold |
| 42 | ExpandMbarrier | Expand mbarrier intrinsics |
| 43 | ForwardProgress | Ensure forward progress guarantees |
| 44 | OptimizeUniformAtomic | Optimize uniform atomic operations |
| 45 | MidExpansion | Mid-pipeline lowering and expansion |
| 46 | GeneralOptimizeMid2 | Second mid-pipeline GeneralOptimize |
| 47 | AdvancedPhaseEarlyEnforceArgs | Hook -- before argument restrictions |
| 48 | EnforceArgumentRestrictions | Enforce ABI argument constraints |
| 49 | GvnCse | Global value numbering and common subexpression elimination |
| 50 | OriReassociateAndCommon | Reassociation and commoning |
| 51 | ExtractShaderConstsFinal | Extract shader constants, final pass |
| 52 | OriReplaceEquivMultiDefMov | Replace equivalent multi-def MOVs |
| 53 | OriPropagateVaryingFirst | Varying propagation, first pass |
| 54 | OriDoRematEarly | Early rematerialization |
| 55 | LateExpansion | Late lowering of complex operations |
| 56 | SpeculativeHoistComInsts | Speculatively hoist common instructions |
| 57 | RemoveASTToDefaultValues | Remove AST nodes set to default values |
| 58 | GeneralOptimizeLate | Late GeneralOptimize |
Group 4: Late Optimization (phases 59--95)
| Index | Phase Name | Purpose |
|---|---|---|
| 59 | OriLoopFusion | Fuse compatible loops |
| 60 | DoVTGMultiViewExpansion | Expand multi-view VTG operations |
| 61 | OriPerformLiveDeadThird | Liveness analysis, third pass |
| 62 | OriRemoveRedundantMultiDefMov | Remove redundant multi-def MOVs |
| 63 | OriDoPredication | If-conversion / predication |
| 64 | LateOriCommoning | Late value commoning |
| 65 | GeneralOptimizeLate2 | Second late GeneralOptimize |
| 66 | OriHoistInvariantsLate | Late invariant hoisting |
| 67 | DoKillMovement | Move kill instructions for better scheduling |
| 68 | DoTexMovement | Move texture instructions for latency hiding |
| 69 | OriDoRemat | Main rematerialization pass |
| 70 | OriPropagateVaryingSecond | Varying propagation, second pass |
| 71 | OptimizeSyncInstructions | Optimize synchronization instructions |
| 72 | LateExpandSyncInstructions | Expand sync instructions to HW sequences |
| 73 | ConvertAllMovPhiToMov | Convert all MOV-phi to plain MOV |
| 74 | ConvertToUniformReg | Promote values to uniform registers |
| 75 | LateArchOptimizeFirst | Architecture-specific late optimization, first pass |
| 76 | UpdateAfterOptimize | Post-optimization bookkeeping |
| 77 | AdvancedPhaseLateConvUnSup | Hook -- before late unsupported op expansion |
| 78 | LateExpansionUnsupportedOps | Late lowering of unsupported operations |
| 79 | OriHoistInvariantsLate2 | Second late invariant hoisting |
| 80 | ExpandJmxComputation | Expand JMX (join/merge) computations |
| 81 | LateArchOptimizeSecond | Architecture-specific late optimization, second pass |
| 82 | AdvancedPhaseBackPropVReg | Hook -- before back-copy propagation |
| 83 | OriBackCopyPropagate | Backward copy propagation |
| 84 | OriPerformLiveDeadFourth | Liveness analysis, fourth pass |
| 85 | OriPropagateGmma | GMMA/WGMMA propagation |
| 86 | InsertPseudoUseDefForConvUR | Insert pseudo use/def for uniform reg conversion |
| 87 | FixupGmmaSequence | Fix up GMMA instruction sequences |
| 88 | OriHoistInvariantsLate3 | Third late invariant hoisting |
| 89 | AdvancedPhaseSetRegAttr | Hook -- before register attribute setting |
| 90 | OriSetRegisterAttr | Set register attributes (types, constraints) |
| 91 | OriCalcDependantTex | Calculate dependent texture operations |
| 92 | AdvancedPhaseAfterSetRegAttr | Hook -- after register attribute setting |
| 93 | LateExpansionUnsupportedOps2 | Second late unsupported op expansion |
| 94 | FinalInspectionPass | Final IR validity checks |
| 95 | SetAfterLegalization | Mark legalization complete |
Group 5: Scheduling and Register Allocation (phases 96--105)
| Index | Phase Name | Purpose |
|---|---|---|
| 96 | ReportBeforeScheduling | Diagnostic dump before scheduling |
| 97 | AdvancedPhasePreSched | Hook -- before scheduling |
| 98 | BackPropagateVEC2D | Back-propagate 2D vector instructions |
| 99 | OriDoSyncronization | Insert synchronization instructions |
| 100 | ApplyPostSyncronizationWars | Apply post-synchronization write-after-read fixes |
| 101 | AdvancedPhaseAllocReg | Hook -- register allocation |
| 102 | ReportAfterRegisterAllocation | Diagnostic dump after regalloc |
| 103 | Get64bRegComponents | Extract 64-bit register components |
| 104 | AdvancedPhasePostExpansion | Hook -- after post-expansion |
| 105 | ApplyPostRegAllocWars | Apply post-regalloc write-after-read fixes |
Group 6: Post-Schedule and Code Generation (phases 106--131)
| Index | Phase Name | Purpose |
|---|---|---|
| 106 | AdvancedPhasePostSched | Hook -- after scheduling |
| 107 | OriRemoveNopCode | Remove NOP instructions |
| 108 | OptimizeHotColdInLoop | Hot/cold partitioning within loops |
| 109 | OptimizeHotColdFlow | Hot/cold partitioning across flow |
| 110 | PostSchedule | Post-scheduling fixups |
| 111 | AdvancedPhasePostFixUp | Hook -- after post-schedule fixup |
| 112 | PlaceBlocksInSourceOrder | Reorder blocks to match source order |
| 113 | PostFixForMercTargets | Mercury target-specific fixups |
| 114 | FixUpTexDepBarAndSync | Fix texture dependency barriers and sync |
| 115 | AdvancedScoreboardsAndOpexes | Hook -- before scoreboard generation |
| 116 | ProcessO0WaitsAndSBs | Process O0-level waits and scoreboards |
| 117 | MercEncodeAndDecode | Mercury encode to SASS and decode-verify |
| 118 | MercExpandInstructions | Expand macro instructions to SASS |
| 119 | MercGenerateWARs1 | Generate write-after-read hazard stalls, pass 1 |
| 120 | MercGenerateOpex | Generate operand exchange stalls |
| 121 | MercGenerateWARs2 | Generate write-after-read hazard stalls, pass 2 |
| 122 | MercGenerateSassUCode | Emit final SASS microcode |
| 123 | ComputeVCallRegUse | Compute virtual call register usage |
| 124 | CalcRegisterMap | Calculate final register map |
| 125 | UpdateAfterPostRegAlloc | Post-regalloc bookkeeping |
| 126 | ReportFinalMemoryUsage | Report final memory consumption |
| 127 | AdvancedPhaseOriPhaseEncoding | Hook -- before final encoding |
| 128 | UpdateAfterFormatCodeList | Update after code list formatting |
| 129 | DumpNVuCodeText | Dump NV microcode as text (debug) |
| 130 | DumpNVuCodeHex | Dump NV microcode as hex (debug) |
| 131 | DebuggerBreak | Debugger breakpoint (debug) |
Group 7: Late Cleanup (phases 132--158)
| Index | Phase Name | Purpose |
|---|---|---|
| 132 | UpdateAfterConvertUnsupportedOps | Bookkeeping after late conversion |
| 133 | MergeEquivalentConditionalFlow | Merge equivalent conditional branches |
| 134 | AdvancedPhaseAfterMidExpansion | Hook -- after mid-expansion |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Hook -- after late sync expansion |
| 136 | LateMergeEquivalentConditionalFlow | Late merge of equivalent conditionals |
| 137 | LateExpansionUnsupportedOpsMid | Mid-point late unsupported op expansion |
| 138 | OriSplitHighPressureLiveRanges | Split live ranges under high register pressure |
| 139--158 | (architecture-specific) | 20 additional phases with names in vtable getString() methods |
Phases 139--158 are not in the static name table at off_22BD0C0. Their names are returned by each phase's getName() virtual method. These are conditionally-enabled phases for specific architecture targets (SM variants) or optimization levels.
AdvancedPhase Hook Points
The 16 AdvancedPhase entries are insertion points for architecture-specific or optimization-level-specific processing. All return isNoOp() == true by default. When activated (typically by NvOptRecipe configuration for a specific SM target), they execute additional transformations at precisely defined points in the pipeline:
| Index | Hook Name | Insertion Context |
|---|---|---|
| 4 | AdvancedPhaseBeforeConvUnSup | Before ConvertUnsupportedOps |
| 7 | AdvancedPhaseAfterConvUnSup | After ConvertUnsupportedOps |
| 47 | AdvancedPhaseEarlyEnforceArgs | Before EnforceArgumentRestrictions |
| 77 | AdvancedPhaseLateConvUnSup | Before LateExpansionUnsupportedOps |
| 82 | AdvancedPhaseBackPropVReg | Before OriBackCopyPropagate |
| 89 | AdvancedPhaseSetRegAttr | Before OriSetRegisterAttr |
| 92 | AdvancedPhaseAfterSetRegAttr | After OriSetRegisterAttr |
| 97 | AdvancedPhasePreSched | Before scheduling pipeline |
| 101 | AdvancedPhaseAllocReg | Register allocation entry point |
| 104 | AdvancedPhasePostExpansion | After post-regalloc expansion |
| 106 | AdvancedPhasePostSched | After scheduling |
| 111 | AdvancedPhasePostFixUp | After post-schedule fixup |
| 115 | AdvancedScoreboardsAndOpexes | Before scoreboard/opex generation |
| 127 | AdvancedPhaseOriPhaseEncoding | Before final instruction encoding |
| 134 | AdvancedPhaseAfterMidExpansion | After mid-level expansion |
| 135 | AdvancedPhaseLateExpandSyncInstructions | After late sync instruction expansion |
Mercury Encoding Sub-Pipeline
Phases 113--122 form a self-contained sub-pipeline that transforms the optimized, register-allocated Ori IR into final SASS machine code via the Mercury encoding format:
PostFixForMercTargets (113)
→ FixUpTexDepBarAndSync (114)
→ [AdvancedScoreboardsAndOpexes hook (115)]
→ ProcessO0WaitsAndSBs (116)
→ MercEncodeAndDecode (117) ← encode to SASS + decode for verification
→ MercExpandInstructions (118) ← expand remaining macros
→ MercGenerateWARs1 (119) ← first WAR hazard pass
→ MercGenerateOpex (120) ← operand exchange stalls
→ MercGenerateWARs2 (121)← second WAR hazard pass
→ MercGenerateSassUCode (122) ← final microcode emission
"Mercury" is NVIDIA's internal name for the SASS encoding format on recent GPU architectures (Blackwell-era SM 100/103/110/120).
Diagnostic Strings
| Address | String | Emitted By | Context |
|---|---|---|---|
0x22BC3B3 | "[Pool Consumption = " | sub_C62200 | After all phases summary |
0x22BC416 | "All Phases Summary" | sub_C64F70 | End of dispatch loop |
| (inline) | " :: " | sub_C64310 | Phase timing line separator |
| (inline) | "[Total " | sub_C64310 | Total memory delta |
| (inline) | "[Freeable " | sub_C64310 | Freeable memory delta |
| (inline) | "[Freeable Leaked " | sub_C64310 | Leaked memory delta |
| (inline) | "Before " / "After " | sub_C64F70 | Phase execution diagnostic |
Function Map
| Address | Size | Function | Confidence |
|---|---|---|---|
sub_C60D20 | 16 | Default phase table pointer | HIGH |
sub_C60D30 | 3,554 | Phase factory (159-case switch) | VERY HIGH |
sub_C60BD0 | 334 | Multi-function phase invoker | MEDIUM-HIGH |
sub_C61B20 | 1,753 | PhaseManager destructor | VERY HIGH |
sub_C62200 | 888 | Pool consumption reporter | VERY HIGH |
sub_C62580 | 253 | Timing record array resizer | HIGH |
sub_C62640 | 223 | Phase list resizer | HIGH |
sub_C62720 | 4,734 | PhaseManager constructor | VERY HIGH |
sub_C639A0 | 1,535 | Case-insensitive quicksort | HIGH |
sub_C63FA0 | 556 | Phase name table sort/rebuild | HIGH |
sub_C641D0 | 305 | Phase name-to-index lookup | VERY HIGH |
sub_C64310 | 3,168 | Per-phase timing reporter | VERY HIGH |
sub_C64F70 | 1,455 | Phase dispatch loop | VERY HIGH |
Cross-References
- Pass Inventory & Ordering -- full phase sequence and stage grouping
- GeneralOptimize Bundles -- phases 13, 29, 37, 46, 58, 65
- Synchronization & Barriers -- phases 26, 71, 72, 99, 100
- Liveness Analysis -- phases 10, 16, 33, 61, 84
- Mercury Encoder -- phases 113--122
- Memory Pool Allocator -- pool allocation infrastructure used by PhaseManager
- Optimization Levels -- how opt level controls phase behavior
- DUMPIR & NamedPhases -- phase name resolution for debug output
GeneralOptimize Bundles
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The GeneralOptimize* passes are compound optimization bundles that run multiple sub-transformations in sequence on each basic block, repeating until no further changes occur (fixed-point iteration). They serve as the primary IR cleanup mechanism throughout the pipeline: after any major transformation introduces new dead code, redundant copies, or foldable constants, a GeneralOptimize pass re-normalizes the IR before the next major phase.
Six instances exist at strategic positions in the 159-phase pipeline. Despite sharing the "GeneralOptimize" name prefix, the six instances decompose into three distinct implementation families -- a lightweight block-iteration variant, a heavyweight bitvector-tracked orchestrator, and an indirect vtable dispatch stub. Each family shares a common architectural pattern (per-block iteration with convergence check) but invokes different sub-pass combinations and has different gate conditions.
| Instances | 6 (phases 13, 29, 37, 46, 58, 65) |
| Pattern | Per-block iteration with convergence check |
| Sub-passes | Copy propagation, constant folding, structural equivalence elimination, dead code elimination, predicate simplification, register promotion (Phase 37) |
| Convergence | Boolean change flag per iteration; stops when no sub-pass reports a change |
| Iteration cap | Knob-controlled (option 464); breaks loop if knob returns false |
| Single-function fast path | Phases 13 and 65 have direct tail-call paths bypassing the multi-function dispatch |
| Multi-function gate | All variants check sub_7DDB50(ctx) > 1 before entering the main loop |
| Code range | Execute functions at 0xC5F940--0xC60870; sub-pass bodies at 0x7917F0--0x910840 |
Instance Map
| Phase | Name | Vtable | execute() | Sub-pass Body | Gate Conditions |
|---|---|---|---|---|---|
| 13 | GeneralOptimizeEarly | off_22BD7D0 | 0xC5F940 | sub_7917F0 (multi-func) / 0x1C64BF0 (single-func) | bit 2 of ctx+1382 must be set |
| 29 | GeneralOptimize | off_22BDA50 | 0xC5FC50 | sub_908EB0 | Option 487 enabled; option 231 not set; option 461 pass |
| 37 | GeneralOptimizeMid | off_22BDB90 | 0xC5FD70 | sub_910840 | sub_8F3EA0 pre-check; option 487; "ConvertMemoryToRegisterOrUniform" name-gate |
| 46 | GeneralOptimizeMid2 | off_22BDCF8 | 0xC60840 | indirect via [*(ctx+1584)]->vtable[0x1C0] | Vtable dispatch; skips if target == sub_7D6DD0 (no-op sentinel) |
| 58 | GeneralOptimizeLate | off_22BDED8 | 0xC5FF20 | sub_8F7080 | Function count > 2; bits 4-5 of ctx+1396 != 0x20; option 31 checked |
| 65 | GeneralOptimizeLate2 | off_22BDFF0 | 0xC60550 | indirect via [*(ctx+1584)]->vtable[392] | Function count > 1; indirect dispatch through compilation unit vtable |
Architecture: Three Structural Variants
Variant A: Block-Iteration with Explicit Fixed-Point Loop (Phases 13, 29)
The Early and standard GeneralOptimize passes iterate over basic blocks with an explicit convergence loop. Phase 13 (GeneralOptimizeEarly) at sub_7917F0 is the simplest and best-documented:
// sub_7917F0 -- GeneralOptimizeEarly (multi-function path)
void GeneralOptimizeEarly(int64_t ctx) {
if (!(*(uint8_t*)(ctx + 1382) & 4)) return; // gate: optimization flag
// Option 214 check -- uses vtable fast-path comparison:
// if vtable[72] == sub_6614A0, reads *(config + 15408) directly
// otherwise calls the virtual getOption(214)
if (getOption(ctx, 214)) return; // gate: skip if set
// Option 487 check -- uses vtable[152] fast-path:
// if vtable[152] == sub_67EB60, calls sub_7468B0(config, 487)
// otherwise calls the virtual isOptionSet(487, 1)
if (!getOption_v2(ctx, 487)) return; // gate: general opt enable
if (*(int64_t*)(*(int64_t*)ctx + 1056)) return; // gate: already processed
sub_785E20(ctx, 0); // reset per-block change tracking
sub_781F80(ctx, 1); // initialize instruction flags
sub_7E6090(ctx, 0, 0, 0, 0); // prepare operand use/def chains
sub_7E6AD0(ctx, 0, ...); // build def-use/use-def links
// Iterate over basic blocks (block_count at ctx+520)
int bb_count = *(int32_t*)(ctx + 520);
for (int i = 1; i <= bb_count; i++) {
// block_order at ctx+512, block_table at ctx+296
int bb_idx = *(int32_t*)(*(int64_t*)(ctx + 512) + 4*i);
BasicBlock* bb = *(BasicBlock**)(*(int64_t*)(ctx + 296) + 8*bb_idx);
// Fixed-point loop on this block
int64_t state[...]; // stack-allocated state at rbp-0x88
while (true) {
bool changed = sub_753600(&state, bb); // run sub-passes
if (!changed) break;
// Iteration cap: knob 464
if (!getOption_v2(ctx, 464)) break;
sub_753B50(&state); // apply instruction rewrites
}
}
if (any_changed)
sub_785E20(ctx, 0); // re-normalize if anything changed
}
The inner function sub_753600 runs on a single basic block and returns a boolean indicating whether any transformation fired. When it returns true, sub_753B50 applies the accumulated changes (instruction replacement, operand rewriting, def-use chain updates), and the loop re-runs sub_753600 on the same block to check if the new IR enables further simplifications.
The convergence check for option 464 acts as an emergency brake: if the knob returns false, the loop breaks even if changes were detected. This prevents pathological cases where mutual transformations oscillate indefinitely.
Phase 29 (sub_C5FC50) follows the same pattern but delegates to sub_908EB0, which implements a more complex instruction walk with additional opcode dispatch (opcodes 97 [STG in ROT13; used here as a definition anchor], 18 [FSETP], 124 [conditional select]) and predicate-aware propagation.
Variant B: Full-Program Sub-Pass Orchestration (Phases 37, 58)
The Mid and Late variants operate at a higher level: they construct a multi-field context structure, initialize bitvector tracking infrastructure, and call a heavyweight sub-pass orchestrator.
Phase 37 -- GeneralOptimizeMid (sub_910840)
- Calls
sub_8F3EA0-- a pre-condition check (returns false to skip the entire pass) - Checks option 487 (general optimization enable) via the same vtable fast-path pattern
- Calls
sub_799250with the string"ConvertMemoryToRegisterOrUniform"(at0x21DD228) -- a named phase gate that allows the pass to be selectively disabled via--no-phase - Constructs a 0x408-byte context object on the stack with vtable pointer
off_21DBEF8at offset 0. The layout is:GeneralOptimizeMid Context (0x408 bytes) +0x000 vtable_ptr = off_21DBEF8 +0x008 allocator = *(ctx + 16) +0x010 (zero-init) ... +0x018 (zero-init) ... +0x020 (zero-init) ... +0x030 int count = 0 +0x040 sub_context -- initialized by sub_905B50 (bitvectors, register tracking) ... - Calls
sub_905B50-- a 500+ line setup function that creates bitvector arrays for tracking register definitions, use-def chains, and per-block change flags. Allocates three pairs of {bitvector, metadata, capacity} structures for tracking definition reach, register liveness, and fold eligibility - Calls
sub_90FBA0-- the main optimization loop that iterates over all blocks, running sub-passes per instruction
After sub_90FBA0 returns, the function destroys three RAII-style bitvector containers at offsets +0x200, +0x228, and +0x1E0 by invoking their vtable destructors via *(vtable + 32).
Phase 58 -- GeneralOptimizeLate (sub_8F7080)
- Checks function count > 2 via
sub_7DDB50(stricter than other variants that check > 1) - Checks optimization level bits at
ctx+1396: the condition(flags & 0x30) != 0x20ensures the pass is skipped at certain reduced optimization levels - Checks option 31 via the vtable fast-path; when option 31 reports as "extended" (value at
config+2232is 1 with non-zero extra word atconfig+2240), an additionalsub_7DC0E0check determines a secondary control flagv7 - Constructs a 0x168-byte context on the stack with 7 sub-pass tracking groups. Each group occupies 56 bytes (three
__int128values + a boolean changed-flag + a counter):GeneralOptimizeLate Context (0x168 bytes) +0x000 ctx_ptr = ctx (the compilation context) +0x008 flag_a -- initialized from (ctx+1396 & 4) +0x009 flag_b -- initialized from (ctx+1396 & 8) +0x00C counter_0 = 0 | +0x010 changed_0 = 0 | Sub-pass group 0 (56 bytes) +0x018 ... | +0x048 counter_1 = 0 | Sub-pass group 1 ... +0x12C counter_6 = 0 | Sub-pass group 6 +0x130 changed_6 = 0 | +0x138 ... | - Calls
sub_8F6FA0-- the block iterator
The block iterator sub_8F6FA0 initializes per-context flags from ctx+1396:
- Bit 2 (
& 4): stored atcontext+9, controls whether opcode-7 instructions are processed - Bit 3 (
& 8): stored atcontext+8, controls whether opcode-6 (MOV variant) instructions are processed
It then calls sub_7E6090 to rebuild use-def chains and walks the block list calling sub_8F6530 per block.
Variant C: Indirect Vtable Dispatch (Phases 46, 65)
The Mid2 and Late2 variants use indirect vtable dispatch to call their sub-pass bodies, making the exact implementation architecture-dependent:
Phase 46 (GeneralOptimizeMid2) at 0xC60840:
mov rdi, [rsi+0x630] ; load sm_backend (compilation_context+1584)
mov rax, [rdi] ; load vtable
mov rax, [rax+0x1C0] ; load vtable slot 56 (offset 0x1C0 = 448)
cmp rax, 0x7D6DD0 ; compare against no-op sentinel
jne call_it ; if not sentinel, call it
ret ; otherwise, return (phase is no-op)
call_it:
jmp rax ; tail-call the vtable method
Phase 65 (GeneralOptimizeLate2) at sub_C60550:
// sub_C60550 -- GeneralOptimizeLate2 execute
int64_t GeneralOptimizeLate2(int64_t phase, int64_t ctx) {
int64_t result = sub_7DDB50(ctx); // get function count
if ((int)result > 1) {
int64_t comp_unit = *(int64_t*)(ctx + 1584);
return (*(int64_t(**)(int64_t, int64_t))(*(int64_t*)comp_unit + 392))(comp_unit, ctx);
}
return result;
}
This indirection means the actual optimization behavior for phases 46 and 65 is determined by the compilation unit's vtable, which varies by target architecture and optimization level. The no-op sentinel sub_7D6DD0 (for phase 46) indicates that some architectures skip this pass entirely.
Sub-Pass Decomposition
The sub-passes that run inside a GeneralOptimize iteration are not named individually in the binary -- they are inline code within the per-block processing functions. Based on the decompiled logic, the following sub-transformations are identifiable:
Copy Propagation Algorithm
String evidence: "OriCopyProp" at 0x21E6CE1 appears in the phase name table at index 22, confirming that copy propagation is a recognized sub-pass within the system.
Two distinct copy propagation algorithms exist across the GeneralOptimize variants:
Algorithm A: Chain-Matching Copy Propagation (Phase 13 -- sub_753600)
Phase 13's copy propagation operates by matching structurally equivalent instruction pairs connected through single-use def-use chains. The 253-line function sub_753600 uses a state structure (8 int64_t fields, allocated on the stack at rbp-0x88 in sub_7917F0) that accumulates matched chain endpoints:
sub_753600 State Layout (8 qwords)
state[0] = ctx -- Code Object pointer (set by caller)
state[1] = match_start -- first matched instruction in chain
state[2] = match_end -- last matched instruction in chain
state[3] = def_entry_a -- first definition chain entry (from sub_753520)
state[4] = reg_entry -- register/BB entry for replacement target
state[5] = def_entry_b -- extended chain entry (second level)
state[6] = reg_entry_b -- extended register/BB entry
The algorithm proceeds in eight steps:
// sub_753600 -- Phase 13 copy propagation (decompiled pseudocode)
function copy_prop_early(state, basic_block):
ctx = state[0]
first_instr = *(basic_block[1]) // head of instruction list
// Step 1: Entry gate -- only process blocks starting with control-flow terminator
if first_instr.opcode != 95: return false // opcode 95 = STS in ROT13; used as terminator class
if first_instr.operand_count != 5: return false
format = first_instr[25] & 7
if format != 3 and format != 4: return false // must be imm or reg source
// Step 2: Single-use chain check
use_link = basic_block[17] // use-def chain link
if use_link == NULL: return false
if *use_link == NULL: return false
if **use_link != NULL: return false // must be SINGLE consumer
// Step 3: Follow to defining instruction via opcode-97 anchor
next_instr = *(basic_block[1] + 8) // linked list next
if next_instr.opcode != 97: return false // must be def anchor
reg_entry = *(ctx+296)[ next_instr.bb_index ] // BB/def lookup
// Step 4: Walk def-use chain to find structural match
chain_a = follow_chain_filtered(state, reg_entry) // sub_753520
if chain_a == NULL: return false
state[3] = chain_a
// Step 5: Walk reverse chain from chain_a
chain_b = follow_reverse_chain(state, chain_a) // sub_753570
if chain_b == NULL: return false
state[1] = chain_b
state[2] = chain_b
// Step 6: Predicate-operand compatibility check
endpoint_instr = *(chain_b[1])
if endpoint_instr.opcode != 95: return false
if !predicate_operand_compatible(first_instr, endpoint_instr): return false
// sub_7E7380
// Step 7: Operand-level matching
if operand formats differ (format-4 parity mismatch): return false
if reg_indices match AND metadata matches AND modifiers match:
goto apply // direct match
// Step 7b: Deep sub-DAG equivalence (for non-trivial patterns)
if both sources are register type (bits 28-30 == 1)
and both have use_count <= 1
and both defining instructions have opcode 119
and no aliasing hazards (sub_748570)
and sub_1245740(ctx, def_a, def_b, 2): // depth-2 DAG compare
goto apply
return false
apply:
// Step 8: Record replacement target
state[4] = register_entry_for_replacement
// Optionally follow one more chain level for state[5]/state[6]
return true // caller invokes sub_753B50 to rewrite
The chain walker sub_753480 (43 lines) is the core of this algorithm. It follows single-use, single-def chains within a basic block:
// sub_753480 -- def-use chain walker (at 0x753480)
function follow_chain(ctx, entry, &skip_flag):
skip_flag = false
if entry == NULL: return NULL
current = entry
loop:
if check_multi_condition_skip(current): // sub_7E5120
skip_flag = true // chain crossed a skip point
if current[16] == NULL: break // no next-use link
if *current[16] != NULL: break // MULTI-USE: stop
if current[17] == NULL: break // no def link
if *current[17] != NULL: break // MULTI-DEF: stop
def_bb_idx = *(current[17] + 8)
instr_bb_idx = *(current[1] + 8).bb_index // at +24
if def_bb_idx != instr_bb_idx: break // CROSS-BB: stop
next_instr = *(current[1] + 8)
if next_instr.opcode == 97: // def anchor
current = *(ctx+296)[ def_bb_idx ] // follow chain
continue
else:
return NULL // chain broken
return current // last valid entry
Key properties of this walker:
- Only follows single-use chains (
current[16]must have exactly one consumer) - Only follows single-def chains (
current[17]must have exactly one producer) - Only follows intra-block chains (definition and use must share the same BB index)
- Only traverses through opcode 97 (definition anchor) instructions
- The
check_multi_condition_skip(sub_7E5120, 18 lines) tests four conditions: vtable dispatch atctx+1784, block ordering bounds atctx+1776, instruction flags at+283bit 0, and knob 91
The helper sub_753520 wraps sub_753480 with an additional opcode-93 gate: the chain endpoint's instruction must have opcode 93 (OUT_FINAL in ROT13; used as an internal chain-link marker) and the use-chain at entry[16] must be empty. sub_753570 performs the reverse direction check, verifying that following the chain backward from a given entry reaches the expected starting point with matching register indices.
Algorithm B: Forward Walk with Flag Marking (Phase 29 -- sub_908EB0)
Phase 29's copy propagation walks the instruction linked list sequentially from *(ctx+272) (instruction list head) and marks eligible operands with flag bits for later consumption. The 217-line function sub_908EB0 maintains three key state variables:
| Variable | Type | Purpose |
|---|---|---|
v10 | bool | "previous instruction was a recognized copy" -- gates liveness fallback |
v11 | int64_t | Current definition tracking entry (BB array pointer, from opcode 97) |
v21 | char | Architecture-allows-predicate-marking flag (from vtable at **(ctx+1584)+1312) |
// sub_908EB0 -- Phase 29 forward copy propagation (decompiled pseudocode)
function copy_prop_forward(ctx):
// Gate checks: option 487, option 231, option 461, function count,
// architecture check via sub_7DC0E0, vtable dispatch at +1312
v21 = check_arch_predicate_marking(ctx)
sub_781F80(ctx, 1) // initialize per-instruction flags
v10 = initial_gate_flag // from option 487 check
v11 = 0 // no current definition context
for instr in instruction_list(ctx+272):
opcode = instr.opcode & ~0x3000 // mask bits 12-13
switch opcode:
case 97: // DEFINITION ANCHOR
v10 = initial_gate_flag // reset copy-tracking
v11 = *(ctx+296)[ instr.operand[0] & 0xFFFFFF ]
// Updates definition context -- subsequent opcodes 18/124
// reference v11 for their propagation decisions
continue
case 18: // SET-PREDICATE (FSETP/ISETP)
if sub_8F2E50(ctx, instr): // eligible?
v10 = false // suppress liveness check
if v21: // arch supports pred marking?
dst_idx = count + ~((opcode>>11) & 2)
instr.operand[dst_idx] |= 0x400 // mark: propagated-under-predicate
continue
case 124: // CONDITIONAL SELECT
if !sub_8F2E50(ctx, instr): continue
dst = instr.operand[ count + ~((opcode>>11) & 2) ]
if (ctx+1379) & 7 == 0: // simple mode
dst |= 0x100 // mark: propagated
continue
if (dst & 0xF) == 1: // integer constant type
if !sub_8F29C0(ctx): continue // arch check
// fall through to direct marking
else:
if !sub_8F29C0(ctx) or (ctx+1379 & 0x1B) != 0:
// Two-pass predicate simplifier
sub_908A60(ctx, v11, instr, 1, &hit, &partial) // forward
if hit: goto mark_propagated
if !partial:
sub_908A60(ctx, v11, instr, 0, &hit, &partial) // backward
if hit: goto mark_propagated
if !partial: continue // no match at all
// Direct propagation: convert operand type
dst = (dst & 0xFFFFFDF0) | 0x201 // clear type, set reg+deferred
continue
// Liveness-gated propagation check for extended chains
if !v10 or v21:
mark_propagated:
instr.operand[dst_idx] |= 0x100 // mark: propagated
else:
// Follow definition chain from v11 for additional candidates
follow_and_check_chain(ctx, v11, instr)
continue
default:
if !v10: // no prior copy recognized
status = sub_7DF3A0(instr, ctx) // liveness check
v10 = (*status & 0xC) != 0 // live uses exist?
continue
Target Opcodes in Copy Propagation Context
| Opcode | IR Meaning | Role in Copy Prop | Evidence |
|---|---|---|---|
| 97 | Definition anchor / label marker (STG in the ROT13 name table; used here as a definition anchor, not an actual store-global instruction) | Updates the current definition tracking context (v11). Its operand instr+84 & 0xFFFFFF is an index into the BB array at ctx+296, retrieving the BasicBlock descriptor for the definition point. All subsequent propagation decisions for opcodes 18 and 124 reference this context. | sub_908EB0 lines 74--78: v11 = *(*(a1+296) + 8 * (*(v9+84) & 0xFFFFFF)) |
| 18 | FSETP/ISETP (set predicate) | A predicate-setting comparison instruction. Copy propagation treats it as a "predicated copy" target: when source operands have type 2 or 3 (predicate/uniform register) and pass sub_91D150 register constraint checks, the destination predicate can be folded into consumers. Marked with 0x400 when the architecture supports it. | sub_908EB0 lines 84--96, sub_8F2E50 lines 19--61 |
| 124 | Conditional select (phi-like) | A two-source selection instruction controlled by a predicate. Copy propagation attempts to simplify it to a direct assignment when one source is a constant or when structural analysis shows the predicate is trivially true/false. Marked with 0x100 or type-converted via `(operand & 0xFFFFFDF0) | 0x201`. |
Flag Bit Semantics
The propagation marks operands with three flag bits on the destination operand word at instr + 84 + 8*dst_idx:
| Bit | Mask | Name | Set When | Effect |
|---|---|---|---|---|
| 8 | 0x100 | Propagated | Conditional select (opcode 124) is eligible for propagation AND the architecture/mode checks pass | Downstream apply-changes passes replace all uses of this destination with its source operand. Checked as a guard in sub_8F2E50: if (dst & 0x100) return false prevents double-propagation. |
| 9 | 0x200 | Deferred cleanup | Combined with type-field rewriting: `(operand & 0xFFFFFDF0) | 0x201` |
| 10 | 0x400 | Propagated under predicate | Set-predicate instruction (opcode 18) is eligible AND the architecture flag v21 is true (vtable dispatch at **(ctx+1584)+1312 returned non-zero) | Marks a conditional propagation: the destination predicate can be folded into consumers, but only if the guarding predicate is maintained. Distinguished from 0x100 because the propagation is predicate-dependent rather than unconditional. |
Eligibility Checker: sub_8F2E50
The 64-line function sub_8F2E50 is the central gatekeeper for both opcodes 18 and 124. Decompiled logic:
// sub_8F2E50 -- copy/fold eligibility (from decompiled code at 0x8F2E50)
function is_eligible(ctx, instr):
opcode = instr[18] with BYTE1 &= 0xCF // mask bits 12-13
if opcode == 18: // set-predicate
dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
type_nibble = (dst >> 2) & 0xF
if type_nibble == 10: return false // type 10: never foldable
if type_nibble == 0 and !(dst & 0x400): // no type bits, not yet marked
// Architecture-gated source operand check
vtable_fn = **(ctx+1584) + 1320
if vtable_fn == sub_7D7240: // sentinel: direct check
if (instr[23] >> 28) & 7 not in {2, 3}: return false
else:
if vtable_fn() returns true: goto opcode_124_check
// Register constraint check on both source operands
if sub_91D150(ctx, instr[23] & 0xFFFFFF): goto opcode_124_check
if sub_91D150(ctx, instr[25] & 0xFFFFFF): goto opcode_124_check
return true
return false
opcode_124_check:
if opcode == 124: // conditional select
dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
if dst & 0x100: return false // already propagated
if dst & 0x70: return false // has modifier bits
type = dst & 0xF
sm_version = *(*(ctx+1584) + 372)
if (type == 1 or type == 2) // integer or float
and (sm_version <= 20479 or !(dst & 0x1C00)): // SM gate
return true
return false
The SM version threshold 20479 (0x4FFF) divides generation-4-and-below architectures (Kepler/Maxwell, where constant propagation through conditional selects is unconditionally safe) from generation-5+ architectures (Pascal onward) that require the constraint bits at dst & 0x1C00 to be zero. See SM Version Encoding and the 20479 Boundary for the encoding formula.
Architecture Predicate Query: sub_8F29C0
The 9-line function sub_8F29C0 at 0x8F29C0 determines whether the compilation unit's target architecture supports predicate-aware copy propagation:
// sub_8F29C0 -- architecture predicate query (decompiled verbatim)
bool check_arch_predicate(int64_t ctx) {
int64_t comp_unit = *(int64_t*)(ctx + 1584);
return sub_7DC0E0(comp_unit) // primary arch capability
|| sub_7DC050(comp_unit) // secondary arch capability
|| sub_7DC030(comp_unit); // tertiary arch capability
}
This same query is used inside sub_908A60 (the two-pass predicate simplifier) to initialize the default "safe to transform" flag before instruction-level analysis refines the answer.
Two-Pass Predicate Simplifier: sub_908A60
When simple eligibility checks pass for opcode 124 but additional predicate analysis is needed (specifically: when sub_8F29C0 returns false OR ctx+1379 & 0x1B has bits set), the two-pass predicate simplifier sub_908A60 at 0x908A60 is invoked. It takes a direction argument (1 = forward, 0 = backward) and scans the instruction stream in the specified direction looking for matching definitions:
- Forward pass (
a4=1): Starts from the current definition contextv11, walks forward through the block's instruction list. For each instruction, dispatches on opcode: 97 updates tracking context, 124/18 checks eligibility viasub_8F2E50, others check liveness. Uses a hash-set membership test (sub_767240) to avoid visiting the same instruction twice. - Backward pass (
a4=0): Starts from the definition chain atv11+136, walks backward through linked definitions with the same opcode dispatch logic.
The function outputs two flags: out_a (full match found -- propagation is safe) and out_b (partial match found -- further analysis may help). Phase 29 invokes forward first; if forward finds neither a full nor partial match, it invokes backward. This handles PHI-like merge patterns where the definition chain has both forward paths (normal control flow) and backward paths (loop back-edges).
Comparison of Algorithm A vs Algorithm B
| Aspect | Phase 13 (sub_753600) | Phase 29 (sub_908EB0) |
|---|---|---|
| Pattern | Chain matching (pair structural equivalence) | Forward walk with flag marking |
| Opcodes handled | 95 (entry gate), 93 (chain gate), 97 (anchor), 119 (deep eq) | 97 (anchor), 18 (pred copy), 124 (cond select) |
| Chain depth | Multi-level (follows through opcode 97 anchors) | Single-level (immediate operand check) |
| Result mechanism | Direct instruction rewriting via sub_753B50 | Flag marking (0x100/0x200/0x400), consumed later |
| Convergence | Fixed-point loop in sub_7917F0 (option 464 cap) | Single pass, flags consumed by subsequent iterations |
| Complexity | 253 lines + 5 helper functions | 217 lines + 4 helper functions |
| Scope | Intra-block, single-use chains only | Intra-block, all instructions in sequence |
Constant Folding Patterns
Constant folding in GeneralOptimize is a two-level mechanism. At the ORI IR level (phases 29 and 37), the fold-eligibility check sub_8F2E50 at 0x8F2E50 decides which operands can be marked as constant-propagation-eligible. Separately, at the SASS level, the peephole pass sub_1249B50 performs instruction-combining folds on ALU operations whose sources are both MOV-from-immediate. The ORI-level fold does not evaluate arithmetic at compile time -- it marks operands with flag bits that downstream passes consume to replace registers with immediates.
The Eligibility Check: sub_8F2E50
The central gatekeeper, called by sub_908EB0 (phase 29) and sub_908A60 (predicate simplifier). Returns boolean: 1 = foldable, 0 = not foldable. Two dispatch paths based on the masked opcode at instr[18] & ~0x3000:
// sub_8F2E50 -- Fold eligibility check (complete, annotated)
bool is_fold_eligible(int64_t ctx, uint32_t* instr) {
uint32_t raw = instr[18];
uint32_t opcode = raw;
BYTE1(opcode) &= 0xCF; // clear bits 12-13 (predication variant)
// --- Path A: opcode 18 (predicated copy) ---
if (opcode == 18) {
int dest_idx = instr[20] + ~((raw >> 11) & 2); // last-operand index
int dest = instr[2 * dest_idx + 21];
int type_nibble = (dest >> 2) & 0xF;
if (type_nibble == 10) return false; // operand type 10: never foldable
// Require both type nibble == 0 AND no predicate-propagated flag (0x400)
if (type_nibble != 0 || (dest & 0x400))
return false;
// Vtable dispatch at comp_unit->vtable[1320]:
// sentinel sub_7D7240 -> check source operand type bits directly
// otherwise -> call virtual method
fn = *(comp_unit->vtable + 1320);
if (fn == sub_7D7240) {
src_type = (instr[23] >> 28) & 7;
if (src_type - 2 > 1) return false; // only types 2,3 eligible
} else {
if (fn() != 0) goto check_opcode_124;
src_type = (instr[23] >> 28) & 7;
if (src_type - 2 > 1) return false;
}
// Verify register constraints via sub_91D150 for both sources
if (sub_91D150(ctx, instr[23] & 0xFFFFFF))
goto check_opcode_124;
src2_type = (instr[25] >> 28) & 7;
if (src2_type - 2 <= 1 && !sub_91D150(ctx, instr[25] & 0xFFFFFF))
return true;
// Fall through to opcode 124 check on constraint failure
}
check_opcode_124:
// --- Path B: opcode 124 (conditional select / phi-like move) ---
if (opcode == 124) {
int dest_idx = instr[20] + ~((raw >> 11) & 2);
int dest = instr[2 * dest_idx + 21];
if (dest & 0x100) return false; // already propagated
if (dest & 0x70) return false; // has modifier bits (neg/abs/sat)
int type = dest & 0xF;
int sm_ver = *(int32_t*)(*(int64_t*)(ctx + 1584) + 372);
if ((type == 1 || type == 2) &&
(sm_ver <= 20479 || (dest & 0x1C00) == 0))
return true;
}
return false;
}
The function sub_91D150 is a trivial lookup into a per-register constraint array: return *(uint32_t*)(*(ctx+440) + 4 * reg_index). A return value of 0 means the register has no fold-blocking constraint.
Fold Eligibility Table
| ORI Opcode | Operation | Foldable? | Conditions | Evidence |
|---|---|---|---|---|
| 18 | Predicated copy | Yes | Source operand types must be 2 (predicate) or 3 (uniform); operand type nibble must be 0; no 0x400 flag; both source registers pass sub_91D150 constraint check | sub_8F2E50 lines 17--61 |
| 124 | Conditional select | Yes | Dest type 1 (integer) or 2 (float); no modifier bits (& 0x70 == 0); not already propagated (& 0x100 == 0); SM-version-dependent constraint check | sub_8F2E50 lines 42--51 |
| 97 | Register-to-register move | Propagated, not folded | Dest register replaced by source in all uses (copy propagation); no type/SM checks | sub_908EB0 lines 75--79 |
| 98 | Local load (LDL) | Cost-exempt fold target | In phase 37 only; target symbol looked up from constant bank; foldable if symbol is in constant bank | sub_90FBA0 lines 261--270 |
| 130 | HSET2 (packed FP16x2 compare) | Cost-exempt | Phase 37 bitmask: opcode in {130,133,134,137} bypasses fold cost penalty | sub_90FBA0 bitmask 0x99 |
| 133 | (SM-range-dependent ALU) | Cost-exempt | Same bitmask as 130 | sub_90FBA0 |
| 134 | (SM-range-dependent ALU) | Cost-exempt | Same bitmask as 130 | sub_90FBA0 |
| 137 | (SM-range-dependent ALU) | Cost-exempt | Same bitmask as 130 | sub_90FBA0 |
| 272 | Extended instruction | Cost-exempt | Phase 37: excluded from cost penalty alongside 273 | sub_90FBA0 line 226 |
| 273 | Extended instruction | Cost-exempt | Same as 272 | sub_90FBA0 |
| Others | General ALU/memory | Not directly foldable | GeneralOptimize does not attempt to fold; deferred to peephole passes | sub_8F2E50 returns 0 |
"Cost-exempt" means that when phase 37 (GeneralOptimizeMid) encounters these opcodes, it does not apply the standard cost penalty to the fold decision. Other opcodes accumulate a cost of 1.0 per fold, while these accumulate only v104 (default 0.25, adjustable via knob 474). Exception: even cost-exempt opcodes pay full cost if sub_91E860 reports a use count exceeding 4, preventing over-folding of widely-referenced constants.
SM Version 20479 Threshold
The value at *(*(ctx+1584) + 372) encodes the target SM architecture. The threshold 20479 (0x4FFF) appears in three locations in the binary:
sub_8F2E50line 47: fold eligibility for opcode 124sub_908A60line 107: predicate simplifier's opcode-124 handling- Transitively through
sub_8F2E50calls fromsub_908A60
The threshold divides two immediate-encoding regimes:
| SM range | Encoded value | Fold rule | Rationale |
|---|---|---|---|
| <= 20479 | Legacy encoding | Integer (type 1) and float (type 2) constants in conditional selects fold unconditionally | Legacy architectures use fixed-width immediate slots with no sign/width constraints |
| > 20479 | Extended encoding | Same types fold only if (dest & 0x1C00) == 0 -- constraint bits at operand positions 10--12 must all be zero | Extended architectures introduced variable-width immediate encoding with sign-extension rules; bits 10--12 encode width/signedness constraints that prevent certain constants from being represented as immediates |
The encoded value at comp_unit+372 uses the formula (generation << 12) | variant. Known values: 12288 = sm_30 (gen 3), 16385 = sm_50 (gen 4), 20481 = sm_50a (gen 5), 24576 = sm_60 (gen 6), 28672 = sm_70 (gen 7), 32768 = sm_90 (gen 8), 36864 = sm_100 (gen 9). The threshold 20479 = (5 << 12) - 1 = 0x4FFF falls exactly at the generation 4/5 boundary: all generation-4 values (Kepler/Maxwell) are at or below 20479, while the first generation-5 value (20481) exceeds it. This aligns with the introduction of extended immediate encoding formats in Pascal (sm_60, gen 6) and its predecessors in the gen-5 range.
How Fold Results Are Consumed
When sub_8F2E50 returns true in sub_908EB0 (phase 29), the action depends on the opcode:
Opcode 18 (predicated copy): If the v21 flag is set (vtable dispatch at comp_unit+1312 returned non-zero, confirming the target supports this transformation), the destination operand is marked with 0x400 (propagated-under-predicate). No immediate evaluation occurs.
Opcode 124 (conditional select): A tier flag at ctx+1379 controls aggressiveness:
// In sub_908EB0, after sub_8F2E50 returns true for opcode 124:
int tier = *(uint8_t*)(ctx + 1379) & 7;
if (tier == 0) {
// AGGRESSIVE: mark dest byte-1 |= 1 (fold-committed, fast path)
dest_operand[1] |= 1;
} else {
// CONSERVATIVE: type-dispatched analysis required
if ((dest & 0xF) == 1) { // integer immediate
if (sub_8F29C0(ctx)) // predicate analysis passes
dest = (dest & 0xFFFFFDF0) | 0x201; // clear type, set propagated+eligible
} else { // float or other
if (!sub_8F29C0(ctx) || (*(ctx+1379) & 0x1B) != 0) {
// Two-pass predicate simplifier (forward, then backward)
sub_908A60(ctx, reg, instr, 1, &out_a, &out_b); // forward
if (!out_a && !out_b[0])
sub_908A60(ctx, reg, instr, 0, &out_a, &out_b); // backward
}
dest = (dest & 0xFFFFFDF0) | 0x201; // set propagated+eligible
}
}
The tier value at ctx+1379 & 7 distinguishes:
- 0 = aggressive fold (unconditional fast path, no predicate analysis)
- 1--7 = conservative fold (requires
sub_8F29C0predicate analysis and potentiallysub_908A60two-pass simplification)
The actual constant value is not computed during GeneralOptimize. The fold marks operands with flag bits (0x100, 0x200, 0x400, byte-1 |= 1) that downstream passes consume: the apply-changes function sub_753B50 rewrites instruction lists, and the peephole/codegen passes emit the actual immediates.
The limit-fold-fp Knob
| String | "limit-fold-fp" at 0x1CE3D23 |
| Help text | "Enable/disable constant folding of float operations." at 0x1CE63B0 |
| Type | Boolean |
| Default | "false" (FP folding is NOT limited -- folding is enabled) |
| Config offset | config + 340 (registered at sub_434320 line 268) |
| Category | Optimization control (registration category 4) |
| Visibility | Internal (not exposed on public CLI) |
Despite the name, limit-fold-fp follows the convention that limit-X = true means restrict/disable X. When set to true:
- The
config+340byte propagates into per-function context flags atctx+1379during compilation context setup - The
ctx+1379 & 7tier value becomes non-zero, forcing all type-2 (float) operands through the conservative fold path - Conservative fold requires predicate analysis via
sub_8F29C0and potentially the two-passsub_908A60simplifier, which rejects folds where predicate conditions are ambiguous - This prevents FP constants from being folded when the fold could alter precision semantics -- for example, folding an FMA source operand might lose the fused multiply-add precision guarantee that the original instruction provided
The predicate analysis helper sub_8F29C0 (11 lines) performs three sequential checks on the compilation unit at ctx+1584: sub_7DC0E0, sub_7DC050, and sub_7DC030. If any returns true, the predicate allows safe propagation. These check architecture capability flags for predicated constant operations.
Phase 37 Fold Cost Model
sub_90FBA0 (the main loop for GeneralOptimizeMid) integrates fold decisions into a cost-weighted convergence model rather than a simple boolean. Key elements:
Opcode classification (lines 226--228 of the decompiled output): a bitmask 0x99 applied to the range 130--137 classifies opcodes as cost-exempt. The expression ~(0x99 >> ((uint8_t)opcode + 126)) & v15 clears v15 (the cost flag) for opcodes where the corresponding bit in 0x99 is set. Combined with the range check for 272--273:
| Cost-exempt opcodes | Bitmask bit | Interpretation |
|---|---|---|
| 130 | bit 0 of 0x99 | FP16x2 compare (HSET2 family) |
| 133 | bit 3 of 0x99 | SM-range-dependent ALU |
| 134 | bit 4 of 0x99 | SM-range-dependent ALU |
| 137 | bit 7 of 0x99 | SM-range-dependent ALU |
| 272, 273 | Direct check | Extended load/store variants |
Cost computation: For cost-exempt opcodes, fold cost = v104 * v35 where v104 defaults to 0.25 (overridable via knob 474) and v35 is 0.0 if the instruction is dead (checked via sub_7DF3A0), 1.0 otherwise. For non-exempt opcodes, fold cost = 1.0 * v35 (full weight).
Use-count gate: Even cost-exempt opcodes pay full cost if sub_91E860 (use-count estimator) reports more than 4 uses, preventing over-folding of widely-referenced constants.
Convergence: Accumulated costs at context+26 (weighted) and context+27 (unweighted) are doubles. The loop continues until the cost delta falls below the threshold (default 0.25 from knob 474; overridable by knob 135 when *(config+9720) is set).
Register Constraint Validation: sub_8F3FE0
Phase 37 uses sub_8F3FE0 to validate that folding an instruction's operands respects register-class constraints. The function:
- Queries
comp_unit->vtable[904]for the per-element operand size of the instruction - Queries
comp_unit->vtable[936](if not sentinelsub_7D7040) for per-instruction fold metadata - Iterates over all source operands:
- Requires operand type bits
(>> 28) & 7to be 2 or 3 (predicate or uniform register) - Calls
sub_91D150to look up the register constraint for each source operand - Compares against a previously cached constraint at
context + 7(8-byte stride per fold group) - Returns 0 (fold invalid) on any constraint mismatch
- Requires operand type bits
- Loop count is determined by the destination operand format field (
& 7) - Returns 1 only if all source operands have consistent register constraints
Constant Folding and Propagation Marking Architecture
The term "constant folding" in the context of GeneralOptimize is misleading. The pass does not evaluate arithmetic at compile time (e.g., replacing 3 + 5 with 8). Instead, it performs constant propagation eligibility marking -- identifying operands that hold constant or propagatable values and setting flag bits so downstream passes can exploit this information. Actual arithmetic evaluation occurs elsewhere in the pipeline.
Three Levels of Constant Handling in ptxas
Constant handling spans three distinct pipeline stages, each with different scope and mechanism:
| Level | Stage | Functions | What It Does | What It Does NOT Do |
|---|---|---|---|---|
| 1 -- ORI-IR Propagation Marking | GeneralOptimize (phases 13/29/37/46/58/65) | sub_908EB0 (body), sub_8F2E50 (gate), sub_908A60 (deep analysis) | Marks operands with flag bits (0x100/0x200/0x400) indicating they are eligible for constant propagation | Evaluate arithmetic; rewrite instructions; emit immediates |
| 2 -- SASS Peephole Combining | Post-ISel peephole (phases 83+) | sub_83EF00 (156KB mega-peephole), sub_1249B50 (integer ALU fold), sub_1249940 (MOV-pair matcher) | Combines MOV-from-immediate + ALU instruction pairs into single instructions with folded constants | Operate on ORI IR; handle non-MOV sources |
| 3 -- Frontend Expression Evaluation | PTX parser/validator (address range 0x460000--0x4D5000) | Multiple validator functions (string evidence: "Constant expression has division by zero", "Constant overflow") | Evaluates PTX-level constant expressions during parsing; reports errors for invalid expressions | Operate on internal IR; run during optimization |
The limit-fold-fp knob controls Level 1 only -- specifically whether float-typed operands take the fast path or must go through predicate analysis before being marked.
SM Version Encoding and the 20479 Boundary
The SM version at comp_unit->profile[+372] is not a direct sm_XX number. It uses a packed encoding:
encoded_sm = (generation << 12) | variant
Concrete values from the binary:
| Encoded | Hex | Generation | Variant | Architecture |
|---|---|---|---|---|
| 12288 | 0x3000 | 3 | 0 | sm_30 (Kepler) |
| 16385 | 0x4001 | 4 | 1 | sm_50 (Maxwell) |
| 20481 | 0x5001 | 5 | 1 | sm_50a (Maxwell alt / gen-5 base) |
| 24576 | 0x6000 | 6 | 0 | sm_60 (Pascal) |
| 28672 | 0x7000 | 7 | 0 | sm_70 (Volta) |
| 28673 | 0x7001 | 7 | 1 | sm_80 (Ampere) |
| 32768 | 0x8000 | 8 | 0 | sm_90 (Hopper) |
| 36864 | 0x9000 | 9 | 0 | sm_100 (Blackwell) |
The threshold 20479 = (5 << 12) - 1 = 0x4FFF. This is the largest value that fits in generation 4. Every generation-5+ encoded value exceeds it.
The fold-eligibility impact:
-
SM <= 20479 (generation 4 and below -- Kepler, Maxwell): Integer and float immediates in conditional-select instructions (opcode 124) fold unconditionally. The hardware uses fixed-width immediate slots with no sign/width constraints at operand bit positions 10--12.
-
SM > 20479 (generation 5+ -- Pascal and all newer): The operand's constraint bits at positions 10--12 (mask
0x1C00) must all be zero for folding to proceed. These bits encode hardware constraints introduced with extended immediate formats:- Bit 10: immediate width constraint (narrow vs wide encoding)
- Bit 11: sign-extension requirement
- Bit 12: bank-relative vs absolute encoding
The threshold appears in 6 locations across the binary, confirming it is a fundamental architectural boundary rather than an ad-hoc check: sub_8F2E50 (fold eligibility), sub_406C5E (peephole), sub_406018 (peephole operand matcher), sub_751940 (instruction walker), sub_78DB70 (phase pre-check), sub_848790 (register bank coalescer).
Architecture Class Predicate: sub_8F29C0 Internals
The 9-line function sub_8F29C0 queries three architecture capability checks in sequence. If any returns true, the conservative fold path (which requires additional predicate analysis) is the correct approach for the target:
bool arch_needs_conservative_fold(int64_t ctx) {
int64_t cu = *(int64_t*)(ctx + 1584);
if (sub_7DC0E0(cu)) return true; // isDualIssue
if (sub_7DC050(cu)) return true; // isNvlinkArch
return sub_7DC030(cu); // isGraceArch
}
Each sub-function reads the architecture class field at comp_unit->profile[+12]:
| Function | Check | Class ID | Architecture Family |
|---|---|---|---|
sub_7DC0E0 | profile[+12] == 4 | 4 | Dual-issue (Maxwell sm_50) |
sub_7DC050 | profile[+12] == 11 OR profile[+1418] & 1 | 11 | NVLink-capable (Volta+) |
sub_7DC030 | profile[+12] == 10 OR profile[+1417] >> 7 | 10 | Grace (ARM-based) |
When sub_8F29C0 returns true: folding a constant into a conditional select requires predicate analysis first, because these architectures have immediate encoding differences between conditional and unconditional instruction forms, or because predicate evaluation may have observable side effects.
When sub_8F29C0 returns false (simpler architectures): the fold attempt still proceeds but falls through to the more expensive two-pass predicate simplifier (sub_908A60) as a fallback rather than using the direct marking path.
Two-Pass Predicate Simplifier: sub_908A60 Internals
When the eligibility check passes for opcode 124 but the conservative path is required (either sub_8F29C0 returns false, or the tier flags at ctx+1379 & 0x1B have bits set), sub_908A60 performs a bidirectional scan of the instruction stream to validate that the fold is safe.
Signature: sub_908A60(ctx_array, basic_block_id, instr, direction, &out_hit, &out_partial)
| Parameter | Type | Meaning |
|---|---|---|
a1 | int64_t* | Context as QWORD array (a1[37] = block array, a1[198] = comp_unit) |
a2 | int | Basic block index (from the definition anchor) |
a3 | int64_t | Current instruction pointer |
a4 | int | Direction: 1 = forward scan, 0 = backward scan |
a5 | int* | Output: 1 if a complete safe-fold chain was found |
a6 | int* | Output: 1 if architecture supports aggressive mode |
Algorithm:
- Allocates a 24-byte tracking structure via
comp_unit->vtable[24] - Queries architecture mode via
sub_7DC0E0/sub_7DC050/sub_7DC030 - Walks instructions in the specified direction within the basic block:
- Opcode 97 (
STGin ROT13; used as definition anchor/label marker): follows the label chain to the next definition - Opcode 52 (NOP/delimiter): stops the scan (block boundary)
- Opcode 124 or 18: recursively calls
sub_8F2E50on the chained instruction to verify fold safety through the chain
- Opcode 97 (
- Sets output flags based on whether a complete safe-fold chain was found
Invocation pattern in sub_908EB0:
// Forward pass first
sub_908A60(ctx, bb_id, instr, 1, &hit, &partial);
if (hit) goto mark_propagated;
// If forward found nothing useful, try backward
if (!partial) {
sub_908A60(ctx, bb_id, instr, 0, &hit, &partial);
if (hit) goto mark_propagated;
if (!partial) continue; // neither direction found a match
}
The two-pass strategy (forward then backward) handles PHI-like merge patterns at loop boundaries. Forward catches definitions along normal control flow; backward catches definitions from loop back-edges. The partial flag prevents unnecessary backward scans when the forward pass already determined the chain is definitively unfoldable.
Algebraic Simplification and Structural Equivalence
The algebraic simplifier in GeneralOptimize is not a traditional constant-identity pattern matcher. It does not check operand values against constants (0, 1, -1) to recognize identities like x+0 or x*1. Instead, it is a structural equivalence-based pattern recognizer that detects when two instructions in a def-use chain compute identical values, enabling one to be eliminated. Traditional algebraic identity patterns (x+0->x, x*1->x, x&0->0, x-x->0, etc.) are handled by the separate MainPeepholeOptimizer -- see the comparison table below.
The simplifier lives in sub_753600 (Phase 13, GeneralOptimizeEarly) and is approximately 253 lines of decompiled code. It operates on chains of instructions linked through def-use relationships.
Entry Guard
The function only triggers on instructions matching a narrow pattern:
// sub_753600 entry guard
if (instr[18] == 95 // opcode 95 (STS in ROT13; used as terminator class)
&& instr[20] == 5 // exactly 5 operands
&& (instr[25] & 7) - 3 <= 1) // operand format 3 (register) or 4 (immediate)
{
// proceed to chain walk
}
The restriction to opcode 95 means this simplifier targets conditional exit/return sequences where a guard predicate or condition is computed redundantly. The 5-operand constraint ensures the instruction has the expected layout: result, predicate, and three source operands.
Chain-Walking Algorithm
After the entry guard passes, sub_753600 executes a 9-step algorithm:
Step 1 -- Def-chain traversal. Reads the use-list pointer at instr[17] (offset 136). Checks that the use-list head exists, points to a single definition (head's first element is null), and that the next instruction in the chain has opcode 97 (STG in ROT13; used as definition anchor/label).
Step 2 -- Register resolution. Follows the register index through the register table at ctx+296 to resolve the first chain link to a concrete register entry. Both chain paths (via instr[17]+8 field, "use-list index", and via the register table) must point to the same entry.
Step 3 -- First pair detection via sub_753520. This helper calls sub_753480 to walk the single-def chain forward, looking for an instruction with opcode 93 (OUT_FINAL in ROT13; used as a chain-link marker). At each step, sub_753480 checks:
sub_7E5120-- is the current entry eligible for chain-following? (checks constant bank membership, block region flags, and opcode 91 viasub_7A1A90)- The use-list pointer at
entry[16]has a null head (single use) - The use-list pointer at
entry[17]has a null head (single def) - The register index at
entry[17]+8matches the next instruction's register atentry[1]+8 -> +24
Step 4 -- Second pair detection via sub_753570. Starting from the first pair's result, follows the chain one more step looking for a second opcode-93 instruction that references back to the same register as the first pair's target.
Step 5 -- Predicate-operand compatibility check via sub_7E7380:
// sub_7E7380 -- predicate-operand compatibility check (narrow, not full structural)
bool predicate_operand_compatible(Instr* a, Instr* b) {
bool a_has_pred = (a->opcode & 0x1000) != 0; // bit 12: predicated
bool b_has_pred = (b->opcode & 0x1000) != 0;
if (a_has_pred != b_has_pred)
return false;
if (a_has_pred && b_has_pred) {
// Compare last operand (predicate register): 24-bit register index
int a_idx = a->operands[a->operand_count - 1] & 0xFFFFFF;
int b_idx = b->operands[b->operand_count - 1] & 0xFFFFFF;
if (a_idx != b_idx) return false;
// Compare preceding operand pair (full 64-bit equality)
return a->operands[a->operand_count - 2] == b->operands[b->operand_count - 2];
}
return true; // both unpredicated: predicate-compatible at this level
}
This confirms the two instructions have matching predication structure -- same predicate register, same predicate condition encoding.
Step 6 -- Operand format classification. Computes the effective operand position as operand_count - ((opcode >> 11) & 2) and checks whether it equals 5. When it does, reads the format code at instr[25] & 7. Format 3 means register operand, format 4 means immediate. Both instructions must have the same format classification (both register or both immediate).
Step 7 -- Register index equality. Compares the 24-bit register index: (instr_a[v23+21] & 0xFFFFFF) == (instr_b[v24+21] & 0xFFFFFF). When equal and the full operand descriptors at instr[23] and instr[24] also match, the instructions provably compute the same value. The function jumps to the success path.
Step 8 -- Modifier verification via sub_747F40 and sub_747F80:
// sub_747F40 -- negation flag extraction
int get_negation(Instr* instr) {
int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
return (instr->data[25] >> 3) & 1; // bit 3 of format byte
return 0;
}
// sub_747F80 -- absolute-value flag extraction
int get_abs(Instr* instr) {
int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
return (instr->data[25] >> 4) & 1; // bit 4 of format byte
return 0;
}
Both instructions must have identical negation and absolute-value flags. If neg(a) != neg(b) or abs(a) != abs(b), the pattern is rejected. This prevents incorrectly identifying neg(x) as equivalent to x.
Step 9 -- Deep sub-DAG equivalence. When register indices differ but operand type bits (bits 28-30) equal 1 (register type), the simplifier follows the definition chain to the defining instruction and attempts structural matching at depth:
// Deep equivalence path (sub_753600, lines 149-189)
if (((operand_a >> 28) & 7) == 1) { // register-type operand
RegEntry* reg_a = reg_table[operand_a & 0xFFFFFF];
if (reg_a->use_count_field == 5) { // field at +64
Instr* def_a = reg_a->defining_instr; // field at +56
// ...same for operand_b...
if (def_a->opcode == 119 && def_b->opcode == 119) { // both SHFL
int res_a = def_a->operands[2 * def_a->operand_count + 19];
int res_b = def_b->operands[2 * def_b->operand_count + 19];
if ((res_a & 1) == 0 && (res_b & 1) == 0 // bit 0 clear
&& ((res_a | res_b) & 8) == 0 // bit 3 clear
&& !sub_748570(def_a, ctx) // no alias hazard
&& !sub_748570(def_b, ctx) // no alias hazard
&& def_a->data[25] == def_b->data[25] // format match
&& def_a->data[26] == def_b->data[26] // descriptor match
&& sub_1245740(ctx, def_a, def_b, 2)) // depth-2 DAG eq
{
// Match found -- proceed to chain extension
}
}
}
}
The depth limit of 2 (fourth argument to sub_1245740) prevents exponential blowup in the equivalence check while still catching common patterns like f(g(x)) == f(g(y)) when x == y.
Chain Extension and Accumulation
After finding one matching pair, the function extends the search down the chain. It calls sub_753520 and sub_753570 on subsequent entries, accumulating the full matching sequence in the state array at a1[1] through a1[6]. The state layout is:
State array (passed as a1, 7 qword slots):
a1[0] = ctx (compilation context)
a1[1] = first matched instruction (start of sequence)
a1[2] = second matched instruction (end of first pair)
a1[3] = third matched instruction (from sub_753520)
a1[4] = fourth instruction (next link)
a1[5] = fifth instruction (from secondary sub_753520)
a1[6] = sixth instruction (final chain link)
The function returns true (changed) when the full chain is successfully matched. The caller (sub_7917F0) then invokes sub_753B50 to rewrite the matched sequence.
What This Actually Eliminates
The pattern this simplifier catches is: a sequence of conditional exit instructions where the guard predicates, condition codes, and source operands are structurally equivalent. In practice, this arises from lowering transformations that produce redundant conditional exit/return pairs -- for example, when a function has multiple return paths that were not merged during earlier optimization, or when predicated code duplication creates exit sequences with identical conditions.
The rewrite performed by sub_753B50 replaces the redundant chain with a single exit/return sequence, updating the block's instruction list, register-to-instruction mappings, and def-use chains.
Algebraic Pattern Location Map
The following table clarifies which optimization pass handles each category of algebraic simplification:
| Pattern Category | Pass | Location | Evidence |
|---|---|---|---|
| Structural equivalence (identical computation chains) | GeneralOptimize Phase 13 | sub_753600 | CERTAIN -- decompiled |
| Modifier canonicalization (neg/abs flag matching) | GeneralOptimize Phase 13 | sub_747F40, sub_747F80 | CERTAIN -- decompiled |
| Sub-DAG equivalence (depth-limited tree comparison) | GeneralOptimize Phase 13 | sub_1245740 | CERTAIN -- decompiled |
| Copy propagation (reg-reg, predicated, conditional) | GeneralOptimize Phase 29 | sub_908EB0 | CERTAIN -- decompiled |
| Predicate simplification (constant predicates) | GeneralOptimize Phase 29 | sub_908A60 | CERTAIN -- decompiled |
| Register promotion (memory-to-register conversion) | GeneralOptimize Phase 37 | sub_90EF70 | CERTAIN -- decompiled |
Identity: x+0->x, x*1->x, x&(-1)->x, x|0->x, x^0->x | MainPeepholeOptimizer | sub_169B190 et al. | HIGH -- 3,185 pattern matchers |
Annihilator: x*0->0, x&0->0 | MainPeepholeOptimizer | sub_169B190 et al. | HIGH -- 3,185 pattern matchers |
Inverse: x-x->0, x^x->0, !!x->x | MainPeepholeOptimizer | sub_169B190 et al. | HIGH -- 3,185 pattern matchers |
Strength reduction: x*2->x<<1, x/1->x | StrengthReduction (phase 26) | documented separately | CERTAIN -- separate pass |
Predicate identity: p&true->p, p|false->p | MainPeepholeOptimizer + Phase 29 | combined | MEDIUM |
The MainPeepholeOptimizer operates on the full SASS opcode set via three 233-280 KB dispatch functions with 373-case primary switches. Its pattern tables encode the constant-identity rules (IADD3 with zero source becomes MOV, IMAD with unit multiplier becomes shift/add, LOP3 with identity LUT becomes passthrough, etc.) as prioritized rewrite rules. See Peephole Optimization for full details.
Helper Functions: sub_753E30 and sub_753F70
Two additional helpers extend the Phase 13 algebraic simplifier beyond the main sub_753600 path:
sub_753E30 (67 lines) -- secondary chain matcher that handles the case where the first instruction in the chain has a source register index (instr[25] & 0xFFFFFF) that differs from the current block's register at *(a2 + 24). It follows a more complex chain topology involving three register entries (at state slots a1[7], a1[8], a1[9]) and validates that the secondary chain loops back to the primary entry. This catches equivalences across register renaming boundaries.
sub_753F70 (49 lines) -- vtable-dispatched transformation that performs the actual rewrite for chains detected by sub_753E30. It calls through comp_unit->vtable[656] (with sentinel check against sub_744F30). When the vtable method returns true, it constructs opcode-93 replacement instructions via sub_92E1B0 and splices the old chain out via sub_91E310. This is the surgical rewrite counterpart to sub_753B50's rewrite for the main path.
sub_753DB0 (33 lines) -- chain tail finder that walks from a given register entry forward through the def-chain, following opcode-97 links via the register table. Returns the last reachable entry in the chain (the "tail") or the entry one step before a broken link. Used by the extended chain detection logic to determine where the equivalence region ends.
Dead Code Elimination
DCE within GeneralOptimize is lightweight compared to the standalone OriPerformLiveDead passes (phases 16, 33, 61, 84). It operates locally within basic blocks using the sub_7DF3A0 function:
// sub_7DF3A0 -- instruction liveness check
// Returns pointer to status word
// Bits 2-3 (mask 0xC): has live uses
// Bit 0 (mask 0x1): marked dead
int8_t* check_liveness(int64_t instr, int64_t* ctx) {
// ... examines use-def chains ...
return status_ptr; // caller checks (*result & 0xC) != 0
}
In sub_908EB0, the DCE check appears as the fallback for unrecognized opcodes:
if (!v10) { // v10 = "previous instruction was a recognized copy"
int8_t* status = sub_7DF3A0(instr, ctx);
v10 = (*status & 0xC) != 0; // live uses exist?
}
When (*status & 0xC) == 0, the instruction has no live consumers and is effectively dead. In Variant A, dead instructions are not immediately deleted -- they are marked for removal by the convergence loop cleanup phase (sub_753B50), which rewires the instruction list to skip dead nodes and updates the block's def-use chains via sub_931920, sub_932E80, sub_749090, and sub_9253C0.
In Variant B (phase 58), sub_8F6530 uses the same sub_7DF3A0 liveness check but integrates the result into its 7-counter change tracking structure, incrementing the appropriate sub-pass counter when a dead instruction is found.
Predicate Simplification
A distinct sub-pass handles predicate register operations. The code in sub_908EB0 at the opcode-18 and opcode-124 branches processes predicated moves and conditional selects:
- Opcode 18 (predicated move): if the predicate is known-true (from prior constant folding), simplifies to unconditional move. If the
v21flag is set (indicating the vtable dispatch atcomp_unit+1312returned non-zero, i.e. the target supports this transformation), marks the destination operand with0x400 - Opcode 124 (conditional select): if both source operands are identical (detected via def-chain comparison), simplifies to an unconditional copy; if the predicate is constant, selects the appropriate source. The two-pass approach via
sub_908A60handles phi-like patterns where direction matters:- Pass 1:
sub_908A60(ctx, reg_entry, instr, 1, &out_a, &out_b)-- forward direction - Pass 2 (if pass 1 found no simplification but detected a partial match):
sub_908A60(ctx, reg_entry, instr, 0, &out_a, &out_b)-- backward direction
- Pass 1:
The helper sub_8F29C0 at 0x8F29C0 performs predicate-specific analysis, determining whether the predicate condition allows safe propagation given the current instruction context.
The Per-Block Sub-Pass Runner: sub_8F6530 (Variant B Detail)
The 550-line function sub_8F6530 is the core of Variant B (phase 58). It processes a single basic block using a 6-slot circular buffer of instruction pairs, tracked at 56-byte intervals:
sub_8F6530 Context (passed as a1)
+0x000 ctx_ptr -- compilation context
+0x008 flag_ctrl_flow_4 -- from ctx+1396 bit 2 (opcode-7 enable)
+0x009 flag_ctrl_flow_8 -- from ctx+1396 bit 3 (opcode-6 enable)
+0x00C slot_index -- current slot (modulo 6)
+0x010 slot_0_changed -- boolean: did this slot's pair fire?
+0x014 slot_0_count -- how many pairs stored in this slot
Slot layout (each 56 bytes = 7 int64_t):
+0x00 count/used flag
+0x04 changed flag
+0x08 instr_ptr_a -- first instruction of the pair
+0x10 instr_ptr_b -- second instruction of the pair
+0x18 (reserved)
...
6 slots at offsets: +0x10, +0x48, +0x80, +0xB8, +0xF0, +0x128
The slot index increments with (*(a1+3) + 1) % 6 after each pair is processed. When a new instruction pair is encountered that doesn't match any existing slot, the oldest slot is evicted (slot index advances). Each slot can hold up to 2 instruction pointers.
The function walks the instruction list looking for specific opcode patterns:
- Opcodes 139 and 110 (MOV variants with different addressing modes): these are the primary targets. The function checks operand field at
instr+76for value 6 (register operand) or 7 (immediate operand), with theflag_ctrl_flow_4andflag_ctrl_flow_8gates controlling which variants are processed - For register operands (type field bits 28-30 == 1), it verifies:
- Use count == 1 (
*(reginfo+24) == 1) - No aliasing flags (
*(reginfo+50) & 1 == 0) - Register class not in range 2-8 (
*(reginfo+20) - 2 > 6)
- Use count == 1 (
- For instructions with opcode 139 and no modifier bits (
*(instr+88) & 0x603FFFF == 0), the function attempts to find the instruction in the circular buffer and either promote it (if found) or insert it as a new entry - Option 605 (
getOption(ctx, 605)) at0x8F6530+0x1A0: when enabled, restricts the matching to only instructions already present in the buffer, preventing new insertions. This is an architecture-gated optimization
Fixed-Point Convergence
Per-Block Iteration Model
All GeneralOptimize variants use a per-block convergence model: they iterate over basic blocks in linear order (following the block ordering table at ctx+512), and for each block, run the sub-passes repeatedly until convergence. This differs from the global worklist model used by other optimizers (GVN-CSE at phase 49 uses a global worklist).
for each block B in reverse postorder:
repeat:
changed = run_sub_passes(B)
until !changed OR !getOption(464)
The block ordering table is an array of int32_t indices at *(ctx+512), with the count at *(ctx+520). Block iteration starts at index 1 (not 0) and proceeds through bb_count inclusive. Each index is used to look up the actual basic block pointer via *(*(ctx+296) + 8 * block_order[i]).
Change Detection Mechanism
Changes are detected through different protocols depending on the variant:
- Variant A (
sub_753600): returns a boolean. The return value is the logical OR of all sub-pass fire events. The state machine insub_7917F0stores the result inv15(mapped to registerbp) and accumulates across iterations viav4 = v15 - Variant B, phase 58 (
sub_8F6530): maintains 7 independent counters at 56-byte intervals in the context structure. Counters are at*(a1 + 5),*(a1 + 19),*(a1 + 33),*(a1 + 47),*(a1 + 61),*(a1 + 75). The corresponding boolean changed-flags are at*(a1 + 16),*(a1 + 72),*(a1 + 128),*(a1 + 184),*(a1 + 240),*(a1 + 296). All are zero-initialized at entry. The caller checks if any counter is non-zero to determine convergence - Variant B, phase 37 (
sub_90FBA0): uses a different approach -- tracks a floating-point "cost" accumulator atcontext+25/26/27(threedoublevalues representing total cost, weighted cost, and instruction count). Convergence is determined when the cost delta falls below a threshold (initialized to 0.25, adjustable via knob 474 at0x90FBA0+0x50). Knob 135 at0x90FBA0+0x20controls an initial threshold override when enabled (checked via*(config+9720))
Iteration Limits
The fixed-point loop is guarded by option 464 in Variant A. In sub_7917F0:
while (true) {
bool changed = sub_753600(&state, bb);
if (!changed) break;
// Option 464 check -- same vtable fast-path pattern:
// vtable[152] == sub_67EB60 => sub_7468B0(config, 464)
// otherwise => vtable[152](config, 464, 1)
if (!getOption_v2(ctx, 464)) break;
sub_753B50(&state); // apply rewrites before re-scanning
}
The option 464 check is called after each successful iteration (when changed == true). If the option returns false, the loop terminates even though more changes could be made. The exact semantics of option 464 depend on the knob's implementation -- it could be a simple counter that decrements, a boolean that gets cleared after N iterations, or a cost-based threshold. The default behavior (when option 464 always returns true) allows unbounded iteration until convergence.
Variant B (phases 37 and 58) does not use option 464 for iteration control. Phase 37 uses the cost-based threshold described above. Phase 58 makes a single pass over the block list via sub_8F6FA0, which does not loop -- each block is visited exactly once, with the 6-slot circular buffer providing limited lookback within the walk.
In practice, most basic blocks converge in 1--3 iterations. A block that generates new optimization opportunities typically does so because copy propagation exposes a constant, which enables constant folding, which creates a dead instruction. The second iteration catches any cascading effects, and the third confirms convergence. Blocks requiring more than 3 iterations are rare and typically involve chains of dependent copies or nested predicate simplifications.
The Apply-Changes Function: sub_753B50
After sub_753600 reports changes, sub_753B50 applies the accumulated transformations. This is a compact 70-line function that performs instruction-list surgery:
- Creates a replacement instruction via
sub_931920(ctx, state->instr_pair, *(*(state->instr_pair+8)+8), -1)-- the-1argument (0xFFFFFFFF) signals "allocate new" - Updates the block's instruction head at
*(ctx+232)with the new instruction's head pointer - Clears the block's instruction count at
*(ctx+264) = 0 - Calls
sub_932E80to relink the instruction into the block's doubly-linked list - Propagates flags: if the original instruction had flag bit 3 of
*(instr+280)set (indicating a control-flow-sensitive instruction), the replacement inherits it vianew_instr[70] |= 8 - Walks the state's instruction chain (from
state[1]throughstate[2]), creating replacements for each and callingsub_749090to update register-to-instruction mappings - Final cleanup: calls
sub_9253C0to remove the dead instructions from their blocks, andsub_749290to update the register numbering, andsub_91E310to splice the old instruction range out of the linked list
Differences Between Early/Mid/Late Variants
1. Gate Conditions (Who Runs)
| Phase | Gate Logic |
|---|---|
| 13 (Early) | Requires ctx->flags_1382 & 4; skips if option 214 is set; requires option 487; skips if *(*(ctx)+1056) is non-null |
| 29 | Requires option 487; skips if option 231 (dump mode) is set; requires *(config+33192) check or option 461 pass; skips if function count == 1 |
| 37 (Mid) | Requires sub_8F3EA0 pre-check; option 487; can be disabled via --no-phase ConvertMemoryToRegisterOrUniform; skips if function count == 1 |
| 46 (Mid2) | Indirect dispatch; skips if vtable slot [0x1C0] points to no-op sentinel sub_7D6DD0 |
| 58 (Late) | Requires function count > 2 (not just > 1); checks optimization level bits (ctx+1396 & 0x30) != 0x20; checks option 31 with extended-value semantics |
| 65 (Late2) | Requires function count > 1; indirect dispatch through compilation unit vtable slot at offset 392 |
2. Sub-Pass Selection (What Runs)
| Phase | Sub-Passes Included |
|---|---|
| 13 (Early) | Structural equivalence detection via sub_753600 (def-use chain walking, instruction pair matching, modifier verification, depth-2 sub-DAG comparison via sub_1245740), instruction rewrite via sub_753B50. No instruction-level constant folding. Lightweight -- designed for quick cleanup after initial lowering. |
| 29 | Copy prop with full opcode dispatch (97, 18, 124), predicate-aware propagation via sub_8F2E50/sub_8F29C0, two-pass predicate simplification via sub_908A60, liveness-gated DCE via sub_7DF3A0. Flag marking with 0x100/0x200/0x400 bits. |
| 37 (Mid) | Full sub-pass suite plus ConvertMemoryToRegisterOrUniform (memory-to-register promotion). Bitvector-based change tracking. Cost-driven convergence with configurable threshold (default 0.25, knob 474). Most comprehensive instance. |
| 46 (Mid2) | Architecture-dependent (vtable dispatch). May include additional target-specific simplifications. |
| 58 (Late) | 6-slot circular buffer pattern matching over MOV/copy instructions (opcodes 139, 110). Register use-count and aliasing checks. Option-605-gated restriction mode. Per-block single-pass (no iteration). |
| 65 (Late2) | Architecture-dependent (vtable dispatch). Final cleanup before register allocation. |
3. Infrastructure Weight (How It Runs)
| Phase | Context Size | Tracking | Complexity |
|---|---|---|---|
| 13 (Early) | Minimal (0x88 bytes on stack) | Boolean changed flag | Low (78 lines in sub_7917F0) |
| 29 | Stack frame (~0x60 bytes) | Boolean + instruction flag bits | Medium (218 lines in sub_908EB0) |
| 37 (Mid) | 0x408-byte stack context + heap bitvectors | Cost-based convergence (3 doubles) + bitvector arrays | High (500+ lines in setup + 400+ in loop) |
| 46 (Mid2) | Vtable-dependent | Vtable-dependent | Variable |
| 58 (Late) | 0x168-byte stack context | 7 counters at 56-byte stride + 6-slot circular buffer | Medium-high (550 lines in sub_8F6530) |
| 65 (Late2) | Vtable-dependent | Vtable-dependent | Variable |
Initialization Infrastructure
Two large helper functions set up the state required before the sub-passes can run:
sub_785E20 -- Change Tracking Reset
Called at the start of phase 13 and after the convergence loop completes (if any changes were made). Resets per-block change flags and instruction state. Takes (ctx, 0) -- the second argument selects the reset mode.
sub_781F80 -- Instruction Flag Initialization
A large function (~1800 lines) that walks every instruction in every basic block, setting per-instruction optimization flags. Called with argument 1 to enable full initialization. These flags control which instructions are eligible for the sub-passes: instructions marked with certain flag patterns are skipped by copy prop, others are skipped by the algebraic simplifier.
sub_7E6090 -- Use-Def Chain Builder
Builds operand use-def chains for copy propagation. Called with (ctx, 0, 0, 0, 0) at the start of phases 13 and 58. The zero arguments indicate "build from scratch" rather than incremental update.
sub_7E6AD0 -- Def-Use Link Builder
Builds bidirectional def-use/use-def links. Called only by phase 13 (Variant A). Variant B phases use their own bitvector-based tracking instead.
sub_905B50 -- Bitvector Infrastructure (Phase 37 Only)
A 500+ line setup function specific to GeneralOptimizeMid. Allocates and initializes three major bitvector structures for tracking:
- Register definition reach (which definitions reach each block entry)
- Per-register liveness within basic blocks
- Fold eligibility tracking (which operands have known-constant sources)
These bitvectors are destroyed by RAII-style cleanup after sub_90FBA0 returns, using vtable destructors at offsets +32 in the bitvector vtables.
Pipeline Positioning
The six instances are positioned to clean up after specific groups of transformations:
Phase 0-12: Initial setup, FP16 promotion, unsupported op conversion
--> Phase 13: GeneralOptimizeEarly (clean up after lowering artifacts)
Phase 14-28: Branch opt, loop passes, strength reduction, pipelining
--> Phase 29: GeneralOptimize (clean up after loop transformations)
Phase 30-36: Switch opt, linear replacement, LICM
--> Phase 37: GeneralOptimizeMid (heavy cleanup + mem-to-reg promotion)
Phase 38-45: Nested branch opt, CTA expansion, mbarrier, mid expansion
--> Phase 46: GeneralOptimizeMid2 (clean up after mid-level expansion)
Phase 47-57: GVN-CSE, reassociation, remat, late expansion, speculative hoist
--> Phase 58: GeneralOptimizeLate (clean up after late expansion)
Phase 59-64: Loop fusion, predication, late commoning
--> Phase 65: GeneralOptimizeLate2 (final cleanup before register work)
After phase 65, the pipeline transitions to register-attribute setting (phase 90), synchronization (phase 99), and register allocation (phase 101). No GeneralOptimize instance runs after register allocation -- the post-RA pipeline uses different peephole mechanisms.
Knobs and Options
| Option | Decoded Name | Type | Code Default | Used By | Description |
|---|---|---|---|---|---|
| 31 | AllowReassociateCSE | OKT_INT | unset | Phase 58 | Architecture-dependent fold eligibility gate; extended-value semantics via config+2232/+2240 |
| 135 | ConvertMemoryToRegIndexedSizeLimit | OKT_INT | unset (fallback: 0.25 from knob 474) | Phase 37 | Threshold override for cost-based convergence when *(config+9720) is set; controls indexed-access size limit for memory-to-register conversion |
| 214 | DisableMergeEquivalentConditionalFlow | OKT_NONE | false | Phase 13 only | When present, skips GeneralOptimizeEarly entirely (if (getOption(ctx, 214)) return;) |
| 231 | DisableRedundantBarrierRemoval | OKT_NONE | false | Phase 29 only | Dump mode -- when present, skips GeneralOptimize to preserve IR state for debugging |
| 461 | MembarFlowControl | OKT_INT | unset | Phase 29 | Secondary gate; controls whether memory barrier flow analysis runs during standard GeneralOptimize; passed through sub_661470 |
| 464 | MergeEquivalentConditionalFlowBudget | OKT_BDGT | unset (= unbounded) | Phase 13 (Variant A) | Iteration cap -- budget knob that breaks the fixed-point loop when exhausted; prevents oscillating transformations |
| 474 | MovWeightForConvertMemToReg | OKT_DBL | 0.25 | Phase 37 (sub_90FBA0) | Cost convergence threshold and per-fold cost weight for cost-exempt opcodes (v104 in cost computation) |
| 487 | (not yet decoded) | -- | enabled | Phases 13, 29, 37 | General optimization enable -- master switch for all GeneralOptimize passes |
| 499 | OptBudget | OKT_BDGT | enabled (pass-through) | sub_7DDB50 (opt-level accessor) | Master guard for opt-level accessor; when disabled, caps all opt-level-gated behavior at O1 |
| 605 | ReassociateCSEWindow | OKT_NONE | false | Phase 58 (sub_8F6530) | When present, restricts 6-slot circular buffer matching to existing entries only (no new entries added during walk) |
limit-fold-fp | -- | bool | "false" (config+340) | Phase 37 | When true, forces conservative fold path via ctx+1379 tier flags; prevents FP folds that could alter precision semantics |
The "ConvertMemoryToRegisterOrUniform" named-phase gate at 0x21DD228 allows phase 37 to be disabled via the --no-phase command-line option.
Function Map
| Address | Name | Role | Confidence |
|---|---|---|---|
0xC5F940 | Phase 13 execute | Tail-calls 0x1C64BF0 (single-func) or sub_7917F0 (multi-func) | CERTAIN |
0xC5FC50 | Phase 29 execute | Checks count > 1, calls sub_908EB0 | CERTAIN |
0xC5FD70 | Phase 37 execute | Checks count > 1, calls sub_910840 | CERTAIN |
0xC60840 | Phase 46 execute | Indirect vtable dispatch through comp_unit->vtable[0x1C0] | CERTAIN |
0xC5FF20 | Phase 58 execute | Checks count > 1, calls sub_8F7080 | CERTAIN |
0xC60550 | Phase 65 execute | Checks count > 1, indirect dispatch through comp_unit->vtable[392] | CERTAIN |
0x7917F0 | GeneralOptimizeEarly body | Multi-function path: iterates blocks, fixed-point loop with sub_753600 | HIGH |
0x908EB0 | GeneralOptimize body | Per-block copy prop + predicate simplification with flag marking | HIGH |
0x910840 | GeneralOptimizeMid body | Full suite with mem-to-reg; delegates to sub_905B50 + sub_90FBA0 | HIGH |
0x8F7080 | GeneralOptimizeLate body | Bitvector-tracked 7-counter pass; calls sub_8F6FA0 | HIGH |
0x753600 | Per-block sub-pass runner (Early) | Structural equivalence detection on def-use chains; returns boolean changed | HIGH |
0x753B50 | Per-block apply changes (Early) | Instruction rewriting: sub_931920, sub_932E80, sub_749090, sub_9253C0 | HIGH |
0x753480 | Chain walker (Early) | Walks single-def chain forward, checking sub_7E5120 eligibility | HIGH |
0x753520 | Pair detector (Early) | Finds opcode-93 instruction in chain via sub_753480 | HIGH |
0x753570 | Secondary pair detector (Early) | Finds second opcode-93 link referencing back to primary | HIGH |
0x753DB0 | Chain tail finder (Early) | Walks opcode-97 links to find end of chain | MEDIUM |
0x753E30 | Secondary chain matcher (Early) | Handles register renaming boundaries; stores a1[7..9] | MEDIUM |
0x753F70 | Vtable rewrite dispatcher (Early) | Calls comp_unit->vtable[656]; constructs opcode-93 replacements | HIGH |
0x7E5120 | Chain eligibility predicate | Checks constant bank, block region, opcode 91 | HIGH |
0x8F6530 | Per-block sub-pass runner (Late) | 6-slot circular buffer; 7-counter change tracking; 550-line function | HIGH |
0x8F6FA0 | Block iterator (Late) | Walks block list calling sub_8F6530 per block; single pass, no iteration | HIGH |
0x905B50 | Setup/init (Mid) | ~500 lines; creates bitvector infrastructure; 3 tracked structures | HIGH |
0x90FBA0 | Main loop (Mid) | Cost-based instruction-level iteration with register bank analysis | HIGH |
0x90EF70 | Register promotion (Mid) | Memory-to-register conversion; threshold-based (default 0.93, knob 136) | HIGH |
0x903A10 | Register bank helper (Mid) | Per-instruction register bank assignment for LD/ST materialization | MEDIUM |
0x8F3FE0 | Register constraint fold validator (Mid) | Validates all source operand types are 2/3 and sub_91D150 constraints match cached values; queries vtable[904] for element size and vtable[936] for fold metadata | HIGH |
0x8F2E50 | Fold eligibility check | Two-path dispatch: opcode 18 checks source types 2/3 + sub_91D150 constraints; opcode 124 checks dest type 1/2 + SM <= 20479 threshold + constraint bits & 0x1C00 | HIGH |
0x8F29C0 | Architecture predicate query | 9 lines; returns sub_7DC0E0(cu) || sub_7DC050(cu) || sub_7DC030(cu) on ctx+1584 | HIGH |
0x908A60 | Two-pass predicate simplify | Called with direction flag (1 = forward, 0 = backward) | HIGH |
0x785E20 | Change tracking reset | Resets per-block change flags | MEDIUM |
0x781F80 | Instruction flag init | Initializes per-instruction optimization flags (~1800 lines) | MEDIUM |
0x7E6090 | Use-def chain builder | Builds operand use-def chains; called with (ctx, 0, 0, 0, 0) | HIGH |
0x7E6AD0 | Def-use link builder | Builds def-use/use-def bidirectional links | HIGH |
0x7DF3A0 | Liveness check | Returns status pointer; bits 2-3 (& 0xC) indicate live uses | HIGH |
0x7E7380 | Predicate-operand compatibility | Narrow check: predicate modifier parity + last-operand 24-bit ID + penultimate 8-byte encoding (not full structural comparison) | HIGH |
0x747F40 | Negation flag extractor | Extracts negation modifier from operand encoding | HIGH |
0x747F80 | Absolute-value flag extractor | Extracts abs modifier from operand encoding | HIGH |
0x748570 | Alias hazard check | Returns true if operand has aliasing hazard | MEDIUM |
0x1245740 | Sub-DAG equivalence | Compares two instruction sub-DAGs for structural equivalence (arg 2 = depth) | HIGH |
0x91D150 | Register constraint lookup | Trivial: return *(*(ctx+440) + 4*reg_index); 0 = no fold-blocking constraint | CERTAIN |
0x91E860 | Use-count estimator | Returns estimated use count for cost-based decisions (used by phase 37) | MEDIUM |
0xA9BD30 | Register-class remapper | Maps opcode indices in set {1,2,3,7,11,15,20,24} via vtable[632]; writes value | 0x60000000 (constant class marker) | HIGH |
0x1249B50 | SASS-level integer ALU fold | Combines IMAD_WIDE/IADD3/SGXT/CCTLT (opcodes 2,3,5,110) with MOV source pairs via sub_1249940 and sub_1245740 | HIGH |
0x1249940 | MOV-pair fold combiner | Matches two MOV-from-immediate (opcode 139) instructions feeding an ALU op; validates structural equivalence at depth 1 and 2 | HIGH |
0x7E19E0 | Operand info extractor | Builds 52-byte operand descriptor for opcodes 2,3,5,6,7; classifies source types and constant bank membership | MEDIUM |
0x7DC0E0 | Architecture capability check A | Checks compilation unit capability flag; used by sub_8F29C0 for predicate fold safety | MEDIUM |
0x7DC050 | Architecture capability check B | Secondary capability check for sub_8F29C0 | MEDIUM |
0x7DC030 | Architecture capability check C | Tertiary capability check for sub_8F29C0 | MEDIUM |
Cross-References
- Pass Inventory -- full 159-phase table with GeneralOptimize instances highlighted
- Phase Manager -- dispatch loop, vtable protocol, factory switch at
sub_C60D30 - Optimization Pipeline -- overall pipeline stages
- Copy Propagation & CSE -- standalone copy propagation passes (phases 49, 50, 64, 83)
- Liveness Analysis -- standalone
OriPerformLiveDeadpasses (heavier DCE) - Peephole Optimization -- MainPeepholeOptimizer; handles constant-identity patterns (x+0, x*1, x&0, etc.)
- Strength Reduction -- standalone strength reduction pass (phase 26)
- Knobs System --
MergeEquivalentConditionalFlowBudget(464, iteration cap), option 487 (general opt enable),OptBudget(499, opt-level guard),AllowReassociateCSE(31),MovWeightForConvertMemToReg(474, cost threshold),limit-fold-fp - Optimization Levels -- knob 499 (
OptBudget) as opt-level accessor guard
Branch & Switch Optimization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Four phases in the ptxas pipeline transform branch and switch-statement control flow in the Ori IR. Two phases optimize switch statements (phases 14 and 30), one performs general branch simplification (phase 15), and one flattens nested conditional branches (phase 38). Together they reduce branch count, eliminate unreachable code, and prepare the CFG for downstream passes like predication (phase 63), liveness analysis (phase 16), and loop canonicalization (phase 18).
These phases operate on the Ori IR before register allocation and scheduling. At this pipeline stage, branch instructions use the Ori OEN opcode (SASS BRA), conditional execution is controlled by predicate registers (P0--P6, PT), and the CFG is a hash-map-based structure with FNV-1a-keyed successor/predecessor edges.
| DoSwitchOptFirst | Phase 14 -- vtable at off_22BD7F8 |
| OriBranchOpt | Phase 15 -- vtable at off_22BD820 |
| DoSwitchOptSecond | Phase 30 -- vtable at off_22BDA78 |
| OptimizeNestedCondBranches | Phase 38 -- vtable at off_22BDBB8 |
| Phase factory | sub_C60D30 cases 14, 15, 30, 38 |
| Phase object size | 16 bytes (standard {vtable_ptr, allocator_ptr}) |
| IR level | Ori -- SASS opcodes with virtual registers |
| Key opcodes | OEN (BRA), OFFL (BSSY), OFLAP (BSYNC) |
| CFG infrastructure | FNV-1a hash maps at Code Object +648 (successors), +680 (backedges) |
| Related passes | 31 OriLinearReplacement, 63 OriDoPredication, 80 ExpandJmxComputation, 133/136 MergeEquivalentConditionalFlow |
Pipeline Placement
Phase 3 AnalyzeControlFlow ── builds CFG (predecessors, successors, RPO, dominators)
Phase 6 SetControlFlowOpLastInBB ── ensures branches are last in each block
Phase 13 GeneralOptimizeEarly ── const fold + copy prop (feeds branch info)
Phase 14 DoSwitchOptFirst ── SWITCH OPTIMIZATION (1st pass)
Phase 15 OriBranchOpt ── BRANCH SIMPLIFICATION
Phase 16 OriPerformLiveDeadFirst ── DCE cleanup of dead branches
...
Phase 29 GeneralOptimize ── const fold after loop transforms
Phase 30 DoSwitchOptSecond ── SWITCH OPTIMIZATION (2nd pass)
Phase 31 OriLinearReplacement ── branchless replacement
...
Phase 37 GeneralOptimizeMid ── const fold + copy prop (feeds nested cond info)
Phase 38 OptimizeNestedCondBranches ── NESTED CONDITIONAL FLATTENING
...
Phase 63 OriDoPredication ── if-conversion (converts short branches to predicates)
...
Phase 80 ExpandJmxComputation ── expands jump-table index computations
...
Phase 133 MergeEquivalentConditionalFlow ── tail merging
Phase 136 LateMergeEquivalentConditionalFlow
Why Two DoSwitchOpt Passes?
The first pass (phase 14) runs immediately after the initial GeneralOptimizeEarly compound pass. At this point, constant folding and copy propagation have resolved many switch selector values, enabling the optimizer to determine case density and choose a lowering strategy.
The second pass (phase 30) runs after loop unrolling (phase 22), strength reduction (phase 21), SSA phi insertion (phase 23), and software pipelining (phase 24). These transformations can expose new switch patterns -- particularly after loop unrolling duplicates switch bodies, creating opportunities for case clustering that were not visible before.
Despite their names, the two passes use different dispatch paths. Phase 14 dispatches through the SM backend's vtable at offset +136 (*(*(ctx+1584)+136)), making it a polymorphic, architecture-specific switch optimization. Phase 30 calls the generic switch optimization core (sub_77CF40 via sub_791F00). This means phase 14 runs whatever switch optimization the current SM target provides, while phase 30 always runs the generic algorithm. The two passes share pipeline position semantics (first pass vs. second pass) but not necessarily the same code.
DoSwitchOpt -- Switch Statement Optimization (Phases 14, 30)
Overview
DoSwitchOpt transforms high-level switch statements from their initial representation as cascading conditional branches into one of three lowered forms, selected based on case density and count. The input is a chain of ISETP (integer set-predicate) + BRA (conditional branch) instruction pairs that compare the switch selector against successive case constants. The output is one of:
- Jump table -- a contiguous array of branch targets indexed by the selector value
- Binary search tree -- a balanced tree of comparisons that narrows the target in O(log n)
- Cascading if-else chain -- the original form, retained when the switch is small or sparse
Input: Switch Pattern Recognition
The pass scans each basic block for a characteristic pattern:
// Input: cascading if-else for switch(x)
BB0:
ISETP.EQ P0, x, #case_0 // compare selector against constant
@P0 BRA target_0 // conditional branch to case body
ISETP.EQ P0, x, #case_1
@P0 BRA target_1
ISETP.EQ P0, x, #case_2
@P0 BRA target_2
...
BRA default_target // fallthrough to default case
The recognizer collects:
- The selector register (the common first operand of all
ISETPinstructions) - The set of case constants (immediate operands of each
ISETP) - The branch targets (one per case, plus the default target)
- The case count N
Decision: Strategy Selection
The strategy is selected by evaluating case density and count:
function select_switch_strategy(cases[], N, min_val, max_val):
range = max_val - min_val + 1
density = N / range // fraction of range covered by cases
if N <= SMALL_SWITCH_THRESHOLD: // observed: ~4 cases
return CASCADING_IF_ELSE // keep original form
if density >= JUMP_TABLE_DENSITY: // observed: ~0.4 (40%)
if range <= MAX_JUMP_TABLE_SIZE: // observed: ~1024 entries
return JUMP_TABLE
return BINARY_SEARCH_TREE
The thresholds are not configurable via the knob system. They are hardcoded constants in the phase execute function.
Jump table is preferred when case values are dense -- the selector maps directly to a table index with a bounds check and a subtraction. This produces the fastest code but consumes data memory proportional to the value range (not the case count).
Binary search tree is the default for large sparse switches. The pass sorts case constants and generates a balanced BST of ISETP + BRA pairs. Each comparison eliminates half the remaining candidates, reaching the target in O(log N) branches.
Cascading if-else is retained for small switches (typically 4 or fewer cases) where the overhead of a jump table or BST setup exceeds the cost of linear comparison.
Output: Jump Table Lowering
For jump-table-eligible switches, the pass produces:
// Output: jump table lowering
BB_switch:
IADD3 Rtmp, Rselector, -min_val, RZ // normalize to 0-based index
ISETP.GE.U32 P0, Rtmp, #range // bounds check (unsigned)
@P0 BRA default_target // out-of-range -> default
// The jump table index computation is left as a pseudo-instruction
// that phase 80 (ExpandJmxComputation) expands later into:
// LEA Raddr, Rtmp, #table_base, 2 // Raddr = table_base + index * 4
// BRX Raddr, #table_base // indexed branch
The actual BRX (branch indexed) instruction is a SASS-level indirect branch through a table embedded in the .text section. Each table entry is a 4-byte relative offset. Phase 80 (ExpandJmxComputation) runs much later (after legalization) to expand the index computation pseudo-instruction into the final LEA + BRX sequence.
Output: Binary Search Tree Lowering
For BST-eligible switches:
function emit_bst(cases[], lo, hi, selector, default_target):
if lo > hi:
emit: BRA default_target
return
mid = (lo + hi) / 2
if lo == hi:
emit: ISETP.EQ P0, selector, #cases[mid].value
emit: @P0 BRA cases[mid].target
emit: BRA default_target
return
emit: ISETP.LT P0, selector, #cases[mid].value
emit: @P0 BRA left_subtree_label
// Right subtree (selector >= cases[mid].value)
emit: ISETP.EQ P0, selector, #cases[mid].value
emit: @P0 BRA cases[mid].target
emit_bst(cases, mid+1, hi, selector, default_target)
// Left subtree (selector < cases[mid].value)
left_subtree_label:
emit_bst(cases, lo, mid-1, selector, default_target)
This produces a balanced tree with depth ceil(log2(N+1)). Each internal node performs at most two comparisons (less-than and equality), though the pass may optimize nodes with consecutive case values to use range checks.
GPU-Specific: SIMT Divergence Impact
Switch optimization interacts directly with SIMT execution. On a GPU, when threads in a warp take different switch cases, the warp diverges and each case executes serially. The optimizer considers this:
- Jump tables produce a single divergence point at the
BRXinstruction. All threads that pick the same case reconverge naturally. The hardwareBSSY/BSYNC(branch sync stack push/pop) mechanism ensures reconvergence after the switch. - BST lowering produces O(log N) potential divergence points. Threads that agree on the BST path stay converged; threads that disagree at each BST node split into independently masked sub-warps.
- Cascading if-else produces N potential divergence points. Each comparison can split the warp.
For GPU code, jump tables are strongly preferred when density permits, because they minimize the number of divergence points to exactly one (the BRX), regardless of case count.
OriBranchOpt -- Branch Simplification (Phase 15)
Overview
OriBranchOpt performs four categories of CFG-level simplification on the Ori IR. It runs as a single pass that iterates over all basic blocks and applies the following transformations until no further changes occur:
- Unconditional branch folding -- eliminates
BRAinstructions that jump to the immediately following block - Unreachable block elimination -- removes basic blocks with no predecessors (except the entry block)
- Conditional branch simplification -- simplifies conditional branches where the condition is provably constant or the true/false targets are identical
- Branch chain threading -- redirects branches that target blocks consisting of a single unconditional
BRA, directly to the final destination
Transformation 1: Unconditional Branch Folding
When a basic block ends with an unconditional BRA to the block that immediately follows in layout order, the branch is redundant and is deleted:
// Before: // After:
BB_A: BB_A:
... ...
BRA BB_B // fallthrough
BB_B: BB_B:
... ...
This is the most common transformation. It arises frequently after switch optimization introduces new blocks and after loop unrolling creates copies of loop bodies that end with unconditional jumps back to the next iteration.
Transformation 2: Unreachable Block Elimination
After other branch simplifications may redirect branches away from certain blocks, those blocks lose all predecessors and become unreachable. The pass deletes them:
function eliminate_unreachable(func):
for each block B in func (excluding entry):
if predecessor_count(B) == 0:
// Remove B from successor lists of all blocks
// Delete all instructions in B
// Remove B from the block list
// Update CFG hash maps
The CFG hash maps at Code Object offsets +648 (successors) and +680 (backedges) must be updated atomically with block deletion to maintain consistency for downstream passes.
Transformation 3: Conditional Branch Simplification
Two sub-cases:
Constant condition. If copy propagation or constant folding (in the preceding GeneralOptimizeEarly, phase 13) has determined that a predicate register always holds a known value at the branch point, the conditional branch is replaced:
// Before: condition always true // After:
BB: BB:
ISETP.EQ PT, R0, R0 // (deleted -- tautology)
@PT BRA target BRA target // unconditional
BRA fallthrough // (deleted)
Equivalent targets. If both the taken and not-taken paths of a conditional branch go to the same block, the condition test is dead and the branch becomes unconditional:
// Before: both targets identical // After:
BB: BB:
@P0 BRA target BRA target // unconditional
BRA target // (deleted)
Transformation 4: Branch Chain Threading
When a branch targets a block whose only content is another unconditional branch, the pass redirects the original branch directly to the final target:
// Before: // After:
BB_A: BB_A:
@P0 BRA BB_B @P0 BRA BB_C // threaded
BB_B: // BB_B may become unreachable
BRA BB_C BB_C:
BB_C: ...
...
The pass applies threading iteratively, following chains of single-branch blocks until a non-trivial block is reached. A depth limit prevents infinite loops on pathological CFGs with cycles of empty blocks (which should not exist in well-formed IR but are guarded against defensively).
Fixed-Point Iteration
The four transformations are applied in a worklist-driven loop. Each transformation can enable others:
- Threading can make intermediate blocks unreachable (enables transformation 2)
- Unreachable block elimination can make remaining branches target the immediately following block (enables transformation 1)
- Folding can expose equivalent-target conditionals (enables transformation 3)
The pass terminates when a full iteration over all blocks produces no changes.
OptimizeNestedCondBranches -- Nested Conditional Flattening (Phase 38)
Overview
Phase 38 targets a specific control flow pattern: nested conditional branches that test related predicates. This pattern commonly arises from C/C++ code with compound conditions (if (a && b), if (a || b)) and from switch-case fall-through after DoSwitchOpt lowering.
The pass runs after GeneralOptimizeMid (phase 37), which provides fresh constant folding and copy propagation results. It runs before OriDoPredication (phase 63), feeding it simpler CFG patterns that are easier to convert to predicated code.
Pattern: Nested If-Then
// Before: nested conditional
BB_outer:
@P0 BRA BB_inner
BRA BB_merge
BB_inner:
@P1 BRA BB_body
BRA BB_merge
BB_body:
... body instructions ...
BRA BB_merge
BB_merge:
...
// After: flattened with combined predicate
BB_entry:
LOP3 Ptmp, P0, P1, 0xC0 // Ptmp = P0 AND P1
@Ptmp BRA BB_body
BRA BB_merge
BB_body:
... body instructions ...
BRA BB_merge
BB_merge:
...
The LOP3 (3-input logic) instruction with truth table 0xC0 computes AND. This combines two branch tests into one, eliminating a basic block and a divergence point.
Pattern: Nested If-Or
// Before: short-circuit OR
BB_test1:
@P0 BRA BB_body // first condition true -> body
BRA BB_test2
BB_test2:
@P1 BRA BB_body // second condition true -> body
BRA BB_merge // both false -> merge
BB_body:
...
// After: flattened with OR predicate
BB_entry:
LOP3 Ptmp, P0, P1, 0xFC // Ptmp = P0 OR P1
@Ptmp BRA BB_body
BRA BB_merge
BB_body:
...
Safety Constraints
The pass applies these transformations only when:
- No side effects between the nested branches -- the intermediate block must contain only the branch instruction (and optionally predicate-setting
ISETP/FSETPinstructions) - No live-out values from the intermediate block other than the predicate -- if the intermediate block defines registers used after the merge, the transformation would change semantics
- Both branches target the same merge point -- the not-taken path of both the outer and inner branches must reach the same merge block
- The predicates are independent -- P0 and P1 must not be related by a def-use chain within the nested pattern (otherwise folding changes the evaluation order)
Relationship to Predication
Phase 38 is a stepping stone toward phase 63 (OriDoPredication). By reducing nested branches to single-level branches, it creates more opportunities for if-conversion -- the predication pass can then convert the single remaining branch into a fully predicated (branchless) instruction sequence.
The transformation pipeline for a if (a && b) { x = y; } pattern is:
Phase 38: nested {if(a) { if(b) { ... }}} --> if(a AND b) { ... }
Phase 63: if(a AND b) { x = y; } --> @(a AND b) MOV x, y
Without phase 38, the predication pass would see a multi-level branch diamond that exceeds its nesting-depth threshold, and both branches would remain in the output.
GPU-Specific Considerations
SIMT Divergence and Reconvergence
On NVIDIA GPUs, branch optimization has a direct impact on warp execution efficiency. Every conditional branch is a potential divergence point where threads in a 32-thread warp may take different paths. Divergence serializes execution: the warp must execute both paths, masking inactive threads.
The BSSY (branch sync stack push) / BSYNC (branch sync) mechanism on modern NVIDIA architectures (sm_75+) manages reconvergence:
BSSY B0, reconvergence_point // push reconvergence point onto sync stack
@P0 BRA taken_path // diverge
... not-taken path ...
BSYNC B0 // threads arriving here wait
taken_path:
... taken path ...
BSYNC B0 // all threads reconverge here
reconvergence_point:
... // continue with full warp
Branch optimization directly reduces the number of BSSY/BSYNC pairs needed:
- Branch folding (phase 15) eliminates unconditional branches that do not cause divergence but still consume
BSSY/BSYNCbookkeeping - Nested conditional flattening (phase 38) reduces two nested
BSSY/BSYNCregions to one, cutting sync-stack depth by one level - Jump table lowering (phases 14/30) collapses N divergence points into one
BRXinstruction
Reconvergence Stack Depth
The hardware branch sync stack has finite depth (varies by architecture, typically 16--32 entries on sm_75+). Deeply nested branches can overflow the stack, causing hardware serialization or requiring the compiler to restructure control flow. Branch optimization reduces sync-stack pressure by flattening nesting.
Uniform Branches
When all threads in a warp evaluate a branch condition identically (uniform branch), no divergence occurs. The optimizer detects uniform branches via the AnalyzeUniformsForSpeculation pass (phase 27) and the OriPropagateVarying passes (phases 53, 70). Uniform branches are cheaper because:
- No
BSSY/BSYNCis needed (the warp stays converged) - On sm_75+, uniform branches can use the
UBRA(uniform branch) encoding, which has lower latency
Branch optimization interacts with uniformity analysis: simplifications that eliminate branches also eliminate divergence-point metadata, and conversely, branches proven uniform may not need optimization because their execution cost is already minimal.
Switch Tables and Warp Divergence
A switch with K active cases in a 32-thread warp incurs at most K serialized case executions (one per unique case value across threads). Jump table lowering does not change this thread-level divergence cost, but it does change the instruction-level cost:
| Strategy | Instructions (worst case) | Divergence points | Sync-stack entries |
|---|---|---|---|
| Cascading if-else (N cases) | 2N (ISETP + BRA per case) | N | N |
| BST (N cases) | 2 * ceil(log2(N)) | ceil(log2(N)) | ceil(log2(N)) |
| Jump table | 3 (IADD3 + ISETP + BRX) | 1 | 1 |
The jump table is strongly preferred for GPU execution because it minimizes sync-stack entries to exactly 1, regardless of case count.
Implementation Details
Phase Vtable Structure
All four phases follow the standard 16-byte phase object model. Each vtable has three methods: +0 execute, +8 getPhaseNumber, +16 isNoOp.
| Phase | Factory case | Vtable address | execute body | isNoOp |
|---|---|---|---|---|
14 DoSwitchOptFirst | case 14 | off_22BD7F8 | sub_C5F720 (42B) | returns false |
15 OriBranchOpt | case 15 | off_22BD820 | sub_C5F950 (34B) | returns false |
30 DoSwitchOptSecond | case 30 | off_22BDA78 | sub_C5FC80 (34B) | returns false |
38 OptimizeNestedCondBranches | case 38 | off_22BDBB8 | sub_C5FA70 (34B) | returns false |
All four isNoOp methods return false unconditionally -- gating is performed inside the execute body, not via isNoOp. Each execute body calls sub_7DDB50 (156B), which reads the optimization level from compilation_context+2104 and checks knob 499. The guard is opt_level > 1, so these phases execute at -O2 and above. At -O0 and -O1, sub_7DDB50 returns 1 and the execute body returns without action.
Execute Body Details
Phase 14 -- sub_C5F720 (42 bytes). After the sub_7DDB50 gate, dispatches through the SM backend object's vtable: (*(*(ctx+1584) + 136))(*(ctx+1584)). Offset +136 is vtable slot 17 on the SM backend. This is a polymorphic call -- each SM target (sm_50, sm_75, sm_89, sm_100, etc.) provides its own switch optimization implementation. The SM backend object at compilation_context+1584 is documented in data-structures.md.
Phase 15 -- sub_C5F950 (34 bytes). After the gate, calls sub_7917F0 (529B) directly -- no polymorphic dispatch. sub_7917F0 is the branch simplification core:
- Checks
context+1382bit 2 (CFG validity flag) -- returns immediately if clear - Checks knob 214 via the knob state dispatcher -- if set, skips the pass (OriBranchOpt disable switch)
- Checks knob 487 (general optimization enablement)
- Calls
sub_785E20(266B) to rebuild the CFG - Calls
sub_781F80(8335B) for block preparation infrastructure - Calls
sub_7E6090(2614B) to scan branch patterns andsub_7E6AD0(33B) for chain setup - Iterates over basic blocks in RPO order (block list at
*(ctx+296), RPO indices at*(ctx+512)). For each block, callssub_753600(1351B) for the transformation, with a convergence loop gated by knob 464 - After convergence, calls
sub_785E20again to finalize the CFG
Phase 30 -- sub_C5FC80 (34 bytes). After the gate, calls sub_791F00(ctx, 1). The second argument 1 indicates this is the second switch optimization pass. sub_791F00 (587B) performs lazy initialization of a 152-byte SwitchOptContext cached at code_object+1288:
SwitchOptContext (152 bytes, allocated at code_object+1288):
+0 back-pointer to code object
+8 allocator reference (from code_object+16)
+16 case collection array (capacity = block_count + 2)
+56 secondary collection array
+96 code_object reference copy
+104 initialized sentinel (0xFFFFFFFF)
+112 tertiary collection array
After setup, sub_791F00 calls sub_77CF40 (4698B, 987 instructions) -- the main switch optimization algorithm containing pattern matching, strategy selection (jump table vs. BST vs. cascading if-else), and code emission.
Phase 38 -- sub_C5FA70 (34 bytes). After the gate, calls sub_A0F020 (2375B, 563 instructions) directly. sub_A0F020 implements the nested conditional optimizer as a fixed-point loop. It allocates a 16-byte work context at code_object+1648 (lazy init), then iterates: scan blocks for nested branch patterns, combine predicates with LOP3, remove eliminated blocks, repeat until stable. The function also accesses code object fields +832 (block hash map) and +856 (edge data) for CFG manipulation.
Knob Gating Summary
| Knob | Index | Effect | Checked by |
|---|---|---|---|
| ConvertUnsupportedOps | 499 | Master opt-level gate (all 4 phases) | sub_7DDB50 |
| OriBranchOpt disable | 214 | Disables branch simplification | sub_7917F0 (phase 15) |
| General optimization | 487 | Enables/disables optimizer passes | sub_7917F0 (phase 15) |
| Convergence loop | 464 | Gates the fixed-point iteration | sub_7917F0 (phase 15) |
Interaction with ExpandJmxComputation (Phase 80)
Phase 80 is the delayed lowering phase for jump table index computations created by DoSwitchOpt. The separation exists because:
- Jump table index computation requires knowing the final table address, which is not available until after legalization
- Intervening optimization passes (GVN-CSE, strength reduction) may simplify the index computation before it is expanded
- Register allocation needs to see the index computation as a single pseudo-instruction for accurate pressure estimation
The pseudo-instruction left by DoSwitchOpt is expanded by phase 80 into the final LEA + BRX sequence after all high-level optimizations are complete.
Interaction with OriLinearReplacement (Phase 31)
Phase 31 runs immediately after DoSwitchOptSecond (phase 30). It targets branch-heavy patterns that survived switch optimization and attempts to replace them with branchless (linear) computation sequences using SEL (select) and predicated MOV instructions. This is a complement to predication (phase 63) -- it operates earlier in the pipeline on simpler patterns, while predication handles more complex diamond-shaped control flow later.
Interaction with MergeEquivalentConditionalFlow (Phases 133, 136)
Two late-pipeline passes perform tail merging -- identifying basic blocks with identical instruction sequences that branch to the same targets, and merging them into a single block. This catches redundancy left over after branch optimization, particularly in switch case bodies that perform similar operations on different case values.
Algorithmic Summary
Pass Algorithm Complexity CFG Changes
───────────────────────────── ─────────────────────────── ──────────── ──────────────────────
DoSwitchOpt (14, 30) Pattern match + decision O(N log N) Rewrites blocks, adds
tree for strategy selection per switch jump table pseudo-ops
OriBranchOpt (15) Worklist-driven CFG O(B + E) Deletes blocks, removes
simplification (fixed-point) per iter edges, threads branches
OptimizeNestedCondBranches Pattern match on nested O(B) Merges blocks, replaces
(38) branch diamonds branches with LOP3+BRA
Where N = number of switch cases, B = number of basic blocks, E = number of CFG edges.
Function Map
All addresses from ptxas v13.0.88. Vtable entries resolved by reading the ELF .rodata section at file offset VA - 0x400000. Confidence: HIGH for vtable functions (direct binary read), HIGH for core algorithms (single-caller chains from vtable execute bodies).
Phase Vtable Functions
| Address | Size | Phase | Vtable slot | Role |
|---|---|---|---|---|
sub_C5F720 | 42B | 14 | +0 | execute -- dispatches to SM backend vtable[17] |
sub_C5F4A0 | 6B | 14 | +8 | getPhaseNumber -- returns 14 |
sub_C5F4B0 | 3B | 14 | +16 | isNoOp -- returns false |
sub_C5F950 | 34B | 15 | +0 | execute -- calls sub_7917F0 |
sub_C5F480 | 6B | 15 | +8 | getPhaseNumber -- returns 15 |
sub_C5F490 | 3B | 15 | +16 | isNoOp -- returns false |
sub_C5FC80 | 34B | 30 | +0 | execute -- calls sub_791F00(ctx, 1) |
sub_C5F2A0 | 6B | 30 | +8 | getPhaseNumber -- returns 30 |
sub_C5F2B0 | 3B | 30 | +16 | isNoOp -- returns false |
sub_C5FA70 | 34B | 38 | +0 | execute -- calls sub_A0F020 |
sub_C5F1A0 | 6B | 38 | +8 | getPhaseNumber -- returns 38 |
sub_C5F1B0 | 3B | 38 | +16 | isNoOp -- returns false |
Core Algorithm Functions
| Address | Size | Callers | Description |
|---|---|---|---|
sub_77CF40 | 4698B | 1 | DoSwitchOpt core -- pattern match, strategy select, code emit |
sub_7917F0 | 529B | 2 | OriBranchOpt core -- worklist CFG simplification |
sub_A0F020 | 2375B | 11 | OptimizeNestedCondBranches core -- predicate combining |
sub_791F00 | 587B | 3 | DoSwitchOpt setup -- SwitchOptContext init, calls sub_77CF40 |
Infrastructure Functions
| Address | Size | Callers | Description |
|---|---|---|---|
sub_7DDB50 | 156B | 180 | Optimization level gate (knob 499 + opt-level check) |
sub_781F80 | 8335B | 131 | Block preparation infrastructure (major shared function) |
sub_785E20 | 266B | 34 | CFG rebuild after transformation |
sub_7E6090 | 2614B | 80 | Branch pattern scanner |
sub_7E6AD0 | 33B | 10 | Branch chain setup |
sub_753600 | 1351B | 1 | Block-level branch transform (phase 15 inner loop) |
sub_753B50 | 598B | 1 | Block transform continuation |
Factory and Vtable Data
| Symbol | Address | Description |
|---|---|---|
sub_C60D30 | 0xC60D30 | Phase factory -- 159-case switch, allocates 16-byte phase objects |
off_22BD5C8 | 0x22BD5C8 | Vtable base -- 40-byte stride, index = phase number |
off_22BD7F8 | 0x22BD7F8 | Phase 14 vtable (base + 14 * 0x28) |
off_22BD820 | 0x22BD820 | Phase 15 vtable (base + 15 * 0x28) |
off_22BDA78 | 0x22BDA78 | Phase 30 vtable (base + 30 * 0x28) |
off_22BDBB8 | 0x22BDBB8 | Phase 38 vtable (base + 38 * 0x28) |
Cross-References
- Pass Inventory -- complete 159-phase table with phase numbers and categories
- Optimization Pipeline -- pipeline infrastructure, dispatch loop, phase ordering
- Ori IR -- instruction layout, opcode table (OEN = BRA, OFFL = BSSY, OFLAP = BSYNC), CFG hash maps
- GeneralOptimize Bundles -- phases 13, 29, 37 that feed constant/copy information to branch passes
- Liveness Analysis -- phase 16 (DCE cleanup after branch/switch optimization)
- Predication -- phase 63 (if-conversion, consumes simplified CFG from phases 15 and 38)
- Hot/Cold Partitioning -- phases 41, 108, 109 (block placement interacts with branch layout)
- Synchronization & Barriers -- BSSY/BSYNC reconvergence mechanism
- Data Structures -- SM backend object at +1584 (phase 14 polymorphic dispatch target)
- Optimization Levels --
sub_7DDB50opt-level gate, knob 499 interaction - Knobs System -- knobs 214, 464, 487, 499 gating branch/switch phases
Loop Passes
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Eight phases in the ptxas pipeline transform loops in the Ori IR: one canonicalizer (phase 18), one unroller (phase 22), one software pipeliner (phase 24), four LICM instances (phases 35, 66, 79, 88), and one fusion pass (phase 59). Together they account for the largest category of repeated-pass instances in the pipeline -- the LICM family alone runs four times because intervening transformations (predication, legalization, GMMA fixup) continuously expose new invariants.
ptxas is not built on LLVM. Its loop infrastructure is a custom, non-SSA representation operating directly on the Ori IR's basic-block graph. Loop detection is performed by AnalyzeControlFlow (phase 3), which identifies back-edges, computes dominators, and annotates each basic block with a loop nesting depth stored at block offset +144. This nesting depth is the primary loop identity used by all eight passes.
| OriLoopSimplification | Phase 18 -- vtable at off_22BD898 |
| OriLoopUnrolling | Phase 22 -- vtable at off_22BD938 |
| OriPipelining | Phase 24 -- vtable at off_22BD988 |
| OriHoistInvariantsEarly | Phase 35 -- vtable at off_22BDB40 |
| OriLoopFusion | Phase 59 -- vtable at off_22BDF00 |
| OriHoistInvariantsLate | Phase 66 -- vtable at off_22BE018 |
| OriHoistInvariantsLate2 | Phase 79 -- vtable at off_22BE220 |
| OriHoistInvariantsLate3 | Phase 88 -- vtable at off_22BE388 |
| Phase factory | sub_C60D30 cases 18, 22, 24, 35, 59, 66, 79, 88 |
| Phase object size | 16 bytes (standard {vtable_ptr, allocator_ptr}) |
| IR level | Ori -- SASS opcodes with virtual registers, pre-RA |
| Loop detection | AnalyzeControlFlow (phase 3) -- back-edges, dominators, nesting depth |
| Related passes | 3 AnalyzeControlFlow, 19 OriSplitLiveRanges, 21 OriStrengthReduce, 108 OptimizeHotColdInLoop |
Pipeline Placement
Phase 3 AnalyzeControlFlow ── builds CFG, identifies loops, computes dominators
Phase 13 GeneralOptimizeEarly ── const fold + copy prop (feeds loop analysis)
Phase 15 OriBranchOpt ── branch simplification (may change loop shape)
Phase 16 OriPerformLiveDeadFirst ── DCE removes dead loop bodies
Phase 18 OriLoopSimplification ── CANONICALIZATION: single entry, preheader insertion
Phase 19 OriSplitLiveRanges ── splits live ranges at loop boundaries
Phase 21 OriStrengthReduce ── induction variable strength reduction
Phase 22 OriLoopUnrolling ── UNROLLING: full/partial based on trip count
Phase 23 GenerateMovPhi ── SSA phi insertion (after unrolling changes CFG)
Phase 24 OriPipelining ── SOFTWARE PIPELINING: overlaps iterations
...
Phase 35 OriHoistInvariantsEarly ── LICM #1: after GVN, before mid-expansion
...
Phase 59 OriLoopFusion ── FUSION: merges adjacent compatible loops
...
Phase 66 OriHoistInvariantsLate ── LICM #2: after predication
...
Phase 79 OriHoistInvariantsLate2 ── LICM #3: after late unsupported-op expansion
...
Phase 88 OriHoistInvariantsLate3 ── LICM #4: after GMMA fixup
...
Phase 108 OptimizeHotColdInLoop ── separates hot/cold paths within loops (post-RA)
Ordering Rationale
The eight loop passes are deliberately spread across the pipeline rather than clustered together. Each occupies a specific position dictated by what has been lowered or optimized upstream:
- Phase 18 (simplification) must run before strength reduction (21) and unrolling (22) because both require canonical loop forms.
- Phase 22 (unrolling) runs after strength reduction so that induction variable simplifications are already applied, avoiding redundant computation in unrolled copies.
- Phase 24 (pipelining) runs after unrolling because pipelining targets loops that were not fully unrolled.
- Phase 35 (early LICM) runs after
GeneralOptimizeat phase 29, which performs partial CSE, giving it common subexpressions to hoist. - Phase 59 (fusion) runs after late expansion (phase 55) because expansion can split a single operation into a loop pair that fusion can reunite.
- Phases 66, 79, 88 (late LICM instances) each follow a major transformation that can create new loop-invariant code: predication (63), unsupported-op expansion (78), and GMMA fixup (87), respectively.
Loop Representation in Ori IR
ptxas does not use a dedicated loop descriptor data structure (no LoopInfo object like LLVM's). Instead, loop membership is implicit in the CFG through annotations computed by AnalyzeControlFlow (phase 3):
| BB Field | Offset | Type | Meaning |
|---|---|---|---|
loop_depth | +144 | int | Loop nesting depth (0 = not in loop) |
loop_depth_equal | +152 | int | Copy of loop_depth, used for sibling detection |
predecessor_list | +128 | linked_list* | List of predecessor block indices |
successor_list | +136 | linked_list* | List of successor block indices |
A loop header is a block whose loop_depth equals its own back-edge source's depth. Back-edge information is stored in the Code Object's back-edge hash map at offset +680. Diagnostic output from sub_BDEA50 prints this information as bix%d -> backedge's successor BB: %d.
The block iteration order is controlled by a reverse-post-order (RPO) array stored at Code Object offset +512. All loop passes iterate over this array, ensuring they visit headers before inner blocks. The array length is at Code Object offset +520.
Phase 18 -- OriLoopSimplification
Purpose
Canonicalizes loop structure to simplify downstream analysis. Ensures each natural loop has a single entry edge, inserts dedicated preheader blocks where needed, and normalizes back-edge shapes. This is a prerequisite for strength reduction, unrolling, and pipelining, all of which assume canonical loop form.
Entry Point
sub_C5FB00 (34 bytes) ── vtable execute(), calls sub_7DDB50
└─ sub_78B430 (1,172 bytes) ── LoopMakeSingleEntry core
├─ sub_7753F0 ── pre-pass: loop peeling setup
├─ sub_789BE0 ── canonicalize back-edges
├─ sub_781F80 ── rebuild instruction list
└─ sub_9253C0 ── split edges / insert preheader
Algorithm
function LoopSimplification(code_object):
if code_object.flags[1368] & 1 == 0: // optimization disabled
return
// Phase 1: optional loop peeling for O4+ or flagged functions
if opt_level not in {4,5} and flags[1382] & 4 set:
peeled = PeelOuterEdges(code_object, 0) // sub_7753F0
canonicalized = CanonicalizeBackEdges(code_object, peeled) // sub_789BE0
else:
canonicalized = CanonicalizeBackEdges(code_object, 0)
if code_object.flags[1368] & 1 == 0: // re-check after canon
return
// Phase 2: single-entry enforcement
if not QueryKnob("LoopMakeSingleEntry", knob_487): // OCG knob 487
return
RebuildInstructionList(code_object, 1) // sub_781F80
for each block in RPO order:
if block.loop_depth > 0 and block is loop header:
// find the deepest-nesting back-edge target
// if multiple entries exist, split into single-entry form
// insert preheader block between external predecessors and header
InsertPreheaderIfNeeded(code_object, block) // sub_9253C0
GPU-Specific Considerations
The simplification pass checks the optimization level at offset +896 of the code object. Levels 4 and 5 (-O4, -O5) enable aggressive loop peeling via sub_7753F0 before canonicalization. At the default -O2, peeling is suppressed to avoid code size growth that could cause instruction cache thrashing.
The LoopMakeSingleEntry knob (OCG knob 487) is the master enable. When disabled, only back-edge canonicalization runs -- preheader insertion is skipped. This knob is checked via the standard OCG knob query at offset +152 of the allocator vtable.
The pass also inspects the convergence flag at offset +1380 (bit 7). When set, it indicates a convergent execution context (e.g., warp-synchronous code), and certain edge-splitting transformations are suppressed to avoid disrupting convergence guarantees.
Related Knobs
| Knob Name | Default | Description |
|---|---|---|
LoopInversion | enabled | Enable loop inversion (do-while to while conversion) |
LoopInversionBudget | unset | Maximum instruction count for loop inversion |
LoopPeelInversion | disabled | Enable loop peeling combined with inversion |
EnableSingleThreadPeelingLoops | unset | Enable peeling for single-thread execution paths |
GenPeelingLoopsForSyncs | unset | Generate peeling loops around sync instructions |
AssertIfPeelingLoopForTexSurf | unset | Assert (debug) if peeling a loop for texture/surface ops |
Phase 22 -- OriLoopUnrolling
Purpose
Performs full unrolling of loops with known small trip counts and partial unrolling of larger loops to amortize loop overhead and expose instruction-level parallelism. This is one of the most impactful optimization passes for GPU code, where loops over texture coordinates, reduction accumulators, and matrix tiles dominate execution time.
Function Map
Correction (P1-04): The W023 report incorrectly listed sub_83EF00 as the unrolling driver. That function is the MainPeepholeOptimizer (confirmed by p1.06a sweep). The actual unrolling call chain starts at sub_1392E30.
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_1392E30 | 25 lines | Phase 22 execute entry: guards, calls initializer + driver + cleanup | HIGH |
sub_1389AF0 | 593 lines | Unrolling context initializer: reads all knobs from OCG profile | HIGH |
sub_1390B30 | 1,598 lines | Main unrolling driver: per-loop decision, factor selection, dispatch | HIGH |
sub_138A6E0 | 774 lines | Post-unroll cleanup: frees working structures | HIGH |
sub_7E5120 | 19 lines | Nounroll/skip check: pragma flag, convergence, knob 91 | HIGH |
sub_7F5D20 | 99 lines | Rejection recording: indexes string table at 0x21D1EA0 | HIGH |
sub_138E3E0 | 125 lines | Loop body scanner: three-pass analysis (header, forward, backward) | HIGH |
sub_13858C0 | 42 lines | Loop back-edge locator | HIGH |
sub_1385E90 | ~200 lines | Trip count bound extractor (init, limit, stride from IV) | MEDIUM |
sub_1383620 | 1,157 lines | Full unroll profitability evaluator (foldable constants, addresses) | MEDIUM |
sub_1387C30 | ~400 lines | Partial unroll body replicator | MEDIUM |
sub_13880F0 | ~200 lines | Post-unroll CFG fixup | MEDIUM |
sub_1385950 | ~300 lines | Induction variable analysis | MEDIUM |
sub_138E9C0 | ~400 lines | IV stride/direction verification | MEDIUM |
sub_1385CC0 | ~200 lines | IV constant detection | MEDIUM |
sub_13829F0 | ~200 lines | Profitability: foldable constant load counting | MEDIUM |
sub_A3A7E0 | 1,236 lines | Post-unroll statistics (DUMPIR output) | HIGH |
Unrolling Decision Algorithm
The unrolling decision is a multi-stage pipeline implemented in sub_1390B30. The function iterates over loops in reverse RPO order (innermost first, matching the RPO array at code_object+512) and applies a series of eligibility checks, trip count analysis, factor selection, and profitability evaluation before committing to the unroll.
Entry Guard (sub_1392E30)
function OriLoopUnrolling_Execute(code_object):
if code_object.flags[1368] & 1 == 0: // optimization disabled
return
if code_object.flags[1397] & 0xC0 == 0x40: // global nounroll override
return
if DUMPIR_skip("LoopUnrolling"): // sub_799250
return
if CountBlocks(code_object) <= 2: // sub_7DDB50
return
if not QueryKnob(487, true): // master loop pass guard
return
ctx = InitializeContext(code_object) // sub_1389AF0
RunUnrolling(ctx) // sub_1390B30
Cleanup(ctx) // sub_138A6E0
Context Initialization and Knob Defaults (sub_1389AF0)
The initializer reads unrolling parameters from the OCG profile object. Each knob uses a three-valued flag: 0 = use hardcoded default, 1 = use integer override, 2 = use float override, 3 = use double override. The defaults recovered from binary:
| Context Field | Profile Offset | Default | Knob Name (inferred) |
|---|---|---|---|
ctx+168 (int32) | +31320 | 140 | UnrollBudget |
ctx+172 (float) | +31032 | 0.25 | UnrollFlexableFullLimit |
ctx+176 (int32) | +30960 | 4 | UnrollUnknownCount |
ctx+180 (int32) | +30816 | 4 | UnrollSmallLoopLimit |
ctx+184 (dbl) | +64656 | 0.4 | LoopUnrollLargePartOfShaderPct |
ctx+192 (float) | +31392 | 20.0 | UnrollInstLimit |
ctx+196 (int32) | +64872 | 50 | UnrollPregThreshold |
ctx+200 (int32) | +31248 | 2 | UnrollExtraInstPerPercentSaving |
ctx+204 (int32) | +31176 | 200 | UnrollFullInstLimit |
ctx+208 (int32) | +64296 | 46 | LoopUnrollNumExtraInstBase |
Boolean and integer knobs read via vtable dispatch:
| Knob ID | Profile Offset | Default | Knob Name |
|---|---|---|---|
| 437 | +31464 | true | LoopUnroll (master enable) |
| 894 | +64368 | true | LoopUnrollNonInnermost |
| 897 | +64584 | true | UnrollMultiBlockLoops |
| 902 | +64944 | true | UnrollVariableBounds |
| 896 | +64512 | 0 | LoopUnrollFactor (INT override; 0 = heuristic) |
| 895 | +64440 | 0 | EpilogueLoopUnrollCount |
| 900 | +64800 | 0 | LoopUnrollNumInstTex |
| 903 | +65016 | false | DisablePartialUnrollOverflowCheck |
String knob: knob 427 (profile+30744) returns the LoopUnrollFactor per-block override string, with the format "-N-" to skip block N, "+N+" to force-unroll block N, "-" to skip all, "+" to force all.
Nounroll Pragma Check (sub_7E5120)
Returns true (suppress unrolling) when any of these conditions hold:
- Convergence constraint: The back-edge analysis context at code_object+1784 is active, and the loop header's entry in the back-edge table (code_object+1776+16) is valid and within the convergence limit. This suppresses unrolling of warp-synchronous loops.
- PTX
nounrollpragma: Byte 292 of the block descriptor at(code_object+368 + 8*block_idx)has bit 1 set. This bit is set during PTX-to-Ori lowering when thenounrollpragma string (at0x1CFE126) is parsed. - Instruction-level marker: Byte 283 of the loop header instruction has bit 0 set.
- Per-block knob: OCG knob 91 is set for this block (queried via
sub_7A1A90).
Main Decision Flowchart (sub_1390B30)
function RunUnrolling(ctx):
code_object = ctx.code_object
// Phase 1: Read master enable and per-block override string
master_enable = QueryKnob(437) // LoopUnroll
override_string = QueryKnobString(427) // "-N-" / "+N+" format
RecomputeRegisterPressure(code_object) // sub_7E6090
RebuildInstructionList(code_object) // sub_781F80
// Phase 2: Pre-scan -- count inlinable calls and non-unrollable instructions
for each instruction in code_object.instruction_list:
if opcode == 97 (BRX):
if callee.entry_block == callee.exit_block:
inlinable_calls++
if trip_count > 1:
multi_exit |= AnalyzeMultiExit(ctx, callee)
// Phase 3: Iterate loops in reverse RPO (innermost first)
rpo_count = code_object.rpo_count // offset +520
for idx = rpo_count-1 downto 0:
block = code_object.blocks[code_object.rpo[idx]]
// ── Step A: nounroll annotation propagation ──
if block.nounroll_annotation: // byte +246
propagate nounroll to all blocks at >= same nesting depth
// ── Step B: eligibility filter ──
if block.loop_depth == 0: continue // not a loop
if block.loop_depth != block.loop_depth_equal: continue
if block.nounroll and not ctx.force_all: continue
// ── Step C: structure analysis ──
latch = LocateBackEdge(ctx, block) // sub_13858C0
if not latch: continue
exit_inst = latch.last_instruction
if exit_inst.opcode != 95: // not conditional branch
Reject(block, 13); continue // indirect jump
// ── Step D: nounroll / convergence check ──
if CheckNounroll(block, code_object): // sub_7E5120
Reject(block, 11); continue
// ── Step E: execution frequency analysis ──
freq_header = code_object.freq_table[header_reg]
freq_latch = code_object.freq_table[latch_reg]
is_hot = (freq_latch > 999) and (freq_header > 0)
and (freq_latch / freq_header > 3)
// ── Step F: body analysis ──
body_info = ScanLoopBody(ctx, block, latch) // sub_138E3E0
// body_info contains: tex_count, body_size, foldable_ldc_count,
// has_cross_edges, mem_count
if body_info.has_cross_edges: continue
// ── Step G: budget computation ──
budget_scale = QueryKnobDouble(898, 0.5) // default 0.5
scaled_body = (int)(budget_scale * body_size)
remaining = total_budget - body_size - scaled_body - ...
// ── Step H: per-block override check ──
if override_string:
needle = "-{block_id}-"
if override_string == "-" or strstr(override_string, needle):
continue // skip this block
needle = "+{block_id}+"
if override_string == "+" or strstr(override_string, needle):
force_unroll = true
// ── Step I: pragma force-unroll ──
if flags[1397] & 0xC0 == 0x80: // PTX pragma force
force_unroll = true
// ── Step J: non-innermost filter ──
if not ctx.allow_non_innermost and not force_unroll:
if 10 * body_info.tex_count < remaining:
Reject(block, 7); continue
// ── Step K: factor selection ──
if force_unroll:
factor = 1 << ctx.force_factor // power-of-2 override
else if known_trip_count:
factor = trip_count
// Budget-constrain: while factor * body_cost > UnrollBudget:
// factor--
if factor > 4 and trip_count == 1:
factor &= ~3 // round to mult-of-4
if factor <= 1:
Reject(block, 12); continue
else:
if body_size <= 49 and body_info.tex_count > 0:
factor = 2 // conservative default
else:
factor = max(1, UnrollBudget / body_cost)
// ── Step L: knob override ──
if QueryKnob(429): // LoopUnrollFactor INT
factor = GetKnobInt(429)
// ── Step M: IV analysis ──
iv_info = AnalyzeIV(ctx, latch) // sub_1385950
if not iv_info: Reject(block, 14); continue
if not ValidateIV(ctx, iv_info): // sub_1387870
Reject(block, 14); continue
bound = ExtractBound(ctx, iv_info) // sub_1385E90
if not bound or bound.opcode != 2:
Reject(block, 16); continue
if bound.def_block.predecessor_count != 1:
Reject(block, 17); continue
if bound.init_reg == bound.limit_reg:
Reject(block, 18); continue
stride_ok = VerifyStride(ctx, block, latch, iv_info, bound)
if stride_ok & 2: Reject(block, 17); continue
if stride_ok & 1: Reject(block, 18); continue
// ── Step N: detect constant trip count ──
const_iv = DetectConstantIV(ctx, iv_info) // sub_1385CC0
// ── Step O: profitability for full unroll ──
if factor == trip_count and single_block_body:
if CheckFoldableProfitability(ctx, block, iv_info, factor):
ReplicateFullUnroll(ctx, block, factor) // sub_1383620
stats.unrolled_count++
continue
// ── Step P: partial unroll execution ──
if factor >= 2:
remainder = trip_count % factor
iterations_per_copy = (trip_count - remainder) / factor
block.iterations_per_copy = iterations_per_copy
if remainder > 0:
for r = 0 to remainder-1:
DuplicateBody(ctx, block) // sub_932E40
ReplicatePartialUnroll(ctx, block, latch,
factor, remainder) // sub_1387C30
stats.unrolled_count++
else:
Reject(block, 24) // budget exceeded
// Phase 4: Post-unroll fixup
stats.non_unrolled = total_loops - stats.unrolled - stats.failed
if any_unrolled:
RebuildBackEdges(code_object) // sub_7846F0
RerunLiveness(code_object) // sub_A0F020
RerunControlFlow(code_object) // sub_752E40
MarkModified(code_object) // sub_7B52B0
Unroll Rejection Table
When a loop cannot be unrolled, sub_7F5D20 records the reason by indexing a string pointer array at 0x21D1EA0. The diagnostic strings contain hex codes like "0x80000001 - Not unrolled: Irregular loop" -- these hex values are part of the printed message text, not the internal array index. The W023 report originally described a 36-byte structure table at 0x21D1980; that table belongs to the operand range lookup in the peephole optimizer (sub_7E39B0), not the unrolling pass. The actual internal rejection codes are simple integers indexing the string array:
| Code | Category | Reason |
|---|---|---|
| 7 | Performance | Body too large relative to texture savings (10 * tex_count < remaining_budget) |
| 11 | Pragma/knob | PTX nounroll pragma, convergence constraint, or per-block knob 91 |
| 12 | Budget | Partial unroll factor reduced to 1 (no factor >= 2 fits within UnrollBudget) |
| 13 | Ineligible | Loop exit contains BRX (indirect jump, opcode 95 with special flags) |
| 14 | Unsupported IV | Induction variable analysis failed (sub_1385950 or sub_1387870) |
| 15 | Unsupported IV | IV register class is not integer (class 1) or pointer (class 2/3) |
| 16 | Trip count | Trip count bound extraction failed (sub_1385E90) |
| 17 | Irregular | IV definition block has multiple predecessors, or stride/direction verification failed |
| 18 | Trip count | IV initial value register equals IV limit register (degenerate zero-trip loop) |
| 19 | Unsupported IV | IV stride sign inconsistent between loop header and induction increment |
| 24 | Budget | Catch-all: budget exceeded after all factor reduction attempts |
The diagnostic output is gated by flags[1421] & 0x20 (DUMPIR verbose mode). When enabled, the rejection string is recorded in a hash map keyed by the loop header instruction node, using FNV-1a hashing of the node's block index.
Heuristic Thresholds (Knobs)
The unrolling decision is controlled by a rich set of OCG knobs. All knob names are stored ROT13-encoded in the binary:
| Knob Name | Type | Default | Description |
|---|---|---|---|
LoopUnroll | BOOL | true | Master enable for loop unrolling |
LoopUnrollFactor | INT | 0 | Override unroll factor (0 = heuristic) |
UnrollBudget | INT | 140 | Maximum total instruction count after unrolling |
UnrollInstLimit | FLOAT | 20.0 | Maximum instructions in a single unrolled loop body |
UnrollFullInstLimit | INT | 200 | Maximum body size for full unrolling |
UnrollFlexableFullLimit | FLOAT | 0.25 | Flexible full-unroll limit (adjusted by loop characteristics) |
UnrollSmallLoopLimit | INT | 4 | Body size threshold below which loops are always fully unrolled |
UnrollPregThreshold | INT | 50 | Maximum predicate register pressure for unrolling |
UnrollMultiBlockLoops | BOOL | true | Allow unrolling of multi-basic-block loop bodies |
UnrollVariableBounds | BOOL | true | Allow unrolling when trip count is not compile-time constant |
UnrollUnknownCount | INT | 4 | Default trip count assumption when count is unknown |
UnrollUnknownInstLimit | INT | 0 | Maximum body size for unrolling with unknown trip count |
UnrollExtraInstPerPercentSaving | INT | 2 | Instructions allowed per percent of cycle saving |
UnrollTex3DPercentSavedThreshold | INT | 0 | Minimum savings percent for 3D texture loops |
UnrollProfiledColdInstsScale | INT | 0 | Scale factor for instruction count in profiled-cold blocks |
LoopUnrollExtraFoldableLdcWeight | INT | 0 | Extra weight for foldable constant loads in unroll benefit |
LoopUnrollFoldableAddrWeight | INT | 0 | Weight for foldable address computations |
LoopUnrollLargePartOfShaderPct | DOUBLE | 0.4 | Percentage threshold: loop is "large part of shader" |
LoopUnrollNumExtraInstBase | INT | 46 | Base extra instruction allowance per unroll iteration |
LoopUnrollNumInstSmallLoop | INT | 0 | Instruction count defining "small loop" |
LoopUnrollNumInstTex | INT | 0 | Texture instruction count bonus for unrolling |
LoopUnrollSingleLoopSavedPctFactor | INT | 0 | Savings factor for single-loop shaders |
LoopUnrollNonInnermost | BOOL | true | Allow unrolling of non-innermost loops |
LoopUnrollUnknownMultiBlock | BOOL | false | Allow multi-block unroll with unknown bounds |
EpilogueLoopUnrollCount | INT | 0 | Unroll count for epilogue (remainder) loops |
DisablePartialUnrollOverflowCheck | BOOL | false | Skip overflow check on partial unroll count |
GPU-Specific Unrolling Concerns
Register pressure. GPU threads share a fixed register file per SM. Unrolling increases live ranges, potentially reducing occupancy (the number of concurrent warps). The unroller queries register pressure estimates and compares against UnrollPregThreshold before committing.
Instruction cache. GPU instruction caches are small (typically 128KB L1i per SM). Aggressive unrolling of large loop bodies can cause i-cache thrashing. The UnrollBudget knob caps the total instruction growth.
Texture instruction scheduling. Texture fetches have high latency (hundreds of cycles). Unrolling loops containing texture operations is especially profitable because it exposes independent fetches that the scheduler can overlap. The LoopUnrollNumInstTex and UnrollTex3DPercentSavedThreshold knobs give extra weight to texture-heavy loops.
PTX nounroll pragma. The PTX string nounroll at 0x1CFE126 is parsed during PTX-to-Ori lowering and sets bit 1 of byte 292 in the block descriptor at (code_object+368 + 8*block_idx). The check is performed by sub_7E5120, which also tests three additional suppression conditions: the convergence constraint (back-edge table at code_object+1776), an instruction-level marker (byte 283 bit 0), and per-block knob 91. Any single condition is sufficient to suppress unrolling for that loop (rejection code 11).
Convergence constraint. When the back-edge analysis context at code_object+1784 is active (indicating warp-synchronous code), the unroller checks whether the loop header falls within the convergence region. If it does, unrolling is suppressed to avoid breaking warp-level synchronization guarantees. This is particularly important for cooperative groups and ballot-based algorithms.
DUMPIR Statistics
When diagnostics are enabled, the pass outputs:
# [partially unrolled loops=N] [non-unrolled loops=M]
This line appears in eight SM-variant statistics printers (sub_ABBA50 through sub_ABEB50), each a 1,771-byte clone specializing output format for a specific SM generation.
Phase 24 -- OriPipelining
Purpose
Performs modulo software pipelining on loops that were not fully unrolled. The pass overlaps successive loop iterations by interleaving instructions from different iterations within a single loop body, hiding functional unit and memory latency. This is the single most complex loop transformation in ptxas.
Two-Layer Pipelining Architecture
ptxas implements software pipelining in two cooperating layers:
-
Phase 24 (OriPipelining, pre-RA): Annotates instruction operands with pipeline latency classes, computes the minimum initiation interval (MII), performs the modulo scheduling loop transformation (iteration overlap, prolog/epilog generation). Operates on the Ori IR before register allocation.
-
Post-RA SoftwarePipeline (
sub_8B9390, 23KB): A scheduling algorithm variant within the post-RA instruction scheduler (address range0x893000--0x8FE000) that performs instruction-level scheduling of already-pipelined loop bodies using physical registers. One of approximately 12 scheduling variants alongsideDualIssueScheduler,TensorScheduler,LoopScheduler,PrefetchScheduler, etc.
The two layers cooperate: Phase 24 transforms the loop structure (instruction replication, prolog/epilog construction) before register allocation. The post-RA SoftwarePipeline variant handles the cycle-accurate instruction placement of already-pipelined loops.
Function Map
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_926A30 | 22,116 bytes | Per-instruction operand latency annotator and encoding rewriter | HIGH |
sub_91A0F0 | 5,550 bytes | Opcode-to-latency-class classifier (~350 opcodes, 13 distinct classes) | HIGH |
sub_9203A0 | 4,881 bytes | Pipeline stage cost calculator (ResMII computation, FP cost accumulation) | MEDIUM |
sub_921820 | 1,592 bytes | Prolog/epilog code generator | MEDIUM |
sub_9202D0 | 207 bytes | Two-operand pipeline feasibility check (returns 60=reject, 130=accept) | HIGH |
sub_91E610 | 399 bytes | Register-class-based latency lookup (class 4→26, class 5/2→20) | HIGH |
sub_91E900 | 470 bytes | Pipe-assignment-based stall cycle calculator (32/64 cycle caps) | HIGH |
sub_92C0D0 | 358 bytes | Per-instruction annotation wrapper (calls sub_926A30, checks opcode changes) | HIGH |
sub_92C240 | 8,033 bytes | Extended GEMM-loop pipeliner (SM90+ TMA pipeline depth management) | MEDIUM |
sub_8B9390 | 22,841 bytes | Post-RA software pipelining scheduling variant (in scheduler subsystem) | MEDIUM |
Correction (P1-06): The original function map listed sub_926A30 as the "main pipelining engine (modulo scheduling)." Decompilation reveals it is the per-instruction operand latency annotator -- it iterates over each operand of an instruction, calls sub_91A0F0 to classify the operand's latency class, and rewrites the operand encoding with the latency annotation. The modulo scheduling loop transformation is distributed across the remaining functions, with sub_9203A0 computing stage costs and sub_921820 generating prolog/epilog code.
Software Pipelining Algorithm
Phase 1: Operand Latency Annotation
For each instruction in the loop body, sub_92C0D0 calls sub_926A30 to annotate operands:
function AnnotateOperandLatencies(code_object, instruction):
opcode = instruction.word & 0xFFFFCFFF // strip modifier bits (bits 12-13)
secondary_opcode = instruction.secondary_opcode
operand_array = instruction.operands // offset +84
operand_count = instruction.operand_count // offset +80
for i in 0..operand_count-1:
operand_type = (operand_array[i].word >> 28) & 7
if operand_type in {2, 3}: // register or register pair
// Adjust count for predicated instructions (bit 12)
adjusted_count = operand_count - 2 * ((opcode >> 11) & 2 != 0)
if i < adjusted_count:
latency_class = ClassifyLatency(opcode, secondary_opcode,
operand_array, adjusted_count, i)
if latency_class != default:
RewriteOperandEncoding(operand_array[i], code_object, latency_class)
// For register operands: call full rewriter sub_922210
// For non-register operands: call sub_9267C0
Phase 2: Pipeline Feasibility Filtering
Each instruction is checked by sub_9202D0:
function CheckPipelineFeasibility(code_object, instruction):
// Reject instructions with special operand flags
if (operand_array[1] & 0x603FFFF) != 0 or (operand_array[3] & 0xF8000000) != 0:
if optimization_level > 1:
return REJECT // return code 60
// Reject if pipe assignment class <= 3 (control/barrier pipe)
pipe_class = PipeAssignment(code_object, primary_opcode) // vtable+904
if pipe_class <= 3:
return REJECT
// Reject if operand 0 and operand 1 have different latency classes
lat0 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 0)
lat1 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 1)
if lat0 != lat1:
return REJECT // asymmetric latencies
// Reject if extended operands have blocking flags
if operand_count > 2 and (operand_array[4] & 0xF) or (operand_array[4] >> 4) & 1:
return REJECT
// Accept: trim to 2-operand form
result_operands = &operand_array[2]
result_count = 2
return ACCEPT // return code 130
Phase 3: MII Computation
The minimum initiation interval is computed as:
MII = max(RecMII, ResMII)
RecMII (recurrence-constrained): The longest data dependence cycle in the DDG divided by the iteration distance it spans. For a cycle of total latency L spanning D iterations: RecMII = ceil(L / D).
ResMII (resource-constrained): Computed by sub_9203A0 using floating-point cost accumulation. The function classifies each instruction's pipe class using a 7-entry pipe class table at code_object+16 and accumulates per-pipe instruction counts:
function ComputeResMII(loop_body, pipe_table):
pipe_counts[0..6] = {0}
for each instruction in loop_body:
lat0 = ClassifyLatency(instruction, operand=0)
lat1 = ClassifyLatency(instruction, operand=1)
pipe = MapLatencyToPipe(lat0, pipe_table) // 7-entry lookup
pipe_counts[pipe] += cost(instruction) // FP cost weights
ResMII = max(pipe_counts[i] / pipe_width[i] for i in 0..6)
The pipe class boundaries stored at code_object+16 define 7 functional unit classes. Each class has a capacity (number of execution slots per cycle). ResMII is the maximum ratio of instruction demand to capacity across all pipe classes.
Phase 4: Modulo Schedule Construction
function ModuloSchedule(loop_body, MII):
II = MII
while II <= MAX_II:
MRT = new ModuloReservationTable(II) // II rows x pipe_classes columns
success = true
for each instruction in priority order:
earliest = max(data_dependency_constraints)
latest = earliest + II - 1
placed = false
for slot in earliest..latest:
row = slot mod II
pipe = instruction.pipe_class
if MRT[row][pipe] has capacity:
MRT[row][pipe] -= 1
instruction.scheduled_time = slot
instruction.stage = slot / II
placed = true
break
if not placed:
success = false
break
if success:
return (II, schedule)
II += 1
return FAILURE // could not pipeline
Phase 5: Prolog/Epilog Generation
Once a valid schedule is found at initiation interval II with S pipeline stages, sub_921820 generates:
function GeneratePrologEpilog(loop, II, num_stages):
// Prolog: S-1 partial iterations
for stage in 0..num_stages-2:
emit instructions assigned to stages 0..stage
// Each prolog iteration adds one more stage
// Kernel: steady-state loop body
emit all instructions from all stages
// Trip count adjusted: new_trip = original_trip - (num_stages - 1)
// Epilog: S-1 drain iterations
for stage in num_stages-2..0:
emit instructions assigned to stages stage+1..num_stages-1
// Each epilog iteration removes one stage
Instruction Latency Classifier (sub_91A0F0)
The classifier is a 5.5KB, 1372-line switch statement mapping approximately 350 Ori opcodes to 13 distinct latency class values. It takes five parameters: (opcode, secondary_opcode, operand_array, operand_count, operand_index) and returns a class ID -- not a cycle count. The scheduler maps class IDs to actual cycle counts via the hardware profile.
Latency Class Table
| Class | Typical opcodes | Meaning |
|---|---|---|
| 1 | Past-end operands, invalid indices | Skip / not used |
| 6 | Simple ALU, bitwise, short integer | Short-pipe latency (~80 opcodes) |
| 7 | Paired register operations | Medium-short (~5 opcodes) |
| 8 | Special cases (via lookup table dword_21E1340) | Medium |
| 9 | Type conversions (via lookup table) | Medium |
| 10 | Integer multiply, shifts, IMAD | Medium-long (~40 opcodes) |
| 11 | Address computations, LEA variants | Medium-long (~15 opcodes) |
| 12 | Memory operations, FP32, barriers | Standard long (~100 opcodes) |
| 14 | Wide memory, atomics, FP64 stores | Extended long (~20 opcodes) |
| 16 | FP64 special variants | Extended long (~3 opcodes) |
| 20 | Texture fetches, uniform loads | Very long (~30 opcodes) |
| 26 | Global memory loads, uncached access | Maximum latency (~25 opcodes) |
| 31 | Scoreboard/barrier-related operands | Special handling (~5 opcodes) |
Opcode Family Handling
| Opcode range | Category | Latency behavior |
|---|---|---|
0x03--0x24 | Integer ALU | Mostly passthrough default; 0x23 always returns 10 |
0x3C, 0x3E, 0x4E, 0x4F | Memory (load/store) | Returns field from operand_array[4] bits for operands 0--1 |
0x46, 0xF3--0x106 | Texture | Returns 6 normally; 10 for MIO-dependent with extended flag check |
0x49, 0x4A, 0x51, 0x143, 0x15E | Atomic/reduce | Always returns 12 |
0x55--0x6F | Floating-point | Complex per-operand logic; 0x55 uses lookup table dword_21E1340 |
0x5B, 0x5C, 0x137 | Barriers/sync | Returns 12 for operand 1, else default |
0xB7, 0x120 | WGMMA setup | Per-operand latency (10--20) based on accumulator flags |
0x135 | HMMA/IMMA | Calls sub_7E39B0/sub_7E3A70/sub_7E3BA0/sub_7E3C30 for matrix latency |
0x13D, 0x13E | Extended FP | Accumulator-flag-dependent returns (10 or 12) |
Stall Cycle Calculator (sub_91E900)
sub_91E900 computes the stall penalty for an instruction by mapping latency classes through the pipe assignment function (vtable+904):
function ComputeStallCycles(code_object, instruction):
lat0 = ClassifyLatency(instruction, operand=0)
pipe0 = PipeAssignment(code_object, lat0) // vtable+904
if pipe0 == 8: // long-latency pipe
stall = StallTable[instruction.index] // code_object+440
return min(stall, 64) // cap at 64 cycles
lat1 = ClassifyLatency(instruction, operand=1)
pipe1 = PipeAssignment(code_object, lat1)
if pipe1 == 8:
stall = StallTable[instruction.index]
return min(stall, 64)
// Neither operand on long pipe
stall = StallTable[instruction.index]
return min(stall, 32) // cap at 32 cycles
The pipe assignment value 8 corresponds to the long-latency functional unit (memory/texture). Instructions on this pipe get a 64-cycle cap; all others are capped at 32 cycles.
GEMM Pipelining (sub_92C240)
The GemmPipeliner* family of knobs controls a specialized pipelining mode for GEMM (matrix multiply) loops:
| Knob Name | Type | Default | Description |
|---|---|---|---|
GemmPipelinerEnabled | BOOL | false | Master enable for GEMM-specific pipelining |
GemmPipelinerPipelineDepthEnforceDeltaFull | INT | 0 | Pipeline depth adjustment for full enforcement |
GemmPipelinerPipelineDepthEnforceDeltaPartial | INT | 0 | Pipeline depth adjustment for partial enforcement |
GemmPipelinerDependenciesPopbl | BOOL | false | Dependency resolution policy between DMA and compute stages |
GemmPipelinerScoreboardHashPopbl | BOOL | false | Scoreboard hash policy for GEMM barrier tracking |
GemmPipelinerUseRegisterCalculation | INT | 0 | Use register-based calculation for pipeline depth vs. fixed |
The extended pipelining in sub_92C240 (8KB) handles GEMM-like patterns where the loop body contains WGMMA/IMMA instructions. From decompilation:
- Activation: The GEMM pipeliner activates when
code_object+48(GEMM mode flag) is set and the pipeline context atcode_object+56has a valid stage range. - Stage iteration: Iterates from
context+84(start stage) tocontext+88(end stage), with 96-byte descriptors per stage atcontext+136. - Pipeline depth management: Uses
sub_8A4DA0to validate stage depth andsub_6E6650for dynamic array resizing when pipeline depth exceeds the current allocation. Writes stage bitmasks (1 << stage_index) into the stage descriptor arrays. - Hardware model: On SM90+ (Hopper), TMA supports up to 8 outstanding asynchronous copy operations. The GEMM pipeliner matches this hardware depth, staging DMA (memory) and compute (math) operations to fill the pipeline.
The DUMPIR diagnostic output includes For Dma Loop and For Math Loop sections from sub_7A4500, confirming the pipeliner explicitly distinguishes between DMA and compute loop stages.
Other Pipelining Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
OkToPipelineNoUnroll | INT | 0 (disabled) | Allow pipelining even when unrolling was also suppressed |
PipelineHoistCondLimit | INT | unset | Maximum condition complexity for hoisting in pipelined loops |
PipelineHoistRRegPressureLimit | INT | unset | R-register pressure limit for hoisting inside pipelined body |
PipelineHoistPRegPressureLimit | INT | unset | P-register pressure limit for hoisting inside pipelined body |
PipelineMIOVQToInstRatio | DBL | unset | MIOVQ-to-instruction ratio threshold for pipeline profitability |
PipelineMultiOutputTex | INT | 0 (disabled) | Enable pipelining of loops with multi-output texture instructions |
PipelineSpecUsesInHeadOnly | INT | 0 (disabled) | Restrict speculative uses to loop header only |
GPU-Specific Pipeline Concerns
Warp divergence. Pipelined loops assume all threads in a warp execute the same number of iterations. If the trip count is warp-divergent, the prolog/epilog handling must account for early-exit threads. The pass checks the varying analysis (phases 53, 70) to determine divergence.
Barrier placement. Pipelined loops containing BAR.SYNC or MEMBAR instructions are checked by sub_9202D0 -- if the pipe assignment class for a barrier instruction is <= 3, the instruction is rejected from pipelining. The latency classifier (sub_91A0F0) assigns class 12 to barrier operands (opcodes 0x5B, 0x5C, 0x137), but the feasibility check rejects based on pipe class, not latency class.
Memory pipeline depth. The sub_92C240 extended pipeliner for GEMM-like loops manages the hardware memory pipeline on SM90+. It explicitly tracks DMA pipeline depth using 96-byte per-stage descriptors, resizing arrays dynamically when depth exceeds allocation. The stage descriptor at context+136 + 96*stage holds bitmask membership, latency counters, and dependency links.
Pipe class model. The 7-entry pipe class table at code_object+16 partitions the functional units into classes. The post-RA software pipelining variant (sub_8B9390) uses the same table to determine which functional unit class each instruction uses, ensuring resource conflict detection is consistent between the two pipelining layers.
Phases 35, 66, 79, 88 -- OriHoistInvariants (LICM)
Purpose
Hoists computations that produce the same result on every loop iteration out of the loop body and into the preheader. This reduces the dynamic instruction count proportionally to the trip count. The four instances are not redundant -- each targets invariants created by different intervening transformations.
Function Map
All four instances share the same core implementation:
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_C5FE00 | 34 bytes | Phase 35 execute wrapper | CERTAIN |
sub_C5FE30 | 34 bytes | Phase 66 execute wrapper | CERTAIN |
sub_C5FE60 | 34 bytes | Phase 79 execute wrapper | CERTAIN |
sub_C5FE90 | 34 bytes | Phase 88 execute wrapper | CERTAIN |
sub_7DDB50 | 156 bytes | Optimization guard: checks knob 499, block count > 2 | HIGH |
sub_8FFDE0 | 573 bytes | HoistInvariants orchestrator: iterates blocks, queries knob 381, dispatches inner worker | HIGH |
sub_8FF780 | 1,622 bytes | LICM inner worker: identifies and moves invariant instructions | HIGH |
sub_8FEAC0 | 2,053 bytes | Invariance marking: forward/backward operand scan per block | HIGH |
sub_8F76E0 | 90 bytes | Per-instruction invariance test: checks output register def-block | HIGH |
sub_8F7770 | 810 bytes | Hoisting safety check: operand class + latency analysis | HIGH |
sub_8F8CB0 | 658 bytes | Profitability check: budget-weighted score vs latency penalty | HIGH |
sub_8F7DD0 | 374 bytes | Transitive invariance propagation through def-use chains | HIGH |
sub_8F7AE0 | 558 bytes | Instruction mover: unlinks from loop, inserts at preheader | HIGH |
sub_8FF2D0 | 1,186 bytes | Budget computation + invariant marking + hoist dispatch | HIGH |
sub_8F8BC0 | 257 bytes | Instruction counting: header/body weight via isNoOp | HIGH |
sub_74D720 | 353 bytes | Loop boundary analysis: barrier/jump/predecessor checks | HIGH |
sub_74F500 | -- | Preheader location finder | MEDIUM |
sub_7DF3A0 | 88 bytes | Opcode flags table lookup (side-effect classification) | HIGH |
sub_7E0540 | 156 bytes | Observable side-effect checker (memory, call, barrier) | HIGH |
Execute Flow
sub_C5FExxx(phase_obj) // 34-byte vtable dispatch
└─ sub_8FFDE0(code_object, pass_id) // orchestrator
├─ sub_7DDB50(code_object) // guard: returns block count, checks knob 499
├─ sub_799250(allocator, "HoistInvariants", &skip) // DUMPIR check
└─ sub_8FF780(context) // per-loop LICM core
├─ sub_781F80 // rebuild instruction list
├─ sub_7E6090 // recompute register pressure
├─ sub_773140 // recompute loop depths
├─ sub_74D720 // analyze loop boundaries
├─ sub_74F500 // find preheader
├─ sub_7A1A90 / sub_7A1B80 // query knob 381 per block
└─ sub_8F8BC0 // move instruction to preheader
Why Four Instances?
| Phase | Pass ID (a2) | Pipeline Position | What Creates New Invariants |
|---|---|---|---|
35 (Early) | 0 | After GeneralOptimize (29), ExtractShaderConsts (34) | CSE eliminates redundant expressions, exposing loop-invariant results; shader constant extraction hoists uniform loads |
66 (Late) | 1 | After predication (63), GeneralOptimizeLate2 (65) | Predication converts conditional branches to predicated instructions; if the condition is loop-invariant, the entire predicated instruction becomes invariant |
79 (Late2) | 2 | After LateExpansionUnsupportedOps (78) | Late expansion splits compound operations into sequences; address computations and constant sub-expressions in expanded sequences are often invariant |
88 (Late3) | 3 | After FixupGmmaSequence (87) | GMMA fixup reorders/inserts instructions for wgmma hardware constraints; descriptor loads and accumulator setup become visible as invariants |
Pass ID Controls Aggressiveness
The pass_id parameter (parameter a2 of sub_8FFDE0) affects which loops are processed and how aggressively hoisting is performed. From the decompiled logic at sub_8FFDE0:
// sub_8FFDE0 lines 58-89 (simplified)
v7 = sub_7A1B80(allocator, 381, block); // query knob 381 for this block
if (v7 == 1) { // knob says "inner loops only"
if (pass_id == 1) goto hoist_block; // Late pass: proceed
goto skip_block; // Early pass: skip
}
if (v7 == 3) { // knob says "never"
if (pass_id <= 1) goto handle_conservative;
goto skip_block;
}
if (v7 == 0) { // knob says "always"
if (pass_id == 0) goto hoist_aggressively;
goto skip_block;
}
- pass_id = 0 (Early): Hoists aggressively and calls
sub_A112C0(code_object, 1)to re-run sub-analyses afterward. This is the most aggressive pass. - pass_id = 1 (Late): Includes inner-loop-only blocks, but skips the re-analysis call.
- pass_id >= 2 (Late2, Late3): Most conservative -- only hoists from blocks where knob 381 returns 0 (always-hoist).
Per-Block Knob 381 Policy
The LICM pass queries OCG knob 381 (sub_7A1A90 / sub_7A1B80) per basic block to determine the hoisting policy:
| Knob 381 Result | Meaning |
|---|---|
| 0 | Always hoist from this block |
| 1 | Hoist from inner loops only |
| 3 | Never hoist from this block |
This per-block granularity allows the knob system to selectively disable hoisting in specific loop nests (e.g., those known to be register-pressure-critical).
Guard Function (sub_7DDB50)
Before the LICM core runs, sub_7DDB50 (156 bytes) gates execution on two conditions:
- Knob 499 enabled. Queries the allocator vtable at +72 for OCG knob 499 (the master LICM switch). If disabled, returns 1 which causes the orchestrator to bail (since 1 <= 2).
- Rate limiter. When knob 499 is enabled, the guard checks a pair of counters at
allocator[9]+35936(max invocations) andallocator[9]+35940(current count). If the current count has reached the maximum, returns 1. Otherwise increments the counter and returns the actual basic block count fromcode_object+2104. This bounds the number of LICM invocations for compile-time control in functions with many loops. - Block count > 2. The orchestrator (
sub_8FFDE0) checks the return value: if <= 2, no hoisting is attempted. Single-block functions have no loops; two-block functions have at most a trivial loop not worth processing.
LICM Invariant Detection Algorithm
The invariance detection pipeline runs inside sub_8FF2D0 (1,186 bytes), which is called from sub_8FF780 once per loop nest level. It executes five stages in sequence: budget computation, forward invariance marking, backward non-invariance marking, transitive propagation, and profitability gating.
Stage 1: Budget Computation (sub_8FF2D0)
function ComputeHoistBudget(context, block, is_simple, num_preds, hoist_mode, is_inner):
// Base budget from knob 483 (HoistBudget)
if QueryKnob(483):
budget = QueryKnobValue(483) // 0 = unlimited
else:
budget = 10 // default
// CBO budget from knob 482
if QueryKnob(482):
cbo_budget = QueryKnobValue(482)
else:
cbo_budget = (pass_id == 0) ? 22 : 100
// Adjust by loop type and depth
if pass_id > 0 and is_simple:
budget = (hoist_mode < 2) ? cbo_budget : 300
else if pass_id == 0 and is_simple:
budget = (hoist_mode < 2) ? cbo_budget : 200
// Conservative multiplier for Late3
if pass_id == 3:
budget *= 100 // generous once decided to hoist
// Split budget among back-edge blocks
if hoist_mode == 3: // processing back-edge block
budget /= num_preds
// Inner-loop divisor from knob 380
if is_inner:
if QueryKnob(380):
budget /= QueryKnobValue(380)
else:
budget /= 10
Stage 2: Forward Invariance Marking (sub_8FEAC0, a3=1)
The forward pass iterates every instruction in the basic block and marks each register operand's invariance status based on where it was defined.
function MarkInvariants_Forward(context, block_index):
block = blocks[block_index]
header_depth = context.header_depth
loop_depth_range = [context.header_depth, context.max_depth]
// Two code paths based on knob 934 (UseNewLoopInvariantRoutineForHoisting)
if QueryKnob(934):
// Advanced path: set-based computation via sub_768BF0 + sub_8F7280
return MarkInvariants_SetBased(context, block_index)
// Default path: single-pass scan
for each instruction in block (linked list: block+0 .. sentinel at block+8):
has_side_effect = isNoOp(instruction) // vtable+1824
opcode = instruction+72 (masked: BYTE1 &= 0xCF)
num_operands = instruction+80
// Special case: opcode 195 + first dst is reg class 9 (predicate)
is_predicate_def = (opcode == 195 and dst_reg.class == 9)
is_invariant = true
for each operand from LAST to FIRST: // reverse scan
operand = instruction + 84 + 8*i
type = (operand >> 28) & 7
if type != 1: // not a register
continue // immediates are always invariant
if IsFixedRegister(operand, code_object): // sub_7DEB90
continue // e.g., RZ, PT — always available
if pass_id == 3: // Late3 extra check
if IsSpecialRegClass(operand, code_object): // sub_7DA2F0
// Exception: IADD3 (opcode 130, flag 0x1000) carry-out
if not (opcode == 130 and flag_0x1000 and is_penultimate_operand):
continue
reg = RegisterDescriptor(code_object, operand & 0xFFFFFF)
if reg.def_block (reg+76) == block_index:
// Defined in THIS block — not invariant for this loop
is_invariant = false
else if context.is_multi_depth:
def_instr = reg.def_instruction (reg+56)
if def_instr is null or reg has pinned bit:
handle_predicate_invariance()
else:
def_block = blocks[def_instr.block_index]
def_depth = def_block.loop_depth (offset +144)
if def_depth < header_depth or def_depth > max_depth:
reg.use_count (reg+80) = 0 // mark as loop-external
else:
is_invariant = false
reg.def_block (reg+76) = block_index
else:
reg.use_count (reg+80) = 0 // simple loop: mark external
// Side-effect check for the entire instruction
flags = LookupOpcodeFlags(instruction, code_object) // sub_7DF3A0
if (flags & 2) != 0: // has memory/control side effect
is_invariant = false
if MemoryOverlapsLoopLiveSet(instruction): // sub_74F5E0
is_invariant = false
if is_multi_depth and HasObservableSideEffects(instruction): // sub_7E0540
is_invariant = false
// Mark destination operands
for each dst_operand (bit 31 set = definition):
if type == 1 and not pinned:
if is_invariant:
reg.def_block = block_index // mark for hoisting
else:
reg.use_count += 1 // count loop-internal uses
The key insight is that invariance is determined by definition site: if every source register was defined outside the loop (or in a block already processed), the instruction is invariant. Immediates and constants are trivially invariant. The check is not purely structural -- it uses the reg+76 field which gets updated as hoisting proceeds, allowing transitive invariance discovery.
Stage 3: Backward Non-Invariance Marking (sub_8FEAC0, a3=0)
The backward pass uses the same function with a3=0. Instead of marking definitions as external, it marks operands whose definitions are inside the loop as non-invariant by setting reg.def_block = block_index. This clears any false positives from the forward pass where a register appeared invariant but its defining instruction depends on a loop-variant value.
For destination operands, the backward pass increments reg.use_count for all non-pinned register definitions, building the use-count information needed by the profitability check.
Stage 4: Transitive Invariance Propagation (sub_8F7DD0)
After the two marking passes, sub_8F7DD0 propagates invariance transitively through the instruction chain. This handles the case where instruction A is invariant and defines register R, and instruction B uses R and is otherwise invariant -- the forward pass may have marked B as non-invariant because R's definition was in the loop, but A (the definer) is itself invariant.
function PropagateInvariance(context, block_index):
block = blocks[block_index]
side_effect_mask = 0
for each instruction in block:
aliases_memory = CheckMemoryAlias(code_object, instruction) // sub_74F5E0
for each operand (type == 1, register):
reg = RegisterDescriptor(operand)
if operand is definition (bit 31 set):
if isNoOp(instruction):
if IsInvariant(instruction, block_index): // sub_8F76E0
side_effect_mask |= reg.flags & 0x3
else:
reg.flags |= aliases_memory ? 1 : 0
else:
reg.flags |= (has_side_effect ? 1 : 0) | 2
else: // use
if has_side_effect:
reg.def_block = block_index // taint defining register
else:
reg.use_count += 1
return side_effect_mask
Stage 5: Profitability Check (sub_8F8CB0)
The final gate before hoisting. Computes a cost-benefit ratio and rejects hoisting if the ratio is unfavorable.
function IsProfitable(context, block_index, budget, is_hoist_safe):
header_weight = context.header_insn_count // from sub_8F8BC0
body_weight = context.body_insn_count
// Scoring weights depend on pass aggressiveness and safety
if is_hoist_safe:
noOp_weight = (pass_id == 0) ? 60 : 150
real_weight = 5
else:
noOp_weight = (pass_id == 0) ? 12 : 30
real_weight = 1
score = 0
latency_penalty = 0
instruction_count = 0
for each instruction in block:
instruction_count += 1
if IsInvariant(instruction, block_index): // sub_8F76E0
if isNoOp(instruction):
score += noOp_weight
else:
score += 1
for each dst_operand with scoreboard flag:
score += real_weight
latency = GetLatencyClass(instruction) // sub_91E860
latency_penalty += (latency > 4) ? 2 : 1
else:
for each high-latency dst_operand:
latency_penalty += (latency > 4) ? 2 : 1
// Final decision: weighted score vs latency cost
if pass_id == 0: // aggressive
denominator = real_weight * instruction_count
else:
denominator = body_weight / 3 + header_weight
return denominator != 0 and (score * budget) / (real_weight * denominator) >= latency_penalty
The profitability check encodes a fundamental GPU tradeoff: hoisting reduces dynamic instruction count (proportional to trip count) but extends live ranges (increasing register pressure and reducing occupancy). The budget parameter, which varies by 100x between pass_id 0 and 3, controls how aggressively this tradeoff is resolved. Pass_id 0 (Early) uses the smallest denominator, making it easiest to exceed the threshold.
Per-Instruction Invariance Test (sub_8F76E0)
The leaf-level invariance test used by stages 4 and 5 is a simple definition-site check:
function IsInvariant(instruction, current_block_index):
num_operands = instruction.operand_count // inst+80
if num_operands == 0:
return false
// Find the last "interesting" operand (skip immediates/constants)
// Immediates have type bits in the 0x70000000 range
last_operand = scan backwards from operand[num_operands-1]
while (operand XOR 0x70000000) & 0x70000000 == 0
// Check: is this a register definition outside the current block?
if last_operand is negative (bit 31 = definition)
and type_bits == 1 (register)
and not pinned (byte+7 bit 0 == 0):
reg = RegisterDescriptor(last_operand & 0xFFFFFF)
return reg.def_block (reg+76) != current_block_index
return false
This is the most-called function in the LICM pipeline. It checks whether an instruction's primary output register was defined outside the current block -- if so, the instruction is considered invariant (already hoisted or defined in a dominating block).
Side-Effect Blocking Rules
An instruction is blocked from hoisting if any of the following conditions hold, regardless of operand invariance:
| Check | Function | Condition |
|---|---|---|
| Memory store | sub_7DF3A0 | Flags byte bits 2-3 set and bit 5 clear |
| Memory barrier | sub_74D720 | Opcode 159 (BAR.SYNC), 32 (MEMBAR), or 271 (barrier variant) |
| Indirect jump | sub_74D720 | Opcode 236 (BRX) |
| Volatile/atomic access | sub_7DFA80 | Called from sub_7E0540; detects volatile or atomic memory |
| Function call | vtable+1456 | isBarrier() returns true |
| Texture side effect | sub_7DF3A0 | Flags byte bit 6 set with operand modifier flag |
| Address-space effect | sub_7E0540 | Opcodes 85/109 (memory ops) with (flags+20 & 2) != 0 |
The boundary analysis (sub_74D720) also produces a 5-byte result array that gates the entire loop:
| Byte | Meaning | Effect |
|---|---|---|
| 0 | Has external predecessor (outside loop depth range) | Skip loop (not a natural loop) |
| 1 | Non-header block with different nesting | Marks as complex multi-depth loop |
| 2 | Contains barrier instruction | Skip loop entirely |
| 3 | Contains indirect jump | Skip loop entirely |
| 4 | Multi-depth safety flag | AND-ed with sub_7E5120 per inner block |
Instruction Counting (sub_8F8BC0)
Before the profitability check, sub_8F8BC0 counts instructions in the loop header and body separately. It walks the instruction linked list for each block in the loop and classifies each instruction using isNoOp (vtable+1824):
- No-op instruction (scheduling placeholder, predicate set, etc.): weight 1
- Real instruction (ALU, memory, branch, etc.): weight 30
The header count is stored at context+64 and the body count at context+68. The profitability formula uses these to normalize the hoisting score: a loop with a heavy header relative to the body benefits less from hoisting.
Instruction Movement (sub_8F7AE0)
After all checks pass, sub_8F7AE0 physically moves each invariant instruction from the loop body to the preheader:
- Invariance re-check. Calls
sub_8F76E0one final time per instruction. Instructions whose invariance status changed during the marking passes are skipped. - Knob 484 gate. Queries the allocator for knob 484; if disabled, no movement occurs. This provides a fine-grained override separate from the loop-level knob 381.
- Preheader creation. On the first hoisted instruction, creates or locates the preheader block:
- If the loop has an existing preheader block (
context+16non-null): clones it viasub_931920, copies convergence flags from the original preheader'soffset+282 bit 3, and links it into the CFG viasub_8F7610. - If no preheader exists: creates a new block via
sub_92E1F0and links it.
- If the loop has an existing preheader block (
- Unlink and reinsert. For each invariant instruction:
sub_9253C0(code_object, instruction, 1): unlinks the instruction from the current block.sub_91E290(code_object, instruction): inserts at the preheader insertion point.- Updates the Ori instruction's control word at
instruction+32(not the SchedNode): sets bit 1 at byte offset +13 to mark the instruction as hoisted (prevents the scheduler from reordering it back into the loop).
- Destination register tracking. For each output operand, if the defining instruction at
reg+56differs from the current instruction, setscontext+44(hoisted_cbo flag). For pass_id == 2, additionally setsreg+48 bit 26if the register class is in {2, 3, 4} (GPR classes) and the preheader has the convergence flag. - Special IADD3 handling. For pass_id == 3, instructions with opcode 130 (
IADD3), flag0x1000, and a negative byte at+90(carry chain) receive special treatment viasub_9232B0which adjusts the carry-out register linkage before movement.
Multi-Depth Loop Handling
For loops with nesting depth > 1 (inner loops within the hoisting target), sub_8FF780 performs multiple rounds of sub_8FF2D0 calls:
- Header block. First call processes the loop header with
hoist_mode = 0. - Intermediate blocks. For each depth level between
header_depth+1andmax_depth, checks if the block's parent depth (block+148) matches the header depth. If the block is a back-edge predecessor of the loop header, useshoist_mode = 3. Otherwise, checks a dominance bitmap atblock[25] + 4*(depth >> 5): if bit(1 << depth)is set, useshoist_mode = 1(dominated); otherwisehoist_mode = 2(non-dominated). - Back-edge block. Final call with
hoist_mode = 3and the deepest back-edge block index, ensuring the budget is split among back-edge predecessors.
Multi-depth permission is gated by knob 220 (queried at allocator[9]+15840 for the fast path) and the DisableNestedHoist knob. When hoisting from an inner loop to the header of an outer loop, an additional constraint applies:
allow_nested = allow_nested_hoist AND is_simple_loop
AND body_insn_count > 1
AND num_predecessors == 1
AND body_insn_count < header_insn_count * max_iterations
This prevents hoisting from inner loops where the cost (extended live range across multiple loop levels) exceeds the benefit (reduced inner-loop dynamic count).
LICM Outer Loop (sub_8FF780)
The complete outer driver that iterates over all loop nests:
function HoistInvariantsCore(context):
code_object = context.code_object
pass_id = context.pass_id
// Read iteration limit from allocator+34632
config_byte = allocator[34632]
max_iterations = (config_byte == 0) ? 2
: (config_byte == 1) ? allocator[34640]
: 0 // unlimited
allow_nested_hoist = (allocator[20016] != 0)
// Prepare IR
RebuildInstructionList(code_object, 1) // sub_781F80
RecomputeRegisterPressure(code_object, 1, 0, 0, 0) // sub_7E6090
RecomputeLoopDepths(code_object, 0) // sub_773140
if code_object.flags[176] & 2 and pass_id > 1:
RecomputeLoopNesting(code_object) // sub_789280
// Clear prior invariance markers
for each block in instruction list:
block.marker (offset +76) = 0xFFFFFFFF
// Iterate from innermost loop outward (last RPO entry first)
current = blocks[rpo[block_count]]
while current is valid:
if current has no predecessors or no first instruction:
advance; continue
// Count predecessors at >= current loop depth
header_depth = current.loop_depth // offset +144
for each predecessor:
if pred.loop_depth >= header_depth:
num_at_depth++; track deepest back-edge index
if num_at_depth == 0: // not a loop header
advance; continue
// Simple vs multi-depth
if max_depth == header_depth:
is_simple = true
else:
info = AnalyzeBoundaries(code_object, header_depth, max_depth)
if has_external_pred or has_barrier or has_indirect_jump:
advance; continue
if !MultiDepthAllowed(knob_220):
advance; continue
context.is_multi_depth = true
// Find preheader and query knob 381
context.insert_pt = FindPreheader(code_object, current, ...)
if !ShouldHoist(QueryKnob381(381, current), pass_id, opt_level):
advance; continue
// Count instruction weights
CountInstructions(context) // sub_8F8BC0
// === CORE HOISTING PIPELINE (per loop) ===
sub_8FF2D0(context, header_block, ...) // header block
if context.is_multi_depth:
for depth in (header_depth+1 .. max_depth-1):
sub_8FF2D0(context, block_at_depth, ..., hoist_mode, ...)
sub_8FF2D0(context, back_edge_block, ..., 3, ...) // back-edge
// Post-hoist cleanup
if context.changed and current.num_back_edge_successors > 1:
RebuildInstructionList(code_object, 0)
advance to next loop
Hoisting Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
HoistBudget | FLOAT | 10 | Maximum number of instructions to hoist per loop (0 = unlimited) |
HoistLoopInvBudget | FLOAT | 22 (early) / 100 (late) | Budget specifically for loop-invariant hoisting; pass_id 0 uses 22, pass_id > 0 uses 100 |
HoistConservativeScale | INT | 10 (divisor) | Inner-loop budget divisor; budget /= scale when hoisting from inner loops |
HoistLate | INT | per-block policy | Per-block hoisting policy (0=always, 1=inner only, 3=never) |
HoistCBOMode | INT | 0 | Constant-buffer-object hoisting mode |
HoistCBOLoad | INT | unset | Enable hoisting of CBO load instructions |
HoistCBOFromLoopWithColdNest | INT | 1 (enabled) | Hoist CBO loads even from loops with cold nesting |
HoistCBOHighCostSBInstRatioThreshold | INT | unset | Scoreboard cost threshold for CBO hoisting |
HoistCBOLoadIDOMTravseLimit | INT | 4 | IDOM traversal limit for CBO load hoisting |
HoistCBORRegPressureLimitApplyRate | INT | 80 | R-register pressure limit application rate (percentage) |
HoistTexToInstRatioHigh | DBL | 0.045 | High texture-to-instruction ratio threshold for aggressive hoisting |
HoistTexToInstRatioLow | DBL | 0.03 | Low texture-to-instruction ratio threshold for conservative hoisting |
DisableNestedHoist | BOOL | false | Disable hoisting from nested loops (false = nested hoisting allowed) |
NestedHoistInnerThreshold | INT | 22 / 100 | Inner loop instruction threshold for nested hoisting (same value as HoistLoopInvBudget) |
NestedHoistOuterThreshold | INT | 10 | Outer loop instruction threshold for nested hoisting (same value as HoistBudget) |
UseNewLoopInvariantRoutineForHoisting | BOOL | false | Use updated set-based invariance check routine (legacy single-pass is default) |
MaxMidHeaderSizeRateForAggressiveHoist | INT | 2 | Maximum LICM iteration count (limits repeated hoisting passes) |
EnableHoistLowLatencyInstMidBlock | BOOL | false | Hoist low-latency instructions from mid-block positions |
MovWeightForSinkingHoisting | DBL | 0.25 | Weight for MOV instructions in sink/hoist decisions |
GPU-Specific LICM Concerns
Constant buffer loads. GPU shaders frequently load from constant buffers (LDC). These loads are loop-invariant by definition (the buffer is read-only during kernel execution). The HoistCBO* knobs control a specialized path that aggressively hoists these loads, trading register pressure for reduced memory traffic.
Register pressure vs. occupancy. Every hoisted instruction extends its live range from the preheader through the entire loop. On GPUs, this directly reduces occupancy. The four LICM passes use increasingly conservative heuristics (controlled by pass_id) to avoid excessive register growth in later pipeline stages where register allocation is imminent.
Texture instruction hoisting. Texture fetches (TEX, TLD, TLD4) are high-latency and loop-invariant when their coordinates are loop-invariant. The HoistTexToInstRatio* knobs provide thresholds for deciding when to hoist texture instructions -- a tradeoff between reducing loop body latency and increasing preheader register pressure.
Phase 59 -- OriLoopFusion
Purpose
Fuses adjacent loops with compatible bounds and no inter-loop data dependencies into a single loop. This reduces loop overhead (branch, induction variable update) and creates opportunities for the scheduler to overlap instructions from the formerly separate loop bodies.
Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
PerformLoopFusion | INT | 0 (disabled) | Master enable for loop fusion; must be explicitly set to a nonzero value |
PerformLoopFusionBudget | FLOAT | unset | Maximum instruction count in fused body |
Fusion Criteria
Two adjacent loops L1 followed by L2 are candidates for fusion when:
- Same trip count. Both loops iterate the same number of times (same induction variable bounds and stride, or equivalent after normalization).
- No violated inter-loop dependencies. No flow dependence (write in L1, read in L2) that crosses iteration boundaries differently after fusion. Since both loops are sequential pre-fusion, this reduces to: L2 must not read a value written by L1 at a different iteration index.
- Compatible loop structure. Both must be single-basic-block bodies (or the fused body must remain within the
PerformLoopFusionBudgetinstruction limit). - No intervening barriers. No
BAR.SYNC,MEMBAR, or fence instructions between the two loop bodies.
Pipeline Position Rationale
Phase 59 runs after GeneralOptimizeLate (phase 58) and before predication (phase 63). This position is chosen because:
- Late expansion (phase 55) may have split a single operation into a pair of loops (e.g., an atomic-reduce pattern becomes a compare loop followed by an exchange loop).
- After fusion, the merged loop body gives predication (phase 63) a larger basic block to work with, improving if-conversion opportunities.
- The subsequent LICM (phase 66) can hoist invariants from the fused loop that were not hoistable from either original loop individually (because they appeared in the "between-loops" region).
Loop Infrastructure Functions
Several utility functions are shared across the loop passes:
| Function | Address | Size | Purpose |
|---|---|---|---|
sub_781F80 | 0x781F80 | -- | Rebuild instruction linked list after CFG modification |
sub_789280 | 0x789280 | -- | Recompute loop nesting depths (called when flags[176] & 2 set) |
sub_773140 | 0x773140 | -- | Recompute register pressure estimates |
sub_7E6090 | 0x7E6090 | 2,614 | Create complex multi-operand instruction (used in unroll body duplication) |
sub_7753F0 | 0x7753F0 | -- | Loop peeling setup (splits first/last iterations) |
sub_789BE0 | 0x789BE0 | -- | Back-edge canonicalization |
sub_74D720 | 0x74D720 | -- | Loop boundary analysis (determines header, latch, exit) |
sub_74F500 | 0x74F500 | -- | Find preheader block for a given loop |
sub_9253C0 | 0x9253C0 | -- | Edge splitting / preheader block insertion |
sub_7A1A90 | 0x7A1A90 | -- | OCG knob query (boolean) |
sub_7A1B80 | 0x7A1B80 | -- | OCG knob query (multi-valued) |
sub_799250 | 0x799250 | -- | Named-phase DUMPIR check (string match against phase name) |
sub_A112C0 | 0xA112C0 | -- | Trigger sub-analysis re-run (liveness, CFG refresh) |
sub_BDEA50 | 0xBDEA50 | -- | Back-edge information printer (bix%d -> backedge's successor BB: %d) |
Related Passes
| Phase | Name | Relationship |
|---|---|---|
| 3 | AnalyzeControlFlow | Builds the CFG, identifies loops, computes dominators -- prerequisite for all loop passes |
| 19 | OriSplitLiveRanges | Splits live ranges at loop boundaries to reduce register pressure post-simplification |
| 20 | PerformPGO | Applies profile data that informs unrolling and pipelining heuristics |
| 21 | OriStrengthReduce | Reduces induction variable strength before unrolling |
| 23 | GenerateMovPhi | Inserts SSA phi nodes after unrolling changes the CFG |
| 25 | StageAndFence | Inserts memory fences needed by pipelined loops |
| 56 | SpeculativeHoistComInsts | Speculatively hoists common instructions above branches (related to LICM) |
| 108 | OptimizeHotColdInLoop | Post-RA hot/cold partitioning within loop bodies |
| 138 | OriSplitHighPressureLiveRanges | Last-resort splitter when unrolling or LICM caused excessive register pressure |
Cross-References
- Pass Inventory & Ordering -- complete 159-phase table
- Strength Reduction -- phase 21, IV simplification before unrolling
- Predication -- phase 63, creates new LICM opportunities for phase 66
- GMMA/WGMMA Pipeline -- phases 85, 87, creates LICM opportunities for phase 88
- Late Legalization -- phase 78, creates LICM opportunities for phase 79
- Hot/Cold Partitioning -- phase 108, loop-interior hot/cold splitting
- Liveness Analysis -- phases 16, 33, 61, 84 -- liveness drives unroll register pressure
- Knobs System -- knob infrastructure, ROT13 encoding
- Scheduling Architecture -- pipelined loops interact with the instruction scheduler
Strength Reduction
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Phase 21 (OriStrengthReduce) replaces expensive arithmetic operations with cheaper equivalents in the Ori IR. It runs early in the optimization pipeline -- after loop simplification (phase 18) and live range splitting (phase 19), but before loop unrolling (phase 22) and software pipelining (phase 24). This placement is deliberate: strength reduction benefits from canonicalized loop structure and benefits subsequent loop transformations by simplifying induction variable expressions.
Strength reduction in ptxas is not a single monolithic pass. It is distributed across three layers, each operating at a different abstraction level:
- Phase 21 (
OriStrengthReduce) -- Ori-level induction variable strength reduction on the use-def graph - Peephole patterns -- SASS-level algebraic simplifications in the
MainPeepholeOptimizer(sub_83EF00) - Division lowering templates -- Newton-Raphson integer division sequences emitted during instruction selection
| Phase index | 21 |
| Phase name | OriStrengthReduce |
| Category | Optimization |
| Pipeline position | Stage 2 (Early Optimization), between PGO (phase 20) and loop unrolling (phase 22) |
| Vtable address | off_22BD910 |
execute() | sub_C5FB30 (wrapper) -> sub_752E40 (core logic, 359 lines decompiled, ~1.2 KB binary) |
isNoOp() | sub_C5F3D0 -- returns 0 (always runs) |
getName() | sub_C5F3C0 -- returns 21 |
| Gate knob | 487 (general optimization enablement) |
| Key helpers | sub_745A80 (replacement register creator), sub_91BF30 (virtual register allocator), sub_A13890 (use-def chain iterator), sub_9253C0 (instruction deleter) |
| Peephole SHR+SHL->BFE | sub_81DB30 (matcher: sub_81D7E0) |
| Peephole BFE+ADD | sub_81DDD0 (matcher: sub_81DBC0) |
| Division templates | sub_1724A20 (32-bit, 28 KB), sub_1728930 (64-bit unsigned, 16.5 KB), sub_1727AC0 (64-bit signed, 13.7 KB) |
Phase 21: Induction Variable Strength Reduction
Architecture
The execute wrapper (sub_C5FB30) gates on multi-function compilation (function count > 1 via sub_7DDB50) and delegates to sub_752E40 with parameters (context, 0, 0, 0).
sub_752E40 is the core. It performs a single-pass walk over the instruction list, focusing on a specific intermediate opcode -- opcode 137 (SM73_FIRST), masked with & 0xFFFFCFFF to strip modifier bits in the opcode field at instruction offset +72. The ROT13 name SM73_FIRST is a generation boundary marker name, but the Ori IR reuses this opcode slot at runtime for IMAD-like multiply-accumulate instructions in their pre-lowered form. The actual SASS IMAD is opcode 1.
Algorithm
The pass executes in two phases within a single call:
Phase 1 -- Trivial multiply elimination. The first loop walks the instruction list (*(context+272) is the list head). For each instruction with masked opcode == 137 (SM73_FIRST; IMAD-like):
- Check if the destination register (operand at +84) has no uses (
*(def+56) == NULL) AND the source chain is empty (*src_chain == NULL). If both hold, delete the instruction viasub_9253C0-- it is dead. - Otherwise, for each source operand (iterating from operand count - 1 down to 0):
- Check operand type: must be register (
(operand >> 28) & 7 == 1) - Look up the register definition in the SSA value table (
*(context+88) + 8 * (operand & 0xFFFFFF)) - Check the definition has no special flags (
*(def+48) & 0x400000022 == 0) - Check the register type is not 9 (predicate register)
- Check the source operand's use chain is empty (single-use) and the def has no other users
- If all conditions hold, call
sub_91BF30to allocate a replacement register with the same type, then patch the operand in place
- Check operand type: must be register (
Phase 2 -- Use-def chain traversal. The second loop walks the instruction list again. For each instruction with operands that have been marked (flag 0x100 at instruction[6], set during initialization):
- For each source operand with a use chain:
- Compute the replacement register via
sub_745A80(context, def, a4), which:- Allocates a new virtual register via
sub_91BF30with the same type as the original - Copies the data type field (
+16) and relevant flags (0x40,0x10,0x8bits of the flags word at+48) - Returns the new register ID
- Allocates a new virtual register via
- If the operand was not yet marked (flag
0x100bit not set), initialize it and mark as "needs strength reduction" - Traverse the use chain as a worklist: for each user of the replaced register, check if its uses also need updating, growing the worklist dynamically (doubling allocation via pool allocator)
- Compute the replacement register via
- Track how many source operands were rewritten (
v72counter) - After processing all operands of an instruction: if the instruction is still opcode 137 (
SM73_FIRST; IMAD-like) and certain conditions hold (destination matches source pattern, specific operand bit patterns), either delete it or convert it to opcode 130 /0x82(HSET2in ROT13; used as an internal MOV-like marker -- actual SASSMOVis opcode 19)
The worklist traversal is the key algorithmic insight: when a multiply's result feeds into another multiply, the chain of strength reductions propagates transitively through the def-use graph.
Data Flow
sub_C5FB30 (execute wrapper)
|
+-- sub_7DDB50: check function count > 1
|
+-- sub_752E40 (core logic)
|
+-- sub_7468B0 / vtable+152: check knob 487 (optimization enabled)
|
+-- Phase 1: Walk instruction list (*(ctx+272))
| +-- For opcode 137 (`SM73_FIRST`; IMAD-like) instructions:
| | +-- sub_9253C0: delete dead instructions
| | +-- sub_91BF30: allocate replacement registers
| |
| +-- Clear flag 0x100 on all basic blocks (*(ctx+104) chain)
| +-- Set flag 0x40 at ctx+1385
|
+-- sub_A13890: initialize use-def traversal context
| +-- Creates context object with vtable off_21DBEF8
| +-- Sets up iterator with vtable off_21B4FD0
|
+-- Phase 2: Walk instruction list again
| +-- For each source operand with use chain:
| | +-- sub_745A80: create replacement register
| | +-- Worklist propagation through use chain
| |
| +-- Convert trivial IMAD to MOV (opcode 130 / `0x82`, `HSET2`; MOV-like)
| +-- sub_9253C0: delete fully reduced instructions
|
+-- sub_7B52B0: optional post-reduction scheduling pass
| (called if any replacements were made, flag v76)
|
+-- sub_8E3A20: destroy use-def context
Instruction Representation
The pass operates on the Ori IR instruction format. Relevant fields:
| Offset | Field | Usage in this pass |
|---|---|---|
| +8 | next pointer | Instruction list traversal |
| +64 | source operand chain | Array of {use_chain_ptr, ...} per operand |
| +72 | opcode (DWORD) | Bits 0-11 = base opcode, bits 12-13 = modifier (masked with 0xCF) |
| +80 | operand count | Number of source operands |
| +84 | operand[0] | First source operand descriptor (bits 28-30 = type tag, bits 0-23 = register ID) |
| +92 | operand[1] | Second source operand |
| +100 | operand[2] | Third source operand (for IMAD: accumulator) |
Operand type tags (bits 28-30):
1= register operand (index into SSA value table at*(context+88))2,3= immediate operand7= special/predicate
Register definition structure (from SSA value table):
| Offset | Field | Usage |
|---|---|---|
| +8 | register ID | Unique identifier |
| +16 | data type | Copied to replacement register |
| +20 | use count | Checked for single-use optimization |
| +28 | replacement ID | Set by sub_745A80 to point to strength-reduced version |
| +48 | flags (QWORD) | Bit 0x100 = "marked for strength reduction", bit 0x40 = volatile, bit 0x10/0x8 = scheduling hints |
| +56 | defining instruction | Pointer to the instruction that defines this register |
| +64 | register class | Type code (2/3 = integer, 4 = predicate, 7 = special, 9 = predicate) |
Peephole Strength Reduction Patterns
The MainPeepholeOptimizer (sub_83EF00, 29 KB, case 2 of the opcode switch) applies algebraic strength reduction patterns at the SASS instruction level. These run later in the pipeline than phase 21 and operate on concrete SASS opcodes rather than the pre-lowered intermediate form.
Pattern: SHR + SHL -> BFE (Bit-Field Extract)
Matcher: sub_81D7E0 (166 lines decompiled)
Emitter: sub_81DB30
Target opcodes: 290, 151, or 2 (various ALU forms) with operand size 11 or 12
Recognition:
- The instruction must have two register source operands (type tag 1), no modifier bits, no special flags
- Source operand 0's definition must be opcode 213 (SHL) or 214 (SHR)
- Source operand 1's definition must be the complementary shift (SHR if 0 was SHL, or vice versa)
- Both shift definitions must have immediate shift amounts (type tag 2 or 3)
- The shift amounts must sum to 32 (i.e.,
SHL(x, n)paired withSHR(x, 32-n)) - Both definitions must dominate the current instruction (
sub_1245740dominance check) - Loop depth heuristic: if the shift definitions are in a shallower loop than the current instruction (checked via block RPO depth at
*(block+156)), the transformation may be suppressed to avoid increasing register pressure
Transformation:
Before: t1 = SHL(x, n) ; opcode 213
t2 = SHR(x, 32-n) ; opcode 214
r = ALU(t1, t2) ; opcode 290/151/2
After: r = BFE(x, ...) ; opcode 210 (bit-field extract)
The emitter calls sub_9314F0 to create a BFE instruction (opcode 210) with the appropriate operands, then sub_9253C0 to delete the original ALU instruction.
Pattern: BFE Folding into ADD
Matcher: sub_81DBC0 (83 lines decompiled)
Emitter: sub_81DDD0
Target opcode: 2 (IADD) with operand size 11 or 12
Recognition:
- One source operand is defined by opcode 210 (BFE)
- The BFE has no modifier bits, no special flags on the last operand
- The BFE's immediate operand (shift amount) is 1-31
- The BFE has a single use (use count <= 1)
- Dominance check passes
Transformation:
Before: t = BFE(x, amount) ; opcode 210
r = IADD(t, y) ; opcode 2
After: r = LOP3/SHF(x, y, ...) ; opcode 102, combining shift+add
The emitter creates opcode 102 (a combined shift-and-add operation) with encoded shift amount (8 * amount | 0x60000002).
Integer Division Lowering
Integer division and modulo by non-constant values are lowered to multi-instruction sequences during instruction selection. This is not part of phase 21 but is the most visible strength reduction in ptxas output. The sequences use the classic Barrett reduction / Newton-Raphson reciprocal algorithm.
32-bit Division -- sub_1724A20
Size: 28,138 bytes decompiled (the largest function in the 0x1723000-0x17F8000 ISA description range)
Called from: sub_1727130 (template driver)
Instruction count: ~35 SASS instructions emitted
Algorithm (unsigned 32-bit a / b):
- Convert divisor to float:
I2F(b)(opcode 0xD5) - Compute approximate reciprocal via
MUFU.RCP(opcode 0x3C) - Convert back to integer:
F2I(1/b)(opcode 0xD6) - Refine via multiply-add:
IMAD(q, b, ...)(opcode 0x6E) - Correction step with conditional branch:
ISETP+BRA(opcodes 0xC9, 0x5F) - Final adjustment via
IADD(opcode 0x02)
Key constants allocated via sub_91D160:
- 23 (float exponent bias for mantissa extraction)
- 255 (exponent mask)
- 127 (IEEE 754 single-precision bias)
- 254 (double-bias for overflow guard)
- 1, -1 (correction increments)
The temporary register pool uses indices 90-126 from a parameter array (a7[]), providing 37 dedicated scratch registers for the sequence.
64-bit Division
Two variants handle 64-bit operands:
sub_1728930(16,545 bytes): unsigned 64-bit division. Emits longer sequences with double-width multiply and carry propagation.sub_1727AC0(13,776 bytes): signed 64-bit division. Parallel structure with sign-handling logic.
Both are called from sub_1729B50.
Division by Constant
Division by compile-time constant is handled separately during constant folding (in the GeneralOptimize bundle passes). The classic magic-number multiplication technique (Granlund-Montgomery) converts x / C into MULHI(x, magic) >> shift, avoiding the Newton-Raphson sequence entirely. This produces 2-3 instructions instead of ~35.
SASS Cost Model
The profitability of strength reduction on NVIDIA GPUs differs from CPUs in several important ways:
Integer multiply is cheap. Modern NVIDIA GPUs (sm_70+) have dedicated integer multiply-add (IMAD) functional units. IMAD has the same throughput as IADD on most architectures -- both are single-cycle operations on the integer ALU. This means the classical "replace multiply with shift+add" transformation is often not profitable on GPU. ptxas does not aggressively replace multiplies with shift chains the way CPU compilers do.
Integer division is expensive. There is no hardware integer divider. Division must be lowered to the ~35-instruction Newton-Raphson sequence described above. This is why division-by-constant is a high-priority optimization -- replacing 35 instructions with 2-3 is a massive win.
Shift operations. SHL and SHR are single-cycle on the integer ALU, same throughput as IADD and IMAD. However, they use a different functional unit slot on some architectures, which can matter for scheduling.
BFE (bit-field extract) is a dedicated single-cycle instruction. Recognizing SHR+SHL pairs and folding them to BFE saves an instruction and a register, which is the primary motivation for the peephole patterns.
Register pressure dominates. On GPUs, the primary cost metric is not instruction count but register pressure, because register count directly determines occupancy (the number of concurrent warps). The strength reduction pass checks loop depth before transformations and suppresses replacements that would increase register pressure in inner loops (the RPO depth comparison in sub_81D7E0).
This explains why phase 21's core logic is relatively compact (~1.2 KB binary) compared to CPU compilers' strength reduction passes: the GPU cost model makes fewer algebraic replacements profitable, so the pass focuses narrowly on use-def chain simplification and trivial multiply elimination rather than elaborate pattern tables.
Pipeline Context
Phase 21 runs after:
- Phase 18 (
OriLoopSimplification) -- loops are canonicalized with single entry, single back-edge, and preheaders - Phase 19 (
OriSplitLiveRanges) -- live ranges are split at loop boundaries - Phase 20 (
PerformPGO) -- profile data is applied (block weights inform the cost model)
Phase 21 runs before:
- Phase 22 (
OriLoopUnrolling) -- simplified induction variables enable better unroll decisions - Phase 24 (
OriPipelining) -- strength-reduced loops are more amenable to software pipelining - Phase 29 (
GeneralOptimize) -- compound pass cleans up any dead code left by strength reduction
The GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) also perform algebraic simplification that overlaps with strength reduction -- specifically constant folding of multiply-by-power-of-2 to shifts. Phase 21 handles the more complex cases that require use-def chain analysis, while GeneralOptimize handles local, single-instruction rewrites.
Function Map
| Address | Size | Function | Role |
|---|---|---|---|
sub_C5FB30 | 9 bytes | OriStrengthReduce::execute | Vtable entry, gates on function count |
sub_C5F3C0 | 16 bytes | OriStrengthReduce::getName | Returns phase index 21 |
sub_C5F3D0 | 16 bytes | OriStrengthReduce::isNoOp | Returns 0 (never skipped) |
sub_752E40 | ~1.2 KB | Core strength reduction | Use-def chain walk, replacement |
sub_745A80 | 168 bytes | Replacement register creator | Allocates new register with copied type/flags |
sub_91BF30 | ~400 bytes | Virtual register allocator | Creates 160-byte register descriptor |
sub_9253C0 | 325 bytes | Instruction deleter | Unlinks and removes instruction (634 callers) |
sub_A13890 | ~2 KB | Use-def context initializer | Sets up chain traversal structures |
sub_81D7E0 | ~660 bytes | SHR+SHL->BFE matcher | Peephole pattern recognizer |
sub_81DB30 | ~112 bytes | SHR+SHL->BFE emitter | Emits BFE (opcode 210) |
sub_81DBC0 | ~330 bytes | BFE+ADD matcher | Peephole pattern recognizer |
sub_81DDD0 | ~100 bytes | BFE+ADD emitter | Emits combined shift-add (opcode 102) |
sub_1724A20 | 28,138 bytes | 32-bit div/mod template | Newton-Raphson integer division |
sub_1728930 | 16,545 bytes | 64-bit unsigned div template | Double-width Newton-Raphson |
sub_1727AC0 | 13,776 bytes | 64-bit signed div template | Signed variant |
sub_1727130 | ~2 KB | Division template driver | Allocates temps, dispatches to templates |
Cross-References
- Pass Inventory & Ordering -- complete 159-phase table showing phase 21's position
- GeneralOptimize Bundles -- algebraic simplification sub-passes that complement strength reduction
- Loop Passes -- loop canonicalization (phase 18) that enables induction variable analysis
- Ori IR Overview -- instruction format, opcode encoding (ROT13), register model
- Peephole Optimization --
MainPeepholeOptimizercontaining SHR+SHL->BFE patterns - Newton-Raphson Templates -- detailed analysis of division lowering sequences
- Scheduling --
sub_7B52B0scheduling pass called after strength reduction - Knobs System -- knob 487 controlling optimization enablement
Copy Propagation & CSE
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Copy propagation and common subexpression elimination in ptxas are spread across four dedicated pipeline phases (49, 50, 64, 83) plus a forward copy propagation sub-pass (OriCopyProp) embedded inside every GeneralOptimize bundle. Together these passes form the value-redundancy elimination subsystem: they detect computations that produce values already available elsewhere in the program, then eliminate the redundant instructions or replace them with cheaper copies.
The four dedicated phases run at specific pipeline positions chosen to exploit opportunities created by preceding transformations. GvnCse (phase 49) runs after mid-level expansion and argument enforcement when the IR is maximally normalized. OriReassociateAndCommon (phase 50) immediately follows GvnCse to catch near-misses through algebraic normalization. LateOriCommoning (phase 64) runs after predication (phase 63) converts branches into predicated instructions, exposing new redundancies. OriBackCopyPropagate (phase 83) runs late in the pipeline to shorten MOV chains before register allocation.
| Phases covered | 49 (GvnCse), 50 (OriReassociateAndCommon), 64 (LateOriCommoning), 83 (OriBackCopyPropagate) |
| Forward copy prop | OriCopyProp sub-pass inside each GeneralOptimize bundle (phases 13, 29, 37, 46, 58, 65) |
| Related knobs | 22 knobs controlling budgets, modes, and enable/disable flags |
| Pipeline position | Mid-optimization (49--50), post-predication (64), pre-regalloc legalization (83) |
| Prerequisite passes | AnalyzeControlFlow (3), GeneralOptimizeMid2 (46), EnforceArgumentRestrictions (48) |
| Downstream consumers | ExtractShaderConstsFinal (51), OriDoPredication (63), register allocation (101) |
Phase Summary Table
| Phase | Name | Vtable | execute | getName | isNoOp | Default |
|---|---|---|---|---|---|---|
| 49 | GvnCse | off_22BDD70 | 0xC5F000 (thunk) | 0xC5F010 (ret 49) | 0xC5F020 (ret 0) | Enabled |
| 50 | OriReassociateAndCommon | off_22BDD98 | sub_C604D0 | 0xC5EFE0 (ret 50) | 0xC5EFF0 (ret 0) | Enabled |
| 64 | LateOriCommoning | off_22BDFC8 | sub_C60020 | 0xC5EDF0 (ret 64) | 0xC5EE00 (ret 0) | Enabled |
| 83 | OriBackCopyPropagate | off_22BE2C0 | sub_C5EB80 | 0xC5EB90 (ret 83) | 0xC5EBA0 (ret 1) | Disabled |
Phase name strings (from static name table at off_22BD0C0, verified in ptxas_strings.json):
| Phase | String Address | Name Table Ref |
|---|---|---|
| 49 | 0x22BC80C | 0x22BD280 |
| 50 | 0x22BC813 | 0x22BD290 |
| 64 | 0x22BC949 | 0x22BD310 |
| 83 | 0x22BCAE5 | 0x22BD3C8 |
All four vtables are laid out at uniform 0x28-byte (40-byte) spacing in .data.rel.ro, matching the 5-pointer-per-vtable pattern used by all 159 phases. The factory switch at sub_C60D30 allocates each phase as a 16-byte object and installs the corresponding vtable pointer.
Phase 83 is disabled by default (isNoOp returns 1). It is activated through the AdvancedPhaseBackPropVReg gate (phase 82), which architecture-specific backends override to enable backward copy propagation for their target.
Phase 49 -- GvnCse (Global Value Numbering + CSE)
Overview
GvnCse combines global value numbering (GVN) with common subexpression elimination (CSE) in a single pass. GVN assigns a canonical "value number" to every expression in the program such that two expressions with the same value number are guaranteed to compute the same result. CSE then uses these value numbers to detect and eliminate redundant computations.
The pass is gated by the EnableGvnCse knob (address 0x21BDA50). When disabled, the pass is skipped entirely.
Dispatch Mechanism
The execute function at 0xC5F000 is a 16-byte thunk:
mov rdi, [rsi+0x630] ; rdi = compilation_context->sm_backend
mov rax, [rdi] ; rax = sm_backend->vtable
jmp [rax+0xB8] ; tail-call vtable[23] -- the actual GVN-CSE implementation
The real implementation lives in the compilation context's SM backend object (at context+0x630 / +1584), dispatched through its vtable at offset 0xB8 (slot 23). This indirection means the GVN-CSE algorithm can be overridden by architecture-specific backends that provide a different SM backend vtable. (This object was previously called "optimizer_state" on this page, but it is the same polymorphic SM backend used for legalization, scheduling, and all other architecture-dependent dispatch -- see data-structures.md.)
Algorithm (Reconstructed)
The ptxas GVN-CSE operates on the Ori IR basic block list with dominator-tree-guided traversal:
procedure GvnCse(function F):
build dominator tree DT for F
initialize value_table: hash_map<expression_key, value_number>
vn_counter = 0
for each block B in RPO(DT):
for each instruction I in B:
key = canonicalize(I.opcode, I.type, [lookup_vn(op) for op in I.operands])
if key in value_table:
existing_vn = value_table[key]
replace all uses of I.dest with representative(existing_vn)
mark I as dead
else:
value_table[key] = ++vn_counter
set_representative(vn_counter, I.dest)
run dead code elimination to remove marked instructions
Key design decisions visible from the binary:
-
Hash-based value table. The value numbering table uses FNV-1a hashing (seed
0x811C9DC5, prime16777619/0x01000193), the same hash primitive used throughout ptxas for instruction fingerprinting, code caching, and scheduling table lookups. The hash function incorporates the opcode, type, and recursively resolved value numbers of all operands. Hash table entries are 24 bytes each:[next_ptr (8B), key (8B), value/metadata (8B)]with chained collision resolution. -
Dominator-tree scoping. Values defined in block B are only visible to blocks dominated by B. When the walk exits a dominator subtree, value table entries scoped to that subtree are removed. This prevents CSE from moving computations to positions where they would not dominate all uses. Dominance is checked via
sub_1245740, which performs a single-bit test against a per-block dominator bitvector: the dominator set at block descriptor offset+176is indexed by the dominator block's ID from offset+144. The check is O(1). -
Commutativity normalization. For commutative operations (ADD, MUL, AND, OR, XOR, MIN, MAX), operands are sorted by value number before hashing. This ensures
a + bandb + aget the same value number without requiring a separate reassociation pass. -
Address space awareness. Memory operations in different address spaces (shared, global, local, constant) are never considered equivalent even if they have identical operands. The address space qualifier is encoded in the instruction opcode or modifier bits (not the operand), so the opcode comparison in the structural equivalence check inherently preserves this distinction.
-
Predicate handling. Predicated instructions (
@P0 IADD R1, R2, R3) hash the predicate register's value number as an additional operand. Two identical computations under different predicates are distinct values. -
Predicate-operand compatibility (
sub_7E7380). After opcode and type matching in the caller,sub_7E7380performs a focused predicate-operand compatibility check (30 lines, 150 bytes). The function tests: (a) predicate modifier parity --instr+73bit 4 versusinstr+72bit 12 (0x1000); if one instruction has a predicate modifier and the other does not, they are incompatible; (b) last operand 24-bit value ID --(instr + 84 + 8*(operand_count-1)) & 0xFFFFFFmust match; (c) second-to-last operand 8-byte encoding -- the two dwords immediately before the last operand slot must be identical. The broader structural comparison (opcodes masked with& 0xFFFFCFFF, data types at+76, operand counts at+80, full per-operand encoding, register class at+64) is performed by each of the 21 callers ofsub_7E7380, not by the function itself. Instructions with volatile flags (bit0x20at register descriptor offset+48) and barrier-type registers (type 9) are excluded from CSE by the callers' pre-checks.
GVN Algorithm Details (Binary Trace)
The GVN-CSE body was located by reading SM backend vtable slot 23 (offset +0xB8) from all seven SM backend vtables in the ptxas binary. The actual function pointer varies by SM generation:
| SM Backend | Vtable | Slot 23 Function | Behavior |
|---|---|---|---|
| SM30 (Kepler) | off_2029DD0 | sub_661250 | Returns 0 -- NO-OP |
| SM50 (Maxwell) | off_21B4A50 | sub_661250 | Returns 0 -- NO-OP |
| SM60 (Pascal) | off_22B2A58 | sub_BEE590 | Real GVN-CSE |
| SM70 (Volta) | off_21D82B0 | sub_BEE590 | Real GVN-CSE |
| SM80 (Ampere) | off_21B2D30 | sub_661250 | Returns 0 -- NO-OP |
| SM89 (Ada) | off_21C0C68 | sub_661250 | Returns 0 -- NO-OP |
| SM90+ (Hopper) | off_21D6860 | sub_BEE590 | Real GVN-CSE |
GVN-CSE (phase 49) is a no-op on Kepler, Maxwell, Ampere, and Ada. It only executes on Pascal, Volta, and Hopper/Blackwell. SM80/SM89 backends rely on LateOriCommoning (phase 64) and the GeneralOptimize sub-passes for CSE coverage instead. This is a deliberate per-generation decision embedded in each SM backend's vtable.
Call Chain
GvnCse::execute (0xC5F000)
-> sm_backend->vtable[23] (indirect dispatch)
-> sub_BEE590 (GVN entry, SM60/70/90)
-> sub_781F80(ctx, 0) (rebuild def chains, mode=full)
-> sub_BEE370 (mode dispatcher)
-- queries knob 402 via knob_container->vtable[9] --
mode 0: disabled, return
mode 1: sub_BEA450 (simple GVN)
mode 2: sub_BEAD00 (standard dominator-guided GVN)
mode 3-6: sub_BED7E0 (full GVN with extended block scope)
Mode Dispatcher (sub_BEE370)
The mode is determined by knob 402 (EnableGvnCseMode), queried through two vtable calls on the knob container at context+1664:
- Boolean query --
knob_container->vtable[9](402)(offset+72): checks if the knob is set at all. The dispatcher has a fast-path optimization: whenvtable[9]issub_6614A0(the standard implementation), it reads directly fromknob_container+72+28944instead of dispatching through the vtable. - Integer query --
knob_container->vtable[15](402)(offset+120): reads the mode value as an integer. Similarly fast-pathed whenvtable[15]issub_661470.
If both queries return truthy, the integer value selects the GVN variant:
| Mode | Function | Description |
|---|---|---|
| 0 | (none) | Pass disabled, return immediately |
| 1 | sub_BEA450 | Simple single-block GVN (111 lines, ~2KB) |
| 2 | sub_BEAD00 | Standard dominator-guided GVN (157 lines, ~2.5KB) |
| 3 | sub_BED7E0 | Full GVN (when sm_backend+1106 bit 6 AND context+1416 bit 0) |
| 4 | sub_BED7E0 | Full GVN (remapped to mode 2 if bit 6 is clear) |
| 5-6 | sub_BED7E0 | Full GVN with extended block scope |
| >6 | (none) | Return immediately (no operation) |
Additional flags modulate the mode selection:
- SM backend flag at
sm_backend+1106bit 6 (0x40): when set, enables modes 5-6 (enhanced scope). When clear and mode is 4, the dispatcher remaps to mode 2. - Context flag at
context+1416bit 0: when set (and bit 6 is set), selects mode 3 over mode 5-6. - SM version threshold
sm_backend+372 <= 0x7FFF(32767): gates the EBB pre-passsub_BED430via knob 210.
Before the standard GVN (sub_BEAD00), the mode dispatcher may invoke sub_BED430 -- an extended basic block (EBB) pre-pass that identifies and marks multi-block CSE opportunities within single-entry regions. The EBB pre-pass is called unless: (a) SM version > 0x7FFF, AND (b) knob 210 is set or context+1368 bit 0 is clear.
Simple GVN (sub_BEA450, Mode 1)
Mode 1 provides the lightest GVN variant -- single-scope CSE without cross-dominator lookup. Reconstructed pseudocode:
procedure SimpleGvn(gvn_state S):
context = S.context
first_reg = operand_24bit(first_instr(context+272))
value_record = context.reg_table[first_reg] // context+296
if not value_record: return
for each value_record in linked order:
if knob_query(257, value_record): break // per-instruction gate
first_instr = value_record.head // value_record[0]
sentinel = value_record.sentinel // value_record[1]
eligible = false
for each instr from first_instr to sentinel:
if not eligible:
eligible = check_eligibility(instr) // sub_BEA1E0
if eligible: advance and check sentinel
if instr.opcode_masked == 145: // barrier
if sm_backend->vtable[371](instr): // safe to CSE
mark eligible
else: break scope
if eligible:
// Directly generate MOV replacement -- no dominator check
context+232 = value_record.head
context+264 = value_record.head->field_20
sub_9314F0(context, 0x124, 1, 0, 0) // insert MOV 292
advance to next block via opcode 97 (block header) -> field +24
This variant does not examine the immediate-dominator chain at instruction+148. It only replaces redundancies that are visible within the current value record's instruction list (effectively single-block scope).
Standard GVN (sub_BEAD00, Mode 2)
Mode 2 extends the simple GVN with cross-dominator CSE. After finding an eligible instruction and reaching the end of a block, it follows the immediate-dominator chain:
procedure StandardGvn(gvn_state S, char cross_block_flag):
// ... (same entry and block walk as SimpleGvn) ...
// After eligibility walk reaches sentinel:
idom = instr.field_148 // immediate dominator index
if idom != 0:
dom_record = context.reg_table[context.idom_map[idom]] // context+296[context+512[4*idom]]
if dom_record and (not cross_block_flag or dom_record.opcode != 1):
if not dominance_check(S, value_record): // sub_BEA3B0
leader = dom_record.head
if leader.next.opcode != 292: // not already a MOV
context+232 = leader
context+264 = leader.field_20
sub_9314F0(context, 0x124, 1, 0, 0) // insert MOV
// Fallback: if idom chain is empty, try block-level CSE
block_desc = context.block_table[instr.field_164] // context+368
if block_desc+280 bit 0 is clear:
leader = reg_table[operand_24bit(block_desc.first_instr)]
if leader.next.opcode != 292:
generate MOV replacement
The cross_block_flag parameter (passed from the mode dispatcher) controls whether the standard GVN allows replacement when the dominator has opcode == 1 (a block-header sentinel). When set, it skips such cases to avoid unsafe cross-block hoisting.
Dominance Check with Cache (sub_BEA3B0)
The dominance check is guarded by context+1377 bit 5 (0x20). When this flag is clear, the function returns 0 immediately (no dominance, meaning "safe to CSE" -- the caller inverts the result).
When the flag is set, the function implements a single-entry global cache to accelerate repeated dominator queries:
procedure DominanceCheck(gvn_state S, value_record vr):
if not (context+1377 & 0x20): return 0 // no extended scope
idom = vr.field_148
if idom == 0: return 1 // no dominator -> can't CSE
dom_record = reg_table[idom_map[idom]]
if dom_record == NULL: return 1
// Check global cache (single-entry, TLS-safe through static storage)
if dom_record == cached_key: // qword_2A12A08
return cached_result ^ 1 // byte_2A129FE[0] ^ 1
// Cache miss: compute dominator ordering via sub_74D720
if idom >= 0 and vr.field_152 >= 0:
cached_key = dom_record
sub_74D720(context, idom, vr.field_152, &cached_result)
return cached_result ^ 1
else:
return 1 // negative index -> can't CSE
The cache stores a single (key, result) pair in global statics qword_2A12A08 and byte_2A129FE. This is effective because the GVN walk processes instructions within a block sequentially, and many consecutive instructions share the same dominator. The cache hit rate is high for blocks dominated by a single predecessor.
EBB Pre-Pass (sub_BED430)
The Extended Basic Block (EBB) pre-pass runs before mode 2 GVN when the SM version and knob conditions are met. It identifies cross-block CSE opportunities within single-entry CFG regions.
procedure EbbPrePass(gvn_state S):
// Phase 1: Clear previous markings
for each block B in linked order:
B.field_264 = 0 // clear EBB marking
// Phase 2: Find first CSE-eligible instruction
for each instr in instruction list:
if check_eligibility(instr) and instr.opcode != 211:
break // found seed
if not found: return
// Phase 3: Build dominator tree and compute block ordering
sub_7846D0(context) // dominator tree + RPO
sub_A12EA0(context, walker_context, visitor) // dominator tree walk
sub_775010(context) // predecessor setup
sub_773140(context, 0) // successor setup
sub_770E60(context, 0) // block ordering
// Phase 4: Mark CSE candidates on every instruction
for each instr in instruction list:
if check_eligibility(instr) and instr.opcode != 211:
instr.field_48 = 1 // mark as CSE candidate
else:
instr.field_48 = 0
// Phase 5: Propagate eligibility through operand chains
sub_BED0A0(walker_state) // fixed-point propagation
// Phase 6: Evaluate cross-block candidates
for each value_record in RPO order:
if knob_query(257, vr): continue // per-instruction gate
idom = vr.field_148
if idom != 0:
dom_record = resolve_idom(context, idom)
if dom_record and dom_record.field_264 == 0:
dom_record.field_264 = sub_BEA000(walker, dom_record, 0) ? 2 : 1
The EBB propagation engine (sub_BED0A0) is a fixed-point iteration that propagates CSE eligibility backward through operand use-chains. For each instruction with field_48 bit 0 set, it follows the operand-to-instruction back-references at context+88 to mark defining instructions as eligible too. The iteration continues until no more changes occur. This ensures that an entire expression tree is marked eligible when any of its consumers is eligible.
Full GVN Body (sub_BED7E0, 689 lines, ~18KB binary)
This is the most complete GVN variant (modes 3-6). Reconstructed pseudocode:
procedure FullGvnCse(gvn_state S):
context = S.context
mode_flags = S.mode & ~0x02 // strip bit 1
extended_scope = (mode_flags - 5) // >1 enables cross-block scope
// Phase 1: Initialization
block_count = context.block_count // at +376
visited[] = allocate_zeroed(block_count + 1) // one byte per block
build_dominator_tree(context) // sub_7846D0
scope_tree = new_scoped_tree() // sub_661750
rpo = context.rpo_ordering // at +792
// Phase 2: RPO block walk
for i = 0 to rpo.count - 1:
block_idx = rpo.indices[i]
block = context.block_table[block_idx] // context+368
if block.head == NULL or block.flags & SKIP:
continue
first_instr = lookup_first_instruction(block)
dominator_candidate = NULL
has_redundant = false
for each instr in block:
// Per-instruction knob gate (knob 257)
if knob_query(257, instr):
break to next block boundary
eligible = check_eligibility(instr) // sub_BEA1E0
if eligible:
visited[block_idx] = not block.visited_flag
elif opcode_masked(instr) in {32, 159}: // branch, return
propagate visited flag from predicate operand
elif opcode_masked(instr) == 145: // barrier/sync
safe = sm_backend->vtable[371](instr)
if safe: mark_as_candidate
// Check dominator for existing equivalent
idom_ref = instr.idom_ref // at +148
if idom_ref != 0:
dom_block = resolve_idom(context, idom_ref)
if dom_block dominates current position:
leader = dom_block.first_instr
if leader.opcode != 292: // not already a MOV
replace_with_mov(context, leader, 0x124)
record_block_in_scope_tree(scope_tree, block_idx)
// Phase 3: Post-processing dominated blocks
for each (node, bit_pos) in scope_tree.bit_iterate():
block_idx = bit_pos | (node.data << 6)
block_record = reg_table[block_idx]
cse_dominated_block(S, block_record) // sub_BEA5F0
// Phase 4: Cleanup
flush_deferred_instructions(scope_tree)
destroy_scoped_tree(scope_tree)
Key observations from the binary:
-
Block walk order is RPO. The outer loop reads
context+792-- a struct containing{int count; int indices[]}-- and iterates in that order. The RPO array is pre-computed bysub_7846D0which also builds the dominator tree. -
The value table is a register-indexed array, not a hash map. Values are stored in
context+296(an array of pointers indexed by the 24-bit register/value identifier from the operand encoding atinstruction+84). This gives O(1) lookup by register ID. The dominator tree is used for scoping, not a stack-based hash table. -
Dominator scoping uses a balanced binary tree with bitset nodes. Each tree node stores a 64-bit bitset of block indices, traversed with
tzcntfor efficient iteration. Block index is recovered asbit_position | (node_data << 6), supporting up to 64 * depth blocks. -
Replacement is MOV insertion. When a redundant instruction is found, the pass calls
sub_9314F0(context, 0x124, 1, 0, 0)to generate a replacement MOV instruction (opcode0x124= 292 decimal). The original computation is recorded atcontext+232(source) andcontext+264(metadata) before the MOV is generated. -
Barrier instructions (opcode 145) have a dedicated safety check via
sm_backend->vtable[371](offset+2968), which is an architecture-specific predicate that determines whether an instruction can be CSE'd across a barrier boundary.
Instruction Eligibility (sub_BEA1E0)
| Opcode (masked) | Category | Condition |
|---|---|---|
| 16 | Register copy / PHI | Always, unless last operand bit 1 set |
| 183 | Memory load/compute | Bit 5 of last operand, or sub_91BC40 safety check |
| 119 | GPU special | SM flag +1106 bit 6 required; operand bit 1 |
| 186 | GPU special | SM flag +1106 bit 6 required; operand bit 0 |
| 211 | GPU special | SM flag +1106 bit 6 required; operand bit 2 |
| 283 | GPU special | SM flag +1106 bit 6 required; operand bit 3 |
| 122 | Conditional | Type 2-3: always; type 7-8: bit 7 set |
| 310 | Specialized | (flags & 0xF) == 2 and (flags & 0x30) != 0x30 |
| 145 | Barrier/sync | Separate sm_backend->vtable[371] check |
| all others | -- | Not eligible |
Opcodes 119, 186, 211, 283 are only CSE-eligible when sm_backend+1106 bit 6 (0x40) is set. This bit appears to be an architecture-specific capability flag enabling extended CSE for certain GPU-specific instruction classes.
Per-Dominated-Block CSE (sub_BEA5F0)
After the RPO walk populates the scope tree, sub_BEA5F0 processes each dominated block:
- SM version gate: if
sm_backend+372 <= 28671(SM70 or earlier), enables a special operand canonicalization path for commutative operations - Instruction walk: iterates via
block+128child pointer chain - Dominator ordering: compares
instruction+144(dominator number) to test dominance - Commutative canonicalization (opcode 95): calls
sm_backend->vtable[79](offset+632) to sort operands by value number. Rewrites operand encoding with flags0x60000000and0x40000000to mark canonicalized operands - Replacement: calls
sub_931920to insert copy instructions when a dominating equivalent is found
Scope Tree Bit-Iteration Detail
The scope tree post-processing (lines 498-664 of sub_BED7E0) uses a binary tree where each node contains a 4-word (32-byte) bitset region starting at node+32. The iteration:
- Start at the leftmost node: follow
node->leftuntil NULL - Scan the 4-word bitset region (
node+32throughnode+64), finding each set bit viatzcnt(x86 trailing-zero-count) - Recover the block index:
bit_position | (((word_offset_in_node | (node.field_24 << 2)) << 6)) - After processing a bit, mask it out:
word &= ~(0xFFFFFFFFFFFFFFFF >> (64 - (bit_pos + 1))) - When current word is exhausted, advance to next word in the 4-word region
- When all 4 words are exhausted, follow parent/right-child links to traverse the tree in order
Each block index recovered from the tree triggers a call to sub_BEA5F0 for per-dominated-block CSE. The tree structure allows the scope walk to skip large ranges of blocks that have no CSE candidates, making it efficient for sparse CFGs.
GVN Function Map
| Address | Name | Size | Role |
|---|---|---|---|
sub_BEE590 | GvnEntry | ~200B | Entry point (vtable slot 23, SM60/70/90) |
sub_BEE370 | ModeDispatcher | ~550B | Selects GVN variant via knob 402 |
sub_BED7E0 | FullGvn | ~18KB | Full GVN body (modes 3-6, RPO + scope tree) |
sub_BED430 | EbbPrePass | ~2KB | Extended basic block pre-pass |
sub_BED0A0 | EbbPropagate | ~3KB | EBB eligibility propagation (fixed-point) |
sub_BEC880 | EbbInit | -- | EBB state initialization |
sub_BEAD00 | StandardGvn | ~2.5KB | Standard dominator-guided GVN (mode 2) |
sub_BEA5F0 | PerBlockCse | ~9KB | Per-dominated-block CSE + commutative canon. |
sub_BEA450 | SimpleGvn | ~2KB | Simple single-block GVN (mode 1) |
sub_BEA3B0 | DomCheckCached | ~300B | Dominance check with global cache |
sub_BEA1E0 | EligibilityCheck | ~500B | Instruction eligibility (7 opcode classes) |
sub_BEA000 | EbbCandidateCheck | ~700B | EBB candidate dominator-chain walk |
sub_7E7380 | PredicateCompat | ~150B | Predicate-operand compatibility check |
sub_661250 | NoOp | ~6B | No-op stub (SM30/50/80/89) |
sub_7846D0 | BuildDomTree | -- | Dominator tree + RPO ordering builder |
sub_661750 | ScopeTreeInit | -- | Scoped value tree init/destroy |
sub_9314F0 | InsertMov | -- | Instruction insertion (generates MOV 292) |
sub_934630 | InsertMulti | -- | Instruction insertion (multi-operand variant) |
sub_931920 | InsertNode | -- | Instruction node insertion into linked list |
sub_9253C0 | DeleteInstr | -- | Instruction deletion |
sub_6B4520 | RecordBlock | -- | Block recording for dominator scoping |
sub_74D720 | DomOrdering | -- | Dominator ordering comparison |
sub_69DD70 | TreeExtract | -- | Tree node extraction (deferred processing) |
sub_7A1A90 | KnobQuery | -- | Knob query (per-instruction enablement) |
sub_91BC40 | MemSafetyCheck | -- | Memory operation safety check |
sub_A12EA0 | DomTreeWalk | -- | Dominator tree walker (EBB discovery) |
GPU-Specific CSE Constraints
GPU CSE must respect constraints that do not arise in CPU compilers:
-
Divergence. A uniform subexpression (same value across all threads in a warp) can be safely hoisted. A divergent subexpression may have different values per thread and must only be CSE'd within the same control-flow path. The GvnCse pass runs after
AnalyzeUniformsForSpeculation(phase 27), which provides divergence annotations. -
Barrier sensitivity. A computation that reads shared memory before a
BAR.SYNCcannot be commoned with an identical computation after the barrier, because intervening threads may have written different values. Memory operations with barrier dependencies are assigned unique value numbers. The actual barrier check is performed bysm_backend->vtable[371](offset+2968), an architecture-specific predicate. -
Register pressure. Aggressive CSE can increase register pressure by extending the live range of the representative value. The
EnableGvnCseknob allows the pass to be disabled when register pressure is the binding constraint. -
Per-SM enablement. GVN-CSE is only active on SM60, SM70, and SM90+. SM80/SM89 rely on LateOriCommoning (phase 64) and GeneralOptimize sub-passes instead. This per-generation selection is embedded in the SM backend vtable at slot 23.
Phase 50 -- OriReassociateAndCommon
Overview
Reassociation normalizes the algebraic structure of expressions to expose commoning opportunities that GvnCse missed. GvnCse cannot detect that (a + b) + c and (a + c) + b compute the same value unless the expressions are first reassociated into a canonical form. This pass performs that reassociation and then runs a second commoning pass over the normalized IR.
Dispatch Mechanism
// sub_C604D0 -- OriReassociateAndCommon::execute
int64 execute(phase* self, compilation_context* ctx) {
int func_count = get_function_count(ctx); // sub_7DDB50
if (func_count > 1)
return ctx->field_1584->vtable[44](ctx->field_1584, ctx);
return func_count;
}
For multi-function compilation units, the pass dispatches through the compilation context's SM backend (field +1584 / 0x630), calling vtable slot 44 (offset 0x160). This enables per-function reassociation with function-level isolation of value numbering state.
Algorithm (Reconstructed)
Reassociation works on associative and commutative operators:
procedure ReassociateAndCommon(function F):
for each basic block B in RPO:
for each instruction I in B:
if I.opcode is associative+commutative (ADD, MUL, AND, OR, XOR):
flatten expression tree rooted at I into a list of leaves
sort leaves by canonical order (constants last, then by register number)
rebuild balanced binary tree from sorted leaves
if I.opcode is SUB:
rewrite (a - b) as (a + (-b)) for uniformity
// Second pass: hash-based commoning over the reassociated IR
run local CSE over each basic block
Why Reassociation Matters
The reassociation and commoning phases are tightly coupled because reassociation's primary goal is to enable commoning:
BB0: R5 = (R2 + R3) + R4 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R3)), vn(R4))
BB1: R6 = (R2 + R4) + R3 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R4)), vn(R3))
-- These are NOT the same VN because the inner ADDs differ.
After reassociation, both flatten to {R2, R3, R4} sorted canonically, then rebuild as (R2 + R3) + R4. Now they share the same value number and the second is eliminated.
Controlling Knobs
| Knob | Address | Purpose |
|---|---|---|
AllowReassociateCSE | 0x21C0180 | Master enable/disable |
ReassociateCSEBudget | 0x21BA810 | Max instructions processed per function |
ReassociateCSEWindow | 0x21BA7D0 | Sliding window size for local CSE after reassociation |
ReassociateCSESkip | 0x21BA7F0 | Skip first N instructions (debugging) |
ReassociateLargeImmInUIADD64 | 0x21BA7A0 | Large immediates in 64-bit unsigned ADD |
DistributeAndReassociateMulBudget | 0x21BDDC0 | Budget for a*b + a*c -> a*(b+c) |
Phase 64 -- LateOriCommoning
Overview
LateOriCommoning is a late CSE pass that runs immediately after predication (phase 63, OriDoPredication). If-conversion transforms conditional branches into predicated instructions, which can expose new redundancies: two computations that were previously in mutually exclusive branches become adjacent predicated instructions that may compute the same value.
Dispatch Mechanism
// sub_C60020 -- LateOriCommoning::execute
char execute(phase* self, compilation_context* ctx) {
int func_count = get_function_count(ctx); // sub_7DDB50
if (func_count > 1)
return sub_9059B0(ctx); // late commoning implementation
return func_count;
}
Implementation -- sub_9059B0
sub_9059B0 is the entry point for late commoning. It:
- Checks knob 487 (
ForceLateCommoningat0x21BD2F0) to determine whether the pass is enabled - Verifies the function's optimization state has commoning enabled: the byte at
context->field_1664->field_72 + 60696must be 1, and the dword at offset+60704must be nonzero - Allocates a ref-counted working set via the pool allocator
- Calls
sub_9055F0-- the core commoning walker
Core Commoning Walker -- sub_9055F0
sub_9055F0 (203 lines decompiled) is the central commoning algorithm for late CSE. Its structure, reconstructed from the decompilation:
procedure LateCommoning(function_state S):
if not knob_enabled(487): return
if S.flags & 0x02: return // already processed
if (S.flags | S.flags2) & 0x08: return // conflicting mode
rebuild_def_chains(S, mode=1) // sub_781F80
rebuild_use_chains(S) // sub_763070
compute_hash_values(S, 0, 0, 0, 0) // sub_7E6090
block_count = S.field_520 + 1
allocate bit_array[block_count]
// Reset hash/VN slots on all instructions
for each instruction I in S.instruction_list:
I.field_88 = 0xFFFFFFFF00000000 // upper 32 bits = -1, lower = 0
// Main commoning loop over code list
for each instruction I in S.code_list:
// Phase 1: Remap operands through equivalence table
for each operand (reverse order):
if operand is register ref (type 0x10000000):
resolve to canonical representative
// Phase 2: Try commoning based on opcode class
if I.opcode == 72 (MOV):
propagate_equivalence(I) // sub_8F2CD0
elif is_pure(I): // sub_7DF3A0
opcode_class = I.opcode & 0xCF00
if opcode_class == 0x0061 (SEL): // conditional select
reset_tracking()
elif opcode_class == 0x0034 (PHI):
record_phi_equivalence(S, I)
else:
if not try_common(S, I): // sub_901A90
hash = compute_hash(S, I) // sub_74ED70
record_hash_for_future_matching(hash)
The three infrastructure functions called at the beginning are shared with the GeneralOptimize sub-passes:
sub_781F80-- rebuilds reaching definition chains (also used by GeneralOptimizeEarly)sub_763070-- rebuilds use-def chainssub_7E6090-- pre-computes instruction hash values
Commoning Check -- sub_901A90
sub_901A90 (387 lines) is the instruction-level CSE checker. It:
- Examines the instruction's opcode, type, and operand value numbers
- Looks up the instruction's hash in the per-block equivalence table
- If a match is found, verifies that the matched instruction dominates the current position via
sub_1245740(O(1) bitvector bit test:(1 << def_dom_id) & dom_set[def_dom_id >> 5]) - If domination holds, replaces the current instruction's destination with the matched instruction's destination
- Returns true if commoning succeeded, false otherwise
A related commoning pattern was confirmed from sub_90A340 (1670 bytes, 21 callees), which performs commoning on opcode 130 (HSET2 in the ROT13 name table; used as an internal marker for MOV-like instructions -- actual SASS MOV is opcode 19) instructions. From the decompilation, the operand comparison loop:
// Operand-by-operand equivalence check within commoning body
for (i = operand_count - 1; i >= 0; i--) {
if (candidate.operands[2*i + 21] != existing.operands[2*i + 21])
break; // operand value mismatch
if (candidate.operands[2*i + 22] != existing.operands[2*i + 22])
break; // operand modifier mismatch
}
// If all operands match AND opcodes match AND operand counts match:
// verify dominance, then replace
The reverse iteration order (from last operand to first) is an optimization: destination operands at lower indices are more likely to differ, so checking source operands first (higher indices) allows early exit.
Instruction Hashing -- sub_74ED70
sub_74ED70 (304 lines) computes a hash value for an instruction, incorporating:
- Opcode and type qualifiers
- Value numbers of all source operands (recursively resolved through MOV chains)
- Address space for memory operations
- Predicate register (if predicated)
- Immediate values (folded into the hash)
The hash is stored at instruction field +88 (the upper 32 bits that were reset to 0xFFFFFFFF during initialization). The function calls sub_7DF3A0 (purity check), sub_7E0030 and sub_7E2530 (operand accessors), and sub_748440 (hash combining).
Controlling Knobs
| Knob | Address | Purpose |
|---|---|---|
ForceLateCommoning | 0x21BD2F0 | Force-enable late commoning |
DisableMoveCommoning | 0x21BE2C0 | Disable MOV-based equivalence propagation within the commoning walker |
Phase 83 -- OriBackCopyPropagate
Overview
Backward copy propagation propagates values backward through MOV chains, eliminating intermediate copies. Unlike forward copy propagation (which replaces uses of a copy's destination with the copy's source), backward copy propagation replaces the definition of a copy's source with the copy's destination, allowing the copy instruction itself to be deleted.
Phase 83 uses a split-phase design with phase 82 (AdvancedPhaseBackPropVReg). The actual backward copy propagation algorithm lives in architecture-specific SM backend overrides of phase 82. Phase 83 is a pipeline progress marker that advances the pipeline counter context+1552 to 9 after backward copy propagation completes, signaling to downstream operand encoding functions that they may apply relaxed register constraints.
This phase is disabled by default (isNoOp returns 1). It is activated only when an architecture backend overrides phase 82 to provide its own backward propagation implementation.
Dispatch Mechanism
The execute function is a 7-byte stub that advances the pipeline progress counter:
// sub_C5EB80 -- OriBackCopyPropagate::execute
void execute(phase* self, compilation_context* ctx) {
ctx->field_1552 = 9; // advance pipeline progress counter to backward-copy-prop stage
}
Phase 83 does not contain the backward copy propagation algorithm. The actual algorithm is provided by the architecture-specific SM backend that overrides phase 82 (AdvancedPhaseBackPropVReg). The split-phase design works as follows:
| Phase | Role | Default behavior | When arch-activated |
|---|---|---|---|
82 (AdvancedPhaseBackPropVReg) | Gate + algorithm provider | No-op (hook, isNoOp = 1) | Arch backend installs backward copy propagation body |
83 (OriBackCopyPropagate) | Pipeline progress marker | No-op (isNoOp = 1) | Sets context+1552 = 9, enabling downstream constraint relaxation |
The factory switch at sub_C60D30 installs vtable off_22BE298 for phase 82 and off_22BE2C0 for phase 83. Both vtables are 40-byte (5-pointer) structures at consecutive addresses in .data.rel.ro.
Gate Mechanism (Phase 82)
Phase 82 (AdvancedPhaseBackPropVReg) is one of 16 AdvancedPhase hook points in the pipeline. By default its isNoOp returns true, meaning the phase is skipped entirely. When an architecture backend needs backward copy propagation, it:
- Overrides phase 82's vtable to install the actual backward propagation algorithm as the execute function
- Overrides phase 82's
isNoOpto return 0 (enabled) - Configures phase 83's
isNoOpto return 0, enabling the pipeline counter advancement
The BackCopyPropBudget knob (index 808, address 0x21BFDF0) limits the number of backward propagations performed. This knob is read by sub_8C0270 (scheduler initialization) at the point where the scheduler allocates its per-function work structure. When knob 808 is not set by the user, the budget falls back to a default stored in the scheduler state object at offset +92.
Algorithm (Reconstructed)
The backward copy propagation algorithm is reconstructed from the phase name, the infrastructure it shares with forward copy propagation (sub_781F80, sub_763070), the BackCopyPropBudget knob, and the pipeline position constraints. The actual algorithm body resides in architecture-specific SM backend code, not in the generic binary.
procedure BackCopyPropagate(function F):
budget = knob(808) // BackCopyPropBudget
count = 0
// Phase 1: rebuild def-use chains (shared infrastructure)
rebuild_def_chains(F) // sub_781F80
rebuild_use_chains(F) // sub_763070
// Phase 2: walk blocks in RPO, instructions in reverse
for each basic block B in reverse postorder:
for each instruction I in B (last to first):
if count >= budget:
return
if I is not MOV (opcode & 0xCF00 != MOV class):
continue
// I is: Rd = MOV Rs
def_of_Rs = reaching_def(Rs)
// Guard 1: Rs must have exactly one use (this MOV)
if use_count(Rs) != 1:
continue
// Guard 2: def(Rs).dest can be renamed to Rd without conflict
if not can_rename(def_of_Rs.dest, Rd):
continue
// Guard 3: no intervening definition of Rd between def(Rs) and I
if has_intervening_def(Rd, def_of_Rs, I):
continue
// Perform backward propagation: rename definition
rename def_of_Rs.dest from Rs to Rd
delete I // MOV is now redundant
count++
The backward walk direction is essential for cascading chain collapse:
Before: R1 = expr; R2 = R1; R3 = R2
^^^^^^ processed first (backward)
Step 1: R1 = expr; R3 = R1; (deleted R3=R2, renamed R2→R3 in "R2=R1")
^^^^^^ processed next
Step 2: R3 = expr; (deleted R3=R1, renamed R1→R3 in "R1=expr")
Result: entire 3-instruction chain collapses to single "R3 = expr"
If the walk were forward, only R2 = R1 would be processed first (renaming R1 = expr to R2 = expr), but then R3 = R2 would need a second pass to collapse further. The backward direction achieves full chain collapse in a single pass.
Why Phase 83 Runs So Late
Phase 83 is positioned at pipeline slot 83 out of 158, immediately before the register attribute computation sequence (phases 84--95). This late position serves three purposes:
-
Catches late-created copies. Phases 66--81 include late optimizations (LICM, texture movement, rematerialization, late arch-specific peepholes) that frequently insert new MOV instructions. Backward copy propagation after these passes cleans up the residual chains that forward propagation (which last ran in phase 65) cannot see.
-
Reduces register pressure for allocation. Every eliminated MOV is one fewer live range the register allocator (phase 101) must handle. By running just before the liveness/DCE pass (phase 84,
OriPerformLiveDeadFourth), backward copy propagation minimizes the input to register allocation. -
Safe renaming window. After phase 83, the pipeline enters the register attribute and legalization sequence. Renaming destinations before this point avoids conflicts with the fixed register assignments that legalization may impose.
Why Disabled by Default
Phase 83 is disabled by default (isNoOp returns 1) for several reasons:
-
Backward renaming is inherently riskier than forward propagation. Forward copy propagation modifies uses (safe because the original definition still exists). Backward copy propagation modifies definitions -- changing which register an instruction writes to. A bug here can silently corrupt values used by other instructions.
-
Architecture-specific register constraints. The legality of renaming a destination depends on target-specific constraints: fixed-function registers (thread ID, special purpose), register bank conflicts, paired/grouped register requirements for 64-bit operations, and uniform register constraints on newer architectures (Volta+). Only the architecture backend knows which renames are safe.
-
Diminishing returns. Forward copy propagation (
OriCopyProp) runs six times during the GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) and handles the majority of copy elimination. Backward propagation catches only residual chains that forward propagation structurally cannot eliminate. -
Gate requirement. Architecture backends that enable backward copy propagation via phase 82 may also need to pre-process the IR (e.g., marking registers that must not be renamed, or inserting constraints that protect fixed-function registers).
Downstream Effects: Pipeline Counter and Encoding Relaxation
When phase 83 sets context+1552 to 9, two operand encoding pattern functions (sub_9BF350 and sub_9BFAF0) change behavior. These functions gate on two conditions:
// Gate check in sub_9BF350 and sub_9BFAF0
if ((context->field_1398 & 0x04) != 0 && context->field_1552 > 9) {
// Apply register constraint relaxation
// Check if operand register class == 3 (address register) or reg_id == 41
// Assign special operand mask 0xFFFFFA (16777210) instead of 0xFFFFFF
}
The flag at context+1398 bit 2 is an architecture capability flag. When both conditions are met (capability flag set AND pipeline has progressed past phase 83), the encoding functions relax operand constraints for address registers (class 3) and special register 41, allowing these to participate in operand patterns that they would otherwise be excluded from.
The pipeline counter value 9 is part of a progression: phase 95 (SetAfterLegalization, sub_C5E440) later advances the counter to 19, enabling a further tier of relaxation in the scheduler initialization (sub_8C0270).
Forward vs. Backward Copy Propagation
The two propagation directions are complementary and handle different structural patterns:
| Property | Forward (OriCopyProp) | Backward (OriBackCopyPropagate) |
|---|---|---|
| Direction | Replaces uses of copy destination with copy source | Replaces definitions to eliminate copies |
| Example | R2=R1; ADD R3,R2,R4 -> ADD R3,R1,R4 | R1=expr; R2=R1 -> R2=expr |
| Runs | 6 times (phases 13,29,37,46,58,65) | Once (phase 83) |
| Default | Always enabled | Disabled (arch-gated) |
| Risk | Low (original def unchanged) | Higher (modifies defs) |
| Catches | Most copies from expansion and lowering | Residual chains from late passes (66--81) |
Controlling Knobs
| Knob | Address | Purpose |
|---|---|---|
BackCopyPropBudget | 0x21BFDF0 | Maximum backward propagations per function (knob index 808) |
Forward Copy Propagation -- OriCopyProp
Overview
Forward copy propagation is not a standalone pipeline phase but a sub-pass within each of the six GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65). It is identified by the name string OriCopyProp at address 0x21E6CE1 and can be individually targeted via the --named-phases mechanism.
The OriCopyProp name appears in the NamedPhases parser (sub_9F4040 at offset +1648), where it is looked up via sub_C641D0 (case-insensitive binary search over the phase name table). When the user specifies --named-phases OriCopyProp, the system resolves this to the appropriate sub-pass within GeneralOptimize.
Target Opcodes and Flag Bits
Three Ori opcodes are candidates for forward copy propagation:
| Opcode | Meaning | Propagation Rule |
|---|---|---|
| 97 | Definition anchor (STG in ROT13; used internally as a register-to-register MOV/definition marker -- actual SASS MOV is opcode 19) | Replace uses of destination with source |
| 18 | Predicated copy | Propagate only under matching predicate guard |
| 124 | Conditional select (CSEL) | Propagate when select condition is provably constant |
Opcode matching uses a mask: (instr.opcode & 0xCF00) == target, stripping modifier bits in the upper nibble of the opcode field at instruction offset +72.
Three flag bits on instruction field [6] (byte offset 24) track propagation state:
| Bit | Hex | Meaning |
|---|---|---|
| 8 | 0x100 | Copy has been propagated |
| 9 | 0x200 | Deferred cleanup (instruction may still be needed) |
| 10 | 0x400 | Under predicate guard (requires predicate-aware handling) |
Eligibility Check (sub_8F2E50)
The eligibility function checks whether a copy can be safely propagated, with an SM-version-dependent constraint:
function isEligibleForPropagation(instr, ctx):
sm_version = *(ctx + 372)
operand_type = instr.operand_type & 0xF
if sm_version <= 20479: // pre-Turing (sm_70 and earlier)
return true // unconditionally safe
else: // Turing+ (sm_75+)
return (operand_type & 0x1C00) == 0 // constraint bits must be clear
The SM version threshold 20479 corresponds to the boundary between Volta (sm_70) and Turing (sm_75). Turing introduced new operand constraint bits that restrict when copies can be folded.
Algorithm
Forward copy propagation replaces uses of a copy's destination with the copy's source:
procedure OriCopyProp(function F):
for each basic block B in RPO:
for each instruction I in B:
if I is MOV Rd, Rs:
for each use U of Rd that I dominates:
if Rs is still live at U:
replace Rd with Rs in U
if Rd has no remaining uses:
mark I as dead
Within the GeneralOptimize loop, copy propagation interacts with constant folding and algebraic simplification: a copy propagation may expose a constant operand, enabling constant folding in the next iteration, which may create a dead instruction for DCE. This is why GeneralOptimize runs as a fixed-point loop. In Variant A (phases 13, 29), the fixed-point iteration is capped by knob 464. In Variant B (phases 37, 58), convergence uses a cost-based threshold of 0.25 (knob 474). Two-pass predicate simplification via sub_908A60 runs within the copy propagation loop to handle predicate-conditional copies.
Controlling Knobs
| Knob | Address | Purpose |
|---|---|---|
CopyPropBudget | 0x21BECD0 | Maximum instructions processed per invocation |
CopyPropGlobalBudget | 0x21BEC70 | Budget for cross-block (global) copy propagation |
CopyPropForceGlobal | 0x21BEC90 | Force global copy propagation |
CopyPropAddr | 0x21BECE8 | Propagate through address computations |
CopyPropConstantBank | 0x21BECB0 | Propagate constant bank references |
CopyPropUseReachingDefs | 0x21BEBD0 | Use reaching definitions for more aggressive propagation |
CopyPropPreAllocReg | 0x21BEBF0 | Enable for pre-allocated (fixed) registers |
CopyPropNoWriteNonRR | 0x21BEC10 | Disable into non-register-register contexts |
CopyPropNonRegMultiDef | 0x21BEC30 | Handle non-register multi-definition copies |
CopyPropNoMmaCb | 0x21BEC50 | Disable into MMA constant bank operands |
LateCopyPropComplPred | 0x21BC680 | Late copy propagation for complementary predicates |
The CopyPropUseReachingDefs knob is particularly significant: when enabled, the pass uses reaching definitions analysis (built by sub_781F80) instead of simple dominator checks, allowing more aggressive propagation at the cost of additional analysis time.
Complete Knob Reference
All 24 knobs controlling copy propagation and CSE:
| Knob | ROT13 | Address | Controls |
|---|---|---|---|
EnableGvnCse | RanoyrTiaPfr | 0x21BDA50 | Master enable for phase 49 |
EnableGvnCseMode | -- | knob 402 | GVN mode selector (0=off, 1=simple, 2=standard, 3-6=full) |
EnableGvnCsePerInstr | -- | knob 257 | Per-instruction GVN enablement gate |
AllowReassociateCSE | NyybjErnffbpvngrPFR | 0x21C0180 | Master enable for reassociation CSE |
ReassociateCSEBudget | ErnffbpvngrPFROhqtrg | 0x21BA810 | Instruction budget |
ReassociateCSEWindow | ErnffbpvngrPFRJvaqbj | 0x21BA7D0 | Sliding window size |
ReassociateCSESkip | ErnffbpvngrPFRFxvc | 0x21BA7F0 | Skip first N |
ReassociateLargeImmInUIADD64 | ErnffbpvngrYnetrVzzVaHVNQQ64 | 0x21BA7A0 | 64-bit ADD imm |
DistributeAndReassociateMulBudget | QvfgevohgrNaqErnffbpvngrZhyOhqtrg | 0x21BDDC0 | Distributive law |
ForceLateCommoning | SbeprYngrPbzzbavat | 0x21BD2F0 | Force phase 64 |
DisableMoveCommoning | QvfnoyrZbirPbzzbavat | 0x21BE2C0 | Disable MOV commoning |
BackCopyPropBudget | OnpxPbclCebcOhqtrg | 0x21BFDF0 | Phase 83 budget |
CopyPropBudget | PbclCebcOhqtrg | 0x21BECD0 | Per-invocation budget |
CopyPropGlobalBudget | PbclCebcTybonyOhqtrg | 0x21BEC70 | Cross-block budget |
CopyPropForceGlobal | PbclCebcSbeprTybony | 0x21BEC90 | Force global |
CopyPropAddr | PbclCebcNqqe | 0x21BECE8 | Address prop |
CopyPropConstantBank | PbclCebcPbafgnagOnax | 0x21BECB0 | Constant bank |
CopyPropUseReachingDefs | PbclCebcHfrErnpuvatQrsf | 0x21BEBD0 | Reaching defs |
CopyPropPreAllocReg | PbclCebcCerNyybpErt | 0x21BEBF0 | Fixed registers |
CopyPropNoWriteNonRR | PbclCebcAbJevgrAbaEE | 0x21BEC10 | Non-RR disable |
CopyPropNonRegMultiDef | PbclCebcAbaErtZhygvQrs | 0x21BEC30 | Multi-def |
CopyPropNoMmaCb | PbclCebcAbZznPo | 0x21BEC50 | MMA disable |
LateCopyPropComplPred | YngrPbclCebcPbzcyCerq | 0x21BC680 | Compl pred |
SpeculativeHoistCommonInsts | FcrphyngivruBvfgPbzzbaVafgf | 0x21B81B0 | Spec hoist (phase 56) |
Interaction Between Passes
The copy propagation and CSE passes interact with each other and with the rest of the pipeline in a specific sequence designed to maximize redundancy elimination:
Phase 46: GeneralOptimizeMid2
|-- OriCopyProp (forward copy propagation)
|-- constant folding, algebraic simplification, DCE
Phase 48: EnforceArgumentRestrictions
|-- may insert MOVs for ABI compliance -> new copy prop opportunities
Phase 49: GvnCse
|-- global value numbering + CSE
|-- eliminates redundant computations across basic blocks
Phase 50: OriReassociateAndCommon
|-- normalizes expression trees for better commoning
|-- local CSE over reassociated IR
|-- catches cases GvnCse missed due to non-canonical form
Phase 51: ExtractShaderConstsFinal
|-- may replace computations with constant loads -> dead code
Phase 58: GeneralOptimizeLate
|-- OriCopyProp again (cleans up after expansion passes)
Phase 63: OriDoPredication
|-- converts branches to predicated instructions
|-- previously mutually-exclusive code becomes linear
Phase 64: LateOriCommoning
|-- CSE on newly-linearized predicated code
|-- eliminates redundancies exposed by if-conversion
Phase 65: GeneralOptimizeLate2
|-- OriCopyProp + DCE (final cleanup)
Phase 82: AdvancedPhaseBackPropVReg (gate, arch-specific)
Phase 83: OriBackCopyPropagate
|-- backward MOV chain elimination (disabled by default)
|-- reduces copy count before register allocation
Key Function Map
| Address | Size | Name | Purpose |
|---|---|---|---|
0xC5F000 | 16 B | GvnCse::execute | Thunk to sm_backend (context+0x630)->vtable[23] |
0xC5F010 | 6 B | GvnCse::getName | Returns 49 |
0xC5F020 | 6 B | GvnCse::isNoOp | Returns 0 (enabled) |
sub_BEE590 | ~200 B | GvnCse body (SM60/70/90) | Entry point: rebuilds def chains, dispatches to mode |
sub_BEE370 | ~550 B | GvnCse mode dispatcher | Queries knob 402, selects mode 0-6 |
sub_BED7E0 | ~18 KB | FullGvnCse (modes 3-6) | RPO block walk + dominator-scoped CSE, 689 lines |
sub_BEAD00 | ~2.5 KB | StandardGvnCse (mode 2) | Dominator-guided GVN for SM < 32K threshold |
sub_BEA5F0 | ~9 KB | PerDominatedBlockCse | Per-block CSE within dominator subtree, commutative canon |
sub_BEA450 | ~2 KB | SimpleGvn (mode 1) | Basic GVN variant |
sub_BEA1E0 | ~500 B | GvnCse eligibility check | Opcode-based CSE eligibility (16,122,145,183,186,...) |
sub_BED430 | ~2 KB | EBB pre-pass | Extended basic block identification (gated by knob 210) |
sub_661250 | 6 B | GvnCse no-op stub | Returns 0 (SM30/50/80/89 vtable slot 23) |
sub_7846D0 | -- | Build dominator tree | Also computes RPO ordering at context+792 |
sub_661750 | -- | Scoped value tree | Init/destroy balanced BST for dominator scoping |
0xC604D0 | 42 B | OriReassociate::execute | Dispatches to sm_backend (context+1584)->vtable[44] |
0xC5EFE0 | 6 B | OriReassociate::getName | Returns 50 |
0xC5EFF0 | 6 B | OriReassociate::isNoOp | Returns 0 (enabled) |
0xC60020 | 48 B | LateOriCommoning::execute | Calls sub_9059B0 |
0xC5EDF0 | 6 B | LateOriCommoning::getName | Returns 64 |
0xC5EE00 | 6 B | LateOriCommoning::isNoOp | Returns 0 (enabled) |
0xC5EB80 | 7 B | BackCopyProp::execute | Sets context+1552 = 9 (pipeline progress marker) |
0xC5EB90 | 6 B | BackCopyProp::getName | Returns 83 |
0xC5EBA0 | 6 B | BackCopyProp::isNoOp | Returns 1 (disabled) |
0xC5EBB0 | 6 B | AdvancedPhaseBackPropVReg::getName | Returns 82 |
0xC5EBC0 | 6 B | AdvancedPhaseBackPropVReg::isNoOp | Returns 0 (overridden to 1 at runtime by default vtable) |
sub_9BF350 | 8.6 KB | Encoding pattern (post-phase-83) | Checks context+1552 > 9 for register constraint relaxation |
sub_9BFAF0 | 9.0 KB | Encoding pattern (post-phase-83) | Checks context+1552 > 9 for register constraint relaxation |
sub_8C0270 | 14 KB | Scheduler vtable init | Reads knob 808 (BackCopyPropBudget), checks +1552 == 19 |
sub_9059B0 | ~320 B | LateOriCommoning impl | Knob check + ref-counted working set + core walker |
sub_9055F0 | ~800 B | LateCommoning core | Iterates code list, remaps operands, calls commoning check |
sub_901A90 | ~1.5 KB | Commoning check | Hash lookup + dominance verify + replacement |
sub_74ED70 | ~1.2 KB | Instruction hash | Opcode + type + operand VNs + address space -> hash |
sub_781F80 | -- | Rebuild def chains | Reaching definitions for commoning |
sub_763070 | -- | Rebuild use chains | Use-def chains |
sub_7E6090 | -- | Compute hash values | Pre-computes per-instruction hashes |
sub_7DDB50 | ~140 B | get_function_count | Returns func count from compilation context |
sub_7DF3A0 | ~80 B | is_pure_instruction | Side-effect-free check (bits 2-3 of status word) |
sub_748440 | -- | Hash combine | Mixes operand hashes into instruction hash |
sub_8F2CD0 | -- | Propagate equivalence | MOV-based value equivalence propagation |
sub_8FCE70 | ~150 B | Ref-count release | Releases ref-counted working set objects |
sub_1245740 | -- | Dominance check | O(1) bitvector bit test for CSE safety |
sub_6B9180 | -- | Set membership test | Commoning set contains check |
sub_9253C0 | -- | Instruction deletion | Removes dead/redundant instructions |
sub_90A340 | 1.7 KB | Commoning body | Commoning pass instance (21 callees, confirms operand comparison pattern) |
sub_908A60 | -- | Predicate simplifier | Two-pass (forward+backward) predicate simplification in copy prop |
sub_8F2E50 | -- | Copy/fold eligibility | SM-version-dependent eligibility check (threshold 20479) |
sub_7BA510 | 5.2 KB | HashCompute | Program/instruction sequence hash (FNV/Jenkins variant) |
sub_7BB260 | 3.5 KB | HashAccumulate | Incremental hash accumulation |
sub_8DCF20 | 23 KB | FNV-1a hash table | 8-byte key hash table with chained collision (24-byte entries) |
sub_8DF1C0 | 16 KB | FNV-1a hash table | 32-bit key hash table, two-level structure |
sub_9B1200 | 7.7 KB | Code-caching hash | Jenkins-style instruction fingerprint for RA cache |
Hash Infrastructure
The GVN/CSE passes share hash infrastructure with other subsystems (scheduling, code caching, register allocation). All FNV-1a implementations in ptxas use the same constants:
| Constant | Value | Purpose |
|---|---|---|
| FNV offset basis | 0x811C9DC5 | Initial hash state |
| FNV prime | 16777619 (0x01000193) | Multiplication factor per byte |
Hash-related functions identified in the binary:
| Address | Size | Function | Used By |
|---|---|---|---|
sub_7BA510 | 5.2 KB | HashCompute -- program/instruction sequence hash | Shader hash matching (SH= knob) |
sub_7BB260 | 3.5 KB | HashAccumulate -- incremental hash accumulation | Instruction-at-a-time hashing |
sub_8DCF20 | 23 KB | FNV-1a hash table (8-byte keys, chained collision) | Instruction deduplication in scheduling |
sub_8DF1C0 | 16 KB | FNV-1a hash table (32-bit keys, two-level) | Opcode pattern classification |
sub_9B1200 | 7.7 KB | Jenkins-style instruction hash for code caching | Register allocator cache hit detection |
sub_74ED70 | ~1.2 KB | Per-instruction hash for commoning | LateOriCommoning (phase 64) |
sub_748440 | -- | Hash combine helper | Mixes operand hashes into instruction hash |
The code-caching hash at sub_9B1200 uses a different algorithm from FNV-1a:
hash = (1025 * (value + hash)) ^ ((1025 * (value + hash)) >> 6)
It processes instruction opcodes (offset +72), operand counts (+80), operand encodings (+76), register properties (+64), and variable pair mode (bits 20-21 of the descriptor at offset +48).
Cross-References
- Pass Inventory -- complete 159-phase table
- GeneralOptimize Bundles -- forward copy propagation (OriCopyProp) sub-pass
- Predication -- phase 63 creates opportunities for LateOriCommoning
- Liveness Analysis -- liveness data consumed by copy propagation
- Strength Reduction -- produces normalized expressions for GvnCse
- Knobs System -- ROT13-encoded knob infrastructure
- Phase Manager -- vtable dispatch, phase factory
- Ori IR -- instruction representation, operand encoding
Predication (If-Conversion)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
OriDoPredication (phase 63) is the if-conversion pass in ptxas. It transforms short conditional branch regions into predicated straight-line code, eliminating branches by guarding individual instructions with predicate registers. On NVIDIA GPUs, where all threads in a warp execute in lockstep, eliminating divergent branches avoids the performance penalty of serialized path execution under the SIMT model.
| Phase index | 63 |
| Phase name | OriDoPredication |
| Category | Optimization |
| Entry point | sub_1381DA0 (1,517 bytes) |
| Core driver | sub_1381CD0 (206 bytes) |
| Main loop | sub_1381010 (3,249 bytes) |
| Total code | ~17 KB across 19 functions in 0x137D8B0--0x13829F0 |
| SSA window | Yes -- runs at phase 63, within the partial-SSA window (phases 23--73) |
| Pipeline position | After OriRemoveRedundantMultiDefMov (62), before LateOriCommoning (64) |
| Gating | Disabled when bit 5 of context+1376 flags is set; can be disabled via PTXAS_DISABLED_PASSES containing "Predication" |
| Knob controls | Knob 487 (enable/limit gate), knob 577 (per-region enable), knob 579 (texture-bearing region gate), knob 582 (block-level cold-region query), knob 260 (extra-latency penalty check) |
GPU Motivation
The SIMT execution model makes predication qualitatively different from its role on scalar CPUs.
On a scalar CPU, a correctly-predicted branch is essentially free -- the branch predictor eliminates the control flow cost. If-conversion on CPUs is a niche optimization applied only when branches are highly unpredictable.
On a GPU, a divergent conditional branch forces the warp to serialize: the hardware executes the taken path with some threads masked off, then executes the not-taken path with the complementary mask. Both paths execute regardless, and the warp reconverges at the post-dominator. The cost is the sum of both paths, not the maximum.
Predication eliminates this divergence penalty entirely. Both paths still execute, but without the overhead of stack-based reconvergence (BSSY/BSYNC pairs on sm_70+), without the branch instruction itself, and with the ability for the scheduler to interleave the predicated instructions with other independent work. For short regions (a few instructions per side), predication is strictly superior to branching.
Branching (divergent): Predicated:
ISETP.NE P0, R4, R5 ISETP.NE P0, R4, R5
BSSY B0, target @P0 IADD3 R6, R6, 1, RZ
@P0 BRA taken_path @!P0 IADD3 R7, R7, 1, RZ
// not-taken: // continues straight-line
IADD3 R7, R7, 1, RZ
BRA rejoin
// taken:
IADD3 R6, R6, 1, RZ
// rejoin:
BSYNC B0
The branching version requires 6 instructions (including BSSY/BSYNC convergence bookkeeping) and forces warp serialization. The predicated version requires 3 instructions and executes without divergence.
Algorithm Overview
The pass operates in three layers:
- Entry and gating (
sub_1381DA0): checks the"Predication"disable flag and knob 487, initializes working state, calls the driver. - Iterative driver (
sub_1381CD0): initializes via the SM backend's vtable dispatch atsm_backend+1296, then calls the main loop up to 3 times (controlled by a knob at options offset 41768) with different aggressiveness settings. - Main RPO loop (
sub_1381010): walks the RPO block order, identifies candidate branch regions, evaluates profitability, and applies the transformation.
Entry Point -- sub_1381DA0
sub_1381DA0(compilation_unit):
if context+1376 bit 5 set:
return // phase disabled by flag
knob_state = *(context+1664) // OCG knob container
mode = *(*(knob_state+72) + 16416)
if mode == 0:
limit = (context+1419 bit 4) != 0
elif mode == 1:
limit = *(*(knob_state+72) + 16424)
else:
// mode >= 2: skip limit check
IsPassDisabled(knob_state, "Predication", &disabled)
if disabled or limit:
return
// Check knob 487 iteration limit
CheckKnob487(knob_state)
// Set up working state (allocate two pool objects)
context+1385 |= 1 // mark predication active
sub_1381CD0(state) // call driver
context+1385 &= ~1 // clear predication flag
// Cleanup: release pool objects and tree structures
The context+1385 byte has bit 0 set during predication execution, which signals downstream code (such as sub_137EE50) that the pass is active.
Iterative Driver -- sub_1381CD0
sub_1381CD0(state):
// Initialize via SM backend
sm_backend = *(context+1584)
init_fn = vtable(sm_backend)+1296
if init_fn == sub_7D82C0: // fast path: zero-init
clear state fields
else:
init_fn(sm_backend, state) // backend-specific init
bb_count = *(context+520)
if bb_count <= 1: return 0 // nothing to if-convert
// Determine iteration count from knob at options+41760
iterations = 0
if *(options+41760) == 1:
iterations = *(options+41768)
// First pass: always run
state[14].byte[8] = 0 // not second-pass mode
changed = sub_1381010(state)
// Optional second/third pass with relaxed thresholds
while changed and iterations > 0:
state[14].byte[8] = (iterations == 1)
changed = sub_1381010(state)
if iterations <= 2: break
The iteration mechanism allows the pass to make a second (and potentially third) traversal with progressively relaxed profitability thresholds. The flag at state[14].byte[8] signals the final iteration, which changes some size-limit comparisons in the profitability heuristic.
Main Loop -- sub_1381010
The main loop walks basic blocks in RPO order (via the block index array at context+512), identifies candidate branch regions, and decides whether to if-convert each one.
sub_1381010(state):
// Rebuild liveness and CFG
sub_781F80(context, 1) // rebuild liveness
if context+1370 bit 4 set:
sub_A10160(context, 1) // rebuild analysis
sub_7E6090(context, 0,0,0,0) // refresh CFG
// Clear block-76 fields
for each block in chain:
block+76 = 0
sub_791F00(context, 0) // clear RPO numbering
changed = false
for rpo_idx = 2 .. bb_count:
bb = bb_array[rpo_order[rpo_idx]]
if bb is same as previous region tail:
// Continuation of prior diamond -- reuse state
restore saved state
else:
// Fresh candidate: analyze new region
init candidate state
if not isTriangleDiamondCandidate(bb):
skip
if not analyzeRegion(state, candidate):
skip
// Region identified -- extract branch info
header = bb
true_target = successor of header's terminator
branch_pred = extractBranchPredicate(header)
false_target = fallthrough
// Try to if-convert both sides
if evaluateProfitability(true_side, false_side):
applyTransformation(...)
changed = true
if changed:
context+1370 &= ~4 // invalidate CFG
sub_785E20(context, 0) // rebuild
return changed
CFG Pattern Recognition
The pass recognizes three CFG shapes for if-conversion:
Triangle Pattern
One arm of the branch is empty (falls through directly to the merge point).
[header]
/ \
/ \
[then] |
\ /
\ /
[merge]
Requirements:
headerends with a conditional branch (opcode 93;OUT_FINALin the ROT13 name table, but checked here as a control-flow terminator marker)thenblock has a single predecessor (the header)thenblock's sole successor is themergeblockmergehas exactly two predecessors:headerandthen- No backedges into the region
Diamond Pattern
Both arms contain instructions.
[header]
/ \
/ \
[then] [else]
\ /
\ /
[merge]
Requirements (same as triangle, plus):
- The
elseblock has a single predecessor (the header) - The
elseblock's sole successor is the samemergeblock mergehas exactly two (or three, for extended diamonds) predecessors
Extended Diamond Pattern
The pass can also handle diamonds where one or both arms chain through a successor block before merging. The sub_137FE10 function implements this extended analysis, walking forward through fall-through blocks until it reaches a merge point or encounters a block that fails the candidate check.
[header]
/ \
/ \
[then] [else]
| |
[then2] [else2] (optional chain blocks)
\ /
\ /
[merge]
Region Analysis -- sub_137E3A0
This function (sub_137E3A0, 367 bytes) validates that a basic block is part of a valid if-conversion candidate. It checks:
- Predecessor count: The merge block must have exactly
header_predecessor_count + 1predecessors. - Terminator type: The header's terminator must match opcode 95 after masking bits 12-13 (
STSin the ROT13 name table; used here as a control-flow terminator class marker, not an actual store-shared instruction). - Branch predicate: The branch guard must be a non-negated register operand (type field
(>>28)&7 == 1), from the predicate register file (register file type checked against the state's expected file types 2 or 3, corresponding to R or UR). - No backedges: The predecessor list must not contain a self-edge.
- Merge block successor check: Validates that the merge block's sole successor leads to the expected continuation block.
// Pseudocode for sub_137E3A0
bool isTriangleDiamondCandidate(state, bb):
pred_count = bb->predecessor_count // at bb+144
if pred_count == 0: return false
preds = bb->predecessor_list // at bb+128
if preds == NULL: return false
if preds->next != NULL: return false // must be single-entry
header = bb_array[preds->block_index]
if header->predecessor_count + 1 != pred_count:
return false
terminator = header->first_instr
opcode = terminator->opcode & 0xFFFFCFFF // mask bits 12-13
if opcode != 95: return false // opcode 95 = STS in ROT13 table; used as control-flow terminator class
// Extract branch predicate from last operand
last_op_idx = terminator->num_operands - ((opcode >> 11) & 2) - 2
pred_operand = terminator->operands[last_op_idx]
if operand_type(pred_operand) != 1: return false // must be register
if pred_operand is negated: return false
reg_file = get_register_file(pred_operand)
if reg_file != state->expected_file: return false
// Check successor list for backedges
for each successor of header:
if successor == bb: continue
if other_successor exists: return false // at most one other
return true
Region Scanning -- sub_137D990
This function (1,270 bytes) walks all instructions in a candidate block, counting them and checking each for predicability. It builds a cost model:
Per-Instruction Checks
For each instruction in the candidate block:
-
Already-predicated check (opcode bit 12 =
0x1000): Instructions that already carry a predicate guard are flagged viastate+48for special handling. -
MOV counting (opcode 130): Instructions with opcode 130 (
HSET2in the ROT13 name table; the code treats this value as an internal marker for MOV-like operations) that match specific operand patterns increment a separate MOV counter atstate+4, used to adjust profitability thresholds. The actual SASS MOV instruction is opcode 19. -
Predicable instruction check (
sub_137D8B0): Each instruction is tested via the SM backend'scanPredicatevtable method atsm_backend+1424. Instructions that cannot be predicated (atomics, certain memory operations, barriers) cause the scan to fail. -
Primary memory load classification: For load instructions (opcode 125 after masking), the memory space is queried via
sub_91C840. The internal category number is tested against bitmask0x90E((1 << category) & 0x90E), which selects the five primary data memory spaces:.shared(1),.local(2),.const(3),.tex(8),.global(11). When a load targets one of these spaces, thehas_primary_memory_loadflag is set atcandidate+12, which affects profitability thresholds in the heuristic. See the Memory Space Classification for Predication section for the full bitmask decode. -
Extra-latency check: Instructions matching opcodes in the set
{22, 23, 41, 42, 55, 57, 352, 297}(long-latency operations including texture, surface, and certain memory ops) have their latency contribution tallied atstate+16via the SM backend'sgetExtraLatencymethod atsm_backend+1392. -
Predicate-register conflict: If any destination operand writes to the same predicate register that the branch uses as its guard, the region cannot be if-converted (the predicate would be clobbered before all instructions are guarded).
-
Instruction count limit: The non-MOV instruction count at
state+8is compared against a threshold from the state object. If exceeded and the block is not marked as "must-predicate" (state+20), the scan returns failure.
// Pseudocode for sub_137D990
bool analyzeRegion(state, candidate):
bb = candidate->basic_block
if bb->flags & 2: return false // block excluded
first_instr = bb->first_instruction
// Check if first instruction is speculative-safe
if isSpeculativelyUnsafe(first_instr, context):
candidate->has_unsafe = first_instr
// Extract branch predicate register index
header = bb_array[bb->predecessor->block_index]
terminator = header->first_instruction
branch_pred_idx = extractPredicateIndex(terminator)
// Walk all instructions in the block
for instr = first_instr; instr != bb->tail; instr = instr->next:
// Track already-predicated flag
candidate->has_predicated |= (instr->opcode & 0x1000) != 0
// Count MOVs
if isMOV(instr) and matchesMOVPattern(instr):
candidate->mov_count++
// Check speculation safety for uniform operands
if state->has_uniform_speculation:
check uniform register SSA chain
// Check predicability via backend
if not canPredicateInstruction(state, instr, header):
fail with "too many instructions"
// Primary memory load classification (0x90E bitmask)
if isLoadOp(instr):
space = getMemorySpace(instr)
if space is in {shared, local, const, tex, global}:
candidate->has_primary_memory_load = true
// Extra latency accounting
if isLongLatencyOp(instr):
candidate->extra_latency += getExtraLatency(instr)
// Count non-trivial instructions
if not isMOVPHI(instr): // opcode 263 = MOV.PHI
candidate->instr_count++
if not candidate->must_predicate:
if candidate->instr_count > state->threshold:
return false
// Check for predicate-register clobber
for each destination operand:
if dest is register and dest index == branch_pred_idx:
return false
candidate->complete = true
return true
Profitability Heuristic -- sub_1380BF0
The profitability decision (sub_1380BF0, 1,055 bytes) is the most complex part of the pass. It considers multiple factors to decide whether converting a branch region to predicated code is profitable.
Decision Flow
sub_1380BF0(state, true_side, false_side, is_reverse, result):
result = false
// 1. Texture-bearing region check
if true_side->has_predicated:
if not CheckKnob579(knob_state):
return false
// 2. Must-predicate override
if true_side->must_predicate:
return true
// 3. CONV.ALLOC check
if state->has_conv_alloc:
if not (bb->flags & 8) or not state->flag_byte76:
return false
// 4. Branch-predicate matching
// Check if the branch condition matches a known pattern
// (SEL instruction producing the predicate)
header_terminator = state->header->first_instruction
pred_operand = extractLastPredicate(header_terminator)
if predicateMatchesSELPattern(pred_operand):
return true
// 5. False-side memory load check
if false_side->has_primary_memory_load:
return sub_137F800(...) // speculation safety analysis
// 6. Extra-latency penalty
if CheckKnob260(knob_state):
if true_side->extra_latency > 0 and false_side->extra_latency > 0:
return false // both sides have long-latency ops
// 7. Size-based thresholds (main heuristic)
instr_count = true_side->instr_count
if true_side->has_primary_memory_load:
// Memory loads route to extended diamond analysis
return sub_137FE10(...) // extended diamond analysis
mov_count = true_side->mov_count
if mov_count <= state->mov_threshold:
if state->flag_byte76:
// Uniform-speculation-aware thresholds
if true_side->has_predicated:
return state->uniform_tex_limit >= instr_count
else:
return state->uniform_limit >= instr_count
else:
if true_side->has_predicated:
return state->tex_limit >= instr_count
else:
return state->base_limit >= instr_count
and (true_extra <= 2 or false_extra <= 2)
// 8. Fallback: combined size check
combined = true_side->instr_count + false_side->instr_count
if state->combined_limit < instr_count and combined > state->threshold:
return false
// 9. False-side memory loads boost profitability
if false_side->has_primary_memory_load:
return true // scheduling overlap benefit
return sub_1380810(...) // fall-through block analysis
Threshold Fields
The state object contains multiple instruction-count thresholds, initialized by the scheduler backend during sub_1381CD0:
State offset (as int32 index) | Field | Typical role |
|---|---|---|
[8] | base_limit | Maximum instructions for simple (non-textured, non-uniform) regions |
[9] | tex_limit | Maximum instructions for textured regions (without uniform speculation) |
[10] | uniform_limit | Maximum instructions with uniform-speculation enabled |
[11] | uniform_tex_limit | Maximum for textured + uniform-speculation regions |
[12] | threshold | Hard ceiling on non-MOV instruction count |
[13] | combined_limit | Maximum for combined (both-sides) instruction count |
[14] | fallthrough_limit | Threshold for fall-through block extension |
[15] | extended_limit | Threshold for extended diamond regions |
[16] | mov_threshold | MOV count below which standard limits apply |
[17] | mov_limit | MOV-specific threshold |
These values are architecture-specific -- the scheduler backend's vtable method at offset 1296 initializes them based on the SM target and optimization level.
Instruction Predication -- sub_9324E0
Once a region passes the profitability check, each instruction in the region is predicated. The predication is performed by sub_9324E0 (280 bytes), which transforms each instruction by adding a predicate guard operand.
Transformation Rules
For a non-branch instruction with opcode op:
- Copy the operand array, appending the guard predicate as the new last operand and the predicate register as the penultimate operand.
- Set bit 12 of the opcode (
op | 0x1000) to mark the instruction as predicated. - Special case for opcode 188: remapped to 190.
- Special case for opcode 93 (
OUT_FINALin the ROT13 name table; used here as a branch marker): replaced with opcode 95 (STSin the ROT13 name table; used here as a conditional-select construct), not simply predicated. - Emit the new instruction via
sub_92C240, which creates the replacement in the code list. - Transfer debug info:
*new_instr+32 = *old_instr+32(debug location). - Delete the original instruction via
sub_9253C0.
// Predicate guard encoding in operand word:
// guard_pred = predicate_reg_index | 0x60000000
// (type field 3 = 0x6000_0000 >> 28, register index in low 24 bits)
//
// Example: @P2 IADD3 R0, R1, R2, RZ
// Original IADD3 operands: [R0_def, R1, R2, RZ]
// Predicated operands: [R0_def, R1, R2, RZ, guard_word, P2 | 0x60000000]
Already-Predicated Instructions
Instructions that already have a predicate guard (bit 12 set in original opcode) are handled by sub_9321B0, which must compose the existing predicate with the new guard using a predicate-AND or predicate-SEL operation rather than simply replacing the guard.
Post-Transformation -- sub_137DE90
After predicating all instructions in a region, sub_137DE90 (1,286 bytes) performs cleanup:
-
Bitvector maintenance: For each register operand in the predicated instructions, checks whether the register is live in the dominator's bitvector (at
context+832). If the register is newly defined under the predicate, marks it in the bitvector viasub_BDBB80. This ensures later liveness analysis accounts for the conditionally-defined values. -
Per-instruction predication: Walks the block's instruction list and calls
sub_9324E0on each instruction, passing the predicate register index and the guard operand word. -
Predicate register tracking: If any register was newly exposed to the bitvector, and the guard predicate is a non-negated register operand, marks the predicate register's descriptor at
+76with bit 0 set, and increments a counter atstate+200. -
Cleanup: Resets the per-block tracking arrays (stored at
state[27]/state[56..57]) which track which registers were bitvector-updated during this region.
Speculative Execution Safety -- sub_137EE50
After the main if-conversion, sub_137EE50 (969 bytes) performs a secondary scan to identify instructions that were speculatively moved above their original control-flow guard. This function:
-
Checks the global predication flag at
context+1412and the per-function flag atcontext+1392bit 0. If the function already has speculated instructions from a prior pass, returns immediately. -
Scans the true-side block for load instructions to global or surface memory (opcodes 183 and 288 after masking). For each such load, queries the memory space via
sub_91C840and checks whether space type 18 (unmapped/invalid) could be accessed. -
Records speculatively unsafe instructions in a tracking hash set (at
state+240), used by later passes to insert appropriate guard instructions or to avoid further speculation. -
Scans the false-side block with the same logic.
The post-predication speculation safety check targets exclusively category 18 (.surf/tensor extended, sm_90+). This is the only memory space that sub_137EE50 treats as requiring speculative-unsafe tracking; global loads and texture loads are considered acceptable for speculative execution in the predication cost model.
Memory Space Classification for Predication
The bitmask 0x90E appears in five functions within the predication pass (sub_137D990, sub_137F560, sub_137F220, sub_137FB60, sub_1380810). All five use the identical test pattern:
category = sub_91C840(operand); // classify memory space
if (category <= 0xB && ((1LL << category) & 0x90E) != 0)
// load targets a primary data memory space
Bitmask Decode
0x90E = binary 1001 0000 1110 -- bits {1, 2, 3, 8, 11} are set.
| Bit | Category | PTX Space | In 0x90E? | Role in predication |
|---|---|---|---|---|
| 0 | 0 | Generic (unqualified) | No | Unresolved address space -- cannot be classified, excluded |
| 1 | 1 | .shared | Yes | CTA-scope scratchpad; always mapped for executing CTA; 20--30 cycle latency |
| 2 | 2 | .local | Yes | Thread-private stack/frame; always mapped; backed by L1/L2 |
| 3 | 3 | .const | Yes | Constant bank (c[bank][offset]); loaded by driver before launch; always mapped |
| 4 | 4 | .param | No | Kernel parameter memory; typically constant-folded or register-promoted by earlier passes |
| 5 | 5 | .const (extended) | No | Extended constant path (PTX inputs 21, 22); different scheduling model |
| 6 | 6 | .global (extended) | No | Extended global variant (PTX input 20); different scheduling model |
| 7 | 7 | Spill space | No | Compiler-generated register spill/fill; handled separately by regalloc |
| 8 | 8 | .tex | Yes | Texture memory; high latency (200+ cycles); texture cache always valid when bound |
| 9 | 9 | Special (opcode-dep.) | No | Ambiguous classification from case-18 sub-switch in sub_91C840 |
| 10 | -- | (unused) | No | No memory space maps to category 10 |
| 11 | 11 | .global | Yes | DRAM-backed global memory; highest latency (300+ cycles) |
Categories 12--18 (code/function, uniform, register file, surface, surface/tensor extended) all exceed the <= 0xB range check and are excluded from the bitmask test automatically.
What the Bitmask Selects
The five selected categories -- shared, local, const, texture, global -- are the primary data memory spaces: the ones that involve real data movement through the GPU memory hierarchy and carry meaningful scheduling latency. These are the loads a scheduler can profitably overlap with predicated computation.
The excluded categories are either:
- Unresolvable (generic -- could be anything)
- Non-load in practice (param -- folded away, code -- function pointers)
- Compiler-internal (spill, special -- the compiler already knows how to handle these)
- Out of range (register file, uniform, surface, surface/tensor -- categories > 11)
How the Bitmask Affects Profitability
The bitmask test does NOT directly determine speculation safety. It sets a has_primary_memory_load flag at candidate offset +12, which the profitability heuristic (sub_1380BF0) uses in three ways:
-
True-side memory loads (
a2+12set): The profitability check routes to the extended diamond analysis (sub_137FE10) instead of the standard size-threshold path. This allows larger regions to be if-converted when they contain meaningful loads. -
False-side memory loads -- speculation guard (
a3+12set): If the false side has memory loads AND the SM backend's speculation policy (vtable atsm_backend+1200) allows it, the detailed speculation analysis (sub_137F800) is invoked. If that analysis flags the loads as risky, predication is rejected. -
False-side memory loads -- profitability boost (
a3+12set, passes safety): If the false side has memory loads and passes safety checks, the profitability heuristic returnstruedirectly (line 166 ofsub_1380BF0). The reasoning: if the false-side code contains real memory loads, converting the branch to predicated straight-line code lets the scheduler overlap those loads with other work.
Speculation Safety (Separate Mechanism)
The actual speculation safety tracking is handled by sub_137EE50 (post-predication scan), which uses a different criterion from the 0x90E bitmask:
- Scans both sides for opcodes 183 (LDG) and 288 (STG) after masking
- For each, queries
sub_91C840and checks if category == 18 (.surf/tensor extended) - Only category 18 loads are tracked as "speculatively unsafe" in the hash set at
state+240 - The
context+1392bit 0 flag persists and is checked byOriHoistInvariantsLate(phase 66)
This means global loads (category 11) that are speculatively predicated are not tracked as unsafe. In the ptxas cost model, global memory loads under a predicate guard are considered acceptable: the hardware will issue the load speculatively, and if the predicate is false, the result is simply discarded. On architectures with memory access traps (e.g., page faults on unmapped addresses), the hardware masks the fault for lanes where the predicate is false. Surface/tensor extended operations (category 18), however, may have side effects that cannot be masked, so they receive the unsafe designation.
Fall-Through Block Analysis -- sub_1380810
When the standard profitability check is inconclusive, sub_1380810 (980 bytes) analyzes the fall-through continuation of the merge block. The idea: even if the region itself is borderline, if the code immediately after the merge point contains long-latency operations (loads, texture fetches), the predicated version may be better because the scheduler can overlap the predicated instructions with those long-latency operations.
The function walks instructions in the merge block's successor(s), using the same 0x90E bitmask test to identify primary-data-memory loads. Non-load instructions are checked via the SM backend's vtable at sm_backend+1824. The function counts:
- Primary-memory-space loads (via the
0x90Emask) - Other long-latency operations (via the backend vtable check)
- Total instruction count
If the fall-through region contains enough long-latency work (compared to state->fallthrough_limit and state->extended_limit), the function returns true, indicating that predication is profitable despite the region being above the standard size threshold.
Extended Diamond Analysis -- sub_137FE10
For complex diamonds where one side has primary-memory loads that affect profitability thresholds, sub_137FE10 (2,550 bytes) performs a more thorough analysis. It can "look through" the diamond to the merge block and even one block beyond, checking whether the instruction mix in the continuation makes predication worthwhile. It invokes sub_137F560 (which also uses the 0x90E bitmask) to scan continuation blocks for scheduling-relevant loads.
The function also handles the case where the merge block falls through to another conditional branch that itself is a predication candidate -- effectively analyzing a chain of adjacent diamonds.
Interaction with Later Passes
The predication pass is positioned to maximize the benefit of subsequent passes:
| Phase | Name | Interaction |
|---|---|---|
| 64 | LateOriCommoning | Predication may create duplicate computations on both sides of the original branch. Commoning eliminates these by recognizing that @P0 IADD3 R0, R1, R2, RZ and @!P0 IADD3 R0, R1, R2, RZ with the same inputs can be merged into an unconditional instruction. |
| 65 | GeneralOptimizeLate2 | The copy propagation and constant folding sub-passes clean up the predicated code: dead predicate definitions, redundant MOVs introduced by the PHI destruction at merge points, and constant-foldable predicates. |
| 66 | OriHoistInvariantsLate | Predication can convert loop-varying branches into predicated straight-line code. LICM then hoists any newly-exposed loop-invariant computations. |
| 69 | OriDoRemat | Predicated instructions that define values used far from their definition are candidates for rematerialization, reducing register pressure. |
| 70 | OriPropagateVaryingSecond | After predication changes the control flow, varying annotations must be recomputed. The second varying-propagation pass updates which values are uniform vs. divergent. |
The context+1392 bit 0 flag set by sub_137EE50 persists through these passes and is checked by OriHoistInvariantsLate to avoid hoisting speculatively-unsafe instructions out of their guarded context.
Key Functions
| Address | Size | Function | Role |
|---|---|---|---|
sub_1381DA0 | 1,517 B | OriDoPredication::execute | Phase entry point; gating, setup, cleanup |
sub_1381CD0 | 206 B | runPredicationDriver | Iterative driver; calls main loop up to 3 times |
sub_1381010 | 3,249 B | predicationMainLoop | RPO walk, region identification, transformation dispatch |
sub_137E3A0 | 367 B | isTriangleDiamondCandidate | CFG pattern validation |
sub_137D990 | 1,270 B | analyzeRegion | Per-block instruction scan, cost modeling |
sub_137D8B0 | 209 B | canPredicateInstruction | Single-instruction predicability check |
sub_1380BF0 | 1,055 B | evaluateProfitability | Multi-factor profitability decision |
sub_137FE10 | 2,550 B | analyzeExtendedDiamond | Extended diamond and chain analysis |
sub_137F800 | 864 B | analyzeSpeculationSafety | Speculation safety for side-effect loads |
sub_1380810 | 980 B | analyzeFallThrough | Fall-through block continuation analysis |
sub_137EE50 | 969 B | markSpeculativeInstructions | Post-transformation speculative-load tracking |
sub_137DE90 | 1,286 B | applyPredication | Instruction rewriting and bitvector update |
sub_137FB60 | 687 B | classifyInstruction | Per-instruction classification during walk |
sub_137F560 | 665 B | scanBlockForUnsafe | Block scan for speculative safety |
sub_137F220 | 828 B | classifyInstructionExtended | Classification with bitvector tracking |
sub_137E510 | 2,360 B | moveInstructionsToHash | Instruction movement during transformation |
sub_9324E0 | 280 B | predicateInstruction | Adds predicate guard to single instruction |
sub_9321B0 | ~800 B | predicateAlreadyGuarded | Handles already-predicated instructions |
sub_92C240 | (shared) | createInstruction | Instruction builder (shared utility) |
SASS Predicate Model
NVIDIA SASS provides 7 usable predicate registers (P0--P6) plus the hardwired always-true register PT. Every instruction in the SASS ISA can optionally carry a predicate guard:
@P0 IADD3 R0, R1, R2, RZ // executes only if P0 is true
@!P2 FMUL R3, R4, R5 // executes only if P2 is false
FADD R6, R7, R8 // unconditional (implicit @PT)
Predicate conditions are set by comparison instructions:
ISETP.GT.AND P0, PT, R1, R2, PT // P0 = (R1 > R2) AND PT
FSETP.LT.AND P1, P2, R3, R4, PT // P1 = (R3 < R4), P2 = !(R3 < R4)
Uniform predicates (UP0--UP6, UPT) are the warp-uniform variant available on sm_75+. When all threads in a warp have the same predicate value, using UP instead of P avoids consuming a per-thread predicate register and enables the hardware to skip the entire instruction rather than masking per-thread.
In the Ori IR, predicate operands are encoded with type field 5 (bits 28-30 of the packed operand word). The guard predicate is appended as a pair of extra operands: the guard control word (type 3, 0x60000000 | reg_index) followed by the predicate register operand itself.
Opcode Reference
Key opcodes referenced by the predication pass (after BYTE1 &= 0xCF masking to clear bits 12-13):
| Value | Mnemonic | Role in predication |
|---|---|---|
| 93 | OUT_FINAL | ROT13 name is OUT_FINAL; used here as a conditional branch marker -- the instruction being eliminated. Actual SASS BRA is opcode 67. |
| 95 | STS | ROT13 name is STS; used here as the branch terminator class marker and conditional-select replacement target. Actual SASS EXIT is opcode 77. |
| 97 | STG | ROT13 name is STG; used here as a block boundary sentinel for scan termination. Actual SASS CALL is opcode 71. |
| 125 | LD (variant) | Load -- checked for speculative safety |
| 130 | HSET2 | ROT13 name is HSET2; used here as an internal marker for MOV-like instructions counted separately for profitability. Actual SASS MOV is opcode 19. |
| 183 | LDG | Global load -- speculative-unsafe |
| 188 | (variant) | Remapped to 190 when predicated |
| 263 | MOV.PHI | SSA phi -- not counted in instruction totals |
| 286 | CONV.ALLOC | Convergence allocation marker -- special handling in profitability check |
| 288 | STG | Global store -- speculative-unsafe |
| 352, 297 | (long-latency) | Texture/surface ops -- extra latency penalty |
Cross-References
- Pass Inventory -- phase 63 in the 159-phase table
- IR Overview -- Ori instruction format, operand encoding, register files
- Copy Propagation & CSE -- phase 64 (LateOriCommoning) runs immediately after
- GeneralOptimize Bundles -- phase 65 cleans up after predication
- Loop Passes -- phase 66 (OriHoistInvariantsLate) hoists newly exposed invariants
- Rematerialization -- phase 69 (OriDoRemat) handles increased register pressure
- Liveness Analysis -- liveness rebuilt at entry, bitvectors maintained during transformation
- Knobs System -- knobs 260, 487, 577, 579, 582 control predication behavior
- Scheduling -- scheduler backend initializes profitability thresholds
Rematerialization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Rematerialization is the compiler technique of recomputing a value near its use instead of keeping the original definition live across a long range. In ptxas, rematerialization is implemented through three cooperating pipeline phases and tightly integrated with the register allocator's spill-vs-remat decision logic. On GPUs, where register pressure directly determines occupancy and therefore throughput, aggressive rematerialization is one of the most performance-critical optimizations in the entire pipeline.
| Phase 28 | SinkRemat -- sinks instructions closer to uses, marks remat candidates |
| Phase 54 | OriDoRematEarly -- sets remat mode flag (ctx+1552 = 4) |
| Phase 69 | OriDoRemat -- late rematerialization after predication and fusion |
| Address range (phase 28) | Execute: sub_C5FC20, core: sub_913A30 -> sub_A0F020 |
| Address range (phase 69) | Execute: sub_C5F910, core: sub_A112C0 -> sub_A11060 -> sub_A107B0 |
| Minimum opt level | Phase 28: requires level > 4 (knob 487); Phase 69: requires level > 1 |
| Operand kind 7 | "Remat" marker in the Ori IR operand classification |
| Vreg flags (offset +80) | 0x80000001 = remat candidate; 0x80000007 = remat with predication; 0x80000008 = remat committed |
| Regalloc integration | sub_93AC90 (remat check), sub_99A9D0/sub_99AA50 (range remat cost) |
| DUMPIR name | SinkRemat, OriDoRematEarly, OriDoRemat |
Why Rematerialization Matters on GPUs
On NVIDIA GPUs, register count per thread inversely determines the number of concurrent warps (occupancy). Each additional register consumed by a kernel reduces the number of warps that can be resident on an SM. Since GPU performance depends on hiding memory latency through massive parallelism, even a single extra register can measurably degrade throughput.
Rematerialization trades instruction count for register pressure reduction. Instead of keeping a computed value alive in a register from its definition to its last use, the compiler recomputes it where needed. This is profitable when:
- The original instruction is cheap (single-cycle ALU: IADD, IMAD, MOV, SEL, LOP3, SHF)
- All source operands are still available at the use point (not overwritten)
- The live range of the result is long enough to actually cause register pressure
- The instruction has no side effects (no memory writes, no barrier interactions)
On GPUs, the cost-benefit tradeoff is skewed much further toward remat than on CPUs. A single spill/refill pair (STL + LDL) costs 20--100 cycles of local memory latency, while a rematerialized IADD costs 1 cycle. More importantly, the spill itself consumes a register for the address computation, potentially cascading into more spills.
Pipeline Position
Phase 23 GenerateMovPhi SSA phi nodes -> MOV instructions
Phase 24 OriPipelining Software pipelining
Phase 25 StageAndFence Memory fence insertion
Phase 26 OriRemoveRedundantBarriers
Phase 27 AnalyzeUniformsForSpeculation
Phase 28 SinkRemat *** Sink + remat candidate marking ***
Phase 29 GeneralOptimize Bundled mid-level optimizations
...
Phase 53 OriPropagateVaryingFirst
Phase 54 OriDoRematEarly *** Sets remat mode flag ***
Phase 55 LateExpansion
...
Phase 63 OriDoPredication If-conversion (creates new opportunities)
...
Phase 66 OriHoistInvariantsLate
Phase 67 DoKillMovement
Phase 68 DoTexMovement
Phase 69 OriDoRemat *** Late rematerialization ***
Phase 70 OriPropagateVaryingSecond
The three-phase design is deliberate:
- Phase 28 (early): Runs after SSA construction and pipelining but before the main optimization passes. Sinks instructions closer to their uses and identifies candidates. This is the most complex of the three phases.
- Phase 54 (mode setter): A trivial phase that writes
4toctx+1552(the pipeline progress counter), signaling to downstream passes that rematerialization mode is active. ItsisNoOp()returns 1 in the default vtable, meaning the dispatch loop skips itsexecute()by default. The phase is only active when an architecture backend overrides the vtable to return 0, at which point the single-store execute body runs. - Phase 69 (late): Runs after predication (phase 63) and loop fusion (phase 59), which restructure control flow and create new rematerialization opportunities that did not exist at phase 28 time. Also runs after
OriHoistInvariantsLate(phase 66), which may have extended live ranges by hoisting invariants.
Phase 28: SinkRemat
Entry and Guard Logic
The execute function (sub_C5FC20) applies two layers of gating:
function SinkRemat_execute(phase, ctx):
opt_level = getOptLevel(ctx) // sub_7DDB50
if opt_level <= 1:
return
return sub_913A30(ctx) // actual implementation
sub_913A30 (131 lines) performs additional checks before invoking the core:
- Optimization level >= 5: Required for the full sink+remat pass
- Knob 487: Must be enabled (queried via
vtable+152dispatch onctx+1664) - Cutlass detection (
sub_8F47E0): Checks if the function name contains "cutlass" viastrstr(). Cutlass kernels receive special treatment - Flag check (
ctx+1368bit 0): Must be set (compilation is in SSA window) - Feature flags (
ctx+1376): Must have bit 26 set (0x4000000) but NOT bit 53 (0x20000000000000) simultaneously
When the cutlass flag (ctx+1381 bit 6) is set, the pass enters an iterative mode:
function sub_913A30(ctx):
if opt_level <= 4:
return
if not knob_enabled(487):
return
is_cutlass = function_name_contains("cutlass")
if not (flag_byte(ctx+1368) & 1):
return
if not is_cutlass and not (flag_byte(ctx+1381) & 0x40):
return
// Feature flag gating
features = *(ctx+1376) & 0x20000004000000
if features != 0x4000000:
return
// Cutlass iterative mode
if flag_byte(ctx+1381) & 0x40:
max_iters = 5 // default
if hw_config->field_62064: // architecture-specific override
max_iters = getKnob(862) // configurable iteration limit
if max_iters <= 0: goto sinkRemat_core
for iter in 0..max_iters:
sub_8F5220(&state, ctx) // initialize iteration state
changed = sub_911030(&state, iter) // core sink+remat
if not changed or sub_8F59C0(&state): // convergence check
break
sub_8F5AD0(&state) // update state for next iter
sub_909A20(&state) // propagate changes
// clean up 4 bitvectors + 2 hash tables
return
// Non-cutlass path: single invocation
sinkRemat_core:
if is_cutlass:
// Instruction count limit check
if *(ctx+1584)->field_372 > 0x7FFF:
// Warn via vtable dispatch (diagnostic knob 356, severity 2)
sub_A0F020(ctx) // CORE: sink + remat driver
vtable_callback() // post-processing hook
sub_781F80(ctx, 1) // rebuild liveness
sub_8F4820(ctx, &worklist) // build remat worklist
// Process worklist in reverse order
for item in worklist (descending):
sub_8F4F90(ctx, &item) // apply remat decisions
Core Sink+Remat Driver: sub_A0F020
sub_A0F020 (494 lines) is the main workhorse of phase 28. It operates on the entire function body, processing basic blocks in reverse postorder through the dominator tree.
The algorithm has two main stages:
Stage 1: Per-block sinking analysis (via sub_A06A60 calling sub_A08250)
For each basic block in reverse postorder:
- Walk the instruction list backward
- For each instruction, check if it has a single use in a dominated block
- If so, sink the instruction to the use block (moves the instruction node in the linked list)
- Track whether any changes were made for convergence
Stage 2: Cross-block rematerialization (via sub_A06A60 calling sub_A07DA0)
For each basic block in reverse postorder:
- Walk the instruction list
- For each rematerialization-eligible instruction, check if the cost model approves duplication
- If profitable, clone the instruction at the use site and mark the original's result register with the remat flag
The pass alternates between sinking and rematerialization in a fixed-point loop, repeating until no more changes occur. The two worklist callbacks (sub_A08250 for sinking, sub_A07DA0 for remat) operate on a per-block basis through a generic block visitor (sub_A06A60).
The block visitor manages per-block liveness bitvectors:
block+16: live-in bitvectorblock+40: live-out bitvectorblock+64: kill setblock+112: live-through set (computed as intersection of live-in and live-out)
Phase 69: OriDoRemat (Late Rematerialization)
Entry Logic
The execute function (sub_C5F910) checks the optimization level and delegates:
function OriDoRemat_execute(phase, ctx):
opt_level = getOptLevel(ctx) // sub_7DDB50
if opt_level <= 1:
return
return sub_A112C0(ctx, -1) // -1 = process all blocks
Remat State Initialization: sub_A0C310
sub_A112C0 first initializes a rematerialization state object (88+ bytes) via sub_A0C310:
struct RematState {
+0 compilation_ctx* ctx
+8 RefCountedObj* block_worklist // refcounted allocator object
+16 int hash_capacity // hash table for seen blocks
+20 int hash_count
+24 HashNode** hash_buckets
+32 int hash_bucket_count
+72 int iteration_count // convergence counter
+80 RefCountedObj* instruction_pool // for cloned instructions
+88 int pass_id // -1 for late remat
+92 byte requires_tex_check // hw_config->field_17064 == 1
};
The pass ID -1 distinguishes OriDoRemat from OriDoRematEarly. When the hardware configuration at hw_config+17064 is 1 and hw_config+17072 != 0, the requires_tex_check flag is set, enabling additional texture-instruction awareness.
Iterative Remat Loop: sub_A112C0 + sub_A11060
The late remat pass runs in a convergence loop:
function sub_A112C0(ctx, pass_id):
init_remat_state(&state, ctx, pass_id)
// Iterative convergence loop
while sub_A11060(&state) and getOptLevel(ctx) != 1
and sub_785E20(ctx, 0): // instruction budget check
continue
// Cleanup: drain worklist, release refcounted objects
...
Per-Iteration Worker: sub_A11060
Each iteration of sub_A11060 (155 lines) processes the entire instruction list:
function sub_A11060(state):
ctx = state->ctx
sub_7E6090(ctx, 0, 1, 0, 0) // rebuild use-def chains
// Reset all basic block depth markers to 0x80000000 (unvisited)
for bb in ctx->block_list:
bb->field_76 = 0x80000000
// Drain hash table back into instruction pool
drain_hash_table(state)
first_pass = !state->requires_tex_check
changed = false
// Walk instructions in program order
instr = ctx->first_instruction // ctx+280
while instr:
if first_pass:
first_pass = false
while instr:
opcode = instr->opcode & 0xFFFFCFFF
if opcode == 97: // STG in ROT13; used as definition anchor/label marker
changed |= sub_A10DF0(state, instr)
next = instr->next
sub_A107B0(state, instr, &sink_flag, &changed_flag,
&remat_flag, true)
instr = next
else:
// Non-first-pass: skip MOV processing
while instr:
next = instr->next
sub_A107B0(state, instr, &sink_flag, &changed_flag,
&remat_flag, true)
instr = next
if not changed_flag:
goto check_second_pass
// Decrement iteration counter, check convergence
if --state->iteration_count == 0:
return sink_flag
check_second_pass:
if remat_flag and *(ctx+1552) > 4:
// Second pass: walk block list for cross-block opportunities
for bb in ctx->block_list:
if (bb->field_20 & 1) == 0 or bb->size <= 0
or (bb->field_20 & 6) == 6:
continue // skip empty/dead/cold blocks
instr = ctx->first_block_instruction
while instr:
instr = sub_A0C540(state, instr, &changed, ...)
if changed:
// Reset depth markers and loop
continue
--state->iteration_count
return sink_flag
Per-Instruction Remat Worker: sub_A107B0
sub_A107B0 (316 lines) is the core per-instruction decision function called from both phases 28 and 69. It determines whether a specific instruction should be sunk, rematerialized, or left alone.
function sub_A107B0(state, instr, sink_flag_out, changed_out, remat_flag_out,
allow_remat):
// Quick rejection: check if instruction is sinkable
result = sub_A105F0(state, instr, sink_flag_out, changed_out)
if result:
return result // already sunk, done
num_operands = instr->operand_count // at instr+80
if num_operands <= 0:
return 0
// Walk destination operands
for i in 0..num_operands:
operand = instr->operands[i] // at instr+84 + 8*i
operand_type = (operand >> 28) & 7
if operand_type == 7: // barrier register
// Track barrier liveness
continue
if operand_type != 1: // not a GPR destination
continue
// GPR destination operand
if operand < 0: // bit 31 set = definition
vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
vreg->flags |= 0x80000001 // mark as remat candidate
if has_predication_flag and last_operand_is_0x20:
vreg->flags |= 0x80000007 // enhanced remat with predication
if sub_A0C410(state, vreg, instr, allow_remat):
// Remat is profitable: clear depth flag, update block assignment
vreg->field_76 = ~instr->block_id
else:
// Not profitable: process as regular definition
// Check for multi-use definitions
...
else:
// Source operand: track liveness contribution
...
return result
Sinkability Check: sub_A105F0
sub_A105F0 (77 lines) determines if an instruction can be sunk to a single-use block. It enforces strict criteria:
- Opcode filter: Only opcode
0x5F(95;STSin the ROT13 name table, used here as a constant/immediate load variant marker) withstate->byte_92clear - Single-use check via
sub_A07940: The instruction must have exactly one use - Dominator check: The use must be in a block dominated by the definition block
- MOV chain check: If the instruction feeds opcode 93 (
OUT_FINALin ROT13; used here as a MOV-like chain link), verifies through an FNV-1a hash table that the definition matches the expected pattern - Cost check via
sub_A0C4A0: Verifies that sinking reduces pressure (returns the pressure delta)
When sinking succeeds, the instruction is physically moved in the linked list via sub_92E1B0 (insert at new position) and sub_9253C0 (remove from old position).
Rematerialization Eligibility Criteria
The eligibility check spans multiple functions. An instruction is rematerializable if it passes ALL of these filters:
Opcode Whitelist
From sub_911030 and sub_A11060, the eligible opcode set (after masking opcode & 0xFFFFCFFF) is:
| Opcode | Identity | Category |
|---|---|---|
| 22 | IADD/IADD3 | Integer add (1 cycle) |
| 50 | SHF | Funnel shift (1 cycle) |
| 77 | IMAD | Integer multiply-add (1 cycle on modern SM) |
| 83 | ISETP | Integer set-predicate (1 cycle) |
| 93 | OUT_FINAL in ROT13; used as MOV-like marker | Register move (0--1 cycles, often eliminated). Actual SASS MOV is opcode 19. |
| 95 | STS in ROT13; used as constant-load marker | Constant materialization |
| 297 | LOP3 | 3-input logic (1 cycle) |
| 352 | SEL | Conditional select (1 cycle) |
The eligibility bitmask is encoded as 0x2080000010000001 >> (opcode - 22) for opcodes in range [22, 83], with explicit checks for opcodes 297 and 352. This is a compile-time-constant bitmask covering single-cycle ALU instructions.
Operand Source Liveness
sub_90C010 (70 lines) checks that all source operands are still available (live) at the proposed remat point:
function check_sources_available(state, instr, operand_idx, cost_out):
operand = &instr->operands[operand_idx]
// Immediate operand: always available
if sub_7DEB90(operand, state->ctx):
return 1
// Must be a GPR (type 1) and not a definition (bit 31 clear)
type = (operand->value >> 28) & 7
if type != 1 or (operand->value_high & 1):
return 0
// Check if the source vreg has a single reaching definition
vreg = lookup_vreg(ctx, operand->value & 0xFFFFFF)
single_def = vreg->field_56
if single_def:
return sub_90B790(state, single_def, cost_out, false)
// Multiple definitions: walk the def-use chain
min_cost = UINT_MAX
for def in vreg->def_chain: // at instr->field_64 + 8*operand_idx
cost = sub_90B790(state, def->instruction, cost_out, false)
if cost == 0:
return 0 // any unavailable source -> reject
// For rematerializable defs, add depth cost
if def is rematerializable:
cost += (def->block_depth <= instr->block_depth) ? 1 : 0
min_cost = min(min_cost, cost)
return min_cost
Cost Model: sub_90B790
sub_90B790 (large function, ~350 lines) implements the core cost/benefit analysis. It returns a non-negative integer cost where:
0= not profitable, do not rematerialize1+= profitable, higher values indicate cheaper remat
The function considers:
- Opcode-specific register consumption: Different opcodes produce different register-type results.
sub_7E36C0,sub_7E40E0,sub_7E3790,sub_7E3800,sub_7E3640extract per-operand register class (R/P/UR/UP) and width - Live range length: Longer live ranges benefit more from remat
- Use count: Multiple uses may require multiple remat copies -- still profitable if the live range is long enough
- Block depth: Instructions in deeper loop nests get higher remat cost thresholds since the duplicated instruction executes more frequently
- Predication state: Predicated instructions have additional constraints on remat safety
- Pre-existing flags: If
vreg+80already has0x80000001set, the register is already a remat candidate
Cross-Block Rematerialization: sub_A0C540
sub_A0C540 (228 lines) handles rematerialization across basic block boundaries, invoked in the second pass of sub_A11060. It processes definitions that are used in different blocks:
function cross_block_remat(state, instr, changed_out):
// Walk operands in reverse order (destinations first)
for i in (instr->operand_count - 1) downto 0:
operand = instr->operands[i]
if (operand >> 28) & 7 != 1: // not a GPR
continue
if (operand_high & 1): // skip source operands
continue
vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
if (vreg->flags_48 & 0x22) != 0: // skip special vregs
continue
if vreg->reg_index in [41..44]: // skip architectural predicates
continue
flags80 = vreg->field_80
if not (flags80 & 1): // not a remat candidate
continue
if vreg->use_count <= 0:
continue
if (flags80 & 2) and (flags80 & 4): // already fully processed
continue
// Compute instruction-level remat cost
cost = sub_91E860(ctx, instr, i)
if operand < 0: // definition
if cost <= 3:
vreg->field_80 |= 0x80000008 // commit remat
continue
// Remat profitable: insert remat copy
adjust_pressure(state, instr, -1) // sub_A0C4A0
duplicate_at_use(ctx, instr) // vtable dispatch +1280
adjust_pressure(state, instr, +1)
vreg->field_80 |= 0x80000008
vreg->flags_48 &= ~0x300000 // clear live-through bits
// Rebuild interference for affected ranges
adjust_pressure(state, instr, -1)
sub_92C0D0(ctx, instr, 0, ...) // clone instruction at use
adjust_pressure(state, instr, +1)
*changed_out = 1
Interaction with Register Allocator
The rematerialization flags set during phases 28 and 69 are consumed by the fat-point register allocator in several ways:
Remat Detection During Assignment: sub_93AC90
During per-instruction register assignment (sub_9680F0, 3722 lines), the allocator calls sub_93AC90 (29 lines) to check if a virtual register is a rematerialization candidate:
function check_remat_opportunity(alloc, vreg_index, reg_class):
if alloc->vreg_count == 0:
BUG()
entry = hash_lookup(alloc->remat_table, vreg_index)
cost = entry->field_144[reg_class]
if cost < entry->threshold:
return true
return (cost == entry->threshold) and (reg_class == entry->field_12)
Range Remat Cost: sub_99AA50
The live-range infrastructure at 0x994000--0x9A1000 includes remat-aware cost functions. sub_99AA50 (51 lines) inserts a rematerialization cost node into a range's cost linked list, enabling the allocator to compare spill cost against remat cost when choosing between spilling and rematerializing a value.
Spill-vs-Remat Decision
The allocator's main iteration driver (sub_9AEF60, 1415 lines) uses remat information to guide the spill-vs-remat tradeoff:
- During interference analysis, remat candidates get lower interference weights (they can be killed and recreated)
- When a spill is triggered, the allocator first checks if the value is rematerializable. If so, it inserts a remat copy instead of a spill/refill pair
- Remat linked lists are maintained at
alloc+161..+175in the per-class allocator state
Verification: sub_A55D80
The post-allocation verifier (sub_A55D80, referenced by "REMATERIALIZATION PROBLEM..." string) validates that rematerialization was applied correctly. Error case 7 in the verifier specifically checks that:
- The rematerialized instruction produces the same value as the original
- The reaching definitions before and after allocation match (modulo known-safe remat transformations)
- No rematerialized instruction references a register that was invalidated by the allocation
Operand Kind 7: Remat Markers
The Ori IR operand classification includes a dedicated "Remat" kind (value 7) that marks operands participating in rematerialization. This marker is orthogonal to the vreg+80 flags -- it exists in the instruction's operand descriptors and tells downstream passes that this particular use was created by rematerialization rather than by the original program.
The 10 operand kinds in the Ori IR:
| Kind | Name | Description |
|---|---|---|
| 0 | R register | General-purpose register |
| 1 | Offset | Memory offset |
| 2 | P/UP register | Predicate register |
| 3 | Any register | Wildcard |
| 4 | Regular | Immediate or constant |
| 5 | Predicated | Guard predicate |
| 6 | -- | (reserved) |
| 7 | Remat | Rematerialization marker |
| 8 | Spill-refill | Spill/refill pair |
| 9 | R2P/P2R | Register-to-predicate conversion |
Vreg Flags at Offset +80
The virtual register's field at offset +80 encodes rematerialization state through a bitmask:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | Remat candidate -- this value CAN be recomputed |
| 1 | 0x2 | Remat source processed -- cross-block analysis done |
| 2 | 0x4 | Remat committed -- the allocator should prefer remat over spill |
| 31 | 0x80000000 | Depth marker / unvisited sentinel |
Common flag combinations:
0x80000001: Candidate identified by sub_A107B0, pending cost analysis0x80000007: Candidate with predication awareness (stronger guarantee for predicated code paths)0x80000008: Remat committed by cross-block analysis (sub_A0C540), allocator should use remat
Knobs and Configuration
| Knob ID | Role | Default | Notes |
|---|---|---|---|
| 487 | Gate for SinkRemat pass | (enabled) | Must be true for phase 28 to execute |
| 862 | Cutlass iteration limit | 5 | Max iterations in cutlass-specific iterative mode |
| 356 | Instruction count diagnostic | -- | Severity-2 warning when instruction count exceeds 32767 |
The optimization level gating:
- Level <= 1 (
-O0/-O1): All three remat phases are disabled - Level <= 4: Phase 28 runs the non-cutlass path only
- Level >= 5 (
-O3+): Full sink+remat with cutlass iteration support
Function Map
Phase 28 (SinkRemat)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5FC20 | sub_C5FC20 | 12 | Phase execute dispatcher |
0xC5F2E0 | sub_C5F2E0 | 7 | getName() -> returns 28 |
0xC5F2F0 | sub_C5F2F0 | 7 | isNoOp() -> returns 0 (always runs) |
0x913A30 | sub_913A30 | 131 | SinkRemat entry with knob/feature gating |
0xA0F020 | sub_A0F020 | 494 | Core sink+remat driver (block visitor loop) |
0x911030 | sub_911030 | 2408 | Per-block promotion/sinking engine |
0x90C010 | sub_90C010 | 70 | Source operand liveness check for remat |
0x90B790 | sub_90B790 | ~350 | Cost model: remat profitability analysis |
0x8F47E0 | sub_8F47E0 | 12 | Cutlass detection (strstr("cutlass")) |
0x8F4820 | sub_8F4820 | -- | Build remat worklist |
0x8F4F90 | sub_8F4F90 | -- | Apply remat decisions from worklist |
Phase 54 (OriDoRematEarly)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5EF30 | sub_C5EF30 | 7 | Phase execute: writes ctx+1552 = 4 |
0xC5EF40 | sub_C5EF40 | 7 | getName() -> returns 54 |
0xC5EF50 | sub_C5EF50 | 7 | isNoOp() -> returns 1 |
Phase 54 is a degenerate phase. Its execute body is a single store: *(ctx + 1552) = 4. Its isNoOp() returns 1, so the dispatch loop skips execute() by default -- the phase does nothing unless an architecture backend overrides the vtable to activate it. When active, the value 4 written to ctx+1552 advances the pipeline progress counter, which sub_A11060 checks (if *(ctx+1552) > 4 triggers the cross-block second pass).
Phase 69 (OriDoRemat)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5F910 | sub_C5F910 | 24 | Phase execute dispatcher |
0xC5ED50 | sub_C5ED50 | 7 | getName() -> returns 69 |
0xC5ED60 | sub_C5ED60 | 7 | isNoOp() -> returns 0 (always runs) |
0xA112C0 | sub_A112C0 | 245 | Late remat entry + cleanup |
0xA0C310 | sub_A0C310 | 45 | RematState initialization |
0xA11060 | sub_A11060 | 155 | Per-iteration remat worker |
0xA107B0 | sub_A107B0 | 316 | Per-instruction remat decision |
0xA105F0 | sub_A105F0 | 77 | Sinkability check (opcode 0x5F) |
0xA10DF0 | sub_A10DF0 | 138 | MOV chain analysis (FNV-1a hash table) |
0xA0C540 | sub_A0C540 | 228 | Cross-block rematerialization |
0xA0C4A0 | sub_A0C4A0 | -- | Pressure adjustment (+1 or -1) |
0xA0C410 | sub_A0C410 | -- | Remat profitability check for a vreg |
Register Allocator Integration
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0x93AC90 | sub_93AC90 | 29 | Remat opportunity check during assignment |
0x99A9D0 | sub_99A9D0 | 38 | Range rematerialization cost cleanup |
0x99AA50 | sub_99AA50 | 51 | Range rematerialization cost insertion |
0x9AEF60 | sub_9AEF60 | 1415 | Main allocation driver with remat support |
0xA55D80 | sub_A55D80 | ~800 | Post-allocation remat verification |
Sinking vs. Rematerialization
The SinkRemat pass (phase 28) and the late OriDoRemat pass (phase 69) both move instructions closer to their uses, but through fundamentally different mechanisms:
Sinking moves the original instruction. The definition is physically relocated from its original position to a dominated block closer to the use. This does not increase instruction count but may change schedule. Sinking is legal only when:
- The instruction has exactly one use
- The use is in a block dominated by the current definition block
- Moving the instruction does not cross any barrier or synchronization point
- All source operands remain available at the new position
Rematerialization duplicates the instruction. The original definition remains in place (or is deleted if dead), and a fresh copy is inserted near each use. This increases instruction count but can dramatically reduce register pressure. Remat is legal for any instruction in the opcode whitelist, subject to:
- All source operands available at the use point
- The cost model approves the duplication
- The instruction has no side effects
The sub_A105F0 sinkability check runs first in sub_A107B0. Only if sinking fails does the function proceed to the rematerialization path. This prioritizes the cheaper transformation (sinking = zero instruction overhead) before falling back to the more expensive one (remat = duplicated instructions).
Architectural Notes
The three-phase structure with an interleaved flag-setter (phase 54) suggests the rematerialization infrastructure evolved over multiple ptxas generations. Phase 54's isNoOp() = 1 default means its execute() is skipped unless an architecture backend activates it by overriding the vtable. This indicates the phase was likely once a full pass that was later simplified to a flag write, with its analysis logic migrated into phase 69.
The CUTLASS-specific iterative mode in phase 28 (sub_913A30) reveals that NVIDIA's matrix-multiply library is important enough to warrant dedicated compiler heuristics. The strstr("cutlass") check is a name-based pattern match on the function name, not a property of the IR itself. This coupling between compiler optimization and library naming conventions is a pragmatic choice for a production compiler targeting known workloads.
The FNV-1a hash (constant 0x811C9DC5, prime 16777619) appears in both the rematerialization infrastructure (sub_A10DF0 for MOV chain tracking) and the register allocator (sub_926A30 for interference). This shared hash implementation is one of ptxas's standard infrastructure components (see Hash Tables & Bitvectors).
Liveness Analysis & Dead Code Elimination
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Liveness analysis is the most frequently repeated computation in the ptxas pipeline. Six dedicated phases perform liveness analysis combined with dead code elimination (DCE), and at least four additional subsystems recompute liveness on demand. The core algorithm is a standard backward dataflow analysis over the CFG, but the implementation is notable for its SSE2-accelerated bitvector library, per-register-file liveness tracking, and the orWithAndNotIfChanged fused transfer function that implements the entire dataflow step in a single SIMD pass.
| Dedicated liveness phases | 6 (phases 10, 16, 19, 33, 61, 84) |
| Core bitvector library | 0xBDBA60--0xBDE150 (15+ functions, SSE2) |
| BitVector object size | 20 bytes header + dynamic word array |
| Word size | 32-bit (uint32_t) -- indexed by >> 5 and & 0x1F |
| Transfer function | `out = gen |
| Fixed-point detection | orIfChanged / andIfChanged return bool |
| Liveness storage | Code Object +832 (main), +856 (uniform) |
| NamedPhases override | "OriPerformLiveDead" controls all 4 instances |
| Related phase | 138 OriSplitHighPressureLiveRanges (late cleanup) |
Pipeline Placement
The six liveness-related phases are distributed across the entire optimization pipeline. Each runs after a group of transformations that may have introduced dead code or invalidated previous liveness information:
Phase 10 EarlyOriSimpleLiveDead ── Initial Setup
Phase 16 OriPerformLiveDeadFirst ── Early Optimization
Phase 19 OriSplitLiveRanges ── Early Optimization
Phase 33 OriPerformLiveDeadSecond ── Mid-Level Optimization
Phase 61 OriPerformLiveDeadThird ── Late Optimization
Phase 84 OriPerformLiveDeadFourth ── Legalization
Phase 138 OriSplitHighPressureLiveRanges ── Late Cleanup (related)
The four OriPerformLiveDead instances are identical passes invoked at different pipeline positions. They share the same vtable execute function and differ only in when they run. The NamedPhases system addresses all four through the single name "OriPerformLiveDead".
Why Four Instances?
Each instance cleans up dead code introduced by the preceding optimization group:
| Phase | Runs After | Cleans Up |
|---|---|---|
| 16 (First) | Branch optimization, switch optimization | Dead branches, unreachable code from CFG simplification |
| 33 (Second) | GeneralOptimize, loop unrolling, pipelining, strength reduction | Dead loop induction variables, redundant computations from unrolling |
| 61 (Third) | GeneralOptimizeLate, loop fusion, VTG expansion, late expansion | Dead code from loop fusion, expanded macro instructions |
| 84 (Fourth) | Backward copy propagation, late arch optimization | Dead copies, redundant moves from backward propagation |
Without these intermediate liveness passes, dead code would accumulate through the pipeline, inflating register pressure and increasing compile time for downstream passes.
Dataflow Algorithm
Classical Backward Liveness
ptxas implements textbook backward dataflow analysis. For each basic block B, the analysis computes:
LiveIn(B) = gen(B) | (LiveOut(B) - kill(B))
LiveOut(B) = Union over all successors S of LiveIn(S)
Where:
- gen(B): registers used in B before any definition in B (upward-exposed uses)
- kill(B): registers defined in B (regardless of whether they are also used)
- LiveIn(B): registers live at the entry of B
- LiveOut(B): registers live at the exit of B
Iterative Fixed-Point Solver
The analysis iterates in reverse post-order (RPO) until no LiveIn/LiveOut set changes:
function compute_liveness(func):
compute_RPO(func) // sub_BDE150
// Initialize gen/kill sets per block
for each block B in func:
gen(B) = {}
kill(B) = {}
for each instruction I in B (reverse order):
for each source operand r of I:
if r not in kill(B):
gen(B) |= {r}
for each destination operand d of I:
kill(B) |= {d}
// Initialize LiveOut to empty for all blocks
for each block B:
LiveOut(B) = {}
// Iterate until fixed point
changed = true
while changed:
changed = false
for each block B in reverse RPO:
// LiveOut = union of successors' LiveIn
for each successor S of B:
changed |= LiveOut(B).orIfChanged(LiveIn(S))
// LiveIn = gen | (LiveOut - kill)
// implemented as: orWithAndNotIfChanged
changed |= LiveIn(B).orWithAndNotIfChanged(
gen(B), LiveOut(B), kill(B))
The key optimization is the fused orWithAndNotIfChanged operation (sub_BDD560), which computes dst |= gen | (in & ~kill) and returns whether any bit changed -- all in a single SSE2 pass over the bitvector words. This avoids materializing intermediate bitvectors for (LiveOut - kill).
Convergence
The analysis converges because:
- All sets are initialized to empty (bottom of the lattice).
- Each iteration can only add bits (the transfer function is monotone).
- The lattice has finite height (bounded by the total number of virtual registers).
- RPO traversal order minimizes the number of iterations -- typically 2--3 passes for acyclic code, proportional to loop nesting depth for loops.
BitVector Implementation
The bitvector library at 0xBDBA60--0xBDE150 is the most performance-critical infrastructure in ptxas dataflow analysis. All operations are SSE2-accelerated with manual alignment handling.
Layout
struct BitVector { // 20 bytes total
uint32_t* data; // +0: pointer to word array (heap-allocated)
int32_t word_count; // +8: number of 32-bit words in use
int32_t capacity; // +12: allocated words (>= word_count)
int32_t bit_count; // +16: number of valid bits
};
Word count is computed from bit count: word_count = (bit_count + 31) >> 5. Memory is allocated via the pool allocator (vtable dispatch at allocator +24 for alloc, +32 for free). Reallocation occurs only when the new word count exceeds the current capacity.
Core Operations
| Address | Operation | Signature | Notes |
|---|---|---|---|
sub_BDBA60 | allocate | (bv*, alloc*, num_bits) | Grow-only; no shrink |
sub_BDBFB0 | setBit | (bv*, bit_index) | `data[i>>5] |
sub_BDC0E0 | clearBit | (bv*, bit_index) | data[i>>5] &= ~(1 << (i&31)) |
sub_BDC200 | testBit | (bv*, bit_index) -> bool | (data[i>>5] >> (i&31)) & 1 |
sub_BDCDE0 | operator|= | (dst*, src*) | SSE2 _mm_or_si128 loop |
sub_BDCF40 | orIfChanged | (dst*, src*) -> bool | Scans (~dst & src) != 0 first |
sub_BDC5F0 | operator&= | (dst*, src*) | SSE2 _mm_and_si128; zeroes tail |
sub_BDC790 | andIfChanged | (dst*, src*) -> bool | Scans (~src & dst) != 0 first |
sub_BDDAA0 | operator^= | (dst*, src*) | SSE2 _mm_xor_si128 |
sub_BDC3F0 | assignAND | (dst*, a*, b*) | dst = a & b |
sub_BDD300 | orWithAndNot | (dst*, gen*, in*, kill*) | dst |= gen | (in & ~kill) |
sub_BDD560 | orWithAndNotIfChanged | (dst*, gen*, in*, kill*) -> bool | Core transfer function |
sub_BDBD60 | extractBits | (out[], start, end) | Cross-word boundary handling |
sub_BDD8C0 | popcount | (bv*) -> int | Count set bits |
sub_BDDC00 | clear | (bv*) | memset(data, 0, ...) |
sub_BDCA60 | operator= | (dst*, src*) | Copy with possible realloc |
sub_BDCC20 | isSubsetOf | (a*, b*) -> bool | Tests (a & ~b) == 0 |
SSE2 Loop Structure
All bulk operations follow the same pattern:
// Alignment prologue: process scalar words until 16-byte aligned
int align_count = (-(uintptr_t)(dst_ptr) >> 2) & 3;
for (int i = 0; i < min(align_count, word_count); i++)
dst_ptr[i] |= src_ptr[i];
// SSE2 main loop: process 4 words (128 bits) per iteration
int sse_count = (word_count - align_count) >> 2;
for (int i = 0; i < sse_count; i++) {
__m128i d = _mm_load_si128(&dst_ptr[aligned_offset + 4*i]);
__m128i s = _mm_loadu_si128(&src_ptr[aligned_offset + 4*i]);
_mm_store_si128(&dst_ptr[aligned_offset + 4*i], _mm_or_si128(d, s));
}
// Scalar epilogue: remaining 0-3 words
for (remaining words)
dst_ptr[j] |= src_ptr[j];
The orWithAndNot transfer function fuses three operations into one SSE2 expression:
__m128i result = _mm_or_si128(
_mm_or_si128(gen_vec, dst_vec),
_mm_andnot_si128(kill_vec, in_vec) // in & ~kill
);
The IfChanged variants first scan for any bit that would change (~dst & new_bits), then apply the operation only from the first differing word forward. This early-exit optimization avoids unnecessary writes when the analysis has already converged for most blocks.
Per-Register-File Liveness
GPU register allocation manages multiple independent register files. ptxas tracks liveness separately for each:
| Register File | Bit Range | Storage |
|---|---|---|
| R (GPR, 32-bit) | Bits 0..254 | Code Object +832 (main bitvector) |
| UR (uniform GPR) | Bits 0..63 | Code Object +856 (uniform bitvector) |
| P (predicate, 1-bit) | Separate tracking | Operand type (v >> 28) & 7 == 5 |
| UP (uniform predicate) | Separate tracking | Flag at Code Object +1378 bit 4 |
| B (barrier) | Indices 20, 21 | Special-cased in dependency graph |
The main liveness bitvector at Code Object +832 covers R registers. The uniform register bitvector at +856 is conditionally allocated: it exists only when the flag at Code Object +1378 bit 4 is set (indicating the function uses uniform registers). The scheduling pass (sub_A06A60) allocates both bitvectors via sub_BDBAD0 and processes them in parallel.
Predicate registers are handled at the operand level during scheduling: the operand type check ((operand >> 28) & 7) == 5 identifies predicate operands, which are tracked in a separate per-block set rather than the main bitvector.
Barrier registers (IDs 20, 21 for sm >= 4.0) receive special treatment in the dependency graph builder (sub_A0D800): they generate ordering dependencies rather than data dependencies, since barriers enforce execution ordering constraints independent of register values.
Phase 10: EarlyOriSimpleLiveDead
The earliest liveness pass, running immediately after initial IR construction (after ReportInitialRepresentation at phase 9). This is a simplified liveness + DCE pass that removes obviously dead instructions from the freshly-lowered IR.
Pipeline context: At this point, the IR has just been lowered from PTX. Many PTX instructions expand to multiple Ori instructions, some of which produce values that are immediately dead (e.g., condition codes that are never tested, intermediate values from multi-instruction expansions). EarlyOriSimpleLiveDead removes this low-hanging dead code before the main optimization pipeline begins, reducing the working set for all subsequent passes.
Implementation evidence: The sweep at p1.10 (W010) confirms this pass uses the bitvector infrastructure at sub_BDBA60--sub_BDE150 for liveness computation. The "simple" in the name may indicate a local-only (per-BB) analysis that avoids the cost of full global iterative dataflow -- sufficient for removing obviously dead definitions that have no uses within the same block.
Phases 16, 33, 61, 84: OriPerformLiveDead
The four instances of the full liveness + DCE pass. These perform global iterative dataflow analysis followed by dead instruction removal.
Algorithm
function OriPerformLiveDead(func):
// 1. Rebuild basic block metadata
rebuild_basic_blocks(func, mode) // sub_781F80
// 2. Compute global liveness (iterative fixed-point)
compute_global_liveness(func) // iterative solver
// 3. Dead code elimination
for each block B in func:
for each instruction I in B:
if all destinations of I are dead (not in LiveOut):
if I has no side effects:
remove(I)
// 4. Update IR metadata
// (instruction counts, block sizes, etc.)
Side-Effect Preservation
Not all instructions with dead destinations can be removed. The DCE must preserve:
- Memory stores (
STG,STS,STL,ATOM, etc.) -- observable side effects - Barrier instructions (
BAR,MEMBAR) -- synchronization semantics - Control flow (
BRA,EXIT,RET,CALL) -- program structure - Texture operations with side effects
- Instructions with volatile flags
The opcode mask & 0xCFFF (seen in sub_A06A60) strips modifier bits to obtain the base opcode for side-effect classification. Opcodes 93 (OUT_FINAL in the ROT13 name table; used as a call-like marker -- actual CALL is opcode 71), 94 (LDS)/95 (STS) (used as block boundary markers), 97 (STG; used as a branch-like marker -- actual BRA is opcode 67), and 52 (AL2P_INDEXED; used as NOP/boundary) receive special handling.
DCE Integration
The OriPerformLiveDead pass combines liveness computation with DCE in a single pass rather than running them as separate analyses. After computing LiveOut sets for each block, the pass walks each block backward: for each instruction, it checks whether every destination register is absent from the current live set. If so and the instruction has no side effects, it is unlinked from the instruction list. Source operands of removed instructions are themselves removed from the live set, potentially enabling cascading removal of further dead instructions within the same backward walk.
Phase 19: OriSplitLiveRanges
This phase splits live ranges at loop boundaries and across phi/copy chains to reduce register pressure. It runs after OriPerformLiveDeadFirst (phase 16) and OriLoopSimplification (phase 18), when the loop structure is canonical.
String reference: "OriSplitLiveRanges" at 0x22BC5C0.
Core implementation: sub_BEF110 (108KB, 3,414 decompiled lines). Called via sub_A1D3A0 (vtable execute) -> sub_BF33D0 (knob-gated entry, reads register budget from ctx+1624 and knob 456).
Motivation
On GPUs, register pressure directly determines occupancy (the number of concurrent warps). A value defined before a loop and used only after the loop occupies a register for the entire loop body, even though it is not accessed within the loop. Splitting the live range at the loop boundary -- by inserting a copy before the loop and a copy after -- can free the register for use inside the loop, reducing peak pressure and enabling higher occupancy.
Algorithm (Decompiled from sub_BEF110)
The function operates in five distinct phases:
Phase 1: Pre-analysis -- Rebuilds basic blocks (sub_781F80), allocates three bitvector fields per virtual register (kill at VR+96, gen at VR+24, live-through at VR+176), then runs the standard iterative liveness solver (sub_775010 + sub_773140). Walks the register table checking interference chains: for each VR with a chain at VR+136, tests whether the chain target's kill set is a subset of the VR's kill set (sub_BDC390 = isSubsetOf). Non-subset cases receive the +264 bit 1 flag, marking them as interference candidates.
Phase 2: Work structure allocation -- Allocates a scratch array s[] (one entry per split candidate), a hash table for interference tracking (power-of-2 buckets sized via _BitScanReverse64), and an array of 64-byte per-block split records:
struct PerBlockSplitRecord { // 64 bytes, indexed by block ID
void* list_head; // +0: interference linked list
void* first_in_block; // +8: first entry pointer
void* sentinel; // +16: self-pointer
void* reserved; // +24
void* last_in_block; // +32: last entry pointer
void* tail; // +40: tail pointer
int32_t count; // +48: entry count
int32_t pad; // +52
void* allocator_ref; // +56: refcounted allocator
};
Phase 3: Main splitting loop -- Iterates the ordered register array at ctx+792 in reverse order (highest VR ID first). For each VR, walks the def-use chain via ctx+296 (register table), classifying instructions by opcode:
| Opcode (masked) | Meaning | Split Action |
|---|---|---|
| 167 (0xA7) | Phi-like | Walk up phi chain, split at each level via sub_931920 |
| 158 (0x9E) | Copy-like | Similar chain walk with copy-specific handling |
| 188 (0xBC) | Multi-operand special | Check operand types, dispatch to sub_BE3720 for multi-source split |
| 27 (0x1B) | Register move | Standard split point; emit via sub_9314F0 with 4 operands |
| 269 (0x10D) | Copy | Lightweight split; emit via sub_9314F0 with 2 operands |
For each split: allocates a new VR via sub_931920, copies the three bitvector fields (sub_BDBA60 allocates, sub_BDC1B0 copies dst |= src), validates the register class via sub_9314F0 (called 11 times total across different split patterns), and updates the interference hash via sub_BEEC80.
The inline interference check in the hot path:
// Fast single-bit test: is vr_class_id live in the kill set?
if ((1 << vr_class_id) & kill_set[vr_class_id >> 5]) != 0)
// VRs interfere -- cannot share a physical register
Phase 4: Interference hash processing -- Builds a global interference hash table using FNV-1a (0x811C9DC5 offset basis, 16777619 prime). Walks per-block split records, for each entry scans the kill bitvector (sub_BDDC00 clears from position, scanning forward) to find concurrently live VRs. Tests interference via sub_BEE7F0 and emits split instructions via sub_934630 (opcode 46). The hash table resizes when load factor exceeds 50%.
Phase 5: Cleanup -- Marks phi/copy chains with the +245 rewrite flag (triggering opcode mutation from 188 to 93 or 95), frees hash tables and per-block records, clears ctx+1370 bit 2 to signal liveness invalidation.
function OriSplitLiveRanges(func):
// Phase 1: Pre-analysis
rebuild_basic_blocks(func, 0) // sub_781F80
alloc_kill_bitvectors(func) // sub_BEAFD0: VR+96
alloc_gen_bitvectors(func) // sub_BEB110: VR+24
compute_liveness(func) // sub_775010
propagate_per_block(func, 0) // sub_773140
mark_interference_candidates(func) // inline: walk chains, test subsets
// Phase 2: Work structure allocation
allocate_work_structures(split_candidate_count)
// Phase 3: Main splitting loop
for each VR in ordered_array[ctx+792] (reverse):
walk def-use chain via ctx+296:
classify instruction by opcode
if splittable:
new_vr = allocate_vr(func, vr, def_instr) // sub_931920
copy_bitvectors(new_vr, vr) // sub_BDBA60 + sub_BDC1B0
validate_reg_class(new_vr, opcode, operands) // sub_9314F0
update_interference_hash(new_vr) // sub_BEEC80
// Phase 4: Interference hash processing
for each entry in interference_hash:
for each concurrent_vr in kill_bitvector:
if interferes(entry, concurrent_vr): // sub_BEE7F0
emit_split_instruction(entry, concurrent_vr) // sub_934630
// Phase 5: Cleanup
mark_rewrite_flags() // byte +245
free_work_structures()
ctx[+1370] &= ~4 // invalidate liveness
Three Bitvector Fields per Virtual Register
The splitting pass maintains three independent bitvectors per VR, all using the standard 32-bit-word BitVector from 0xBDBA60--0xBDE150:
| VR Offset | Name | Content | Allocated by |
|---|---|---|---|
+96 | Kill set | Registers defined by this VR's instructions | sub_BEAFD0 |
+24 | Gen set | Registers used before definition in this VR's range | sub_BEB110 |
+176 | Live-through set | Registers live through the range without kill or gen | Derived |
These per-VR bitvectors differ from the per-block liveness bitvectors used by OriPerformLiveDead. The per-block sets track global liveness; the per-VR sets track interference within a single virtual register's live range, enabling the split decision: if two VRs have overlapping kill sets (tested via the fast inline (1 << id) & word[id >> 5] check), they interfere and splitting one of them at the boundary reduces the overlap.
Helper Functions
| Address | Identity | Role |
|---|---|---|
sub_BEAFD0 | AllocKillBitvectors | Allocate VR+96 kill sets; propagate via interference chain VR+136 |
sub_BEB110 | AllocGenBitvectors | Allocate VR+24 gen sets; scan phi/copy defs (opcodes 158, 167) |
sub_BE3390 | ComputeSplitCount(interference) | Count split points for interference-chain case |
sub_BE3590 | ComputeSplitCount(clean) | Count split points for non-interfering case |
sub_BE3720 | ComputeSplitCount(multiSrc) | Count split points for multi-source operand case |
sub_BEE7F0 | TestInterference | Test bitvector interference between two VRs |
sub_BEEC80 | UpdateHashWithSplit | Update per-split hash table (192-byte entries, 8 buckets) |
Relationship to Phase 138
Phase 138 (OriSplitHighPressureLiveRanges) performs a similar transformation but much later in the pipeline (late cleanup stage), targeting live ranges that still cause excessive pressure after all optimization and legalization passes have run. Phase 19 is the early, conservative version; phase 138 is the late, aggressive fallback.
Liveness Consumers
The liveness information computed by these phases is consumed throughout the pipeline:
Register Allocator
The fat-point register allocator (sub_9721C0) is the primary consumer. Its entry point explicitly rebuilds liveness before allocation:
sub_781F80(ctx, 1); // rebuild basic blocks
sub_A10160(ctx, 1); // recompute liveness
The allocator uses liveness information for:
- Interference computation: Two virtual registers interfere if their live ranges overlap. The interference graph builder (
sub_926A30, 155KB decompiled) uses bitvector intersection to detect overlaps. - Spill cost estimation:
sub_94E620computes spill costs weighted by liveness range length and instruction properties. - Spill placement:
sub_9449B0(liveness range calculator, 1800 bytes) iterates instructions in reverse block order using bitvector operations to determine optimal spill/reload insertion points.
Instruction Scheduler
The scheduling subsystem maintains its own liveness tracking at Code Object +832:
- Pre-scheduling:
sub_8DBAF0(16KB,LivenessAnalysis) computes register liveness for the scheduling priority function. - Per-BB liveness:
sub_8DB5F0(8.4KB,LivenessCompute) computes per-basic-block liveness sets. - Initialization:
sub_8DB070(8.2KB,LivenessInit) sets up the liveness data structures. - Iterative solver:
sub_8DE7A0(12KB) runs the iterative fixed-point computation for scheduling-specific dataflow.
The scheduler uses liveness to:
- Estimate register pressure at each scheduling point
- Identify last-use operands for dead-register marking (
sub_A08250checks(1 << reg_num) & *(live_set + 4*(reg_num >> 5))) - Compute instruction priority based on register pressure impact
DAG Construction
The dependency graph builder (sub_A0F970, sub_A0D800) uses liveness to:
- Determine which registers are live at block boundaries
- Identify anti-dependencies (WAR) that constrain scheduling
- Track callee-clobbered registers at call sites (opcode 93;
OUT_FINALin ROT13, used as call-like marker -- actual CALL is opcode 71)
Multi-Set Register Manager
sub_A7BC80 (36KB) manages multiple parallel liveness bitvectors for different register classes (R, P, B, UR, UP) during post-allocation scheduling. It allocates and deallocates bitvectors in coordinated groups, updating each set based on instruction defs/uses.
Uninitialized Register Detector
sub_A0B5E0 uses liveness information to detect potentially uninitialized registers. After scheduling, it walks each block's entry live set: for each live register, it checks the 0x20 flag at register descriptor offset 48. If the flag is clear, the register is reported as potentially uninitialized via warning strings "Found %d potentially uninitialized register(s) in function %s" (warning 0x1E14).
Data Flow Infrastructure for Scheduling
The scheduling subsystem has its own dataflow infrastructure (separate from the optimizer's OriPerformLiveDead):
| Address | Size | Identity |
|---|---|---|
sub_8DB070 | 8.2KB | LivenessInit -- allocate and initialize per-BB liveness structures |
sub_8DB5F0 | 8.4KB | LivenessCompute -- compute liveness per basic block |
sub_8DBAF0 | 16KB | LivenessAnalysis -- full liveness analysis (red-black tree interval structure) |
sub_8DC3F0 | 3.0KB | ComputeDataFlowState -- scheduling-specific dataflow |
sub_8DC620 | 3.3KB | UpdateDataFlowOnSchedule -- update flow after scheduling decisions |
sub_8DC880 | 10KB | PropagateDataFlow -- propagate dataflow information |
sub_8DCF20 | 23KB | BuildDFGForScheduling -- build scheduling data flow graph |
sub_8DE7A0 | 12KB | IterativeDataFlow -- iterative fixed-point solver |
sub_8DEF90 | 2.0KB | FinalizeDataFlow -- finalize dataflow results |
The sub_8DBAF0 function implements a red-black tree (evidenced by tree rotations and color flags at node offset +40 in the decompiled code), used to store liveness intervals as an ordered set. This enables efficient range queries: "which registers are live at program point P?" is answered by a tree search in O(log n) rather than a linear scan.
sub_781F80: Basic Block Rebuild
This function appears ubiquitously as a prerequisite to liveness computation. It is called with a mode parameter:
sub_781F80(func, 0): Reset/rebuild basic block metadata for reverse scheduling modesub_781F80(func, 1): Full rebuild for forward analysis (used before register allocation)
Over 50 call sites reference this function across the optimizer, register allocator, and scheduler. It refreshes the basic block linked lists, instruction counts, and block boundary markers that the liveness analysis depends on.
Key Function Table
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_BDBA60 | ~120B | BitVector::allocate | HIGH (0.90) |
sub_BDBFB0 | ~120B | BitVector::setBit | HIGH (0.90) |
sub_BDC0E0 | ~120B | BitVector::clearBit | HIGH (0.90) |
sub_BDC200 | ~140B | BitVector::testBit | HIGH (0.90) |
sub_BDCDE0 | ~400B | BitVector::operator|= (OR) | HIGH (0.95) |
sub_BDCF40 | ~564B | BitVector::orIfChanged | HIGH (0.95) |
sub_BDC5F0 | ~484B | BitVector::operator&= (AND) | HIGH (0.95) |
sub_BDC790 | ~800B | BitVector::andIfChanged | HIGH (0.95) |
sub_BDDAA0 | ~400B | BitVector::operator^= (XOR) | HIGH (0.95) |
sub_BDC3F0 | ~520B | BitVector::assignAND | HIGH (0.90) |
sub_BDD300 | ~488B | BitVector::orWithAndNot | HIGH (0.92) |
sub_BDD560 | ~648B | BitVector::orWithAndNotIfChanged | HIGH (0.92) |
sub_BDBD60 | ~368B | BitVector::extractBits | HIGH (0.88) |
sub_BDD8C0 | ~320B | BitVector::popcount | MEDIUM (0.80) |
sub_BDDC00 | ~140B | BitVector::clear | HIGH (0.90) |
sub_BDCA60 | ~280B | BitVector::operator= (copy) | MEDIUM (0.85) |
sub_BDCC20 | ~320B | BitVector::isSubsetOf | MEDIUM (0.85) |
sub_BDE150 | 9KB | CFG::computeRPO | HIGH (0.90) |
sub_781F80 | varies | Basic block rebuild | HIGH (0.85) |
sub_A10160 | ~2KB | Liveness computation entry | MEDIUM (0.75) |
sub_A0BA40 | 15KB | Block-level liveness iteration | HIGH (0.85) |
sub_A06A60 | 15KB | Per-block register set tracking | HIGH (0.95) |
sub_A0D800 | 39KB | Dependency graph construction | HIGH (0.95) |
sub_A0F970 | 10KB | DAG construction entry | HIGH (0.95) |
sub_92C240 | 8KB | Liveness bitvector operations (regalloc) | HIGH (87 callers) |
sub_9449B0 | 1.8KB | Liveness range calculator (spill codegen) | HIGH |
sub_8DBAF0 | 16KB | LivenessAnalysis (scheduling) | HIGH (0.85) |
sub_8DB5F0 | 8.4KB | LivenessCompute (per-BB scheduling) | HIGH (0.85) |
sub_8DB070 | 8.2KB | LivenessInit (scheduling) | HIGH (0.85) |
sub_8DE7A0 | 12KB | IterativeDataFlow (scheduling solver) | HIGH (0.80) |
sub_A0B5E0 | varies | Uninitialized register detector | HIGH (0.97) |
sub_A7BC80 | 36KB | RegisterSetManager (multi-file liveness) | MEDIUM (0.65) |
sub_BEF110 | 108KB | OriSplitLiveRanges core (Phase 19) | HIGH (0.90) |
sub_BF33D0 | ~1KB | OriSplitLiveRanges knob-gated entry (reads knob 456) | HIGH (0.90) |
sub_A1D3A0 | ~0.2KB | OriSplitLiveRanges vtable execute | HIGH (0.90) |
sub_BEAFD0 | ~2KB | AllocKillBitvectors (VR+96 per-VR kill sets) | HIGH (0.85) |
sub_BEB110 | ~3KB | AllocGenBitvectors (VR+24 per-VR gen sets) | HIGH (0.85) |
sub_BE3390 | varies | ComputeSplitCount(interference) | MEDIUM (0.80) |
sub_BE3590 | varies | ComputeSplitCount(clean) | MEDIUM (0.80) |
sub_BE3720 | varies | ComputeSplitCount(multiSrc) | MEDIUM (0.80) |
sub_BEE7F0 | varies | TestInterference (BV interference test) | MEDIUM (0.80) |
sub_BEEC80 | ~1KB | UpdateHashWithSplit (per-split hash update) | MEDIUM (0.80) |
sub_BEB9C0 | varies | Hash table init/destroy (secondary) | MEDIUM (0.75) |
sub_BEBA40 | varies | Hash table init/destroy (primary) | MEDIUM (0.75) |
Key Constants
| Value | Meaning |
|---|---|
+832 | Code Object offset: main register liveness bitvector (R registers) |
+856 | Code Object offset: uniform register liveness bitvector (UR registers) |
+840 | Code Object offset: max live register count |
+848 | Code Object offset: liveness info pointer |
+720 | Code Object offset: RPO order array |
+984 | Code Object offset: number of basic blocks |
+1378 bit 4 | Flag: function uses uniform registers (enables +856 bitvector) |
0xCFFF | Opcode mask: strips modifier bits for side-effect classification |
+792 | Context offset: reverse-ordered register array (for live range splitting) |
+1370 bit 2 | Flag: liveness invalid (cleared by sub_BEF110 on exit) |
+1624 | Context offset: register budget (double, read by sub_BF33D0) |
VR+24 | Virtual register offset: gen bitvector (allocated by sub_BEB110) |
VR+96 | Virtual register offset: kill bitvector (allocated by sub_BEAFD0) |
VR+136 | Virtual register offset: interference chain (linked list of aliased VRs) |
VR+144 | Virtual register offset: register class ID (int32) |
VR+176 | Virtual register offset: live-through bitvector |
VR+245 | Virtual register byte flag: needs-opcode-rewrite (set by Phase 19 cleanup) |
VR+264 | Virtual register flags: bit 0 = has-interference-chain, bit 1 = non-subset, bit 2 = was-split |
VR+280 | Virtual register flags: bit 2 = needs-split, bit 4 = propagated, bit 12 = predicate-qualified |
0x811C9DC5 | FNV-1a offset basis (used in Phase 19 interference hash) |
16777619 | FNV-1a prime (0x01000193) |
0x22BC5C0 | String address: "OriSplitLiveRanges" |
0x22BCFE8 | String address: "OriSplitHighPressureLiveRanges" |
Cross-References
- Pass Inventory -- full 159-phase listing with liveness phase positions
- Optimizer Pipeline -- pipeline stage groupings
- Ori IR -- Code Object layout, bitvector infrastructure details
- Allocator Architecture -- liveness as input to fat-point allocator
- Spill Mechanism -- liveness range calculator for spill placement
- Scheduler Architecture -- scheduling-specific liveness
- Scheduling Algorithm -- priority function's pressure estimates
Synchronization & Barriers
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas synchronization pipeline manages the insertion, optimization, and expansion of all GPU synchronization and barrier instructions. Eight phases span the full compilation pipeline, from early memory-ordering fence insertion through post-scheduling dependency barrier fixup. These phases collectively translate the PTX memory model into the hardware synchronization primitives required by each SM architecture: thread block barriers (BAR), memory barriers (MEMBAR), dependency barriers (DEPBAR), warp-level synchronization (WARPSYNC/BSYNC/BSSY), and asynchronous barriers (MBARRIER).
| Phases | 25, 26, 42, 71, 72, 99, 100, 114 |
| Categories | Lowering (25, 42, 72), Optimization (26, 71), Scheduling (99, 100, 114) |
| Pipeline span | Phase 25 (early optimization) through phase 114 (post-scheduling) |
| Key opcodes | BAR (opcode 61), MEMBAR (opcode 111), DEPBAR, BSYNC, BSSY, WARPSYNC, MBARRIER.*. Note: the code uses opcode 130 (HSET2 in the ROT13 name table) as an internal marker for barrier/sync instructions in the Ori IR. |
| Architecture gates | Phases 100, 114 dispatch through architecture vtable; phase 42 dispatches through backend vtable at ctx+1584 offset 0x168 |
| Related EIATTR | EIATTR_SYNC_STACK, EIATTR_NUM_BARRIERS, EIATTR_NUM_MBARRIERS, EIATTR_MBARRIER_INSTR_OFFSETS, EIATTR_GEN_ERRBAR_AT_EXIT, EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS |
| CLI options | --assume-extern-functions-do-not-sync, --no-membermask-overlap, --print-potentially-overlapping-membermasks |
| Knobs | DisableErrbarAfterMembar, knob 487 (iteration gate), knob 358 (sync mode), knob 472 (barrier liveness) |
GPU Synchronization Model
NVIDIA GPUs provide four distinct synchronization mechanisms, each operating at a different scope and addressing different hazards.
Thread Block Barriers (BAR)
Thread block barriers synchronize all threads within a cooperative thread array (CTA). The hardware provides 16 named barriers (indices 0--15), each tracking participation counts. PTX exposes these as:
bar.sync N-- block until all threads in the CTA arrive at barrier Nbar.red.{and,or,popc} N-- barrier with warp-level reductionbar.arrive N-- signal arrival without blockingbarrier.cta.{sync,arrive,red}-- PTX 8.0+ cluster-aware variants
In SASS, these map to the BAR instruction family (opcode 61 in the ROT13 name table). The Ori IR uses opcode 130 (HSET2 in the ROT13 name table) as an internal barrier/sync marker. The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used, which the hardware uses to partition the convergence barrier file.
PTX: bar.sync 0;
SASS: BAR.SYNC 0x0;
// stalls warp until all CTASize threads arrive at barrier 0
Memory Barriers (MEMBAR)
Memory barriers enforce ordering of memory operations across different visibility scopes:
membar.cta-- visible to threads in the same CTAmembar.gpu-- visible to threads on the same GPU devicemembar.sys-- visible to all agents (including host CPU and peer GPUs)
Additionally, fence.proxy instructions enforce ordering between different memory proxy domains (generic, texture, surface, constant).
The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS records the byte offsets of membar.sys instructions for driver-level workaround injection.
Dependency Barriers (DEPBAR / Scoreboards)
Dependency barriers are the micro-architectural mechanism for tracking instruction-level data hazards. Each SM provides 6 scoreboard entries (barriers 0--5) that track completion of long-latency operations. SASS instructions encode a 23-bit control word containing:
- Stall count (4 bits): cycles to wait before issuing the next instruction
- Yield flag (1 bit): hint to give up the scheduling quantum
- Write barrier (3 bits): scoreboard index to set on result writeback
- Read barrier mask (6 bits): scoreboard entries to wait for before reading
- Wait barrier mask (6 bits): scoreboard entries to clear/release
DEPBAR is the explicit dependency barrier instruction that waits for a specific set of scoreboard entries. Scoreboards are assigned by phase 115 (AdvancedScoreboardsAndOpexes) and phase 116 (ProcessO0WaitsAndSBs); the sync passes described here prepare the IR for scoreboard generation but do not assign scoreboards directly.
Warp-Level Synchronization
Warp-level sync instructions operate within a single warp (32 threads):
WARPSYNC mask-- synchronizes threads identified by the lane mask (sm70+)BSSY B, target-- pushes a synchronization barrier for convergenceBSYNC B-- pops and waits at the convergence barrier
The BSSY/BSYNC mechanism replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points. EIATTR_SYNC_STACK records metadata about the convergence barrier stack depth.
Asynchronous Barriers (MBARRIER)
Introduced in sm90 (Hopper), MBARRIER provides hardware-accelerated asynchronous barriers in shared memory. These support non-blocking arrival, expected transaction count tracking, and parity-based phase completion -- critical for async copy (cp.async.bulk) and TMA (Tensor Memory Accelerator) operations.
MBARRIER operations in PTX:
| PTX instruction | Purpose |
|---|---|
mbarrier.init | Initialize barrier object in shared memory |
mbarrier.arrive | Signal arrival (non-blocking) |
mbarrier.arrive_drop | Arrive and decrement expected count |
mbarrier.arrive.expect_tx | Arrive with expected transaction byte count |
mbarrier.test_wait | Test if barrier phase is complete |
mbarrier.try_wait | Wait with timeout |
mbarrier.try_wait.parity | Phase-parity-based wait |
mbarrier.pending_count | Query remaining arrivals |
mbarrier.inval | Invalidate barrier |
mbarrier.complete_tx | Mark transaction bytes as complete |
The EIATTR_NUM_MBARRIERS and EIATTR_MBARRIER_INSTR_OFFSETS metadata inform the runtime about barrier allocation and instruction locations for driver patching.
Phase 25 -- StageAndFence
| Phase name | StageAndFence |
| Category | Lowering |
| Execute wrapper | sub_C5FBC0 (34 bytes) |
| Implementation | sub_1392E30 (166 bytes) |
| Core logic | sub_1390B30 (8,956 bytes, 97 callees) |
| Setup | sub_1389AF0 (3,049 bytes) |
| Teardown | sub_138A6E0 (3,408 bytes) |
| Gating | Requires opt_level > 1 AND context+1368 bit 0 AND context+1397 bits[6:7] != 0x40; additionally guarded by "LoopUnrolling" disable check and knob 487 |
| Total code | ~16 KB across 0x1389AF0--0x1393340 |
Purpose
StageAndFence inserts memory fence and staging instructions to enforce coherence ordering after loop unrolling. When loop unrolling replicates memory operations, the replicated loads and stores may violate the memory model if they cross a synchronization boundary that was inside the original loop body. This pass re-establishes correctness by inserting fence operations at the boundaries of unrolled iterations.
Execution Flow
sub_1392E30(compilation_unit):
// Guard: must have loops and bit flags set
if !(context+1368 bit 0) or (context+1397 & 0xC0) == 0x40:
return
// Check if "LoopUnrolling" pass is disabled
IsPassDisabled(knob_state, "LoopUnrolling", &disabled)
if disabled: return
if opt_level <= 2: return
// Check knob 487
if !CheckKnob(knob_state, 487, 1): return
// Core execution
sub_1389AF0(state, compilation_unit) // allocate working structures
sub_1390B30(state) // main fence insertion pass
sub_138A6E0(state) // cleanup
Main Pass -- sub_1390B30
The main pass (8,956 bytes) is the largest function in this phase group. It:
- Iterates over the basic block list via the instruction chain (
context+272) - Identifies memory operations that cross unrolled loop iteration boundaries
- Computes fence requirements based on the memory model and target architecture
- Calls
sub_A0F020(the scheduling entry point) to build dependency information and determine where fences are needed - Inserts
fence.proxyorMEMBARpseudo-instructions at identified locations - Updates the instruction list metadata via
sub_781F80(basic block refresh)
The function takes floating-point parameters (double a2, double a3, __m128d a4), suggesting it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay fences to overlap with independent computation.
Phase 26 -- OriRemoveRedundantBarriers
| Phase name | OriRemoveRedundantBarriers |
| Category | Optimization |
| Execute wrapper | sub_C60BD0 (334 bytes) |
| Implementation | sub_790A40 (2,288 bytes, 33 callees) |
| Helper: post-RA sched | sub_790020 (1,200 bytes) |
| Helper: pre-RA opt | sub_7904D0 (1,381 bytes) |
| Helper: barrier opt | sub_7923A0 (2,344 bytes, 30 callees) |
| Helper: barrier pass | sub_792CD0 (1,360 bytes, 25 callees) |
| Gating | Multi-function dispatch: only runs when sub_7DDB50(ctx) > 1 (i.e., the compilation unit contains more than one function) |
| Total code | ~10 KB across 0x790020--0x793220 |
Purpose
OriRemoveRedundantBarriers performs dataflow-driven elimination of provably redundant barrier instructions. When the compiler can prove that all threads in a warp (or CTA) must have already passed through a dominating synchronization point, subsequent barriers to the same scope are redundant and can be removed. This reduces the synchronization overhead without changing program semantics.
Execution Flow
The execute wrapper sub_C60BD0 is a multi-function dispatch pattern: when a compilation unit contains multiple functions, it creates two reference-counted list objects, stores the current phase chain pointer, and calls sub_790A40 for cross-function barrier analysis. For single-function units, it returns directly.
sub_C60BD0(phase, compilation_unit):
func_count = sub_7DDB50(compilation_unit)
if func_count <= 1: return
// Create two ref-counted analysis lists
list1 = pool_alloc(24)
list1->refcount = 1
list2 = pool_alloc(24)
list2->refcount = 1
// Store current phase chain
saved_chain = compilation_unit->field_88
// Run multi-function barrier analysis
sub_790A40(&compilation_unit)
// Release ref-counted lists
release(list1)
release(list2)
Main Analysis -- sub_790A40
The main analysis function (2,288 bytes) operates through several stages:
-
Mode selection: Queries knob 358 (sync mode) through the knob container at
ctx+1664. Three modes exist:- Mode 0: no barrier removal (return immediately via
sub_756F10) - Mode 1: conservative removal (calls
sub_790020) - Mode 2: aggressive removal (calls
sub_790020with flag) - Mode >= 3: full multi-function analysis
- Mode 0: no barrier removal (return immediately via
-
Graph construction (
sub_7E6090): Builds an instruction-level dependency graph with 32-bit flags. Called with(ctx, 0, 0, 0, 0). -
Liveness refresh (
sub_781F80): Refreshes the basic block liveness information with mode parameter 1 (compute barrier liveness). -
Dependency tracking (
sub_A10160): Sets up dependency tracking data structures. -
Block iteration (
sub_769300,sub_752AB0): Builds block-level analysis structures for the function. -
Redundancy analysis: For each barrier instruction (opcode 130;
HSET2in the ROT13 name table, but used as the internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111), checks whether the barrier's destination register is live in any successor block. If the barrier result is dead (no thread could observe it before the next dominating barrier), the barrier is eliminated. -
Block-level merging (
sub_75EAE0,sub_75E2F0): Merges barriers at block boundaries where adjacent blocks have compatible barrier scopes.
The algorithm checks barriers by walking the instruction chain and testing opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/sync instructions -- not the actual HSET2 half-precision set instruction). For each barrier, it extracts the destination operand (field+84), resolves the register through the register table at context+88, and tests whether the register's use-count (reg+24) indicates the barrier result is consumed.
Phase 42 -- ExpandMbarrier
| Phase name | ExpandMbarrier |
| Category | Lowering |
| Execute wrapper | 0xC5F110 (6 bytes) |
| Implementation | Architecture-dispatch via *(*(ctx+0x630))->vtable[0x168/8] |
| isNoOp | Always false (0xC5F130 returns 0) |
| No opt-level check | Runs at all optimization levels |
Purpose
ExpandMbarrier expands MBARRIER pseudo-instructions into native barrier instruction sequences. This is critically important for sm90+ (Hopper and later) architectures that use asynchronous barriers for TMA operations, cp.async.bulk, and warpgroup-level synchronization.
Dispatch Mechanism
Unlike most phases that tail-call a fixed function after an optimization level check, ExpandMbarrier performs a direct vtable dispatch:
mov rdi, [rsi+0x630] ; rdi = ctx->arch_backend (offset 1584)
mov rax, [rdi] ; rax = arch_backend->vtable
jmp [rax+0x168] ; call vtable[45] -- ExpandMbarrier impl
The architecture backend at ctx+1584 provides the actual expansion logic. This design allows each SM generation to define its own mbarrier expansion rules:
- Pre-sm90: MBARRIER pseudo-ops do not exist; the phase is effectively a no-op.
- sm90 (Hopper): Expands MBARRIER pseudo-ops into hardware mbarrier instruction sequences using the mbarrier object in shared memory. Handles
mbarrier.init,mbarrier.arrive,mbarrier.arrive.expect_tx,mbarrier.try_wait.parity, andmbarrier.inval. - sm100+ (Blackwell): Extended mbarrier semantics for
tcgen05.fence, cluster-level barriers, and async pipeline operations.
MBARRIER Expansion Patterns
A typical async copy pattern in the Ori IR and its expansion:
Before expansion (pseudo-ops):
MBARRIER_INIT %mbar, count
MBARRIER_ARRIVE_EXPECT_TX %mbar, bytes
CP.ASYNC.BULK.TENSOR dst, src, %mbar
MBARRIER_TRY_WAIT_PARITY %mbar, parity, pred
After expansion (native):
MBARRIER.INIT [smem_addr], count
MBARRIER.ARRIVE.EXPECT_TX [smem_addr], bytes
CP.ASYNC.BULK.TENSOR [dst], [src], [smem_addr]
MBARRIER.TRY_WAIT.PARITY pred, [smem_addr], parity
The expansion resolves shared memory addresses for the mbarrier objects, handles the naming of __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier and __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity reserved shared memory regions, and inserts any required fence.proxy operations for proxy domain coherence.
Phase 71 -- OptimizeSyncInstructions
| Phase name | OptimizeSyncInstructions |
| Category | Optimization |
| Execute wrapper | sub_C60080 (34 bytes) |
| Implementation | sub_90A340 (1,670 bytes, 21 callees) |
| Sync predicate | sub_18F6930 (185 bytes) -- determines if sync optimization should run |
| Gating | Requires opt_level > 2; additionally checks knob 487, architecture flags at context+1368, and sub_18F6930 predicate |
| Pipeline position | After OriPropagateVaryingSecond (70), before LateExpandSyncInstructions (72) |
Purpose
OptimizeSyncInstructions performs redundancy elimination and simplification of synchronization instructions within the partial-SSA window. It identifies and removes sync instructions that are provably unnecessary based on the data flow and the GPU memory model, and simplifies complex sync patterns into cheaper equivalents.
Gating Logic
The pass has elaborate gating controlled by sub_18F6930, which evaluates:
sub_18F6930(ctx, mode):
// Check architecture-specific sync flags
flags = *(ctx+1398)
if (flags & 0x18) != 0:
return (flags & 0x18) == 8 // specific arch config
// Check whether SM requires explicit sync
if !(*(ctx+1412) bit 7) or *(ctx+1584)->field_372 <= 28673:
return true
// Functions with <= 4 registers always need sync
if *(ctx+1704) <= 4:
return true
// Mode-specific knob checks at offsets 51120/51192
...
The value 28673 corresponds to sm70/sm72/sm73/sm75 architecture IDs. The predicate returns true (optimize) for architectures that have explicit synchronization requirements (Volta and later), and false for older architectures where synchronization is implicit.
Main Algorithm -- sub_90A340
sub_90A340(ctx):
if opt_level <= 2: return
if !CheckKnob(ctx+1664, 487, 1): return
// Determine sync optimization mode
has_uniform_regs = (ctx+1412 bit 7) && !(ctx+1368 bit 4)
arch_data = *(*(ctx+1664)+72)
sync_mode = *(arch_data + 15480)
if sync_mode == 1: mode = *(arch_data + 15488)
// Main path: combined sync + barrier optimization
if (ctx+1368 flags 0x20000001 all set) && (ctx+1377 bit 6) && !mode:
need_expand = sub_18F6930(ctx, 0)
sub_781F80(ctx, 1) // refresh liveness
if !need_expand && !has_uniform_regs:
sub_7E6090(ctx, 0, 0, 0, 32) // build dep graph, 32-bit mode
goto optimize
else:
need_expand = sub_18F6930(ctx, 0)
if !has_uniform_regs && !need_expand: return
sub_781F80(ctx, 1)
// Barrier liveness computation
sub_775010(ctx)
sub_7E6090(ctx, 0, 0, 0, 32)
// Walk instruction list, find opcode 130 (HSET2 in ROT13; internal barrier/sync marker)
for instr = ctx->first_instr; instr; instr = instr->next:
if instr->opcode != 130: continue
// Extract operand, check register type
operand = instr->field_84
if operand_type(operand) != 1: continue
reg = register_table[operand & 0xFFFFFF]
if !check_liveness(reg): continue
// For uniform-register-aware path:
if has_uniform_regs:
if (instr->field_91 & 1): continue // skip if flagged
if reg->file != 6: continue // must be barrier reg
if reg->use_count <= 1: continue
// Check all uses via use-def chain...
try_merge_barriers(ctx, instr)
// Standard redundancy elimination
try_eliminate_redundant_sync(ctx, instr)
cleanup_lists()
The pass iterates the flat instruction list (not per-block), checking every instruction with opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/synchronization instructions). For each barrier, it examines the operand to determine:
- Whether the barrier result register is consumed by any subsequent instruction
- Whether the barrier can be merged with an adjacent barrier of the same scope
- Whether the barrier guards a memory region that is provably thread-local
The sub_1245740 call performs the actual redundancy proof by checking dominance relationships between barrier pairs.
Phase 72 -- LateExpandSyncInstructions
| Phase name | LateExpandSyncInstructions |
| Category | Lowering |
| Execute wrapper | sub_C600B0 (34 bytes) |
| Implementation | sub_1381DA0 (1,517 bytes, 3 callees) |
| Core driver | sub_1381CD0 (206 bytes) |
| Gating | Requires opt_level > 1; checks context+1376 bit 5, "Predication" disable flag, and knob 487 with iteration counter |
| Error diagnostic | "ExpandSyncInstLate option is not supported on this architecture." (via sub_7EF030) |
| Pipeline position | After OptimizeSyncInstructions (71), before ConvertAllMovPhiToMov (73) |
| Gate pass | Phase 135 (AdvancedPhaseLateExpandSyncInstructions) provides an additional architecture hook |
Purpose
LateExpandSyncInstructions performs the final expansion of synchronization pseudo-instructions into their target-specific SASS instruction sequences. This runs late in the pipeline (phase 72, within the partial-SSA window) so that earlier optimization passes can work with high-level sync pseudo-ops rather than architecture-specific instruction sequences.
Execution Flow
The entry function shares structural similarity with the Predication pass entry (sub_1381DA0) because both operate within the same address range (0x1381000--0x1382000) and share infrastructure for walking the instruction list within the partial-SSA window.
sub_1381DA0(ctx):
if context+1376 bit 5: return // disabled by phase flag
// Read expansion mode from knob container
knob_state = *(ctx+1664)
mode = *(*(knob_state+72) + 16416)
if mode == 0:
limit = (ctx+1419 bit 4) != 0
elif mode == 1:
limit = *(*(knob_state+72) + 16424)
IsPassDisabled(knob_state, "Predication", &disabled)
if disabled or limit: return
// Knob 487 iteration gating with counter
if !CheckKnob487WithCounter(knob_state): return
// Set up working state
context+1385 |= 1 // mark expansion active
// Call core driver
sub_1381CD0(state)
context+1385 &= ~1 // clear expansion flag
cleanup_pools()
Expansion Rules
The pass transforms sync pseudo-instructions according to the target SM:
| Pseudo-instruction | sm70+ expansion | sm90+ expansion |
|---|---|---|
SYNC.WARP mask | WARPSYNC mask | WARPSYNC mask |
SYNC.BLOCK | BAR.SYNC 0 | BAR.SYNC 0 |
SYNC.CONVERGE target | BSSY B, target ... BSYNC B | BSSY B, target ... BSYNC B |
MBARRIER.WAIT pseudo | (not expanded here) | MBARRIER.TRY_WAIT.PARITY loop |
ERRBAR | BAR.SYNC 15 (error barrier) | Conditional on DisableErrbarAfterMembar |
The ERRBAR (error barrier) is a compiler-inserted synchronization point placed after membar.sys instructions to ensure memory ordering is observable before proceeding. The DisableErrbarAfterMembar knob (accessible via the CLI option string at 0x1D04BC0) controls whether these error barriers are emitted. When set to 1, the compiler omits the error barrier, trading safety for performance.
Phase 99 -- OriDoSyncronization
| Phase name | OriDoSyncronization |
| Category | Scheduling |
| Execute wrapper | sub_C5FAD0 (34 bytes) |
| Implementation | sub_A0F020 (2,375 bytes, 32 callees) -- DAG scheduler entry |
| Dependency builder | sub_A0D800 (dependency DAG construction) |
| Per-block processor | sub_A06A60 (3,045 bytes, 53 callees) |
| Uninit reg check | sub_A0B5E0 |
| Gating | Requires opt_level > 1 |
| Pipeline position | After BackPropagateVEC2D (98), before ApplyPostSyncronizationWars (100) |
Callers of sub_A0F020 | 11 sites: sub_913A30, sub_9AEF60 (x2), sub_C5FA40/sub_C5FA70/sub_C5FAA0/sub_C5FAD0 (4 arch wrappers), sub_1390B30 (x2), sub_1395850 (x2) |
Purpose
OriDoSyncronization is the post-optimization synchronization insertion pass. It runs after all IR-level optimizations are complete and before register allocation, using the scheduling infrastructure to analyze data dependencies and insert the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model for correctness.
Note the intentional misspelling "Syncronization" (missing 'h') -- this is present in the binary's string table and preserved here for fidelity.
Architecture
OriDoSyncronization reuses the DAG scheduler's infrastructure (sub_A0F020) rather than implementing its own analysis. The same function serves as the scheduling entry point in multiple contexts:
- Phase 99 (
OriDoSyncronization): inserts sync instructions based on dependency analysis - Phase 25 (
StageAndFence): inserts fences viasub_1390B30 - Multiple architecture-specific scheduling wrappers:
sub_C5FA40,sub_C5FA70,sub_C5FAA0
Execution Flow
sub_A0F020(ctx):
while true:
if *(ctx+1648) == 0: break
// Initialize dependency context
dep_ctx = pool_alloc(16)
dep_ctx->refcount = 2
dep_ctx->parent = ctx->pool
// Build dependency DAG
sub_A0D800(ctx, dep_ctx)
// Process blocks in reverse order
for each basic_block in reverse(block_list):
if block->opcode == 8: continue // skip NOP/exit blocks
sub_A06A60(ctx, callback, block, flags...)
// Check for uninitialized register usage
sub_A0B5E0(ctx, dep_ctx)
// Diagnostic output if enabled
sub_7F44D0(ctx)
// Break or retry based on scheduling result
...
Per-Block Synchronization -- sub_A06A60
The per-block processor (3,045 bytes, 53 callees) is the core of sync insertion. For each basic block:
- Allocates temporary liveness bitsets via
sub_BDBA60(bitvector alloc) - Copies block-entry live set from
ctx+832viasub_BDC300 - Walks instructions forward, examining each opcode (masked by
0xCFFF):- Opcode 93 (
OUT_FINALin ROT13; used here as a call-like control-flow marker -- actual CALL is opcode 71): copies callee-save register set, handles arguments - Opcode 95 (
STSin ROT13; used here as a barrier/terminator marker -- actual BAR is opcode 61): AND-merges successor block live sets - Opcode 97 (
STGin ROT13; used here as a branch/control marker -- actual BRA is opcode 67): tests if live set changed since block entry
- Opcode 93 (
- Inserts sync instructions where data dependencies cross synchronization boundaries
- Updates uniform register liveness at
ctx+856whenctx+1378 bit 3is set
The function uses extensive bitvector operations (13 different bitvector functions from the sub_BDB*/sub_BDC* infrastructure) to track register liveness through synchronization points.
Phase 100 -- ApplyPostSyncronizationWars
| Phase name | ApplyPostSyncronizationWars |
| Category | Scheduling |
| Execute wrapper | sub_C607A0 (51 bytes) |
| Implementation | Architecture-dispatch via *(*(ctx+0x630))->vtable[0x110/8] |
| Nullsub guard | Skips if vtable entry equals nullsub_170 (0x7D6C80) |
| Gating | Requires opt_level > 1 |
| Pipeline position | After OriDoSyncronization (99), before AdvancedPhaseAllocReg (101) |
Purpose
ApplyPostSyncronizationWars fixes write-after-read (WAR) hazards that are introduced or exposed by the synchronization insertion in phase 99. When OriDoSyncronization inserts new barrier or memory fence instructions, these insertions can create new register hazards (the barrier instruction may read a register that a subsequent instruction writes). This pass scans for and resolves those hazards.
Dispatch Mechanism
; sub_C607A0
mov rbx, rsi ; save ctx
call sub_7DDB50 ; get opt_level
cmp eax, 1
jle return ; skip if opt_level <= 1
mov rdi, [rbx+0x630] ; rdi = ctx->arch_backend
mov rax, [rdi] ; rax = arch_backend->vtable
mov rax, [rax+0x110] ; vtable[34] = ApplyPostSyncWars impl
cmp rax, 0x7D6C80 ; compare with nullsub_170
jne call_impl ; if not nullsub, call it
return:
ret
call_impl:
jmp rax ; tail-call architecture implementation
The nullsub_170 check (at 0x7D6C80) is the no-op sentinel: if the architecture backend does not override this vtable entry, the phase is silently skipped. This allows architectures that do not have post-sync WAR hazards to avoid unnecessary work.
Phase 114 -- FixUpTexDepBarAndSync
| Phase name | FixUpTexDepBarAndSync |
| Category | Scheduling |
| Execute wrapper | sub_C60600 (51 bytes) |
| Implementation | Architecture-dispatch via *(*(*(ctx+0x630)+0x10))->vtable[0x70/8] |
| Nullsub guard | Skips if vtable entry equals nullsub_43 (0x680170) |
| Gating | Requires opt_level > 1 |
| Pipeline position | After PostFixForMercTargets (113), before AdvancedScoreboardsAndOpexes (115) |
Purpose
FixUpTexDepBarAndSync performs a post-scheduling fixup of texture dependency barriers and synchronization instructions. After the main scheduling passes (phases 97--110) have reordered instructions and the Mercury encoder (phases 117--122) has finalized SASS encoding, texture fetch instructions may have dependency barriers that are incorrect due to instruction movement. This phase corrects those barriers.
Dispatch Mechanism
The dispatch is doubly-indirect, going through two vtable levels:
; sub_C60600
mov rbx, rsi
call sub_7DDB50 ; get opt_level
cmp eax, 1
jle return
mov rax, [rbx+0x630] ; arch_backend
mov rdi, [rax+0x10] ; secondary object at arch_backend+16
mov rax, [rdi] ; secondary vtable
mov rax, [rax+0x70] ; vtable[14] = FixUpTexDepBar impl
cmp rax, 0x680170 ; compare with nullsub_43
jne call_impl
return:
ret
call_impl:
jmp rax ; tail-call implementation
The double indirection (arch_backend -> arch_backend+16 -> vtable+0x70) indicates that the texture dependency barrier fixup lives in a secondary object owned by the architecture backend -- likely the scheduling/scoreboard subsystem object.
Texture Dependency Barriers
Texture fetches are long-latency operations (hundreds of cycles). The hardware uses dependency barriers (scoreboards) to track their completion. When the scheduler moves a texture fetch away from its original position, the dependency barrier assignment from AdvancedScoreboardsAndOpexes (phase 115) may become suboptimal or incorrect. This fixup pass:
- Scans for texture fetch instructions (
opcode 0x17/ class 0x37/0x38 in the scheduling tables) - Checks that the assigned write-barrier index correctly covers the instruction's result register
- Verifies that consumer instructions have the corresponding read-barrier bit set in their wait mask
- Adjusts stall counts and yield flags if the texture result is consumed sooner than the original schedule assumed
Memory Order Intrinsic Lowering
Before the eight sync phases operate on the Ori IR, the OCG intrinsic lowering pipeline translates PTX memory-ordering intrinsics into Ori IR instruction sequences. Three sibling functions in the OCG body dispatcher (sub_6D8B20) handle the three families of memory-ordering intrinsics. All three share an identical subop-array parsing protocol and the same scope/memory-order/deprecation validation logic.
Dispatcher and Function Family
The OCG body dispatcher at sub_6D8B20 (432 lines) reads the intrinsic ID from *(state+10688) and dispatches to per-family lowering functions via a 28-case switch statement. The three memory-ordering handlers are:
| Case | Function | Size | Family | PTX instructions |
|---|---|---|---|---|
| 9 | sub_6C0D90 | 19KB (812 lines) | Atomic/reduction | atom.add, atom.cas, atom.exch, red.add |
| 0xA | sub_6C1CF0 | 16KB (633 lines) | Mbarrier | mbarrier.arrive, mbarrier.test_wait, mbarrier.try_wait, counted/bytemask variants |
| 0x16 | sub_6C4DA0 | 15KB (647 lines) | Fence / load-store | fence.sc, ld.acquire, st.release with scope/domain |
Subop Array Protocol
Each intrinsic descriptor carries a subop array at state+10704 (an int[]) with the count at state+10712. The subop values encode orthogonal PTX qualifiers (scope, memory order, type, domain) into a flat integer sequence that the lowering functions parse in positional order.
Reconstructed subop value map (shared by all three functions):
| Subop | Meaning | IR effect |
|---|---|---|
| 0 | Scope qualifier (.sys/.gpu/.cta) | Sets scope_level = 4 |
| 1 | Counted mode (mbarrier arrival count) | Adds extra type-14 parameter |
| 2 | Shared domain (_shared) | scope = 5 |
| 3 | Memory order acquire | Sets order = 5 |
| 4 | Memory order release | Sets order = 6 |
| 5 | MMIO flag (.mmio) | Sets flag bit 8 |
| 6 | Vector width 2x | scope_width = 2 |
| 7 | Vector width 4x | scope_width = 4 |
| 8 | Type u32 | IR type 12 |
| 9 | Type s32 | IR type 11 |
| 0xA | Type u64 | IR type 10 |
| 0xB--0x12 | Reduction ops (add/min/max/inc/dec/and/or/xor) | Op index 0--7 |
Scope and Memory Order Validation
All three functions enforce the PTX 8.0 scoped memory model rules through a three-way decision tree. The logic (taken from sub_6C0D90 and sub_6C4DA0 where the strings appear verbatim; sub_6C1CF0 enforces equivalent constraints via positional subop checks) is:
if scope_qualifier_present:
if memory_order NOT present:
ERROR 7308: "Required scope with memory order semantics"
elif memory_order_present:
WARNING 7308 (via sub_7F7C10): "Deprecated scope without memory order semantics"
// Deprecation warning — may be promoted to error in future PTX versions.
// If location info available (ctx+104), emits follow-up via sub_8955D0.
if mmio_flag AND NOT global_domain:
ERROR 7308: "Domain param \"_global\" required for mmio semantics"
The warning path uses sub_7F7C10 (the deprecation-warning emitter at context+1176), which returns a boolean indicating whether the warning was promoted to an error. This implements NVIDIA's staged deprecation of unscoped memory operations: PTX code using old-style membar.cta without explicit .acquire/.release qualifiers triggers the deprecation path, while new-style fence.sc.cta.acquire requires the full scope + order combination.
Mbarrier Intrinsic Lowering -- sub_6C1CF0
The mbarrier handler (16KB, case 0xA) lowers mbarrier.* PTX intrinsics into Ori IR instruction sequences. It handles:
-
Scope/domain parsing: First subop must be 2 (shared) or 3 (global). If the first subop is > 1, it is treated as the domain selector directly; otherwise the function enters the two-position scope path where the second subop supplies the domain.
-
Counted mode (subop 1): Enables arrival-count tracking. When active, the parameter list includes an extra type-14 (integer) parameter for the expected arrival count. Bytemask mode (subop 6) is incompatible with counted mode -- error 7300:
"byte mask not allowed with counted". -
Bytemask mode (subop 6): Requires global destination (
subop[1] == 3) and shared source (subop[2] == 2). Sets flag bit 17 (0x20000). Error messages:"global dst should be specified with bytemask"and"shared src should be specified with bytemask". -
Sequenced mode (subop 5): Explicitly unsupported. Error 7300:
"sequenced : Not yet supported". -
MMIO flag (subop 4 when value == 4 in the optional-subop loop): Sets bit 3 in the flag word. Only valid with global domain (scope 2); enforced by the same
"_global required for mmio"rule.
Parameter Processing
Parameters are stored at state+10728 as 12-byte records {value[4], flags[4], type[4]}. The function iterates over v100 parameters (2 or 3 depending on counted mode):
- Each parameter type must be 10 (predicate register) or 12 (scope domain). Other types trigger error 7302 using the type name table at
off_229E8C0. - For scope-domain parameters, the top 3 bits of the value word (
(value >> 28) & 7) select the resolution mode:- Mode 5: Named barrier resolution via
sub_91BF30, thensub_934630(opcode 130)to create a barrier pseudo-op in the Ori IR. - Mode 1 (no bit 24): Direct register reference (fast path, no resolution needed).
- Other modes: Full register resolution via
sub_91D150+sub_7DEFA0.
- Mode 5: Named barrier resolution via
Output Instruction Sequence
The function generates three Ori IR instructions:
| Step | Builder | Opcode | Purpose |
|---|---|---|---|
| 1 | sub_934630 | 214 | Mbarrier scope-domain setup; template mask 0x90FFFFFF |
| 2 | sub_934630 | 273 | Memory ordering constraint / fence |
| 3 | sub_92C240 | 299 | Mbarrier operation with full flags (arrive/wait/test) |
The flag word passed to opcode 299 encodes: flags | 0x60000000, where flags accumulates mmio (bit 3), bytemask (bit 17), and other qualifiers from the subop parsing.
Error Codes
| Code | Message template | Severity |
|---|---|---|
| 7300 | "Unexpected intrinsic name (%s)" | Semantic restriction (hard error) |
| 7301 | "Unexpected intrinsic param number (%d)" | Parameter count mismatch |
| 7302 | "Unexpected intrinsic type (%s) in param (%d)" | Wrong parameter type |
| 7303 | "Unexpected intrinsic type (%s) instead of (%s) in param (%d)" | Type mismatch with expected |
| 7306 | "Unexpected intrinsic subop in position (%d)" | Positional subop error |
| 7307 | "Unexpected intrinsic subop (%s) in position (%d)" | Named subop error |
| 7308 | "Instrinsic - \"%s\"" | Scope/order/domain validation |
Two diagnostic functions handle these errors: sub_895530 emits directly when source location is available (ctx+48); sub_7EEFA0 builds a deferred diagnostic record.
Function Map
| Address | Size | Identity | Phase | Confidence |
|---|---|---|---|---|
sub_C5FBC0 | 34 | StageAndFence execute wrapper | 25 | CERTAIN |
sub_1392E30 | 166 | StageAndFence entry | 25 | HIGH |
sub_1389AF0 | 3,049 | StageAndFence setup | 25 | HIGH |
sub_1390B30 | 8,956 | StageAndFence core (fence insertion) | 25 | HIGH |
sub_138A6E0 | 3,408 | StageAndFence teardown | 25 | HIGH |
sub_C60BD0 | 334 | OriRemoveRedundantBarriers execute wrapper | 26 | CERTAIN |
sub_790A40 | 2,288 | OriRemoveRedundantBarriers main | 26 | HIGH |
sub_790020 | 1,200 | Post-RA scheduling helper | 26 | MEDIUM |
sub_7904D0 | 1,381 | Pre-RA optimization helper | 26 | MEDIUM |
sub_7923A0 | 2,344 | Barrier placement optimization | 26 | MEDIUM |
sub_792CD0 | 1,360 | Top-level barrier pass | 26 | MEDIUM |
0xC5F110 | 6 | ExpandMbarrier execute (vtable dispatch) | 42 | CERTAIN |
sub_C60080 | 34 | OptimizeSyncInstructions execute wrapper | 71 | CERTAIN |
sub_90A340 | 1,670 | OptimizeSyncInstructions main | 71 | HIGH |
sub_18F6930 | 185 | Sync optimization predicate | 71 | HIGH |
sub_C600B0 | 34 | LateExpandSyncInstructions execute wrapper | 72 | CERTAIN |
sub_1381DA0 | 1,517 | LateExpandSyncInstructions entry | 72 | HIGH |
sub_1381CD0 | 206 | LateExpandSyncInstructions core driver | 72 | HIGH |
sub_C5FAD0 | 34 | OriDoSyncronization execute wrapper | 99 | CERTAIN |
sub_A0F020 | 2,375 | DAG scheduler entry (sync insertion) | 99 | HIGH |
sub_A0D800 | -- | Dependency DAG builder | 99 | MEDIUM |
sub_A06A60 | 3,045 | Per-block sync processor | 99 | HIGH |
sub_A0B5E0 | -- | Uninitialized register check | 99 | MEDIUM |
sub_C607A0 | 51 | ApplyPostSyncronizationWars execute wrapper | 100 | CERTAIN |
sub_C60600 | 51 | FixUpTexDepBarAndSync execute wrapper | 114 | CERTAIN |
sub_A9C550 | 2,178 | Barrier instruction lowering | -- | HIGH |
sub_80F400 | 1,779 | Sync instruction SASS lowering | -- | HIGH |
sub_AA3BB0 | 2,726 | MBARRIER encoding | -- | HIGH |
sub_AA33C0 | -- | MBARRIER mnemonic builder | -- | MEDIUM |
sub_775010 | 18 | Barrier liveness computation entry | -- | MEDIUM |
sub_6D8B20 | 432 lines | OCG intrinsic body dispatcher (28-case switch) | -- | HIGH |
sub_6C0D90 | 812 lines | Atomic/reduction intrinsic lowering (scope+order) | -- | HIGH |
sub_6C1CF0 | 633 lines | Mbarrier intrinsic lowering (arrive/wait/test) | -- | HIGH |
sub_6C4DA0 | 647 lines | Fence/load-store intrinsic lowering (scope+domain) | -- | HIGH |
Pipeline Position and Data Flow
The eight sync phases are distributed across the pipeline to operate at the appropriate abstraction level:
Phase 25 StageAndFence ─── Early: after loop unrolling (24)
Phase 26 OriRemoveRedundantBarriers ─── Early: before GeneralOptimize (29)
... (mid-level optimization) ...
Phase 42 ExpandMbarrier ─── Mid: after CTA expansion (40)
... (late optimization) ...
Phase 71 OptimizeSyncInstructions ─── Late: after varying propagation (70)
Phase 72 LateExpandSyncInstructions ─── Late: before SSA destruction (73)
... (legalization, scheduling setup) ...
Phase 99 OriDoSyncronization ─── Post-opt: sync insertion pass
Phase 100 ApplyPostSyncronizationWars ─── Post-opt: WAR fixup
... (register allocation, scheduling) ...
Phase 114 FixUpTexDepBarAndSync ─── Post-sched: texture dep fixup
Data dependencies between phases:
- Phase 25 -> 26: StageAndFence inserts fences; OriRemoveRedundantBarriers may then eliminate redundant ones.
- Phase 42 -> 71: ExpandMbarrier materializes mbarrier ops; OptimizeSyncInstructions may simplify the resulting sequences.
- Phase 71 -> 72: OptimizeSyncInstructions reduces sync count; LateExpandSyncInstructions expands remaining pseudo-ops to SASS.
- Phase 99 -> 100: OriDoSyncronization inserts sync instructions; ApplyPostSyncronizationWars fixes hazards introduced by the insertion.
- Phase 114 -> 115: FixUpTexDepBarAndSync prepares texture barriers for AdvancedScoreboardsAndOpexes.
Architecture-Specific Behavior
The sync passes have significant architecture-dependent behavior controlled through the architecture backend vtable at ctx+1584:
| SM generation | Key behavior |
|---|---|
| sm70--sm75 (Volta/Turing) | Explicit BSSY/BSYNC convergence; WARPSYNC required; --no-membermask-overlap warning active; EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS emitted for membar.sys WAR |
| sm80--sm89 (Ampere/Ada) | cp.async commit/wait groups; ERRBAR after membar.sys; barrier number range checked [0..15] |
| sm90--sm90a (Hopper) | Full MBARRIER support; TMA async pipeline barriers; EIATTR_NUM_MBARRIERS and EIATTR_MBARRIER_INSTR_OFFSETS emitted; wgmma.fence / tcgen05.fence sync fences for tensor operations |
| sm100+ (Blackwell) | Extended cluster barriers (barrier.cluster.arrive/wait); fence.proxy with proxy domain annotations; sync_restrict::shared::{cta,cluster} scope qualifiers; async bulk multicast |
The sub_18F6930 predicate (185 bytes) encodes the architecture-specific decision logic. The magic value 28673 at *(ctx+1584)+372 corresponds to an architecture version threshold that enables explicit synchronization optimization for Volta-class and later architectures.
Related CLI Options
| Option | Effect |
|---|---|
--assume-extern-functions-do-not-sync | Tells the compiler that external function calls do not execute synchronization instructions, enabling more aggressive barrier elimination |
--no-membermask-overlap | Asserts that no sync instruction is executed with different but overlapping thread masks (sm70--sm75 only). Enables additional optimizations. |
--print-potentially-overlapping-membermasks | Diagnostic: prints locations of sync instructions where the compiler must assume overlapping masks |
Related Knobs
| Knob | Effect |
|---|---|
DisableErrbarAfterMembar | When set to 1, suppresses error barrier (BAR.SYNC 15) insertion after membar.sys instructions |
| Knob 358 | Sync optimization mode selector (0=disabled, 1=conservative, 2=aggressive, 3+=full analysis) |
| Knob 472 | Barrier liveness tracking enable |
| Knob 487 | Iteration gate (shared with multiple passes); controls maximum number of iterations |
Cross-References
- Pass Inventory -- complete 159-phase table with sync phases at positions 25, 26, 42, 71, 72, 99, 100, 114
- Scheduler Architecture -- the scheduling infrastructure reused by OriDoSyncronization
- Scoreboards & Dependency Barriers -- phases 114, 115, 116; scoreboard generation
- Phase Manager -- vtable dispatch mechanism, factory switch
- Predication -- shares entry infrastructure with LateExpandSyncInstructions
- Intrinsics Index -- OCG body dispatcher (
sub_6D8B20) and per-family lowering functions - OCG Intrinsic Lowering -- dispatch table for
sub_6C0D90/sub_6C1CF0/sub_6C4DA0 - GMMA/WGMMA Pipeline --
wgmma.fenceandtcgen05.fenceinteractions - SM Architecture Map -- per-SM sync capabilities
- Knobs System -- knob 358, 472, 487, DisableErrbarAfterMembar
Hot/Cold Partitioning
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas implements hot/cold partitioning across three dedicated phases that mark cold blocks, reorganize loop internals, and restructure whole-function control flow to improve instruction cache utilization and warp scheduling efficiency. The system operates at two distinct granularities: instruction-level classification (used by the scheduler's priority function) and block-level classification (used by code layout and predication). Both are static heuristics -- no hardware performance counters are read at runtime -- though profile-guided data from phase 20 (PerformPGO) can influence block weights when available.
| Phases | 41 (MarkAdditionalColdBlocks), 108 (OptimizeHotColdInLoop), 109 (OptimizeHotColdFlow) |
| Category | Analysis (41), Optimization (108, 109) |
| Pipeline positions | Phase 41: mid-optimization (after DoVirtualCTAExpansion); Phases 108--109: post-scheduling (after OriRemoveNopCode) |
| Vtable addresses | off_22BDC30 (41), off_22BE6A8 (108), off_22BE6D0 (109) |
| Instruction classifiers | sub_A9CDE0 (isHotMemoryOp, 380B), sub_A9CF90 (isColdMemoryOp, 367B) |
| Block layout consumer | Phase 112: PlaceBlocksInSourceOrder (sub_A92C50) |
| Related knob | Knob 582 (block-level cold-region query, consumed by predication at phase 63) |
| PGO feeder | Phase 20: PerformPGO (block weights, branch probabilities) |
GPU Motivation
Hot/cold partitioning on a GPU serves fundamentally different purposes than on a CPU.
On a CPU, the primary goal is to keep the hot path in L1 icache lines and push cold code to distant addresses that never evict hot cache lines. The branch predictor handles the control flow; the optimization is purely about cache geometry.
On a GPU, three factors make hot/cold partitioning more impactful:
-
Instruction cache pressure. GPU SMs have small instruction caches (typically 32--128 KB shared across all warps on the SM). With dozens of warps in flight, each executing the same kernel, icache misses stall the entire SM. Moving cold code (error paths, rare branches) away from hot loops reduces the working set that must remain cached.
-
Warp scheduling. The warp scheduler selects ready warps from a pool. If cold-path instructions are interleaved with hot-path instructions in the binary layout, warps executing the cold path occupy instruction fetch bandwidth that could serve warps on the hot path. Physical separation means the fetch unit can service hot warps without cache line conflicts from cold code.
-
Convergence overhead. On sm_70+ architectures, divergent branches require
BSSY/BSYNCconvergence barriers. Cold blocks that are reached by divergent branches incur barrier setup costs even when the cold path is rarely taken. The predication pass (phase 63) uses knob 582 to query whether a block is in a cold region, allowing it to avoid if-converting cold regions where the divergence penalty is acceptable.
Architecture Overview
The three phases form a pipeline with increasing scope:
Phase 41: MarkAdditionalColdBlocks (mid-optimization, Ori IR)
|
| Sets cold-block flags on basic blocks based on static heuristics
| and PGO data. These flags are read by subsequent optimization
| passes (predication, scheduling, code layout).
|
v
Phase 108: OptimizeHotColdInLoop (post-scheduling, SASS-level)
|
| Within each loop body, separates hot and cold paths. Moves cold
| blocks to the end of the loop region so that the hot path forms
| a contiguous instruction sequence.
|
v
Phase 109: OptimizeHotColdFlow (post-scheduling, SASS-level)
|
| At function scope, restructures control flow to place cold blocks
| after all hot blocks. Adjusts branch targets to maintain correctness.
|
v
Phase 112: PlaceBlocksInSourceOrder (final block layout)
|
| Determines the physical ordering of all basic blocks in the
| emitted binary, consuming the hot/cold annotations set above.
The key architectural decision is that phase 41 runs at the Ori IR level (before scheduling and register allocation), while phases 108--109 run post-scheduling on the nearly-final SASS representation. This two-stage design is necessary because:
- Cold-block annotations must be available early for predication decisions (phase 63) and scheduling priority (the 8-bit priority encoder).
- Block reordering can only happen after scheduling has assigned stall counts and dependency barriers, since moving blocks changes instruction fetch distances and potentially invalidates scoreboard computations.
Phase 41: MarkAdditionalColdBlocks
Phase 41 is an analysis pass that annotates basic blocks with cold flags. The name "Additional" implies that some initial cold marking occurs earlier (likely during AnalyzeControlFlow at phase 3 or PerformPGO at phase 20), and this pass extends those annotations using additional heuristics available after mid-level optimization.
Pipeline Context
Phase 41 runs after DoVirtualCTAExpansion (40) and before ExpandMbarrier (42). At this point in the pipeline:
- The CFG is fully built (phase 3) and loop structure is known (phase 18).
- PGO data has been applied (phase 20) if available.
- Branch optimization (phase 15) has simplified the control flow.
- The IR is still in Ori form -- no register allocation or scheduling has occurred.
Cold-Block Heuristics
The cold-block classification uses both static and profile-guided signals. Based on analysis of consumers of the cold-block flag:
Static heuristics (always available):
| Signal | Classification | Rationale |
|---|---|---|
Error handling / trap terminator | Cold | Error paths are rarely executed in correct programs |
EXIT with non-zero error code | Cold | Abnormal termination paths |
| Deeply nested conditional with uniform condition | Cold | Threads rarely diverge on uniform values |
| Block dominated by a back-edge but not in the loop body | Cold | Loop exit paths taken only once |
| Very low instruction count + unconditional branch to return | Cold | Cleanup epilogues |
Profile-guided signals (when PGO data is available via phase 20):
| Signal | Classification | Rationale |
|---|---|---|
| Execution count below threshold (relative to function entry) | Cold | Directly measured low frequency |
| Branch probability < 5% on the edge leading to the block | Cold | Rarely-taken branch target |
Cold Flag Storage
The cold annotation is stored in the BasicBlock flags field at offset +28 of the 136-byte BasicBlock object. The predication pass queries this via knob 582 (block-level cold-region query), and the scheduling priority function reads it when computing the 8-bit packed priority at bit position 5.
Consumers of Cold Annotations
| Consumer | Phase | Usage |
|---|---|---|
OriDoPredication | 63 | Knob 582: skips if-conversion of cold regions (divergence penalty acceptable in cold code) |
| Scheduling priority | 97--101 | Bit 5 of 8-bit priority: hot instructions get higher scheduling priority (1 = hot, 0 = cold) |
OptimizeHotColdInLoop | 108 | Reads cold flags to identify which loop blocks to move |
OptimizeHotColdFlow | 109 | Reads cold flags for whole-function layout |
PlaceBlocksInSourceOrder | 112 | Final block ordering uses cold annotations |
Instruction-Level Hot/Cold Classification
Independent of the block-level cold marking, ptxas classifies individual memory instructions as "hot" or "cold" for scheduling purposes. This classification is performed by two small, dual functions.
sub_A9CDE0 -- isHotMemoryOp (380 bytes)
Classifies an instruction as a hot memory operation. Hot instructions access memory spaces with high latency where early scheduling is beneficial.
isHotMemoryOp(scheduler, context, instruction):
opcode = instruction->opcode & 0xFFFFCFFF // mask modifier bits
if opcode == 183 or opcode == 288: // LD.E / ST.E (global load/store)
operand = resolve_last_source(instruction)
memspace = getMemorySpace(operand)
if memspace == 6: // global memory
return true
if memspace == 4: // shared memory
return (operand->modifier >> 19) & 7 == 1 // specific variant
return false
if opcode in {91, 92}: // ATOM / RED
modifier = instruction->operand[last]
return ((modifier ^ 6) & 6) == 0 and (modifier & 1) // specific addressing mode
return false
sub_A9CF90 -- isColdMemoryOp (367 bytes)
The exact dual of isHotMemoryOp. Classifies an instruction as a cold memory operation.
isColdMemoryOp(scheduler, context, instruction):
opcode = instruction->opcode & 0xFFFFCFFF
if opcode == 183 or opcode == 288: // LD.E / ST.E
operand = resolve_last_source(instruction)
memspace = getMemorySpace(operand)
if memspace == 5: // constant memory (vs 6 for hot)
return true
if memspace == 4: // shared memory
return (operand->modifier >> 19) & 7 == 2 // complement variant (vs 1 for hot)
return false
if opcode in {91, 92}: // ATOM / RED
modifier = instruction->operand[last]
return ((modifier ^ 6) & 6) == 0 and (modifier & 1) == 0 // complement of hot check
return false
Memory Space Classification
The memory space type is resolved by sub_91C840 from register file metadata at context+152:
| Space Code | Memory Type | Hot/Cold | Scheduling Implication |
|---|---|---|---|
| 4 | Shared memory | Depends on variant | Low latency (~20 cycles), variant-dependent |
| 5 | Constant memory | Cold | Cached, low latency (~4 cycles via constant cache) |
| 6 | Global memory | Hot | High latency (~200--800 cycles), benefits from early issue |
The shared memory case splits on a 3-bit subfield at operand bits 19--21: variant 1 is hot (bank-conflicted or special access pattern), variant 2 is cold (standard access).
For atomic operations (opcodes 91/92 = ATOM/RED), the hot/cold split is on the addressing mode: specific atomics targeting global memory in reduction mode are hot; others are cold.
Scheduling Priority Integration
The instruction-level hot/cold classification feeds directly into the scheduler's 8-bit priority encoding (documented in Scheduling Algorithm):
Bit 7: yield-related
Bit 6: yield
Bit 5: hot/cold (1 = hot = higher priority, 0 = cold = lower priority)
Bit 4: register pressure overflow
Bit 3: same-BB preference
Bit 2: stall-free
Bit 1: critical path
Bit 0: tiebreaker
Hot memory instructions (global loads, global atomics) get higher scheduling priority because their long latencies benefit from being issued early -- the scheduler can then fill the latency window with independent instructions. Cold memory instructions (constant loads) have short latencies and do not benefit from early issue, so they receive lower priority.
Phase 108: OptimizeHotColdInLoop
Phase 108 operates at the post-scheduling level, after register allocation and NOP removal have completed. It optimizes the layout of basic blocks within loop bodies.
Pipeline Context
Phase 107: OriRemoveNopCode (NOP removal)
Phase 108: OptimizeHotColdInLoop (loop-internal reordering)
Phase 109: OptimizeHotColdFlow (function-wide reordering)
Phase 110: PostSchedule (post-scheduling fixup)
Phase 112: PlaceBlocksInSourceOrder (final layout)
At this point, instructions have been scheduled and stall counts assigned. The optimization must preserve scheduling correctness while improving spatial locality.
Algorithm
The pass iterates over each loop in the function (loop structure computed at phase 18, maintained through the pipeline):
-
Identify loop blocks. Using the loop header RPO number and exit RPO number, enumerate all blocks in the loop body.
-
Classify blocks. Each block in the loop is classified as hot or cold based on the cold-block flags set by phase 41 (and potentially refined by phases between 41 and 108).
-
Partition. Hot blocks remain at the top of the loop body; cold blocks are moved to the bottom (higher addresses within the loop region).
-
Adjust branches. Branch targets are updated to reflect the new block positions. Cold blocks that were fall-through targets of hot blocks receive explicit branch instructions (since they are no longer adjacent).
The effect is that the hot loop body forms a contiguous instruction sequence that fits in fewer icache lines:
Before: After:
loop_header (hot) loop_header (hot)
hot_block_1 hot_block_1
cold_error_check hot_block_2
hot_block_2 BRA loop_header
BRA loop_header cold_error_check (moved)
Constraints
- The loop header block cannot be moved (it must be the entry point).
- Blocks with back-edges to the loop header must maintain their branch reachability.
- The transformation must not change the set of scoreboard/dependency barrier states visible at each instruction (since scheduling has already completed).
Phase 109: OptimizeHotColdFlow
Phase 109 extends the hot/cold separation to the entire function, operating on blocks that are not inside loops (or that span multiple loops).
Algorithm
The whole-function pass:
-
Scan all blocks in RPO order. Classify each non-loop block as hot or cold.
-
Partition the function into a hot region (placed first in the binary) and a cold region (placed last). Loop bodies are treated as atomic units -- the internal ordering was already optimized by phase 108.
-
Insert or adjust branches at hot-to-cold and cold-to-hot transitions. Cold-to-hot transitions require a branch back to the hot region.
-
Update block ordering metadata consumed by phase 112 (
PlaceBlocksInSourceOrder).
The combined effect of phases 108 and 109 is a two-level layout:
Function layout after phases 108+109:
[hot loop bodies, internally sorted by phase 108]
[hot non-loop blocks]
[cold blocks from all regions]
Tepid Scheduling
Between the extremes of hot and cold, ptxas recognizes a "tepid" scheduling mode that balances math and memory instruction interleaving. The tepid infrastructure lives at 0x7A4350--0x7A5000 and computes ratios:
| Metric | Formula | Purpose |
|---|---|---|
MathToDmaWaitRatio | field[756] / a5 | Ratio of math cycles to memory wait cycles |
MathToDmaTepidRatio | field[752] / a6 | Ratio of math cycles to memory tepid cycles |
MathToEpilogueWaitRatio | field[756] / (a5 / epilogue_count) | Per-epilogue math-to-wait ratio |
MathToEpilogueTepidRatio | a6 / epilogue_count | Per-epilogue tepid ratio |
These ratios are computed by sub_7A4350 (TepidSchedulingCompute) and reported by sub_7A46E0 (TepidSchedulingReport) when verbosity > 0. Epilogue blocks are identified by sub_754510 (IsEpilogueBlock), with the epilogue instruction count controlled by knob 294.
The tepid mode affects how aggressively the scheduler interleaves memory and math instructions -- hot regions use aggressive overlap, cold regions use conservative scheduling, and tepid regions use an intermediate policy.
Interaction with Other Passes
Predication (Phase 63)
The predication pass queries knob 582 to determine whether a branch region lies in a cold block. If the region is cold, predication may be skipped because:
- The cold path is rarely executed, so the branch divergence penalty is amortized.
- Predication would execute both paths unconditionally, wasting functional units on cold-path instructions.
- Keeping the branch allows the cold path to be physically separated by phases 108--109.
PlaceBlocksInSourceOrder (Phase 112)
Phase 112 is the final block layout pass. It consumes the hot/cold annotations and the reordering decisions made by phases 108--109 to determine the physical position of every basic block in the emitted binary. The function sub_A92C50 implements this with a complex block-sorting algorithm that uses FNV-1a hash maps and an explicit work stack.
Key fields consumed from the Code Object:
| Offset | Field | Usage |
|---|---|---|
| +232 | Current block pointer | Block being placed |
| +264 | Block type/mode | Controls placement strategy |
| +296 | BB array | Block pointers for lookup |
| +648 | Successor edge map | Determines fall-through targets |
| +720 | RPO array | Provides initial ordering |
PerformPGO (Phase 20)
When profile data is available (from prior compilation runs with --generate-line-info and feedback), phase 20 applies execution counts and branch probabilities to the IR. These weights directly influence cold-block identification at phase 41 -- blocks with execution counts below a threshold relative to the function entry are marked cold regardless of static heuristics.
Key Functions
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_A9CDE0 | 380B | isHotMemoryOp -- classifies instruction as hot memory access | HIGH (0.90) |
sub_A9CF90 | 367B | isColdMemoryOp -- classifies instruction as cold memory access | HIGH (0.90) |
sub_91C840 | ~200B | getMemorySpace -- resolves memory space type from operand metadata | MEDIUM |
sub_A92C50 | ~5KB | PlaceBlocksInSourceOrder -- final block layout algorithm | HIGH |
sub_7A46E0 | ~1.1KB | TepidSchedulingReport -- reports tepid scheduling ratios | HIGH |
sub_7A4350 | ~500B | TepidSchedulingCompute -- computes tepid scheduling metrics | MEDIUM |
sub_754510 | ~200B | IsEpilogueBlock -- identifies epilogue blocks | MEDIUM |
Vtable Layout
| Phase | Index | Vtable Address | Name String Address |
|---|---|---|---|
| MarkAdditionalColdBlocks | 41 | off_22BDC30 | 0x22BC763 |
| OptimizeHotColdInLoop | 108 | off_22BE6A8 | 0x22BCD1D |
| OptimizeHotColdFlow | 109 | off_22BE6D0 | 0x22BCD33 |
All three vtables follow the standard 5-entry layout:
| Vtable Offset | Entry |
|---|---|
| +0 | execute(phase*, compilation_context*) |
| +8 | isNoOp(phase*) -> bool |
| +16 | getName(phase*) -> int |
| +24 | alloc(pool*, size) |
| +32 | free(pool*, ptr) |
Confidence Assessment
| Claim | Confidence | Evidence |
|---|---|---|
| Phase names and indices (41, 108, 109) | VERY HIGH | Static name table at off_22BD0C0, factory switch at sub_C60D30 |
| Vtable addresses | VERY HIGH | Computed from base off_22BD5C8 + index * 40 |
isHotMemoryOp / isColdMemoryOp identity | HIGH | Dual function structure, memory space checks, opcode patterns |
| Memory space codes (4=shared, 5=constant, 6=global) | HIGH | Confirmed across multiple consumers |
| Scheduling priority bit 5 = hot/cold | HIGH | Decompiled priority function at sub_8C9320 |
| Phase 41 runs before scheduling | VERY HIGH | Factory index and pipeline ordering table |
| Phases 108--109 run post-scheduling | VERY HIGH | Pipeline ordering table, position after OriRemoveNopCode |
| Knob 582 cold-region query in predication | HIGH | Decompiled predication pass at sub_1381010 |
| Block layout consumer at phase 112 | HIGH | sub_A92C50 identified via string xref to PlaceBlocksInSourceOrder |
| Cold-block flag in BB+28 | MEDIUM | Inferred from consumer patterns; exact bit position unconfirmed |
| Tepid scheduling ratios | HIGH | String evidence from decompiled sub_7A46E0 |
| PGO influence on cold marking | MEDIUM | Inferred from pipeline ordering (PGO at 20, cold marking at 41) |
Cross-References
- Pass Inventory -- phases 41, 108, 109, 112 in the complete 159-phase table
- Basic Blocks & CFG -- BasicBlock object layout, RPO computation, edge hash maps
- Scheduling Algorithm -- 8-bit priority encoding, hot/cold bit 5
- Scheduler Overview -- hot/cold classification in scheduling context
- Predication -- knob 582 cold-region gate
- Instruction Format -- instruction +72 opcode, +80 operand count, +84 operand array
- Optimization Pipeline -- dispatch loop and phase execution order
GMMA/WGMMA Pipeline
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The GMMA pipeline handles warpgroup matrix multiply-accumulate (WGMMA) instructions introduced with SM 90 (Hopper). Two dedicated compiler phases -- OriPropagateGmma (phase 85) and FixupGmmaSequence (phase 87) -- transform the IR to satisfy the hardware's strict pipelining requirements for asynchronous tensor-core operations. These are the only passes in ptxas whose sole purpose is WGMMA instruction handling.
WGMMA operates at warpgroup granularity (4 warps executing in lockstep). The hardware requires a specific sequencing protocol: wgmma.fence to open a pipeline stage, a sequence of wgmma.mma_async operations that share accumulator registers, wgmma.commit_group to close the stage, and wgmma.wait_group to synchronize on completion. Between the fence and wait, strict constraints govern which registers can be touched by non-WGMMA instructions. Violating these constraints forces the compiler to serialize the WGMMA pipeline, destroying throughput.
| Pipeline phases | 85 (OriPropagateGmma), 87 (FixupGmmaSequence) |
| Target architectures | SM 90+ (Hopper, Blackwell) |
| Phase 85 entry | sub_AE5030 (2,967 bytes) -- outer driver, SM gate check |
| Phase 85 core | sub_ADAD60 (2,170 bytes) -- accumulator propagation per instruction |
| Phase 87 entry | sub_AE4F70 (182 bytes) -- sequencing orchestrator |
| Phase 87 core | sub_ADEB40 (7,077 bytes) -- sequence fixup, warpgroup inject |
| Serialization warnings | sub_ACE480 (1,908 bytes) -- 10 distinct warning codes |
| Pipeline validation | sub_AE3D40 (2,511 bytes) -- sequence structural check |
| Accumulator collect | sub_ADA740 (146 bytes) -- gathers accumulator register set |
| Live range propagation | sub_ADBD30 (3,364 bytes) -- per-basic-block propagation |
| Phase name strings | 0x22BCB13 (OriPropagateGmma), 0x22BCB40 (FixupGmmaSequence) |
Hardware Background
Warpgroup Execution Model
A warpgroup consists of 4 consecutive warps (128 threads). WGMMA instructions execute cooperatively across all 4 warps, with each warp contributing a slice of the matrix operation. The hardware tensor core pipeline is decoupled from the main pipeline: wgmma.mma_async dispatches work to the tensor core and returns immediately, while the accumulator registers remain in-flight until a wgmma.wait_group completes.
The PTX-level instructions that constitute a WGMMA pipeline stage:
| PTX Instruction | Ori Opcode | Role |
|---|---|---|
wgmma.fence | (via handler sub_4DA380) | Opens a pipeline stage; prevents reordering across the fence |
wgmma.mma_async | 309 | Dispatches an asynchronous matrix multiply-accumulate |
wgmma.commit_group | (via handler sub_4DA4B0) | Closes the current pipeline stage |
wgmma.wait_group | (via handler sub_4DA5E0) | Waits for N committed groups to complete |
_warpgroup.arrive | 323 | Compiler-inserted warpgroup synchronization (arrive) |
_warpgroup.wait | 271 (masked & 0xFFFFCFFF) | Compiler-inserted warpgroup synchronization (wait) |
_warpgroup.commit_batch | Compiler-inserted commit batch |
The _warpgroup.* instructions (prefixed with underscore) are compiler-internal pseudo-operations inserted by ptxas, not directly written by the programmer. They map to SASS WARPGROUP.ARRIVE, WARPGROUP.WAIT, and WARPGROUP.DEPBAR instructions.
Accumulator Register Constraints
WGMMA accumulator registers are the output (D) operands of wgmma.mma_async. While a pipeline stage is open (between fence and wait), strict rules apply:
- No non-WGMMA definitions of accumulator registers. Another instruction cannot write to a register that a WGMMA in the current stage uses as an accumulator.
- No non-WGMMA reads of accumulator registers. Another instruction cannot read from an accumulator register between the producing WGMMA and the completing wait.
- No non-WGMMA definitions of WGMMA input registers. The A and B matrix input registers (including descriptor registers) must not be redefined by non-WGMMA instructions within the stage.
Violation of any constraint forces serialization -- the compiler collapses the pipeline to issue one WGMMA at a time with individual fence/commit/wait per operation.
Sparse GMMA
The binary contains support for sparse GMMA variants (structured sparsity). The string "Sparse GMMA with " at 0x1D0B430 appears in sub_494210 (2,276 bytes), which handles sparse matrix metadata validation. Sparse WGMMA uses an additional metadata operand encoding the 2:4 or other sparsity pattern.
Phase 85: OriPropagateGmma
Purpose
Phase 85 propagates WGMMA accumulator register liveness information through the IR. For each wgmma.mma_async instruction (Ori opcode 309), it identifies the accumulator register set and builds a compact encoding that downstream passes use to track which registers are "in-flight" at each program point. This information is consumed by phase 87 to determine where warpgroup.arrive and warpgroup.wait instructions must be injected.
SM Gate
The outer driver sub_AE5030 checks the target architecture before proceeding. At offset +1381 of the compilation context, a flag indicates whether the target supports WGMMA. The check at the function entry:
if (*(char*)(context + 1381) >= 0) // bit 7 clear = no WGMMA support
return;
An additional mode check reads from the target descriptor at offset 26208 (within a 72-byte sub-structure at the descriptor's offset 72):
- Value 0: no WGMMA support -- skip entirely
- Value 1 with sub-field at 26216 nonzero: use the simple single-function path (
sub_ADCA60) - Otherwise: use the full pipeline analysis path
Accumulator Register Encoding
The core function sub_ADAD60 processes each wgmma.mma_async instruction and encodes its accumulator register set into a packed 32-bit word. The encoding uses the FNV-1a hash (prime 16777619, offset basis 0x811C9DC5) for register-set lookup in a hash table:
hash = 16777619 * (HIBYTE(reg_id) ^
(16777619 * (BYTE2(reg_id) ^
(16777619 * (BYTE1(reg_id) ^
(16777619 * ((uint8_t)reg_id ^ 0x811C9DC5)))))));
Accumulator entries are stored with a type tag in the high nibble:
0x90000000 | (encoded_accum & 0xFFFFFF)-- source accumulator register set0x10000000 | (encoded_accum & 0xFFFFFF)-- destination accumulator register set
Live Range Limit Check
After accumulator propagation, the pass checks whether the number of active GMMA live ranges exceeds the hardware limit. The limit is stored at offset 56 of the pass object (field *(DWORD*)(a1 + 56) = maxActiveGmmaLiveRanges). If exceeded, a diagnostic is emitted:
"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"
This diagnostic uses warning code 0x1CEF (7407). The limit is architecture-dependent and reflects the number of accumulator register banks available to the tensor core pipeline.
Call Chain
sub_AE5030 (2,967B -- SM gate, iteration over basic blocks)
└─ sub_ADCA60 (3,643B -- per-function pipeline analysis)
└─ sub_ADBD30 (3,364B -- per-block accumulator propagation)
└─ sub_ADAD60 (2,170B -- per-instruction accumulator encoding)
├─ sub_AD4500 -- hash table lookup for register set
├─ sub_AD4940 -- hash table insert/update
├─ sub_AD6280 -- register set cache insert
├─ sub_AD8E50 -- instruction iterator setup
├─ sub_AD0C50 -- begin accumulator iteration
├─ sub_AD3EA0 -- advance accumulator iterator
├─ sub_AD1FA0 -- advance to next accumulator slot
├─ sub_75A670 -- grow dynamic array (accumulator list)
└─ sub_895530 -- emit diagnostic warning
Accumulator Collection Helper
sub_ADA740 (146 bytes) collects the set of registers that are accumulators for a given instruction. It iterates over an instruction's operands, checking:
- Operand type tag
(operand >> 28) & 7 == 1(register operand) - Not an immediate-flagged operand (
(byte_flag & 1) == 0) reg_type == 6atvreg+64(tensor/accumulator register class)
Matching registers are added to a bitvector-like set via sub_768AB0.
Phase 87: FixupGmmaSequence
Purpose
Phase 87 is the critical legalization pass. It analyzes WGMMA instruction sequences, verifies that the hardware pipeline constraints are satisfied, and inserts warpgroup.arrive / warpgroup.wait instructions where registers used by non-WGMMA instructions conflict with in-flight WGMMA accumulators. If the pipeline cannot be formed correctly, it triggers serialization and emits performance warnings.
Orchestrator: sub_AE4F70
The 182-byte wrapper orchestrates the complete fixup sequence:
sub_AE4F70 (FixupGmmaSequence orchestrator)
│
├─ [1] sub_ADEB40 -- primary sequence fixup (inject arrive/wait)
├─ [2] sub_ADA7E0 -- verify pipeline consistency
├─ [3] sub_AE3D40 -- structural validation of sequences
├─ [4] sub_AD8F90 -- secondary validation pass
├─ [5] sub_AE4710 -- finalize sequence metadata
├─ [6] sub_AE17C0 -- late pipeline consistency check
│
└─ On failure at any step:
├─ Set serialization flag: *(BYTE*)(context + 1920) = 1
├─ sub_ACE480 -- emit serialization warning
└─ sub_AE47B0 -- serialize the WGMMA pipeline (fallback)
The return value encodes the failure reason in the low 32 bits and a function identifier in the high 32 bits, which sub_ACE480 uses to select the appropriate warning message.
Primary Fixup: sub_ADEB40
This 7,077-byte function is the heart of the GMMA pipeline. Its logic:
1. Initialization. Allocates two dynamic arrays (v224/v225 for warpgroup.wait insertion points, i/v228 for warpgroup.arrive insertion points) and initializes them with sentinel values (0xFFFFFFFF).
2. First pass -- identify WGMMA sequences. Iterates over all instructions in the function's code list. For each instruction with opcode 309 (wgmma.mma_async):
- Collects the instruction's accumulator register set via
sub_ACC0A0/sub_AD50B0iterator pattern - Checks whether each of the instruction's operands (positions 1--4) has already been marked with arrival/wait flags
- For unmarked operands, calls
sub_ADA740to collect accumulator registers and add them to the tracking set
The pass checks operand flag bits at instruction + 84 + 8*operand_index + 4:
- Bit 0 (
& 1): operand has been processed for arrive - Bit 1 (
& 2): operand has been processed for wait - Bit 2 (
& 4): operand requires a warpgroup.arrive/wait boundary
3. Second pass -- walk pipeline stages. For each WGMMA sequence identified in the compilation context's sequence table (context->field_99), the pass walks forward through basic blocks:
- Tracks the current pipeline stage state (
v206: 0=initial, 1=arrived, 2=committed) - When encountering a
wgmma.mma_async(opcode 309), records it as part of the current stage - When encountering a
_warpgroup.commit_batch(opcode 323), marks the stage boundary and sets bit 2 on the last accumulator operand - When encountering an
arrive(opcode 271 masked) orwait(opcode 32 masked), updates the pipeline state - When encountering a function call (opcode 236), forces a pipeline break
For non-WGMMA instructions within a stage, checks whether their register operands conflict with the active accumulator set by querying the bitvector (the balanced binary tree at v238). If a conflict is found, the instruction needs a warpgroup.arrive or warpgroup.wait to be injected before it.
4. Injection. Creates new instructions:
sub_ACBE60createswarpgroup.arrivepseudo-instructionssub_ACBF80createswarpgroup.waitpseudo-instructions
These are added to the arrival/wait lists and later inserted into the code.
5. Commit pass. After analysis, iterates over the collected injection points:
- For each
warpgroup.arriveinsertion, checks whether the injection needs a diagnostic viasub_ACBCA0(knob-gated) - Emits advisory warning
0x1D5F(7519):"warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'" - For each
warpgroup.waitinsertion, emits advisory warning0x1D5D(7517):"warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"
6. Finalization. Calls sub_ADD8A0 (1,349 bytes) to rebuild the WGMMA sequence metadata after injection.
Pipeline Stage State Machine
The fixup pass maintains a state machine as it walks through instructions within a WGMMA sequence:
┌──────────────┐
│ state = 0 │ (initial / outside pipeline)
│ no active │
│ stage │
└──────┬───────┘
│ encounter wgmma.mma_async
▼
┌──────────────┐
│ state = 1 │ (in pipeline stage, arrived)
│ tracking │
│ accumulators│
└──────┬───────┘
│ encounter commit_batch
▼
┌──────────────┐
│ state = 2 │ (committed, waiting)
│ accumulators│
│ in-flight │
└──────┬───────┘
│ encounter wait or stage end
▼
┌──────────────┐
│ state = 0 │ (back to initial)
└──────────────┘
At any state, encountering a function call (opcode 236)
or a conflicting register use forces:
→ inject warpgroup.arrive/wait
→ potentially serialize the pipeline
Register Conflict Detection
Register type 6 (vreg+64 == 6) is the tensor/accumulator register class. The conflict check compares operand register IDs against the active accumulator bitvector using a balanced binary search tree (v238 / v148 in the decompilation). The tree is keyed by register_id >> 8 (register bank) with a 64-bit bitmap per node tracking individual registers within the bank:
bit_index = register_id & 0x3F;
bank_offset = (register_id >> 6) & 3; // 0..3 for 4 64-bit words per node
is_conflict = (node->bitmap[bank_offset + 4] >> bit_index) & 1;
Serialization Warnings
When the pipeline cannot be formed correctly, sub_ACE480 (1,908 bytes) emits one of 10 distinct performance warnings. The function receives a packed 64-bit error code: the low 4 bits select the warning case (1--10) and the high 32 bits identify the function that triggered the failure. The function name is resolved via a vtable callback: context->field_0->vtable[18]->method_1(context->field_0->vtable[18], function_id).
Warning Emission Mechanism
Each warning is gated by a per-function flag at context->field_208 + 72 + 26280:
- Byte == 1 with DWORD at +26288 nonzero: Emit via
sub_895530(direct diagnostic with source location). Falls back tosub_7EEFA0(format-to-buffer, no location) if the source location callback atcontext->vtable + 48is null. - Byte != 1 (default): Emit via
sub_7FA2C0(warning-once gate, keyed on hex code atcontext + 154). If the gate passes (first occurrence for this function), emits viasub_895670(diagnostic throughcontext->vtable + 128callback). This prevents the same warning from being emitted multiple times for the same function.
All warnings use the prefix "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...".
Serialization Warning Table
| Case | Hex | Decimal | Message suffix | Source function |
|---|---|---|---|---|
| 1 | 0x1D55 | 7509 | ...the presence of Extern calls in the function '%s' | sub_ADEB40 |
| 2 | 0x1D56 | 7510 | ...wgmma pipeline crossing function boundary at a function call in the function '%s' | sub_ADEB40 |
| 3 | 0x1D57 | 7511 | ...insufficient register resources for the wgmma pipeline in the function '%s' | sub_ADA7E0, orchestrator fallback |
| 4 | 0x1D58 | 7512 | ...insufficient register resources for the function '%s' | orchestrator resource check |
| 5 | 0x1D59 | 7513 | ...non wgmma instructions defining input registers of a wgmma between start and end of the pipeline stage in the function '%s' | sub_ADEB40, sub_AE17C0 |
| 6 | 0x1D5A | 7514 | ...non wgmma instructions reading accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s' | sub_AE17C0 |
| 7 | 0x1D5B | 7515 | ...non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s' | sub_ADEB40, sub_AE17C0 |
| 8 | 0x1D5C | 7516 | ...ill formed pipeline stage in the function '%s' | sub_AE3D40 structural check |
| 9 | 0x1D5E | 7518 | ...program dependence on compiler-inserted WG.DP in divergent path in the function '%s' | sub_ADEB40 finalization |
| 10 | 0x1D60 | 7520 | ...program dependence on compiler-inserted WG.AR in divergent path in the function '%s' | sub_ADEB40 finalization |
Note: The hex codes are not contiguous. Codes 0x1D5D (7517) and 0x1D5F (7519) are advisory injection warnings, not serialization warnings (see below).
Advisory Injection Warnings
During successful (non-serialized) pipeline fixup, sub_ADEB40 emits advisory warnings when it injects warpgroup synchronization instructions. These are gated by knob check at sub_ACBCA0 and the per-instruction flag at bb_info + 282 bit 3:
| Hex | Decimal | Message |
|---|---|---|
0x1D5D | 7517 | "warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'" |
0x1D5F | 7519 | "warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'" |
These are informational: they indicate the compiler successfully handled a register conflict by inserting synchronization, without falling back to serialization.
Detailed Trigger Conditions
Case 1 (0x1D55): Extern calls prevent pipelining
Trigger. During the instruction walk in sub_ADEB40, a call instruction (Ori opcode 236) is encountered within a WGMMA pipeline stage, or an operand references a basic block with no instructions (opaque/extern function target). The compiler cannot verify that the callee preserves the accumulator register state.
Detection code. In sub_ADEB40: when opcode == 236 (function call), or when a callee basic block's instruction pointer is null (*(_QWORD*)v114 == 0), v206 is set to 1.
Code pattern that causes it:
wgmma.fence;
extern_function_call(); // <-- triggers case 1
wgmma.mma_async ...;
wgmma.commit_group;
wgmma.wait_group;
Fix. Mark the callee as __forceinline__ so the compiler can see its register usage. Move non-inlineable function calls outside the fence--wait region. Restructure the kernel so that no opaque calls occur between wgmma.fence and wgmma.wait_group.
Case 2 (0x1D56): Pipeline crosses function call boundary
Trigger. The bitvector conflict check finds a non-WGMMA instruction's register operand colliding with the active accumulator bitvector, at a point where the pipeline already has active state from a preceding call-boundary violation. Specifically, the register is looked up in the balanced binary tree (node->bitmap[bank_offset + 4] >> bit_index) and if the conflict bit is set while v206 was already zero, it is promoted to case 2.
Detection code. In sub_ADEB40 lines 418--426: after the accumulator bitvector lookup returns a match, v206 is set to 2 (the first conflict after a call boundary was detected).
Code pattern that causes it:
// Function A:
wgmma.fence;
wgmma.mma_async ...;
call function_B(); // pipeline spans across this call
wgmma.commit_group; // in function_B or after return
wgmma.wait_group;
Fix. Keep the entire fence--mma--commit--wait sequence within a single function. Do not split WGMMA pipeline stages across function boundaries.
Case 3 (0x1D57): Insufficient register resources for pipeline
Trigger. Three distinct paths produce this code:
sub_ADA7E0returns 3 when its internal call tosub_AD5120()fails (line 233). This function attempts to propagate accumulator tracking through the FNV-1a hash table, and failure means the pipeline's register sets cannot be simultaneously tracked.sub_AE3D40(structural validation) returns with low byte 0, meaningsub_ACE3D0()rejected the pipeline structure. The orchestrator uses case 3 as the generic fallback (v20 = 3at line 66 ofsub_AE4F70).sub_AD8F90(secondary validation) returns with low byte 0 similarly.
Code pattern that causes it:
// Too many concurrent accumulators
wgmma.fence;
wgmma.mma_async D0, ...; // accum set 0
wgmma.mma_async D1, ...; // accum set 1
wgmma.mma_async D2, ...; // accum set 2
// ... many more with distinct accumulators
wgmma.commit_group;
wgmma.wait_group;
Fix. Reduce the number of concurrent WGMMA operations with distinct accumulator register sets. Split large tile computations into smaller stages with intervening waits. Reduce accumulator tile dimensions.
Case 4 (0x1D58): Insufficient register resources for function
Trigger. The function's overall register pressure (including non-WGMMA code) is too high. The WGMMA pipeline requires dedicated accumulator register banks, and if the function's total register demand exceeds what is available after reserving the pipeline's needs, serialization is triggered.
Code pattern that causes it:
__global__ void kernel(...) {
float local_array[256]; // high register pressure
complex_computation(local_array);
wgmma.fence;
wgmma.mma_async ...; // needs accumulator regs too
wgmma.commit_group;
wgmma.wait_group;
}
Fix. Reduce register usage in the kernel: use shared memory for large arrays, reduce live variable counts, split the kernel into smaller functions. Compile with -maxrregcount to force spilling of non-critical values.
Case 5 (0x1D59): Non-WGMMA defines input registers
Trigger. Two paths:
- In
sub_ADEB40(lines 960--990): for each non-WGMMA instruction within a pipeline stage, operand position 4 (WGMMA input operands) is checked. If a non-WGMMA instruction writes to a register that a WGMMA uses as matrix A or B input, and the write is in the same basic block (v84+24 == v36[6]) and after the WGMMA (v84+52 > v36[13]), the conflict is flagged. - In
sub_AE17C0(lines 384--386):sub_AE0D20()validates the pipeline's input register sets against arrive/wait annotations. Failure at either the arrive set (offset +69) or wait set (offset +74) returns code 5.
Code pattern that causes it:
wgmma.fence;
// desc_a = make_descriptor(smem_ptr);
wgmma.mma_async D, desc_a, desc_b;
desc_a = make_descriptor(smem_ptr + offset); // <-- redefines input
wgmma.mma_async D, desc_a, desc_b; // uses redefined input
wgmma.commit_group;
wgmma.wait_group;
Fix. Compute all WGMMA input values (descriptors, pointers) before wgmma.fence. Use separate register variables for distinct input values within a single pipeline stage. If different tiles need different descriptors, pre-compute them all before entering the pipeline.
Case 6 (0x1D5A): Non-WGMMA reads accumulators
Trigger. Detected only by sub_AE17C0 (late consistency check), at two points:
- Lines 707--741: for each WGMMA instruction, operand 0 (accumulator) is examined via
sub_AD4BE0/sub_ACBB60. If the accumulator data set is non-empty (!sub_ACC3A0), a non-WGMMA instruction reads from an in-flight accumulator register. - Lines 870--885: same check in a per-basic-block iteration context.
Code pattern that causes it:
wgmma.fence;
wgmma.mma_async D, A, B;
float val = D[0]; // <-- reads accumulator before wait
wgmma.commit_group;
wgmma.wait_group;
Fix. Move all reads of accumulator registers after wgmma.wait_group. The accumulator values are undefined until the wait completes. If the compiler cannot automatically insert a warpgroup.wait at the read point (e.g., divergent control flow), serialization occurs.
Case 7 (0x1D5B): Non-WGMMA defines accumulators
Trigger. Three paths:
- In
sub_ADEB40(lines 994--1028): for each non-WGMMA instruction, operand position 3 is checked. If the operand is a register (not immediate, tag !=0x70000000), and it belongs to the same basic block and pipeline stage, and the defining instruction's opcode (after masking) is not 309 (wgmma.mma_async), the conflict is flagged. - In
sub_AE17C0(lines 684--703):sub_AD4CC0checks WGMMA accumulator operands against the conflict set. If a match is found and the set is non-empty, code 7 is returned. - In
sub_AE17C0(lines 1296--1302): a catch-all at the end of the late validation walk.
Code pattern that causes it:
wgmma.fence;
D[0] = 0.0f; // <-- writes to accumulator
wgmma.mma_async D, A, B; // D is accumulator
wgmma.commit_group;
wgmma.wait_group;
Fix. Initialize accumulators before wgmma.fence, or use the WGMMA .useC mode to let the hardware handle accumulator initialization. Never write to accumulator registers from non-WGMMA instructions inside a pipeline stage.
Case 8 (0x1D5C): Ill-formed pipeline stage
Trigger. sub_AE3D40 (structural validation) detects that the fence/mma/commit/wait structure is malformed. The function walks the WGMMA sequence and checks structural properties via sub_ACE3D0. When the structure check fails (line 447), an error with low byte 0 is returned. The orchestrator maps structural failures to code 3 as fallback, but code 8 is emitted when sub_ADEB40 detects the stage state machine in an inconsistent state.
Code pattern that causes it:
wgmma.fence;
if (condition) {
wgmma.mma_async D, A, B;
wgmma.commit_group; // commit only on one path
}
wgmma.wait_group; // wait on all paths -- mismatch
Fix. Ensure each wgmma.fence is matched by exactly one wgmma.commit_group and one wgmma.wait_group on every control flow path. Keep pipeline stages in straight-line code. Do not use goto, early return, or conditional branches between fence and wait.
Case 9 (0x1D5E): WG.DP in divergent path
Trigger. During the finalization pass in sub_ADEB40 (lines 1308--1370), the compiler iterates over warpgroup.wait injection points. For each injection, it checks the basic block's convergence flag at bb_info + 282 bit 3. If bit 3 is NOT set (block is divergent) and v206 was previously zero, v206 is set to 9 with the function ID from the basic block at offset +200.
WG.DP = WARPGROUP.DEPBAR (dependency barrier), the SASS-level instruction that implements warpgroup.wait.
Code pattern that causes it:
wgmma.fence;
wgmma.mma_async D, A, B;
wgmma.commit_group;
if (threadIdx.x < 64) { // warp-divergent condition
use(D[0]); // compiler needs WG.DP here, but path is divergent
}
wgmma.wait_group;
Fix. Ensure WGMMA pipeline stages execute in uniform (non-divergent) control flow. Move conditional logic outside the fence--wait region. Use predication instead of branching for minor variations within a stage.
Case 10 (0x1D60): WG.AR in divergent path
Trigger. During the finalization pass in sub_ADEB40 (lines 1242--1306), the compiler iterates over warpgroup.arrive injection points. When the compiler needs to inject a warpgroup.arrive (to start a new pipeline stage after a conflict) but the injection point is in a divergent basic block, v206 is set to 10. This occurs at line 1302 when a knob-gated diagnostic check at sub_ACBCA0 indicates the injection is not suppressed but the block divergence prevents safe insertion.
WG.AR = WARPGROUP.ARRIVE (arrival barrier), the SASS-level instruction that synchronizes warpgroup warps before entering a pipeline stage.
Code pattern that causes it:
if (threadIdx.x < 64) { // divergent
wgmma.fence; // <-- compiler needs WG.AR, but divergent
wgmma.mma_async D, A, B;
wgmma.commit_group;
wgmma.wait_group;
}
Fix. Same as case 9. Keep pipeline stage entry points (fences) and exit points (waits) in uniform control flow. All warps in the warpgroup must execute the same WGMMA pipeline structure.
Orchestrator Error Code Flow
The orchestrator sub_AE4F70 calls validation functions in sequence. Each returns a packed 64-bit value with the error code in the low bits and a function identifier in the high 32 bits:
sub_AE4F70
│
├─ sub_ADEB40 (primary fixup)
│ returns: 1, 2, 5, 7, 9, 10 in low 4 bits
│ (0 = success)
│
├─ sub_ADA7E0 (pipeline consistency)
│ returns: 3 if FNV-1a accumulator tracking fails
│ (0 = success)
│
├─ sub_AE3D40 (structural validation)
│ returns: low byte 1 = pass, low byte 0 = fail
│ (orchestrator maps fail to case 3)
│
├─ sub_AD8F90 (secondary validation)
│ returns: low byte 1 = pass, low byte 0 = fail
│ (orchestrator maps fail to case 3)
│
├─ sub_AE4710 (finalize metadata) -- only on success
│
└─ sub_AE17C0 (late consistency)
returns: 5, 6, 7 in low bits
(0 = success)
Any nonzero result triggers the serialization path: *(BYTE*)(context->field_0->field_1584 + 1920) = 1, followed by sub_ACE480 (warning emission) and sub_AE47B0 (pipeline collapse).
The serialization fallback function sub_AE47B0 replaces the pipelined WGMMA sequence with individual fence/mma/commit/wait groups per operation, which is functionally correct but eliminates all overlap between tensor core operations.
Interaction with Register Allocation
The GMMA pipeline runs at phases 85/87, before register allocation (phase 101). This is by design -- the pass operates on virtual registers and needs to:
- Track accumulator live ranges before physical register assignment constrains placement
- Insert warpgroup.arrive/wait with freedom to position them optimally
- Propagate accumulator liveness to inform the register allocator about the extended live ranges that WGMMA creates
The live range limit check (warning code 0x1CEF) directly impacts register allocation: if too many WGMMA accumulators are simultaneously live, the register allocator will not have enough physical registers, and the pipeline must be serialized.
Phase 86 (InsertPseudoUseDefForConvUR) runs between the two GMMA phases. It inserts pseudo use/def instructions for uniform register conversion, which must account for the accumulator regions identified by phase 85.
Phase 88 (OriHoistInvariantsLate3) runs immediately after phase 87, exploiting the now-explicit pipeline boundaries as LICM barriers.
PTX Instruction Handlers
The PTX-to-Ori lowering registers four WGMMA-related handlers in sub_5D4190:
| PTX Mnemonic | Handler | Size |
|---|---|---|
wgmma.mma_async | sub_50AC70 | 1,282 bytes |
wgmma.fence | sub_4DA380 | 295 bytes |
wgmma.commit_group | sub_4DA4B0 | 295 bytes |
wgmma.wait_group | sub_4DA5E0 | 311 bytes |
The wgmma.mma_async handler is the largest, handling the complex operand encoding (matrix dimensions, data types, layout, scale factors, descriptor format). The fence/commit/wait handlers are thin wrappers producing single Ori instructions.
The internal warpgroup synchronization instructions (_warpgroup.arrive, _warpgroup.wait, _warpgroup.commit_batch) are registered separately as _mma.warpgroup-prefixed handlers at 0x466000--0x467900 (approximately 36 small ~96-byte handler functions covering the various warpgroup synchronization variants).
SASS Output
The Ori WGMMA instructions are encoded to the following SASS opcodes by the Mercury encoder:
| Ori Instruction | SASS Opcode | Description |
|---|---|---|
wgmma.mma_async | WGMMA.MMA_ASYNC | Asynchronous warpgroup matrix multiply |
wgmma.fence | WGMMA.FENCE | Pipeline fence |
wgmma.commit_group | WGMMA.COMMIT_GROUP | Commit current group |
wgmma.wait_group N | WGMMA.WAIT_GROUP N | Wait for N groups |
_warpgroup.arrive | WARPSYNC / BAR.ARRIVE | Warpgroup arrival barrier |
_warpgroup.wait | WARPSYNC / BAR.WAIT | Warpgroup wait barrier |
_warpgroup.commit_batch | DEPBAR variant | Warpgroup dependency barrier |
The Mercury encoder at sub_62E890 (118 KB) handles the SASS-level encoding of warpgroup operations, referenced by strings "warpgroup-arrive", "warpgroup-wait", and "warpgroup-commit_batch" used as internal Mercury instruction tags.
Key Constants
| Constant | Value | Meaning |
|---|---|---|
| WGMMA opcode | 309 | Ori opcode for wgmma.mma_async |
| Arrive opcode (masked) | 271 | opcode & 0xFFFFCFFF for _warpgroup.arrive/wait |
| Commit opcode | 323 | Ori opcode for _warpgroup.commit_batch |
| Call opcode | 236 | Forces pipeline break |
| Accum reg_type | 6 | vreg+64 value for tensor/accumulator regs |
| Accum src tag | 0x90000000 | High nibble tag for source accumulator encoding |
| Accum dst tag | 0x10000000 | High nibble tag for destination accumulator encoding |
| FNV-1a prime | 16777619 | Hash function prime for register set lookup |
| FNV-1a offset | 0x811C9DC5 | Hash function offset basis |
| Live range warning | 0x1CEF | Warning code for excessive live ranges |
| Serialization base | 0x1D55 | First serialization warning code (extern calls) |
| Serialization end | 0x1D60 | Last serialization warning code (WG.AR divergent) |
| Advisory wait inject | 0x1D5D | Advisory: warpgroup.wait injected |
| Advisory arrive inject | 0x1D5F | Advisory: warpgroup.arrive injected |
Key Function Table
| Address | Size | Name / Role |
|---|---|---|
0xAE5030 | 2,967 | Phase 85 outer driver (SM gate, BB iteration) |
0xADCA60 | 3,643 | Phase 85 per-function pipeline analysis |
0xADBD30 | 3,364 | Phase 85 per-block accumulator propagation |
0xADAD60 | 2,170 | Phase 85 per-instruction accumulator encoding |
0xADA740 | 146 | Accumulator register collector |
0xAE4F70 | 182 | Phase 87 orchestrator |
0xADEB40 | 7,077 | Phase 87 primary sequence fixup |
0xADB5E0 | 1,867 | Phase 87 sequence metadata builder |
0xADD8A0 | 1,349 | Phase 87 post-injection metadata rebuild |
0xAE3D40 | 2,511 | Sequence structural validation |
0xAD8F90 | 2,924 | Secondary validation pass |
0xAE17C0 | 7,538 | Late pipeline consistency check |
0xAE47B0 | 1,975 | Serialization fallback (collapse pipeline) |
0xACE480 | 1,908 | Serialization warning emitter (10 codes) |
0xACBE60 | 279 | Create warpgroup.arrive instruction |
0xACBF80 | 279 | Create warpgroup.wait instruction |
0xACBCA0 | 191 | Knob-gated injection diagnostic check |
0x50AC70 | 1,282 | PTX handler: wgmma.mma_async |
0x4DA380 | 295 | PTX handler: wgmma.fence |
0x4DA4B0 | 295 | PTX handler: wgmma.commit_group |
0x4DA5E0 | 311 | PTX handler: wgmma.wait_group |
0x494210 | 2,276 | Sparse GMMA validation |
0x62E890 | 118,150 | Mercury encoder for warpgroup SASS ops |
Cross-References
- Pass Inventory -- phases 85, 87 in the 159-phase table
- Synchronization & Barriers -- warpgroup barriers,
DEPBARgeneration - Register Model -- reg_type 6 (tensor/accumulator, allocator class 6)
- Register Allocator -- live range pressure from WGMMA accumulators
- Mercury Encoder -- SASS encoding of WGMMA instructions
- Uniform Register Optimization -- phase 86 between the two GMMA phases
- Loop Passes -- phase 88 LICM after GMMA fixup
- Late Legalization -- phase 93 catches ops exposed by GMMA passes
- SM Architecture Map -- SM 90+ architecture support
- Knobs System -- diagnostic gating for injection warnings
Uniform Register Optimization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Four passes in the ptxas pipeline collectively manage the conversion of general-purpose register (R) values to uniform registers (UR) on sm_75+ targets. The UR register file is a dedicated 63-entry, 32-bit register bank shared across all threads in a warp: every thread reads the same value from a given UR. By routing warp-uniform computations through the UR file, ptxas reduces R-register pressure (the dominant occupancy limiter), enables the UR-specific ALU datapath, and avoids broadcasting the same value 32 times across the register file.
| Phases | 11, 27, 74, 86 |
| Phase names | ReplaceUniformsWithImm, AnalyzeUniformsForSpeculation, ConvertToUniformReg, InsertPseudoUseDefForConvUR |
| Target | sm_75+ (Turing and later) -- no-op on earlier architectures |
| Register file | UR: UR0--UR62 usable, UR63 = URZ (zero register); UP: UP0--UP6, UP7 = UPT |
| Hardware limit | 63 uniform GPRs, 7 uniform predicates per thread |
| Code Object field | +99 = UR count; +856 = UR liveness bitvector |
| Context flags | +1368 bit 1 = has-uniform; +1376 bit 4 = UR tracking enabled; +1378 bit 3 = has-UR-regs |
| Key knobs | 487 (general optimization gate), 628 (pre-allocation UR promotion), 687 (uniform register mode) |
| Related passes | OriPropagateVaryingFirst (53), OriPropagateVaryingSecond (70), OptimizeUniformAtomic (44), ConvertMemoryToRegisterOrUniform (sub_910840) |
Background: Uniform vs. Divergent Values
A value is uniform (warp-uniform) if every active thread in the warp holds the same value for that register at a given program point. A value is divergent if different threads may hold different values.
Sources of uniformity:
- Kernel parameters. All threads receive the same parameter values. Parameters loaded from constant memory (
LDC) with a uniform address are uniform by construction. - Constant memory loads.
LDCwith a uniform base address produces a uniform result. - S2R of warp-uniform special registers. Registers like
SR_CTAID_X/Y/Z,SR_GRIDID, andSR_SMIDare uniform across the warp.SR_TID_X/Y/ZandSR_LANEIDare divergent. - Arithmetic on uniform inputs. If all source operands are uniform, the result of any pure ALU operation is uniform.
- Convergent control flow. A value defined before a divergent branch and used after reconvergence is still uniform if the definition was uniform.
Sources of divergence:
- Thread identity registers.
SR_TID_X/Y/Z,SR_LANEIDvary per thread. - Memory loads from thread-dependent addresses.
LDG [R_addr]whereR_addris divergent produces a divergent result. - Phi merges across divergent branches. A
MOV.PHIthat merges values from two sides of a divergent branch is divergent even if each incoming value was individually uniform.
ptxas tracks uniformity through two complementary mechanisms: forward "varying" propagation (OriPropagateVarying, phases 53 and 70) marks registers as divergent, while the uniform analysis passes (this page) identify which remaining values are safe to move to the UR file.
UR Hardware ISA
sm_75+ architectures provide a dedicated set of uniform-only SASS instructions that operate on UR/UP registers. These execute on the uniform datapath, which processes one value per warp instead of 32:
| SASS mnemonic | ROT13 in binary | Operation |
|---|---|---|
UIADD3 | HVNQQ3 | Uniform 3-input integer add |
UIMAD | HVZNQ | Uniform integer multiply-add |
ULOP3 | HYBC3 | Uniform 3-input logic |
UISETP | HVFRGC | Uniform integer set-predicate |
USGXT | HFTKG | Uniform sign-extend |
UPRMT | HCEZG | Uniform byte permute |
UPOPC | HCBCP | Uniform population count |
UBREV | HOERI | Uniform bit reverse |
UP2UR | HC2HE | Uniform predicate to uniform register |
UPLOP3 | HCYBC3 | Uniform predicate LOP3 |
VOTEU | IBGRH | Uniform vote |
Blackwell (sm_100+) extends the uniform ISA with:
UFADD,UFFMA,UFSEL,UFSETP-- uniform floating-point operationsUVIADDR-- uniform virtual address computationUCLEA,UCVTA,ULEPC-- uniform address operationsUTMAPC,UTMALDG,UTMAPF,UTMAREDG-- uniform TMA (tensor memory accelerator) operationsUBLKPC,UBLKRED,UBLKPF-- uniform block operations
The R2UR instruction transfers a value from the R file to the UR file; UR2R does the reverse. These are the bridge instructions that ConvertToUniformReg inserts at file boundaries.
The SASS encoder at sub_7BC360 (126 callers) handles UR register encoding using the register-variant-B format, distinct from the main register encoder sub_7BC030. The decoder sub_7BD7D0 (4 callers) extracts UR operands with type=4 (uniform register). In the Mercury encoding layer, Major 0x0E (6 variants, sub_10C0550) encodes the uniform ALU instructions (UIADD3, ULOP3, etc.).
Phase 11: ReplaceUniformsWithImm
| Phase index | 11 |
| Pipeline position | Stage 1 (Initial Setup), after EarlyOriSimpleLiveDead (10), before OriSanitize (12) |
| Category | Optimization |
Purpose
Replaces uniform register reads with immediate constants when the value is known at compile time. This is the earliest uniform-related optimization in the pipeline, running before any loop or branch optimization.
Motivation
Kernel launch parameters are passed through constant memory. After PTX-to-Ori lowering, a kernel parameter access looks like:
LDC R3, c[0x0][0x160] // load parameter from constant bank
IMAD R4, R3, R5, RZ // use the parameter
If the compiler can prove that the constant bank address contains a known immediate (e.g., from .param directives with known offsets), the LDC is dead and the use can be folded:
IMAD R4, 42, R5, RZ // parameter replaced with immediate 42
This eliminates constant memory traffic and reduces register pressure by one register.
When It Fires
The pass is most effective for:
- Kernel parameters with known constant offsets
- Shared memory size constants
- Grid/block dimension constants when known at compile time
- Constant expressions that survive PTX-to-Ori lowering as
LDCloads
The pass is gated by knob 487 (general optimization enablement).
Phase 27: AnalyzeUniformsForSpeculation
| Phase index | 27 |
| Pipeline position | Stage 2 (Early Optimization), after OriRemoveRedundantBarriers (26), before SinkRemat (28) |
| Category | Analysis |
Purpose
Identifies uniform values that are safe for speculative execution. This analysis feeds subsequent passes that may hoist or speculatively execute instructions -- most immediately SinkRemat (phase 28) and SpeculativeHoistComInsts (phase 56).
Speculative Uniformity
A value is "speculatively uniform" if it would be uniform under all possible execution paths, not just the currently taken path. This is a stronger property than simple uniformity: a value that is uniform within one branch arm might not be speculatively safe to hoist above the branch if the other arm would produce a different value or a side effect.
The analysis must be conservative:
- Memory loads cannot be speculated unless the address is provably valid on all paths (no faults).
- Atomic operations are never speculative candidates.
- Values defined under divergent control flow require careful handling -- the analysis must determine whether the definition dominates all paths that could reach the speculation point.
Pipeline Position Rationale
Phase 27 runs after:
- Loop unrolling (22), which may duplicate uniform definitions
- SSA phi insertion (23), which provides single-definition reaching information
- Software pipelining (24), which may interleave loop iterations
- Barrier removal (26), which may relax synchronization constraints
And before:
SinkRemat(28), which uses the analysis to decide what can be sunk/recomputedGeneralOptimize(29), which benefits from knowing which values are uniform
Phase 74: ConvertToUniformReg
| Phase index | 74 |
| Pipeline position | Stage 4 (Late Optimization), after ConvertAllMovPhiToMov (73), before LateArchOptimizeFirst (75) |
| Category | Optimization |
| String reference | "ConvertToUniformReg" at 0x22BCA12 |
| Related function | sub_911030 (10,741 bytes, 56 callees) |
Purpose
The main UR promotion pass. Converts qualifying R-register values to UR registers, replacing per-thread general-purpose register storage with warp-uniform storage. This is the highest-impact uniform register optimization in the pipeline.
Pipeline Position Rationale
Phase 74 runs immediately after SSA destruction (ConvertAllMovPhiToMov, phase 73). This is deliberate:
- After SSA destruction: phi nodes have been converted to plain MOVs, giving the pass a clear view of all definitions and uses without phi-node complications.
- After varying propagation (phases 53 and 70): the divergence annotations are complete -- the pass knows which values are proven uniform.
- After predication (phase 63): if-conversion has already eliminated short branches, which may have exposed new uniform values.
- Before register allocation: UR conversion reduces R-register demand before the fat-point allocator runs (phase 101), directly improving occupancy.
- Before scheduling: the scheduler (phases 97+) can exploit UR-specific latency characteristics.
Conversion Criteria
A value qualifies for R-to-UR conversion when all of the following hold:
-
Uniformity: the value is proven warp-uniform -- all threads compute the same result. This is established by the varying propagation passes and the phase 27 analysis.
-
UR-expressible operation: the defining instruction has a uniform-datapath equivalent. Not all SASS instructions have UR variants. Operations like
IMAD,IADD3,LOP3,ISETP,MOV,SEL,PRMT,SGXT,POPC, andBREVhave UR counterparts. Complex operations likeFFMA,LDG,STG, texture instructions, and atomics do not (until sm_100 added some uniform FP). -
UR pressure budget: the conversion must not exceed the 63-register UR hardware limit. The pass tracks live UR count and aborts conversion for a value if it would push the UR pressure beyond the limit.
-
All uses accept UR sources: every consumer of the value must be able to read from the UR file. Some instructions have encoding restrictions that prohibit UR operands in certain source positions.
-
No cross-warp dependencies: the value must not participate in cross-warp communication patterns (e.g., shuffle instructions that explicitly exchange values between lanes).
Algorithm
The pass operates in two main phases:
Phase A -- Candidate identification. Walks the instruction list and marks each definition as a UR candidate based on the criteria above. For each candidate, it checks:
- The
vreg+64register file type is R (type 1 or 2, not already UR type 3) - The varying propagation flag on the register indicates uniformity (bit 2 of
vreg+49clear) - The defining opcode has a UR-equivalent instruction form
- All consumers of this register accept UR sources
Phase B -- Conversion. For each approved candidate:
- Changes the register's file type from R (type 1) to UR (type 3) at
vreg+64 - Updates the register's allocator class from class 1 (R) to class 4 (UR) at
vreg+12 - Rewrites the defining instruction to use the UR-variant opcode (e.g.,
IMADbecomesUIMAD) - Inserts
R2URbridge instructions where a converted UR value flows into an instruction that requires an R-file source - Inserts
UR2Rbridge instructions where an R-file value needs to flow into a converted UR instruction - Updates the UR count at Code Object
+99
UR Pressure Management
The UR file has only 63 usable registers (UR0--UR62), compared to 254 for the R file. The pass must be conservative about how many values it converts:
- Greedy allocation with pressure cap: candidates are evaluated in program order (RPO). Each conversion increments a pressure counter. If the counter reaches the hardware limit, remaining candidates are skipped.
- Priority by benefit: conversions that save the most R-register pressure (long live ranges with many uses) are preferred.
- Retry mechanism: the scheduling infrastructure at
sub_A0D800supports a "retry without uniform regs" fallback (controlled by flagv63). If scheduling with UR-converted code fails to meet latency targets, the scheduler can request a re-run without UR conversion.
Interaction with Register Allocation
The UR conversion reduces R-register demand but introduces UR-register demand. The fat-point allocator (phase 101) handles R and UR as separate register classes (class 1 and class 4 respectively), with independent allocation passes. The trade-off:
| R file | UR file | |
|---|---|---|
| Capacity | 254 usable | 62 usable |
| Pressure impact | Reduced by conversion | Increased by conversion |
| Occupancy impact | Positive (fewer R regs = higher occupancy) | Neutral (UR count does not affect warp occupancy on most SMs) |
| Spill cost | Spilled to local memory | Spilled to R file, then to local memory |
The allocator state at alloc+440 tracks the uniform register promotion flag (controlled by knob 628 and context flag +1414). When this flag is set, the pre-allocation pass (sub_94A020) enables UR-aware allocation.
Phase 86: InsertPseudoUseDefForConvUR
| Phase index | 86 |
| Pipeline position | Stage 5 (Legalization), after OriPropagateGmma (85), before FixupGmmaSequence (87) |
| Category | Lowering |
Purpose
Inserts pseudo use/def instructions to maintain correct liveness information for UR-converted registers. After ConvertToUniformReg (phase 74) converts values from R to UR, subsequent optimization and legalization passes may invalidate the liveness information. This pass inserts lightweight pseudo-instructions that prevent later passes from incorrectly eliminating UR definitions or extending UR live ranges beyond their intended scope.
Why Pseudo Instructions Are Needed
The UR conversion in phase 74 changes register file assignments, but does not update all downstream data structures. Between phase 74 and register allocation (phase 101), several passes run:
74 ConvertToUniformReg <-- UR conversion happens here
75 LateArchOptimizeFirst
76 UpdateAfterOptimize
77 AdvancedPhaseLateConvUnSup
78 LateExpansionUnsupportedOps
79 OriHoistInvariantsLate2
80 ExpandJmxComputation
81 LateArchOptimizeSecond
82 AdvancedPhaseBackPropVReg
83 OriBackCopyPropagate
84 OriPerformLiveDeadFourth <-- DCE could kill "unused" UR defs
85 OriPropagateGmma
86 InsertPseudoUseDefForConvUR <-- pseudo use/def insertion
87 FixupGmmaSequence
...
101 AdvancedPhaseAllocReg <-- register allocation
The critical problem: OriPerformLiveDeadFourth (phase 84) runs liveness analysis and dead code elimination. If a UR-converted value appears dead (no R-file use remaining because the uses were also converted), DCE would remove it. The pseudo use/def instructions inserted by phase 86 create artificial uses that keep UR definitions alive through DCE.
Pseudo Instruction Properties
The pseudo use/def instructions:
- Have no hardware encoding -- they are removed before SASS emission
- Carry register operand references that maintain the def-use chain
- Are transparent to instruction scheduling (zero latency, no functional unit)
- Are removed during post-RA cleanup or Mercury encoding
Convergent Boundary Interaction
The pass also interacts with the convergent boundary enforcement mechanism. The string "Missing proper convergent boundary around func call annotated with allowConvAlloc" (from sub_19D13F0) indicates that UR-converted values crossing function call boundaries require convergent allocation markers. The allowConvAlloc annotation on function calls triggers convergent boundary checking, and "Multiple functions calls within the allowConvAlloc convergent boundary" (sub_19C6400) warns when a convergent region contains more than one call.
The CONV.ALLOC pseudo-instruction (opcode 286 / 0x11E) is inserted by sub_19D7A70 to mark convergent allocation boundaries. This prevents the register allocator from assigning the same physical UR to values that are live across a convergent boundary where the UR might be redefined.
Varying Propagation (Supporting Analysis)
The OriPropagateVarying passes (phases 53 and 70) propagate divergence information forward through the IR. They are not part of the four-pass uniform register group, but provide the critical input data.
Phase 53 (OriPropagateVaryingFirst) runs after late expansion (55) and before rematerialization. It marks each register as either "uniform" or "varying" (divergent) by propagating divergence from known-divergent sources (thread ID registers, divergent memory loads) through the def-use chain. The propagation is a forward dataflow analysis: if any source operand of an instruction is varying, the destination is varying.
Phase 70 (OriPropagateVaryingSecond) repeats the analysis after predication (phase 63) and rematerialization (phase 69) may have changed the divergence landscape.
The varying flag is stored in the virtual register descriptor (bit 2 of vreg+49). During ConvertToUniformReg, only registers marked as non-varying are candidates for UR promotion.
Uniform Atomic Optimization (Phase 44)
OptimizeUniformAtomic (phase 44) is a mid-pipeline optimization that converts thread-uniform atomic operations into warp-level reductions. When all threads in a warp perform the same atomic operation on the same address with the same value, the hardware can coalesce them into a single atomic. This pass detects such patterns and rewrites them using REDUX (reduction) or ATOM.UNIFORM instruction forms.
Code Object Uniform Register Tracking
The Code Object maintains several fields related to UR state:
| Offset | Field | Description |
|---|---|---|
+99 | ur_count | Number of uniform registers allocated for this function |
+832 | Main liveness bitvector | One bit per virtual register (R + UR combined) |
+856 | UR liveness bitvector | Separate bitvector for UR/UP registers only |
+1368 bit 1 | has-uniform flag | Set when the function uses any UR registers |
+1376 bit 4 | UR tracking enabled | Controls whether scheduling tracks UR pressure |
+1378 bit 3 | has-UR-regs flag | Secondary flag confirming UR register usage |
The scheduling dependency builder at sub_A0D800 (39 KB) tracks UR pressure separately. When +1376 bit 4 is set, the control word computation at sub_A09850 doubles the register count for uniform operands (v15 = type==3 ? 2 : 1) and writes a 9-bit register count to the control word bits [0:8].
The scheduling statistics printer (sub_A3A7E0) reports texture binding mode as "UR-bound" when textures are accessed via uniform-register-based descriptors:
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
Disallowed Uniform Register Diagnostic
The function sub_A465F0 (CodeObject::buildCodeObjectHeader, 2.6 KB binary) checks whether UR registers were used despite being disallowed. The diagnostic:
"Uniform registers were disallowed, but the compiler required (%d) uniform
registers for correct code generation."
This fires on pre-sm_75 targets where the UR file does not exist, or when a CLI option explicitly disables UR usage. Knob 687 controls the uniform register mode.
SM Architecture Availability
| SM range | UR support | UR ALU instructions | Uniform FP |
|---|---|---|---|
| sm_30 -- sm_72 | None | None | None |
| sm_75 -- sm_89 | UR0--UR62, UP0--UP6 | UIADD3, UIMAD, ULOP3, UISETP, UMOV, UPRMT, USGXT, UPOPC, UBREV | None |
| sm_90 -- sm_90a | UR0--UR62, UP0--UP6 | Full integer uniform ALU | None (LDCU requires -forcetext -sso) |
| sm_100+ | UR0--UR62, UP0--UP6 | Full integer + FP uniform ALU | UFADD, UFFMA, UFSEL, UFSETP, UVIADDR |
The LDCU (Load Constant Uniform) instruction is gated by architecture capability. The validation at sub_B28400 (345 bytes) checks:
"SM does not support LDCU. On SM90 -knob EmitLDCU is only supported when
options '-forcetext' and '-sso out.sass' are provided."
This check queries vtable+1336 for the LDCU capability.
ConvertMemoryToRegisterOrUniform
The function sub_910840 (ConvertMemoryToRegisterOrUniform, gated by knob 487) is a pre-allocation optimization that promotes stack-resident variables to registers, with the option of promoting to UR when the variable is proven uniform. It is not one of the four numbered phases but works closely with them.
| Entry | sub_910840 (2,100 bytes) |
| Core | sub_911030 (10,741 bytes, 56 callees) |
| Liveness builder | sub_905B50 (5,407 bytes) |
| Promotion transform | sub_90FBA0 (~4,000 bytes) |
| Gate knob | 487 |
| String | "ConvertMemoryToRegisterOrUniform" at 0x910897 |
The entry function checks knob 487 for enablement (via vtable+152 dispatch), builds def-use chains via sub_905B50, then calls sub_90FBA0 for the actual promotion.
The sub_911030 core function (10.7 KB) handles the "OrUniform" part -- it iterates through the variable list, checks variable properties (address space, type), and decides whether to promote to R or UR. The decision process involves:
- Checking the register's
vreg+49flags byte (bit 2 = uniform marker fromsub_907870) - Evaluating whether the variable's address space permits UR promotion
- Confirming that the defining and using instructions have UR-compatible forms
- Verifying UR pressure headroom
The per-register-class property accessors at sub_900C50--sub_9013F0 (6 nearly identical 391-byte functions, 2 callers each) provide the class-indexed lookups for the promotion decision.
Key Functions
| Address | Size | Function | Description |
|---|---|---|---|
sub_910840 | 2.1 KB | ConvertMemoryToRegisterOrUniform | Promotes stack variables to R or UR registers (knob 487 gated) |
sub_911030 | 10.7 KB | Core UR promotion logic | Iterates variables, decides R vs UR promotion based on uniformity |
sub_905B50 | 5.4 KB | Liveness builder for promotion | Builds def-use chains for promotion analysis |
sub_90FBA0 | ~4 KB | Promotion transform | Applies the actual memory-to-register transformation |
sub_8FEAC0 | 2.1 KB | Per-BB pressure analyzer | Walks instruction list, decodes operand types, updates pressure via vtable+1824; called from sub_910840 |
sub_A465F0 | 2.6 KB | CodeObject::buildCodeObjectHeader | Writes UR count into code object, checks disallowed-UR diagnostic |
sub_B28E90 | small | isUReg | Predicate: is operand a uniform register? |
sub_19D13F0 | 4.3 KB | Convergent boundary checker | Validates allowConvAlloc boundaries around function calls |
sub_19C6400 | 330 B | Per-instruction convergent classifier | Callback: warns on opcode 159 within convergent boundary |
sub_19D7A70 | 3.3 KB | CONV.ALLOC marker insertion | Inserts opcode 0x11E pseudo-instructions at convergent boundaries |
sub_A0D800 | 39 KB | Scheduling dependency builder | Builds per-block dependency graph; tracks UR pressure via +856 bitvector |
sub_A09850 | ~2 KB | Control word computation | Doubles count for uniform operands: type==3 ? 2 : 1 |
sub_B28400 | 345 B | LDCU validator | Checks SM support for Load Constant Uniform |
sub_7BC360 | ~1 KB | UR register encoder | Encodes UR operands in SASS instruction words (126 callers) |
sub_7BD7D0 | ~1 KB | UR register decoder | Decodes UR operands from SASS instruction words (type=4) |
sub_94A020 | ~3.5 KB | Pre-allocation setup | Sets alloc+440 UR promotion flag from knob 628 + context flag +1414 |
sub_900C50 | 391 B | Register class property accessor | Per-class property lookup (one of 6 identical functions for GP, predicate, UR, etc.) |
Related Pages
- Register Model -- UR file, register descriptor layout, allocator classes
- Ori IR Overview -- instruction format, partial SSA window
- Pass Inventory -- complete 159-phase table
- Liveness Analysis -- bitvector infrastructure used by UR liveness tracking
- Rematerialization -- phases 28, 54, 69 (interact with speculation analysis)
- Predication -- phase 63, changes divergence landscape before UR conversion
- Register Allocator -- 7-class allocator handling R and UR independently
- GMMA Pipeline -- phases 85, 87 (adjacent to InsertPseudoUseDefForConvUR)
- GPU ABI -- convergent allocation, allowConvAlloc enforcement
- Scheduler Architecture -- UR pressure tracking in scheduling
Late Expansion & Legalization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas pipeline contains six legalization passes spread across the 159-phase sequence. Their collective job is to replace Ori IR operations that the target SM cannot execute natively with equivalent sequences of legal instructions. "Unsupported ops" means exactly this: operations that exist in the PTX ISA or internal Ori representation but have no single-instruction mapping on the compilation target. The replacement may be an inline expansion (a sequence of simpler instructions), a call to a libdevice helper function, or an SM-specific intrinsic sequence.
The six passes run at deliberately different pipeline positions because each intervening group of optimization passes can expose new unsupported operations or create new legalization opportunities.
| Passes covered | 6 (phases 5, 45, 55, 78, 93, 137) |
| Category | Lowering |
| Backend dispatch | Architecture-specific via two backend objects at context+0x630 and context+0x640 |
| Libdevice functions | 608 helper functions registered at sub_5D1660 (9,728-byte table from unk_1D4D940) |
| Legalization flag | SetAfterLegalization (phase 95) marks the point past which no unsupported ops should remain |
| Update pass | UpdateAfterConvertUnsupportedOps (phase 132, factory 8) rebuilds IR metadata after late expansion |
| Knob gates | Knob 499 (ConvertUnsupportedOps, LateExpansionUnsupportedOps), knob 487 (LateExpansion, SetAfterLegalization, LateExpansionUnsupportedOps), knob 214 / 464 (LateExpansionUnsupportedOps inner loop) |
Why Six Passes
A monolithic legalize-everything pass early in the pipeline would cripple optimization. Many optimizations (CSE, LICM, strength reduction, predication) work on high-level operation semantics. If div.rn.f64 were expanded into a 30-instruction Newton-Raphson sequence at phase 5, loop-invariant code motion at phase 35 would see 30 independent instructions instead of one hoistable division. Conversely, some unsupported operations only appear after optimization passes transform the IR: predication (phase 63) can create new predicated ops that need legalization, GMMA fixup (phase 87) can introduce new WGMMA-related sequences, and conditional flow merging (phases 133/136) can expose operations that were previously dead.
The six passes form a progressive legalization strategy:
| Phase | Name | Pipeline Position | Purpose |
|---|---|---|---|
| 5 | ConvertUnsupportedOps | Before optimization (stage 1) | Early legalization of obviously unsupported ops; preserves optimization opportunities for everything else |
| 45 | MidExpansion | After early/mid optimization (stage 3) | Target-dependent expansion after loop unrolling, strength reduction, and GVN have run |
| 55 | LateExpansion | After high-level optimizations (stage 4) | Expansion of ops that optimization passes should see in unexpanded form |
| 78 | LateExpansionUnsupportedOps | After all optimization (stage 5) | Catches remaining unsupported ops after predication, rematerialization, and uniform conversion |
| 93 | LateExpansionUnsupportedOps2 | After GMMA/attr passes (stage 5) | Second catch -- handles ops exposed by GMMA propagation, GMMA fixup, and register attribute setting |
| 137 | LateExpansionUnsupportedOpsMid | After late merge (stage 10) | Final catch between the two conditional flow merge passes |
Architecture Backend Dispatch
None of the six passes contain legalization logic directly. Each is a thin dispatcher that forwards to a virtual method on one of two architecture backend objects stored in the compilation context. The backend objects are constructed per-SM-target and provide the actual SM-specific legalization implementations.
Two backend objects:
| Context Offset | Used By | Role |
|---|---|---|
context+0x640 | ConvertUnsupportedOps, LateExpansion | Outer backend -- wraps an inner object at +0x10, provides two-level dispatch |
context+0x630 | MidExpansion, LateExpansionUnsupportedOps, LateExpansionUnsupportedOps2, LateExpansionUnsupportedOpsMid, SetAfterLegalization | SM backend -- single-level dispatch through vtable |
The two-level dispatch through context+0x640 allows the outer backend to override the entire legalization strategy (by replacing vtable slot 0), while the inner object provides the SM-specific implementation when the outer backend delegates. This separation exists because ConvertUnsupportedOps and LateExpansion may need to coordinate with higher-level compilation modes (e.g., library compilation, OptiX IR) that wrap the SM backend.
Backend Vtable Slots
The SM backend at context+0x630 dispatches legalization through these vtable offsets:
| Vtable Offset | Decimal | Called By |
|---|---|---|
+0xB0 | 176 | MidExpansion |
+0xD8 | 216 | LateExpansionUnsupportedOps2 |
+0x108 | 264 | SetAfterLegalization |
+0x178 | 376 | LateExpansionUnsupportedOps |
+0x180 | 384 | LateExpansionUnsupportedOpsMid |
The outer backend at context+0x640 dispatches:
| Vtable Offset | Decimal | Called By |
|---|---|---|
+0x00 | 0 | ConvertUnsupportedOps (type check -- compared against sub_661280) |
+0x78 | 120 | ConvertUnsupportedOps (delegated to inner object) |
+0x58 | 88 | LateExpansion (type check -- compared against sub_6612E0) |
inner +0xE0 | 224 | LateExpansion (delegated to inner object) |
Pass Details
Phase 5 -- ConvertUnsupportedOps
Factory index: 5
Vtable: off_22BD690
execute(): sub_C60A20 (thunk -> context+0x640 dispatch)
isNoOp(): sub_C5F610 (returns 0 -- always runs)
Flag side-effect: sets context+1378 bit 0 (isConvertUnsupportedDone)
Knob gate: 499 (checked via sub_7DDB50)
Pipeline: Bracketed by AdvancedPhaseBeforeConvUnSup (4) and AdvancedPhaseAfterConvUnSup (7)
This is the earliest legalization pass, running at phase 5 before any optimization. It converts operations that are clearly illegal on the target SM into equivalent sequences. The pass always runs (isNoOp = false) and is unconditional -- every compilation executes it.
Dispatch mechanism. The execute function (sub_C60A20) reads the backend at context+0x640, checks whether vtable slot 0 is the default implementation (sub_661280), and either calls the overridden method directly or unwraps to the inner object at backend+0x10 and calls vtable offset +0x78 (120). This two-level indirection allows library-mode and OptiX-mode compilation to inject custom legalization logic.
Flag effect. After execution, the pass sets bit 0 of context+1378, signaling to downstream passes that early legalization has completed. Passes like OriCreateMacroInsts (phase 8) check this flag to know whether certain patterns have already been lowered.
What gets legalized early: Operations that cannot survive optimization in their original form. Examples include operations that reference address spaces not supported on the target, certain modifier combinations that have no encoding, and PTX instructions that are syntactically valid but architecturally illegal (e.g., atom.add.f64 on targets without native FP64 atomics).
Phase 45 -- MidExpansion
Factory index: 51
Vtable: off_22BDDC0
execute(): sub_C5EFB0 (thunk -> context+0x630 vtable+0xB0)
isNoOp(): sub_C5EFD0 (returns 0 -- always runs)
Field side-effect: sets context+1552 = 3
Pipeline: After ExpandMbarrier (42), ForwardProgress (43), OptimizeUniformAtomic (44)
Before GeneralOptimizeMid2 (46)
MidExpansion runs after the CTA/mbarrier/barrier expansion passes and before the second mid-level GeneralOptimize bundle. It handles target-dependent expansions that must occur after barrier-related lowering but before the mid-level optimization cleanup.
Dispatch. Dispatches directly through the SM backend vtable at offset +0xB0 (176). No two-level indirection -- the SM backend provides the implementation directly.
Side effect. Sets context+1552 to 3. This field is the pipeline progress counter (not exclusively a legalization counter -- see Context Fields below) and is read by subsequent passes to determine which pipeline stages have completed. The value 3 indicates "mid-expansion complete."
Phase 55 -- LateExpansion
Factory index: 63
Vtable: off_22BDFA0
execute(): sub_C60AA0 (thunk -> context+0x640 dispatch)
isNoOp(): sub_C5EE20 (returns 0 -- always runs)
Field side-effect: sets context+1552 = 7 (via inner dispatch)
Pipeline: After OriDoRematEarly (54), before SpeculativeHoistComInsts (56)
Followed by GeneralOptimizeLate (58)
LateExpansion is the primary post-optimization legalization pass. It runs after all high-level optimizations (loop unrolling, strength reduction, GVN-CSE, reassociation, predication setup) have completed, expanding operations that were deliberately kept in high-level form for those passes.
Dispatch. Uses the outer backend at context+0x640. Checks vtable slot +0x58 (88) against the default (sub_6612E0). If overridden, calls the override. Otherwise, calls the inner object's vtable at +0xE0 (224) and then sets context+1552 = 7, advancing the pipeline progress counter.
What gets expanded here: This is the pass where most math library calls are introduced. Operations like div.rn.f64, sqrt.rn.f32, rcp.rd.f64 that were kept as single Ori instructions through optimization are now replaced with Newton-Raphson sequences or calls to the 608-function libdevice library. The SM20 library functions (division, square root, reciprocal, bit-field extract/insert) and SM70 functions (WMMA matrix operations, barrier reductions) are the primary candidates.
Optimization interaction. GeneralOptimizeLate (phase 58) runs immediately after, cleaning up the expanded sequences with copy propagation, constant folding, and dead code elimination. This is why expansion happens here rather than later -- the expanded code benefits from one more optimization round.
Phase 78 -- LateExpansionUnsupportedOps
Factory index: 90
Vtable: off_22BE3D8
execute(): sub_C5EA50 (thunk -> context+0x630 vtable+0x178)
isNoOp(): sub_C5EA70 (returns 0 -- always runs)
Knob gate: 499 (via sub_7DDB50), plus flag check: context+1414 bit 2
Pipeline: After AdvancedPhaseLateConvUnSup (77), before OriHoistInvariantsLate2 (79)
The first of three "late unsupported ops" catches. It runs after all optimizations have completed (phases 13-76) and catches operations that optimization passes themselves introduced or exposed.
Gating. This pass has the most complex gating of the six. In addition to the standard knob 499 check (via sub_7DDB50), it also checks bit 2 of context+1414. If the bit is clear, the pass is skipped even though isNoOp returns false. This allows the backend to dynamically disable the pass when no unsupported ops were detected during earlier compilation phases.
Implementation. When active, calls sub_7917F0 which:
- Checks
context+1382bit 2 (another prerequisite flag) - Checks knob 214 (via the capability dispatch at
context+1664) - If the function table at
context+0 + 1056is not yet initialized, calls the expansion setup functions (sub_785E20,sub_781F80,sub_7E6090,sub_7E6AD0) - Iterates over basic blocks, applying per-instruction legalization with convergence check (knob 464 gates the inner loop)
This iterative structure -- expand, check if more work needed, repeat -- handles cascading expansions where expanding one operation exposes another unsupported operation.
Phase 93 -- LateExpansionUnsupportedOps2
Factory index: 109
Vtable: off_22BE6D0
execute(): sub_C5E790 (thunk -> context+0x630 vtable+0xD8)
isNoOp(): sub_C5E7B0 (returns 0 -- always runs)
Pipeline: After AdvancedPhaseAfterSetRegAttr (92), before FinalInspectionPass (94)
The second late catch, positioned after the GMMA/WGMMA passes (85-87), register attribute setting (90), and texture dependency analysis (91). These intervening passes can introduce new operations that need legalization:
- GMMA propagation (phase 85) may introduce WGMMA accumulator movement operations
- GMMA sequence fixup (phase 87) may insert hardware ordering instructions
- Register attribute setting (phase 90) may expose operations that become illegal once register classes are assigned
Dispatch. Uses the SM backend vtable at offset +0xD8 (216). The dispatch is architecture-dependent: the execute function reads vtable slot 12 (backend[12]), compares against a default implementation (sub_661310), and either calls the override or falls through to a two-step sequence that calls methods at offsets 280 and 3088 on an inner object.
Phase 137 -- LateExpansionUnsupportedOpsMid
Factory index: 93
Vtable: off_22BE450
execute(): sub_C607E0 (thunk -> context+0x630 vtable+0x180)
isNoOp(): sub_C5EA00 (returns 0 -- always runs)
Default check: compares vtable+0x180 against sub_7D6D50 -- if default, entire pass is no-op
Pipeline: After LateMergeEquivalentConditionalFlow (136), before OriSplitHighPressureLiveRanges (138)
The final legalization catch, positioned between the two conditional flow merge passes (133, 136) and the last-resort live range splitter (138). The merge passes can combine basic blocks in ways that create new instruction sequences containing unsupported operations.
Conditional execution. Unlike the other five passes, this one has a soft no-op mechanism: the execute function reads vtable slot +0x180 (384) and compares the function pointer against the default implementation (sub_7D6D50). If the backend has not overridden this slot, the pass returns immediately without doing any work. This means the pass is truly active only on SM targets that define a LateExpansionUnsupportedOpsMid handler -- typically newer architectures (Hopper/Blackwell) that have more complex merge and expansion interactions.
Supporting Passes
Phase 95 -- SetAfterLegalization
Factory index: 111
Vtable: off_22BE720
execute(): sub_C5F8A0
isNoOp(): sub_C5E9C0 (returns 0 -- always runs)
Pipeline: After FinalInspectionPass (94), before ReportBeforeScheduling (96)
Not a legalization pass per se. It marks the compilation context as post-legalization by calling the SM backend's vtable at offset +0x108 (264). This sets the legalization_complete flag that downstream passes (scheduling, register allocation, encoding) check to assert that no unsupported operations remain. The pass is gated by optimization level: sub_7DDB50 returns the current optimization level, and the dispatch only fires at -O2 and above.
Phase 132 -- UpdateAfterConvertUnsupportedOps
Factory index: 8
Vtable: off_22BD708
execute(): sub_C5F570 (rep ret -- NOP)
isNoOp(): sub_C5F590 (returns 1 -- skipped by default)
Pipeline: First pass in Stage 10
A placeholder update pass that rebuilds IR metadata after late unsupported-op conversion. Its execute() is a NOP (rep ret) and isNoOp() returns 1 (true), so it is skipped by default. Architecture backends can override the vtable to activate it when late expansion produces structural changes requiring metadata rebuild.
Libdevice Function Library
The legalization passes replace unsupported operations with calls to a library of 608 predefined helper functions. These are not external libraries -- they are PTX function bodies embedded in the ptxas binary itself, compiled and linked into the output at need.
The function table is initialized by sub_5D1660, which copies a 9,728-byte pre-built table from unk_1D4D940 and registers 608 function names in a hash map for lookup.
Library Function Categories
| SM Prefix | Count | Operations |
|---|---|---|
__cuda_sm20_ | 70 | Division (f32/f64, all rounding modes), reciprocal (f32/f64, all rounding modes), square root (f32/f64), double-precision reciprocal sqrt, bit-field extract/insert 64-bit, integer division/remainder (s16/s64/u16/u64) |
__cuda_sm3x_ | 4 | FP32 division with FTZ variants (Kepler-specific paths) |
__cuda_sm62_ | 2 | DP2A, DP4A dot-product accumulate (pre-Volta emulation) |
__cuda_sm70_ | 397 | Barrier operations (arrive/red/wait with 0-15 barrier IDs and count variants), WMMA matrix operations (204 variants for different shapes/types), warp shuffle sync, warp vote sync, match sync |
__cuda_sm80_ | 3 | Cache policy creation (fractional, range encode) |
__cuda_sm1xx_ | 18 | Bulk copy (unicast/multicast), async bulk tensor copy (1D-5D tile/im2col, unicast/multicast) |
__cuda_sm10x_ | 16 | TCGen05 guardrail traps (bounds check, alignment, allocation), tcgen05 MMA operations, mask creation |
__cuda_scalar_video_emulation_ | 7 | Video instruction emulation (operand extract, sign extend, saturate, merge) |
__cuda_reduxsync_ | 18 | Redux-sync reductions (and/or/xor for b32, add/max/min for s32/u32/f32 with NaN/abs variants) |
__cuda_sanitizer_ | 6 | Memory sanitizer checks (malloc/free/generic/global/local/shared/metadata) |
| Other | ~67 | Miscellaneous: dummy entries, user-function stubs, device synchronize |
SM-Dependent Legalization Examples
The core design principle: what is "unsupported" depends entirely on the target SM. An operation legal on one architecture may require library expansion on another.
Integer division/remainder. PTX div.s64 and rem.u64 have no single SASS instruction on any SM. They are always expanded to multi-instruction sequences via __cuda_sm20_div_s64, __cuda_sm20_rem_u64, etc. These are "sm20" functions because the expansion has been the same since Fermi.
FP32 division with rounding. div.rn.f32 on Turing (sm_75) uses a hardware-assisted Newton-Raphson (MUFU.RCP + refinement). On Kepler (sm_3x, no longer shipped but the code path remains), different refinement sequences are needed, using __cuda_sm3x_div_rn_ftz_f32 and its slowpath variant.
Barrier operations. On Volta+ (sm_70), barrier.arrive with a specific barrier ID and thread count is a single SASS instruction (BAR.ARV). On pre-Volta targets, these must be emulated with the 397 __cuda_sm70_barrier_* library functions that implement the semantic equivalent using older synchronization primitives.
WMMA/Tensor Core. Warp-level matrix multiply-accumulate (wmma.*) on sm_70 has dedicated hardware instructions (HMMA). The 204 __cuda_sm70_wmma_* variants cover the combinatorial explosion of shapes (m16n16k16, m8n32k16, m32n8k16), types (f16, bf16, tf32, s8, u8, s4, u4, b1), layouts (row/col), and accumulator types.
DP2A/DP4A. The integer dot-product-accumulate instructions have native hardware support starting at sm_61. On sm_62 (Xavier), they use __cuda_sm62_dp2a and __cuda_sm62_dp4a emulation routines.
Bulk tensor copy (Blackwell). The cp.async.bulk.tensor family on sm_100+ (Blackwell) supports 1D through 5D tile and im2col access patterns, with unicast and multicast variants. These 18 __cuda_sm1xx_cp_async_bulk_tensor_* functions provide the expansion for targets where hardware support is partial or absent.
TCGen05 guardrails (Blackwell). The 5th-generation tensor core operations (sm_100+) include runtime guardrail traps -- bounds checking, alignment validation, allocation granularity checks -- implemented as __cuda_sm10x_tcgen05_guardrail_trap_* functions inserted during legalization.
Context Fields
The legalization passes interact with several fields on the compilation context:
| Offset | Type | Description |
|---|---|---|
+0x630 | void* | SM backend object (main legalization dispatch target) |
+0x640 | void* | Outer backend object (wraps SM backend, used by ConvertUnsupportedOps and LateExpansion) |
+1378 | byte | Bit 0: ConvertUnsupportedOps has run |
+1382 | byte | Bit 2: prerequisite flag for LateExpansionUnsupportedOps |
+1414 | byte | Bit 2: enable flag for LateExpansionUnsupportedOps |
+1552 | int32 | Pipeline progress counter -- written by multiple passes across legalization, optimization, and post-RA stages (see value table below) |
+1664 | void* | Capability dispatch object (knob/option queries) |
The pipeline progress counter at context+1552 provides a monotonically increasing value that downstream passes can check to determine which pipeline stages have completed. Despite being documented previously as a "legalization stage counter," it is written by passes outside the legalization family (rematerialization, backward copy propagation, architecture-specific peephole, post-RA finalization):
| Value | Writer | Phase | Function |
|---|---|---|---|
| 0 | Context constructor | -- | sub_7F7DC0 |
| 3 | MidExpansion | 45 | sub_C5EF80 |
| 4 | OriDoRematEarly | 54 | sub_C5EF30 |
| 7 | LateExpansion | 55 | sub_6612E0 |
| 8 | Peephole/ISel refinement (arch-specific) | varies | sub_849C60 |
| 9 | OriBackCopyPropagate | 83 | sub_C5EB80 |
| 10 | PostRAFinalizer (arch-specific) | varies | sub_88E9D0 |
| 12 | SetAfterLegalization | 95 | sub_C5E980 |
Downstream passes compare against these thresholds: sub_A11060 checks > 4 to enable cross-block rematerialization; sub_752CF0 checks <= 3; sub_766520 checks <= 11; sub_781F80 checks <= 12; sub_78B8D0 checks > 18.
Pipeline Position Summary
Phase 0-4: Initial setup, FP16 promotion, CFG analysis
Phase 5: ConvertUnsupportedOps <-- LEGALIZATION #1
Phase 6-44: Optimization passes (branch, loop, strength reduction, GVN, barrier expansion)
Phase 45: MidExpansion <-- LEGALIZATION #2
Phase 46-54: Mid/late optimization (GVN-CSE, reassociation, predication setup, remat)
Phase 55: LateExpansion <-- LEGALIZATION #3
Phase 56-77: Late optimization (predication, commoning, LICM, remat, sync, phi destruction, uniform)
Phase 78: LateExpansionUnsupportedOps <-- LEGALIZATION #4
Phase 79-92: Post-opt (LICM, arch opt, back copy prop, GMMA, reg attrs)
Phase 93: LateExpansionUnsupportedOps2 <-- LEGALIZATION #5
Phase 94: FinalInspectionPass
Phase 95: SetAfterLegalization (marks legalization complete)
Phase 96-136: Scheduling, RA, Mercury, post-RA, late merge
Phase 137: LateExpansionUnsupportedOpsMid <-- LEGALIZATION #6
Phase 138: OriSplitHighPressureLiveRanges
Key Functions
| Address | Size | Role |
|---|---|---|
sub_C60A20 | ~40B | ConvertUnsupportedOps execute dispatcher |
sub_C5EFB0 | ~16B | MidExpansion execute dispatcher |
sub_C60AA0 | ~50B | LateExpansion execute dispatcher |
sub_C5EA50 | ~16B | LateExpansionUnsupportedOps execute dispatcher |
sub_C607E0 | ~30B | LateExpansionUnsupportedOpsMid execute dispatcher |
sub_C5E790 | ~16B | LateExpansionUnsupportedOps2 execute dispatcher |
sub_C5F8A0 | ~30B | SetAfterLegalization execute |
sub_7DDB50 | 232B | Optimization level gate (knob 499 check) |
sub_7917F0 | ~400B | LateExpansionUnsupportedOps core implementation |
sub_9059B0 | ~500B | LateExpansion core implementation (with expansion loop) |
sub_5D1660 | ~8KB | Libdevice function table initializer (608 entries) |
sub_785E20 | -- | Expansion setup (function table initialization) |
sub_781F80 | -- | Expansion setup (mode configuration) |
sub_7E6090 | -- | Instruction expansion driver |
sub_7E6AD0 | -- | Instruction expansion driver (secondary) |
sub_753600 | -- | Per-instruction legalization check |
sub_753B50 | -- | Retry/convergence loop for iterative expansion |
sub_13AF3D0 | 26,795B | Operand legalization dispatcher -- 164-case switch on opcode, called from sub_A29220 |
sub_13A6280 | 1,289B | General operand materializer -- ensures operand is in legal register (called 83x) |
sub_13A6AE0 | ~250B | Special-class operand materializer -- handles condition code and predicate classes |
sub_13A7410 | ~50B | Try-inline-then-materialize wrapper -- checks sub_822750 before falling back |
sub_13A6F90 | ~40B | Arch-immediate materializer -- like sub_13A7410 without pre-check |
sub_13A45E0 | -- | Predicate operand materializer |
sub_13A75D0 | -- | Uniform register conversion (class 6 to class 3) |
sub_A29220 | -- | Pass driver that calls sub_13AF3D0 per instruction |
sub_13ADB90 | 3,353B | Extended operand legalization variant (arch-specific override, vtable-dispatched) |
Operand Legalization Dispatcher
The SASS encoding backend cannot encode arbitrary operand forms. Before an instruction reaches the per-instruction encoder, every operand must be in a form the hardware encoding supports: a register in the correct class, an immediate that fits the bit-field width, or an absent-operand sentinel. The operand legalization dispatcher (sub_13AF3D0, 26,795 bytes) enforces these constraints. It is called once per instruction from the pass driver sub_A29220 and runs after ISel but before the SASS encoders.
Dispatcher Structure
The function reads the instruction opcode from field +72, masks off the predication flags (bits 12-13, mask & 0xCFFF), and enters a switch with 164 case labels covering Ori IR opcodes 0 through 352. Each case implements the legalization recipe for one opcode or a group of opcodes with identical operand layouts.
Before the switch, a pre-pass handles predicated instructions. If bit 12 of the opcode is set (indicating a predicate guard is present), the function first checks backend vtable slot +3232 for a custom handler. If none exists or it declines, sub_13A6AE0 is called on the predicate guard operand (at position operand_count - 2) to ensure it is in a legal register.
The switch routes to five categories of legalization logic:
Direct operand materialization. The majority of cases call sub_13A6280 on each operand that might need conversion. Example for a 3-source FMA (case 6):
sub_13A6280(context, instruction, 3, insert_point, ...) // src0
sub_13A7410(backend, instruction, 4, 1, insert_point, ...) // src1 (try inline first)
sub_13A6280(context, instruction, 5, insert_point, ...) // src2
// then check optional predicate operands 6,7 via sentinel test
Variable-length operand scanning. Case 16 (store) scans up to 15 operand slots, testing each against the 0x70000000 sentinel to find where active operands end before legalizing each one.
Architecture-specific delegation. Cases 70, 243, 245-247, 254-255, 257-259, 262 delegate entirely to vtable+2816. Cases 280-281 delegate to vtable+2328 with adjusted operand counts. These are SM-specific instructions (tensor core, WGMMA, bulk copy) where operand constraints vary by architecture.
Opcode rewriting. Case 137 (SM73_FIRST boundary marker, ROT13: FZ73_SVEFG; note: MOV = opcode 19) rewrites the opcode field itself: to 0x82 (130) for conditional MOV, or to 0x109 (265) for MOV-from-special-register when the source is in register class 4.
Passthrough. Cases 22, 24, 34, 38, 44, 45, 59, 73, 74, 77, 83, 106, 135, 161, 180, 182, 192, 194, 198, 209, 213-215, 221, 297, 352 and the default case require no operand legalization and exit immediately.
The 0x70000000 Null-Operand Sentinel
Each operand occupies an 8-byte slot in the instruction. The lower 4 bytes encode the operand value and type:
| Bits | Field | Values |
|---|---|---|
[30:28] | Type | 1=register, 2=signed immediate, 3=unsigned immediate, 5=predicate, 7=null |
[23:0] | Payload | Register index or immediate value |
[31] | Negate | 1=operand is negated |
+7 (byte) | Flags | Bit 0: uniform/constant bank reference |
The sentinel value 0x70000000 encodes type 7 ("null") with zero payload and no negation. It marks operand slots that are architecturally absent -- optional predicate guards not specified, trailing source operands of variable-width instructions, or unused operand positions in instructions with fewer sources than the maximum slot count.
The dispatcher tests for the sentinel with:
if ( ((*((_DWORD *)instr + offset) ^ 0x70000000) & 0x70000000) != 0 )
// operand is PRESENT -- legalize it
The XOR produces zero in bits [30:28] only when they are exactly 0b111 (type 7). The AND isolates those bits. If the result is zero, the operand is null and legalization is skipped. If non-zero, the operand is present and must be processed.
The function contains 59 references to 0x70000000. The heaviest user is case 16 (store), which chains 14 successive sentinel tests (at instruction offsets +84 through +196) to determine the store's vector width -- effectively implementing for each slot: if sentinel, stop; else legalize.
Operand Materialization Helpers
The dispatcher calls six helper functions depending on the operand class:
| Function | Calls | Role |
|---|---|---|
sub_13A6280 | 83 | General materializer. The core function. Checks if the operand can remain as-is (register in a legal class, or inline immediate that fits). If not, creates a MOV instruction via sub_92E800 to load the value into a fresh register, inserts it before the current instruction, and replaces the operand slot with a register reference (0x10000000 | reg_index). Short-circuits immediately for uniform registers (class 6). Uses sub_7DBC80 to test inline-immediate feasibility and sub_91D150/sub_91D160 for constant pool operations. |
sub_13A7410 | 15 | Try-inline-then-materialize. Checks sub_822750 first ("can this immediate be encoded inline for this arch?"). If yes, keeps the immediate. If no, tries sub_822990/sub_8229D0 for extended encoding paths. Falls back to sub_13A6280 only if all inline attempts fail. |
sub_13A6AE0 | 15 | Special-class materializer. Handles operands in non-standard register classes. For class 5 (predicate): returns immediately. For class 2 (condition code): creates a MOV with opcode 0x108. For immediates: calls sub_91D150 for constant pool lookup and replaces the operand. Used on predicate guard operands and instructions with condition-code sources. |
sub_13A6F90 | 7 | Arch-immediate materializer. Like sub_13A7410 but skips the sub_822750 pre-check. Used for operands where inline encoding is known to be architecture-dependent (texture coordinates, barrier IDs). |
sub_13A45E0 | 5 | Predicate materializer. Handles materialization of optional predicate operand slots, called exclusively after a sentinel test confirms the operand is present. |
sub_13A75D0 | 1 | Uniform register conversion. Called once (case 6, FMA) to handle uniform register class 6 operands that need conversion to general-purpose class 3. |
Materialization Flow (sub_13A6280 Detail)
The general materializer at sub_13A6280 (1,289 bytes) implements this decision tree for a single operand:
-
Uniform register early exit. If the operand is a register (type 1) in class 6 (uniform), return immediately -- uniform registers are always legal in the encoding.
-
Inline immediate check. If the operand is an immediate (type 2/3), call
sub_7DBC80to test whether the value fits in the instruction's immediate field. If it fits and passes the floating-point validity check (vtable+1504) and architecture encoding check (vtable+3248), keep the immediate as-is. -
Register reclassification. If the operand is a register in class 3 (general-purpose), query the architecture via
vtable+1240andvtable+904to determine if the register should be reclassified to uniform class 6 (for data types with width <= 3 register slots). -
Data-type conversion. For boolean (
sub_7D66E0) or floating-point (sub_7D6780) operand types, callvtable+904to map the data type to the appropriate register class. -
Materialization. Call
sub_92E800to create a MOV instruction (opcode 0x82 = 130) that loads the constant/immediate into a new register. Insert it at the insertion point. Replace the operand slot: lower word becomes0x10000000 | new_reg_index(type 1 = register), upper word is cleared to& 0xFEC00000. -
Insertion point update. If the insertion point
a4currently points to the instruction being legalized, advance it to the newly inserted MOV so subsequent materializations are ordered correctly.
Opcode Groups and Legalization Recipes
| Opcodes | Instruction Class | Operands Legalized | Notes |
|---|---|---|---|
| 2-7 | Arithmetic (ADD/MUL/FMA) | dst, src0, src1 [, src2] | FMA (6) has optional predicate slots checked via sentinel |
| 8 | LD (load) | Variable based on addressing mode | Operand count read from +80 |
| 10-11, 151-152, 290-291 | Compare/select | src0, src1 | Standard 2-source legalization |
| 16 | ST (store) | 1-15 data operands | Sentinel-scanned variable width |
| 32 | ATOM (atomic) | dst, addr, data | Specialized register conversion |
| 36 | TEX (texture) | coords + handle | Texture handle materialization |
| 42, 53, 55 | Shift/logic | src0 + try-inline src1 | sub_13A6280 + sub_13A7410 |
| 51 | PRMT (permute) | src0, control, src1 | sub_13A6F90 for arch-dependent control operand |
| 61 | Branch-conditional | Nested switch on modifier bits | 6 sub-cases for different branch forms |
| 70, 243-262 | Tensor/WGMMA/bulk | Delegated to vtable+2816 | Architecture-specific |
| 82, 166, 196 | FP convert | src + try-inline | sub_13A6280 + sub_13A7410 + optional sub_13A6F90 |
| 88-89 | ATOMS/ATOMG | Loop over sources | Per-source legalization with count |
| 110-121 | Wide arithmetic | src0, src1, src2 | 3 consecutive sub_13A6280 calls |
| 137 | MOV | Opcode rewrite | Rewrites to 0x82 or 0x109 based on register class |
| 230-232 | LD/ST extended | src + inline + arch | sub_13A6280 + sub_13A7410 + sub_13A6F90 |
| 270-289 | Control flow / misc | Variable | Several sub-groups with different patterns |
| 280-281 | Multi-source | Delegated to vtable+2328 | Operand count adjusted by -4 |
Architecture Override Points
The dispatcher provides three escape hatches for architecture-specific behavior:
| Vtable Offset | Decimal | Opcodes | Purpose |
|---|---|---|---|
+2816 | 0xB00 | 70, 243, 245-247, 254-255, 257-259, 262 | Full delegation for SM-specific instructions |
+2328 | 0x918 | 280-281 (+ other cases) | Multi-source instructions with adjusted operand counts |
+3232 | 0xCA0 | Pre-switch (predicated instructions) | Custom predicate guard handling |
The vtable+2816 handler receives (backend, instruction, insert_point, pass_context, mode_flag) and is expected to perform complete operand legalization for the instruction. The vtable+2328 handler receives an adjusted operand count (total - 4), suggesting these instructions have 4 fixed operands plus a variable source list.
Relationship to Legalization Passes
The operand legalization dispatcher operates at a different abstraction level than the six legalization passes described above. The legalization passes (phases 5-137) operate on the Ori IR, replacing unsupported operations with sequences of supported ones. The operand legalization dispatcher operates on individual operands within already-legal instructions, ensuring each operand is in a form the SASS encoder can bit-pack into machine code.
The dispatcher runs as part of the SASS encoding pipeline (called from sub_A29220), well after all six Ori-level legalization passes have completed. It is invoked per-instruction during the encoding walk, not as a standalone pass.
Ori legalization passes (phases 5-137)
Replace unsupported OPERATIONS with legal sequences
|
v
SASS operand legalization (sub_13AF3D0, during encoding)
Ensure each OPERAND of a legal instruction is encodable
|
v
SASS per-instruction encoders (522 functions)
Pack operands into binary instruction word
Cross-References
- Pass Inventory & Ordering -- Complete 159-phase table with legalization passes highlighted
- Phase Manager Infrastructure -- Phase factory, vtable layout, dispatch loop
- SM Architecture Map -- Per-SM capability tables driving legalization decisions
- GeneralOptimize Bundles -- Cleanup passes that run after expansion (phases 46, 58)
- GMMA/WGMMA Pipeline -- Phases 85, 87 that create work for LateExpansionUnsupportedOps2
- Synchronization & Barriers -- Barrier expansion (phase 42) that feeds MidExpansion
- Mercury Encoder -- Post-legalization encoding (must see only legal ops)
- Optimization Levels -- SetAfterLegalization gating by -O level
- Knobs System -- Knobs 214, 464, 487, 499 controlling legalization
- SASS Encoding Format -- Per-instruction SASS encoders that consume legalized operands
- Instruction Representation -- Ori IR operand layout (8-byte slots, type/payload encoding)
Allocator Architecture
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas register allocator is a fat-point greedy allocator, not a graph-coloring allocator. There is no interference graph, no Chaitin-Briggs simplify-select-spill loop, and no graph coloring in the main allocation path. Instead, the allocator maintains per-physical-register pressure histograms (512-DWORD arrays) and greedily assigns each virtual register to the physical slot with the lowest interference count. This design trades theoretical optimality for speed on the very large register files of NVIDIA GPUs (up to 255 GPRs per thread).
A secondary live-range-based infrastructure (~80 functions at 0x994000--0x9A1000) supports coalescing, splitting, and pre-coloring but feeds results into the fat-point allocator rather than replacing it.
| Entry point | sub_9721C0 (1086 lines) |
| Per-class driver | sub_971A90 (355 lines) -- NOSPILL then SPILL retry |
| Core allocator | sub_957160 (1658 lines) -- fat-point coloring engine |
| Assignment | sub_94FDD0 (155 lines) -- write physical reg, propagate aliases |
| Spill guidance | sub_96D940 (2983 lines) -- per-class priority queues |
| Spill codegen | sub_94F150 (561 lines) -- emit spill/reload instructions |
| Pre-coloring | sub_991790 (2677 lines) -- full-function pre-assignment |
| Address range | 0x8FE000 -- 0x9D3000 (~860 KB, ~950 functions) |
| Knobs | 87 OCG knobs (RegAlloc* / RegTgt* / RegUsageLevel, indices 613--699) |
Pipeline Position
The register allocator runs in the late pipeline, after all optimization passes and instruction scheduling preparation, but before final SASS encoding:
... optimization passes ...
Late Legalization / Expansion
AdvancedPhaseAllocReg gate <-- pipeline entry guard
HoistInvariants <-- sub_8FFDE0 (optional)
ConvertMemoryToRegisterOrUniform <-- sub_910840
Pre-coloring <-- sub_991790
Instruction lowering <-- sub_98F430 / sub_98B160
Register allocation entry <-- sub_9721C0
Per-class allocation x 7 <-- sub_971A90 for classes 1..6
Core fat-point allocator <-- sub_957160
Post-allocation fixup
Instruction scheduling
SASS encoding
Register Classes
The allocator processes 7 register classes. Class 0 (unified) is skipped in the normal per-class loop; it is used for cross-class constraint propagation. Classes 1--6 are allocated independently in order:
| ID | Name | Width | HW Limit | Description |
|---|---|---|---|---|
| 0 | -- | -- | -- | Unified / cross-class (skipped in main loop) |
| 1 | R | 32-bit | 255 | General-purpose registers (R0--R254) |
| 2 | R (alt) | 32-bit | 255 | GPR variant (RZ sentinel, stat collector alternate) |
| 3 | UR | 32-bit | 63 | Uniform general-purpose registers (UR0--UR62) |
| 4 | UR (ext) | 32-bit | 63 | Uniform GPR variant (extended uniform) |
| 5 | P / UP | 1-bit | 7 | Predicate registers (P0--P6, UP0--UP6) |
| 6 | Tensor/Acc | 32-bit | varies | Tensor/accumulator registers (MMA/WGMMA) |
Barrier registers (B, UB) have reg_type = 9, which is above the <= 6 allocator cutoff and are handled by a separate mechanism.
Special registers that are always skipped during allocation:
- Indices 41--44:
PT,P0--P3(architectural predicates) - Index 39: special register
The class ID is the reg_type value at vreg+64. The allocator distribution loop in sub_9721C0 reads this field directly and uses it as the bucket index.
Pair modes (vreg+48, bits 20--21): 0 = single, 1 = lo-half of pair, 3 = double-width (consumes two physical slots).
Entry Point: sub_9721C0
The top-level register allocation driver (1086 lines). Called once per function after the AdvancedPhaseAllocReg pipeline gate.
function regalloc_entry(alloc_state, compilation_ctx):
// 1. Rebuild liveness
rebuild_basic_blocks(compilation_ctx, 1) // sub_781F80
compute_liveness(compilation_ctx, 1) // sub_A10160
// 2. Initialize 7 register classes
for class_id in 1..6:
vtable[896](alloc_state, class_id) // init register file state
// 3. Sort instructions by priority
sort_instructions_by_priority(alloc_state) // sub_9375C0
// 4. Distribute vregs into per-class linked lists
for each vreg in function:
class = vreg.register_class
append(class_lists[class], vreg)
debug("\nREGALLOC GUIDANCE:\n")
// 5. Allocate each class independently
for class_id in 1..6:
alloc_with_spill_retry( // sub_971A90
alloc_state, compilation_ctx, class_id)
// 6. Post-allocation fixup
fix_load_opcode_187(alloc_state)
fix_call_saved_registers(alloc_state)
// 7. Handle OptixIR mode (ctx+896 == 4 or 5)
if is_optix_ir(compilation_ctx):
record_register_counts(compilation_ctx)
The entry point calls sub_789280 when a pre-allocation fixup bit (flag bit 2) is set, handles live-through-call register counting at lines 343--352, and sets up rematerialization lists at alloc_state[161..175].
Per-Class Driver: sub_971A90
The outer retry loop (355 lines) that wraps the core allocator with a two-phase strategy:
Phase 1 -- NOSPILL: Attempt allocation without allowing spills. Debug string: "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary).
Phase 2 -- SPILL: If NOSPILL fails, invoke spill guidance (sub_96D940) and retry with spilling enabled.
function alloc_with_spill_retry(alloc_state, ctx, class_id):
no_retarget = query_knob(638) // RegAllocNoRetargetPrefs (bool)
num_trials = query_knob(639) // RegAllocNumNonSpillTrials (int)
// Phase 1: NOSPILL
pre_allocation_pass(alloc_state) // sub_94A020
secondary_driver(alloc_state, ctx) // sub_95DC10
result = fatpoint_allocate(alloc_state, ctx, NOSPILL) // sub_957160
record_best_result(alloc_state, result) // sub_93D070
if result == SUCCESS:
return
// Phase 2: SPILL retry loop
for attempt in 1..num_trials:
guidance = compute_spill_guidance(ctx, attempt) // sub_96D940
result = fatpoint_allocate(alloc_state, ctx, SPILL)
record_best_result(alloc_state, result)
if result == SUCCESS:
break
if result == FAILURE:
final_fallback(alloc_state) // sub_936FD0
post_allocation_finalize(alloc_state) // sub_9714E0
For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates sub_939BD0 (spill setup) followed by sub_94F150 (spill codegen) before entering the retry loop.
Core Fat-Point Allocator: sub_957160
The central allocation function (1658 lines). This is where physical registers are actually chosen.
Data Structures
Two 2056-byte arrays (512 DWORDs + 2-DWORD sentinel each):
| Array | Role |
|---|---|
Primary (v12) | Per-physical-register interference count |
Secondary (v225) | Per-physical-register secondary cost (tie-breaking) |
Both arrays are zeroed with SSE2 vectorized loops at the start of each allocation round.
Algorithm
function fatpoint_allocate(alloc_state, ctx, mode):
maxRegs = alloc_state.hw_limit + 7 // from alloc+756
if mode == CSSA_PAIRED (6): maxRegs *= 2
if mode == CSSA (3): maxRegs *= 4
primary[512] = {0} // SSE2 memset
secondary[512] = {0}
threshold = query_knob(684) // RegAllocThresholdForDiscardConflicts, default 50
for each vreg in alloc_state.register_list: // linked list at +744
// Populate interference bitmaps for this vreg
build_interference_bitmaps(vreg, primary, secondary) // sub_957020
// Scan for minimum-pressure physical register
best_slot = -1
best_cost = MAX_INT
for slot in 0..maxRegs:
if primary[slot] > threshold:
continue // too congested
cost = primary[slot]
if cost < best_cost:
best_cost = cost
best_slot = slot
elif cost == best_cost:
// tie-break on secondary bitmap
if secondary[slot] < secondary[best_slot]:
best_slot = slot
if best_slot == -1:
emit_error("Register allocation failed with register count of '%d'")
return FAILURE
// Assign physical register
assign_register(alloc_state, ctx, mode, // sub_94FDD0
vreg, best_slot)
return alloc_state.register_count + 1
The interference threshold (RegAllocThresholdForDiscardConflicts, knob 684, default 50) is the key heuristic parameter. Slots with interference above this value are discarded (skipped entirely), forcing the allocator toward less-contested register slots even if they are not globally minimal.
Register Assignment: sub_94FDD0
The assignment function (155 lines) writes the physical register and propagates through alias chains:
function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
max_regs = regclass_info.max_regs // at +16
if slot >= max_regs and not vreg.is_spilled(): // flag 0x4000
vreg.set_needs_spill() // flag 0x40000
return
if vreg.needs_spill(): // flag 0x40000
setup_spill_allocator(alloc) // sub_939BD0
generate_spill_code(alloc, vreg) // sub_94F150
return
// Non-spill path: commit assignment
consumption = compute_consumption(vreg) // sub_939CE0
update_peak_usage(alloc, consumption)
vreg.physical_register = slot
// Check for pre-allocated candidate
apply_preallocated_candidate(alloc, vreg) // sub_950100
// Propagate through alias chain
alias = vreg.alias_parent // vreg+36
while alias != NULL:
alias.physical_register = slot
alias = alias.alias_parent
Register consumption computation (sub_939CE0, 23 lines) accounts for paired registers: it returns assignment + (1 << (pair_mode == 3)) - 1, effectively consuming two slots for double-width registers.
Constraint System
The fat-point interference builder (sub_926A30, 4005 lines) processes 15+ constraint types extracted from instruction operand descriptors. Each operand encodes: bits 28--30 = operand type, bits 0--23 = register index.
| Type | Name | Description |
|---|---|---|
| 0 | Point interference | Single-instruction conflict at a specific program point |
| 1 | Register operand | Standard read/write interference |
| 2 | Immediate operand | No register interference generated |
| 3 | Paired register | Double-width; bit 23 distinguishes hi/lo half |
| 4 | Exclude-one | Specific physical register excluded from assignment |
| 5 | Exclude-all-but | Only one physical register permitted |
| 6 | Below-point | Interference active below the current program point |
| 7 | Range | Interference over an interval of program points |
| 8 | Phi-related | CSSA phi instruction (opcode 195) constraint |
| 9 | Barrier | Barrier register class constraint |
| 10--15 | Extended | Additional constraint variants |
The builder uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619) for hash-table lookups into the pre-allocation candidate table. It contains SSE2-vectorized inner loops for bulk interference weight accumulation and dispatches through 7+ vtable entries for OCG knob queries.
Spilling Overview
Spilling triggers when the fat-point allocator cannot find a physical register within the budget. The subsystem has three components:
Spill guidance (sub_96D940, 2983 lines): Computes which registers to spill and in what order. Builds a 7-element guidance array (one per register class), each backed by an 11112-byte working structure containing 128-element bitmask arrays. Constructs priority queues of spill candidates using bitvector-based live range analysis. The function contains 7 near-identical code blocks (one per class), likely unrolled from a template.
Spill codegen (sub_94F150, 561 lines): Emits actual spill/reload instructions. Allocates a per-register spill info array (12 bytes per entry, initialized to {0, -1, -1}). Default spill cost is 15.0, reduced to 3.0 for certain architecture modes. Handles loop nesting via block frequency callbacks (vtable offset +8) and provides special handling for uniform registers (bit 0x200 in flags).
Spill memory targets:
| Target | Description |
|---|---|
| LMEM (local memory) | Default spill destination. Per-thread private memory. |
| SMEM (shared memory) | Alternative spill destination. Faster but shared across CTA. Assertion: "Smem spilling should not be enabled when functions use abi." |
Spill setup (sub_939BD0, 65 lines) selects configuration based on RegAllocEstimatedLoopIterations (knob 623) and the cost threshold at alloc+776:
| Condition | Bucket size | Alignment | Max size |
|---|---|---|---|
| Cost threshold == 0 | 8 | 4 | 1 MB |
| Cost threshold != 0 | 16 | 16 | 1 MB |
See Spilling for the full spill subsystem analysis.
Pre-Allocation and Mem-to-Reg
Two important pre-passes run before the main allocator:
ConvertMemoryToRegisterOrUniform
Entry: sub_910840 (327 lines). Promotes stack variables to registers or uniform registers. Gated by sub_8F3EA0 (eligibility check) and NumOptPhasesBudget (knob 487, budget type).
sub_910840 (entry, string: "ConvertMemoryToRegisterOrUniform")
sub_905B50 (1046 lines) build promotion candidates
sub_911030 (2408 lines) detailed analysis engine (def-use chains, dominance)
sub_90FBA0 (653 lines) execute promotion, insert phi nodes
sub_914B40 (1737 lines) post-promotion rewrite / phi-resolution
Pre-Allocation Pass
Entry: sub_94A020 (331 lines). Assigns physical registers to high-priority operands before the main allocator runs. Gated by RegAllocMacForce (knob 628, bool), RegAllocMacVregAllocOrder (knob 629, int), and RegAllocCoalescing (knob 618, bool).
For allocation modes 3, 5, or 6: iterates basic blocks calling sub_9499E0 (per-block scanner) and sub_93ECB0 (per-operand pre-assigner). Priority levels from RegAllocPrefMacOperands (knob 646): 1 = read operands, 2 = write operands, 3 = both.
Uses an opcode eligibility bitmask table (shift-based membership test on opcode - 22) to filter which instructions are candidates for pre-assignment.
Live Range Infrastructure
An interval-based live range system at 0x994000--0x9A1000 (~80 functions) supports auxiliary operations. This is not the main allocator but feeds results into it:
| Subsystem | Range | Count | Key Functions |
|---|---|---|---|
| Live range primitives | 0x994000--0x996000 | ~25 | Constructor, interval queries, weight, color get/set |
| Interference graph | 0x996000--0x99A000 | ~18 | Node/edge construction, adjacency, degree, coloring |
| Range operations | 0x99C000--0x9A1000 | ~35 | Merge, split, interference add/remove, copy detection |
| Register coalescing | sub_9B1200 | 1 | Copy elimination pass (800 lines) |
| Live range splitting | sub_9AEF60 | 1 | Interference graph update (900 lines, self-recursive) |
| Range merge engine | sub_9AD220 | 1 | Coalescing with cost heuristics (700 lines) |
| Range construction | sub_9A5170 | 1 | Build ranges from def-use chains (750 lines) |
Allocator State Object Layout
Full reconstruction from the constructor sub_947150 (1088 lines), cross-referenced with the core allocator, per-class driver, entry point, and spill subsystem. The object is at least 1748 bytes (last initialized field at +1744). The constructor is called once per function before the allocation pipeline runs.
Header and Compilation Context (+0 -- +24)
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +0 | 8 | ptr | &off_21E1648 | Vtable pointer (strategy dispatch, 40+ virtual methods) |
| +8 | 8 | ptr | arg | Compilation context (parent object) |
| +16 | 8 | ptr | off_21DBEF8 | Secondary vtable (allocation sub-strategy) |
| +24 | 8 | ptr | ctx->func | Function object pointer (from ctx+16) |
Pre-Allocation Candidate Tables (+32 -- +443)
Arena-allocated hash tables for pre-assigned registers. Each table is a 3-QWORD header {base, size, capacity} plus an arena node (24 bytes, allocated from the function memory pool with an incrementing class tag).
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +32 | 8 | ptr | 0 | Pre-alloc candidate list A head |
| +40 | 8 | ptr | 0 | Pre-alloc candidate list B head |
| +48 | 4 | DWORD | 0 | Pre-alloc candidate count A |
| +56 -- +208 | 160 | -- | 0 | Per-class registration slots (6 x {ptr, ptr, DWORD} = 24B each) |
| +216 | 8 | ptr | 0 | Registration slots tail |
| +224 | 8 | ptr | alloc(24) | Exclusion set arena node (class tag = 1) |
| +232 | 8 | ptr | alloc(24) | Pre-alloc hash table A arena node (class tag = 2) |
| +240 | 8 | ptr | 0 | Pre-alloc hash table A: base pointer |
| +248 | 8 | ptr | 0 | Pre-alloc hash table A: count |
| +256 | 8 | ptr | 0 | Pre-alloc hash table A: capacity |
| +272 | 8 | ptr | alloc(24) | Pre-alloc hash table B arena node |
| +280 | 24 | -- | 0 | Pre-alloc hash table B: {base, count, capacity} |
| +312 | 8 | ptr | alloc(24) | Pre-alloc hash table C arena node |
| +320 | 24 | -- | 0 | Pre-alloc hash table C: {base, count, capacity} |
| +352 | 8 | ptr | alloc(24) | Exclusion set hash table arena node (class tag = 3) |
| +360 | 8 | ptr | 0 | Exclusion set: base pointer |
| +368 | 8 | ptr | 0 | Exclusion set: count |
| +376 | 8 | ptr | 0 | Exclusion set: capacity |
| +384 | 4 | DWORD | 0 | Exclusion set: element count |
| +392 | 8 | ptr | =+352 | Exclusion alias A (points to same node) |
| +400 | 24 | -- | 0 | Exclusion secondary: {base, count, capacity} |
| +424 | 4 | DWORD | 0 | Exclusion secondary: element count |
| +432 | 8 | ptr | =+352 | Exclusion alias B |
| +440 | 1 | BYTE | 0 | MAC force pre-alloc flag (RegAllocMacForce, knob 628) |
| +441 | 1 | BYTE | 0 | Coalescing enable flag (RegAllocCoalescing, knob 618) |
| +442 | 1 | BYTE | 0 | MAC vreg alloc order (RegAllocMacVregAllocOrder, knob 629) |
| +443 | 1 | BYTE | 0 | Per-class mode flag (set by vtable+296 callback) |
Per-Class Bitvector Sets (+448 -- +695)
An array of 6 bitvector set entries (one per allocatable register class, classes 1--6). Each entry is 40 bytes: a linked-list header {head, data, tail, count} (32 bytes) plus an arena node pointer (8 bytes). The arena nodes carry incrementing class tags (4, 6, 8, 10, 12, 14). The constructor loop starts at +456 and increments by 40 until +656.
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +448 | 8 | QWORD | 0 -> 6 | Bitvector set count (incremented in init loop) |
| +456 | 240 | array | -- | 6 x BitvectorSet (40B each): classes 1--6 |
| +696 | 24 | -- | 0 | Remat candidate list: {base, data, tail} |
| +720 | 4 | DWORD | 0 | Remat candidate list: count |
| +728 | 8 | ptr | alloc(24) | Remat candidate arena node (class tag = 2) |
Core Allocation State (+736 -- +872)
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +736 | 8 | ptr | 0 | Register linked list: secondary head |
| +744 | 8 | ptr | 0 | Register linked list head (main walk list for sub_957160) |
| +752 | 1 | BYTE | 0 | Register list initialized flag |
| +756 | 4 | DWORD | -1 | Hardware register limit (max physical regs, per-class) |
| +760 | 4 | DWORD | -1 | Secondary HW limit |
| +764 | 4 | DWORD | -1 | Pre-alloc constraint count |
| +776 | 8 | double | -1.0 | Spill cost threshold |
| +788 | 4 | DWORD | -1 | Best allocation result (reset to 0 per allocation round) |
| +792 | 1 | BYTE | 0 | Allocation-in-progress flag |
| +800 | 1 | BYTE | 0 | Retry-active flag |
| +808 | 4 | DWORD | (dynamic) | Live range interference state |
| +816 | 8 | ptr | (dynamic) | Live range secondary structure (4-byte DWORD array at +816) |
| +824 | 1 | BYTE | 0 | Pre-coloring done flag |
| +832 | 8 | ptr | 0 -> dyn | Per-function spill info array pointer |
| +840 | 8 | ptr | 0 -> dyn | Per-function spill info arena node |
| +848 | 8 | ptr | 0 | Spill info secondary |
| +856 | 8 | ptr | 0 | Spill info tertiary |
| +864 | 1 | BYTE | 0 | Bank conflict awareness flag |
| +865 | 1 | BYTE | 0 | Spill-already-triggered flag |
| +872 | 8 | ptr | 0 | Debug / trace output state |
Per-Class Register File Descriptors (+880 -- +1103)
An array of 7 register class descriptors (one per class 0--6), each 32 bytes. Indexed as alloc + 880 + 32 * class_id. The per-class driver (sub_971A90) accesses max_regs as a1[32 * class_id + 884] and base_offset as a1[32 * class_id + 880].
RegClassDesc (32 bytes):
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +0 | 4 | DWORD | 0 | Base register offset (first physical reg in class) |
| +4 | 4 | DWORD | -1 | Max regs / HW limit (set by vtable[896] init callback) |
| +8 | 4 | DWORD | 0 | Current allocation count |
| +12 | 1 | BYTE | 0 | Class active flag |
| +13 | 1 | BYTE | 0 | Class overflow flag |
| +14 | 1 | BYTE | 0 | Class spill flag |
| +15 | 1 | -- | -- | Padding |
| +16 | 4 | DWORD | 148 | Phase ID begin (148 = unset sentinel) |
| +20 | 4 | DWORD | 148 | Phase ID end (148 = unset sentinel) |
| +24 | 8 | QWORD | -1 | Class auxiliary link |
Concrete addresses:
| Class | Offset Range | Description |
|---|---|---|
| 0 (unified) | +880 -- +911 | Cross-class (skipped in main loop) |
| 1 (R) | +912 -- +943 | GPR 32-bit |
| 2 (R alt) | +944 -- +975 | GPR variant |
| 3 (UR) | +976 -- +1007 | Uniform GPR |
| 4 (UR ext) | +1008 -- +1039 | Uniform GPR variant |
| 5 (P/UP) | +1040 -- +1071 | Predicate registers |
| 6 (Tensor) | +1072 -- +1103 | Tensor / accumulator |
Extended Class Metadata (+1096 -- +1127)
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1096 | 8 | QWORD | -1 | Class 6 extended auxiliary link |
| +1104 | 8 | ptr | 0 | Extended class info: pointer A |
| +1112 | 8 | ptr | 0 | Extended class info: pointer B |
| +1120 | 4 | DWORD | 0 | Extended class info: count |
Per-Class Rematerialization Lists (+1128 -- +1271)
Six rematerialization candidate lists (one per allocatable class), each 24 bytes {ptr base, ptr data, DWORD count}. Initialized to zero. Populated before the allocation loop in sub_9721C0 for classes that support rematerialization.
| Class | Offset Range |
|---|---|
| 1 | +1128 -- +1147 |
| 2 | +1152 -- +1175 |
| 3 | +1176 -- +1199 |
| 4 | +1200 -- +1219 |
| 5 | +1224 -- +1243 |
| 6 | +1248 -- +1267 |
Coalescing / Live Range Lists (+1272 -- +1432)
Self-referential circular linked lists used for register coalescing and live range splitting. Each list has a sentinel structure where prev and next point into the list body.
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1272 | 8 | ptr | arg2 | Back-pointer to compilation context |
| +1280 | 8 | ptr | 0 | Coalesce list A: sentinel head |
| +1288 | 8 | ptr | self+1296 | Coalesce list A: prev (self-referential) |
| +1296 | 8 | ptr | self+1280 | Coalesce list A: next (circular) |
| +1304 | 8 | ptr | 0 | Coalesce list A: data |
| +1312 | 4 | DWORD | (checked) | Coalesce list A: count (bit 0 = non-empty flag) |
| +1320 | 8 | ptr | self+1296 | Coalesce list A: end marker |
| +1328 | 4 | DWORD | 2 | Coalesce list A: type tag |
| +1336 | 8 | ptr | alloc(24) | Coalesce list A: arena node |
| +1344 | 8 | ptr | 0 | Coalesce list B: sentinel head |
| +1352 | 8 | ptr | self+1360 | Coalesce list B: prev |
| +1360 | 8 | ptr | self+1344 | Coalesce list B: next |
| +1368 | 8 | ptr | 0 | Coalesce list B: data (bit 2 checked as ABI flag) |
| +1376 | 8 | ptr | self+1344 | Coalesce list B: tail |
| +1384 | 8 | ptr | self+1360 | Coalesce list B: end marker |
| +1392 | 4 | DWORD | 2 | Coalesce list B: type tag |
| +1400 | 8 | ptr | alloc(24) | Coalesce list B: arena node |
| +1408 | 8 | ptr | alloc(24) | Interference graph arena node (bit 1 = call-saved mode) |
| +1416 | 8 | ptr | 0 | Interference graph: base |
| +1424 | 8 | ptr | 0 | Interference graph: data (bit 7 checked in sub_97EC60) |
| +1432 | 8 | ptr | 0 | Interference graph: capacity |
Debug / Rematerialization Infrastructure (+1440 -- +1496)
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1440 | 8 | -- | (tree) | Remat exclusion set (tree root, queried via sub_99C5B0) |
| +1448 | 1 | BYTE | 0 | Remat exclusion: active flag (checked in sub_962840, sub_94E620) |
| +1452 | 4 | DWORD | 0 | Remat exclusion: instruction threshold |
| +1464 | 16 | OWORD | 0 | Remat exclusion: data block B |
| +1472 | 8 | ptr | 0 | Remat candidate: linked list (freed in sub_99D190) |
| +1480 | 16 | -- | 0 | Remat candidate list (iterated by sub_94BDF0) |
| +1488 | 4 | DWORD | 0 | Remat candidate: count (checked in sub_99C690) |
| +1496 | 8 | ptr | 0 | Remat candidate: root pointer |
Spill / Retry Control Block (+1504 -- +1594)
The core state for the NOSPILL / SPILL retry loop. Zeroed at allocation start, populated by the per-class driver (sub_971A90), read/written by the fat-point allocator (sub_957160).
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1504 | 4 | DWORD | 0 | Allocation mode (0=normal, 3=CSSA, 5=SMEM, 6=paired) |
| +1508 | 4 | DWORD | 0 | Spill attempt counter |
| +1512 | 4 | DWORD | 0 -> 44 | Spill instruction count (knob 635, default 44) |
| +1516 | 4 | DWORD | -1 | Budget lower bound |
| +1520 | 4 | DWORD | -1 | Budget lower bound secondary (part of 128-bit at +1516) |
| +1524 | 4 | DWORD | -1 | Register budget (from per-class desc max_regs) |
| +1528 | 4 | DWORD | (dynamic) | Peak register usage (copied from +1532 per round) |
| +1532 | 16 | __m128i | (global) | Strategy parameters (loaded from xmmword_21E17F0) |
| +1540 | 4 | DWORD | 0 | Secondary budget limit (knob 633) |
| +1544 | 4 | DWORD | 0 | Tertiary budget limit (knob 632) |
| +1548 | 4 | float | 4.0 | Spill cost multiplier (knob 680) |
| +1552 | 4 | DWORD | -1 | Rollback sentinel |
| +1556 | 4 | DWORD | -1 | Max regs aligned: (budget + 4) & ~3 |
| +1560 | 4 | DWORD | -1 | Best result sentinel |
| +1564 | 4 | DWORD | 0 | Current max assignment (zeroed per allocation round) |
| +1568 | 8 | double | 0.0 | Total spill cost accumulator (zeroed per round) |
| +1576 | 4 | DWORD | 0 | Spill event counter (zeroed per round) |
| +1580 | 4 | DWORD | (dynamic) | Effective budget: max(budget, SMEM_min) |
| +1584 | 4 | DWORD | (dynamic) | Adjusted budget (from vtable+256 callback) |
Mode Flags (+1588 -- +1594)
Knob-derived boolean flags controlling allocation strategy. When the function has more than one basic block (sub_7DDB50 > 1), flags +1588, +1589, +1590 are all forced to 1.
| Offset | Size | Type | Init | Knob | Field |
|---|---|---|---|---|---|
| +1588 | 1 | BYTE | 0 | 682 | Epoch-aware allocation mode |
| +1589 | 1 | BYTE | 0 | 683 | Paired-register allocation mode |
| +1590 | 1 | BYTE | 0 | 619 | SMEM spill enable |
| +1591 | 1 | BYTE | 0 | 627 | Bank-aware allocation |
| +1592 | 1 | BYTE | 0 | -- | Spill status / has-spilled flag |
| +1593 | 1 | BYTE | 1 | 636 | Precolor reuse (default enabled) |
| +1594 | 1 | BYTE | 1 | 649 | ABI compatibility (default enabled; cleared for small kernels) |
Budget Pressure Model (+1600 -- +1744)
Occupancy-aware register budget interpolation. Computes a dynamic register budget based on thread occupancy, using knob-derived coefficients and a linear interpolation model. The slope at +1736 is (coeffB - coeffC) / (maxOccupancy - minOccupancy), enabling the allocator to trade register count for occupancy.
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1600 | 8 | ptr | ctx[2]->+208 | Function object pair pointer |
| +1608 | 8 | ptr | 0 | Budget model: auxiliary pointer |
| +1616 | 8 | QWORD | 0xFFFFFFFF | Budget model: occupancy upper bound |
| +1624 | 4 | DWORD | 119 / knob | Max threads per block (default 119) |
| +1628 | 4 | DWORD | 160 / knob | Pressure threshold (default 160) |
| +1632 | 8 | double | 0.2 | Interpolation coefficient A (knob-overridable) |
| +1640 | 8 | double | 1.0 | Interpolation coefficient B (knob-overridable) |
| +1648 | 8 | double | 0.3 | Interpolation coefficient C (knob-overridable) |
| +1656 | 8 | double | (computed) | Total threads as double |
| +1664 | 8 | double | = coeff A | Interpolation point [0] |
| +1672 | 8 | double | (computed) | Interpolation point [1]: max_threads as double |
| +1680 | 8 | double | = coeff A | Interpolation point [2] |
| +1688 | 8 | double | (computed) | Interpolation point [3]: threshold as double |
| +1696 | 8 | double | = coeff A | Interpolation point [4] |
| +1704 | 8 | double | (computed) | Interpolation point [5]: 255 minus vtable result |
| +1712 | 8 | double | = coeff B | Interpolation point [6] |
| +1720 | 8 | double | (computed) | Linear model: x_min (thread count) |
| +1728 | 8 | double | = coeff C | Linear model: y_min |
| +1736 | 8 | double | (computed) | Linear model: slope |
| +1744 | 8 | ptr | 0 | Budget model: tail sentinel |
Virtual Register Object Layout
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | Next pointer (linked list) |
| +12 | 4 | Register class index |
| +20 | 1 | Flags byte (bit 0x20 = live) |
| +36 | 8 | Alias chain (coalesced parent) |
| +40 | 4 | Spill cost (float, accumulated) |
| +48 | 8 | Flags qword (see below) |
| +64 | 4 | Register type (1=GPR, 3=pred, 9=barrier) |
| +68 | 4 | Physical assignment (-1 = unassigned) |
| +72 | 1 | Size byte (0 = scalar) |
| +76 | 4 | Secondary spill cost (float) |
| +80 | 4 | Spill flag (0 = not spilled, 1 = spilled) |
| +104 | 8 | Use chain head |
| +112 | 8 | Def chain |
| +128 | 8 | Next in linked-register chain |
| +144 | 8 | Constraint list |
Flag bits at +48:
| Bit | Mask | Meaning |
|---|---|---|
| 9 | 0x200 | Pre-assigned / fixed register |
| 10 | 0x400 | Coalesced source |
| 11 | 0x800 | Coalesced target |
| 14 | 0x4000 | Spill marker |
| 18 | 0x40000 | Needs-spill flag |
| 20--21 | -- | Pair mode (0=single, 1=lo-half, 3=double-width) |
| 22 | 0x400000 | Constrained to architecture limit |
| 23 | 0x800000 | Hi-half of pair |
| 27 | 0x8000000 | Special handling flag |
Key Knobs
87 OCG knobs (indices 613--699) control register allocation heuristics. The complete catalog with sub-category grouping is in Knobs System -- Register Allocation Knobs. The most important ones:
| Knob | Name | Type | Role |
|---|---|---|---|
| 381 | (not yet decoded) | -- | HoistInvariants policy: 0=always, 1=inner loops, 3=never |
| 487 | NumOptPhasesBudget | BDGT | Budget counter that gates ConvertMemoryToRegisterOrUniform |
| 618 | RegAllocCoalescing | bool | Enables register coalescing in the allocator |
| 623 | RegAllocEstimatedLoopIterations | STR | Loop iteration estimate hint for spill cost weighting |
| 628 | RegAllocMacForce | bool | Forces MAC-level pre-allocation path |
| 629 | RegAllocMacVregAllocOrder | INT | Vreg processing order during MAC allocation |
| 638 | RegAllocNoRetargetPrefs | bool | Disables retarget-preference optimization |
| 639 | RegAllocNumNonSpillTrials | INT | Non-spill allocation trials before allowing spills |
| 646 | RegAllocPrefMacOperands | INT | MAC operand preference level (1=read, 2=write, 3=both) |
| 684 | RegAllocThresholdForDiscardConflicts | INT | Interference discard threshold. Default 50 |
| 934 | UseNewLoopInvariantRoutineForHoisting | bool | Selects new LICM routine for HoistInvariants pre-pass |
Function Map
| Address | Lines | Role |
|---|---|---|
sub_8FFDE0 | 119 | HoistInvariants entry |
sub_905B50 | 1046 | Mem-to-reg candidate builder |
sub_910840 | 327 | ConvertMemoryToRegisterOrUniform entry |
sub_911030 | 2408 | Mem-to-reg analysis engine |
sub_914B40 | 1737 | Post-promotion rewrite |
sub_926A30 | 4005 | Fat-point interference builder |
sub_947150 | 1088 | Allocator state constructor (initializes 1748-byte object) |
sub_939BD0 | 65 | Spill allocator setup |
sub_939CE0 | 23 | Register consumption counter |
sub_93D070 | 155 | Best result recorder |
sub_93ECB0 | 194 | Pre-assign registers |
sub_93FBE0 | 940 | Spill slot assignment |
sub_94A020 | 331 | Pre-allocation pass |
sub_94E620 | 617 | Spill cost accumulator |
sub_94F150 | 561 | Spill code generation |
sub_94FDD0 | 155 | Register assignment + alias propagation |
sub_950100 | 205 | Pre-allocated candidate applier |
sub_957160 | 1658 | Core fat-point allocator |
sub_9539C0 | 1873 | Shared-memory spill allocator |
sub_95A350 | 1390 | Cost / benefit evaluator |
sub_95BC90 | 1250 | Allocation retry / refinement |
sub_95DC10 | 2738 | Multi-class ABI-aware driver |
sub_9680F0 | 3722 | Per-instruction assignment core loop |
sub_96D940 | 2983 | Spill guidance (7-class priority queues) |
sub_971A90 | 355 | NOSPILL / SPILL retry driver |
sub_9721C0 | 1086 | Register allocation entry point |
sub_991790 | 2677 | Pre-coloring pass |
sub_9A5170 | 750 | Live range construction |
sub_9AD220 | 700 | Live range merge / coalescing engine |
sub_9AEF60 | 900 | Live range splitting |
sub_9B1200 | 800 | Register coalescing / copy elimination |
Fat-Point Allocation Algorithm
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas register allocator uses a fat-point greedy algorithm. For each virtual register, it scans a per-physical-register pressure array, picks the slot with the lowest interference count, and commits the assignment. There is no graph coloring, no simplify-select-spill loop, and no worklist -- just two 512-DWORD pressure histograms and a linear scan. This page documents the algorithm in full detail: pressure array construction, constraint evaluation, register selection, assignment propagation, the retry loop, and the supporting knobs.
| Core allocator | sub_957160 (1658 lines) -- fat-point coloring engine |
| Occupancy bitvector | sub_957020 -- resizes bitvector; sub_94C9E0 -- marks slot ranges |
| Interference builder | sub_926A30 (4005 lines) -- constraint solver |
| Assignment | sub_94FDD0 (155 lines) -- write physical reg, propagate aliases |
| Pre-allocation | sub_94A020 (331 lines) -- pre-assign high-priority operands |
| Retry driver | sub_971A90 (355 lines) -- NOSPILL then SPILL retry loop |
| Best result recorder | sub_93D070 (155 lines) -- compare and keep best attempt |
| Entry point | sub_9721C0 (1086 lines) -- per-function allocation driver |
Pressure Array Construction
The core allocator (sub_957160) allocates two stack-local arrays at the start of each allocation round. Each array is 2056 bytes: 512 DWORDs (2048 bytes) of pressure data plus a 2-DWORD sentinel.
| Array | Variable | Role |
|---|---|---|
| Primary | v12 | Per-physical-register interference count. Lower is better. |
| Secondary | v225 | Per-physical-register secondary cost. Breaks ties when primary values are equal. |
Both arrays are zeroed using SSE2 vectorized _mm_store_si128 loops aligned to 16-byte boundaries. The zeroing loop processes 128 bits per iteration, covering 512 DWORDs in approximately 128 iterations.
For each virtual register in the allocation worklist (linked list at alloc+744), the allocator zeroes the pressure arrays and then walks the VR's constraint list (vreg+144). For each constraint, it increments the appropriate pressure array entries at the physical register slots that conflict with the current virtual register. The result is a histogram: primary[slot] holds the total interference weight for physical register slot, accumulated over all constraints of all previously-assigned virtual registers that conflict with the current one. The full per-VR algorithm is documented in the Pressure Computation Algorithm section below.
The secondary array accumulates a separate cost metric used for tie-breaking. It captures weaker interference signals -- preferences and soft constraints that do not represent hard conflicts but indicate suboptimal placement.
Budget Computation
Before the pressure scan begins, the allocator computes the maximum physical register count for the current class:
v231 = hardware_limit + 7 // alloc+756, with headroom
if allocation_mode == 6 (CSSA paired):
v231 *= 4 // quad range for paired allocation
elif allocation_mode == 3 (CSSA):
v231 *= 2 // doubled range
alloc.budget = v231 // stored at alloc+60
The hardware limit comes from the target descriptor and reflects the physical register file size for the current class (e.g. 255 for GPRs, 7 for predicates). The +7 headroom allows the allocator to explore slightly beyond the architectural limit before triggering a hard failure -- this is clamped during assignment by the register budget check in sub_94FDD0.
The register budget at alloc+1524 interacts with --maxrregcount and --register-usage-level (values 0--10). The CLI-specified maximum register count is stored in the compilation context and propagated to the allocator as the hard ceiling. The register-usage-level option modulates the target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class register budget stored at alloc+32*class+884 reflects this interaction.
Occupancy Bitvector
After computing the budget, the allocator initializes an occupancy bitvector (sub_957020 + sub_94C9E0) that tracks which physical register slots are already assigned. The bitvector is sized to ceil(budget / 64) 64-bit words. For each VR being allocated, sub_94C9E0 sets bits covering the VR's footprint in the bitvector using a word-level OR with computed masks:
function mark_occupancy(bitvec, range, alignment):
lo_word = alignment >> 6
hi_word = min(alignment + 64, range) >> 6
hi_mask = ~(0xFFFFFFFFFFFFFFFF >> (64 - (range & 63)))
lo_mask = 0xFFFFFFFFFFFFFFFF >> (~alignment & 63)
for word_idx in lo_word .. lo_word + (lo_word - hi_word):
mask = 0xFFFFFFFFFFFFFFFF
if word_idx == hi_word: mask &= hi_mask
if word_idx == last: mask &= lo_mask
bitvec[word_idx] |= mask
During the fatpoint scan, a set bit means "slot occupied -- skip it." This prevents the allocator from considering slots already committed to other VRs in the current round.
Pressure Computation Algorithm
The per-VR pressure computation is the core of the fat-point allocator. For each unassigned virtual register, the allocator builds a fresh pressure histogram, selects the minimum-cost physical register slot, and commits the assignment. The algorithm has seven steps, all executed inside the main loop of sub_957160 (lines 493--1590 of the decompiled output).
Step 1: VR Geometry
For each VR, the allocator computes the physical register footprint via sub_7DAFD0:
function aligned_width(vreg):
stride = 1 << vreg.alignment // vreg+72, uint8
size = vreg.width // vreg+74, uint16
return (-stride) & (stride + size - 1) // = ceil(size / stride) * stride
| stride | size | result | Meaning |
|---|---|---|---|
| 1 | 1 | 1 | Single register |
| 1 | 2 | 2 | Unaligned pair |
| 2 | 2 | 2 | Aligned pair |
| 2 | 3 | 4 | Aligned quad (rounded up) |
The pair mode is extracted from (vreg+48 >> 20) & 3:
| pair_mode | step_size | Behavior |
|---|---|---|
| 0 | stride | Normal single-width scan |
| 1 | stride | Paired mode -- is_paired = 1, scan ceiling doubled |
| 3 | 2 * stride | Double-width -- aligned width doubled, step by 2x stride |
Step 2: Scan Range
The scan range defines which physical register slots are candidates:
function compute_scan_range(alloc, vreg):
max_slot = alloc.budget // alloc+1524
if vreg.flags & 0x400000: // per-class ceiling override
class_limit = alloc.class_limits[alloc.current_class]
if max_slot > class_limit:
max_slot = class_limit
ceiling = ((max_slot + 1) << is_paired) - 1
start = vtable[320](alloc, vreg, bitvec) // class-specific start offset
alignment = alloc.scan_alignment << is_paired // alloc+1556
scan_width = alloc.slot_count + 4 // alloc+1584 + 4
return (start, ceiling, scan_width, alignment)
The +4 on scan_width provides padding beyond the register file limit. For pair modes, the ceiling is shifted left: double for is_paired, quad for pair_mode 3 with the 0x40 flag at ctx+1369.
Step 3: Zero Pressure Arrays
Before accumulating interference for this VR, both arrays are zeroed over scan_width DWORDs:
function zero_pressure(primary[], secondary[], scan_width):
if scan_width > 14 and arrays_dont_overlap:
// SSE2 vectorized path: zero 4 DWORDs per iteration
for i in 0 .. scan_width/4:
_mm_store_si128(&primary[4*i], zero_128)
_mm_store_si128(&secondary[4*i], zero_128)
// scalar cleanup for remainder
else:
// scalar path
for i in 0 .. scan_width:
primary[i] = 0
secondary[i] = 0
The SSE2 path has a non-overlap guard (secondary >= primary + 16 || primary >= secondary + 16) to ensure the vectorized stores do not alias. The scalar path is used for narrow scan ranges (width <= 14).
Step 4: Constraint Walk
The allocator iterates the constraint list at vreg+144. For VRs with alias chains (coalesced registers via vreg+32), the walk processes constraints for the entire chain, accumulating pressure from all aliases into the same arrays. Each constraint node is a 24-byte structure:
| Offset | Type | Field |
|---|---|---|
| +0 | pointer | Next constraint (linked list) |
| +8 | int32 | Constraint type (0--15) |
| +12 | int32 | Target VR index or physical register |
| +16 | int32 | Weight (interference cost) |
| +20 | uint8 | Soft flag (skip in later iterations) |
The constraint type dispatches to different accumulation patterns:
function accumulate_pressure(primary[], secondary[], constraint_list, scan_width,
base_offset, pair_mode, half_width_mode, iteration):
soft_count = 0
for node in constraint_list:
// --- Soft constraint relaxation (iteration > 0) ---
if node.soft_flag and iteration > 0:
soft_count++
skip_threshold = iteration * knob_weight // OCG knob at +46232
if soft_count <= skip_threshold:
continue // relax this constraint
if bank_aware and soft_count > relaxation_ceiling:
continue
type = node.type
target = node.target
weight = node.weight
switch type:
case 0: // Point interference
phys = lookup_vreg(target).physical_reg
if phys < 0: continue // target unassigned
offset = phys - base_offset
if offset < 0 or offset >= scan_width: continue
if half_width_mode:
offset = 2 * offset + hi_half_bit(target)
primary[offset] += weight
case 1: // Exclude-one
if half_width_mode:
offset = 2 * offset + hi_half_bit(target)
for slot in 0 .. scan_width:
if slot != offset:
primary[slot] += weight
case 2: // Exclude-all-but (target is the only allowed slot)
for slot in 0 .. scan_width:
if slot != target:
primary[slot] += weight
case 3: // Below-point (same as exclude-all-but, downward liveness)
for slot in 0 .. scan_width:
if target > slot:
primary[slot] += weight
case 5: // Paired-low (even slots only)
primary[offset] += weight
// For pair_mode 3: also primary[offset+1] += weight
case 6: // Paired-high (odd slots only)
primary[offset + 1] += weight
case 7: // Aligned-pair (both halves)
primary[offset] += weight
primary[offset + 1] += weight
case 8: // Phi-related (parity-strided accumulation)
parity = compute_phi_parity(target, vreg)
for slot in parity .. scan_width step 2:
primary[slot] += weight
case 11: // Paired-even-parity
for slot in 0 .. scan_width:
if slot != offset:
primary[slot] += weight
case 12: // Paired-odd-parity
inverse = compute_odd_inverse(offset, pair_mode)
for slot in 0 .. scan_width:
if slot != inverse:
primary[slot] += weight
case 13: // Paired-parity-group (even-only exclusion)
if offset & 1: continue
for slot in 0 .. scan_width:
if slot != offset + 1:
primary[slot] += weight
case 14: // Paired-parity-extended (odd-only exclusion)
if !(offset & 1): continue
for slot in 0 .. scan_width:
if slot != offset - 1:
primary[slot] += weight
case 15: // Range (SECONDARY array)
range_end = min(offset, scan_width)
// SSE2 vectorized: broadcast weight, add to secondary[0..range_end]
for slot in 0 .. range_end: // vectorized
secondary[slot] += weight
// Tail: slots beyond (range_end + pair_width)
for slot in range_end .. scan_width:
if slot >= offset + pair_width:
secondary[slot] += weight
default: // Types 4, 9, 10 and custom extensions
vtable[240](alloc, primary, 514, node, scan_width, offset, pair_flag)
Type 15 (range) is the only constraint type that writes to the secondary array. All others write to primary. This is the architectural decision that makes secondary a pure tie-breaker: it captures long-range preference signals while primary captures hard interference.
SSE2 Vectorization in Constraint Walk
Three inner loops use SSE2 intrinsics:
-
Type 0 (point) with large width:
_mm_add_epi32adds the broadcast weight to 4 primary slots per iteration. Alignment pre-loop handles the first 1--3 slots to reach 16-byte alignment. -
Type 15 (range) secondary accumulation:
_mm_shuffle_epi32(_mm_cvtsi32_si128(weight), 0)broadcasts the weight to all 4 lanes. The vectorized loop processes 4 secondary slots per iteration with_mm_add_epi32(_mm_load_si128(...), broadcast). -
Type 8 (phi) stride-2 accumulation: Uses
_mm_shuffle_pswith mask 136 (0b10001000) to extract every-other element, then_mm_add_epi32to accumulate. This implements stride-2 addition across the primary array.
Step 5: Iteration-Dependent Constraint Relaxation
On retry iterations (iteration > 0), the allocator progressively relaxes soft constraints to reduce pressure:
function should_skip_soft_constraint(soft_count, iteration, knob_weight,
knob_ceiling, bank_aware):
threshold = iteration * knob_weight // more skipped each retry
if soft_count <= threshold:
return true // skip (relax)
if !bank_aware:
return true
if soft_count > (total_soft - iteration * knob_ceiling):
return true // beyond ceiling
return false
The relaxation formula means: on iteration N, the first N * knob_weight soft constraints are ignored. The knob_ceiling parameter (OCG knob at offset +46304) controls how aggressively the tail is also relaxed. This trades bank-conflict quality for register pressure reduction, allowing the allocator to find assignments that fit within the budget on later retries.
Step 6: Fatpoint Selection (Minimum Scan)
After pressure accumulation, the allocator scans for the physical register slot with the lowest cost:
function select_fatpoint(primary[], secondary[], start, ceiling, budget,
step_size, shift, occupancy_bv, threshold,
first_pass, bank_mode, bank_mask, prev_assignment):
best_slot = start
best_primary = 0
best_secondary = 0
// --- Pre-scan threshold check (first pass only) ---
if first_pass:
for slot in start .. ceiling step step_size:
if occupancy_bv[slot]: continue // already occupied
if primary[slot >> shift] > threshold: // knob 684, default 50
first_pass = false // congestion detected
break
// --- Main scan ---
slot = start
while slot < budget and slot < alloc.slot_count << is_paired:
// Occupancy filter
if slot < bv_size and occupancy_bv[slot]:
slot += step_size; continue
p = primary[slot >> shift]
s = secondary[slot >> shift]
if prev_assignment >= 0: // not first VR
if first_pass:
if s >= best_secondary:
slot += step_size; continue // secondary-only comparison
else:
if p > best_primary:
slot += step_size; continue
if p == best_primary and s >= best_secondary:
slot += step_size; continue
// Bank conflict filter
if bank_mode and ((slot ^ prev_assignment) & bank_mask) == 0:
slot += step_size; continue // same bank → skip
// Ceiling check
if slot > ceiling:
best_slot = slot; break // accept (over ceiling)
// Accept this slot
if slot < scan_width_shifted:
best_secondary = secondary[slot >> shift]
best_primary = primary[slot >> shift]
if best_secondary == 0 and (best_primary == 0 or first_pass):
best_slot = slot; break // zero-cost → immediate accept
best_slot = slot
slot += step_size; continue
best_slot = slot
best_primary = 0
break
return best_slot
Key design decisions in the fatpoint scan:
Two-mode comparison. On the first pass (first_pass = true, iteration 0), the scan uses secondary cost as the sole criterion, ignoring primary. This makes the first attempt pure-affinity-driven: it places VRs at their preferred locations based on copy/phi hints in the secondary array. On subsequent passes, primary cost dominates and secondary breaks ties.
Immediate zero-cost accept. When a slot has both primary == 0 and secondary == 0 (or just primary == 0 on first pass), the scan terminates immediately. This means the first zero-interference slot wins -- no further searching. Combined with the priority ordering of VRs, this produces a fast, greedy assignment.
Bank-conflict avoidance. The bank mask (-8 for pair mode 1, -4 otherwise) partitions the register file into banks. The filter ((slot ^ prev_assignment) & mask) == 0 ensures consecutive assignments land in different banks, reducing bank conflicts in the SASS execution units.
Occupancy bitvector filtering. The bitvector provides O(1) per-slot filtering of already-assigned registers. Bits are set by sub_94C9E0 for each committed assignment, preventing the scan from considering occupied slots.
Step 7: Commit and Advance
The selected slot is committed via sub_94FDD0:
alloc.cumulative_pressure += best_primary // alloc+788
sub_94FDD0(alloc, ctx, iteration, vreg, &local_state, best_slot, best_primary)
vreg = vreg.next // vreg+128, advance worklist
The cumulative pressure counter at alloc+788 tracks the total interference weight across all VR assignments in this attempt. The retry driver uses this to compare attempts.
End-of-Round Result
After all VRs are processed, the allocator computes the result (lines 1594--1641):
peak_usage = alloc+1580 // max physical register used
class_slot = ctx+1584 + 4 * mode + 384
*class_slot = peak_usage
if peak_usage > 0x989677 + 6: // sanity threshold (~10M)
emit_error("Register allocation failed with register count of '%d'."
" Compile the program with a higher register target",
alloc+1524 + 1)
return peak_usage + 1 // number of registers used
The return value feeds into the retry driver's comparison: target >= result means success (the allocation fits within the register budget).
Constraint Types
The fat-point interference builder (sub_926A30, 4005 lines) processes constraints attached to each virtual register. Constraints are extracted from instruction operand descriptors encoded as 32-bit values: bits 28--30 encode the operand type, bits 0--23 encode the register index, bit 24 is the pair extension bit, and bit 31 is a sign/direction flag.
The builder recognizes 15 constraint types. Each constraint type adds interference weight to specific physical register slots in the pressure arrays:
| Type | Name | Pressure effect |
|---|---|---|
| 0 | Point interference | Adds weight to specific physical register slots that are live at the same program point as this VR. The most common constraint -- represents a simple "these two VRs cannot share a physical register because both are live at instruction I." |
| 1 | Exclude-one | Adds weight to exactly one physical register slot, excluding it from consideration. Used when a specific physical register is reserved (e.g. for ABI constraints or hardware requirements). |
| 2 | Exclude-all-but | Adds weight to all slots except one. Forces the VR into a single permitted physical register. Used for fixed-register operands (e.g. R0 for return values). |
| 3 | Below-point | Adds interference weight for registers live below (after) the current program point. Captures downward-exposed liveness -- the VR must avoid physical registers that are used by later instructions. |
| 4 | (reserved) | Not observed in common paths. |
| 5 | Paired-low | Constrains the VR to an even-numbered physical register. Used for the low half of a 64-bit register pair. The pressure builder increments only even-indexed slots. |
| 6 | Paired-high | Constrains the VR to an odd-numbered physical register (the slot immediately after its paired-low partner). Increments only odd-indexed slots. |
| 7 | Aligned-pair | Constrains a pair of VRs to consecutive even/odd physical registers simultaneously. Combines the effects of types 5 and 6. |
| 8 | Phi-related | Marks interference from CSSA phi instructions (opcode 195). Phi constraints are softer -- they add lower weight because the phi can potentially be eliminated by the coalescing pass. |
| 9 | (reserved) | Not observed in common paths. |
| 10 | (reserved) | Not observed in common paths. |
| 11 | Paired-even-parity | Constrains the VR to a physical register whose index has even parity with respect to a bank partition. Used for bank-conflict avoidance on architectures where register bank is determined by reg_index % N. |
| 12 | Paired-odd-parity | Constrains to odd parity within the bank partition. |
| 13 | Paired-parity-group | Constrains a group of VRs to compatible parity assignments across a bank. |
| 14 | Paired-parity-extended | Extended variant of parity constraints for wider register groups (quads). |
| 15 | Range | Adds interference over an interval of program points rather than a single point. Represents a VR whose live range spans multiple instructions and conflicts with another VR whose live range overlaps. The weight is proportional to the overlap length. |
The builder uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619) for hash-table lookups into the pre-allocation candidate table. It contains SSE2-vectorized inner loops (_mm_add_epi64) for bulk interference weight accumulation when processing large constraint lists. The builder dispatches through 7+ vtable entries for OCG knob queries that modulate constraint weights.
Constraint List Structure
Each virtual register carries a constraint list at vreg+144. The list is a linked chain of constraint nodes, each containing:
- Constraint type (one of the 15 types above)
- Target VR or physical register index
- Weight (integer, typically 1 for hard constraints, lower for soft)
- Program point or interval (for types 0, 3, 15)
- Pair/alignment specification (for types 5--7, 11--14)
The interference builder iterates this list for every VR being assigned, accumulating weights into the pressure arrays. The total cost of assignment to slot S is the sum of all constraint weights that map to S.
Register Selection
After the pressure arrays are populated for a given VR, the allocator scans physical register candidates and selects the one with minimum cost:
function select_register(primary[], secondary[], maxRegs, threshold, pair_mode):
best_slot = -1
best_primary_cost = MAX_INT
best_secondary_cost = MAX_INT
stride = 1
if pair_mode != 0:
stride = 2 << shift // 2 for pairs, 4 for quads
for slot in range(0, maxRegs, stride):
if primary[slot] > threshold: // knob 684, default 50
continue // skip congested slots
p = primary[slot]
s = secondary[slot]
if p < best_primary_cost:
best_slot = slot
best_primary_cost = p
best_secondary_cost = s
elif p == best_primary_cost and s < best_secondary_cost:
best_slot = slot
best_secondary_cost = s
return best_slot // -1 if nothing found
Key design decisions in the selection loop:
Threshold filtering. The interference threshold (OCG knob 684, default 50) acts as a congestion cutoff. Any physical register slot with total interference weight above this value is immediately skipped. This prevents the allocator from assigning a VR to a slot that would cause excessive register pressure, even if that slot happens to be the global minimum. The threshold trades a small increase in the number of spills for a significant improvement in allocation quality -- high-interference slots tend to require cascading reassignments.
Alignment stride. For paired registers (pair mode 1 or 3 in vreg+48 bits 20--21), the scan steps by 2 instead of 1, ensuring the VR lands on an even-numbered slot. For quad-width registers, the stride is 4. The shift amount comes from the register class descriptor and varies by allocation mode.
Two-level tie-breaking. When two candidates have equal primary cost, the secondary array breaks the tie. This provides a smooth gradient for the allocator to follow when the primary interference picture is flat. The secondary array typically captures weaker signals like register preference hints, pre-allocation suggestions, and copy-related affinities.
No backtracking. The selection is final once made. There is no local search, no Kempe-chain swapping, and no reassignment of previously-colored VRs. If the selection leads to a spill later, the retry loop (see below) handles it by rerunning the entire allocation with updated spill guidance.
Assignment: sub_94FDD0
Once a physical register slot is selected, sub_94FDD0 (155 lines) commits the assignment. This function handles four cases:
Case 1: Normal Assignment
The physical register number is written to vreg+68. The register consumption counter (sub_939CE0, 23 lines) computes how many physical slots this VR occupies:
consumption = slot + (1 << (pair_mode == 3)) - 1
For single registers, this is just slot. For double-width pairs (pair_mode 3), it is slot + 1, consuming two consecutive physical registers. The peak usage trackers at alloc+1528 and alloc+1564 are updated if consumption exceeds the current maximum.
Case 2: Predicate Half-Width
For predicate registers (class 2, type 3), the allocator performs a half-width division. The physical slot is divided by 2, and the odd/even bit is stored at vreg+48 bit 23 (the 0x800000 flag):
physical_reg = slot / 2
if slot is odd:
vreg.flags |= 0x800000 // hi-half of pair
else:
vreg.flags &= ~0x800000 // lo-half of pair
This maps two virtual predicate registers to one physical predicate register, since NVIDIA's predicate register file supports sub-register addressing (each physical predicate holds two 1-bit values).
Case 3: Over-Budget / Spill Trigger
If slot >= regclass_info.max_regs and the VR is not already marked as spilled (flag 0x4000 at vreg+48), the allocator sets the needs-spill flag:
vreg.flags |= 0x40000 // needs-spill flag (bit 18)
When the needs-spill flag is later detected, the allocator calls:
sub_939BD0-- spill allocator setup (selects bucket size, alignment, max based on knob 623 and cost threshold atalloc+776)sub_94F150-- spill code generation (561 lines, emits spill/reload instructions)
The spill cost is accumulated:
alloc.total_spill_cost += vreg.spill_cost // double at alloc+1568
alloc.secondary_cost += vreg.secondary_cost // float at alloc+1576
Case 4: Alias Chain Propagation
After writing the physical register, the function follows the alias chain at vreg+36 (coalesced parent pointer). Every VR in the chain receives the same physical assignment:
alias = vreg.alias_parent // vreg+36
while alias != NULL:
alias.physical_register = slot // alias+68
alias = alias.alias_parent // alias+36
This propagation ensures that coalesced registers (merged by the coalescing pass at sub_9B1200) share a single physical register without requiring the allocator to re-derive the relationship.
Pre-Allocated Candidate Check
Before committing a normal assignment, sub_94FDD0 calls sub_950100 (205 lines) to check if the VR has a pre-allocated candidate in the hash table at alloc+248. If a candidate exists (FNV-1a keyed lookup), the pre-assigned physical register is used instead of the one selected by the pressure scan. For paired registers, the pre-assigned slot is doubled (type 1 -> slot * 2) to account for pair stride.
Pre-Allocation Pass: sub_94A020
Before the core allocator runs, the pre-allocation pass (331 lines) optionally assigns physical registers to high-priority operands. This pass is gated by three knobs:
| Knob | Role |
|---|---|
| 628 | Enable pre-allocation pass |
| 629 | Enable coalescing-aware pre-allocation |
| 618 | Enable uniform register pre-allocation |
When enabled and the allocation mode is 3, 5, or 6, the pass:
- Clears the pre-allocation candidate hash tables at
alloc+240..336(six tables covering candidates, results, and overflow). - Iterates basic blocks calling
sub_9499E0(per-block scanner, 304 lines) to identify pre-assignment opportunities. - For each eligible instruction, calls
sub_93ECB0(194 lines) to pre-assign operands.
sub_93ECB0 iterates instruction operands in reverse order (last to first). It filters: operands must be type 1 (register), index not 41--44 (architectural predicates) or 39 (special). A switch on the masked opcode determines how many operands qualify: opcode 22 dispatches to sub_7E40E0, opcode 50 uses a lookup table, opcodes 77/83/110--112/279/289/297/352 each have dedicated handlers. The function calls sub_93E9D0 with a priority level determined by OCG knob 646:
| Priority | Meaning |
|---|---|
| 1 | Pre-assign read operands only |
| 2 | Pre-assign write operands only |
| 3 | Pre-assign both read and write operands |
sub_93E9D0 (125 lines) creates a spill candidate node via sub_93E290 (allocates 192-byte structures from the arena freelist at alloc+232), marks the live range via sub_93DBD0 (356 lines), and recursively processes dependent operands via sub_93EC50.
Retry Loop: sub_971A90
The per-class allocation driver (355 lines) wraps the core allocator in a two-phase retry loop.
Phase 1: NOSPILL
The first attempt runs the core allocator without spill permission. The debug log emits:
"-CLASS NOSPILL REGALLOC: attemp N, used M, target T"
(Note: "attemp" is a typo present in the binary.)
The call sequence for each NOSPILL attempt:
sub_93FBE0(alloc, ctx, iteration) // reset state for attempt
if iteration == 0:
sub_956130(alloc, class) // build interference masks (first attempt only)
result = sub_957160(alloc, ctx, iteration) // core fat-point allocator
sub_93D070(&best, class, iteration, // record best result
result, pressure, alloc, cost)
The NOSPILL loop runs up to v102 attempts. Retry mode selection (from sub_971A90 lines 199--240):
| Condition | v102 (max attempts) | Behavior |
|---|---|---|
| Knob 638 enabled + special mode | 0 | No allocation at all |
| Knob 638 enabled, knob 639 set | knob 639 value | Custom iteration count |
| Knob 638 enabled, knob 639 unset | 1 | Single attempt |
| Knob 638 disabled, pressure low | 2 | Standard 2-attempt retry |
| Knob 638 disabled, pressure high | 0 | Skip to spill |
Exit conditions within the NOSPILL loop:
target >= adjusted_result: allocation fits within budget (success)target >= result: no improvement possible between iterations (give up)- The best-result recorder (
sub_93D070) compares the current attempt against the best seen so far using a multi-criterion ranking: register count first, then cost (double atbest+56), then spill count, then class width. It uses128 / register_countas an inverse density metric.
Phase 2: SPILL
If all NOSPILL attempts fail, the driver invokes spill guidance:
guidance = sub_96D940(ctx, guidance_array, attempt_no) // 2983 lines
The spill guidance function builds priority queues of spill candidates for each of the 7 register classes. Each guidance entry is an 11112-byte working structure containing 128-element bitmask arrays. The function contains 7 near-identical code blocks (one per class), likely unrolled from a C++ template.
After spill guidance, a final allocation attempt runs via sub_9714E0 (finalize/spill). If this also fails, sub_936FD0 (fallback allocation) makes a last-ditch effort. If that fails too, register assignments are cleared to -1 and the allocator reports:
"Register allocation failed with register count of '%d'.
Compile the program with a higher register target"
SMEM Spill Activation
For allocation modes 3 or 6 when the compilation target is device type 5, shared-memory spilling is activated before the retry loop:
if (class == 3 || class == 6) and device_type == 5:
if num_variables > 0:
sub_939BD0(alloc) // spill allocator setup
sub_94F150(alloc, ctx, 1) // spill codegen to SMEM
alloc.spill_triggered = 1 // flag at alloc+865
This path generates spill/reload instructions targeting shared memory instead of local memory, which is faster but limited in size and shared across the CTA.
Per-Class Iteration
The top-level entry point (sub_9721C0, 1086 lines) drives allocation for all register classes sequentially:
for class_id in 1..6:
if class_list[class_id] is empty:
continue
alloc.current_class = class_id // alloc+376
while sub_971A90(alloc, ctx, class_id) != 0:
sub_8E3A80(alloc+2) // arena cleanup between attempts
Classes 1--6 are initialized via the target descriptor vtable at offset +896. The vtable call vtable[896](alloc_state, class_id) populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class). The class IDs correspond to reg_type values (1 = R, 2 = R alt, 3 = UR, 4 = UR ext, 5 = P/UP, 6 = Tensor/Acc). Barrier registers (reg_type = 9) are above the <= 6 cutoff and handled separately.
| Class ID | Name | Width | HW Limit | Description |
|---|---|---|---|---|
| 1 | R | 32-bit | 255 | General-purpose registers (R0--R254) |
| 2 | R (alt) | 32-bit | 255 | GPR variant (RZ sentinel, stat collector alternate) |
| 3 | UR | 32-bit | 63 | Uniform general-purpose registers (UR0--UR62) |
| 4 | UR (ext) | 32-bit | 63 | Uniform GPR variant (triggers flag update at +1369 in constructor) |
| 5 | P / UP | 1-bit | 7 | Predicate registers (P0--P6, UP0--UP6) |
| 6 | Tensor/Acc | 32-bit | varies | Tensor/accumulator registers for MMA/WGMMA operations |
Class 0 (unified/cross-class) is skipped in the main loop. It is used for cross-class constraint propagation during the interference building phase. Classes 3 (UR) and 6 (Tensor/Acc) have early-out conditions: if alloc+348 == 2 (class 3) or alloc+332 == 2 (class 6), allocation is skipped because no VRs of that class exist. Barrier registers (B, UB) have reg_type = 9, which is above the <= 6 allocator cutoff and are handled by a separate mechanism outside the 7-class system.
Before the per-class loop, virtual registers are distributed into class-specific linked lists (lines 520--549 of sub_9721C0):
for each vreg in function_vreg_list: // from ctx+104
if vreg.id in {41, 42, 43, 44}: // skip architectural predicates
continue
class = vreg.register_class // vreg+12
if class >= 1 and class <= 6 and vreg.type != 0:
insert(class_lists[class], vreg)
The VR list is sorted by priority (sub_9375C0) before distribution. Priority ordering ensures that VRs with more constraints and higher spill costs are allocated first, giving them first pick of the register file.
Fast Register Allocation: Knob 638
Knob 638 (register pressure analysis enable / fast allocation mode) controls a single-pass no-retry allocation path. When enabled with the special mode flag set, the allocator sets v102 = 0, meaning the NOSPILL retry loop body never executes. Allocation proceeds directly to spill handling without iterating.
When knob 638 is enabled without the special mode flag:
- The iteration count is set to 1 (or the value of knob 639 if set)
- This creates a limited-retry mode where the allocator makes at most
knob_639attempts - Each attempt still uses the full fat-point algorithm but with no fallback to the multi-attempt guidance-driven loop
This mode is intended for fast compilation (--fast-compile) where compilation time matters more than register allocation quality. The allocator accepts the first viable assignment rather than searching for an optimal one.
Interference Builder: sub_926A30
The interference builder (4005 lines) is the largest single function in the allocator. It constructs the constraint lists that feed the pressure arrays. For each basic block and each instruction within it, the builder:
- Iterates instruction operands. Each operand is a 32-bit descriptor:
- Bits 27--25: operand type (1 = register, 6 = special, 7 = immediate)
- Bits 23--0: register/variable ID
- Bit 31: sign/direction flag
- Bit 24: pair extension bit
- For register operands (type 1), extracts the VR ID and looks up the VR object.
- Determines the constraint type based on the operand's role (def, use, or both), the instruction's properties, and the VR's pair mode.
- Creates a constraint node and appends it to the VR's constraint list.
- For paired registers (type 3 in the operand descriptor), generates two constraints: one for the low half and one for the high half (distinguished by bit 23).
- Uses SSE2 vectorized loops for bulk weight accumulation when processing large basic blocks with many live registers.
The builder queries multiple OCG knobs via vtable dispatches at offsets +72, +120, +152, +224, +256, +272, and +320. These knobs modulate constraint weights and enable/disable specific constraint categories (e.g. bank-conflict-aware constraints are gated by knob 641).
Special register IDs 41--44 (PT, P0--P3) and 39 are always skipped. The skip predicate (sub_9446D0, 29 lines) additionally checks for CSSA phi instructions (opcode 195 with type 9 = barrier) and performs hash-table lookups in the exclusion set at alloc+360.
Best Result Recorder: sub_93D070
The best-result recorder (155 lines) compares the current allocation result against the best seen across all retry attempts. It maintains state at offsets best[10..20]:
best[10] = register_count // best count so far
best[13] = 128 / register_count // inverse density metric
best[16] = max_pressure // peak live registers
best[17] = spill_score
*(double*)(best + 56) = cost // floating-point cost metric
best[18] = arch_peak_1 // from architecture state +408
best[20] = arch_peak_2 // from architecture state +400
Comparison uses lexicographic ordering:
- Lower register count wins
- On tie: lower cost (double) wins
- On tie: lower spill count wins
- On tie: lower class width wins
When the current attempt improves over the best, the recorder allocates a per-register assignment array and copies the full VR-to-physical-register mapping for later restoration.
Per-Instruction Assignment: sub_9680F0
The per-instruction assignment core loop (3722 lines, the largest function in part 2 of the allocator) handles the actual instruction-by-instruction walk during allocation:
- Iterates instructions via linked list (
v87 = *(_QWORD *)v87) - For each instruction, calls
sub_961A60to attempt register assignment - Tracks register pressure via
v86counter and 256-bit bitvectors atalloc+1342..1350 - Manages three bitvector masks per instruction: assigned, must-not-spill, and used
- Detects rematerialization opportunities (flag
v570) and callssub_93AC90 - Detects bank conflicts via
sub_9364B0and resolves them - Handles special opcodes: 187 / IMMA_16832 (
VZZN_16832, behavioral: LOAD), 97 / STG (FGT, behavioral: STORE), 52 / BB boundary (behavioral: BRANCH), 236 / UBLKPF (HOYXCS, behavioral: CALL) - Tracks first-spill-candidate (
alloc+1354) and fallback-spill-candidate (alloc+1355) - On allocation failure for an instruction, calls
sub_96CE90which recursively invokessub_9680F0with different flags for the spill fallback path
Post-Allocation Verification
After register allocation completes, a verification pass called "memcheck" (NVIDIA's internal name, unrelated to Valgrind) compares reaching definitions before and after allocation. Every instruction's operands are checked: the set of definitions that could reach each use must be preserved, or any change must be explained by a known-safe transformation (spill/refill, rematerialization, predicate packing). Unexplained changes indicate an allocator bug.
The verification runs inside the post-regalloc scheduling pass (sub_A8B680), after all register assignments are finalized and spill/reload instructions have been inserted.
Verification Call Flow
sub_A8B680 (PostAllocPass::run)
+-- sub_A5B1C0 build pre/post def-use chains (48KB, all instructions)
+-- sub_A76030 MemcheckPass::run -- entry point
|
for each instruction in function:
| +-- sub_A56790 fast per-instruction check (returns bool)
| | true -> skip (defs match)
| | false -> deep verify
| |
| +-- sub_A54140 look up pre-regalloc def set
| +-- sub_A54140 look up post-regalloc def set
| +-- sub_A75CC0 deep single-instruction verification
| +-- sub_A56400 build Additions list (new defs)
| +-- sub_A56400 build Removals list (lost defs)
| +-- sub_A55D80 diagnostic reporter (10 error codes)
| +-- sub_A55D20 print uninitialized-def detail
|
printf("TOTAL MISMATCH %d MISMATCH ON OLD %d\n", ...)
Fast Check vs Deep Verify
The fast check (sub_A56790) performs a lightweight comparison per instruction. It returns true when pre-regalloc and post-regalloc reaching definitions match exactly. Only on failure does the verifier invoke the expensive deep path (sub_A75CC0), which:
- Builds two difference lists -- "Additions" (defs present after but not before) and "Removals" (defs present before but not after).
- Classifies each difference as either
BENIGN (explainable)orPOTENTIAL PROBLEMby pattern-matching against known-safe transformations: spill-store/refill-load pairs, P2R/R2P predicate packing, bit-spill patterns, and rematerialized instructions. - For each unexplained difference, creates an error record with a category code (1--10), pointers to the offending instructions, and operand type flags.
Error Categories
The reporter (sub_A55D80) dispatches on the error code at record+24:
| Code | Name | Message | Trigger |
|---|---|---|---|
| 1 | Spill-refill mismatch | Failed to find matching spill for refilling load that is involved in this operand computation | A post-regalloc refill load has no corresponding spill store. The verifier walks the spill-refill chain and cannot find the matching pair. |
| 2 | Refill reads uninitialized | This operand was fully defined before register allocation, however refill that is involved in this operand computation reads potentially uninitialized memory | The refill reads from a stack slot that was never written -- the spill store was optimized away or placed on a non-executing path. |
| 3 | P2R-R2P pattern failure | Failed to establish match for P2R-R2P pattern involved in this operand computation | Predicate-to-register / register-to-predicate instruction pairs used to spill predicate registers through GPRs have a broken chain -- the matching counterpart is missing. |
| 4 | Bit-spill-refill failure | Failed to establish match for bit-spill-refill pattern involved in this operand computation | The bit-packing variant of predicate spilling (multiple predicates packed into GPR bits) failed pattern matching. Same root cause as code 3 but for the packed representation. |
| 5 | Uninitialized value introduced | Before register allocation this operand was fully defined, now an uninitialized value can reach it | Pre-regalloc: all paths to this use had a definition. Post-regalloc: at least one path has no definition. The register holds a stale value from a prior computation. Typically caused by a spill placed on the wrong path or a definition eliminated during allocation. |
| 6 | Extra defs introduced | After reg-alloc there are some extra def(s) that participate in this operand computation. They were not used this way before the allocation. | The post-regalloc definition set is a strict superset of the pre-regalloc set. New definitions were introduced through register reuse or aliasing. When the extra def involves a short/byte type in a wider register, the reporter prints: Does this def potentially clobber upper bits of a register that is supposed to hold unsigned short or unsigned byte? and suggests -knob IgnorePotentialMixedSizeProblems. |
| 7 | Rematerialization mismatch | Rematerialization problem: Old instruction: [%d] New instruction: [%d] | A rematerialized instruction does not match the original. The new instruction created to recompute a value differs from the original in an unexpected way. |
| 8 | P2R-R2P base destroyed | Some instruction(s) are destroying the base of P2R-R2P pattern involved in this operand computation | The GPR holding packed predicate bits is overwritten between the P2R store and R2P restore by another instruction that defs the same physical register. |
| 9 | Bit-spill base destroyed | Some instruction(s) are destroying the base of bit-spill-refill pattern involved in this operand computation | Same as code 8 but for the bit-packing spill variant. The base register holding packed predicate bits is overwritten between store and restore. |
| 10 | Definitions disappeared | During register allocation we did not add any new definitions for this operand and yet some original old ones disappeared | The post-regalloc definition set is a strict subset of the pre-regalloc set. Definitions were removed without replacement by spill/refill or rematerialization. |
Reporter Output Format
When DUMPIR=AllocateRegisters is enabled (knob ID 266), the reporter (sub_A55D80) prints a structured diagnostic per mismatch:
=== ... (110 '=' banner) ===
REMATERIALIZATION PROBLEM. New Instruction [N] Old Instruction [M] // only if instruction changed
INSTRUCTION: [N]
=== ... ===
Operand # K
Producers for operand K of instruction [N] before register allocation:
[42] def opr # 0 for src opr # 2
[38] def opr # 1 for src opr # 2
Producers for operand # K of instruction [N] after register allocation:
[42] def opr # 0 for src opr # 2
[55] def opr # 0 for src opr # 2 // <-- new, from refill
Additions
[55] def opr # 0 src opr # 2 BENIGN (explainable)
Removals
[38] def opr # 1 src opr # 2 POTENTIAL PROBLEM
<error-category-specific message from the table above>
If DUMPIR=AllocateRegisters is not enabled and mismatches exist, the verifier prints a one-shot suggestion:
Please use -knob DUMPIR=AllocateRegisters for debugging
The one-shot flag at verifier+1234 ensures this message appears at most once per allocation attempt.
Mismatch Counters
The verifier tracks two counters reported at the end of the pass:
| Offset | Counter | Meaning |
|---|---|---|
verifier+1236 | Total mismatches | Instructions where post-regalloc defs differ from pre-regalloc defs in an unexplained way. |
verifier+1240 | Old mismatches | Subset of total mismatches that represent pre-existing issues -- the pre-regalloc def chain was already empty (no reaching definitions before allocation either). These are not regressions from the current allocation attempt. |
Knob Interactions
| Knob | Effect |
|---|---|
DUMPIR=AllocateRegisters (ID 266) | Enables verbose per-mismatch diagnostic output. Without this, only the summary line and suggestion are printed. |
IgnorePotentialMixedSizeProblems | Suppresses the mixed-size aliasing warning in error code 6 (extra defs involving short/byte types in wider registers). |
memcheck flag at function+1384 | Gates whether verification runs at all. When zero, sub_A76030 is not called. |
Verification Function Map
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_A54140 | -- | Def-use chain lookup (hash table query into pre/post maps) | HIGH |
sub_A55D20 | ~100B | Print uninitialized-def warning helper | HIGH |
sub_A55D80 | 1454B | Diagnostic reporter -- 10 error categories, structured output | HIGH |
sub_A56400 | -- | Build additions/removals lists for deep comparison | MEDIUM |
sub_A56560 | 698B | Verify single operand's reaching definitions | HIGH |
sub_A56790 | ~250B | Per-instruction fast check (returns bool pass/fail) | HIGH |
sub_A5B1C0 | 8802B | Full-function def-use chain builder (pre and post regalloc) | HIGH |
sub_A60B60 | 4560B | Pre/post chain comparison engine | HIGH |
sub_A62480 | -- | Reset scratch arrays between operand checks | MEDIUM |
sub_A75220 | 2640B | Compare reaching definitions (builds diff lists) | HIGH |
sub_A75CC0 | 866B | Deep single-instruction verifier (classifies diffs) | HIGH |
sub_A76030 | 770B | MemcheckPass::run -- verification entry point | HIGH |
Occupancy-Aware Budget Model
The allocator maintains a 144-byte budget pressure model at alloc+1600--alloc+1744 that adjusts the effective register budget based on thread occupancy. The model is initialized by sub_947150 (the allocator constructor) and consumed by the spill guidance function sub_96D940. The goal: kernels that need high occupancy get tighter register budgets (more aggressive spilling), while kernels that can tolerate low occupancy get relaxed budgets (more registers, fewer spills).
Coefficient Initialization
Three knob-overridable coefficients control the interpolation:
| Field | Offset | Knob | Knob Name | Type | Default | Meaning |
|---|---|---|---|---|---|---|
| coeffA | +1632 | 664 | RegAllocSpillBitLowRegScale | DBL | 0.2 | Scale at low register counts (piecewise default value) |
| coeffB | +1640 | 661 | RegAllocSpillBitHighRegScale | DBL | 1.0 | Scale at high register counts (linear model y_max) |
| coeffC | +1648 | 665 | RegAllocSpillBitMediumRegScale | DBL | 0.3 | Scale at medium register counts (linear model y_min) |
Two integer knobs set the tier boundaries:
| Field | Offset | Knob | Knob Name | Type | Default | Meaning |
|---|---|---|---|---|---|---|
| maxThreads | +1624 | 663 | RegAllocSpillBitLowRegCountHeur | INT | 119 | Low register count tier boundary |
| pressureThresh | +1628 | 660 | RegAllocSpillBitHighRegCountHeur | INT | 160 | High register count tier boundary |
All five knobs use the standard OCG type-check pattern: byte at knobArray + 72*index encodes 0 (use default), 1 (use INT value at +8), or 3 (use DBL value at +8).
Piecewise Interpolation Table
After reading the knobs, sub_947150 queries the target descriptor for the hardware maximum thread count (vtable slot 90, offset +720). This value is clamped to maxThreads - 1 if a knob override is active. The result becomes totalThreads -- the kernel's maximum achievable occupancy.
An optional override through the function-object vtable at context+1584 (vtable slot 118, offset +944) can adjust the architectural register limit. When the override is active, it calls override_fn(totalThreads, param, 255.0) and sets the adjusted limit to 255 - result. When inactive, the limit stays at 255.
The piecewise array stores 7 (value, x-coordinate) pairs that define a step function mapping register counts to scale factors:
interpTable[0] = coeffA interpTable[1] = maxThreads
interpTable[2] = coeffA interpTable[3] = pressureThresh
interpTable[4] = coeffA interpTable[5] = adjustedLimit // 255 or (255 - override)
interpTable[6] = coeffB
This means: for register counts up to maxThreads (default 119), the budget scale is coeffA (0.2); from maxThreads to pressureThresh (160), still coeffA; from pressureThresh to the adjusted limit (255), still coeffA; and beyond that boundary, coeffB (1.0). In practice the piecewise table establishes a two-tier system: a low scale for most of the register range, jumping to the high scale only at the top.
Linear Interpolation Model
A separate linear model provides continuous interpolation for the spill guidance decision. Two more vtable queries establish the domain:
x_min = target->getMaxOccupancy() // vtable slot 90 on target descriptor via context+1584
x_max = target->getMinOccupancy() // vtable slot 96 (offset +768) on function-object via context+1584
y_min = coeffC // 0.3
y_max = coeffB // 1.0
slope = (coeffB - coeffC) / (x_max - x_min)
The slope is stored at alloc+1736. Since x_max (minimum occupancy, meaning fewest concurrent threads = most registers allowed) is typically greater than x_min (maximum occupancy, meaning most concurrent threads = fewest registers), the slope is positive: as the function moves toward allowing more registers (fewer threads), the budget fraction increases.
Spill Guidance Consumption
The spill guidance function sub_96D940 (line 1520 in the decompiled output) uses the linear model to compute a dynamic spill threshold:
budget_fraction = (current_reg_count - x_min) * slope + y_min
spill_threshold = budget_fraction * (class_budget - class_floor + 1)
Where:
current_reg_count: the current register allocation count for this class (fromalloc+884indexed by class)class_budgetandclass_floor: per-class allocation bounds atalloc + 32*class + 884andalloc + 32*class + 880- For paired registers,
current_reg_countis halved:(count + 1) >> 1
The comparison at line 1527:
if (spill_count + unspilled_need_spill + current_reg_count) > spill_threshold:
trigger_spill(sub_948B80)
If the total pressure (live registers needing spill + those already marked for spill + current allocation count) exceeds the occupancy-adjusted threshold, the allocator triggers a spill. The sub_948B80 call adds the current VR to the spill candidate queue.
Worked Example
For a Blackwell SM100 kernel with default knobs:
| Parameter | Value | Source |
|---|---|---|
| coeffA | 0.2 | Knob 664 default |
| coeffB | 1.0 | Knob 661 default |
| coeffC | 0.3 | Knob 665 default |
| maxOccupancy (x_min) | 240 | SM100 target vtable slot 90 |
| minOccupancy (x_max) | 480 | SM100 target vtable slot 96 |
| slope | (1.0 - 0.3) / (480 - 240) = 0.00292 | Computed |
If the current GPR class has a budget of 128 and floor of 0 (range = 129), and the function currently uses 300 registers:
budget_fraction = (300 - 240) * 0.00292 + 0.3 = 0.475
spill_threshold = 0.475 * 129 = 61.3
If more than 61 VRs are pending spill or already allocated, the allocator triggers a spill rather than attempting to fit another register. With fewer registers in play (say 250), the fraction drops to 0.329 and the threshold tightens to 42 -- more aggressive spilling at higher occupancy targets.
Budget Model Field Summary
| Offset | Size | Type | Init | Field |
|---|---|---|---|---|
| +1600 | 8 | ptr | ctx[2]->+208 | Function object pair pointer |
| +1608 | 8 | ptr | 0 | Auxiliary pointer (unused at init) |
| +1616 | 8 | QWORD | 0xFFFFFFFF | Occupancy upper bound |
| +1624 | 4 | DWORD | 119 / knob 663 | maxThreads (low reg count tier boundary) |
| +1628 | 4 | DWORD | 160 / knob 660 | pressureThresh (high reg count tier boundary) |
| +1632 | 8 | double | 0.2 / knob 664 | coeffA (low-register scale) |
| +1640 | 8 | double | 1.0 / knob 661 | coeffB (high-register scale / y_max) |
| +1648 | 8 | double | 0.3 / knob 665 | coeffC (medium-register scale / y_min) |
| +1656 | 8 | double | (computed) | totalThreads as double |
| +1664 | 8 | double | = coeffA | interpTable[0]: value for tier 0 |
| +1672 | 8 | double | = maxThreads | interpTable[1]: x-boundary for tier 0 |
| +1680 | 8 | double | = coeffA | interpTable[2]: value for tier 1 |
| +1688 | 8 | double | = pressureThresh | interpTable[3]: x-boundary for tier 1 |
| +1696 | 8 | double | = coeffA | interpTable[4]: value for tier 2 |
| +1704 | 8 | double | = adjustedLimit | interpTable[5]: x-boundary for tier 2 |
| +1712 | 8 | double | = coeffB | interpTable[6]: value for tier 3 (final) |
| +1720 | 8 | double | (computed) | x_min: max occupancy thread count |
| +1728 | 8 | double | = coeffC | y_min (linear model intercept at x_min) |
| +1736 | 8 | double | (computed) | slope: (coeffB - coeffC) / (minOcc - maxOcc) |
| +1744 | 8 | ptr | 0 | Tail sentinel |
Post-Init: sub_939BD0
Immediately after building the interpolation tables, sub_947150 calls sub_939BD0 which configures the spill guidance lookup strategy at alloc+1784. This function queries knob 623 (RegAllocEstimatedLoopIterations) through the function-object vtable:
- If knob 623 is set: the spill guidance uses the knob's value to estimate loop iteration counts, passed to the lookup strategy via vtable slot 3 (offset +24).
- If knob 623 is unset: the lookup strategy is initialized with default parameters. When the budget model's auxiliary weight at
alloc+776is zero, the strategy uses(8, 4, 0x100000); otherwise(16, 16, 0x100000).
Function Map
| Address | Lines | Role | Confidence |
|---|---|---|---|
sub_926A30 | 4005 | Fat-point interference builder / constraint solver | HIGH |
sub_93D070 | 155 | Best result recorder (multi-criterion comparison) | HIGH |
sub_93E290 | 397 | Spill candidate node creator (192-byte arena alloc) | HIGH |
sub_93E9D0 | 125 | Pre-assign individual operand | HIGH |
sub_93ECB0 | 194 | Pre-assign registers (per-instruction dispatcher) | HIGH |
sub_93FBE0 | 940 | Per-iteration allocation state reset | HIGH |
sub_939BD0 | 63 | Spill guidance strategy initializer (knob 623 query) | HIGH |
sub_939CE0 | 23 | Register consumption counter (pair-aware) | HIGH |
sub_9446D0 | 29 | Register skip predicate (special regs, exclusion set) | HIGH |
sub_947150 | ~700 | Allocator constructor (budget model + interpolation init) | HIGH |
sub_94A020 | 331 | Pre-allocation pass (knobs 628/629/618) | HIGH |
sub_94FDD0 | 155 | Register assignment + alias propagation | HIGH |
sub_950100 | 205 | Pre-allocated candidate applier (FNV-1a lookup) | HIGH |
sub_956130 | 873 | Register class interference mask builder (SSE2) | HIGH |
sub_957020 | 24 | Occupancy bitvector resizer (arena-backed realloc + memset) | HIGH |
sub_94C9E0 | 59 | Occupancy bitmask range setter (word-level OR with masks) | HIGH |
sub_7DAFD0 | 7 | VR aligned width computation (ceil(size/stride)*stride) | CERTAIN |
sub_957160 | 1658 | Core fat-point allocator (coloring engine) | HIGH |
sub_9680F0 | 3722 | Per-instruction assignment core loop | HIGH |
sub_96D940 | 2983 | Spill guidance (7-class priority queues) | HIGH |
sub_971A90 | 355 | NOSPILL / SPILL retry driver | HIGH |
sub_9714E0 | -- | Post-allocation finalization | MEDIUM |
sub_9721C0 | 1086 | Register allocation entry point | HIGH |
sub_936FD0 | -- | Final fallback allocation | MEDIUM |
sub_9375C0 | -- | VR priority sort | MEDIUM |
Spill Mechanism
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
When the fat-point register allocator cannot fit all simultaneously-live virtual registers into the hardware register budget, it spills excess values to memory and reloads them on demand. The spill subsystem is the second-largest component of the register allocator by code volume, spanning roughly 25 functions and 12,000+ lines of decompiled code. It implements a cost-driven, retry-based spill strategy with two memory targets (local memory and shared memory), a per-class priority queue guidance engine, and a multi-attempt allocation loop that progressively relaxes constraints until allocation succeeds or fails fatally.
| Spill trigger | sub_94FDD0 (155 lines) -- sets flag 0x40000 when assignment exceeds budget |
| Spill guidance | sub_96D940 (2983 lines) -- builds 7 priority queues of spill candidates |
| Spill codegen | sub_94F150 (561 lines) -- inserts spill stores and refill loads |
| LMEM setup | sub_939BD0 (65 lines) -- local memory slot allocator configuration |
| SMEM allocator | sub_9539C0 (1873 lines) -- shared memory spill alternative |
| Retry driver | sub_971A90 (355 lines) -- NOSPILL then SPILL retry loop |
| Finalization | sub_9714E0 (290 lines) -- commit spills, emit errors on failure |
| SASS codegen | sub_9850F0 (520 lines) -- generate LDL/STL instruction sequences |
| Key knobs | 623 (spill mode), 638/639 (retry limits), 684 (interference threshold) |
Spill Trigger
The spill trigger fires inside the per-virtual-register assignment function sub_94FDD0 (155 lines). When the fat-point allocator (sub_957160) selects a physical slot for a virtual register, it calls sub_94FDD0 to commit the assignment. If the chosen slot index equals or exceeds the per-class register budget, the function does not commit -- instead it marks the virtual register for spilling.
function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
max_regs = regclass_info.max_regs // at regclass_info+16
// Budget check
if slot >= max_regs and not vreg.has_flag(0x4000): // not already spilled
vreg.flags |= 0x40000 // set "needs-spill" bit
return // do NOT commit assignment
// Spill path: flag was set on a previous call
if vreg.has_flag(0x40000): // needs-spill
setup_spill_allocator(alloc) // sub_939BD0
generate_spill_code(alloc, ctx, 1) // sub_94F150
return
// Non-spill path: commit the assignment
consumption = compute_consumption(vreg) // sub_939CE0
update_peak(alloc+1564, consumption)
update_peak(alloc+1528, consumption)
vreg.physical_register = slot // vreg+68
// Accumulate spill cost even for successful assignments
*(double*)(alloc+1568) += *(float*)(vreg+40) // store weight
*(float*)(alloc+1576) += load_weight // load weight
apply_preallocated_candidate(alloc, vreg) // sub_950100
// Propagate through alias chain
alias = vreg.alias_parent // vreg+36
while alias != NULL:
alias.physical_register = slot
alias = alias.alias_parent
The two flag bits at vreg+48 encode spill state:
| Bit | Mask | Meaning |
|---|---|---|
| 14 | 0x4000 | Already spilled -- prevents the same vreg from being spilled again |
| 18 | 0x40000 | Needs spill -- triggers spill codegen on the next sub_94FDD0 call |
Register consumption (sub_939CE0, 23 lines) accounts for paired registers. For double-width registers (pair mode 3 at vreg+48 bits 20--21), it returns assignment + 1, consuming two physical slots.
Spill Retry Loop
The outer allocation driver sub_971A90 (355 lines) wraps the core allocator in a two-phase strategy: first attempt allocation without spilling, then retry with progressively more aggressive spill guidance.
function alloc_with_spill_retry(alloc, ctx, class_id):
// Phase 1: NOSPILL
pre_allocation_pass(alloc) // sub_94A020
per_class_driver(alloc, ctx, class_id) // sub_95DC10
result = fatpoint_allocate(alloc, ctx, attempt=0) // sub_957160
record_best_result(&best, class_id, 0, result) // sub_93D070
if alloc.target >= adjusted_result:
goto finalize // success
// Phase 2: SPILL retry
max_attempts = query_knob(638) // default varies
if knob_639_set:
max_attempts = query_knob(639) // override
for attempt in 0..max_attempts:
reset_alloc_state(alloc, ctx, attempt) // sub_93FBE0
if attempt == 0:
build_interference_masks(alloc, class_id) // sub_956130
result = fatpoint_allocate(alloc, ctx, attempt)
debug("-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d",
attempt, result, alloc.target)
record_best_result(&best, class_id, attempt, result)
if alloc.target >= adjusted_result:
break
if all_attempts_failed and no_spill_recorded:
result = final_fallback(&best, result) // sub_936FD0
// Finalize
status = finalize_allocation(alloc, result, class_id, &best) // sub_9714E0
if HIBYTE(status):
clear_all_assignments_to_minus_one() // allocation failed
else:
commit_results()
The debug string "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary) is printed for every attempt.
For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates spill setup before entering the retry loop:
if (class_id == 3 or class_id == 6) and device_type == 5:
if vreg_count > 0:
setup_spill_allocator(alloc) // sub_939BD0
generate_spill_code(alloc, ctx, 1) // sub_94F150
alloc.spill_done_flag = 1 // alloc+865
Spill Guidance Engine
sub_96D940 (2983 lines, 84 KB decompiled) computes which registers should be spilled and in what order. It is the largest spill-related function and one of the largest in the entire allocator.
Structure
The function contains 7 near-identical code blocks, one per register class (R, P, B, UR, UP, UB, tensor/accumulator). Each block is approximately 400 lines of bitvector iteration and set intersection. This repetition strongly suggests C++ template instantiation or macro expansion in the original source.
Spill Guidance Structure (11,112 bytes)
The guidance engine allocates a single 11,112-byte working structure from the arena (vtable +24). The structure is organized into five regions.
Region 0 -- Header and core pointers (bytes 0--271)
| Byte offset | QWORD idx | Type | Init | Field |
|---|---|---|---|---|
| 0 | [0] | ptr | ctx | Back-pointer to allocation context |
| 24 | [3] | ptr | alloc+16 | Pointer into allocator state object |
| 32 | [4] | QWORD | 0 | Run counter / iteration state |
| 40 | [5] | QWORD | 0 | Class processing counter |
| 48 | [6] | QWORD | 0 | Spill mode flags |
| 96 | [12] | ptr | arena | Arena allocator pointer (from ctx+16) |
| 104 | [13] | ptr | arena | Queue header base / candidate list base |
| 112 | [14] | DWORD+DWORD | -1, 0 | Max element index sentinel, entry count |
| 136 | [17] | ptr | arena | Third arena reference |
| 144 | [18] | QWORD | 0 | Queue 0 entry count |
| 152 | [19] | QWORD | -1 | Queue 0 sentinel |
| 160 | [20] | ptr | arena | Fourth arena reference |
| 168 | [21] | QWORD | 0 | Queue 0b entry count |
| 176 | [22] | QWORD | -1 | Queue 0b sentinel |
| 184 | [23] | ptr | ctx | Back-pointer to context |
| 192 | [24] | ptr | node | Candidate node list head (24-byte arena node) |
| 200 | [25] | ptr | node | Candidate node list tail |
| 208 | [26] | QWORD | 0 | Node count |
| 216 | [27] | QWORD | 0 | Node capacity |
| 240 | [30] | ptr | node | Sentinel node (same as initial node at [24]) |
| 248 | [31] | QWORD | 0 | Free list head |
| 256 | [32] | QWORD | 0 | Free list count |
Region 1 -- Bitmask arrays (bytes 272--1327)
Two 508-byte bitmask arrays (127 DWORDs each), separated by single-byte sentinels:
| Byte range | Content |
|---|---|
| 284 | Sentinel byte (set to 0x80 after zeroing) |
| 288--795 | Bitmask array 0: 127 DWORDs for live range set intersection |
| 808 | Sentinel byte (set to 0x80 after zeroing) |
| 812--1319 | Bitmask array 1: 127 DWORDs for second class pair |
Each bitmask array is zeroed via an SSE2 vectorized loop (16 bytes per iteration, 0x1F iterations). The 0x80 sentinel byte at the start of each array marks initialization completion.
Region 2 -- Priority queue table blocks (bytes 1328--2063)
Five embedded priority queue tables, each containing an entry count (QWORD) followed by an array of 6 queue entries (24 bytes each):
| QWORD idx | Byte offset | Content |
|---|---|---|
| [166] | 1328 | Queue block 1 entry count (incremented by 7 per pass) |
| [167]--[184] | 1336--1479 | Queue block 1: 6 entries x 24 bytes |
| [188] | 1504 | Queue block 2 entry count |
| [189]--[206] | 1512--1655 | Queue block 2: 6 entries x 24 bytes |
| [210] | 1680 | Queue block 3 entry count |
| [211]--[228] | 1688--1831 | Queue block 3: 6 entries x 24 bytes |
| [232] | 1856 | Queue block 4 entry count |
| [233]--[250] | 1864--2007 | Queue block 4: 6 entries x 24 bytes |
| [256] | 2048 | Queue block 5 (overflow) count |
Each 24-byte queue entry has this layout:
| Entry offset | Type | Init | Field |
|---|---|---|---|
| +0 | ptr | arena | Bitvector storage pointer |
| +8 | QWORD | 0 | Bitvector data pointer |
| +16 | DWORD | -1 | Max element index |
| +20 | DWORD | 0 | Current element count |
Queue entries are built by sub_8BE190 and sorted by sub_7553C0. Candidates are inserted via sub_9370A0 (with tie-breaking) and removed via sub_9365A0 (bit-clear in bitvector).
Region 3 -- Candidate node management (bytes ~2064--10591)
The largest region (~8,528 bytes). Contains working storage for spill candidate evaluation across all 7 register classes. This region is zeroed during initialization and populated during the instruction walk phase by sub_93BF50 (candidate evaluation), sub_936610 (candidate insertion with cost), sub_9680F0 (cost propagation), and sub_93A1F0 (interference counting). The exact internal sub-layout varies by register class and virtual register count.
Region 4 -- Linked list, accumulators, and tail (bytes 10592--11111)
| Byte offset | QWORD idx | Type | Init | Field |
|---|---|---|---|---|
| 10592 | [1324] | QWORD | 0 | Linked list head (spill candidate chain) |
| 10600 | [1325] | ptr | &self[1326] | Forward pointer (circular doubly-linked) |
| 10608 | [1326] | ptr | &self[1324] | Backward pointer |
| 10616 | [1327] | QWORD | 0 | List count |
| 10624 | [1328] | ptr | &self[1324] | Secondary forward pointer |
| 10632 | [1329] | ptr | &self[1326] | Secondary backward pointer |
| 10640 | DWORD | int | 2 | Node type tag |
| 10648 | [1331] | ptr | node | Primary candidate node (24B, type=2) |
| 10656 | [1332] | ptr | node | Secondary candidate node (24B, type=2) |
| 10696 | [1337] | ptr | node | Secondary tail pointer |
| 10704 | [1338] | ptr | 0 | Instruction walk context (knob 622 gate) |
| 10712 | [1339] | QWORD | 0 | Walk state |
| 10720 | [1340] | QWORD | 0 | Walk counter |
| 10728 | BYTE | byte | 0 | Walk active flag |
| 10736 | [1342] | QWORD | 0 | Spill cost accumulator 0 |
| 10744 | [1343] | QWORD | 0 | Spill cost accumulator 1 |
| 10752--10824 | [1344]--[1353] | QWORD | 0 | Additional cost/range counters (10 slots) |
| 10840 | [1355] | QWORD | 0 | Interference counter |
| 10872 | DWORD | int | 0 | Class mask |
| 10888 | [1361] | QWORD | 0 | Result register count |
| 10896 | [1362] | QWORD | 0 | Result cost metric |
| 10904 | [1363] | QWORD | 0 | Result spill count |
| 10912 | [1364] | QWORD | 0 | Result class width |
| 10920 | [1365] | QWORD | 0 | Best-attempt index |
| 10960 | [1370] | QWORD | 0 | Phase indicator |
| 10968 | [1371] | QWORD | 0 | Phase state |
| 10976 | [1372] | QWORD | 0 | Output flag |
| 11008 | [1376] | QWORD | 0 | SMEM spill tracking |
| 11016 | [1377] | QWORD | 0 | SMEM spill state |
| 11048 | [1381] | QWORD | 0 | Output queue pointer |
| 11056 | [1382] | QWORD | 0 | Output queue size |
| 11072 | [1384] | ptr | 0 | Callee-save tracking (freed by sub_96CFA0) |
| 11080 | [1385] | ptr | 0 | Callee-save arena ref (freed by sub_96CFA0) |
| 11089 | BYTE | byte | 1 | Guidance enabled flag |
| 11096 | [1387] | ptr | 0 | Final candidate object (freed by sub_96CFA0) |
| 11104 | [1388] | ptr | 0 | Final candidate arena ref (freed by sub_96CFA0) |
The linked list at [1324]--[1329] is initialized as a circular doubly-linked list with self-referential pointers, following the standard intrusive list pattern. The cleanup function sub_96CFA0 (694 lines) deallocates the candidate node objects at offsets 11072, 11080, 11096, and 11104.
Candidate node object (24 bytes)
Each candidate node is a 24-byte arena-allocated object used in the doubly-linked list and as priority queue sentinels:
| Offset | Type | Field |
|---|---|---|
| +0 | QWORD | Type tag: 2 = priority queue node, 3 = initial sentinel |
| +8 | QWORD | Count / payload |
| +16 | ptr | Arena back-reference |
Stack-local queue header array
In addition to the 11,112-byte structure, the function maintains a stack-local 7-element queue header array (v514, 168 bytes on stack). Each entry is 24 bytes (3 QWORDs) with the same layout as the embedded queue entries above. The 7 entries map to the 7 register classes:
| Index | Class |
|---|---|
| 0 | R (general-purpose registers) |
| 1 | P (predicate registers) |
| 2 | B (barrier registers) |
| 3 | UR (uniform registers) |
| 4 | UP (uniform predicates) |
| 5 | UB (uniform barriers) |
| 6 | Acc (tensor/accumulator registers) |
After bitvector iteration, each stack-local queue header is built by sub_8BE190 and sorted by sub_7553C0.
Algorithm
function compute_spill_guidance(ctx, guidance_array, attempt):
for class_id in 0..6:
entry = &guidance_array[class_id]
// 1. Initialize working bitmask arrays
zero_fill(entry.bitmasks, 128 elements)
// 2. Iterate live range bitvectors
for each live_range in class[class_id]:
// 3. Compute set intersection with other live ranges
intersect(entry.bitmasks, live_range.bitvector)
// 4. Build priority queue of spill candidates
build_priority_queue(entry) // sub_8BE190
// 5. Sort by spill cost (ascending -- cheapest to spill first)
sort_priority_queue(entry) // sub_7553C0
The guidance output is consumed by the retry loop: after each failed allocation attempt, the allocator consults the guidance to decide which virtual registers to allow to spill on the next attempt.
Spill Cost Model
The allocator uses a multi-level cost model to evaluate which registers are cheapest to spill.
Per-virtual-register weights
| Field | Type | Meaning |
|---|---|---|
vreg+40 | float | Primary spill cost (accumulated from usage frequency) |
vreg+76 | float | Secondary spill cost (alternate weighting) |
vreg+80 | int | Spill flag: 0 = not spilled, 1 = spilled |
Allocator-level accumulators
| Field | Type | Meaning |
|---|---|---|
alloc+1568 | double | Total spill-store cost |
alloc+1576 | float | Total spill-load cost |
Default cost weight
The base spill cost weight is 15.0 for normal register classes, reduced to 3.0 for register classes under high pressure. The selection is made by a per-class flag at alloc + 32 * class_id + 893:
float spill_weight = 15.0f;
if (*(uint8_t*)(alloc + 32 * regclass + 893))
spill_weight = 3.0f; // high-pressure class: lower cost to encourage spilling
Block frequency weighting
Spill cost is multiplied by the enclosing loop's nesting depth, obtained via a block frequency callback at vtable offset +8. Inner-loop spills receive higher penalties, discouraging the allocator from spilling values that are live across loop back-edges.
Best-result comparison
sub_93D070 (155 lines) records the best allocation result across retry attempts. Comparison uses tie-breaking priority:
- Register count (lower is better)
- Cost metric (double at
result+56) - Spill count
- Register class width
An inverse density metric 128 / register_count is used for secondary comparison. The per-variable assignment array is saved to a backup when a new best is found.
Spill cost infrastructure
A suite of functions at 0x998000--0x99E000 implements the cost computation:
| Address | Role |
|---|---|
sub_9997D0 | Spill cost initialization |
sub_9998A0 | Spill cost computation |
sub_999950 | Spill cost comparison |
sub_999AA0 | Spill benefit estimation |
sub_999D10 | Spill cost aggregation |
sub_999F00 | Spill cost finalization |
sub_99A0B0 | Range spill cost query |
sub_9A8270 | Live range spill cost (14 KB) |
Spill Code Generation
sub_94F150 (561 lines) inserts actual spill-store and refill-load instructions into the IR instruction stream.
Per-register spill info
The function allocates a tracking array: 12 bytes per virtual register, initialized to {0, -1, -1}:
| Offset | Size | Field |
|---|---|---|
| +0 | 4 | Spill state (0 = none) |
| +4 | 4 | Spill slot index (-1 = unassigned) |
| +8 | 4 | Refill slot index (-1 = unassigned) |
Execution flow
function generate_spill_code(alloc, ctx, mode):
// 1. Allocate per-block tracking
tracking = arena_alloc(12 * (numBlocks + 1))
// 2. Set up instruction iteration
setup_instruction_walk(ctx, walk=1, dir=0) // sub_7E6090
// 3. Multi-block liveness (if numBlocks > 1 and mode == 1)
compute_cross_block_liveness(alloc, ctx) // sub_9449B0
// Uses bitvectors: sub_BDBA60, sub_BDC180, sub_BDCDE0
// 4. Clear per-instruction spill flags
for each instr:
instr.flags &= ~0xE00
// 5. Walk instruction list
spill_weight = 15.0
if high_pressure_class:
spill_weight = 3.0
for each instr in instruction_list:
for each operand:
vreg = lookup_vreg(operand)
if vreg.was_previously_spilled:
// Track for potential refill
update_refill_tracking(vreg, instr)
// Accumulate spill cost weighted by block frequency
vreg.spill_cost += spill_weight * block_frequency(instr.block)
// Handle call instructions (opcode 97; STG in ROT13, used as CALL marker -- actual CALL is opcode 71)
if instr.opcode == 97:
handle_callee_save(alloc, instr)
// Track "use after last def" for enhanced cost
update_use_after_def(vreg, instr)
// 6. Uniform register special handling (flag 0x200)
if vreg.flags & 0x200:
apply_uniform_spill_rules(vreg)
Epoch tracking
sub_936CF0 (81 lines) checks basic block boundaries for epoch increments. It returns true if the block's successor is a CALL instruction (opcode 52) with a target that has opcode 236 (special call), or if allocator flags at alloc+1588/alloc+1589 are set. This mechanism tracks liveness across call boundaries, ensuring that spilled values are correctly reloaded after calls that may clobber registers.
LMEM Spilling
Local memory (LMEM) is the default spill destination. Each GPU thread has private local memory backed by DRAM and cached in L2.
Slot allocation
sub_939BD0 (65 lines) configures the spill slot allocator. It queries OCG knob 623 for a custom spill mode; if the knob is disabled, it selects between two default configurations based on the cost threshold at alloc+776:
| Condition | Bucket size | Alignment | Max pool |
|---|---|---|---|
| Cost threshold == 0.0 | 8 bytes | 4 bytes | 1 MB |
| Cost threshold != 0.0 | 16 bytes | 16 bytes | 1 MB |
The 8-byte/4-byte configuration handles standard 32-bit register spills. The 16-byte/16-byte configuration handles double-precision or 64-bit values that require stricter alignment.
if (*(double*)(alloc + 776) == 0.0)
return init_spill_pool(mem_alloc, 8, 4, 0x100000); // 8B buckets, 4B aligned
else
return init_spill_pool(mem_alloc, 16, 16, 0x100000); // 16B buckets, 16B aligned
When knob 623 is enabled, the knob value at offset +224 supplies a custom spill limit, passed to the spill allocator init function via vtable +24.
SASS instruction sequences
sub_9850F0 (520 lines) generates the actual SASS load/store instruction sequences for spill traffic. Architecture-specific registers drive the address computation:
| Field | Source | Role |
|---|---|---|
target_info[400] | Architecture vtable | Spill base register |
target_info[401] | Architecture vtable | Spill slot stride |
target_info[402] | Architecture vtable | Spill offset register |
Spill store sequence:
IADD addr, spill_base, offset // compute slot address
IMAD addr, addr, stride, 0 // scale by slot stride
ST [addr], value // store to local memory
Refill load sequence:
IADD addr, spill_base, offset // compute slot address
IMAD addr, addr, stride, 0 // scale by slot stride
LD value, [addr] // load from local memory
The generated instructions use these SASS opcodes:
| Opcode | SASS mnemonic | Role in spill sequence |
|---|---|---|
0xC9 | IADD | Add offset to base register |
0x11B | IMAD | Multiply-add for address |
0xC3 | MOV | Move value |
0x82 (130) | HSET2 in ROT13; used as LD/LDL-like | Load from local memory (refill) |
0xB7 | ST / STL | Store to local memory (spill) |
0x14 | ISETP | Set predicate (conditional spill) |
0x8B | SHL | Shift for alignment |
0x110 | LOP3 | Logical operation for masking |
0x5F | BRA | Branch (conditional spill) |
0x120 | STS | Store to shared memory (SMEM path) |
At the hardware level, local memory spills become LDL/STL instructions. The SETLMEMBASE (opcode 147) and GETLMEMBASE (opcode 148) instructions manage the local memory base pointer for the thread.
SMEM Spilling
Shared memory (SMEM) spilling is an alternative to local memory spilling. Shared memory is on-chip SRAM, offering significantly lower latency than LMEM (which goes through L2/DRAM). However, shared memory is a finite resource shared across all threads in a CTA, making this optimization applicable only in specific circumstances.
Entry point
sub_9539C0 (1873 lines, 54 KB decompiled) implements the SMEM spill allocator.
Hard restriction
"Smem spilling should not be enabled when functions use abi."
This assertion (checked at two call sites within sub_9539C0) prevents SMEM spilling for ABI-conformant functions. Functions using the GPU calling convention require a stable stack frame in local memory; shared memory spill slots would conflict with the calling convention's stack layout requirements.
Activation conditions
SMEM spilling activates when all of the following hold:
- Register class is 3 (UR) or 6 (Tensor/Acc)
- Device type is 5 (
ctx+896 == 5) - The class has virtual registers to allocate (
vreg_count > 0) - The function does not use ABI calling conventions
Algorithm
function smem_spill_allocate(alloc, ctx):
// 1. Assert no ABI conflict
assert(not function_uses_abi())
// 2. Allocate per-block tracking (24-byte slots)
tracking = arena_alloc(24 * numBlocks)
// 3. Set up SSE-width bitmaps for shared memory tracking
init_smem_bitmaps(alloc)
// 4. Walk instruction list, identify spill candidates
for each vreg marked for spill:
// 5. Allocate shared memory slot
slot = allocate_smem_slot(alloc)
vreg.smem_slot = slot
// 6. Generate shared memory load/store
insert_sts_instruction(slot, vreg) // STS (store to smem)
insert_lds_instruction(slot, vreg) // LDS (load from smem)
// 7. Update shared memory allocation bitmap
update_smem_allocation(alloc)
SMEM symbols
The SMEM spill infrastructure uses two internal symbols for allocation tracking:
__nv_reservedSMEM_allocation_phase(address0x1CFCE80)__nv_reservedSMEM_allocation_mask(address0x1CFCEA8)
The CLI flag --disable-smem-reservation can disable shared memory reservation entirely.
Spill Statistics
The allocator collects detailed spill metrics into a per-function statistics object. These are used for compilation reporting, performance analysis, and the --warn-on-spills diagnostic.
Statistics fields
The statistics object stores spill counters at fixed DWORD offsets:
| Offset (DWORDs) | Name | Description |
|---|---|---|
| [12] | LSpillB | Local memory spill bytes |
| [13] | LRefillB | Local memory refill bytes |
| [14] | SRefillB | Shared memory refill bytes |
| [15] | SSpillB | Shared memory spill bytes |
| [16] | LowLmemSpillSize | Low-bound local memory spill size |
| [17] | FrameLmemSpillSize | Frame-level local memory spill size |
| [18] | LNonSpillB | Non-spill local memory bytes |
| [19] | LNonRefillB | Non-refill local memory bytes |
| [20] | NonSpillSize | Total non-spill memory size |
Format strings
The statistics printing subsystem (sub_A3A7E0) emits two lines for spill metrics:
# [est latency = %d] [LSpillB=%d] [LRefillB=%d], [SSpillB=%d],
[SRefillB=%d], [LowLmemSpillSize=%d] [FrameLmemSpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d], [NonSpillSize=%d]
The function properties string (used in verbose output):
Function properties for %s
%d bytes stack frame, %d bytes spill stores, %d bytes spill loads
Spill warning
When --warn-on-spills is active, the following warning is emitted for any function with spills:
Registers are spilled to local memory in function '%s',
%d bytes spill stores, %d bytes spill loads
The flag is registered in sub_432A00 / sub_434320 and stored at compilation_ctx + 473.
Allocation Failure
When all spill retry attempts are exhausted and the allocator still cannot fit within the register budget, a fatal error is emitted:
Register allocation failed with register count of '%d'.
Compile the program with a higher register target
This error originates from sub_9714E0 (allocation finalization, 290 lines), which is the last function called in the retry pipeline. Two emission paths exist:
| Path | Function | Context |
|---|---|---|
| With source location | sub_895530 | Includes function name and PTX line number |
| Without source location | sub_7EEFA0 | Generic error when location unavailable |
The alternate allocator path sub_964130 (1794 lines) also references this error at six separate points, covering different failure/retry scenarios. A dedicated failure reporter sub_966490 (474 lines) handles error diagnostic formatting.
Spill-Refill Pattern Optimization
The Ori IR includes dedicated instruction type markers for spill/refill patterns, enabling post-allocation optimization of spill traffic:
| Type ID | Name | Description |
|---|---|---|
| 8 | Spill-refill | Spill/refill pair marker |
| 10 | Bit-spill | Single-bit spill (predicate register spill to GPR) |
The SpillRefill pass attempts to match and optimize these patterns. Error strings reveal three failure modes:
"Failed to find matching spill for refilling load that is involved in this operand computation"-- the refill load has no corresponding spill store"Failed to establish match for bit-spill-refill pattern involved in this operand computation"-- the bit-spill pattern does not match expected form"Some instruction(s) are destroying the base of bit-spill-refill pattern involved in this operand computation"-- instructions between spill and refill clobber the base address register
Debug strings include " spill-regill bug " and " bit-spill bug " (both with typos present in the binary).
Function Map
| Address | Lines | Role | Confidence |
|---|---|---|---|
sub_939BD0 | 65 | Spill allocator setup (knob 623 dispatch) | HIGH |
sub_93C0B0 | 582 | Spill range optimizer | HIGH |
sub_93D070 | 155 | Best allocation result recorder | HIGH |
sub_93E290 | 397 | Spill candidate node creator (192-byte arena nodes) | HIGH |
sub_93F130 | 544 | Spill code inserter | HIGH |
sub_93FBE0 | 940 | Per-iteration state reset / slot assignment | HIGH |
sub_940EF0 | 764 | Alternate spill slot assignment path | MEDIUM |
sub_944740 | 138 | Interference count at program point | HIGH |
sub_9449B0 | 418 | Cross-block liveness range calculator | HIGH |
sub_94B200 | 642 | Spill weight accumulator | HIGH |
sub_94E620 | 617 | Spill cost accumulator / liveness annotator | HIGH |
sub_94F150 | 561 | Spill code generation (main entry) | HIGH |
sub_94FDD0 | 155 | Register assignment + spill trigger | HIGH |
sub_9539C0 | 1,873 | SMEM (shared memory) spill allocator | HIGH |
sub_9714E0 | 290 | Allocation finalization / error emission | MEDIUM |
sub_96D940 | 2,983 | Spill guidance engine (7 class-parallel queues) | HIGH |
sub_971A90 | 355 | NOSPILL / SPILL retry driver | HIGH |
sub_9850F0 | 520 | SASS-level spill instruction generator | HIGH |
sub_9997D0 | -- | Spill cost initialization | MEDIUM |
sub_9998A0 | -- | Spill cost computation | MEDIUM |
sub_999950 | -- | Spill cost comparison | MEDIUM |
sub_999AA0 | -- | Spill benefit estimation | MEDIUM |
sub_9A8270 | -- | Live range spill cost computation (14 KB) | MEDIUM |
GPU ABI & Calling Convention
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas ABI engine implements the NVIDIA GPU calling convention for device-side function calls. It manages parameter register allocation, return address placement, scratch/preserved register classification, and per-function ABI lowering across the full range of SM architectures (sm_35 through sm_100+). The engine runs as a multi-pass pipeline invoked per-kernel from the per-function compilation driver (sub_98F430), positioned between optimization passes and the register allocator. It spans approximately 250 KB (276 functions) at address range 0x19C6230--0x1A00FFF.
| Master ABI setup | sub_19D1AF0 (5608 bytes) -- orchestrates full per-function ABI pipeline |
| Per-pass lowering | sub_19DC4B0 (6459 bytes) -- 3-pass instruction transform driver |
| Opcode-level dispatch | sub_19CFC30 -- routes 11 opcodes to ABI handlers |
| Parameter allocator | sub_19CA730 (2277 bytes) -- 2048-bit free-list bitmap allocator |
| Return address validator | sub_19CDFF0 (7.5 KB) -- 12 diagnostic strings, warnings 7001--7009 |
| Return address setup | sub_19D1720 (4.8 KB) -- validates and assigns return address registers |
| Register transfer lowering | sub_19CC1A0 (3873 bytes) -- generates MOV/STS/LDS/PRMT sequences |
| gb10b WAR | sub_19D9E00 + sub_19DA2A0 -- __nv_reservedSMEM_gb10b_war_var |
| Convergent checker | sub_19D13F0 (4.3 KB) -- allowConvAlloc boundary validation |
| Address range | 0x19C6230--0x1A00FFF (~250 KB, 276 functions) |
Reserved Registers
Registers R0--R3 are reserved by the ABI and cannot be used for general allocation. The allocator enforces this with the diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s". These four registers serve fixed ABI roles (stack pointer, thread parameters, etc.) and are excluded from both parameter passing and general register allocation.
The reservation is unconditional across all SM generations. Any .maxreg directive or ABI specification that attempts to assign these registers to parameter or return roles triggers a diagnostic.
SM Generation Dispatch
The ABI engine determines the target SM generation by reading a field from the SM target descriptor:
generation = *(int*)(sm_target + 372) >> 12
| Generation value | SM targets | Key ABI differences |
|---|---|---|
| 3 | sm_35, sm_37 | Kepler ABI: no uniform registers, no convergent boundaries |
| 4 | sm_50, sm_52, sm_53 | Maxwell ABI: 16-register minimum, label fixups, coroutine insertion |
| 5 | sm_60--sm_89 | Pascal through Ada ABI: 24-register minimum, cooperative launch support |
| 9 | sm_90, sm_90a | Hopper ABI: 24-register minimum, uniform return address support |
| >9 | sm_100+ | Blackwell ABI: no minimum enforced (skips check), extended register reservation |
The minimum register count varies by generation. For generations 3--4 (sm_35 through sm_53), the ABI requires at least 16 registers per function. For generations 5--9 (sm_60 through sm_90a), the minimum is 24. Generations below 3 and above 9 skip the minimum check entirely. Violating these minimums emits warning 7016: "regcount %d specified below abi_minimum of %d". The abi_minimum value is computed as (generation - 5) < 5 ? 24 : 16.
Master ABI Setup: sub_19D1AF0
The top-level ABI entry point (5608 bytes), called once per function by the per-function compilation driver sub_98F430. It orchestrates the full ABI pipeline in 10 steps:
function abi_master_setup(func, sm_target, abi_spec):
// 1. Validate register count vs. ABI minimums
generation = *(sm_target + 372) >> 12
if generation in 3..4: min_regs = 16 // sm_35-sm_53
if generation in 5..9: min_regs = 24 // sm_60-sm_90a
if func.maxreg < min_regs:
warn(7016, "regcount %d specified below abi_minimum of %d",
func.maxreg, min_regs)
// 2. Validate register reservation range
if available_regs < requested_reservation:
warn(7017, "register available %d for reservation is less "
"than the requested number of registers %d",
available_regs, requested_reservation)
// 3. Validate coroutine SUSPEND semantics
for each register in func.preserved_set:
if register.is_scratch_at_suspend:
warn(7011, "Register (%s%d)is defined as scratch on "
"SUSPEND but preserved for coroutine function",
register.class_name, register.index)
// 4. Iterate callee list, mark ABI-callable entries
for each callee in func.callees:
callee.abi_flags |= ABI_CALLABLE
propagate_abi_attributes(func, callee)
// 5. Propagate register limits to callees
abi_propagate_limits(func) // sub_19CE590
// 6. Check return-address / parameter overlap
abi_overlap_precheck(func) // sub_19CA3C0
// 7. Allocate parameter registers
abi_alloc_params(func) // sub_19CA730
// 8. Validate return address assignment
abi_return_addr_setup(func) // sub_19D1720
// 9. Detailed return address validation
abi_return_addr_validate(func) // sub_19CDFF0
// 10. Adjust register file limits via vtable
vtable[736](func, sm_target)
Parameter Passing
Parameters are passed in consecutive R registers starting from a configurable base register. The ABI tracks "number of registers used for parameter passing" and "first parameter register" as per-function properties. The parameter register range begins after the reserved registers (R0--R3) and the return address register.
Parameter Register Allocator: sub_19CA730
The core parameter allocator (2277 bytes, 98% confidence). It uses a 2048-bit free-list bitmap (v103[], 256 bytes) to track available register slots.
function abi_alloc_params(func):
// Initialize 2048-bit free-list (256 bytes)
bitmap[256] = {0xFF...} // all slots free
// Mark reserved registers as occupied
clear_bits(bitmap, 0, 3) // R0-R3 always reserved
// Mark already-allocated registers
popcount = register_usage_popcount(func) // sub_19C99B0
// Allocate PARAMETER registers
for each param in func.parameters:
align = param_alignment(param.type_width) // 4/8/16 bytes
slot = find_contiguous_free(bitmap, param.reg_count, align)
if slot == -1:
error("Function %s size requires more registers(%d) "
"than available(%d)", func.name, needed, available)
return FAILURE
assign_register(slot, param) // sub_7FA420
mark_allocated(bitmap, slot, param.reg_count) // sub_BDBB80
// Allocate RETURN registers (same algorithm, separate class)
for each ret in func.return_values:
slot = find_contiguous_free(bitmap, ret.reg_count, align)
assign_register(slot, ret)
mark_allocated(bitmap, slot, ret.reg_count)
The allocator processes parameters and return values as separate classes, each requiring contiguous register ranges with natural alignment. For 8-byte parameters, the base register must be even-aligned. For 16-byte parameters, the base must be 4-register-aligned.
The population count helper (sub_19C99B0, 2568 bytes) uses the __popcountdi2 intrinsic to count live registers in the function's usage bitmap, determining how many slots remain available.
Return Address Register
The return address occupies a dedicated register (or register pair) whose location is validated against parameter ranges. The diagnostic "Parameter registers from R%d to R%d overlap with return address register R%d to R%d" fires when parameter and return address ranges collide.
Return Address Modes
The return address validator (sub_19CDFF0, 7.5 KB, 99% confidence) handles four modes, selected by the v7 field in the ABI specification:
| Mode | v7 | Behavior |
|---|---|---|
| Fixed | 1 | Return address at register 4 + 2 = R6. Fixed by architecture. |
| Regular | 2 | General-purpose register, validated < max_reg. |
| Uniform | 3 | Uniform register (UR) for return address. Requires SM support (sm_75+). |
| Computed | 5 | Derived from parameter layout. Auto-aligned to even register number. |
Return Address Validator: sub_19CDFF0
The most thoroughly instrumented function in the ABI engine (7 distinct warning codes across two mode-specific paths). It performs these validations in sequence:
| Code | Condition | Message |
|---|---|---|
| 7001 | return_addr & 1 != 0 | "ABI return address %d is unaligned" |
| 7002 | return_addr >= max_reg | "Return Address (%d) should be less than %d" |
| 7003 | stack_ptr in [return_addr, return_addr+1] | "Return address (%d) should not overlap with the stack pointer (%d)" |
| 7004 | Return addr bit set in parameter bitmap | "Return Address %d overlaps with parameters in range %d - %d" |
| 7005 | param_end + align > max_reg (auto-placement) | "With specified parameters, return address is %d registers and exceeds specified max reg (%d)" |
| 7008 | return_addr < lower_bound or return_addr > upper_bound | "Return address (%d) should be between %d and %d" |
| 7009 | Mode 3 and !(func+1408 byte & 0x02) | "SM does not support uniform registers for return address" |
The checks are mode-dependent. Mode 2 (regular GPR) enters the 7002/7001/7003/7004 path. Modes 3 and 5 (uniform/computed) enter the 7009/7008/7001 path. Mode 1 and mode 5 share the auto-placement path where 7005 fires. Warning 7001 (unaligned) appears in both paths because 64-bit return address pairs always require even alignment.
Return Address Setup: sub_19D1720
The setup function (4.8 KB, 95% confidence) runs before the validator. It propagates ABI flag 0x04 to the function state (byte 1389), validates that the return address register (register 1) is not classified as scratch when it must be preserved (warning 7012: "%d register should not be classified as scratch"), sizes the preserved register set to 255 entries via sub_BDBAD0, and computes the effective register range as return_size + param_size for comparison against the maximum available. The 7012 check fires when *(abi_spec+88) & 0x01 and *(abi_spec+48) & 0x02 are both set, always with argument 1 (the return address register).
The function also enforces the mutual exclusion rule (warning 7006): "ABI allows either specifying return address or return address before params". This fires when mode is 1 (fixed, "return address before params") but an explicit return address register is also assigned (return_addr != -1). You pick one strategy, not both.
Scratch Data Registers
Registers not reserved by the ABI and not used for parameters or return values may be classified as scratch (callee-clobbered). The ABI engine tracks scratch classification per register and validates it against coroutine semantics. At SUSPEND points in coroutine functions, a register marked as scratch must not also appear in the preserved set. Violation triggers warning 7011.
The scratch/preserved classification feeds into the register allocator's spill decisions. Registers marked as scratch across a call boundary must be saved by the caller; preserved registers must be saved by the callee.
Per-Pass Instruction Lowering: sub_19DC4B0
The instruction-level ABI transform driver (6459 bytes, 95% confidence). Called from both sub_98F430 and sub_A9DDD0. It makes three passes over the instruction stream, each performing different transformations:
Pass 1 -- Convergent Boundary Fixup
- Fixes convergent boundary annotations (
allowConvAlloc). - Handles
SHFL.NI(shuffle, no-index) fixups for intra-warp communication. - Propagates the
.uniformbit onCAL(call) instructions.
Pass 2 -- Instruction Lowering
Lowers high-level Ori opcodes into ABI-conforming SASS sequences:
| Ori opcode | Mnemonic | Transform |
|---|---|---|
| 109 | CALL | Parameter register setup, save/restore insertion |
| 16 | ST | Shared memory store lowering |
| 77 | LD | Shared memory load lowering |
| 185 | ATOMG | Atomic operation lowering |
| 183 | (special) | Mode 2/3 reclassification |
Pass 3 -- Architecture-Specific Fixups
Conditioned on SM generation:
sm_50 (generation == 4): Label fixups, coroutine code insertion, shared memory WAR insertion, convergent boundary checks.
sm_60+ (generation == 5): Additional register reservation for ABI conformance, cooperative launch handling, extended register file support.
All architectures: Per-block instruction scanning for opcode 195 (MOV) and opcode 205 reclassification. Register reservation range setup via sub_7358F0 / sub_7AC150.
Opcode-Level ABI Dispatch: sub_19CFC30
A dispatcher called twice from sub_98F430 that routes individual opcodes to specialized ABI handlers:
| Ori opcode | Handler | Transform |
|---|---|---|
| 9 | sub_19CF9A0 | PRMT (permute) lowering |
| 54 | (inline) | Function parameter preallocation |
| 72 | sub_19CDED0 + sub_19CB590 + sub_19CB7E0 | SMEM reservation + pre/post call register save/restore |
| 98 | sub_19CBAC0 | Shared load (LD.S) ABI lowering |
| 159 | sub_19CD0D0 | Barrier instruction lowering |
| 164 | sub_19CC1A0 | Register load (transfer lowering) |
| 168 | sub_19CC1A0 | Register store (transfer lowering) |
| 183 | sub_19CBE00 | Special instruction fixup |
| 226 | sub_19CD950 | Predicate lowering |
| 236 | sub_19CD510 | Conversion instruction lowering |
| 335 | sub_19CDED0 | SMEM reservation instruction handler |
Register Transfer Lowering: sub_19CC1A0
The register-to-register transfer lowering function (3873 bytes, 95% confidence). It converts abstract register load/store operations (opcodes 164 and 168) into concrete SASS instruction sequences. The lowering path depends on the ABI function properties:
Direct copy path (byte 12 == 0): Register-to-register MOV instructions.
| Data width | Generated sequence |
|---|---|
| 4 bytes (32-bit) | Single MOV-like (opcode 130 / 0x82, HSET2 in ROT13; actual SASS MOV is opcode 19) |
| 8 bytes (64-bit) | STS + LDS pair (opcodes 0x86/0x85) through shared memory |
| Permute | PRMT (opcode 0x120) for byte-lane rearrangement |
Shared memory indirect path (byte 13 == 1): All transfers go through shared memory via STS/LDS pairs, using a reserved shared memory region as a scratch buffer. This path is used when direct register-to-register transfer is not possible (e.g., cross-warp parameter passing on older architectures or when the register file is partitioned).
The function also generates opcode 0xB7 (special) for shared-memory-based transfers that require additional synchronization. It calls sub_92E800 (instruction builder) for each generated SASS instruction.
Convergent Boundary Enforcement
Two functions enforce convergent allocation boundaries for function calls annotated with allowConvAlloc:
Convergent boundary checker (sub_19D13F0, 4.3 KB): Walks the basic block list, builds a bitmask of convergent register definitions, and validates that every allowConvAlloc-annotated call has a proper convergent boundary. Emits "Missing proper convergent boundary around func call annotated with allowConvAlloc" when the boundary is absent.
CONV.ALLOC insertion (sub_19D7A70, 3313 bytes): Scans the instruction list for convergent boundary violations. When a register def flows to a convergent use through a non-convergent path, inserts a CONV.ALLOC placeholder instruction (opcode 0x11E = 286). Uses a 64-bit-word bitmask array to track which register slots are live across convergent boundaries.
The single-call checker (sub_19C6400) warns when a convergent region contains more than one call: "Multiple functions calls within the allowConvAlloc convergent boundary".
Coroutine Support
Functions with coroutine support (flag 0x01 at function byte +1369) receive special ABI handling. Registers that are live across SUSPEND points must be saved to and restored from the coroutine frame.
Coroutine SUSPEND handler (sub_19D5F10, 1568 bytes): Scans the instruction stream for suspend points. For each register defined before and used after a SUSPEND, inserts save/restore pairs to/from the coroutine frame.
Coroutine frame builder (sub_19D4B80, 1925 bytes): Constructs the frame layout for coroutine-style functions, allocating slots for each register that must survive a SUSPEND.
The ABI engine validates that the scratch/preserved classification is consistent with coroutine semantics. Warning 7011 fires when a register marked as scratch at a SUSPEND point is also required to be preserved for the coroutine function. Warning 7012 fires when the return address register itself is misclassified as scratch.
gb10b Hardware WAR
Two functions implement a shared-memory-based workaround for a hardware bug on the gb10b variant (SM 75, Turing). Both reference the reserved symbol __nv_reservedSMEM_gb10b_war_var.
Entry block variant (sub_19D9E00): Generates a complex instruction sequence using additional temp registers (opcodes ADD, MOV, BAR) for the function entry block.
Body variant (sub_19DA2A0, 95% confidence): Generates a 7-instruction SASS sequence:
1. MOV.C temp_reg, <constant> // opcode 195, class 3
2. LD.S temp_reg, [__nv_reservedSMEM_gb10b_war_var] // opcode 98
3. AND temp_reg, temp_reg, 4 // opcode 214
4. SETP P, temp_reg, 0x4000 // opcode 272
5. STS [__nv_reservedSMEM_gb10b_war_var], temp_reg // opcode 277
6. @P BRA target // opcode 18, predicated
7. MOV result, 0 // zero-initialization
The reserved SMEM checker (sub_19DDEF0, 1687 bytes) iterates instructions looking for opcode 335 (SMEM reservation). When found and the function is not allowed to use reserved shared memory, it emits warning 7801: "Function '%s' uses reserved shared memory when not allowed.".
ABI Register Limit Propagation
The limit propagator (sub_19CE590) handles inter-procedural ABI attribute forwarding. For SM generations 4 and 5 (sm_50, sm_60 families), it iterates the call graph and copies the max-register limit from caller to callee (field +264 to +268) unless the callee has an explicit ABI specification. This ensures that callees do not exceed the register budget established by their callers.
Call Instruction ABI Lowering: sub_19D41E0
The call lowering function (2247 bytes, 85% confidence) processes each call instruction (opcode 97; STG in the ROT13 name table, but used here as an internal CALL-like marker -- actual SASS CALL is opcode 71) in the function. For each call site it:
- Sets up parameter passing registers according to the callee's ABI specification.
- Inserts pre-call register save sequences for caller-saved registers.
- Modifies the call target to use ABI-conforming register assignments.
- Inserts post-call register restore sequences.
Register File Types
The ABI handles three register file types, each with distinct allocation rules:
| Type | v7 value | File | Range | SM requirement |
|---|---|---|---|---|
| GPR | 2 | General-purpose | R0--R255 | All architectures |
| Uniform | 3 | Uniform GPR | UR0--UR63 | sm_75+ |
| Predicate | 5 | Predicate | P0--P7 | All architectures |
Uniform registers (type 3) are only available on sm_75 and later. Attempting to use a uniform register for the return address on an older SM triggers warning 7009.
Pipeline Integration
The ABI engine sits between the optimization passes and the register allocator in the ptxas pipeline:
... optimization passes ...
Late Legalization / Expansion
ABI Master Setup <-- sub_19D1AF0 (per-function)
ABI Pass 1 (convergent) <-- sub_19DC4B0 (a2=1)
ABI Pass 2 (lowering) <-- sub_19DC4B0 (a2=2)
ABI Opcode Dispatch <-- sub_19CFC30 (2x)
ABI Pass 3 (arch-specific) <-- sub_19DC4B0 (a2=3)
Register Allocation <-- sub_9721C0
Instruction Scheduling
SASS Encoding
The ABI engine produces new SASS instructions via sub_934630 / sub_9314F0 (instruction builder/inserter) and uses sub_91BF30 (temp register allocation) for scratch registers needed during lowering. During final emission, the encoding functions in Zone B (0x1A01000--0x1A76F30) convert the ABI-lowered instructions into binary SASS words.
ABI State Object Layout
The ABI engine operates on three nested data structures: the ABI engine context (the this pointer passed as a1 to all ABI functions), the per-callee ABI specification (one per callee in the call graph), and parameter/return descriptor entries (one per parameter or return value). All offsets are byte offsets from the structure base.
ABI Engine Context
The top-level per-function ABI state, passed as a1 to sub_19D1AF0, sub_19CA730, sub_19CDFF0, and sub_19D1720. Total size is at least 4672 bytes.
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
+0 | 8 | ptr | vtable | Dispatch table; method at +144 dispatches per-callee validation, +152 selects register reservation strategy |
+8 | 8 | ptr | func_ctx | Pointer to per-function compilation context (1716+ bytes); accessed everywhere as *(_QWORD *)(a1+8) |
+16 | 1 | byte | abi_mode_flags | Master ABI mode selector; 0 = no ABI lowering, nonzero = full pipeline |
+64 | 4 | int | max_param_offset | Highest parameter register offset seen during callee iteration |
+76 | 4 | int | preserved_param_start | Start register for preserved parameter range |
+80 | 4 | int | preserved_param_align | Alignment requirement for preserved parameter range |
+88 | 8 | ptr | current_callee_entry | Pointer to the callee entry node being processed in the current iteration |
+97 | 1 | byte | skip_popcount | When set, skips the register usage population count (sub_19C99B0) |
+98 | 1 | byte | has_return_addr_spec | Set to 1 when any callee has a return address ABI specification |
+4428 | 4 | int | cached_reg_R3 | Cached physical register ID for R3 (from sub_7FA420(regfile, 6, 3)) |
+4432 | 4 | int | cached_reg_R2 | Cached physical register ID for R2 (from sub_7FA420(regfile, 6, 2)) |
+4449 | 1 | byte | first_callee_seen | Set after the first callee with an ABI spec is processed; controls whether per-class reservation bitmaps are populated or inherited |
+4456 | 16+ | bitvec | param_alloc_bitmap | Bitvector tracking which physical registers have been assigned to parameters; manipulated via sub_BDBB80 (set bit), sub_BDDCB0 (find highest), sub_BDDD40 (popcount) |
+4472 | 4 | int | param_alloc_count | Number of registers allocated for parameter passing |
+4480 | 16+ | bitvec | retval_alloc_bitmap | Bitvector tracking which physical registers have been assigned to return values |
+4496 | 4 | int | retval_alloc_count | Number of registers allocated for return values |
+4528 | 144 | bitvec[6] | per_class_reservation | Per-register-class ABI reservation bitmaps; 6 entries (classes 1--6), 24 bytes each; the loop in sub_19D1AF0 iterates v148 from 1 to 6, incrementing the pointer by 3 qwords per iteration |
The param_alloc_bitmap and retval_alloc_bitmap are used after parameter/return allocation to compute the effective register file occupancy. The master setup reads the highest set bit in each (sub_BDDCB0) to determine func_ctx+361 (total register demand) and compares against func_ctx+367 (register file limit).
Per-Callee ABI Specification
Pointed to by *(callee_entry + 64). One instance per callee in the call graph. Describes how parameters are passed, return values are received, and the return address is placed. Accessed as v3/v12/v14 (cast to _DWORD *) in the decompiled code, so integer-indexed fields are at 4-byte stride.
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
+0 | 4 | int | param_count | Number of parameter descriptor entries |
+4 | 4 | int | return_count | Number of return value descriptor entries |
+8 | 8 | ptr | param_descriptors | Pointer to array of 32-byte parameter descriptor entries |
+16 | 8 | ptr | return_descriptors | Pointer to array of 32-byte return value descriptor entries |
+24 | 4 | int | return_addr_register | Explicit return address register number; -1 = unassigned |
+28 | 4 | int | return_addr_mode | Return address placement strategy (see table below) |
+32 | 4 | int | first_param_register | First register available for parameter passing; -1 = use default |
+36 | 4 | int | available_reg_count | Number of registers available; -1 = target default, -2 = computed from target descriptor |
+40 | 1 | byte | ret_addr_before_params | If set, return address is placed before the parameter range |
+44 | 4 | int | preserved_reg_type | Preserved register specification type; 1 triggers per-register scratch bitmap construction |
+48 | 8 | uint64 | scratch_gpr_bitmask | Bit 1 (& 2) = scratch classification active for GPR return address register |
+57 | 1 | byte | has_abi_spec | Master enable: 0 = callee has no ABI specification, 1 = specification is active |
+58 | 1 | byte | allocation_complete | Set to 1 after parameter/return allocation finishes successfully |
+64 | 8 | ptr | abi_detail_ptr | Pointer to extended ABI detail sub-object (preserved bitmasks, scratch classification) |
+80 | 8 | uint64 | preserved_pred_bitmask | Per-predicate-register preserved bitmask; bit N = predicate register N is preserved |
+88 | 4 | uint32 | preserved_class_flags | Bit 0 (& 1) = GPR preserved set active; bit 1 (& 2) = scratch classification active |
+96 | 1 | byte | return_addr_validated | Set to 1 after sub_19CDFF0 completes validation for this callee |
Return address mode values (field +28):
| Value | Mode | Behavior |
|---|---|---|
| 1 | Fixed | Return address at first_param_register + 2 (e.g., R6 when base is R4) |
| 2 | Regular | General-purpose register, validated < max_reg |
| 3 | Uniform | Uniform register (UR), requires SM75+ (func_ctx+1408 & 0x02) |
| 5 | Computed | Derived from parameter layout, auto-aligned to even register boundary |
Parameter/Return Descriptor Entry (32 bytes)
Each parameter or return value is described by a 32-byte entry. The allocator iterates the parameter array with stride 32 (v34 += 32 per parameter) and the return array identically (v43 += 32 per return value).
| Offset | Size | Type | Field | Notes |
|---|---|---|---|---|
+0 | 4 | int | element_count | Number of elements (e.g., 4 for a float4) |
+4 | 4 | int | element_size | Size per element in bytes (e.g., 4 for float) |
+8 | 4 | int | alignment_hint | Alignment in bytes, clamped to [4, 16]; 8 = even-aligned, 16 = quad-aligned |
+12 | 1 | byte | is_register_allocated | 0 = stack-passed (fallback), 1 = register-allocated |
+16 | 4 | int | assigned_register_id | Physical register ID assigned by the allocator (from sub_7FA420) |
The total byte size is element_count * element_size. The register count is ceil(total_bytes / 4), computed as (total + 3) >> 2. The alignment mask applied to register slot selection is -(alignment_hint >> 2), producing a bitmask that enforces natural alignment: 8-byte parameters require even-aligned base registers, 16-byte parameters require 4-register-aligned bases.
2048-Bit Free-List Bitmap (Stack Local)
The parameter allocator (sub_19CA730) constructs a 2048-bit free-list bitmap as a stack-local variable (not stored in the engine context). It is declared as v103[31] (248 bytes of QWORD array) plus v104 (4 bytes), v105 (2 bytes), and v106 (1 byte), totaling 255 bytes.
Initialization:
memset(v103, 0xFF, 248) // 248 bytes all-ones
v104 = 0xFFFFFFFF // 4 bytes
v105 = 0xFFFF // 2 bytes
v106 = 0xFF // 1 byte
Result: 2040 bits all-ones (255 bytes)
A bit value of 1 means the register slot is free; 0 means occupied. The bitmap is indexed relative to first_param_register, not absolute R0. When a contiguous run of free slots is found for a parameter, the allocator zeroes the corresponding bytes using a size-optimized zeroing sequence (special-cased for lengths < 4, == 4, and >= 8 bytes). After allocation, the assigned registers are also recorded in the persistent bitvectors at +4456 (parameters) and +4480 (return values) via sub_BDBB80.
The bitmap supports up to 2040 register slots, far exceeding the 255-register GPR limit. This over-provisioning accommodates the allocator's use for both parameter and return value allocation in a single bitmap, and provides headroom for potential multi-class allocation in future architectures.
Target Descriptor Fields Referenced by ABI Engine
The ABI engine accesses the target descriptor (at func_ctx+1584) through these offsets during ABI setup:
| Offset | Type | Purpose |
|---|---|---|
+372 | int | SM generation index (value >> 12; 3=Kepler, 4=Maxwell, 5=Pascal+, 9=Hopper, >9=Blackwell) |
+452 | int | SM version number; > 4 gates 64-bit return address pair semantics |
+616 | int | Available register count ceiling for the target |
+636 | int | Register count subtraction base (for computed available_reg_count) |
+896 | vfunc | Register range query; called with (target, func_ctx, &query, 6), returns low/high range pair at query+24 |
+2096 | vfunc | Register class capacity query; called with (target, reg_class) |
+3000 | vfunc | Validator callback; nullsub_464 = no-op (validation skipped) |
The vtable call at +896 takes a 32-byte query structure initialized to {hi=-1 lo=0, 0, 0, 0, 0, 148, 148, -1}. The result at query +24 (as two 32-bit halves) returns the reserved register range boundaries. This is used by warnings 7014 (reserved range overlaps parameters) and 7017 (insufficient registers for reservation).
ABI Validation Diagnostics
The ABI engine emits 15 distinct warning codes (7001--7017) from six functions. Two codes are unused in this binary version (7007, 7018). All codes share the contiguous hex ID range 0x1B59--0x1B69 and are emitted through two parallel paths: sub_7EEFA0 (standalone diagnostic buffer) and sub_895530 (context-attached diagnostic using the compilation context at *(func+48)).
Complete Warning Catalog
| Code | Hex | Emitter | Message | Trigger |
|---|---|---|---|---|
| 7001 | 0x1B59 | sub_19CDFF0 | "ABI return address %d is unaligned" | return_addr & 1 != 0 (odd register for 64-bit pair) |
| 7002 | 0x1B5A | sub_19CDFF0 | "Return Address (%d) should be less than %d" | return_addr >= max_reg (exceeds register file) |
| 7003 | 0x1B5B | sub_19CDFF0 | "Return address (%d) should not overlap with the stack pointer (%d)" | Stack pointer falls within [return_addr, return_addr+1] |
| 7004 | 0x1B5C | sub_19CDFF0 | "Return Address %d overlaps with parameters in range %d - %d" | Return addr bit set in parameter allocation bitmap |
| 7005 | 0x1B5D | sub_19CDFF0 | "With specified parameters, return address is %d registers and exceeds specified max reg (%d)" | Auto-placed return addr pushed beyond register file limit |
| 7006 | 0x1B5E | sub_19D1720 | "ABI allows either specifying return address or return address before params" | Mode 1 (fixed) with explicit return_addr != -1 |
| 7007 | 0x1B5F | -- | -- | Unused/reserved in this binary version |
| 7008 | 0x1B60 | sub_19CDFF0 | "Return address (%d) should be between %d and %d" | Return addr outside valid range from target vtable query |
| 7009 | 0x1B61 | sub_19CDFF0 | "SM does not support uniform registers for return address" | Mode 3 (uniform) on target without UR support (!(func+1408 & 0x02)) |
| 7010 | 0x1B62 | sub_13B6DF0 | "Relative 32-bit return address requires a caller-save 64-bit scratch register pair" | 32-bit relative call without available scratch pair |
| 7011 | 0x1B63 | sub_19D1AF0 | "Register (%s%d)is defined as scratch on SUSPEND but preserved for coroutine function" | Register in preserved set is scratch in SUSPEND bitmap |
| 7012 | 0x1B64 | sub_19D1720, sub_19D1AF0 | "%d register should not be classified as scratch" | Preserved ABI register (return addr) misclassified as scratch |
| 7013 | 0x1B65 | sub_19CA730 | "%d register used to return value cannot be classified as preserved" | Return-value register appears in preserved bitmap |
| 7014 | 0x1B66 | sub_19CA730 | "Reserved register range %d - %d overlaps with parameters in range %d - %d" | Explicit reserved range collides with parameter range |
| 7015 | 0x1B67 | sub_19C69D0 | "Reserved register range %d - %d overlaps with retAddr %d" | Reserved range collides with return address register |
| 7016 | 0x1B68 | sub_19D1AF0 | "regcount %d specified below abi_minimum of %d" | func.maxreg below generation minimum (16 or 24) |
| 7017 | 0x1B69 | sub_19D1AF0 | "register available %d for reservation is less than the requested number of registers %d " | Available regs after reservation base < requested count |
Diagnostic Emission Architecture
The ABI engine uses three diagnostic emitters:
sub_7EEFA0 (standalone path): Takes a stack buffer, the decimal warning code, and a printf-format string. Used as the fallback when no compilation context is available (when *(*(func)+48) == NULL). This is the path that produces warnings visible in non-context mode (e.g., standalone ptxas invocations).
sub_895530 (context-attached path): Takes the function object, the output context, flags (always 0), the hex warning code, and the format string. Used when the compilation context exists. This is the primary path during normal nvcc-driven compilation.
sub_7F7C10 (conditional emitter): Returns a bool indicating whether the diagnostic was accepted (not suppressed by the diagnostic context at func+1176). Used exclusively for warning 7011 (SUSPEND). When it returns true, the caller additionally invokes sub_8955D0 to attach the diagnostic to the compilation context.
Validation Order
The ABI master setup (sub_19D1AF0) invokes validators in this order:
1. regcount vs. abi_minimum -> 7016
2. register reservation overflow -> 7017
3. return address setup -> 7006, 7012 (sub_19D1720)
4. parameter allocation -> 7013, 7014 (sub_19CA730)
5. reserved range vs. retAddr -> 7015 (sub_19C69D0)
6. return address validation -> 7001-7005, 7008, 7009 (sub_19CDFF0)
7. coroutine SUSPEND validation -> 7011, 7012
Unreferenced ABI Strings
Three ABI-related strings exist in ptxas_strings.json with no cross-references in the decompiled binary. They may be dead code, referenced via indirect dispatch, or used only in debug builds:
"Caller and callee expected to have different return address register but '%s' and '%s' both use R%d as return address register""Function '%s' specifies register R%d as scratch register which is used as return address register""Mismatch in return address abi when '%s' calls '%s'"
Function Map
| Address | Size | Confidence | Role |
|---|---|---|---|
sub_19C6400 | ~200 | 90% | Convergent boundary single-call checker |
sub_19C69D0 | ~600 | 90% | Reserved register overlap checker |
sub_19C7350 | ~900 | 80% | Register bitmap manipulation helper |
sub_19C7890 | ~600 | 80% | Register range validator |
sub_19C7B20 | ~600 | 80% | Register alignment checker |
sub_19C7D60 | ~700 | 80% | Register pair allocator helper |
sub_19C8040 | ~700 | 80% | Register contiguous-range finder |
sub_19C84A0 | 1927 | 85% | Multi-function register dispatcher |
sub_19C8D30 | ~600 | 80% | Register usage merger |
sub_19C9010 | ~700 | 85% | Per-function register limit setter |
sub_19C92F0 | ~1050 | 85% | Register bitmap AND/OR combiner |
sub_19C99B0 | 2568 | 90% | Register usage population counter |
sub_19CA3C0 | ~300 | 95% | Return address overlap pre-check |
sub_19CA730 | 2277 | 98% | Parameter register allocator |
sub_19CB020 | ~200 | 85% | Shared-mem base address calculator |
sub_19CB230 | ~200 | 85% | Shared-mem offset calculator |
sub_19CB590 | ~350 | 80% | Post-call register restore |
sub_19CB7E0 | ~350 | 80% | Pre-call register save |
sub_19CBAC0 | ~600 | 85% | Shared load (LD.S) ABI lowering |
sub_19CBE00 | ~600 | 85% | Special instruction ABI fixup |
sub_19CC1A0 | 3873 | 95% | Register transfer lowering (STS/LDS) |
sub_19CD0D0 | ~1050 | 85% | Barrier instruction ABI lowering |
sub_19CD510 | ~900 | 85% | Conversion instruction ABI lowering |
sub_19CD950 | ~700 | 85% | Predicate lowering |
sub_19CDDB0 | ~200 | 80% | Reserved SMEM helper |
sub_19CDED0 | ~200 | 85% | SMEM reservation instruction handler |
sub_19CDFF0 | ~7500 | 99% | Return address validator |
sub_19CE590 | ~300 | 90% | Register limit propagator |
sub_19CE6D0 | ~300 | 85% | ABI flag propagator |
sub_19CEEF0 | ~200 | 80% | ABI attribute copier |
sub_19CF030 | ~200 | 80% | Function entry ABI setup |
sub_19CF140 | ~700 | 85% | Register-save sequence builder |
sub_19CF530 | ~350 | 80% | Parameter setup helper |
sub_19CF9A0 | ~600 | 85% | PRMT instruction ABI lowering |
sub_19CFC30 | ~500 | 95% | Opcode-based ABI dispatch |
sub_19D01E0 | ~1200 | 85% | Multi-callee ABI propagation |
sub_19D0680 | ~300 | 80% | Iterator initialization |
sub_19D0A80 | ~200 | 80% | Iterator filter setup |
sub_19D0AF0 | ~100 | 95% | Iterator filter check |
sub_19D0BC0 | ~40 | 95% | Iterator advance (next instruction) |
sub_19D0C10 | ~40 | 95% | Iterator advance (next matching) |
sub_19D0C70 | ~40 | 95% | Iterator advance (skip non-matching) |
sub_19D0CE0 | ~40 | 95% | Iterator advance (reverse) |
sub_19D0EE0 | ~40 | 95% | Iterator reset |
sub_19D1030 | ~200 | 80% | Iterator state query |
sub_19D13F0 | ~4300 | 90% | Convergent boundary checker |
sub_19D1720 | ~4800 | 95% | ABI return address setup |
sub_19D1AF0 | 5608 | 98% | Master ABI setup |
sub_19D32C0 | 1902 | 85% | Per-block register reservation builder |
sub_19D41E0 | 2247 | 85% | CALL instruction ABI lowering |
sub_19D4B80 | 1925 | 85% | Coroutine frame builder |
sub_19D5850 | ~900 | 80% | Shared-mem instruction lowering |
sub_19D5F10 | 1568 | 85% | Coroutine SUSPEND handler |
sub_19D67B0 | ~800 | 80% | Function exit ABI lowering |
sub_19D7160 | ~600 | 85% | Sub-pass: scan for ABI-relevant ops |
sub_19D7470 | 1526 | 80% | Register classification propagator |
sub_19D7A70 | 3313 | 85% | CONV.ALLOC insertion (dead instruction insertion) |
sub_19D8CE0 | ~1100 | 80% | Register save/restore pair generator |
sub_19D9290 | ~1000 | 80% | Register live range computation |
sub_19D9710 | ~1000 | 80% | Register conflict detector |
sub_19D9E00 | ~700 | 95% | gb10b WAR code generator (entry) |
sub_19DA2A0 | ~500 | 95% | gb10b WAR code generator (body) |
sub_19DA8F0 | 1580 | 80% | SSA-form instruction rebuilder |
sub_19DAF20 | ~1300 | 80% | Multi-dest instruction splitter |
sub_19DB440 | ~700 | 80% | Additional register reservation pass |
sub_19DC070 | ~900 | 85% | Sub-pass dispatcher |
sub_19DC4B0 | 6459 | 95% | Per-pass instruction lowering |
sub_19DDEF0 | 1687 | 95% | Reserved SMEM checker |
sub_19DE8F0 | 1842 | 80% | Register renaming for ABI conformance |
sub_19DF170 | 1928 | 80% | Instruction list rewriter |
Instruction Scheduler Overview
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas instruction scheduler is a priority list scheduler with a 3-phase architecture. A single top-level orchestrator (sub_8D0640, ScheduleInstructions) drives three passes through one unified scheduling engine (sub_688DD0), each configured by a mode parameter that selects a different optimization objective: register pressure reduction, ILP/latency hiding, or dynamic batch optimization for tensor warpgroup operations. The scheduler runs twice in the ptxas pipeline -- once before register allocation on virtual registers (pre-scheduling) and once after physical register assignment (post-scheduling).
The scheduler consumes a dependency DAG built over the instruction list and produces a final instruction ordering together with SASS control words encoding stall counts, yield hints, barrier assignments, and scoreboard dependencies. The entire subsystem spans roughly 436 KB of code (0x893000--0x8FE000) with an additional 250 KB of supporting infrastructure in the 0x67F000--0x6A0000 range.
| Orchestrator | sub_8D0640 (22 KB) -- ScheduleInstructions |
| Unified engine | sub_688DD0 (20 KB) -- mode-parameterized scheduling loop |
| Priority function | sub_8C9320 (47 KB) -- multi-criteria heuristic |
| Ready list builder | sub_6820B0 (1.5 KB) -- zero-predecessor scan |
| Dependency graph | sub_8CF880 (28 KB) + sub_8D9930 (19 KB) |
| Register budget | sub_8CEE80 (8.7 KB) -- occupancy-aware computation |
| HW latency profiles | sub_8E7300--sub_8E9DC0 -- per-SM tables |
| Opcode table | sub_896D50 (90 KB) -- ROT13-encoded SASS mnemonics |
| Scheduling arena | sub_8E3970 / sub_8E3A80 -- bump allocator |
| Key knobs | 76 Sched* knobs; see Configuration |
| Enable gate | "ScheduleInstructions" named option at (a1+8)+1664 |
3-Phase Pipeline
The orchestrator sub_8D0640 executes the following sequence. All three scheduling phases invoke the same unified engine sub_688DD0 -- the only difference is the mode byte passed as the second argument.
function ScheduleInstructions(sched):
// 1. Build dependency graph
BuildDependencyGraph(sched, func) // sub_8CF880
vtable[29](sched) // InitScheduleData
PreScheduleSetup(sched, opt_level > 2) // sub_8CBAD0
// 2. Gate check
if not KnobGetBool("ScheduleInstructions"):
return
// 3. Set mode flags from knobs 419 (LivenessCountRegComp), 420 (LivenessUseHiLo)
sched.flags |= (knob_419 << 3) | (knob_420 << 4)
// 4. Optionally create register pressure tracker
if sched.flags & 0x10:
sched.scoreboard = alloc(952) // sub_69A1A0
sched.tracker = alloc(208) // sub_6B8F70
if sched.flags & 0x100:
sched.warp_analysis = alloc(856) // sub_6BB7C0
// 5. Reset per-instruction SchedNode fields between passes
// (iterates func+104 metadata chain, NOT instruction list)
for sched_node in func.sched_node_list: // linked via func+104
sched_node.depChainHead = 0 // QWORD +56
sched_node.extendedState = 0 // QWORD +104
sched_node.schedulingCost = 0 // DWORD +76
sched_node.schedulingClass = -1 // DWORD +84, sentinel
// 6. Phase 1 — ReduceReg
if KnobGetBool("ScheduleInstructionsReduceReg"):
ScheduleEngine(sched, mode=0x39, ...) // sub_688DD0
// 7. Phase 2 — Reverse scheduling (ILP / latency)
ReverseSchedule(sched) // sub_8CD6E0
ComputeRegisterBudget(sched) // sub_8CEE80
// 8. Phase 3 — DynBatch
if KnobGetBool("ScheduleInstructionsDynBatch"):
AllocDynBatchData(sched) // sub_8BF890
ScheduleEngine(sched, mode=0x41, ...) // sub_688DD0
// 9. Cleanup
FreeBitvector(sched.bv) // sub_BDC050
ArenaFreeAll(sched.arena) // sub_8E3A80
Phase 1 -- ReduceReg (mode 1, callback 0x39)
Goal: minimize register pressure so the register allocator has headroom. This phase reorders instructions to reduce the maximum number of simultaneously-live virtual registers.
- Enabled by the named option
"ScheduleInstructionsReduceReg"(default: on at-O3). - Register targets set from knobs 776 (
SchedReduceIncLimit) and 778 (SchedReduceIncLimitHigh) (defaults approximately 250 and 300). - The mode byte 0x39 selects the register-pressure-minimizing priority weights inside the unified engine.
- The engine's inner dispatch reads
*(DWORD*)(scheduler+60) == 1to enter the ReduceReg path.
Phase 2 -- ILP / Latency Hiding (mode 0, callback 0x49)
Goal: maximize instruction-level parallelism and hide memory latencies by interleaving independent operations.
- Always runs (no separate enable gate).
- Uses reverse post-order BB iteration via
sub_8CD6E0: iterates basic blocks from last to first after resetting liveness withsub_781F80(func, 0). - Computes a register budget capped at
min(archLimit, 0.95 * maxRegs)viasub_8CEE80. - The mode byte 0x49 selects latency-oriented priority weights.
- After this phase,
sub_8CF5D0evaluates dual-issue eligibility and produces a dual-issue benefit score stored atscheduler+328.
Phase 3 -- DynBatch (mode 2, callback 0x41)
Goal: batch-aware scheduling for GMMA/WGMMA warpgroup tensor operations. Groups tensor instructions into batches that can execute as warpgroup-cooperative operations with minimal pipeline stalls.
- Enabled by the named option
"ScheduleInstructionsDynBatch"and only activates when the function has varying instruction counts across BBs. - Controlled by knob 742 (
SchedCrossBlock, cross-block scheduling mode). - Reads stall/batch depth limits from knobs 805 (
SchedTexBatchTargetSelectRegisterTarget), 741 (SchedCountLoadsPerTex), 761 (SchedMaxRLiveOKslack), 762 (SchedMaxRLiveOKslackColdBlocks). - Allocates a 184-byte DynBatch context (
sub_8BF890) with an8 * numBBssub-array for per-BB batch tracking. - Context initialization (
sub_8C1BA0): sets batch window to 0xFFFFFFFF (sentinel), copies register liveness fromfunc+832. - The mode byte 0x41 selects batch-aware priority weights.
DynBatch Context Object (184 bytes)
sub_8BF890 allocates a 184-byte DynBatch context from the scheduling arena at sched+840 and stores the pointer at sched+272. The object is a flat structure containing a function context reference, a 20-slot working array, and a pointer to a variable-length per-BB sub-array.
| Offset | Size | Type | Init | Name | Purpose |
|---|---|---|---|---|---|
| +0 | 8 | ptr | funcCtx | funcContext | Pointer to CompilationContext (copied from sched+8) |
| +8 | 160 | QWORD[20] | 0 | batchWorkArray | Fixed-size working array for batch state tracking; likely holds instruction pointers or batch boundary markers during scheduling |
| +168 | 8 | ptr | alloc'd | perBBArray | Per-BB batch tracking sub-array; 8 * numBBs bytes, zero-initialized. Each 8-byte entry holds a batch start/end instruction pointer for one basic block |
| +176 | 4 | DWORD | 0 | flags | Status/control flags |
| +180 | 4 | -- | -- | (padding) | Pad to 184-byte allocation |
The per-BB sub-array size is derived from *(sched+392) (maxBBSizeForAlloc), with an overflow check capping the multiplication at 0xFFFFFFFFFFFFFFF (2^60 - 1) entries.
DynBatch Working State (in scheduler context)
The bulk of the DynBatch working state lives directly in the scheduler context, initialized by sub_8C1BA0 (InitDynBatchState). These fields are used by the priority function during Phase 3 scheduling.
| Offset | Size | Type | Init | Name | Purpose |
|---|---|---|---|---|---|
| +464 | 4 | int32 | 0 | batchSlotCount | Number of instructions accumulated in the current batch |
| +468 | 4 | int32 | -- | prevBatchSize | Size of previously-completed batch |
| +476 | 4 | int32 | adj | adjustedBatchTarget | Adjusted batch depth target; capped to min(maxStallCycles, batchTargetCount), halved when 2 * maxStall > target |
| +480 | 4 | int32 | -- | lastBatchEndPos | Scheduling position of the last instruction in the current batch |
| +488 | 8 | QWORD | 0xFFFFFFFF | batchWindow | Batch window start BB offset; sentinel 0xFFFFFFFF means "no batch active" |
| +492 | 4 | int32 | 0 | regDelta | Register pressure delta accumulator across batch boundaries |
| +496 | 4 | int32 | 0 | maxRegInBatch | Maximum register pressure observed within current batch |
| +500 | 4 | int32 | from +72 | regBaseCount | Base register count; copied from sched+72, reset on batch boundary |
| +504 | 8 | QWORD | 0 | maxRegSpan | Maximum register span (pressure peak minus baseline) across all batches |
| +508 | 4 | int32 | 0 | regBaseline | Register count baseline for delta computation |
| +512 | 4 | int32 | 0 | minOverflowCost | Minimum overflow cost; updated when batch exceeds register budget |
| +516 | 4 | int32 | -1 | batchDepthLimit | Per-batch maximum depth; -1 = unlimited (overwritten from BB analysis) |
| +520 | 1 | byte | 0 | batchOverflow | Set to 1 when batch exceeds register budget + base count |
| +521 | 1 | byte | 0 | batchAbort | Set to 1 when opcode 96 (WGMMA commit) detected with sched+524 flag |
| +536+ | var | ptr[] | -- | batchSlots | Array of instruction pointers in the current batch; sched+536 + 8*i for slot i |
The batch target adjustment algorithm in sub_8C1BA0:
adjustedTarget = maxStallCycles // from sched+404
if maxStallCycles > batchTargetCount:
adjustedTarget = batchTargetCount // cap to target
if batchTargetCount > maxStallCycles:
if batchMode == 0 and maxStallCycles < batchTargetCount:
if batchTargetCount >= 2 * maxStallCycles:
adjustedTarget = batchTargetCount / ceil(batchTargetCount / maxStallCycles)
else:
adjustedTarget = batchTargetCount / 2
When a batch boundary is detected (instruction's BB start offset exceeds the batch window), sub_8C1BA0 evaluates the batch: it computes the register pressure delta, checks whether the batch overflows the combined register budget (regBaseCount + regDelta + maxRegSpan), and either accepts the batch or trims it by walking backward through the batchSlots array to find a smaller valid batch.
Unified Scheduling Engine
sub_688DD0 (20 KB) is the single engine that all three phases invoke. Its behavior is parameterized by:
- Mode byte (argument
a2): 0x39 = ReduceReg, 0x49 = ILP/Latency, 0x41 = DynBatch. - Rebuild flag (argument
a4): when true, reconstructs the dependency DAG viasub_6833F0. - Vtable dispatch: uses
*(a1+40)and*(a1+48)for polymorphic pre/post scheduling hooks.
function ScheduleEngine(sched, mode, arg3, rebuild):
if rebuild:
InitScheduleRegion(sched) // sub_6833F0
// allocates 72-byte per-BB records, queries knobs 595 (PreserveSchedOrderSame), 743 (SchedCrossBlockInstsToSpeculate), 747 (SchedCrossBlockTexToSpeculate)
for each bb in sched.basic_blocks:
// 10 register pressure counters from per-BB record +4..+40 into context +48..+87
InitResourceTracking(sched, bb) // sub_A091C0
ReadyList = BuildReadyList(sched) // sub_6820B0
while ReadyList is not empty:
best = SelectBestInstruction(sched) // via priority vtable
ScheduleInstruction(sched, best) // sub_682200
UpdateResourceState(sched, best) // sub_A09530
UpdateWARTracking(sched, best) // sub_A09D40
RelinkInstruction(best) // sub_925510
// Update dependency counts, add newly-ready instructions
for each successor of best:
successor.dep_count -= 1
if successor.dep_count == 0:
ReadyList.insert(successor)
The engine manages 10 register pressure counters at scheduler context offsets 48--87 (copied from the per-block record offsets +4--+40 at BB entry). These correspond to the GPU register classes: R (general), P (predicate), UR (uniform), UP (uniform predicate), B (barrier), and 5 architecture-specific classes. Counter [0] (R class) uses a separate update path; counters [1]--[9] are decremented from a per-opcode resource cost table during the scheduling loop.
Ready List Construction
sub_6820B0 (1.5 KB) builds the initial ready list by scanning the instruction linked list for nodes with zero unsatisfied dependencies.
function BuildReadyList(sched):
for instr in sched.instruction_list:
if instr.opcode == 52: // NOP/BB boundary
continue // follow through to real instruction
if instr.dep_count == 0:
instr.next_ready = sched.ready_head
sched.ready_head = instr
vtable_call(sched, 104, instr) // ready-list insertion callback
instr.latency_counter = 0
The ready list is maintained as a sorted linked list (via pointer at instruction offset +16). The priority function determines sort order.
Priority Function
sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority score as an 8-bit packed encoding combining multiple heuristic factors. The function uses approximately 200 local variables and a 0x330-byte stack frame.
Priority Factors
| Factor | Source | Weight adjustment |
|---|---|---|
| Register pressure | Current live count vs budget at sched+432 | Primary factor in ReduceReg mode |
| Instruction latency | sub_693BC0 latency query | Primary factor in ILP mode |
| Critical path position | DAG depth from sched+464, sched+380 | Favors critical-path instructions |
| FU contention | 10-element resource vector via sub_8C7290 | Avoids saturating a single pipe |
| Hot/cold memory | sub_A9CDE0 (hot=global) / sub_A9CF90 (cold=const) | Prioritizes latency-sensitive ops |
| Anti-dependency | WAR hazard cost | Breaks ties with anti-dep distance |
| Barrier dependencies | Barrier flag at instr+376 | Defers barrier-blocked instructions |
| Priority queue depth | Knob 770 (default 4) | Limits lookahead window |
Priority Encoding
The priority value is packed into an integer with 8-bit fields. Each field is computed from the corresponding factor and shifted into position. The packed encoding allows the ready list to maintain sort order with a single integer comparison, avoiding multi-key sorting overhead.
Key subroutines called during priority computation:
| Address | Purpose |
|---|---|
sub_8C67A0 | Compute resource cost for instruction and update BB resource table |
sub_8C7120 | Barrier tracking update |
sub_8C7290 | Copy 10-element resource vector from per-BB slot (SSE-optimized) |
sub_8C7720 | Instruction reordering within BB (red-black tree operations) |
sub_693BC0 | Memory space classification / latency query |
sub_6818D0 | Register count to hardware-aligned unit conversion |
Resource Tracking
The scheduler tracks 10 functional unit resource counters per basic block. Each counter corresponds to a hardware execution pipe.
Resource Vector Layout
Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:
| Offset (within slot) | Size | Content |
|---|---|---|
| 0--36 | 10 x int32 | Current resource usage per FU |
| 40--76 | 10 x int32 | Resource pressure delta |
| 80 | int32 | BB-entered flag and auxiliary bits |
The 10 functional unit pipes (inferred from resource model queries):
| Index | Pipe | Typical instructions |
|---|---|---|
| 0 | Integer ALU | IADD, IMAD, ISETP, LOP, SHF |
| 1 | FP32 | FADD, FFMA, FMUL, FSETP |
| 2 | FP64 | DADD, DFMA, DMUL |
| 3 | Tensor core | HMMA, IMMA, BMMA, BGMMA |
| 4 | Load/store | LD, ST, LDG, STG, LDS, STS |
| 5 | Texture | TEX, TLD, TXQ |
| 6 | Branch/control | BRA, JMP, EXIT, RET, BAR |
| 7 | Shared memory | ATOMS, REDS, LDS, STS |
| 8 | Special function | MUFU (RCP, RSQ, SIN, COS, EX2, LG2) |
| 9 | Uniform/predicate | UPLOP, UISETP, uniform operations |
sub_8C67A0 computes per-instruction resource costs by calling the resource model (sub_A08A00) three times:
- Mode 1: the instruction's own execution cost
- Mode 2: operand release costs for last-use operands
- Mode 3: combined instruction + BB-level impact
SSE intrinsics (_mm_add_epi32) are used for vector accumulation.
Register Budget
sub_8CEE80 (8.7 KB) computes the occupancy-aware register budget that the scheduler respects during instruction ordering.
function ComputeRegisterBudget(sched):
hw = sched.func.sm_backend // at func+1584 (provides hw latency profiles)
maxRegs = hw[154] // architecture register limit
coeff = KnobGetDouble(740) // default 0.045
if KnobGetBool(763): // budget disabled
budget = hw[157] // use fixed count from profile
else:
physRegs = VirtToPhys(sched, maxRegs) // sub_A99FE0
budget = physRegs - (physRegs >> 6) // 98.4% utilization
// For sm_50: apply special dual-issue budget
if arch_id == 5:
budget = DualIssueBudget(budget)
pressureCurve = ComputePressureCurve(sched, budget - 2) // sub_8CE520
// Piecewise linear model with parameters (4, 2, 6)
sched.regBudget = budget // offset +432
sched.committedTarget = ... // offset +324
sched.minRegs = ... // offset +316
sched.pressureSlack = ... // offset +320
The register pressure curve (sub_8CE520) uses a piecewise linear model parameterized by (4, 2, 6) or a custom string-encoded function from knob 750 (SchedEstimatedLoopIterations).
Dependency Graph
The dependency DAG is built in two stages:
Stage 1: Pre-scheduling scan (sub_8CF880, 28 KB)
Iterates basic blocks in reverse order. For each BB:
- Checks knobs 314 (
FenceInterference) / 313 (FenceCode) for per-instruction scheduling fence conditions - Walks the instruction linked list, identifying NOP/control instructions
- Builds dependency edges via
sub_8D9930 - Manages memory arenas with SSE-optimized copies for instruction metadata arrays
Stage 2: Edge construction (sub_8D9930, 19 KB)
For each pair of instructions in a BB, checks for:
- RAW (true) dependencies: read-after-write on the same register
- WAR (anti) dependencies: write-after-read
- WAW (output) dependencies: write-after-write
- Memory dependencies: through shared/global memory (conservative ordering)
- Barrier dependencies: through barrier/sync instructions
Uses operand analysis from sub_894290 (27 KB) which processes 16-bit operand descriptors encoding register class, bank, and dependency type.
Supplementary dependency builders
| Address | Size | Purpose |
|---|---|---|
sub_68A690 | 31 KB | BuildDependencies -- def-use chain construction |
sub_6A97B0 | 26 KB | AddDependencyEdges -- register-level edges |
sub_6A2D30 | 11 KB | ChainDependencies -- memory ordering constraints |
sub_6A78F0 | 23 KB | ProcessOperands -- operand dependency extraction |
Pre-Scheduling Setup
sub_8CBAD0 (2.9 KB) performs BB scanning and resource allocation before the scheduling passes begin.
Key behaviors:
- Counts instructions per basic block. If any BB exceeds 4095 instructions, it inserts a scheduling barrier (
sub_931920) to split the block. - Tracks maximum BB size at
scheduler+388. - Detects opcode 246 (texture operations) and sets
scheduler+384 = 1. - Allocates per-slot arrays:
scheduler+672: 84-byte scheduling slots (resource tracking)scheduler+280: 48-byte analysis slots (ifopt_level > 2)scheduler+248,scheduler+256: register pressure bitvectors sized to(numRegs+1)or(2*numRegs+2)if knob 420 (LivenessUseHiLo, dual-register tracking) is active
Pre-Scheduling vs Post-Scheduling
The scheduler runs at two distinct points in the ptxas pipeline:
| Aspect | Pre-scheduling | Post-scheduling |
|---|---|---|
| Timing | Before physical register allocation | After physical register allocation |
| Register model | Virtual registers | Physical registers |
| Primary goal | Reduce register pressure, order for regalloc | Hide latencies, minimize stalls |
| Phases active | All 3 (ReduceReg, ILP, DynBatch) | Refinement pass |
| Budget source | Occupancy model estimate | Actual allocation result |
| Entry | sub_8D0640 | sub_7F5D50 / sub_A97600 (42 KB) |
Post-scheduling uses the actual physical register assignments for precise dependency distances and can make final decisions about stall insertion and scoreboard barrier placement.
Scheduling Variants
The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies, each implementing a different approach or targeting a different code pattern:
| Address | Size | Strategy | Notes |
|---|---|---|---|
sub_8B9390 | 23 KB | Software pipelining | Loop body overlapping |
sub_8B77C0 | 15 KB | Dual-issue scheduling | Pair co-issuable instructions |
sub_8BDC40 | 7.9 KB | Dual-issue pairing | Instruction pair selection |
sub_8B8900 | 12 KB | Tensor scheduling | HMMA/BMMA grouping |
sub_8BAAE0 | 15 KB | Loop-aware scheduling | Trip count + register awareness |
sub_8B6D60 | 12 KB | Pressure-optimized | Minimize live range overlap |
sub_8B5400 | 14 KB | Latency-optimized | Maximize memory latency hiding |
sub_8B1190 | 16 KB | Backtracking scheduler | Undo and retry on conflict |
sub_8B2D90 | 18 KB | Global schedule optimization | Cross-BB considerations |
sub_8B4590 | 13 KB | Permutation search | Try schedule permutations |
sub_8A9D80 | 21 KB | Depth-first scheduling | DFS-based instruction ordering |
sub_8AB750 | 9.8 KB | Critical path computation | DAG analysis for priorities |
sub_8BB9C0 | 8.2 KB | Prefetch scheduling | Memory prefetch insertion |
sub_8BC0B0 | 6.1 KB | Barrier coalescence | Merge adjacent barriers |
sub_8BC990 | 7.6 KB | Scoreboard optimization | Minimize scoreboard usage |
sub_8BCFA0 | 6.8 KB | Warp schedule optimization | Warp-level yield tuning |
sub_8BE320 | 25 KB | Complex scheduling pass | Multi-strategy combined pass |
These variants are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level.
Hardware Latency Profiles
Per-architecture latency and throughput tables are constructed by a family of functions at 0x8E7300--0x8E9DC0. Each table specifies pipeline latencies (integer, FP32, FP64, tensor, memory), scoreboard wait counts, barrier stall cycles, and dual-issue pair compatibility for the target GPU.
| Address | Architecture | Size |
|---|---|---|
sub_8E7300 | sm_70 (Volta) | 3.3 KB |
sub_8E7540 | sm_72 | 2.9 KB |
sub_8E7720 | sm_75 (Turing) | 3.5 KB |
sub_8E7940 | sm_80 base | 2.9 KB |
sub_8E7B40 | sm_80 (Ampere) | 3.3 KB |
sub_8E7D80 | sm_86 | 4.4 KB |
sub_8E8070 | sm_87 | 3.5 KB |
sub_8E8280 | sm_89 (Ada Lovelace) | 3.1 KB |
sub_8E8480 | sm_90 (Hopper) | 5.2 KB |
sub_8E8780 | sm_90a | 4.6 KB |
sub_8E8A90 | sm_100 (Blackwell DC) | 3.0 KB |
sub_8E8DB0 | sm_103 (Blackwell Ultra) | 1.7 KB |
sub_8E9000 | sm_120 (RTX 50xx) | 2.9 KB |
sub_8E92E0 | sm_120 extended | 5.5 KB |
sub_8E97B0 | Universal fallback | 8.8 KB |
The warp-level hardware profile (sub_8E4400) maps architecture IDs to dispatch parameters:
| Architecture range | Warps | Dispatch slots | Era |
|---|---|---|---|
| <= 20479 | 4 | 96 | sm_50 (Maxwell) |
| <= 24575 | 6 | 176 | sm_60 (Pascal) |
| <= 28672 | 7 | 192 | sm_70 (Volta) |
| <= 32767 | 7 | 208 | sm_75 (Turing) |
| <= 36863 | 8 | 224 | sm_80 (Ampere) |
| > 36863 | 16 | 240 | sm_90+ (Hopper, Blackwell) |
Sub-architecture variants (stored at profile offset +26) are assigned by specific SM version codes: 8193, 20481, 24576, 28674--28677, 32768, 36864--36869.
See Latency Model for per-opcode latency tables and functional unit mapping.
Scheduling Knobs
The scheduler reads approximately 76 knobs. The most significant ones (names decoded from ROT13 in the binary):
| Knob ID | Name | Type | Default | Purpose |
|---|---|---|---|---|
| 313 | FenceCode | when-list | -- | Skip scheduling for specific opcodes (per-instruction WHEN condition) |
| 314 | FenceInterference | when-list | -- | Mark interference fences for specific opcodes |
| 419 | LivenessCountRegComp | int32 | -- | Forward scheduling mode flag (bit 3 in sched+1376) |
| 420 | LivenessUseHiLo | int32 | -- | Dual-register hi/lo tracking (bit 4 in sched+1376) |
| 487 | -- | bool | true | Master scheduling/peephole enable |
| 510 | OptimizeUniformAtomicMode | int32 | -- | BB pre-optimization mode for uniform atomics |
| 595 | PreserveSchedOrderSame | when-list | -- | Preserve scheduling order (per-instruction WHEN condition) |
| 740 | SchedBumpScaleAugmentFactor | double | 0.045 | Register pressure bump scale augmentation coefficient |
| 741 | SchedCountLoadsPerTex | int32 | 3 | Load count per texture operation (stall threshold) |
| 742 | SchedCrossBlock | int32 | -- | Cross-block scheduling mode |
| 743 | SchedCrossBlockInstsToSpeculate | int32 | -- | Cross-block instruction speculation count |
| 747 | SchedCrossBlockTexToSpeculate | int32 | -- | Cross-block texture speculation count |
| 750 | SchedEstimatedLoopIterations | string | -- | Estimated loop iteration count override |
| 760 | SchedMaxRLiveCarefulSlack | int32 | -- | Reserved register headroom (careful slack for live registers) |
| 761 | SchedMaxRLiveOKslack | int32 | -- | Acceptable live-register slack (batch depth on non-sm_50) |
| 762 | SchedMaxRLiveOKslackColdBlocks | int32 | -- | Extra register slack for cold basic blocks |
| 763 | SchedMaxRTarget | int32 | -- | Maximum register target; 0 disables register budget |
| 769 | SchedPrefFurthestDep | when-list | -- | Per-BB scheduling query: prefer furthest dependency |
| 770 | SchedReadAvailTarget | int32 | 4 | Priority queue depth (read-availability lookahead window) |
| 776 | SchedReduceIncLimit | int32 | ~250 | Forward pass primary register increment limit |
| 778 | SchedReduceIncLimitHigh | int32 | ~300 | Forward pass secondary (high) register increment limit |
| 805 | SchedTexBatchTargetSelectRegisterTarget | int32 | -- | Texture batch register target stall limit (capped at 16) |
| 806 | SchedTexBatchTargetSelectSchedulerTarget | int32 | -- | Texture batch scheduler target stall limit (capped at 16) |
Knob names are stored ROT13-encoded in the binary (see Knobs System for the obfuscation scheme). Types when-list indicate knobs that support per-instruction or per-BB conditional overrides via WHEN= syntax.
The full scheduling context configuration is performed by sub_A95DC0 (35 KB), which reads dozens of knob values and populates the scheduling context structure.
Data Flow Analysis
The scheduler includes a dedicated data flow analysis subsystem (0x8DBAF0--0x8DF1C0) that computes register liveness and propagates def-use information across BB boundaries:
| Address | Size | Purpose |
|---|---|---|
sub_8DB070 | 8.2 KB | Initialize liveness data structures |
sub_8DB5F0 | 8.4 KB | Compute per-BB liveness |
sub_8DBAF0 | 16 KB | Full liveness analysis |
sub_8DC3F0 | 3.0 KB | Compute data flow state |
sub_8DC620 | 3.3 KB | Update data flow on schedule |
sub_8DC880 | 10 KB | Propagate data flow information |
sub_8DCF20 | 23 KB | Build data flow graph for scheduling |
sub_8DE7A0 | 12 KB | Iterative data flow solver (fixed-point) |
sub_8DEF90 | 2.0 KB | Finalize data flow |
The iterative solver runs until convergence, updating per-BB liveness sets. This information feeds into the priority function's register pressure estimates and into the register budget computation.
Scheduling Output
After instruction ordering is determined, the scheduling output pipeline (0x8F1EB0--0x8FDD60, ~57 KB) converts the abstract schedule into SASS control words:
function EmitScheduleForBB(sched, bb):
for each instruction in scheduled order:
stall = ComputeStallCycles(sched, instr) // distance to consumer
yield = ComputeYieldHint(sched, instr) // warp scheduling hint
barrier = AssignBarrier(sched, instr) // 6 barriers available
sb_deps = ComputeScoreboardDeps(sched, instr) // read/write dependencies
control_word = EncodeControlWord(stall, yield, barrier, sb_deps)
EmitControlWord(instr, control_word)
Key encoding functions:
| Address | Purpose |
|---|---|
sub_8F1EB0 | Main schedule encoding entry |
sub_8F3130 | Encode stall count field |
sub_8F31F0 | Encode barrier field |
sub_8F3650 | Encode yield hint field |
sub_8F3860 | Encode scoreboard dependency field |
sub_8F4140 | Encode complete control word |
sub_8F6530 | Output complete schedule for function |
Seven verification functions at 0x8F7610--0x8F8CB0 validate the generated schedule: stall counts, barrier assignments, dependency chains, scoreboard correctness, control word format, yield hints, and overall schedule integrity.
See Scoreboards for the scoreboard and dependency barrier encoding format.
Memory Management
The scheduler uses two allocator strategies:
-
Arena allocator (
sub_8E3970): bump allocator with 10 KB block granularity, 8-byte alignment. Allocations within a scheduling pass use the arena for fast allocation.sub_8E3A80frees all blocks at once at pass completion. -
Free-list allocator (
sub_8DA6D0): free-list with block coalescing for persistent scheduling data. Maintains multiple free lists for different size classes. Blocks larger than 0x1FF bytes go to a separate large-block list. Adjacent free blocks are merged on deallocation.
Per-Instruction Scheduling Metadata (SchedNode)
Each instruction has a pointer at instr+40 (sched_slot) to a separate heap-allocated scheduling metadata block called a SchedNode. The metadata offsets documented throughout the scheduling pages (e.g., metadata+24, metadata+32, metadata+108) are relative to this SchedNode, not to the 296-byte Ori instruction object itself. The SchedNode block is at least 112 bytes; all nodes are linked into a singly-linked list at func+104 (Code Object offset +104), separate from the instruction linked list at func+272.
SchedNode Layout
| Offset | Size | Type | Init | Name | Purpose |
|---|---|---|---|---|---|
| +0 | 8 | ptr | -- | nextInList | Singly-linked next pointer for the func+104 metadata chain |
| +8 | 4 | i32 | 0 | depCount | Unsatisfied dependency count; decremented as predecessors are scheduled; instruction is ready when this reaches 0 |
| +12 | 4 | -- | -- | (pad) | Alignment padding |
| +16 | 8 | ptr | -- | nextReady | Ready list singly-linked next pointer; threaded by sub_6820B0 (BuildReadyList) |
| +24 | 4 | i32 | seq | bbSlot | 1-based position within the BB (assigned sequentially by sub_8D9930); used for program-order tiebreaking in priority decisions |
| +28 | 4 | i32 | 0 | latencyCounter | Remaining latency cycles until the instruction's result is available; reset to 0 when placed on the ready list; updated by sub_A09530 (UpdateStallCycles) |
| +32 | 4 | i32 | -- | earliestCycle | Earliest available cycle -- the latest completion time among all producer instructions; stall-free when earliestCycle >= scheduler+480 (current cycle) |
| +36 | 4 | -- | -- | (reserved) | Alignment padding or internal use |
| +40 | 4 | i32 | 0 | latestDeadline | Latest deadline cycle for scheduling; secondary tiebreaker in the candidate comparison cascade |
| +44 | 4 | i32 | -- | barrierGroupIndex | Barrier group assignment; identifies which of the 6 hardware barriers this instruction participates in |
| +48 | 4 | i32 | -- | schedulingFenceCode | Scheduling fence code from knob 313 (FenceCode) / 314 (FenceInterference) checks; controls per-instruction scheduling boundaries |
| +56 | 8 | i64 | 0 | depChainHead | Dependency chain data; reset to 0 between scheduling passes |
| +76 | 4 | i32 | 0 | schedulingCost | Per-instruction scheduling cost; accumulated during priority evaluation; reset between passes |
| +84 | 4 | i32 | -1 | schedulingClass | Scheduling class index assigned by the latency model (sub_89FBA0); indexes into per-architecture latency tables; -1 = unclassified (sentinel) |
| +88 | 4 | i32 | -- | maxPredecessorCycle | Highest cycle value among predecessor instructions; used in the priority pre-scan to compute max_pred_cycle |
| +92 | 4 | i32 | -- | maxDependencyCycle | Highest cycle value along the dependency chain; used to compute max_dep_cycle for critical-path analysis |
| +104 | 8 | i64 | 0 | extendedState | Extended scheduling state; reset to 0 between scheduling passes |
| +108 | 1 | byte | -- | flags | Primary flag byte: bit 0 = barrier-target, bit 1 = has-dependency-set, bit 2 = fence-early (knob 314), bit 3 = fence-late (knob 313), bit 4 = has-register-operand |
| +111 | 1 | byte | -- | extendedFlags | Extended flags: bit 7 = uses expensive register file (triggers barrier tracking update in sub_8C7120) |
Relationship to the Instruction Object
Ori Instruction (296 bytes) SchedNode (>= 112 bytes)
+--------------------------+ +---------------------------+
| +0: prev (BB list) | instr+40 | +0: nextInList |
| +8: next (BB list) |---sched_slot--> |
| +16: id | | +8: depCount |
| +72: opcode | | +16: nextReady |
| +80: operand_count | | +24: bbSlot |
| +84: operands[] | | +28: latencyCounter |
| | | +32: earliestCycle |
| | | +40: latestDeadline |
| | | +88: maxPredecessorCycle |
| | | +92: maxDependencyCycle |
| | | +108: flags |
+--------------------------+ +---------------------------+
Lifecycle
-
Allocation:
InitScheduleData(vtable[29], called fromsub_8D0640) allocates one SchedNode per instruction from the scheduling arena and stores the pointer atinstr+40. Nodes are linked into thefunc+104chain. -
Initialization:
sub_8D9930(EdgeBuilder) initializesdepCount,bbSlot,latencyCounter,latestDeadline, andflagswhile building dependency edges. Between scheduling phases, the orchestrator resets pass-specific fields:+56 = 0,+104 = 0,DWORD+76 = 0,DWORD+84 = -1. -
Population: The dependency graph builder populates
depCountfrom edge analysis. Critical-path computation fillsearliestCycle,maxPredecessorCycle, andmaxDependencyCycle. -
Use:
sub_6820B0(BuildReadyList) checksdepCount == 0and threads ready instructions vianextReady.sub_8C9320(PriorityFunction) reads all fields to compute the 8-bit scheduling priority. -
Cleanup:
sub_8E3A80(ArenaFreeAll) reclaims all SchedNode blocks when the scheduling pass completes.
Sentinel Values
bbSlot = -1: unscheduled (set during inter-pass reset atDWORD+84)latencyCounter = 99999(0x1869F): infinity (used asmin_barrier_latencyinitial value in the priority pre-scan)earliestCyclebit 31 set (>= 0x80000000): not-yet-available (tested insub_8C9320pre-scan via< 0x80000000comparison)
Large Function Handling
Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is split into chunks that are scheduled independently and then merged. This avoids quadratic blowup in the dependency DAG construction for very large kernels.
Per-Block Scheduling Record (72 bytes)
sub_6833F0 (InitScheduleRegion, 10 KB) allocates an array of (numBBs + 1) records at 72 bytes each, stored at scheduler+184. Each record tracks the register pressure snapshot, region context pointers, and scheduling characteristic flags for a single basic block. The scheduling engine loads a BB's pressure state from this record at region entry and saves it back when moving to the next BB.
Field Map
| Offset | Size | Type | Init | Name | Purpose |
|---|---|---|---|---|---|
| +0 | 4 | i32 | 0 | crossBlockId | Non-zero when the BB is active/scheduled; set to the predecessor BB index during cross-block merging. Tested as a boolean gate by 8+ functions before processing a BB. |
| +4 | 4 | i32 | 0 | pressure[0] | Register pressure snapshot -- R (general-purpose 32-bit registers) |
| +8 | 4 | i32 | 0 | pressure[1] | Register pressure snapshot -- P (predicate registers) |
| +12 | 4 | i32 | 0 | pressure[2] | Register pressure snapshot -- UR (uniform registers) |
| +16 | 4 | i32 | 0 | pressure[3] | Register pressure snapshot -- UP (uniform predicate registers) |
| +20 | 4 | i32 | 0 | pressure[4] | Register pressure snapshot -- B (barrier registers) |
| +24 | 4 | i32 | 0 | pressure[5] | Register pressure snapshot -- arch-specific class 0 |
| +28 | 4 | i32 | 0 | pressure[6] | Register pressure snapshot -- arch-specific class 1 |
| +32 | 4 | i32 | 0 | pressure[7] | Register pressure snapshot -- arch-specific class 2 |
| +36 | 4 | i32 | 0 | pressure[8] | Register pressure snapshot -- arch-specific class 3 |
| +40 | 4 | i32 | 0 | pressure[9] | Register pressure snapshot -- arch-specific class 4 / control total |
| +44 | 4 | -- | -- | (padding) | Not initialized, not accessed |
| +48 | 8 | ptr | 0 | regionContext | Pointer to 136-byte per-region scheduling state allocated by sub_682F10. Contains region boundaries, mode flags, and instruction range metadata. |
| +56 | 8 | ptr | 0 | regionContext2 | Second region context pointer, written via successor-BB index mapping. Dereferenced by sub_681C00 to check barrier presence (bit 4 of pointed-to byte). |
| +64 | 1 | byte | & 0x80 | flags | Per-BB characteristic flags (see below). Low 7 bits cleared on init; bit 7 preserved. |
| +65 | 7 | -- | -- | (padding) | Padding to 72-byte stride |
Pressure Counter Transfer
At the start of each BB's scheduling pass, sub_A091C0 (InitResourceTracking) copies the 10 DWORDs at record offsets +4 through +40 into the scheduler context at context offsets +48 through +87. The scheduling engine then updates the context counters as instructions are scheduled. When cross-block scheduling produces a new pressure snapshot, the engine writes it back with SSE bulk stores:
*(OWORD*)(record + 4) = pressure[0..3] // 16 bytes via _mm_store_si128
*(OWORD*)(record + 20) = pressure[4..7] // 16 bytes via _mm_store_si128
*(QWORD*)(record + 36) = pressure[8..9] // 8 bytes
During the main scheduling loop, the engine decrements pressure[1] through pressure[9] (9 counters) from a 40-byte per-opcode resource cost table. pressure[0] (R class) is handled via a separate path.
Flags Byte (+64)
| Bit | Name | Set by | Meaning |
|---|---|---|---|
| 0 | crossBlockBoundary | sub_688DD0 (ScheduleEngine) | BB is a cross-block scheduling boundary |
| 1 | regionActive | sub_688DD0 (ScheduleEngine) | BB belongs to an active scheduling region |
| 2 | hasCall | sub_6833F0 for opcode 96 | BB contains a CALL instruction |
| 3 | hasBranch | sub_6833F0 for opcodes 188, 190 | BB contains a branch instruction |
| 4 | hasBarrierInstr | sub_6833F0 via sub_7DF3A0 test (bit 6) | BB contains a barrier-flagged instruction |
| 5 | hasLongLatencyOp | sub_6833F0 for memory/texture/tensor opcodes; also vtable[183] arch check | BB contains a long-latency operation (memory, texture, or tensor) |
| 6 | crossBlockTarget | sub_6833F0 cross-block merge | BB is the target of a cross-block scheduling region |
| 7 | (preserved) | Not cleared during init | Carries data from a prior pipeline stage; purpose unknown |
The opcodes that set bit 5 (hasLongLatencyOp): 18 (with knob 62 gate), 23, 26, 32, 57, 81, 101, 124 (with knob 461 gate), 178, 188, 190, 197, 236, 248, 271, 315. Additionally, any instruction where vtable[183] returns true (architecture-specific long-latency classification) sets bit 5.
Cross-Block Scheduling Setup
After per-BB initialization, sub_6833F0 walks the CFG to identify cross-block scheduling opportunities, gated by knob 744 (SchedCrossBlockReorder). For each predecessor-successor pair within the speculative distance threshold (knobs 743 SchedCrossBlockInstsToSpeculate and 747 SchedCrossBlockTexToSpeculate):
- Sets
record[pred].crossBlockId = succ_bb_index(marks predecessor active). - Clears bit 6 of
record[pred].flags(predecessor is not a cross-block target). - Sets bit 6 of
record[succ].flags(successor is a cross-block target). - Calls
sub_682F10to allocate the 136-byte region scheduling context and store pointers atrecord[pred]+48andrecord[succ]+56.
+0 +4 +44 +48 +56 +64 +72
| crossBlockId (4B) | pressure[0..9] (40B = 10 x i32) |pad | regionCtx (8B) | regionCtx2 (8B)| fl | pad |
+-------------------+----+----+----+----+----+----+----+----+----+----+----------------+----------------+----+------+
Scheduler Context Object Layout
The scheduling context object (sched / a1) is the central state structure passed as the first argument to every function in the scheduling subsystem. It is populated by sub_A95DC0 (SchedulingContext::configure, 35 KB) which reads dozens of knob values and architecture parameters. The object spans approximately 1600 bytes, from a vtable pointer at offset 0 through architecture-specific SSE vectors at offset +1584.
Core Fields (offsets 0--176)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +0 | 8 | void* | vtable | Polymorphic dispatch; pre/post scheduling hooks at *(a1+40), *(a1+48) |
| +8 | 8 | ptr | funcContext | Pointer to CompilationContext; all func/arch queries go through this |
| +16 | 8 | ptr | allocator | Memory allocator interface (vtable-dispatched alloc/free) |
| +40 | 8 | ptr | preHookVtable | Pre-scheduling callback (mode-specific polymorphic hook) |
| +48 | 40 | int32[10] | regPressureCounters | Per-register-class live counts (copied from per-BB record +4..+40): R, P, UR, UP, B, and 5 arch-specific. The engine decrements counters [1]..[9] in the scheduling loop; counter [0] (R class) uses a separate path. |
| +60 | 4 | int32 | mode | Scheduling mode: 0 = ILP/Latency, 1 = ReduceReg, 2 = DynBatch |
| +88 | 4 | int32 | maxBBDepth | Maximum dependency depth across all basic blocks |
| +92 | 4 | int32 | maxBBDepthNonTensor | Maximum depth excluding tensor instructions |
| +176 | 1 | byte | scheduleActive | 1 during ReduceReg and DynBatch phases, 0 during ILP/Latency |
| +178 | 1 | byte | reduceRegMode | When set, tightens register budget by ~12.5% + 3 |
Phase Control (offsets 240--312)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +240 | 4 | int32 | currentPhase | Phase ID: 0 = budget computation, 1 = ReduceReg, 2 = ILP |
| +248 | 8 | ptr | regBitvector1 | Register pressure bitvector (numRegs + 1 words) |
| +256 | 8 | ptr | regBitvector2 | Second bitvector for dual-register tracking (knob 420, LivenessUseHiLo) |
| +280 | 8 | ptr | analysisSlots | 48-byte per-BB analysis slots (allocated when opt_level > 2) |
| +292 | 1 | byte | regTargetValid | Whether register targets from knobs 776/778 (SchedReduceIncLimit/SchedReduceIncLimitHigh) are valid |
| +296 | 4 | int32 | regTargetPrimary | Forward-pass primary register target (knob 776 SchedReduceIncLimit, in HW register units) |
| +300 | 4 | int32 | regTargetSecondary | Forward-pass secondary register target (knob 778 SchedReduceIncLimitHigh, in HW register units) |
| +311 | 1 | byte | cfgFlag1 | Priority queue depth configuration flag |
| +312 | 4 | int32 | cfgParam1 | Configuration parameter (default 10) |
Register Budget (offsets 316--432)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +316 | 4 | int32 | minRegs | Minimum register count from architecture register limit |
| +320 | 4 | int32 | pressureSlack | Register pressure headroom (initialized to 0) |
| +324 | 4 | int32 | committedTarget | Committed register target (set to regBudget after budget computation) |
| +328 | 4 | int32 | dualIssueBenefit | Dual-issue benefit score from sub_8CF5D0 (sm_50 only) |
| +380 | 4 | int32 | latencyCutoff | Barrier-target latency cutoff; controls critical-path bit activation |
| +384 | 1 | byte | hasTextureOps | Set to 1 when opcode 246 (texture operation) found in any BB |
| +388 | 4 | int32 | maxBBSize | Maximum basic block size in instructions (capped at 4095) |
| +392 | 4 | int32 | maxBBSizeForAlloc | Copy of maxBBSize used for resource slot allocation sizing |
Stall / Batch Parameters (offsets 404--420)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +404 | 4 | int32 | maxStallCycles | Max stall cycles; from knob 805/806 (SchedTexBatchTargetSelect{Register,Scheduler}Target), capped at 16 |
| +408 | 4 | int32 | stallThreshold | Stall threshold; knob 741 (SchedCountLoadsPerTex), default 3 |
| +412 | 4 | int32 | batchDepth | Batch depth; knob 761 (SchedMaxRLiveOKslack), default 3 (6 or 12 for sm_50 with dual-issue) |
| +416 | 4 | int32 | extraRegReserve | Extra register reservation; knob 762 (SchedMaxRLiveOKslackColdBlocks), default -1 (disabled) |
| +420 | 4 | int32 | spillModeCountdown | Spill-mode countdown; when > 0, forces aggressive scheduling with critical-path bit always set |
Register Budget and Pressure Tracking (offsets 432--485)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +432 | 4 | int32 | regBudget | Target register count (occupancy-aware, from sub_8CEE80) |
| +440 | 8 | ptr | livenessBV.data | Register liveness bitvector data (via sub_BDBA60); sized to numRegs+1 or 2*numRegs+2 if dual-reg |
| +448 | 8 | ptr | livenessBV.alloc | Bitvector allocator reference |
| +456 | 4 | int32 | livenessBV.size | Bitvector size in 64-bit words |
| +464 | 4 | int32 | depthThreshold | Number of barrier-target instructions required to activate critical-path bit |
| +480 | 4 | int32 | currentCycle | Current scheduling cycle; used for stall-free evaluation |
| +484 | 1 | byte | phaseActive | Phase activity flag: 1 = ReduceReg active, 0 = ILP/budget |
| +485 | 1 | byte | schedDirty | Reset to 0 at orchestrator start |
Hot-Cold and Yield State (offsets 523--532)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +523 | 1 | byte | hotColdEnable | Hot-cold memory tracking enable; result of sub_8CF5D0 (dual-issue check) |
| +524 | 1 | byte | yieldState | Current yield state; propagated to CONTROL instructions via priority bit 6 |
| +532 | 4 | int32 | hotColdBudget | Hot-cold budget counter; decremented per cold instruction; tracking deactivates at zero |
Architecture Parameters (offsets 604--616)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +604 | 4 | int32 | archParam1 | Architecture-dependent parameter (6 for sm_60 era) |
| +616 | 4 | int32 | archParam2 | Architecture-dependent limit (63 for sm_50 era, 255 for sm_60+) |
Resource Tracking and Dependency Data (offsets 672--744)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +672 | 8 | ptr | resourceSlots | Per-BB resource cost table; 84 bytes per slot (21 DWORDs: 10 FU usage + 10 FU delta + 1 flag) |
| +680 | 8 | ptr | depData | Dependency tracking data (zeroed at orchestrator start) |
| +720 | 8 | ptr | arenaAllocRef | Arena allocator reference for bitvector buffer resizing |
| +728 | 8 | ptr | bvBuffer | Growable bitvector buffer pointer (1.5x growth factor on realloc) |
| +736 | 4 | int32 | bvCapacity | Bitvector capacity in words (-1 = uninitialized sentinel) |
| +740 | 4 | int32 | bvAllocated | Bitvector allocated word count |
| +744 | 8 | ptr | funcContextRef2 | Second reference to function context for bitvector sizing |
Liveness Bitvector (offset 832)
The scheduler tracks register liveness via a bitvector at offset +832 (referenced only in the scheduling algorithm). Each bit represents one register; pressure is computed as popcount(live_bv). This field is part of the larger scheduling state managed by the engine and priority function.
Arena Allocator (offset 840+)
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +840 | ~120 | ArenaAllocator | arena | Embedded bump allocator; freed via sub_8E3A80(sched+840) at each pass end; 10 KB block granularity, 8-byte alignment |
Configuration Bitfields (offsets 1032--1098)
The region from +1032 through +1098 (~67 bytes) is a dense bitfield array set by sub_A95DC0 (SchedulingContext::configure). Individual bits control fine-grained scheduling features, gated by architecture version, optimization level, and knob queries. Key fields:
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +1032 | 1 | byte | featureFlags0 | Pipeline feature enables (OR'd with 0x4F) |
| +1052 | 4 | int32 | cfgMaxDepth | Knob 449 value (default 5); scheduling depth limit |
| +1064 | 1 | byte | cfgSmFlags | Bit 0: SM-specific flag (knob 931 or arch > 16386) |
| +1072 | 8 | double | pressureCoeff | Knob 366 value (default 0.25); register pressure coefficient |
| +1080 | 1 | byte | cfgBitmask | Bits: [7] always set, [6] knob 868, [5] hot-cold, [4] knob 410, [3] knob 868 alt |
| +1084 | 4 | int32 | cfgThreshold | Knob 876 value (default 50) |
| +1088 | 1 | byte | cfgBitmask2 | Bit 3: knob 752 related |
| +1089 | 1 | byte | cfgBitmask3 | Bit 7: arch == 16387 or arch == 0x4000 |
| +1096 | 1 | byte | cfgBitmask4 | Bit 7: external flag from target descriptor +788 |
| +1097 | 1 | byte | cfgBitmask5 | Bits: [7] target+1844, [4] arch <= 16386, [3] sm_50 dual-issue, [1,0] target+788 |
| +1098 | 1 | byte | cfgBitmask6 | Bit 0: knob 462 (scheduling heuristic), Bit 5: arch == 16386 |
Architecture-Specific Defaults (offsets 1408--1584)
Set early in sub_A95DC0 based on *(a1+372) >> 12 (architecture class). Three code paths populate these fields for sm_50 era (class < 3), sm_60--sm_89 era (class == 4), and sm_90+ era (class >= 5):
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +1408 | 1 | byte | archMode0 | Architecture scheduling mode flag |
| +1411 | 1 | byte | archMode1 | Scheduling sub-mode |
| +1412 | 1 | byte | archMode2 | Scheduling sub-mode |
| +1413 | 1 | byte | archMode3 | Scheduling sub-mode |
| +1414 | 1 | byte | archMode4 | Architecture mode flag |
| +1415 | 1 | byte | archMode5 | Architecture mode flag; bit 2 checked during batch depth selection |
| +1416 | 1 | byte | archMode6 | Architecture mode flag |
| +1440 | 16 | __m128i | archVector | SSE-loaded scheduling parameters (4 x int32) |
| +1452 | 4 | int32 | archWarpSize | Warp/thread configuration: 64 or 128 |
| +1456 | 4 | int32 | archDispatchSize | Dispatch slot parameter: 16, 32, or 64 |
| +1460 | 4 | int32 | archMaxThreads | Max threads per SM: 512 or 1024 |
| +1464 | 4 | int32 | archParam5 | Architecture parameter: 4 (sm_60+ only) |
| +1472 | 4 | int32 | archBlockSize | Block size parameter: 32 |
| +1480 | 8 | int64 | archSpecData | Architecture-specific encoded scheduling data |
| +1584 | 16 | __m128i | archProfile | SSE-loaded architecture profile vector |
Memory Layout Diagram
SchedulerContext (~1600 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+
|+0 vtable |+8 funcContext |+16 allocator |+24 (padding) |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+32 (padding) |+40 preHookVtable |+48 regPressureCounters[0..9] |
+--------+--------+--------+--------+--------+--------+--------+--------+
| ...counters... |+60 mode |+64..84 |+88 maxBBDepth |+92 maxBBDpthNT |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+96..175 (internal state) |+176 active|+178 rrMode| |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+240 phase |+248 regBV1 |+256 regBV2 | |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+280 analysisSlots |+292 valid|+296 tgtPri|+300 tgtSec| |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+316 minR|+320 slack|+324 commit|+328 dualIss| ... |+380 latCut| |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+384 tex |+388 maxBB|+392 alloc | |+404 stall|+408 thresh|+412 batch|
+---------+-------+---------+--------+---------+-------+--------+--------+
|+416 xtraReg|+420 spillCnt| |+432 budget|+440..456 livenessBV |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+464 depth | |+480 cycle|+484 act|+485 dirty| |+523 hcE|+524 yld|
+---------+-------+---------+--------+---------+-------+--------+--------+
|+532 hcBudget| |+604 archP1| |+616 archP2| |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+672 resourceSlots |+680 depData | ...bitvector mgr... |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+720 arenaRef |+728 bvBuf |+736 cap |+740 alloc|+744 funcRef|
+---------+-------+---------+--------+---------+-------+--------+--------+
| ...gap / internal state... |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+832 liveness bitvector ref | |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+840 ArenaAllocator (embedded sub-object, ~120 bytes) |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+960..1031 (internal/padding) |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1032..1098 configuration bitfield array (~67 bytes) |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1099..1407 (internal state, ~308 bytes) |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1408..1416 architecture mode flags (9 bytes) |
+---------+-------+---------+--------+---------+-------+--------+--------+
|+1440 archVector (16B) |+1452..1484 arch params |+1584 archProfile (16B)|
+---------+-------+---------+--------+---------+-------+--------+--------+
Function Map
| Address | Size | Identity |
|---|---|---|
sub_6820B0 | 1.5 KB | BuildReadyList -- zero-dep instruction scan |
sub_682200 | -- | UnlinkFromReadyList -- remove and update deps |
sub_682490 | 14 KB | RegisterPressureAnalyzer -- per-class deltas |
sub_6833F0 | 10 KB | InitScheduleRegion -- per-BB setup and knob query |
sub_685A10 | 11 KB | InstructionBarrierCheck -- opcode analysis |
sub_687FE0 | 12 KB | ScheduleBlock -- per-BB scheduling entry |
sub_688DD0 | 20 KB | ScheduleEngine -- unified 3-mode engine |
sub_68A690 | 31 KB | BuildDependencies -- def-use chain DAG |
sub_68B9C0 | 46 KB | DependencyGraphBuilder -- full DAG construction |
sub_692200 | 18 KB | SchedulingHeuristic -- priority with FP scoring |
sub_695530 | 15 KB | ComputeLatencies -- instruction latency computation |
sub_69B7D0 | 17 KB | TopologicalSort -- valid execution ordering |
sub_69F170 | 12 KB | CriticalPathAnalysis -- DAG critical path |
sub_893100 | 17 KB | ClassifyInstruction -- opcode/operand analysis |
sub_894290 | 27 KB | BuildOperandDependencies -- operand-level edges |
sub_896D50 | 90 KB | InitOpcodeTable -- ROT13 SASS mnemonic table |
sub_89FBA0 | 85 KB | SetOpcodeLatencies -- per-opcode latency table |
sub_8BF890 | 929 B | AllocDynBatchData -- DynBatch context allocation |
sub_8C1BA0 | 6.3 KB | InitDynBatchState -- batch initialization |
sub_8C67A0 | 3.7 KB | ComputeResourceCost -- per-instruction FU cost |
sub_8C7290 | 5.1 KB | GetResourceVector -- SSE-optimized copy |
sub_8C7720 | 20 KB | ReorderInstructions -- red-black tree reordering |
sub_8C9320 | 47 KB | ComputePriority -- multi-criteria heuristic |
sub_8CBAD0 | 2.9 KB | PreScheduleSetup -- BB scan, 4095-instr limit |
sub_8CCF80 | 2.3 KB | IsLongLatencyOp -- latency > 19 check |
sub_8CD160 | 9.3 KB | ScheduleBasicBlock -- per-BB ordering loop |
sub_8CD6E0 | 1.3 KB | ReverseSchedule -- reverse post-order BBs |
sub_8CE520 | 12 KB | RegisterBudgetCurve -- piecewise linear model |
sub_8CEE80 | 8.7 KB | ComputeRegisterBudget -- occupancy-aware |
sub_8CF5D0 | 3.5 KB | CheckDualIssueEligibility |
sub_8CF880 | 28 KB | BuildDependencyGraph -- pre-scheduling DAG |
sub_8D0640 | 22 KB | ScheduleInstructions -- top-level orchestrator |
sub_8D9930 | 19 KB | BuildDependencyEdges -- RAW/WAR/WAW edges |
sub_8E3970 | ~53 B | ArenaAlloc -- bump allocator |
sub_8E3A80 | ~22 ln | ArenaFreeAll -- release all blocks |
sub_8E4400 | 3.3 KB | InitHWProfile_Warp -- warp dispatch params |
sub_8E5CA0 | 20 KB | MasterHWProfileBuilder -- latency/throughput |
sub_8F1EB0 | 15 KB | EncodeScheduleWords -- SASS control word output |
sub_8F6530 | 13 KB | OutputCompleteSchedule -- final output assembly |
sub_A95DC0 | 35 KB | SchedulingContext::configure -- knob loading |
sub_A97600 | 42 KB | PostSchedulePass::runOnFunction |
sub_A9DDD0 | 11.5 KB | HandleLargeFunction -- chunk-based scheduling |
Cross-References
- Scheduling Algorithm -- priority list scheduling internals, ready list management, backtracking
- Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
- Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
- Register Allocation -- register allocator that the scheduler interacts with
- Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
- Knobs -- the 76 scheduling knobs and the knob query infrastructure
- GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch
Priority List Scheduling Algorithm
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The scheduling engine implements a classical priority list scheduling algorithm extended with GPU-specific heuristics for register pressure management, functional unit contention avoidance, yield hint generation, and barrier-aware instruction ordering. A single unified engine (sub_688DD0, 20 KB) serves all three scheduling phases -- ReduceReg, ILP/Latency, and DynBatch -- differentiated only by a mode byte that selects different priority weight configurations. The algorithm iterates basic blocks, builds a ready list of zero-dependency instructions, selects the highest-priority candidate via an 8-bit packed heuristic, emits it into the final schedule, updates the dependency DAG, and repeats until all instructions in the block are placed.
| Unified engine | sub_688DD0 (20 KB) -- mode-parameterized core loop |
| Priority function | sub_8C9320 (47 KB, ~1300 lines) -- 8-bit packed heuristic |
| Ready list builder | sub_6820B0 (1.5 KB) -- zero-predecessor scan |
| Dependency pre-scan | sub_8CF880 (28 KB) -- reverse BB iteration |
| Edge builder | sub_8D9930 (19 KB) -- RAW/WAR/WAW/memory/barrier edges |
| Instruction mover | sub_925510 (341 bytes) -- doubly-linked list relink |
| Resource tracker | sub_A09530 (365 bytes) -- per-instruction stall update |
| Stall/barrier encoder | sub_8D7760 (41 KB) -- control word generation |
| Alternative loop | sub_68B9C0 (46 KB) -- combined DAG + scheduling |
| BB size limit | 4095 instructions (split via sub_931920) |
| Large function limit | 16383 instructions (chunk-based via sub_A9DDD0) |
Core Algorithm
The unified scheduling engine executes the following sequence for each basic block. All three phases (ReduceReg mode 0x39, ILP mode 0x49, DynBatch mode 0x41) follow this identical structure; only the priority weight selection differs.
function ScheduleEngine(sched, mode, arg3, rebuild):
if rebuild:
InitScheduleRegion(sched) // sub_6833F0
// Allocates 72-byte per-BB records
// Queries knobs 595, 743, 747
// Calls sub_7E5120 for instruction characterization
for each bb in sched.basic_blocks: // 72-byte stride
InitResourceTracking(sched, bb) // sub_A091C0
// Zeroes 40-byte resource records (one per register class + 1)
BuildReadyList(sched) // sub_6820B0
while ready_list is not empty:
best = SelectHighestPriority(sched) // via priority vtable
UnlinkFromReadyList(sched, best) // sub_682200
MoveInstruction(sched, best, ref) // sub_925510
UpdateStallCycles(sched, best) // sub_A09530
UpdateWARTracking(sched, best) // sub_A09D40
for each successor of best:
successor.dep_count -= 1
if successor.dep_count == 0:
ready_list.insert(successor)
// Sorted insertion using priority value
PostBBCleanup(sched, bb) // sub_BDC200 / sub_BDCDE0
The outer loop iterates basic blocks via an array of 72-byte records (v112 = 72 * bb_index). The inner loop is a standard worklist algorithm: remove the highest-priority ready instruction, schedule it, and propagate readiness to its successors in the dependency DAG by decrementing their predecessor counts.
Mode Selection
The mode byte stored at *(DWORD*)(scheduler+60) controls which priority weight set the engine uses:
| Mode | Value | Callback | Objective |
|---|---|---|---|
| ReduceReg | 1 | 0x39 | Minimize register pressure. Prioritizes instructions that release registers (last-use operands). |
| ILP/Latency | 0 | 0x49 | Maximize instruction-level parallelism. Prioritizes critical-path and long-latency instructions. |
| DynBatch | 2 | 0x41 | Batch-aware tensor scheduling. Groups GMMA/WGMMA operations for warpgroup cooperation. |
The engine uses vtable dispatch at *(a1+40) and *(a1+48) for polymorphic pre/post scheduling hooks. This allows each mode to inject custom behavior at scheduling boundaries without modifying the core loop.
Ready List Construction
sub_6820B0 (1.5 KB) scans the instruction linked list and collects every instruction with zero unsatisfied dependencies into a sorted ready list.
function BuildReadyList(sched):
for instr in sched.instruction_list: // linked list at sched[20]
if instr.opcode == 52: // NOP / BB boundary marker
continue // follow to real instruction
metadata = *(QWORD*)(instr + 40) // SchedNode pointer
if *(DWORD*)(metadata + 8) == 0: // depCount == 0
*(QWORD*)(metadata + 16) = sched[5] // link to current head
sched[5] = instr // new head
vtable_callback(sched, 104, instr) // insertion hook
*(DWORD*)(metadata + 28) = 0 // reset latency counter
The ready list is a singly-linked list threaded through SchedNode offset +16 (the nextReady field). Sort order is maintained at insertion time by the priority function -- each new instruction is inserted at its correct position so that the head of the list is always the highest-priority candidate. All metadata+N offsets throughout the scheduling pages refer to fields within the SchedNode block pointed to by instr+40 (sched_slot), not offsets from the instruction object itself. See the SchedNode layout for the complete field map.
Opcode 52 instructions are phantom BB boundary markers. The builder skips them but follows their linked-list successors to reach real instructions beyond the boundary.
The vtable+104 callback provides a polymorphic insertion hook. Different scheduling strategies can override this to implement custom ready-list policies (e.g., the dual-issue scheduler uses it to pair co-issuable instructions).
Priority Function
sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority as an integer with 8-bit packed fields. Each bit encodes a different heuristic criterion. Because priority comparison reduces to a single integer comparison, the ready list maintains sort order without multi-key sorting overhead.
8-Bit Priority Encoding
| Bit | Name | Meaning | Notes |
|---|---|---|---|
| 7 (MSB) | yield-related | Instruction is near a yield boundary | Higher priority ensures yield hints align with scheduling boundaries |
| 6 | yield flag | Instruction triggers or participates in a yield sequence | Controls warp scheduler round-robin interaction |
| 5 | hot-cold | Memory access temperature (1 = hot, 0 = cold) | Hot = global/texture/surface loads with long latencies; cold = constant/shared |
| 4 | hot-cold / pressure | Packed byte holds hot-cold flag; pressure overflow acts as comparison override | See Priority Function Internals for the dual mechanism |
| 3 | same-BB preference | Instruction belongs to the currently-scheduled BB | Discourages cross-BB instruction motion |
| 2 | stall-free | Scheduling this instruction introduces zero stall cycles | All producer latencies have completed |
| 1 | latency-bound | Instruction is on the DAG critical path | Prioritizes latency-sensitive dependency chains |
| 0 (LSB) | tiebreaker | Additional ordering factor | Encodes instruction position, operand count, or FU preference |
Higher numeric value = higher priority. Bit 7 is the most significant criterion: yield-boundary instructions always schedule before non-yield instructions when both are ready.
Hot-Cold Classification
The hot-cold flag (bit 5) classifies memory operations by expected latency:
- Hot (bit 5 = 1, higher priority): global memory loads (
LDG), texture fetches (TEX,TLD), surface operations. These have high latency (hundreds of cycles) and benefit most from early scheduling to overlap with computation. Detected bysub_A9CDE0. - Cold (bit 5 = 0, lower priority): constant memory loads (
LDC), shared memory operations (LDS,STS). These have low latency and do not need early scheduling. Detected bysub_A9CF90, which also suppresses the pressure-overflow and critical-path extension signals.
Classification uses sub_A9CDE0 (hot detection) and sub_A9CF90 (cold detection). Memory space type is determined by sub_693BC0, which returns space codes: 3 = shared, 16 = global, 2 = local, 11 = surface, 7 = constant, 1 = generic. Hot-cold tracking is gated by scheduler+523 and scheduler+532; when the hot-cold budget (scheduler+532) reaches zero, the feature deactivates for the remainder of the BB.
Pressure Overflow
Bit 4 in the packed byte holds the hot-cold flag (see above). The pressure overflow signal is a separate Boolean computed by checking all four register classes (GPR, predicate, address, UGP) against their respective limits. When any class exceeds its budget, the pressure overflow flag activates and acts as a comparison override: the candidate wins regardless of the packed priority byte, forcing the scheduler to select the instruction that relieves register pressure. This is the primary mechanism by which the ReduceReg phase achieves its objective: the mode sets a tight register budget via scheduler+178, causing pressure overflow to activate frequently and driving the scheduler toward pressure-reducing orderings. See the Priority Function Internals section for the exact per-class threshold checks.
Priority Evaluation Sequence
The priority function evaluates criteria in this order for each candidate instruction:
sub_8C7290: extract 4-class register deltas, same-BB flag, and per-BB resource vector (SSE-optimized)- Compute yield saturation: check write-barrier counters for predicate, GPR, and UGP register classes against their ceilings (7, 7, and
target_desc+624respectively) sub_8C67A0: compute per-instruction resource cost if BB slot not yet committedsub_8C7120: update barrier tracking state (ifmetadata+111bit 7 set)- Evaluate register pressure: compute per-class overflow against budget (
scheduler+432) and per-class limits; derive pressure-overflow Boolean - Evaluate stall-free: compare earliest cycle (
metadata+32) vs current cycle (scheduler+480) - Evaluate critical path: compare barrier-target count vs depth threshold (
scheduler+464) - Evaluate yield bits: opcode 39 (yield-related) and opcode 96 (yield flag from
scheduler+524) - Pack 8 bits into priority byte
- Evaluate hot/cold:
sub_A9CDE0/sub_A9CF90(only whenscheduler+523active) - Multi-stage comparison against running best: resource vectors, then XOR-based bit scan, then secondary tiebreakers
The function scans the full ready list in a single pass (not limited by knob 770 for the scan itself). Knob 770 (priority queue depth, default 4) controls the depth threshold mechanism for critical-path activation, not the number of candidates evaluated.
Key Internal Variables
| Variable | Source | Content |
|---|---|---|
budget_hw | sub_6818D0(scheduler, scheduler[432] - scheduler[412]) | Register budget in HW register units |
reduced_hw | sub_6818D0(scheduler, budget - budget/16) | Tighter budget for critical-path threshold (or knob 760 override) |
queue_depth | knob 770 | Depth threshold parameter (default 4); controls critical-path activation |
per_bb_flag | knob 769 | Per-BB scheduling flag; when set, resets yield state between BBs |
scheduler+420 | state | Spill-mode countdown; when > 0, forces aggressive scheduling with bit 1 = 1 |
scheduler+464 | state | Depth threshold -- number of barrier targets that must be ready before critical-path activates |
scheduler+480 | state | Current scheduling cycle; used for stall-free evaluation |
scheduler+523 | state | Hot-cold tracking enable flag; gated by knob |
scheduler+524 | state | Current yield state; propagated to CONTROL instructions via bit 6 |
scheduler+532 | state | Hot-cold budget counter; decremented per cold instruction, disables tracking at zero |
scheduler+672 | allocation | Per-BB resource cost table (84 bytes per slot) |
Support Subroutines
| Address | Size | Purpose |
|---|---|---|
sub_8C67A0 | 3.7 KB | Compute per-instruction resource cost. Calls sub_A08A00 (resource model) three times: mode 1 = instruction's own cost, mode 2 = operand release cost, mode 3 = combined BB-level impact. Uses SSE _mm_add_epi32 for vector accumulation. |
sub_8C7290 | 5.1 KB | Copy 10-element int32 resource vector from per-BB table at scheduler+672. SSE _mm_loadu_si128 bulk copy. Special case: opcode 97 (STG in ROT13; used as control/boundary marker) returns base scheduler state with zeroed deltas. |
sub_8C7720 | 20 KB | Red-black tree operations for instruction reordering within BB. Maintains a balanced BST of scheduling candidates for O(log N) insertion, removal, and priority update. |
sub_8C7120 | -- | Barrier tracking state update. |
sub_693BC0 | -- | Memory space classification and latency query. |
sub_6818D0 | -- | Register count to hardware-aligned unit conversion. |
Priority Function Internals
The full logic of sub_8C9320 divides into three phases: (1) pre-scan the ready list to collect aggregate BB statistics, (2) iterate the ready list a second time evaluating each candidate and maintaining a running best, and (3) update scheduler state and return the winner. The function signature is (scheduler, &second_best) -> best_instruction.
Phase 1: Pre-Scan Statistics
Before priority evaluation begins, the function iterates the entire ready list (linked via metadata+16) and accumulates per-BB statistics that feed into the per-instruction priority decisions:
| Variable | Init | Accumulation | Meaning |
|---|---|---|---|
shared_mem_count | 0 | ++ when opcode 183 and sub_693BC0 returns space 3 | Count of shared-memory operations in ready list |
neg_reg_deficit | 0 | += delta when register delta < 0 | Total register pressure reduction from ready instructions |
max_dep_cycle | -1 | max(current, metadata+92) | Highest dependency cycle among all ready instructions |
max_pred_cycle | 0 | max(current, metadata+88) | Highest predecessor cycle among all ready instructions |
barrier_count | 0 | ++ when metadata+108 & 1 | Count of barrier-target instructions in ready list |
dep_flag_count | 0 | ++ when metadata+108 & 2 | Count of instructions with dependency-set flag |
pos_pressure_sum | 0 | += delta when register delta > 0 | Total register pressure increase from ready instructions |
filtered_pressure | 0 | += delta when within depth threshold | Pressure increase from depth-eligible instructions |
max_barrier_slot | -1 | max(current, metadata+24) for barrier targets | Latest BB slot among barrier-target instructions |
min_barrier_latency | 99999 | min(current, metadata+28) for barrier targets | Shortest latency counter among barrier-target instructions |
max_nonbarrier_cycle | -1 | max(current, metadata+32) for non-barrier | Latest earliest-available-cycle for non-barrier instructions |
any_stall_free | 0 | |= (metadata+32 >= 0) | Whether any instruction can issue without stalling |
total_ready | 0 | ++ for every instruction | Total instructions in ready list |
preferred_instr | NULL | non-barrier instr with max metadata+24 | The program-order-latest non-barrier instruction |
The pre-scan also maintains a depth-threshold table: an array of up to 32 barrier-target instruction pointers sorted by their latency counter (metadata+28). This table is scanned to compute scheduler+464 (depth threshold) and scheduler+380 (latency cutoff), which control when the critical-path bit activates.
Phase 2: Register Budget Prologue
Before the main loop, the function computes two register budgets from scheduler+432 (target register count):
budget_base = scheduler[432] - scheduler[412] // target minus committed
if ReduceReg_mode (scheduler+178): // ReduceReg tightens budget
if scheduler[416] < 0:
budget_base -= (scheduler[432] / 8) + 3 // reduce by ~12.5% + 3
else:
budget_base -= scheduler[416] // explicit reduction
budget_hw = RegToHWUnits(scheduler, budget_base) // sub_6818D0
reduced_hw = RegToHWUnits(scheduler, budget_base - budget_base/16)
// ~6.25% tighter
if knob_760_active:
reduced_hw = RegToHWUnits(scheduler, budget_base - knob_760_value)
queue_depth = 4 // default
if knob_770_active:
queue_depth = knob_770_value // override
budget_hw sets the threshold for bit 4 (pressure overflow). reduced_hw provides a tighter threshold used in the critical-path assessment. queue_depth (knob 770) limits how many candidates receive full priority evaluation; the rest use cached values from initial insertion.
Phase 3: Per-Bit Computation
For each instruction in the ready list, sub_8C7290 extracts its per-register-class deltas (4 classes: GPR, predicate, address, UGP) and the same-BB flag. Then each priority bit is computed:
Bit 7 -- Yield-related. Determined by opcode. Only opcode 39 (YIELD instruction variant) can set this bit. The condition checks the last operand's low 2 bits:
if opcode_masked == 39:
operand_index = operand_count - 1 - ((opcode >> 11) & 2)
yield_related = (instr[84 + 8*operand_index] & 3) == 0
else:
yield_related = 0
When set, the instruction is a yield boundary marker and receives absolute highest priority regardless of all other heuristics.
Bit 6 -- Yield flag. Set only for opcode 96 (CONTROL instruction):
if opcode_masked == 96:
yield_flag = scheduler[524] // current yield state
else:
yield_flag = 0
// Post-adjustment: suppress when hot/pressure bits dominate
if (bit5_set || bit4_set):
yield_flag = 0
if metadata[32] < scheduler[480]: // behind schedule
yield_flag = scheduler[396] ? original_yield : 0
The yield flag propagates the scheduler's warp yield state only through CONTROL instructions, ensuring yield hints align with scheduling barriers.
Bit 5 -- Hot-cold classification. Requires hot-cold tracking to be active (scheduler+523 set, gated by scheduler+532 > 0):
if hot_cold_active:
is_hot = sub_A9CDE0(target_desc, context, instruction)
else:
is_hot = 0
// Cold detection suppresses priority
if sub_A9CF90(target_desc, context, instruction): // is_cold?
pressure_overflow = 0 // suppress bit 4
critical_extension = 0 // suppress lookahead
sub_A9CDE0 returns true for global memory loads (LDG), texture fetches (TEX, TLD), and surface operations -- instructions with latencies in the hundreds of cycles. sub_A9CF90 returns true for constant loads (LDC), shared memory operations (LDS/STS) -- low-latency operations. Hot instructions (bit 5 = 1) get higher priority to schedule early and overlap their long latencies with computation. Cold instructions (bit 5 = 0) are deprioritized.
Bit 4 -- Pressure overflow. This bit does NOT appear directly in the initial packing as a single variable. Instead, the pressure overflow signal (v81 in decompiled source) feeds into the candidate comparison logic as an override. The mechanism:
// For barrier-target instructions:
budget_in_units = RegToHWUnits(scheduler, scheduler[432])
headroom = RegToHWUnits(scheduler, 8)
if budget_in_units > headroom + scheduler[72]: // plenty of headroom
pressure_overflow = 0
elif latency_counter > min_barrier_latency + 9: // far from ready
pressure_overflow = 0
else:
// Check all 4 register classes against their limits:
overflow = false
overflow |= (scheduler[72] + gpr_delta > budget_hw)
overflow |= (scheduler[68] + pred_delta > 7)
overflow |= (scheduler[56] + addr_delta > 7)
overflow |= (scheduler[60] + ugp_delta >= target_desc[624])
pressure_overflow = overflow
When pressure_overflow = 1, the candidate wins the comparison regardless of other bits -- it is the scheduler's mechanism for emergency register pressure relief. In the packed byte's bit 4 position, the hot-cold flag occupies the slot. The pressure overflow signal operates at a higher level: it can force the candidate to win even when its packed priority byte is lower.
Bit 3 -- Same-BB preference. Output parameter from sub_8C7290:
same_bb = sub_8C7290.output_param_5 // boolean from resource copy
Set when the instruction belongs to the currently-scheduled basic block. Instructions imported from other BBs by global scheduling get same_bb = 0, reducing their priority relative to local instructions.
Bit 2 -- Stall-free. Computed from the earliest-available-cycle field:
if countdown_active (scheduler[420] != 0):
if metadata[32] < scheduler[480] AND instr != preferred_instr:
stall_free = 0
if pressure_plus_reg_sum > 0:
goto full_evaluation // positive pressure = needs analysis
else:
stall_free = 1
else:
// Normal mode: stall-free when producers have completed
if metadata[32] >= scheduler[480]:
stall_free = 1
elif instr == preferred_instr:
stall_free = 1
else:
stall_free = 0
metadata+32 is the instruction's earliest available cycle -- the latest completion time among all its producer instructions. scheduler+480 is the current scheduling cycle. When earliest >= current, all producers have retired and the instruction can issue with zero pipeline stalls.
Bit 1 -- Critical-path / latency-bound. Complex multi-path computation:
if countdown_active (scheduler[420] != 0):
// Spill mode: almost always critical
if !(barrier_bits_set_in_priority):
if slot_limit_exceeded:
critical = 1
else:
critical = !(pressure_sum <= 0 && max_reg_class == 0)
else:
critical = 0
else:
// Normal mode: depth threshold comparison
if barrier_count >= scheduler[464]:
critical = 1 // enough barriers ready -> critical path active
else:
critical = 0
In spill mode (active when scheduler+420 > 0), the critical-path bit is set for nearly all instructions to maximize scheduling throughput. In normal mode, it activates when the number of barrier-target instructions in the ready list meets or exceeds the depth threshold computed during the pre-scan, indicating that the scheduler is processing a latency-critical dependency chain.
Bit 0 -- Tiebreaker (barrier-target). Read directly from instruction metadata:
tiebreaker = metadata[108] & 1 // barrier-target flag
Barrier-target instructions (those waiting on a hardware barrier) get bit 0 = 1. Since this is the lowest-priority bit, it only affects ordering when all higher bits are identical. Scheduling barrier targets promptly allows the barrier resource to be retired sooner, freeing scoreboard entries for other instructions.
Packed Byte Assembly
The 8 bits are packed into a single byte using shift-and-mask arithmetic:
priority = (yield_related << 7) // bit 7
| (yield_flag << 6) & 0x7F // bit 6
| (hot_cold << 5) & 0x3F // bit 5 (initially yield copy)
| (hot_flag << 4) & 0x3F // bit 4
| (same_bb << 3) & 0x0F // bit 3
| (stall_free << 2) & 0x0F // bit 2
| (critical_path << 1) & 0x03 // bit 1
| (tiebreaker << 0) & 0x03 // bit 0
The & 0xNN masks ensure each bit occupies exactly one position. In the initial packing, bit 5 and bit 6 both derive from the yield variable; the hot-cold flag (sub_A9CDE0 result) overwrites bit 5 in subsequent repackings that occur during the spill-mode and comparison paths.
Candidate Comparison
The comparison between the current candidate and the running best is NOT a simple integer comparison of the packed bytes. The function performs a multi-stage refinement:
-
Resource vector comparison: If knob-gated architecture checks pass (SM index > 5 at
context+1704), a 4-tuple lexicographic comparison of per-register-class resource vectors occurs first. The four classes are compared in order: GPR delta, predicate delta, address delta, UGP delta. The first class that differs determines the winner. -
Priority byte XOR scan: When resource vectors are equal, the function XORs the current and best packed bytes and checks differing bits in this order:
- Bit 4 (0x10) -- pressure: winner has bit 4 set (higher pressure need)
- Bit 6 (0x40) -- yield: winner has bit 6 set (yield participation)
- Bit 1 (0x02) -- critical: winner has bit 1 set
- Bit 2 (0x04) -- stall-free: winner has bit 2 set
- Bit 5 (0x20) -- hot-cold: winner has bit 5 set (hot memory op)
-
Secondary tiebreakers (when all checked bits match):
- Barrier group index (
v213vsv253) - Latency counter comparison (
v223vsv248) - Bit 7 yield-related (only when shared-memory count > 0)
- Contention score (a derived value incorporating register overflow penalty:
contention + 2 * RegToHWUnits(pressure_delta) - pressure_sum_sign) - Slot manager cycles (scheduling cost estimate from
sub_682490) - Earliest available cycle (
metadata+32) - Dependency cycle (
metadata+92) - Latest deadline (
metadata+40) - Register delta magnitude
- Barrier group index (
-
Positional fallback: When all heuristic comparisons are tied, the instruction with the higher BB slot (
metadata+24) wins, preserving original program order.
The multi-stage comparison explains why the packed byte uses non-obvious bit ordering. Bits 4, 6, 1, 2, 5 are checked before bit 7 in the refinement path, even though bit 7 is the MSB. The packed byte enables fast ready-list insertion sort (integer comparison), while the full comparison function provides nuanced selection for the actual scheduling decision.
Scheduler State Updates
After selecting the best candidate, the function updates scheduler state:
// Spill mode countdown
if winner is barrier-target:
scheduler[420] = computed_countdown - 1
scheduler[396] -= 1 // spill sequence counter
if metadata[32] >= 0:
scheduler[400] -= 1 // stall-free counter
if stall_free_count==0 AND remaining>0 AND countdown>1:
scheduler[420] = 0 // force exit spill mode
scheduler[464] = -1 // reset depth threshold
else:
// Non-barrier winner in countdown mode
if !(barrier_bits in priority) AND slot_cost within budget:
// do nothing, continue countdown
else:
scheduler[420] = 0 // exit spill mode
scheduler[464] = -1 // reset depth threshold
// Slot manager update (when winner has positive scheduling cost)
if best_cost > 0 AND slotManager[76] > 0:
if slotManager[140]:
slotManager[28] += slotManager[44] // advance base
slotManager[76] = 0 // reset count
slotManager[80] = NULL // reset anchor
best.metadata[28] = sub_682490(...) // recompute latency
// Hot-cold counter update
if hot_cold_active AND winner is cold (sub_A9CF90 returns true):
scheduler[532] -= 1 // decrement hot-cold budget
elif hot_flag was set for winner:
scheduler[523] = 0 // disable hot-cold tracking
The function returns the best instruction pointer and writes the second-best to *a2 for lookahead scheduling.
Dependency DAG Construction
The dependency graph is built in two stages before the scheduling loop begins. The DAG is a directed acyclic graph where nodes are instructions and edges represent ordering constraints with associated latency values.
Stage 1: Pre-Scan (sub_8CF880, 28 KB)
Iterates basic blocks in reverse order (bb[N-1] to bb[0]) using the BB ordering array at func+512.
For each BB:
- Check knobs 314/313 for per-BB scheduling skip flags
- Walk the instruction linked list, identifying NOP/control instructions
- Set
bb->nextpointers and configure BB scheduling state - Delegate to Stage 2 (
sub_8D9930) for edge construction - Manage memory arenas with SSE-optimized copies for metadata arrays
Contains approximately 14 nested loops for edge construction. The reverse iteration order ensures that when the scheduler processes a BB, all of its successors have already been characterized.
Stage 2: Edge Construction (sub_8D9930, 19 KB)
For each pair of instructions within a BB, checks for five dependency types:
| Type | Abbreviation | Condition | Edge Latency |
|---|---|---|---|
| True | RAW | Read-after-write on same register | Producer's pipeline latency |
| Anti | WAR | Write-after-read on same register | 0 (ordering constraint only) |
| Output | WAW | Write-after-write on same register | 1 (minimum separation) |
| Memory | -- | Store before load to same memory space | Conservative; full ordering |
| Barrier | -- | Instruction depends on barrier/sync result | Barrier completion latency |
Operand analysis is performed by sub_894290 (27 KB), which processes 16-bit operand descriptors encoding:
| Bits | Field |
|---|---|
| 12--15 | Register class |
| 8--11 | Bank number |
| 0--7 | Dependency type |
Memory dependencies are conservative: all stores are ordered before subsequent loads to the same memory space. The scheduler does not perform alias analysis -- it relies on the memory space classification from sub_693BC0 to determine whether two operations might conflict.
Supplementary Dependency Builders
These functions handle specific aspects of dependency construction in the 0x6A0000--0x6B0000 range:
| Address | Size | Purpose |
|---|---|---|
sub_68A690 | 31 KB | BuildDependencies -- walks instruction lists and creates producer-consumer dependency edges from def-use chains |
sub_6A97B0 | 26 KB | AddDependencyEdges -- register-level data dependency edges |
sub_6A2D30 | 11 KB | ChainDependencies -- memory ordering constraints (ordering edges between memory operations even without explicit data deps) |
sub_6A78F0 | 23 KB | ProcessOperands -- iterates operand arrays at instruction +84, extracts register file pressure and dependency distance information |
Instruction Emission
sub_925510 (341 bytes, 57 lines) is the universal instruction relocation primitive. It moves an instruction to a new position in the doubly-linked instruction list.
function MoveInstruction(block, instr, insert_before):
// 1. Unlink from current position
instr.prev.next = instr.next
instr.next.prev = instr.prev
// 2. Insert before reference instruction
instr.next = insert_before
instr.prev = insert_before.prev
insert_before.prev.next = instr
insert_before.prev = instr
// 3. Update block boundaries if needed
if instr was block.head (block+272):
block.head = instr.next
if instr was block.tail (block+280):
block.tail = instr.prev
// 4. Notify subsystems
UpdateDependencyGraph(block, instr) // sub_7EEC10
UpdateBlockTimestamp(block) // sub_7DDCA0
This function has 13 callers across the codebase. It serves as the shared instruction movement primitive for the scheduler, register allocator, and peephole optimizer.
Resource Tracking
The scheduler maintains 10 functional unit resource counters per basic block, tracking pipeline utilization to avoid saturating any single execution unit.
Resource Vector Layout
Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:
| Offset | Size | Content |
|---|---|---|
| 0--36 | 10 x int32 | Current resource usage per functional unit |
| 40--76 | 10 x int32 | Resource pressure delta (change since last step) |
| 80 | int32 | BB-entered flag and auxiliary bits |
Functional Unit Pipes
| Index | Pipe | Typical Instructions |
|---|---|---|
| 0 | Integer ALU | IADD, IMAD, ISETP, LOP, SHF |
| 1 | FP32 | FADD, FFMA, FMUL, FSETP |
| 2 | FP64 | DADD, DFMA, DMUL |
| 3 | Tensor core | HMMA, IMMA, BMMA, BGMMA |
| 4 | Load/store | LD, ST, LDG, STG, LDS, STS |
| 5 | Texture | TEX, TLD, TXQ |
| 6 | Branch/control | BRA, JMP, EXIT, RET, BAR |
| 7 | Shared memory | ATOMS, REDS (overlaps with pipe 4 for LDS/STS) |
| 8 | Special function | MUFU (RCP, RSQ, SIN, COS, EX2, LG2) |
| 9 | Uniform/predicate | UPLOP, UISETP, uniform operations |
Resource Tracking Helpers
| Address | Size | Purpose |
|---|---|---|
sub_A091C0 | -- | Initialize per-BB resource arrays to zero |
sub_A09530 | 365 bytes | Update stall cycle counters after scheduling an instruction. Decrements pending latency counters for all tracked resources. |
sub_A09D40 | -- | Update WAR (anti-dependency) resource tracking for register operands |
sub_A08A00 | -- | Resource model query (called in 3 modes by sub_8C67A0) |
The resource model sub_A08A00 is called three times per instruction by sub_8C67A0:
- Mode 1: instruction's own execution cost (FU assignment + pipeline latency)
- Mode 2: operand release costs (freed resources when an operand reaches last-use)
- Mode 3: combined instruction + BB-level impact (aggregate pressure)
SSE intrinsics (_mm_add_epi32, _mm_loadu_si128) are used throughout for vectorized resource accumulation and copying.
Register Pressure Tracking
The scheduler tracks register liveness via a bitvector at scheduler+832. Each bit represents one register; the pressure is the popcount of the live set.
function UpdateRegisterPressure(sched, instr):
for each operand in instr.operands:
if operand.is_def:
set_bit(sched.live_bv, operand.reg) // DEF: mark live
if operand.is_last_use:
clear_bit(sched.live_bv, operand.reg) // LAST-USE: mark dead
sched.current_pressure = popcount(sched.live_bv)
The bitvector is sized to (numRegs + 1) words, or (2 * numRegs + 2) when knob 420 (dual-register tracking) is active. Dual-register tracking separately tracks register pairs for instructions that consume or produce 64-bit values.
Pressure state fields:
| Offset | Content |
|---|---|
scheduler+432 | Target register count (from budget computation) |
scheduler+324 | Committed register target |
scheduler+316 | Minimum register count |
scheduler+320 | Register pressure slack (headroom) |
When current_pressure > scheduler+432, the priority function sets bit 4 (pressure overflow) in the encoding, biasing the scheduler toward instructions that release registers.
Per-Instruction Scheduling Metadata (SchedNode)
Each instruction has a pointer at instr+40 to a heap-allocated SchedNode block. The offsets below are relative to the SchedNode base, not the 296-byte Ori instruction. See the SchedNode layout for the authoritative field map.
| Offset | Type | Content |
|---|---|---|
| +8 | int32 | dep_count -- unsatisfied predecessor count (0 = ready) |
| +16 | QWORD | next_ready -- linked-list pointer in ready list |
| +24 | int32 | bbSlot -- 1-based BB position (-1 = unscheduled) |
| +28 | int32 | latency_counter -- current stall counter |
| +32 | int32 | earliestCycle -- earliest available cycle |
| +40 | int32 | latestDeadline -- latest deadline cycle |
| +44 | int32 | Barrier group index |
| +88 | int32 | maxPredecessorCycle |
| +92 | int32 | maxDependencyCycle |
| +108 | byte | Flags: bit 0 = barrier target, bit 1 = has dependency, bit 2 = early schedulable, bit 3 = late schedulable, bit 4 = has register operand |
| +111 | byte | Flags: bit 7 = uses expensive register file |
The scheduling loop also reads Ori instruction fields directly (not via the SchedNode): instr+72 (opcode), instr+80 (operand count), instr+84 (operand descriptors).
Sentinel values: bbSlot -1 (unscheduled), latency 0x1869F (99999 = infinity).
The dep_count field at +8 is the key scheduling control: it counts unsatisfied predecessors in the dependency DAG. When a predecessor is scheduled, the engine decrements every successor's dep_count. When dep_count reaches zero, the instruction becomes ready and is inserted into the ready list.
Stall and Barrier Insertion
After the scheduling loop determines instruction order, sub_8D7760 (41 KB) converts the abstract schedule into SASS control words.
For each instruction in the scheduled order:
| Field | Computation | Range |
|---|---|---|
| Stall count | Distance in cycles to the nearest dependent consumer | 0--15 (capped by knob 805, max 16) |
| Yield hint | Warp scheduling hint -- should the HW scheduler switch to another warp? | 0 or 1 |
| Barrier assignment | Which of the 6 available barriers this instruction writes/waits on | 0--5, or none |
| Scoreboard deps | Read/write dependency tracking for the hardware scoreboard | Bitmask |
The function contains architecture-variant switches for different barrier models (sm_70 vs sm_80 vs sm_90+). It manages a 32-entry barrier table for tracking active barrier assignments.
See Scoreboards & Barriers for the control word encoding format.
Alternative Scheduling Loop
sub_68B9C0 (46 KB) is a monolithic function that combines dependency graph construction with the scheduling loop. It serves as an alternative entry point for scheduling passes that need to build the DAG inline rather than using the pre-built graph from Stage 1.
Internal structure:
- Initialize scheduling state (
sub_685700) - Initialize ready-list management (
sub_687080) - Check resource conflicts (
sub_687410) - Inner loop (
while(2)infinite loop with break conditions):- Check if ready list is empty -- break if so
- Check opcode 97 (
STGin ROT13; used as scheduling barrier/control marker) -- special handling - Select best instruction from ready list
- Schedule it: assign cycle, update resources, process edges
- For each successor: decrement
dep_count, add to ready list if zero - Check boundary condition (
v236) -- break if done
- Track first-pass initialization via
v215
This function accesses the Ori instruction's opcode at instr+72, plus SchedNode fields (via instr+40 pointer): +24 (bbSlot), +144 (scheduling slot), +164 (resource class), and +236 (latency).
Specialized Scheduling Strategies
The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies. These are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level. Each strategy implements a variation of the core list scheduling algorithm with different heuristics or search strategies.
| Address | Size | Strategy | Description |
|---|---|---|---|
sub_8B1190 | 16 KB | Backtracking | Undo and retry on scheduling conflicts. Rolls back the last N steps and tries alternative orderings. Bounded depth prevents exponential blowup. |
sub_8B2D90 | 18 KB | Global optimization | Cross-BB scheduling. Moves instructions across BB boundaries when safe (no side effects, dominance preserved). |
sub_8B4590 | 13 KB | Permutation search | Exhaustive permutation of instruction orderings for small BBs. Falls back to heuristic for larger blocks. |
sub_8B5400 | 14 KB | Latency-optimized | Maximizes memory latency hiding by aggressive interleaving of independent operations. |
sub_8B6D60 | 12 KB | Pressure-optimized | Minimizes live range overlap by scheduling defs as close to their uses as possible. |
sub_8B77C0 | 15 KB | Dual-issue | Pairs co-issuable instructions for dual-issue architectures (sm_50/Maxwell). Uses sub_A9CDE0 and sub_A9CF90 for compatibility checks. |
sub_8B8900 | 12 KB | Tensor scheduling | HMMA/BMMA/BGMMA grouping for warpgroup tensor operations. |
sub_8B9390 | 23 KB | Software pipelining | Loop body overlapping -- interleaves iterations to fill pipeline bubbles. |
sub_8BAAE0 | 15 KB | Loop-aware | Trip count + register awareness for loop bodies. |
sub_8BB9C0 | 8.2 KB | Prefetch scheduling | Inserts and schedules memory prefetch instructions. |
sub_8BC0B0 | 6.1 KB | Barrier coalescence | Merges adjacent barrier instructions to reduce overhead. |
sub_8BC990 | 7.6 KB | Scoreboard optimization | Minimizes scoreboard entries by reusing barrier registers. |
sub_8BCFA0 | 6.8 KB | Warp schedule optimization | Warp-level yield tuning for multi-warp scheduling. |
sub_8BDC40 | 7.9 KB | Dual-issue pairing | Instruction pair selection for dual-issue slots. |
sub_8BE320 | 25 KB | Complex combined pass | Multi-strategy combined pass for complex code patterns. |
sub_8A9D80 | 21 KB | Depth-first | DFS-based instruction ordering for deep dependency chains. |
sub_8AB750 | 9.8 KB | Critical path | DAG analysis for priority weight computation. |
Backtracking Scheduler
The backtracking strategy (sub_8B1190) is notable because it breaks from the greedy nature of standard list scheduling. When a scheduling decision leads to excessive stalls or resource conflicts, it can undo the last N steps (where N is bounded by a configurable depth), re-insert the affected instructions into the ready list, and try a different selection. This provides limited but effective lookahead without the full cost of optimal scheduling.
Dual-Issue Scheduling
For sm_50 (Maxwell), sub_8B77C0 pairs instructions that can execute simultaneously on different functional units. Eligibility is checked by sub_8CF5D0 (3.5 KB), which verifies architecture support and computes a dual-issue benefit score at scheduler+328. Pairing compatibility uses sub_A9CDE0 (is instruction dual-issuable?) and sub_A9CF90 (can this pair with the next instruction?).
Size Limits and Chunking
Two mechanisms prevent the scheduling algorithm from hitting quadratic complexity on large inputs:
BB Size Limit
sub_8CBAD0 scans all basic blocks during pre-scheduling setup. Any BB exceeding 4095 instructions is split by inserting scheduling barriers (sub_931920). This caps the per-BB scheduling problem size, ensuring the O(n^2) dependency graph construction remains tractable. The maximum BB size is tracked at scheduler+388.
Large Function Chunking
Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is partitioned into chunks that are scheduled independently, then the results are merged. This avoids the full O(n^2) DAG construction for very large kernels. The chunk boundary selection respects BB boundaries and dependency chains to minimize cross-chunk constraint violations.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_6820B0 | 1.5 KB | BuildReadyList -- zero-dep instruction scan | HIGH |
sub_682200 | -- | UnlinkFromReadyList -- remove and update deps | HIGH |
sub_682490 | 14 KB | RegisterPressureAnalyzer -- per-class deltas | HIGH |
sub_6833F0 | 10 KB | InitScheduleRegion -- per-BB setup, knob query | HIGH |
sub_685700 | -- | InitSchedulingState -- loop initialization | MEDIUM |
sub_685A10 | 11 KB | InstructionBarrierCheck -- opcode analysis | HIGH |
sub_687080 | -- | ReadyListManagementHelper | MEDIUM |
sub_687410 | -- | ResourceConflictCheck | MEDIUM |
sub_687FE0 | 12 KB | ScheduleBlock -- per-BB scheduling entry | HIGH |
sub_688DD0 | 20 KB | ScheduleEngine -- unified 3-mode core loop | HIGH |
sub_68A690 | 31 KB | BuildDependencies -- def-use chain DAG | HIGH |
sub_68B9C0 | 46 KB | MainSchedulingLoop -- combined DAG + scheduling | HIGH |
sub_692200 | 18 KB | SchedulingHeuristic -- priority with FP scoring | HIGH |
sub_695530 | 15 KB | ComputeLatencies -- instruction latency computation | HIGH |
sub_69B7D0 | 17 KB | TopologicalSort -- valid execution ordering | HIGH |
sub_69F170 | 12 KB | CriticalPathAnalysis -- DAG critical path | HIGH |
sub_893100 | 17 KB | ClassifyInstruction -- opcode/operand analysis | HIGH |
sub_894290 | 27 KB | BuildOperandDependencies -- operand-level edges | HIGH |
sub_89C550 | 14 KB | InnerScheduleLoop -- inner scheduling iteration | HIGH |
sub_89EFC0 | 16 KB | ReadyListManager -- BST management | HIGH |
sub_8A9D80 | 21 KB | DepthFirstSchedule | MEDIUM |
sub_8AB750 | 9.8 KB | CriticalPathCompute | MEDIUM |
sub_8B1190 | 16 KB | ScheduleWithBacktrack | MEDIUM |
sub_8B2D90 | 18 KB | GlobalScheduleOpt -- cross-BB scheduling | MEDIUM |
sub_8B4590 | 13 KB | PermuteSchedule -- permutation search | MEDIUM |
sub_8B5400 | 14 KB | ScheduleForLatency | MEDIUM |
sub_8B6D60 | 12 KB | ScheduleForPressure | MEDIUM |
sub_8B77C0 | 15 KB | DualIssueScheduler | MEDIUM |
sub_8B8900 | 12 KB | TensorScheduler | MEDIUM |
sub_8B9390 | 23 KB | SoftwarePipeline | MEDIUM |
sub_8BAAE0 | 15 KB | LoopScheduler | MEDIUM |
sub_8BB9C0 | 8.2 KB | PrefetchScheduler | MEDIUM |
sub_8BC0B0 | 6.1 KB | BarrierCoalescence | MEDIUM |
sub_8BC990 | 7.6 KB | ScoreboardOpt | MEDIUM |
sub_8BCFA0 | 6.8 KB | WarpScheduleOpt | MEDIUM |
sub_8BDC40 | 7.9 KB | DualIssuePairing | MEDIUM |
sub_8BE320 | 25 KB | ComplexSchedulePass | MEDIUM |
sub_8C67A0 | 3.7 KB | ComputeResourceCost -- per-instruction FU cost | HIGH |
sub_8C7120 | -- | BarrierTrackingUpdate | MEDIUM |
sub_8C7290 | 5.1 KB | GetResourceVector -- SSE-optimized copy | HIGH |
sub_8C7720 | 20 KB | ReorderInstructions -- red-black tree | HIGH |
sub_8C9320 | 47 KB | ComputePriority -- 8-bit packed heuristic | HIGH |
sub_8CBAD0 | 2.9 KB | PreScheduleSetup -- BB scan, 4095-instr limit | HIGH |
sub_8CCF80 | 2.3 KB | IsLongLatencyOp -- latency > 19 check | HIGH |
sub_8CD160 | 9.3 KB | ScheduleBasicBlock -- per-BB ordering loop | HIGH |
sub_8CF880 | 28 KB | BuildDependencyGraph -- pre-scan stage 1 | HIGH |
sub_8D0640 | 22 KB | ScheduleInstructions -- top-level orchestrator | HIGH |
sub_8D1730 | 19 KB | ExecuteSchedulePass | HIGH |
sub_8D2510 | 3.6 KB | UpdateDependencies -- post-schedule dep update | HIGH |
sub_8D3150 | 2.0 KB | CheckResourceConflict | MEDIUM |
sub_8D32D0 | 14 KB | ScheduleInstruction -- schedule single instruction | HIGH |
sub_8D3D60 | 1.4 KB | InsertStall | HIGH |
sub_8D3E20 | 2.1 KB | ComputeStallCycles | HIGH |
sub_8D4000 | 3.0 KB | InsertBarrier | HIGH |
sub_8D5E00 | 38 KB | MainSchedulingLoop -- workhorse | HIGH |
sub_8D7760 | 41 KB | StallAndBarrierInsertion -- control word generation | HIGH |
sub_8D9930 | 19 KB | BuildDependencyEdges -- RAW/WAR/WAW/memory/barrier | HIGH |
sub_925510 | 341 bytes | MoveInstruction -- doubly-linked list relink | HIGH |
sub_A08A00 | -- | ResourceModel -- FU cost query (3 modes) | HIGH |
sub_A091C0 | -- | InitResourceTracking | MEDIUM |
sub_A09530 | 365 bytes | UpdateStallCycles -- per-instruction latency update | HIGH |
sub_A09D40 | -- | UpdateWARTracking -- anti-dependency tracking | MEDIUM |
sub_A9DDD0 | 11.5 KB | HandleLargeFunction -- chunk-based scheduling | MEDIUM |
Per-SM Scheduling Backends
Everything documented above describes the main scheduler (Backend A), which covers approximately 436 KB at 0x680000--0x8FE000. ptxas contains two additional complete scheduling implementations activated for newer SM architectures. The three backends coexist in the binary; SM-version-gated dispatch selects which combination runs.
Architecture Dispatch
The function sub_7DDB50 reads an SM architecture index from context+2104 and returns it as an integer. Four dispatch stubs in the 0xC5FE00--0xC61000 range use this value to select the scheduling backend:
| Dispatch Stub | Condition | Backend Selected | Pipeline Stage |
|---|---|---|---|
sub_C5FEF0 | SmVersion > 1 | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C60910 | SmVersion > 1 && (context+1392 & 1) | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C5FFC0 | SmVersion > 1 | Backend C (RBT List), mode 1 | Pre-scheduling |
sub_C5FFF0 | SmVersion > 1 | Backend C (RBT List), mode 0 | Post-scheduling |
When SmVersion <= 1 (sm_50 through sm_75 -- Maxwell through Volta), control falls through to the main Backend A documented in the preceding sections. When SmVersion >= 2 (sm_80+ -- Ampere, Ada Lovelace, Hopper, Blackwell), Backends B and C replace Backend A entirely.
sub_C60910 has a secondary activation path: if *(options + 23544) == 1 && *(options + 23552) != 0, Backend B activates regardless of SM version, providing a knob override for testing the codec scheduler on older architectures.
Backends B and C are complementary, not competing. Backend C handles pre-scheduling and post-scheduling (the same pipeline stages as Backend A's 3-phase ReduceReg/ILP/DynBatch), while Backend B handles a separate codec/ISel scheduling step that has no equivalent in the legacy path.
Backend B -- SM89/90 Codec Scheduler (0x1225000)
Backend B is a forward-then-backward scheduling pass with continuous floating-point priority weighting. It replaces Backend A's discrete 8-bit packed heuristic with a configurable pressure/ILP tradeoff expressed as doubles.
| Entry | sub_1233D70 (1,527 B, 321 lines) -- pass phase 5 |
| Forward scheduler | sub_122AD60 (17.5 KB, 4,118 lines) -- largest function in range |
| Backward scheduler | sub_122F650 (18.2 KB, 3,917 lines) |
| Preparation | sub_123E0D0 -- instruction characterization |
| Post-fixup | sub_A112C0 -- scheduling result finalization |
| Priority structure | BST with FNV-1a hash tracking |
| Code region | 0x1225000--0x1240000 (132 functions, 111 KB) |
Float Weighting System
The entry point sub_1233D70 initializes two pairs of floating-point weights from the options object at *(context+1664) + 72:
Pair 1 -- Pressure/ILP tradeoff (options offsets 7200/7208):
| Weight | Default | Meaning |
|---|---|---|
pressure_weight | 1.8 | Contribution of register pressure to scheduling priority. Positive = favors orderings that reduce live register count. |
ilp_weight | -0.8 | Contribution of instruction-level parallelism. Negative = penalizes moves that reduce available parallelism. |
The two weights sum to 1.0 and form a weighted combination on a unit scale. The default 1.8/-0.8 split heavily favors register pressure reduction, accepting moderate ILP degradation -- appropriate for register-hungry Ada Lovelace and Hopper kernels.
Pair 2 -- Secondary scoring axis (options offsets 7560/7568):
| Weight | Default | Meaning |
|---|---|---|
forward_weight | 3.2 | Forward-looking scheduling contribution |
backward_penalty | -2.2 | Backward-looking penalty factor |
Both pairs are overridable. When the configuration byte at the respective offset equals 3, the weight is read from the adjacent double field and the complement is computed as 1.0 - weight:
if (*(BYTE*)(options + 7200) == 3):
pressure_weight = *(double*)(options + 7208)
ilp_weight = 1.0 - pressure_weight
After loading, both weight pairs are normalized by dividing by the register range (float)(max_regs - min_regs), producing per-register slopes:
range = (float)(max_regs) - (float)(min_regs)
pressure_slope = ilp_weight / range
secondary_slope = backward_penalty / range
This normalization ensures the scheduling heuristic scales consistently regardless of the target architecture's register file size.
Forward Pass (sub_122AD60)
The forward scheduler implements list scheduling with a BST priority queue, iterating basic blocks front-to-back. It uses FNV-1a hash tables (seed 0x811C9DC5, multiplier 16777619) for tracking scheduled instruction mappings. Instruction properties are queried via sub_7DF3A0. The function manages a ref-counted working set with proper cleanup at function exit. At 4,118 decompiled lines, it is the largest function in the 0x1225000 scheduling range.
Backward Pass (sub_122F650)
The backward scheduler receives the floating-point weights as direct parameters and processes basic blocks in reverse order. It calls into the barrier/scoreboard system (sub_BDC080, sub_BDBA60, sub_BDC0A0) and performs register liveness analysis via sub_A0EDE0. The function uses BST operations with left/right/parent pointer traversal and explicit rebalancing, then performs iterative tree cleanup at exit.
Backend C -- RBT List Scheduler (0x18CD000)
Backend C is a complete reimplementation of the list scheduling algorithm using a red-black tree priority queue, double-precision scoring, and an evaluate-then-commit model with hash-table solution caching. It replaces Backend A for all sm_80+ targets.
| Orchestrator | sub_1908D90 -- pre/post mode dispatch |
| Driver | sub_1906090 -- per-block scheduling loop |
| Core scheduler | sub_1902B70 (19 KB) -- RBT-based list scheduling |
| Solution evaluator | sub_1904B70 (26 KB) -- constraint check + commit |
| Constraint validator | sub_19043F0 (10 KB) -- feasibility testing |
| Pressure cost model | sub_18F3CB0 (16 KB) -- SIMD register pressure |
| Recursive cost propagation | sub_18FFD70 (23 KB) -- call-graph-aware scoring |
| Dependency update | sub_1902100 (15 KB) -- post-scheduling DAG update |
| RBT insert | sub_18FD370 -- balanced insertion with 3-key comparison |
| RBT extract | sub_18FCDA0 -- pop highest-priority node |
| RBT reset | sub_18F7EC0 -- tree cleanup |
| Score computation | sub_18FDAF0 -- double-precision weighted score |
| Hash table | sub_1906510 (14 KB) -- FNV-1a instruction ID lookup |
| Code region | 0x18CD000--0x190FFFF (392 functions, 275 KB) |
Red-Black Tree Priority Queue
The critical difference from Backend A is the priority queue data structure. Backend A uses a sorted singly-linked list (O(N) insertion per instruction). Backend C uses a red-black tree that maintains balance through rotation operations in sub_18FD170 (called at the end of every insertion).
Each RBT node is 40 bytes allocated from a pool, with node+24 pointing to the instruction's scheduling entry. The tree ordering uses a three-key comparison in sub_18FD370:
- Priority integer at
scheduling_entry + 384(descending -- higher priority nodes are left children) - Latency double at
scheduling_entry + 368(descending -- higher latency scheduled first among equal-priority instructions) - Instruction ID at
*(scheduling_entry + 16) + 12(ascending -- deterministic tiebreaker)
This three-key comparison provides O(log N) insertion and extraction, a significant improvement for basic blocks with hundreds of instructions where Backend A's O(N) sorted insertion becomes a bottleneck.
Core Scheduling Loop (sub_1902B70)
function RBTListSchedule(context, block, dep_info, bound, constraint):
InitRegisterPressure(context, block) // sub_18F8580
InitRBTree(tree) // sub_18F7EC0
for each instruction in block.instruction_list:
node = AllocPoolNode(40 bytes)
node.scheduling_entry = instruction
RBTreeInsert(tree, node) // sub_18FD370
ResizeScheduleArray(block, tree.count) // sub_18F9CC0
while tree is not empty:
best_node = RBTreeExtractMax(tree) // sub_18FCDA0
ReturnNodeToPool(best_node)
instruction = best_node.scheduling_entry
valid = vtable_check(context, block, instruction)
*(instruction + 365) = valid
UpdateDependencies(context, instruction, tree) // sub_1902100
if not valid:
InsertRejection(block + 112, instruction_id)
continue
// Record scheduling decision
position = ++(block + 360)
entry = *(block + 352) + 24 * position
entry[0] = instruction_id
entry[1] = instruction + 2 // scheduling state
entry[2] = instruction // back-pointer
// Compute and accumulate scores
latency = LookupLatency(context, instruction) // sub_18F5460
*(block + 96) += (priority - 2) * latency
*(block + 88) += latency * *(instruction + 376)
// Process successors, check conflicts (binary search on
// sorted 12-byte conflict array with 0xAAAAAAAAAAAAAAAB
// division-by-3 trick for index computation)
Evaluate-Then-Commit Model
Backend A uses a greedy approach: each scheduling decision is final. Backend C introduces a two-phase model where sub_1904B70 evaluates a proposed schedule against constraints before committing it:
- Build a candidate schedule (408-byte evaluation nodes with def/use/pred-deps/succ-deps lists)
- Validate via
sub_19043F0(checks scheduling mode at+64must be 5 or 6) - Run architecture-specific check via vtable at
*context + 16 - Verify register pressure via
sub_19016E0 - Compute score via
sub_18FDAF0(returns double) - If score exceeds threshold at
context + 360, insert into solution hash table atblock + 304
This allows Backend C to explore multiple scheduling alternatives and commit only the best-scoring solution, a capability Backend A's greedy model lacks.
Recursive Cost Propagation (sub_18FFD70)
Backend C uniquely supports cross-function scheduling awareness through recursive cost propagation. sub_18FFD70 walks the call graph:
- For a given schedule entry, iterate predecessor blocks (linked list at +12/+13)
- Look up each predecessor in the scheduling map via
sub_18F4E70 - If predecessor is live (byte at +365), recursively process it
- After recursion, scan instruction operand lists (offsets 80, 84), identifying register operands by the type tag
(operand >> 28) & 7 == 1 - Clear register usage bits at
reg_entry + 28 - Update double-precision scores at offsets 88 and 96
This propagation allows scheduling decisions in callee functions to influence caller scheduling priorities -- a form of interprocedural scheduling absent from both Backend A and Backend B.
Backend Comparison
| Feature | Backend A (Main) | Backend B (SM89/90 Codec) | Backend C (RBT List) |
|---|---|---|---|
| Address range | 0x680000--0x8FE000 | 0x1225000--0x1240000 | 0x18CD000--0x190FFFF |
| Code size | ~436 KB | ~111 KB | ~275 KB |
| SM gate | SmVersion <= 1 | SmVersion >= 2 | SmVersion >= 2 |
| Pipeline stage | Pre + post scheduling | Codec/ISel scheduling | Pre + post scheduling |
| Priority encoding | 8-bit packed integer | Float-weighted BST | RB-tree (int + double + ID) |
| Priority function size | 47 KB monolithic | Distributed across weights | 3-key comparison |
| Ready list structure | Sorted singly-linked list | Binary search tree | Red-black tree |
| Insertion complexity | O(N) per instruction | O(log N) | O(log N) |
| Scheduling passes | 3 (ReduceReg / ILP / DynBatch) | 2 (Forward / Backward) | 2 (Pre / Post) |
| Pressure tracking | Bitvector + popcount | Float slope per register | SIMD bitmap + cost model |
| Weight configuration | Knobs 769--805 (integer) | Options 7200/7560 (double) | Vtable dispatch |
| Score type | Integer (packed bits) | Double (weighted sum) | Double (accumulated) |
| Solution search | Greedy (single pass) | Forward + backward | Evaluate + commit |
| Cross-function awareness | None | None | Recursive cost propagation |
| Hash infrastructure | None | FNV-1a | FNV-1a |
| Backtracking | Optional (sub_8B1190) | None | Rejection set + retry |
Backend B + C Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_1233D70 | 1.5 KB | SM89/90 CodecScheduleEntry -- pass phase 5, float weight init | HIGH |
sub_122AD60 | 17.5 KB | ForwardCodecScheduler -- BST list scheduling, FNV-1a hash tracking | HIGH |
sub_122F650 | 18.2 KB | BackwardCodecScheduler -- reverse pass, barrier/scoreboard integration | HIGH |
sub_123ADD0 | 5.8 KB | CodecDependencyGraphBuilder -- dispatched via vtable | MEDIUM |
sub_12371D0 | 3.8 KB | CodecInstructionClassifier -- convergence-based property testing | MEDIUM |
sub_123E0D0 | -- | CodecSchedulePreparation -- instruction characterization | MEDIUM |
sub_A112C0 | -- | CodecSchedulePostFixup -- result finalization | MEDIUM |
sub_1908D90 | -- | RBTScheduleOrchestrator -- pre/post mode dispatch | HIGH |
sub_1906090 | -- | RBTScheduleDriver -- per-block loop, 368-byte block stride | HIGH |
sub_1902B70 | 19 KB | RBTCoreListScheduler -- RB-tree priority queue loop | HIGH |
sub_1904B70 | 26 KB | RBTSolutionEvaluator -- constraint check, score threshold, hash commit | HIGH |
sub_19043F0 | 10 KB | RBTConstraintValidator -- mode 5/6 feasibility | HIGH |
sub_19038E0 | 15 KB | RBTInitialEvaluation -- per-block constraint bootstrapping | MEDIUM |
sub_18F3CB0 | 16 KB | RBTPressureCostModel -- SIMD register pressure computation | HIGH |
sub_18FFD70 | 23 KB | RBTRecursiveCostPropagation -- call-graph-aware scoring | HIGH |
sub_1902100 | 15 KB | RBTDependencyUpdate -- post-scheduling DAG maintenance | HIGH |
sub_18FD370 | -- | RBTreeInsert -- 3-key balanced insertion + fix-up | HIGH |
sub_18FCDA0 | -- | RBTreeExtractMax -- pop highest-priority node | HIGH |
sub_18F7EC0 | -- | RBTreeReset -- tree cleanup | HIGH |
sub_18F8580 | -- | RBTRegisterPressureInit -- pressure state initialization | MEDIUM |
sub_18F5460 | -- | RBTLatencyLookup -- vtable-dispatched latency query | MEDIUM |
sub_18FDAF0 | -- | RBTScoreComputation -- double-precision weighted score | HIGH |
sub_1906510 | 14 KB | RBTHashLookup -- FNV-1a instruction ID hash table | HIGH |
sub_18FB850 | -- | RBTHashResize -- power-of-2 growth, 0.5 load factor | HIGH |
sub_1901200 | -- | RBTScorePropagationDriver -- calls sub_18FFD70 | MEDIUM |
sub_19081F0 | 17 KB | RBTBlockDependencyGraphBuild -- per-block DAG construction | HIGH |
sub_19072F0 | 14 KB | RBTInterBlockScheduling -- cross-BB register dependency | MEDIUM |
sub_18FEE60 | -- | RBTScheduleStateCreate -- 528-byte state construction | MEDIUM |
sub_18FE320 | -- | RBTScheduleDataPrepare -- pre-scheduling data setup | MEDIUM |
sub_18F94C0 | -- | RBTCleanup -- state teardown | MEDIUM |
sub_C5FFC0 | -- | DispatchPreSchedule -- SM gate -> Backend C (mode 1) | CERTAIN |
sub_C5FFF0 | -- | DispatchPostSchedule -- SM gate -> Backend C (mode 0) | CERTAIN |
sub_C5FEF0 | -- | DispatchCodecSchedule -- SM gate -> Backend B | CERTAIN |
sub_C60910 | -- | DispatchConditionalCodecSchedule -- SM gate + knob override | CERTAIN |
sub_7DDB50 | -- | GetSmVersionIndex -- reads context+2104 | HIGH |
Scheduling Guidance Output
After scheduling completes, ptxas can emit statistics comments into the SASS output and DUMPIR stream. Three emitter functions produce scheduling guidance in different contexts, all reading from a shared ~1400-byte statistics object. sub_A46CE0 controls the "SCHEDULING GUIDANCE:" header that wraps per-block scheduling output. sub_A3A7E0 emits per-function statistics as # [field=value] comment lines during DUMPIR. Eight post-regalloc clones at sub_ABBA50--sub_ABEB50 emit a variant with hardware pipe names.
Verbosity Controls
Two independent verbosity mechanisms gate the output:
Scheduling guidance level at *(DWORD*)(vtable + 992):
| Level | Behavior |
|---|---|
| 0 | No scheduling guidance output |
| 1+ | "SCHEDULING GUIDANCE:" header emitted; per-block scheduling dispatched |
| 2+ | Pre-formatting hook called via vtable+816 before header emission |
| 4+ | "LOOP STATIC METRICS : " sub-header appended |
DUMPIR detail bits at context+1416:
| Bit | Mask | Behavior |
|---|---|---|
| 3 | 0x08 | Enable detailed statistics (FP16 vectorization, functional unit breakdown, throughput estimates) |
| 4 | 0x10 | Show worst-case latency: # [worstcaseLat=%f] |
| 5 | 0x20 | Show average-case latency: # [avgcaseLat=%f] |
Bits 4 and 5 are mutually exclusive -- only one latency variant is emitted.
Emitter Functions
| Address | Size | Identity | Confidence | Context |
|---|---|---|---|---|
sub_A3A7E0 | 1,236 B | Statistics::emitFunctionStats | CERTAIN | Pre-regalloc DUMPIR statistics. 20+ format strings at 0x21EBF76--0x21EC3B0. Uses abstract FU names (fp, half, shared, controlFlow, loadStore). |
sub_A46CE0 | 1,793 B | SchedulingGuidance::buildAndEmit | HIGH | Scheduling guidance header + BB classification. Walks BB array at context+296, dispatches schedulable blocks to vtable+336. |
sub_A4B8F0 | 248 B | StatsEmitter::emitInstrRegStats | HIGH | Binary-embedded metadata. Writes record type 3 (string) into SASS code object at *(a1+1000) + *(a1+996). |
sub_ABBA50--sub_ABEB50 | 8 x 1,771 B | PostSchedStats::emit (SM-variant) | CERTAIN | Post-regalloc statistics. 8 clones at 0x700 spacing. Format strings at 0x21FA008--0x21FA400. Uses hardware pipe names (adu, alu, cbu, fma, lsu). |
Pre-Regalloc Output Format (sub_A3A7E0)
Emitted during DUMPIR. All lines prefixed with # . Lines marked [COND] are gated by the stated condition.
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [urregs=8] [COND: SM > 0x5FFF]
# [_lat2inst=0.0]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00] [COND: +1416 bit 3]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [SharedMem Alloc thru=0.125000] [COND: value != 0.0]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [Precise inst=0]
# [worstcaseLat=87.000000] [COND: +1416 bits 4-5 == 0x10]
# [avgcaseLat=52.500000] [COND: +1416 bits 4-5 == 0x20]
# [instHint=142] [instPairs=4] [COND: instPairs != 0]
# <custom annotation> [COND: linked list at stats[55] != NULL]
Key format details: pre-regalloc uses commas between some bracket groups ([SSpillB=%d], [SRefillB=%d],) and abstract functional unit names (fp, half, trancedental, shared, controlFlow, loadStore, texLoadStore). The typo "trancedental" (missing "s") is present in the binary.
Post-Regalloc Output Format (sub_ABBA50 clones)
Emitted after scheduling by SM-variant clones dispatched via vtable. Same # prefix. Differs from the pre-regalloc format in three ways:
- No commas between bracket groups
- SpillSize replaces
LowLmemSpillSize+FrameLmemSpillSize - Hardware pipe names replace abstract unit names; MMA variant breakdown added
The unique lines (lines shared with pre-regalloc use the same structure minus commas):
# [est latency = %d] [LSpillB=%d] [LRefillB=%d] [SSpillB=%d] [SRefillB=%d] [SpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d] [NonSpillSize=%d]
# [Occupancy = %f] [est numDivergentBranches=%d] [attributeMemUsage=%d] [programSize=%d]
# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] [est fma=%d] [est half=%d]
# [est trancedental=%d] [est ipa=%d] [est lsu=%d] [est redux=%d]
# [est schedDisp=%d] [est tex=%d] [est ttu=%d] [est udp=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d] [est hmma16816=%d] [est hmma16816f16=%d]
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [issue thru=%f] [adu thru=%f] [alu thru=%f] [cbu thru=%f] [fma2x thru=%f] [fma thru=%f]
# [trancedental thru=%f] [ipa thru=%f] [lsu thru=%f] [redux thru=%f]
# [schedDisp thru=%f] [tex thru=%f] [ttu thru=%f] [udp thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f] [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [reg thru=%f] [warp thru=%f]
Hardware Pipe Name Mapping
The post-regalloc format maps abstract functional unit names to hardware execution pipe identifiers:
| Post-Regalloc Pipe | Pre-Regalloc Equivalent | Description |
|---|---|---|
adu | -- | Address Divergence Unit (address computation) |
alu | fp | Arithmetic Logic Unit (integer + FP32 combined) |
cbu | controlFlow | Control/Branch Unit (branch, exit, barrier) |
fma2x | -- | Double-precision FMA (separate pipe on sm_80+) |
fma | fp | Fused Multiply-Add (FP32) |
half | half | FP16 operations |
lsu | loadStore + shared | Load/Store Unit (unified) |
redux | -- | Reduction Unit (warp-level reductions) |
schedDisp | -- | Scheduler Dispatch (internal overhead) |
tex | tex | Texture Unit |
ttu | -- | Tensor Texture Unit (Ada Lovelace+) |
udp | -- | Uniform Data Path operations |
Binary-Embedded Statistics Record (sub_A4B8F0)
Separate from the DUMPIR comment output, sub_A4B8F0 writes a compact binary record into the SASS code object during emission:
Format string: "instr/R-regs: %d instructions, %d R-regs"
instructions = stats[335] - stats[341] (total minus removed)
R-regs = stats[159] + stats[102] (extra + base allocation)
Record layout in output buffer:
+0 DWORD type = 3 (string record type)
+4 DWORD string_length
+8 char[] string_content (formatted text)
The companion function sub_A4B9F0 writes record type 2 for undefined register warnings: "Referencing undefined register: %s%d".
Scheduling Guidance Header (sub_A46CE0)
sub_A46CE0 emits the scheduling guidance wrapper, then walks the BB array to classify and dispatch blocks for scheduling. The header is emitted into the output stream via sub_7FE930 (string builder) at context + 1440.
BB classification algorithm:
For each BB in context+296 (index 0 through context+304):
-
Schedulable:
sub_7544D0(context, bb)returns true ANDsub_754510(context, bb)returns false. Dispatched immediately to scheduling viavtable+336. -
Type-8 (deferred):
*(bb+16) == 8. Added to a dynamically-grownsrcarray for second-pass processing. -
Loop back-edge: When
*(bb+148) != 0and*(bb+128) != NULL, the function walks the predecessor linked list atbb+128. For each predecessor, it checks whether the predecessor's iteration order (bb+144) exceeds the current block's, and whether the predecessor's terminal instruction is a branch (opcode0x5Dafter masking with0xFFFFCFFD) with a matching program counter at(instr+84) & 0xFFFFFF. If a back-edge is detected, scheduling dispatch includes the back-edge source instruction as a hint parameter.
After the first pass, deferred type-8 blocks are processed in a second loop with the same back-edge detection logic.
Statistics Object Field Map
Both emitter families read from the same ~1400-byte statistics object. The object is accessed as a float* array; integer fields use the same DWORD index but read as int32.
| Index | Type | Field | Description |
|---|---|---|---|
| 8 | int32 | est_latency | Estimated schedule length in cycles |
| 9 | float | FP16_vectorization_pct | Fraction of FP16 instructions vectorized |
| 10 | int32 | worstcase_latency | Worst-case latency (cast to float for output) |
| 11 | int32 | avgcase_latency | Average-case latency (cast to float for output) |
| 12 | int32 | LSpillB | Long spill byte count |
| 13 | int32 | LRefillB | Long refill byte count |
| 14 | int32 | SRefillB | Short refill byte count |
| 15 | int32 | SSpillB | Short spill byte count |
| 16 | int32 | LowLmemSpillSize | Local-memory low spill allocation |
| 17 | int32 | FrameLmemSpillSize | Frame local-memory spill allocation |
| 18 | int32 | LNonSpillB | Long non-spill byte count |
| 19 | int32 | LNonRefillB | Long non-refill byte count |
| 20 | int32 | NonSpillSize | Non-spill allocation size |
| 26 | float | Occupancy | Occupancy ratio (0.0--1.0) |
| 27 | int32 | numDivergentBranches | Estimated divergent branch count |
| 28 | int32 | attributeMemUsage | Attribute memory usage in bytes |
| 29 | int32 | programSize | Program binary size in bytes |
| 42 | int32 | preciseInst | Count of precise (non-approximate) instructions |
| 44 | int32 | UDPinst | Uniform data-path instruction count |
| 45 | int32 | vecToURConverts | Vector-to-uniform-register conversion count |
| 49 | int32 | maxLiveAtSuspend | Max live register values at suspend points |
| 50 | float | issue_thru | Overall issue throughput (fraction of peak) |
| 54 | float | fp_thru | FP32 pipe throughput |
| 57 | float | half_thru | FP16 pipe throughput |
| 58 | float | transcendental_thru | Transcendental function pipe throughput |
| 59 | float | ipa_thru | Interpolation pipe throughput |
| 61 | float | shared_thru | Shared memory pipe throughput |
| 62 | float | controlFlow_thru | Control flow pipe throughput |
| 65 | float | texLoadStore_thru | Texture and load/store pipe throughput |
| 84 | float | reg_thru | Register throughput |
| 85 | float | warp_thru | Warp throughput |
| 86 | float | sharedMemAlloc_thru | Shared memory allocation throughput |
| 87 | int32 | partiallyUnrolledLoops | Partially unrolled loop count |
| 88 | int32 | nonUnrolledLoops | Non-unrolled loop count |
| 89 | int32 | CBBoundTex | Constant-bank-bound texture count |
| 90 | int32 | PartiallyBoundTex | Partially bound texture count |
| 91 | int32 | BindlessTex | Bindless texture count |
| 92 | int32 | URBoundTex | Uniform-register-bound texture count |
| 93 | int32 | SM_architecture_enum | SM version discriminator (>0x5FFF enables UR stats) |
| 99 | int32 | uniform_reg_count | Uniform register count |
| 102 | int32 | R_reg_base | Base R-register allocation |
| 159 | int32 | R_reg_extra | Extra R-register allocation |
| 303 | int32 | est_fp | Estimated FP32 instruction count |
| 306 | int32 | est_half | Estimated FP16 instruction count |
| 307 | int32 | est_transcendental | Estimated transcendental instruction count |
| 308 | int32 | est_ipa | Estimated IPA instruction count |
| 310 | int32 | est_shared | Estimated shared memory operation count |
| 311 | int32 | est_controlFlow | Estimated control flow operation count |
| 315 | int32 | est_loadStore | Estimated load/store instruction count |
| 316 | int32 | est_tex | Estimated texture instruction count |
| 334 | int32 | est_pairs | Estimated co-issuable instruction pairs |
| 335 | int32 | total_inst | Total instruction count (before removal) |
| 336 | int32 | texInst | Texture instruction count |
| 337 | int32 | FP16inst | FP16 instruction count |
| 338 | int32 | FP16VectInst | FP16 vectorized instruction count |
| 339 | int32 | instHint | Instruction hint value |
| 340 | int32 | instPairs | Instruction pair count (also output gate) |
| 341 | int32 | removed_inst | Removed instruction count |
| 342 | int32 | tepid_inst | TEPID (texture-pending) instruction count |
Cross-References
- Scheduler Overview -- 3-phase architecture, register budget, scheduling knobs
- Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
- Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
- Register Allocation -- the allocator that the scheduler interacts with
- Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
- Knobs -- the 76 scheduling knobs and the knob query infrastructure
- GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch
- DUMPIR Configuration -- DUMPIR levels that trigger statistics output
- Spilling -- spill metrics (LSpillB, SSpillB) referenced in guidance output
Latency Model & Functional Units
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas instruction scheduler uses a static hardware performance model to estimate instruction latencies, functional unit occupancy, and pipeline conflicts. The model is architecture-parameterized: a family of 15+ profile-builder functions at 0x8E7300--0x8E9DC0 construct per-SM latency/throughput tables consumed by the scheduling engine. A separate 85 KB function (sub_89FBA0, SetOpcodeLatencies) assigns per-opcode scheduling classes that index into these tables. The combination produces a cost model that drives stall-count computation, priority scoring, and dual-issue pairing decisions.
| Per-opcode classifier | sub_89FBA0 (85 KB) -- assigns scheduling class per Ori opcode |
| HW profile builder | sub_8E5CA0 (20 KB) -- assembles scheduling control word tables |
| Warp profile | sub_8E4400 (3.3 KB) -- maps SM ID to warp/dispatch parameters |
| SM-specific tables | sub_8E7300--sub_8E97B0 -- 15 architecture-specific builders |
| Latency query | sub_693BC0 (22 lines) -- memory space classification |
| Long-latency check | sub_8CCF80 (2.3 KB) -- returns true if latency > 19 |
| Resource model | sub_A08A00 (345 lines) -- per-instruction FU cost computation |
| Register query | sub_A08910 (39 lines) -- operand register count/cost |
| Stall update | sub_A09530 (91 lines) -- per-instruction stall cycle accumulation |
| FU class mapper | sub_8F0CD0 -- maps (opcode, unit_name) to scheduling class |
| FU unit query | sub_704D30 (14 KB) -- maps SASS opcodes to functional unit IDs |
| Cutlass detector | sub_8F47E0 -- detects cutlass kernels for tuned scheduling |
| Pipe class assigner | sub_13710B0 (7.1 KB) -- SASS-level execution pipe assignment |
Architecture of the Latency Model
The model has three layers:
Layer 1: Per-Opcode Classification
sub_89FBA0 reads each instruction's Ori opcode (field at instr+72,
masked with 0xFFFFCFFF) and assigns:
- Scheduling class ID (stored at descriptor+4, range 1..772+)
- 9-bit latency index (low 9 bits of descriptor+196)
- Execution pipe mask (bits 15..19 of descriptor+196..200)
- Throughput class (bits in descriptor+198..199)
Layer 2: Architecture-Specific HW Tables
sub_8E7300..sub_8E97B0 build per-SM latency/throughput tables as
96-byte records in a growable array. Each record maps a scheduling
class to its pipeline latency, scoreboard wait count, barrier stall
cycles, and dual-issue compatibility flags.
Layer 3: Runtime Query
The scheduling engine queries the model via:
- sub_A08A00 for per-instruction resource costs (3 modes)
- sub_A08910 for register operand latency
- sub_693BC0 for memory space classification
- sub_8CCF80 for long-latency detection (threshold: 19 cycles)
Scheduling Class Assignment (sub_89FBA0)
sub_89FBA0 (85 KB, 2938 lines decompiled) is the largest function in the scheduling subsystem. It assigns each instruction a scheduling class -- an integer that indexes into the per-architecture latency tables. The function operates as a massive switch on *(instr+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked out).
Scheduling Descriptor Layout
Each instruction carries a scheduling descriptor at offsets 196--200 within the 296-byte Ori instruction object (not the SchedNode). The descriptor is a packed bit-field:
Descriptor at a3+196 (DWORD, 32 bits):
[8:0] 9-bit latency index -- indexes into HW latency table
[14:9] reserved
[19:15] 5-bit execution pipe mask -- identifies functional unit
0x08000 = pipe A (ALU)
0x10000 = pipe B (FP/tensor)
0x18000 = pipe C (memory/texture)
0xF8000 = all pipes (default sentinel)
Descriptor at a3+198 (WORD, 16 bits):
[3:0] pipe sub-class within the execution pipe
0x10 = sub-class 1 (control flow)
0x20 = sub-class 2 (integer ALU)
0x30 = sub-class 3 (FP32)
0x40 = sub-class 4 (FP64 / wide ops)
[8:4] throughput class (5 bits)
0x1F0 = maximum throughput (sentinel)
Descriptor at a3+199 (BYTE, high bits):
[5:1] additional pipe flags
0x3E = all flags set (default)
Specific values: 0x04 (ALU), 0x08 (SFU), 0x0A (FP64), 0x0C (tensor)
Descriptor at a3+200 (WORD, 16 bits):
[4:0] read barrier mask (5 bits, 0x1F)
[9:5] write barrier mask (5 bits, 0x3E0)
Opcode-to-Class Mapping
The switch statement maps Ori opcodes to scheduling class IDs. These IDs are stored at *(v8+4) where v8 is a pointer to the instruction's extended scheduling record. Representative mappings:
| Ori opcode | Scheduling class | Execution pipe | Description |
|---|---|---|---|
| 1 | 130 | sub-class 1 (0x10) | Control flow (BRA, JMP) |
| 2--7 (wide) | 683 | sub-class 4 (0x40), pipe 0xA | Wide FP64 operations |
| 2--7 (narrow, type 19) | 52 | sub-class 2 (0x20) | Integer ALU (narrow) |
| 2--7 (narrow, other) | 72 | sub-class 2 (0x20) | Integer ALU (standard) |
| 3, 5 (medium) | 140 | sub-class 3 (0x30) | FP32 operations |
| 4 (medium) | 131 | sub-class 2 (0x20) | Integer MAD |
| 6 (wide) | 140 | sub-class 4 (0x40), pipe 0xA | FP64 pair operations |
| 8 (flag bit set) | 3 | default | Predicate operations (true) |
| 8 (flag clear) | 2 | default | Predicate operations (false) |
| 0xA, 0xB, 0x6C, 0x95 | 200 | sub-class 2 (0x20) | Integer compare/logic |
| 0xA (extended) | 551 | default | Extended integer (wide encoding) |
| 0xA (extended, Mercury) | 694/700 | default | Mercury-era extended integer |
| 0xE | 5 | default | Conversion operations |
| 0x10 (atomic) | 575 | default | Atomic with flag |
| 0x10 (global) | varies | sub-class 4 (0x40) | Global memory load/store |
| 0x141 | 745 | latency 0xF1 | WGMMA (warpgroup MMA) |
| 0x142 (variant 3) | 744 | latency 0xF0 | WGMMA variant |
| 0x143 | 765--767 | latency 0xFB | BGMMA/QMMA tensor variants |
| 0x144 | 600 | latency 0xE6 | Tensor fence |
| 0x145, 0x146 | 759 | sub-class 4, pipe 0xC | Tensor core (HMMA/BMMA) |
| 0x147, 0x148 (wide) | 761 | latency 0xFA | Double-precision tensor (wide) |
| 0x147, 0x148 (narrow) | 757 | latency 0xF6 | Double-precision tensor (narrow) |
| 0x149 | 604 | latency 0xE7 | Tensor synchronization |
| 0x13E | 749 | latency 0xF4 | Bulk copy (ACQBULK) |
| 0x13F | 748 | latency 0xF3 | Bulk release (RELBULK) |
| 0x13D (variant) | 747/750 | latency 0xF2/0xF5 | Collective operations |
The scheduling class IDs span a wide range (2--772+). Classes below 256 correspond to legacy instruction categories; higher classes (551, 575, 600, 683, 694, 700, 744--767) represent newer instruction types added for Hopper and Blackwell architectures.
Latency Index Encoding
The low 9 bits of the descriptor at a3+196 encode a latency index that maps directly into the per-architecture HW table. The index is formed by combining the descriptor's low byte with a pipe mask:
latency_index = *(WORD*)(a3+196) & 0x1FF
Observed latency index values and their instruction classes:
| Index (hex) | Index (dec) | Instruction class |
|---|---|---|
| 0xE6 | 230 | Tensor fence / sync |
| 0xE7 | 231 | Tensor synchronization |
| 0xF0 | 240 | WGMMA variant |
| 0xF1 | 241 | WGMMA primary |
| 0xF2 | 242 | Collective op (variant A) |
| 0xF3 | 243 | Bulk release |
| 0xF4 | 244 | Bulk copy |
| 0xF5 | 245 | Collective op (variant B) |
| 0xF6 | 246 | DP tensor (narrow) |
| 0xF8 | 248 | Tensor core (HMMA/BMMA) |
| 0xFA | 250 | DP tensor (wide) |
| 0xFB | 251 | BGMMA/QMMA |
The highest index values (0xE6--0xFB) correspond to tensor and collective operations -- the most complex instructions with the longest and most architecture-variable latencies.
Functional Unit Categories
The scheduler tracks 10 functional unit resource counters per basic block. Each counter corresponds to a hardware execution pipe on the SM.
10-Element Resource Vector
Resource tracking uses an 84-byte per-BB slot at *(scheduler+672) + 84 * slot_index:
| Index | Pipe name | Typical SASS instructions | Throughput (IPC) |
|---|---|---|---|
| 0 | Integer ALU (ALU) | IADD3, IMAD, ISETP, LOP3, SHF, IABS, POPC | 1 (full rate) |
| 1 | FP32 (FMA) | FADD, FFMA, FMUL, FSETP, FMNMX, FCHK | 1 (full rate) |
| 2 | FP64 (DFMA) | DADD, DFMA, DMUL, DSETP, DMNMX | 1/2 to 1/64 (SM-dependent) |
| 3 | Tensor core (MMA) | HMMA, IMMA, BMMA, BGMMA, WGMMA, QMMA | varies |
| 4 | Load/store (LSU) | LDG, STG, LDL, STL, LDS, STS, LDGSTS | 1 (full rate) |
| 5 | Texture (TEX) | TEX, TLD, TXQ, TLD4, TEXS | 1/2 to 1/4 |
| 6 | Control flow (BRA) | BRA, JMP, EXIT, RET, CALL, BRK, CONT | 1 |
| 7 | Shared memory (SMEM) | ATOMS, REDS, LDS, STS (atomic/reduce variants) | 1 |
| 8 | Special function (SFU) | MUFU (RCP, RSQ, SIN, COS, EX2, LG2) | 1/4 |
| 9 | Uniform datapath (UDP) | UPLOP3, UISETP, UIMAD, uniform operations | 1 |
The resource vector layout within each 84-byte slot:
Offset Size Content
0..39 10 x int32 Current resource usage per FU (pipe 0..9)
40..79 10 x int32 Resource pressure delta (change from scheduling)
80..83 1 x int32 BB-entered flag and auxiliary state bits
Functional Unit Class Mapping (sub_8F0CD0)
A secondary mapper at sub_8F0CD0 translates (opcode, unit-name-string) pairs to numeric scheduling class IDs for the stall/barrier encoding stage:
| Opcode | Unit string | Class ID | Meaning |
|---|---|---|---|
| 40 | "LSU_T" | 15 | Texture load/store unit |
| 40 | "XU64" | 35 | Extended unit (64-bit ops) |
| 39 | "DMMA" | 118 | Double-precision matrix multiply |
| 53 | "DMMA" | 118 | DMMA (alternate opcode) |
| default | -- | 35 | Fallback to extended unit |
The "LSU_T" and "XU64" string tags appear in the Mercury-era post-scheduling pipeline where the SASS encoder needs to distinguish sub-pipes within the load/store and extended-precision units.
Functional Unit Query (sub_704D30)
sub_704D30 (14 KB) maps SASS opcode character codes to functional unit IDs for the Mercury encoder's latency model. The mapping uses single-character opcode identifiers:
| Char code | Decimal | FU ID | Unit |
|---|---|---|---|
'D' (68) | 68 | 40 | FP64 unit |
'E' (69) | 69 | 44 | Extended unit |
'F' (70) | 70 | 48 | FP32 unit |
'J' (74) | 74 | 52 | Integer unit |
'K' (75) | 75 | 56 | Conversion unit |
'L' (76) | 76 | 60 | Load/store unit |
'N' (78) | 78 | 32 | Tensor unit |
'S' (83) | 83 | 36 | Special function unit |
The function dispatches on *(config+372) >> 12 (the SM architecture selector) to handle architecture-specific unit mapping variations (e.g., Kepler vs Volta).
Per-Architecture HW Latency Tables
Table Construction Pipeline
The HW latency tables are built during scheduler initialization by a chain of constructors:
sub_8E4400(profile, sm_id, sched_mode) // Warp-level parameters
|
v
sub_8E5CA0(profile, table_ptr, table_size) // Assemble output array
|
+-- sub_8E6760() // Group boundary markers
+-- sub_8E6950() // Barrier entries
+-- sub_8E6B40() // Standard scheduling entries
+-- sub_8E6F20() // Wait dependency entries
+-- sub_8E7110() // Scoreboard entries
|
v
sub_8E7300..sub_8E97B0(profile, ...) // SM-specific table population
|
v
sub_8E3AD0(output, count, entries, ...) // Copy into final profile
Each SM-specific function populates entries in the 96-byte-per-record output array. Records encode latency, throughput, pipe assignment, and barrier compatibility for each scheduling class.
96-Byte Schedule Record Format
Each record in the HW table occupies 96 bytes (6 x 16-byte XMM slots). Records are stored in a growable array at *(context+56) with count at *(context+64) and capacity at *(context+68). The array grows by 1.5x when full. Records are copied using three _mm_loadu_si128 operations (offsets 0, 16, 32) plus manual field-by-field copy for offsets 48--95; the string at +48 is reference-cloned via sub_714160 when the string-backed flag is set.
Offset Size Field Content
------ ---- ----- -------
0..1 WORD type_code Record type (see type table below)
2..3 WORD (padding) Zero
4..7 DWORD aux_size Type-dependent:
root (type 1): table_size
barrier ('M'): 128 (fixed)
wait/scoreboard ('5'/'6'): 36
sched entry (23): 0
8..15 8B (reserved) Zero
16..19 DWORD cost_product Scheduling cost (latency x throughput product)
- Standard entry (23): a2 * a3
- Category header ('!'): entry_count from config+528
- Wait/scoreboard: 280 (fixed sentinel)
- SM-specific (','): 4 * class_count
20..21 WORD base_latency Base latency in cycles (standard entries only)
22..23 WORD dual_issue_flags Dual-issue compatibility mask (standard entries only)
24..31 8B (reserved) Zero
32..39 QWORD data_ptr Pointer to type-specific data block:
- Root: parent profile object
- Wait/scoreboard: dependency tracking table
- Barrier: barrier data array
- Category headers: 0
40..47 QWORD data_size Byte count of data block at data_ptr:
- Root: table_size; barrier: 128
- Wait/scoreboard: 36; headers: 0
48 BYTE inline_flag 0 = data_ptr/data_size carry raw data
1 = this record uses the inline string buffer
49..63 15B inline_str_buf Inline NUL-terminated string (max 15 chars)
64..71 QWORD parent_ptr Back-pointer: SM-specific entries point to table
root; category headers point to profile object
72..79 8B (reserved) Zero
80..87 QWORD string_buf_ptr Pointer to growable string buffer (32-byte header:
data_ptr, size, capacity, allocator) for variable-
length sub-records; self-references +48 when inline
88 BYTE string_backed_flag 1 = record owns allocated string data at +80
0 = no allocated string (uses inline or none)
89..95 7B (padding) Zero
Record Type Codes
Records are polymorphic -- the type code at offset +0 selects the interpretation of fields +16..+31, +32..+47, and the sub-record format stored in the growable buffer at +80.
| Type | ASCII | Creator | Role |
|---|---|---|---|
| 1 | -- | sub_8E5CA0 | Root container (wraps entire HW table) |
| 23 | -- | sub_8E6B40 | Standard scheduling entry (latency + throughput + dual-issue) |
| 33 | '!' | sub_8E5740 | Category header (begins a named section with string list) |
| 44 | ',' | sub_8E8480 et al. | SM-specific table entry (per-architecture class data) |
| 45 | '-' | sub_8E5CA0 | Barrier section header (links 128-byte barrier table) |
| 49 | '1' | sub_8E5530 | Dimension entries (contains 12-byte sub-records) |
| 53 | '5' | sub_8E7110 | Scoreboard entry (dependency tracking, data_size=36) |
| 54 | '6' | sub_8E6F20 | Wait dependency entry (dependency table, data_size=36) |
| 57 | '9' | sub_8E5740 | Category footer (closes the section opened by type 33) |
| 59 | ';' | sub_8E5310 | Variant section (contains 20-byte sub-records) |
| 60 | '<' | sub_8E6760 | Group boundary marker (separates scheduling groups) |
| 69 | 'E' | sub_8E6950 | Barrier entry (a2 = stall count in cost_product field) |
| 77 | 'M' | sub_8E6D40 | Barrier/sync data entry (data_ptr = barrier array, 128B) |
| 87 | 'W' | sub_8E4F20 | Supplementary weight entry (variable-length string data) |
Sub-Record Formats in the Growable Buffer (+80)
Records with string_backed_flag=1 carry variable-length sub-records in the growable buffer. The buffer header at *(record+80) is a 32-byte object: {data_ptr, size (DWORD), capacity (DWORD), allocator_ptr}.
Type 59 (';') -- Variant sub-records (20 bytes each):
Created by sub_8E5310 iterating the variant list at config+536:
Sub-record layout (20 bytes):
+0 DWORD source_data Variant source identifier
+4 WORD flags Variant flags
+6 WORD zero Reserved
+8 DWORD throughput_value Throughput for this variant
+12 DWORD aux_value Auxiliary parameter
+16 DWORD zero Reserved
The main record additionally stores: +16 = start_index (from config+544), +20 = record_index, +24 = back_ref to previous category.
Type 49 ('1') -- Dimension sub-records (12 bytes each):
Created by sub_8E5530 traversing the BST at config+592:
Sub-record layout (12 bytes):
+0 WORD node_flags BST node flags (from node+38)
+2 WORD zero Reserved
+4 DWORD node_value BST node value (from node+32)
+8 DWORD node_child BST node child pointer (low 32 bits of node+24)
Type 44 (',') -- SM-specific class descriptor (16 bytes + packed bitmasks):
Created by sub_8E8480 and other SM-specific builders, followed by a call to sub_8E3AD0 which appends packed bitmask DWORDs:
Initial 16-byte descriptor:
+0 DWORD class_flags = 2 Fixed flag value
+4 WORD zero Reserved
+8 QWORD mask Latency mask (0xFFFFFFFF00000000)
Followed by bitmask DWORDs (4 bytes each, one per 8 scheduling classes):
Each DWORD encodes 4 bits per entry (4 entries x 4 properties):
bit 4*i+0: entry[i].field_0 != 1
bit 4*i+1: entry[i].field_4 != 1
bit 4*i+2: entry[i].field_8 != 1
bit 4*i+3: entry[i].field_12 != 1
Source entries are 20 bytes apart in the input array.
Assembly Sequence
sub_8E5CA0 orchestrates the complete table by emitting records in this order:
- Barrier header (type
'-', conditional onconfig+336): links the 128-byte barrier data table atconfig+272. - Root container (type 1):
data_ptr = profile_object,data_size = table_size. - Category header + footer (types
'!'/'9'): emitted bysub_8E5740, which enumerates named sections fromconfig+520..528. - Variant section (type
';'): emitted bysub_8E5310ifconfig+544 != 0. - Supplementary weights (type
'W'): emitted bysub_8E4F20ifconfig+640 != -1. - Dimension entries (type
'1'): emitted bysub_8E5530ifconfig+608 > 0.
After all records are appended, the function computes the total serialized size (with 16-byte alignment padding per data block), allocates the output buffer, and writes a 32-byte header per record into the linear output at context+104.
Architecture Dispatch Table
| Address | SM | Architecture | Table size | Notes |
|---|---|---|---|---|
sub_8E7300 | sm_70 | Volta | 3.3 KB | First Turing-era table format |
sub_8E7540 | sm_72 | Xavier | 2.9 KB | Automotive Volta variant |
sub_8E7720 | sm_75 | Turing | 3.5 KB | Added TensorFloat-32 |
sub_8E7940 | sm_80 (base) | Ampere base | 2.9 KB | Shared base for sm_80/86/87 |
sub_8E7B40 | sm_80 | Ampere | 3.3 KB | Full Ampere with async copy |
sub_8E7D80 | sm_86 | GA10x | 4.4 KB | Consumer Ampere |
sub_8E8070 | sm_87 | Orin | 3.5 KB | Automotive Ampere |
sub_8E8280 | sm_89 | Ada Lovelace | 3.1 KB | Added FP8 tensor ops |
sub_8E8480 | sm_90 | Hopper | 5.2 KB | DPX, WGMMA, TMA |
sub_8E8780 | sm_90a | Hopper accel. | 4.6 KB | WGMMA async extensions |
sub_8E8A90 | sm_100 | Blackwell DC | 3.0 KB | 5th-gen tensor, TCGEN05 |
sub_8E8CB0 | sm_100 (short) | Blackwell DC | 949 B | Supplementary table |
sub_8E8DB0 | sm_103 | Blackwell Ultra | 1.7 KB | GB300 extensions |
sub_8E8F60 | sm_103 (short) | Blackwell Ultra | 618 B | Supplementary table |
sub_8E9000 | sm_120 | RTX 50xx | 2.9 KB | Consumer Blackwell |
sub_8E92E0 | sm_120 (ext) | RTX 50xx | 5.5 KB | Extended consumer table |
sub_8E97B0 | universal | Fallback | 8.8 KB | Default for unknown SM |
sm_90 (Hopper) has the second-largest combined table (5.2 + 4.6 KB including sm_90a) reflecting the complexity of WGMMA, DPX, and TMA scheduling. sm_120 extended (5.5 KB) is the single largest individual table, accommodating the consumer Blackwell feature set.
The "short" supplementary tables (sub_8E8CB0 for sm_100, sub_8E8F60 for sm_103) add entries for architecture-specific instructions not covered by the base table -- typically new tensor core variants and collective operations.
Warp-Level Hardware Profile (sub_8E4400)
sub_8E4400 maps the SM architecture ID (a2) to warp-level dispatch parameters stored in a 36-byte structure:
Architecture-to-Warp Mapping
| SM ID range | Warps per SM | Dispatch slots | Architecture era |
|---|---|---|---|
| <= 20479 | 4 | 96 | sm_50 (Maxwell) |
| 20480--24575 | 6 | 176 | sm_60 (Pascal) |
| 24576--28672 | 7 | 192 | sm_70 (Volta) |
| 28673--32767 | 7 | 208 | sm_75 (Turing) |
| 32768--36863 | 8 | 224 | sm_80 (Ampere) |
| > 36863 | 16 | 240 | sm_90+ (Hopper, Blackwell) |
The packed DWORD at offset +18 encodes (warps, sub-warp-count) as a 32-bit value. For example, 983055 (0x000F000F) = 15 warps in the low half and 15 in the high half, while 1048592 (0x00100010) = 16 warps for sm_90+.
Sub-Architecture Variants
Specific SM version IDs map to sub-architecture variant codes stored at offset +26:
| SM ID | Hex | Variant | Architecture |
|---|---|---|---|
| 8193 | 0x2001 | 2 | sm_50 (Maxwell Titan X) |
| 20481 | 0x5001 | 2 | sm_60 variant |
| 24576 | 0x6000 | 0 | sm_70 (Volta base) |
| 28674 | 0x7002 | 2 | sm_75 variant A |
| 28675 | 0x7003 | 3 | sm_75 variant B |
| 28676 | 0x7004 | 4 | sm_75 variant C |
| 28677 | 0x7005 | 5 | sm_75 variant D |
| 32768 | 0x8000 | 0 | sm_80 (Ampere base) |
| 36864 | 0x9000 | 0 | sm_90 (Hopper base) |
| 36867 | 0x9003 | 3 | sm_90 variant A |
| 36868 | 0x9004 | 4 | sm_90 variant B (sm_90a) |
| 36869 | 0x9005 | 5 | sm_90 variant C |
Pipeline Width (offset +24)
The scheduling mode parameter (a3) selects the pipeline width stored at offset +24. This value controls how many instructions the scheduler models as issuing per cycle:
| Mode | Value at +24 | Meaning |
|---|---|---|
| 1, 8, 9 | 1 | Single-issue |
| 3 | 4 | Quad-issue (tensor) |
| 4 | 5 | Penta-issue |
| 5 | 6 | Hexa-issue |
| 6 | 7 | Hepta-issue |
| 7 | 8 | Octa-issue |
| 10 | 9 | Nona-issue |
| 11 | 10 | Deca-issue |
| default | 2 | Dual-issue |
These values model the effective issue width for different scheduling contexts. The tensor core modes (4--11) reflect warpgroup-level cooperative execution where multiple warp slots issue tensor instructions simultaneously.
Memory Space Classification (sub_693BC0)
sub_693BC0 (22 lines) classifies the memory space of load/store instructions. It extracts the last source operand from the instruction, looks up the register descriptor, and calls sub_91C840 to determine the memory space type. The function returns an integer code:
| Return value | Memory space | Typical latency range |
|---|---|---|
| 1 | Generic (resolved at runtime) | 20--200+ cycles |
| 2 | Local memory (per-thread stack) | 20--200 cycles |
| 3 | Shared memory | 20--30 cycles |
| 4 | Constant memory (cached) | 4--8 cycles |
| 7 | Constant bank (indexed) | 4--8 cycles |
| 11 | Surface memory | 200--500 cycles |
| 16 | Global memory (DRAM) | 200--500 cycles |
The scheduler uses these values in the priority function (sub_8C9320) to distinguish "hot" (long-latency) memory operations from "cold" (short-latency) ones. Functions sub_A9CDE0 classifies hot (global/texture) memory and sub_A9CF90 classifies cold (constant/shared) memory.
Long-Latency Detection (sub_8CCF80)
sub_8CCF80 checks if an instruction qualifies as "long-latency" for scheduling priority purposes. The function:
- Verifies the target architecture supports dual-issue via
sub_7DC0E0. - For opcode 183 (LD/ST variant): checks memory space via
sub_693BC0. Memory spaces 4, 16, 2, 11, 3, 1, and 7 all qualify for long-latency classification. - For opcode 130 (
HSET2in the ROT13 name table; used as a generic internal marker): queries via vtable+640 whether the instruction is recognized as long-latency. - Queries the scheduling oracle (
sub_8BF3A0) for the instruction's estimated latency. - Returns
trueif the estimated latency exceeds 19 cycles.
The threshold of 19 cycles is the boundary between "short-latency" instructions (ALU, FP32, shared memory) and "long-latency" instructions (global memory, texture, tensor core) that benefit from latency hiding through instruction reordering.
Resource Cost Model (sub_A08A00)
sub_A08A00 (345 lines) computes per-instruction resource costs for the 10-element functional unit vector. It operates in three modes selected by parameter a6:
Mode 0/1: Instruction Cost Initialization
Resets the instruction's resource tracking state:
a1[0]= 0 (accumulated cost)a1[1045]= 0 (accumulated delta)a1[2071]= 0 (accumulated pressure)- Byte at offset 8280 = 0 (flags)
Then computes per-operand resource contributions by iterating source operands (count at a3+80, starting at a3+84):
Mode 2: Differential Cost
Computes the differential cost (new minus old):
v55 = a1[0] // previous instruction cost
v56 = a1[1045] // previous delta cost
Then runs the same operand iteration as mode 1 and subtracts the previous values.
Mode 3: Pressure Accumulation
Adds the instruction's previously computed pressure a1[2071] into the running total at *(a5+24).
Per-Operand Cost Computation
For each source operand, the function:
- Checks operand type:
((operand >> 28) & 7) == 1means register operand. - Skips operands with values 41--44 (special sentinel registers).
- Looks up the register descriptor via
*(a1+88) + 8 * (operand & 0xFFFFFF). - Checks if register class
*(descriptor+64)is <= 6 (physical register file). - Calls
sub_A08910to get the register's latency and count:- Returns the starting register index
- Outputs count (
*a4) and cost-per-register (*a5)
- Iterates over the register range, accumulating costs for registers not in the "already-consumed" bitmask at
*(a1+832).
The cost accumulation uses a 9-bit field in the instruction's scheduling word at offset +12, masked as & 0x1FF.
Register Latency Query (sub_A08910)
sub_A08910 (39 lines) returns the register index and cost for a single operand:
function GetRegisterLatency(context, reg_desc, operand, out_count, out_cost):
pipeline_bits = (reg_desc.field_48 >> 20) & 3
count = 1
cost = (pipeline_bits == 3) ? 2 : 1
*out_count = count
*out_cost = cost
if context.flags & 0x10: // dual-register tracking mode
return 2 * reg_desc.field_12 // doubled register index
else:
if context.flags & 0x08 and pipeline_bits != 1 and reg_desc.class == 6:
*out_cost = 2 * cost // double cost for wide registers
return reg_desc.field_12 // register index
The pipeline bits extracted from (reg_desc+48) >> 20 encode the register's pipeline affinity:
- Bits == 1: standard pipeline register
- Bits == 3: double-width register (costs 2 instead of 1)
- Other values: architecture-specific pipeline assignment
When dual-register tracking is active (context flag 0x10, controlled by knob 420), register indices are doubled to provide separate tracking for even/odd register halves.
Latency Hiding Statistics
The post-scheduling analysis pass (sub_73B360, MacLoopSchedulingAnalytics, 28.7 KB) computes and reports latency hiding effectiveness for four categories of long-latency operations:
| Category | String identifier | Stat function | Typical latency |
|---|---|---|---|
| Shared memory loads | "LDS latency hiding" | sub_73A1D0 | 20--30 cycles |
| Global memory loads | "LDG latency hiding" | sub_73A7F0 | 200--500 cycles |
| Extended 64-bit ops | "Xu64 latency hiding" | sub_73ADF0 | 15--30 cycles |
| Anti-dependencies | "Antidep latency hiding" | (inline) | varies |
Each category reports: Num (count of operations), Min (minimum hidden cycles), Max (maximum hidden cycles), Avg (average hidden cycles). The pass also tracks MAC instruction utilization ("MacInsts", "MacReuses", "TepidMacUtil") and resource busy time ("LsuResBusy", "Time", "TepidTime").
This analysis runs after scheduling is complete and drives feedback for the Mac Loop scheduler, which handles fused multiply-accumulate loop bodies. Knob 443 gates the MAC instruction classification.
Dual-Issue Rules
Dual-issue scheduling is controlled by sub_8CF5D0 (CheckDualIssueEligibility, 3.5 KB) and implemented by sub_8B77C0 (DualIssueScheduler, 15 KB) with pairing logic in sub_8BDC40 (7.9 KB).
Eligibility Check
sub_8CF5D0 returns 0 (no dual-issue) if:
- The target architecture does not support dual-issue (
sub_7DC0E0returns false). - Function flag bit 2 at
func+1368is set (incompatible function).
When eligible, the function iterates basic blocks checking instruction pairs:
sub_A9CDE0(instr): returns true if instruction is dual-issuable (hot = global/texture).sub_A9CF90(instr): returns true if instruction can pair with the next (cold = constant/shared).
The dual-issue benefit score is stored at scheduler+328 and used by the priority function to bias toward instruction pairs that can co-issue.
Dual-Issue Constraints
Dual-issue pairs must satisfy:
- Pipe compatibility: the two instructions must target different functional units (e.g., ALU + FP32, or ALU + load/store). Same-pipe pairs cannot dual-issue.
- Register conflict: the pair must not have RAW dependencies on the same register within the same cycle.
- Barrier compatibility: neither instruction may be waiting on a scoreboard barrier.
- Architecture support: dual-issue is primarily an sm_50 (Maxwell) feature. Newer architectures (sm_70+) use wider warp schedulers instead.
For sm_50, a special register budget function adjusts the register allocation target to account for the reduced register pressure from dual-issue execution.
Stall Count Computation
The stall count determines how many cycles an instruction must wait before it can issue. Stalls are computed by sub_8D3E20 (2.1 KB) and encoded by sub_8F3130 (1.0 KB).
Stall Encoding in Control Words
Each SASS instruction carries a stall count in its control word:
- Maximum stall: 16 cycles (capped by knobs 805 and 806).
- Minimum stall: 1 cycle (no zero-stall encoding exists).
- Default stall when no dependency: determined by the HW profile's pipeline depth.
The stall/barrier encoding pipeline (sub_8D7760, 41 KB) computes stalls by walking the dependency DAG backward from each instruction:
function ComputeStallCycles(sched, instr):
max_wait = 0
for each predecessor of instr:
distance = instr.cycle - pred.cycle
latency = LookupLatency(sched, pred, instr)
wait = latency - distance
max_wait = max(max_wait, wait)
return min(max_wait, MaxStallFromKnob(sched))
The encoding function sub_8F4140 packs the complete control word:
| Field | Encoder | Bits | Range |
|---|---|---|---|
| Stall count | sub_8F3130 | 4 | 1--16 cycles |
| Yield hint | sub_8F3650 | 1 | 0/1 |
| Read barrier | sub_8F31F0 | 6 | 0--5 (barrier ID) |
| Write barrier | sub_8F31F0 | 6 | 0--5 (barrier ID) |
| Scoreboard wait | sub_8F3860 | 6 | barrier wait mask |
| Reuse flags | (separate) | 4 | register reuse hints |
Sentinel Values
The scheduling system uses several sentinel values:
| Value | Meaning |
|---|---|
| -1 (0xFFFFFFFF) | Unscheduled instruction position |
| 0x1869F (99999) | Infinite latency sentinel |
| 0xFFFFFFFF | Batch window sentinel (DynBatch) |
Resource Cost Accumulation
sub_8C67A0 (ComputeResourceCost, 3.7 KB) drives the per-instruction resource accounting. It calls the resource model sub_A08A00 three times per instruction:
function ComputeResourceCost(sched, instr):
slot = GetResourceSlot(sched, instr)
slot.bb_entered |= 1
// Phase 1: Instruction's own execution cost
sub_A08A00(sched, instr, instr_data, output, slot, mode=1)
// Accumulate: slot[0..9] += output[0..9] (SSE _mm_add_epi32)
// Phase 2: Operand release costs (for last-use operands)
sub_A08A00(sched, instr, instr_data, output, slot, mode=2)
// Accumulate delta: slot[10..19] += output[0..9]
// Phase 3: Combined instruction + BB-level impact
sub_A08A00(sched, instr, instr_data, output, slot, mode=3)
// Accumulate pressure into slot[20]
The SSE-optimized accumulation uses _mm_add_epi32 to add 4 resource counters at a time, processing the full 10-element vector in 3 SSE iterations (4 + 4 + 2).
Cutlass-Specific Scheduling
sub_8F47E0 detects NVIDIA cutlass GEMM kernels by calling strstr(function_name, "cutlass"). When detected, the scheduler activates hand-tuned scheduling parameters for matrix multiplication inner loops. This includes:
- Modified stall counts for the HMMA/WGMMA instruction sequences.
- Adjusted register pressure targets.
- Specific barrier placement patterns for double-buffered shared memory.
This reflects NVIDIA's investment in hand-tuning their cutlass library's scheduling behavior within ptxas itself.
Execution Pipe Assignment (sub_13710B0)
sub_13710B0 (7.1 KB, 1,088 lines decompiled) is the SASS-backend execution pipe class assigner. It runs in the SASS encoding pipeline (address range 0x1370--0x139F) after instruction selection, register allocation, and the main scheduling pass are complete. Where sub_89FBA0 assigns IR-level scheduling class IDs (2--772+) consumed by the priority and stall-computation passes, sub_13710B0 writes SASS-level pipe class IDs (0x00--0x141) that control control-word encoding: stall counts, barrier assignments, and dual-issue pairing in the final binary.
Descriptor Initialization
Before dispatching on the opcode, the function initializes the scheduling descriptor at a3+196..202 to the "all-pipes" default:
*(DWORD*)(a3+196) |= 0xF8000 // pipe mask = all (bits 15..19)
*(BYTE*)(a3+200) |= 0x1F // read barrier mask = all
*(WORD*)(a3+198) = HIWORD | 0x1F0 // throughput class = max
*(WORD*)(a3+200) |= 0x3E0 // write barrier mask = all
*(BYTE*)(a3+199) = ... | 0x3E // pipe flags = all set
Then it switches on *(a2+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked), writing a 9-bit pipe class into the low bits of *(WORD*)(a3+196) and optionally overriding the pipe mask, sub-class, and pipe flags.
Pipe Mask Encoding
Bits 15--19 of *(DWORD*)(a3+196) select the execution pipe:
| Value | Pipe | Functional units | Resource vector indices |
|---|---|---|---|
0x08000 | Pipe A | ALU, integer, FP64, conversion | 0 (ALU), 2 (DFMA) |
0x10000 | Pipe B | FP32, tensor, SFU, half-precision | 1 (FMA), 3 (MMA), 8 (SFU) |
0x18000 | Pipe C | Memory, texture, wide FP64 | 4 (LSU), 5 (TEX) |
0xF8000 | All | Default sentinel (no constraint) | -- |
Sub-Class Encoding
Bits 4--7 of *(WORD*)(a3+198) encode the sub-class within the pipe:
| Value | Sub-class | Instruction category |
|---|---|---|
0x10 | Control flow | Branch, predicate, miscellaneous |
0x20 | Integer ALU | Conversion, barrier, integer ops |
0x30 | FP32 / SFU | Single-precision, half-precision |
0x40 | FP64 / Tensor | Double-precision wide, tensor core |
Pipe Flags Encoding
Bits 1--5 of *(BYTE*)(a3+199) encode sub-unit affinity:
| Value | Meaning |
|---|---|
0x02 | Narrow ALU sub-unit |
0x04 | ALU (integer / conversion) |
0x06 | Load/store or wide ALU |
0x08 | SFU / half-precision pipe |
0x0A | FP64 wide (double-precision) |
0x0C | Tensor core pipe |
0x3E | All flags set (default) |
Opcode-to-Pipe-Class Mapping
The complete switch covers 80+ Ori opcodes. Representative mappings:
| Ori opcode | Pipe class | Pipe | Sub-class | SASS instruction | Decision logic |
|---|---|---|---|---|---|
| 1 | 0x08 | -- | 0x10 | IMAD | Always |
| 2--7 (wide) | 0x03 | B (0x10000) | 0x30 | IMAD_WIDE, IADD3, etc. | sub_7D6780 = true |
| 2--7 (wide, v6=6) | 0x03 | C (0x18000) | 0x40 | LOP3 (wide, FP64) | Opcode 6, wide |
| 2--7 (narrow) | 0x0C | A (0x08000) | -- | IMAD, IADD3, etc. | Narrow, type != 19 |
| 2--7 (narrow, t=19) | 0x7B | -- | -- | IMAD (BF16/FP8 type) | Type 19 path |
| 8 (flag clear) | 0x33 | -- | -- | IABS (no guard) | Operand flag bit 0 |
| 8 (flag set) | 0x34 | -- | -- | IABS (guarded) | Operand flag bit 0 |
| 0x10 (flagged) | 0x68 | -- | -- | ATOM (flagged) | Operand bit 2 |
| 0x10 (mem=3) | 0x67 | -- | -- | ATOM (shared) | sub_7DFFC0 = 3 |
| 0x10 (mem=4) | 0x69 | -- | -- | ATOM (constant) | sub_7DFFC0 = 4 |
| 0x10 (other) | 0x66 | -- | -- | ATOM (global) | Default |
| 0x12 (no 0x400) | 0x3D | -- | -- | FADD (standard) | Operand bit 10 clear |
| 0x12 (0x400 set) | 0x78 | -- | -- | FADD (const-bank) | Operand bit 10 set |
| 0x17 (op1 reg6) | 0x37 | -- | -- | S2R (tensor reg, op1) | *(desc+64) = 6 |
| 0x17 (op2 reg6) | 0x36 | -- | -- | S2R (tensor reg, op2) | *(desc+64) = 6 |
| 0x17 (other) | 0x38 | -- | -- | S2R (standard) | Neither operand reg6 |
| 0x18 | 0x04 | A (0x08000) | 0x20 | FSETP | Always |
| 0x24 (wide) | 0x14 | B (0x10000) | 0x30 | PRMT (FP width) | sub_7D6780 = true |
| 0x24 (narrow) | 0x11 | B (0x10000) | 0x30 | PRMT (integer) | sub_7D6780 = false |
| 0x33 | 0x21 | A (0x08000) | 0x20 | IDP | Always; flags 0x06 |
| 0x3C (mem ops) | 0x2B--0x32 | -- | -- | STG variants | 6-way split on flags |
| 0x3E (mem ops) | 0x2D--0x2E | -- | -- | LDL variants | Flag / no-flag split |
| 0x42 | 0x5D | -- | -- | MUFU (SFU) | Always |
| 0x4D | 0x84--0x85 | B (0x10000) | 0x40 | WGMMA-class | Extended tensor fields |
| 0x4E (mem ops) | 0x2F--0x30 | -- | -- | LD (generic) | Flag / no-flag split |
| 0x66 | 0x09 | B (0x10000) | 0x30 | DEPBAR | Always; flags 0x08 |
| 0x82 / 130 (ext) | 0x17 | -- | -- | NANOTRAP (extended); HSET2 in ROT13 | sub_A9AB10 = true |
| 0x82 / 130 (ctrl) | 0x13 | all (0xF8000) | 0x10 | NANOTRAP (control); HSET2 in ROT13 | vtable+640 |
| 0xC9--0xCA (wide) | 0x07 | A (0x08000) | -- | DFMA, DADD (wide) | sub_7D6780 = true |
| 0xD1 | 0x05 | A (0x08000) | 0x20 | DFMA | Always |
| 0xD2 | 0x0A | A (0x08000) | 0x30 | DFMA variant | Sub-class 0x30, flag 0x04 |
| 0xF0 | 0x0F | A (0x08000) | -- | F2F | Flags 0x04 |
| 0x10E | 0x7E | B (0x10000) | -- | HMMA_16 | Flags 0x08 |
| 0x117 | 0x80 | B (0x10000) | 0x40 | HMMA_32 | Tensor pipe; flags 0x0C |
| 0x11A | 0x81 | B (0x10000) | 0x40 | IMMA | Tensor pipe |
| default | 0x88 | -- | -- | (unrecognized) | Sentinel |
Decision Axes
The function dispatches on three axes beyond the opcode:
-
Data type width:
sub_7D6780(*(a2+76))returns true for wide types (FP64). Wide types route to pipe A or C with sub-class 0x30 or 0x40; narrow types route to pipe A with sub-class 0x20. -
Memory access classification:
sub_7DFFC0(a2, code_obj)returns a memory space code (3 = shared, 4 = constant). Used for ATOM (case 0x10) to split into 4 pipe classes by memory space. -
Operand register class:
*(descriptor+64)from the register descriptor. Class 6 (tensor/accumulator register file) triggers distinct pipe classes for S2R (case 0x17) and DFMA/DADD variants.
Additionally, two architectural gates control tensor instruction classes:
*(a1+25)flag andsub_1370F40gate tensor-extended pipe classes. When disabled, tensor instructions fall through to class 0x141 (a sentinel).vtable+3184on the code object checks a feature gate for CALL instruction classification.
Memory Instruction Pipe Variants
Load/store instructions (cases 0x3C, 0x3E, 0x4E) receive a 6-way pipe class split based on two properties:
| Property | Test method |
|---|---|
| Same-source vs different-source | sub_91E7A0(a2, 0) vs sub_91E7A0(a2, 1) |
| Has flag operand | sub_91E860(code_obj, a2, i) returns 8 |
| Variant | STG (0x3C) | LDL (0x3E) | LD (0x4E) |
|---|---|---|---|
| Same-src, no flag | 0x31 | (n/a) | (n/a) |
| Same-src, flagged | 0x32 | (n/a) | (n/a) |
| Diff-src, no flag | 0x2B | 0x2D | 0x2F |
| Diff-src, flagged | 0x2C | 0x2E | 0x30 |
This fine-grained split allows the SASS encoder to select different stall counts and barrier patterns depending on whether the load/store has a predicate guard and whether the source address register is shared with another operand.
Type-19 Special Path
When sub_7D6780 returns false (not wide) and *(a2+76) == 19, several instruction groups receive distinct pipe classes in the 0x7A--0x7D range:
| Ori opcode group | Standard class | Type-19 class | Likely type |
|---|---|---|---|
| 2--7 (narrow) | 0x0C | 0x7B | BF16 / FP8 |
| 0x6E--0x72 (narrow) | 0x0B | 0x7A | BF16 / FP8 |
| 0x8B--0x8C (narrow) | 0x0D | 0x7C | BF16 / FP8 |
| 0xC9--0xCA | 0x10/0x12 | 0x7D | BF16 / FP8 |
Type 19 likely corresponds to BF16 or FP8, which require different pipeline routing than standard FP16/FP32/FP64 types on Hopper and Blackwell architectures.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_693BC0 | 22 lines | MemorySpaceClassify -- return memory space code | HIGH |
sub_695530 | 606 lines | ComputeLatencies -- per-BB latency computation | HIGH |
sub_704D30 | 14 KB | GetFunctionalUnit -- SASS opcode to FU mapping | HIGH |
sub_73A1D0 | ~6 KB | LDSLatencyStats -- shared memory latency stats | HIGH |
sub_73A7F0 | ~6 KB | LDGLatencyStats -- global memory latency stats | HIGH |
sub_73ADF0 | 6.5 KB | XU64LatencyStats -- extended unit latency stats | HIGH |
sub_73B360 | 28.7 KB | MacLoopSchedulingAnalytics -- latency hiding report | HIGH |
sub_799860 | 2.9 KB | ClassifyInstructionLatency | HIGH |
sub_89FBA0 | 85 KB | SetOpcodeLatencies -- per-opcode scheduling class | HIGH |
sub_8B5400 | 14 KB | ScheduleForLatency -- latency-optimized scheduling | MEDIUM |
sub_8B77C0 | 15 KB | DualIssueScheduler -- dual-issue scheduling engine | MEDIUM |
sub_8BDC40 | 7.9 KB | DualIssuePairing -- instruction pair selection | MEDIUM |
sub_8C67A0 | 3.7 KB | ComputeResourceCost -- per-instruction FU cost | HIGH |
sub_8C7290 | 5.1 KB | GetResourceVector -- SSE-optimized copy | HIGH |
sub_8CCF80 | 2.3 KB | IsLongLatencyOp -- latency > 19 check | HIGH |
sub_8CF5D0 | 3.5 KB | CheckDualIssueEligibility | HIGH |
sub_8D3E20 | 2.1 KB | ComputeStallCycles -- required stall count | HIGH |
sub_8D7760 | 41 KB | StallAndBarrierInsertion -- encode stalls/barriers | HIGH |
sub_8E3AD0 | -- | CopyProfileEntries -- finalize HW table | MEDIUM |
sub_8E4400 | 3.3 KB | InitHWProfile_Warp -- warp dispatch params | HIGH |
sub_8E4920 | 6.9 KB | BuildScoreboardEntries -- scoreboard BST | HIGH |
sub_8E4D80 | 15 lines | StringRefCleanup -- decref string in record copy | HIGH |
sub_8E4F20 | ~1.5 KB | EmitWeightEntry -- supplementary weight record (type 'W') | HIGH |
sub_8E5310 | ~1.5 KB | EmitVariantSection -- variant sub-records (type ';') | HIGH |
sub_8E5530 | ~1.5 KB | EmitDimensionEntries -- dimension sub-records (type '1') | HIGH |
sub_8E5CA0 | 20 KB | EmitScheduleOutput -- scheduling control words | HIGH |
sub_8E6760 | 2.9 KB | EmitGroupBoundary -- group boundary marker | HIGH |
sub_8E6B40 | 2.9 KB | EmitSchedEntry -- standard scheduling entry | HIGH |
sub_8E6D40 | 2.9 KB | EmitBarrierEntry -- barrier/sync entry | HIGH |
sub_8E6F20 | 2.9 KB | EmitWaitEntry -- wait dependency entry | HIGH |
sub_8E7110 | 2.9 KB | EmitScoreboardEntry -- scoreboard entry | HIGH |
sub_8E7300 | 3.3 KB | HWTable_sm70 -- Volta latency table | CERTAIN |
sub_8E7540 | 2.9 KB | HWTable_sm72 -- Xavier latency table | CERTAIN |
sub_8E7720 | 3.5 KB | HWTable_sm75 -- Turing latency table | CERTAIN |
sub_8E7940 | 2.9 KB | HWTable_sm80_base -- Ampere base table | CERTAIN |
sub_8E7B40 | 3.3 KB | HWTable_sm80 -- Ampere full table | CERTAIN |
sub_8E7D80 | 4.4 KB | HWTable_sm86 -- GA10x table | CERTAIN |
sub_8E8070 | 3.5 KB | HWTable_sm87 -- Orin table | CERTAIN |
sub_8E8280 | 3.1 KB | HWTable_sm89 -- Ada Lovelace table | CERTAIN |
sub_8E8480 | 5.2 KB | HWTable_sm90 -- Hopper table | CERTAIN |
sub_8E8780 | 4.6 KB | HWTable_sm90a -- Hopper accelerated table | CERTAIN |
sub_8E8A90 | 3.0 KB | HWTable_sm100 -- Blackwell DC table | CERTAIN |
sub_8E8CB0 | 949 B | HWTable_sm100_short -- Blackwell supplementary | CERTAIN |
sub_8E8DB0 | 1.7 KB | HWTable_sm103 -- Blackwell Ultra table | CERTAIN |
sub_8E8F60 | 618 B | HWTable_sm103_short -- BU supplementary | CERTAIN |
sub_8E9000 | 2.9 KB | HWTable_sm120 -- RTX 50xx table | CERTAIN |
sub_8E92E0 | 5.5 KB | HWTable_sm120_ext -- RTX 50xx extended | CERTAIN |
sub_8E97B0 | 8.8 KB | HWTable_universal -- fallback table | CERTAIN |
sub_8E9DC0 | 4.8 KB | EmitLatencyEntry -- HW table entry helper | HIGH |
sub_8EFA10 | 18 KB | EmitScheduleReport -- statistics output | HIGH |
sub_8F0CD0 | 24 B | MapFUClassID -- (opcode, name) to class | HIGH |
sub_8F1EB0 | 15 KB | EncodeScheduleWords -- SASS control word output | HIGH |
sub_8F3130 | 1.0 KB | EncodeStallField | HIGH |
sub_8F31F0 | 6.1 KB | EncodeBarrierField | HIGH |
sub_8F3650 | 2.7 KB | EncodeYieldField | HIGH |
sub_8F3860 | 3.0 KB | EncodeScoreboardField | HIGH |
sub_8F4140 | 5.6 KB | EncodeFullControlWord | HIGH |
sub_8F47E0 | ~50 B | DetectCutlass -- strstr for "cutlass" | CERTAIN |
sub_A08910 | 39 lines | GetRegisterLatency -- operand cost query | HIGH |
sub_A08A00 | 345 lines | ResourceModel -- 3-mode FU cost computation | HIGH |
sub_A09530 | 91 lines | UpdateStallCycles -- per-instruction stall update | HIGH |
sub_A9CDE0 | -- | IsHotMemory -- global/texture classification | HIGH |
sub_A9CF90 | -- | IsColdMemory -- constant/shared classification | HIGH |
sub_13710B0 | 7.1 KB | AssignPipeClass -- SASS-level pipe assignment | HIGH |
sub_1370F40 | ~500 B | CheckTensorFeature -- gates tensor pipe classes | HIGH |
sub_7D6780 | ~100 B | IsWideType -- true for FP64/wide types | HIGH |
sub_7DFFC0 | ~200 B | ClassifyMemAccess -- 3=shared, 4=constant | HIGH |
sub_7E3640 | ~100 B | GetCustomPipe -- 5-bit pipe sub-class | MEDIUM |
sub_91E7A0 | ~100 B | GetSrcEncoding -- source operand encoding query | MEDIUM |
sub_91E860 | ~100 B | GetOperandType -- operand type code | MEDIUM |
sub_A9AB10 | ~100 B | NeedsExtEncoding -- extended encoding check | MEDIUM |
Cross-References
- Scheduler Overview -- 3-phase architecture, HW profile table summary
- Scheduling Algorithm -- priority list scheduling, resource vector usage
- Scoreboards & Barriers -- scoreboard encoding, dependency barriers
- SASS Encoding -- control word format in SASS binary
- Targets Index -- SM architecture map and version codes
- Knobs -- scheduling knobs (740, 741, 805, 806, etc.)
Scoreboards & Dependency Barriers
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The NVIDIA GPU hardware uses a software-managed scoreboard system for instruction-level hazard tracking. Unlike CPUs that detect dependencies in hardware, NVIDIA's warp schedulers rely on per-instruction metadata -- encoded in a control word -- to determine when an instruction's operands are available, when a warp should yield, and which dependency barriers to set or wait on. ptxas generates this metadata in three pipeline phases (114--116) that together produce the final scoreboard annotations embedded in the SASS binary.
| Phase 114 | FixUpTexDepBarAndSync -- texture dependency barrier fixup |
| Phase 115 | AdvancedScoreboardsAndOpexes -- full scoreboard generation (-O1+) |
| Phase 116 | ProcessO0WaitsAndSBs -- conservative scoreboard insertion (-O0) |
| Control word generator | sub_A36360 (52 KB) -- per-instruction control word encoder |
| Scheduling heuristic | sub_A23CF0 (54 KB) -- DAG list scheduler with dependency analysis |
| Instruction dispatcher | sub_85C890 -- opcode-based fast-path / slow-path router |
| Mercury opex pass | sub_6FFDC0 (66 KB) -- MercGenerateOpex, phase 120 |
| HW barrier limit | 6 dependency barriers per warp (hardware constraint) |
Control Word Format
Every SASS instruction carries scheduling metadata in a control word. On sm_70+ architectures, the control word is packed into a dedicated scheduling control instruction that precedes each group of 3 real instructions. The control word encodes stall counts, yield hints, dependency barrier set/wait operations, and source operand reuse flags.
Ori IR Control Word (Internal Representation)
Within ptxas, the control word is stored in the Ori IR instruction node at offsets +196 through +200. sub_A36360 generates the fields, and per-field encoder functions write individual bit ranges.
The internal representation uses wider fields than the final SASS encoding to allow the encoder to track additional state during scoreboard computation:
| Field | Bits | Range | Description |
|---|---|---|---|
| Stall count | 4 | 0--15 | Minimum cycles to wait before issuing this instruction |
| Yield flag | 1 | 0--1 | Hint to warp scheduler: yield execution to another warp |
| Write barrier index | 3 | 0--5 | Which barrier register this instruction's result writes to |
| Read barrier mask | 6 | 0--63 | Bitmask of barriers this instruction must wait for (reads) |
| Wait barrier mask | 6 | 0--63 | Bitmask of barriers this instruction clears upon completion |
| Reuse flags | 6 | 0--63 | Per-source-operand register reuse cache hints |
Total: 26 bits of scheduling metadata per instruction in the internal representation.
SASS Control Word (Binary Encoding)
In the final SASS binary, the control word is packed into 23 bits per instruction slot within a 128-bit scheduling control instruction. Three instruction slots share one control instruction, yielding a 4:3 instruction-to-encoding ratio.
128-bit scheduling control instruction:
┌─────────┬─────────┬─────────┬──────────────────┐
│ Slot 0 │ Slot 1 │ Slot 2 │ Reserved / flags │
│ 23 bits │ 23 bits │ 23 bits │ 59 bits │
└─────────┴─────────┴─────────┴──────────────────┘
Per-slot 23-bit layout (sm_70+):
bits [3:0] Stall count (4 bits, values 0--15)
bit [4] Yield flag (1 bit)
bits [7:5] Write barrier index (3 bits, values 0--5; 7 = none)
bits [13:8] Read barrier mask (6 bits, one-hot per barrier)
bits [19:14] Wait barrier mask (6 bits, one-hot per barrier)
bits [22:20] Reserved / extended flags (3 bits)
The reuse flags (6 bits per instruction) are encoded separately in the instruction word itself at architecture-defined bit positions, not in the scheduling control instruction.
Bit-Field Diagram
22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ rsvd │ wait mask │ read mask │ wr_bar │ Y │ stall │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
Hardware Dependency Barrier Model
NVIDIA GPUs (sm_70+) provide 6 dependency barrier registers per warp, numbered 0 through 5. These are finite hardware resources managed entirely by software (ptxas). The barrier mechanism works as follows:
-
Set barrier (write barrier): When an instruction with long latency (e.g., global memory load, texture fetch) is issued, the compiler assigns it a barrier index from the pool of 6. The hardware marks that barrier as "pending" and associates it with the instruction's completion.
-
Wait on barrier (read barrier / wait mask): When a subsequent instruction needs the result of the long-latency operation, the compiler sets the corresponding bit in the wait mask. The warp scheduler stalls the instruction until the barrier clears.
-
Barrier release: When the long-latency operation completes, the hardware automatically clears the associated barrier register, allowing waiting instructions to proceed.
The key constraint is the hardware limit of 6 simultaneous barriers. If a basic block has more than 6 outstanding long-latency operations, the compiler must either:
- Reuse a barrier (wait for an earlier operation to complete before reassigning its barrier)
- Insert explicit stall cycles to serialize operations
- Use the DEPBAR instruction to manage barrier state programmatically
Stall Count vs. Dependency Barriers
The stall count and dependency barriers serve complementary purposes:
| Mechanism | Latency Range | Use Case |
|---|---|---|
| Stall count (4 bits) | 0--15 cycles | Short-latency operations: ALU (4--6 cycles), shared memory (20--30 cycles when stall is sufficient) |
| Dependency barriers | Arbitrary | Long-latency operations: global memory (200--800 cycles), texture (200--400 cycles), where stall count is insufficient |
For operations with latency <= 15 cycles, the stall count alone suffices. For longer latencies, a dependency barrier must be used because 4 bits cannot encode delays beyond 15 cycles. The yield flag provides an additional hint: when set, it tells the warp scheduler that this warp is about to stall and should be descheduled in favor of another ready warp.
Phase 114: FixUpTexDepBarAndSync
Phase 114 runs a texture-specific fixup pass on dependency barriers and synchronization instructions. It operates through a double-indirect vtable dispatch: the architecture backend object contains a secondary scheduling/scoreboard subsystem object (at offset +16), which provides the actual implementation through its vtable at offset +0x70.
Purpose
Texture operations have complex dependency patterns that the general scoreboard pass may not handle optimally:
- Texture fetches have variable latency depending on cache hit rates
- Texture coordinates and sampler state create additional dependencies
- The texture pipeline has its own internal buffering that interacts with scoreboard barriers
Phase 114 patches up the dependency barrier assignments made by the scheduler to account for these texture-specific requirements. It may:
- Reassign barrier indices for texture operations to avoid conflicts with non-texture barriers
- Insert additional DEPBAR instructions where texture dependencies require explicit barrier management
- Adjust synchronization instructions that interact with texture pipeline state
Implementation
The phase is architecture-specific. In the default vtable, it maps to a nullsub (nullsub_43 at 0x680170), indicating a no-op for architectures that handle texture dependencies entirely in the general scoreboard pass. Architecture backends that need texture-specific fixup override this vtable entry with their implementation.
Dispatch path:
PhaseManager::execute(phase=114)
→ arch_backend->vtable[phase_114]()
→ secondary_object = *(arch_backend + 16)
→ (*(secondary_object->vtable + 0x70))(secondary_object, func)
Phase 115: AdvancedScoreboardsAndOpexes
Phase 115 is the main scoreboard generation pass. It is an AdvancedPhase hook -- a no-op in the default vtable, activated only when the architecture backend overrides it. At -O1 and above, this phase runs the full dependency analysis and scoreboard assignment. At -O0, it is skipped entirely (phase 116 handles the conservative path instead).
Architecture Dispatch
The phase entry point dispatches through the architecture backend vtable to sub_85C890, which acts as an opcode-aware router: depending on instruction type, it either handles the instruction via a fast path (direct barrier assignment for known patterns) or falls through to the full DAG list scheduler sub_A23CF0.
Fast Path (sub_85C890)
sub_85C890 classifies instructions by their masked opcode (opcode & 0xFFFFCFFF) and routes them:
Handled by fast path (direct barrier assignment without full DAG analysis):
- Opcodes 60, 62, 78, 79: Texture/surface operations -- processed via
sub_A22B40(write barrier assignment) after checking architecture capability at vtable+1928 - Opcode 4 with operand types (7, 6): Specific ALU patterns with predicate operands -- dual operand processing via
sub_A220A0 - Opcode 111 with operand types (7, 7, 6): Triple-source patterns -- processed via triple
sub_A220A0calls - Opcodes 120, 121: GMMA/tensor operations -- processed via
sub_A220A0+sub_A22B40with variable operand counts - Opcodes 126--128: Complex operations with architecture-specific operand counts (2--4 source operands)
- Opcodes 195, 270, 280, 281: Memory operations with specific addressing modes
- Opcodes 350, 351: Extended operations with operand subtype 11--12
Fall-through to slow path (full DAG scheduler):
- All other opcodes
- Fast-path opcodes that fail capability checks (vtable+1928 returns false)
- Instructions with the 0x1000 flag set (bit 12 of opcode word) -- handled via
sub_A227F0first, then fall through
The fast-path check at vtable+1928 tests (*(_BYTE *)(a1 + 1090) & 4) != 0, which corresponds to an architecture feature flag controlling whether the backend supports direct scoreboard assignment for specific instruction classes.
Slow Path: DAG List Scheduler (sub_A23CF0, 54 KB)
When the fast path cannot handle an instruction, sub_A23CF0 performs full dependency-driven scoreboard assignment. This function takes 14 parameters including floating-point latency weights and throughput factors.
The scheduler:
-
Classifies instruction dependencies: Iterates operands, extracting register IDs from the operand descriptors at
instr+84. For each operand, looks up the register object via*(*(func+88) + 8 * (operand_desc & 0xFFFFFF))and checks the register type at offset +64. -
Walks the def-use chain: For each source operand, traces back to the defining instruction. Determines the dependency distance (in instructions and cycles) from each producer to the current consumer.
-
Assigns barrier or stall: Based on the dependency distance and the producer's latency:
- If the producer's latency fits within the stall count range (0--15), assigns a stall count
- If the latency exceeds 15 cycles, allocates a dependency barrier from the pool of 6
- If all 6 barriers are in use, finds the oldest barrier, inserts a wait for it, and recycles it
-
Handles instruction-specific patterns: Contains a large switch on opcode for architecture-specific scheduling decisions. Opcodes handled specially include:
- Opcodes 2, 3, 20, 21, 24, 28, 60, 61, 62, 67, 78, 79, 98, 100, 110, 120, 126, 139, 141, 162, 164, 201, 209, 210, 213, 214, 272, 273, 311: Direct operand processing with known dependency patterns
- Opcodes 5, 6, 7, 10, 11, 36, 63, 80, 106, 108, 112, 114: Memory/load operations with variable-latency handling
-
Produces control word fields: After analysis, the function sets the barrier assignment, stall count, and wait mask for the instruction.
Key Support Functions
| Address | Size | Purpose |
|---|---|---|
sub_A220A0 | 9 KB | Instruction attribute / property query -- fills a scheduling descriptor for a specific operand |
sub_A22B40 | -- | Write barrier assignment for a specific operand -- determines which barrier index to assign |
sub_A22BC0 | -- | Read barrier dependency -- sets up wait mask for operand |
sub_A22CE0 | -- | Instruction classification -- determines if instruction needs scoreboard processing |
sub_A231E0 | -- | Scheduling score computation -- determines if full DAG analysis is needed |
sub_A227F0 | -- | Pre-processing for flagged instructions (bit 12 set in opcode) |
sub_A22D00 | -- | Dependency distance computation |
Phase 116: ProcessO0WaitsAndSBs
Phase 116 implements the conservative scoreboard insertion path for -O0 (no optimization) builds. At -O0, phase 115 is a no-op, and phase 116 takes over with a simple, safe strategy.
Conservative Strategy
The -O0 path does not perform dependency analysis. Instead, it applies maximum-safety defaults:
function ProcessO0WaitsAndSBs(func):
for each bb in func.basic_blocks:
for each instr in bb.instructions:
// Set maximum stall count (15 cycles)
instr.stall_count = 15
// Wait on all active barriers before every instruction
instr.wait_mask = 0x3F // all 6 barriers
// No barrier assignment (no long-latency tracking)
instr.write_barrier = 7 // 7 = none
// No read barriers
instr.read_mask = 0
// Yield after every instruction
instr.yield = 1
This produces correct but extremely slow code: every instruction waits the maximum time and clears all barriers, eliminating any possibility of instruction-level parallelism. The primary use case is debugging, where correctness matters more than performance.
At -O1 and above, phase 115 runs the full analysis, and phase 116's isNoOp() returns true, skipping execution entirely.
Control Word Generation Pipeline (sub_A36360)
sub_A36360 (52 KB) is the master control word generator, called via vtable for each instruction in the scheduled order. It orchestrates six per-field encoder functions to produce the complete control word.
Dispatch Architecture
The function takes the scheduling context (a1), the instruction node (a2), and several SIMD/float parameters encoding latency weights and architecture-specific constants. It begins by:
- Loading the function context from
*(a1+8)and the SM backend from*(func+1584)(thesm_backendfield; provides hardware latency profiles) - Calling
sub_7E1750to classify the instruction - Extracting the opcode from
*(a2+72)with the standard mask (BYTE1 &= 0xCF) - Switching on the masked opcode to determine the encoding strategy
Per-Opcode Dispatch
The master switch at the entry of sub_A36360 routes instructions by opcode class:
| Opcode Class | Handler | Description |
|---|---|---|
| 2, 3, 5, 7 | Inline (LABEL_18 path) | Standard ALU/memory with full barrier analysis. Checks operand subtype 9--10 and architecture feature at *(sm_backend+1037) & 0x20. Calls sub_A32C70 for operand analysis, then sub_A31040 for field encoding. |
| 6 | sub_A34B70 | Wait barrier mask encoding for specific memory operations |
| 10, 149, 151, 290 | Inline (large block) | Extended operations with special barrier handling. Calls sub_A32A20 for multi-operand setup, then processes register-type checks at offset +64 (type==5 triggers additional barrier logic). |
| All others | Per-field encoder chain | Default path through the six encoder functions |
Per-Field Encoder Chain
For the default path, sub_A36360 calls these encoders in sequence:
function GenerateControlWord(ctx, instr):
// 1. Initialize operand analysis
sub_7E19E0(&operand_info, ctx.func, instr)
barrier_type = sub_7E53D0(instr.operand_subtype)
// 2. Analyze source/dest operand dependencies
sub_A32C70(&ctx, instr, src_idx, dst_idx,
&dep_info, &barrier_info)
// 3. Encode all control word fields
sub_A31040(&ctx, &dep_info, &barrier_info,
&src_desc, &dst_desc, &flags,
barrier_type, ...)
// 4. Finalize: set register space = 7 (done)
*(ctx.func + 240) = 7
// 5. Emit the control word
sub_9253C0(ctx.func, instr, 1)
Encoder Function Details
| Address | Size | Function | Field Encoded |
|---|---|---|---|
sub_A333A0 | 3 KB | EncodeStallAndYield | 4-bit stall count + 1-bit yield flag. Called twice from sub_A36360. Computes the minimum stall cycles from the dependency distance to the nearest consumer. Sets yield=1 when stall > threshold (architecture-dependent, typically 4+ cycles). |
sub_A33660 | 7 KB | EncodeReadBarrierMask | 6-bit read barrier mask. Determines which barrier registers this instruction must wait for before reading its source operands. Calls sub_935720 to query register-barrier associations. |
sub_A342E0 | 9 KB | EncodeWriteBarrierIndex | 3-bit write barrier index. Allocates a barrier from the pool of 6 for this instruction's result. Calls sub_934630 to find a free barrier; if none available, forces a wait on the oldest active barrier via sub_9253C0. |
sub_A34B70 | 10 KB | EncodeWaitBarrierMask | 6-bit wait barrier mask. Determines which barriers are cleared when this instruction completes. |
sub_A356A0 | 12 KB | EncodeScoreboardFields | Combined scoreboard field encoder. Orchestrates read/write barrier assignment with dependency distance tracking via sub_A318F0 and conflict detection via sub_A31390. |
sub_A31F80 | 7 KB | ComputeReuseFlags | 6-bit reuse flags. Determines which source register values should be cached in the operand reuse buffer. Calls sub_7DB310 for register bank analysis and sub_91BF30 for reuse eligibility checking. |
Supporting Analysis Functions
| Address | Size | Purpose |
|---|---|---|
sub_A318F0 | 4 KB | Barrier dependency distance computation -- measures the instruction distance between a barrier set and its corresponding wait |
sub_A31390 | 4 KB | Barrier set intersection / conflict detection -- checks whether two instructions' barrier usage conflicts |
sub_A32C70 | -- | Source/destination operand dependency analysis -- identifies which operands create dependencies |
sub_A31040 | -- | Master field encoding dispatcher -- coordinates all six per-field encoders |
Dependency Barrier Allocation Algorithm
The barrier allocator manages the 6 hardware barrier registers as a resource pool. The algorithm must satisfy three constraints:
- No two simultaneously-live long-latency operations share a barrier index
- Every consumer instruction waits on the correct barrier before reading its operand
- Barrier reuse is maximized to avoid unnecessary stalls
Allocation State Machine
State per barrier register (6 entries):
barrier[i].status ∈ {FREE, PENDING, COMPLETED}
barrier[i].producer = instruction pointer (or NULL)
barrier[i].set_cycle = cycle when barrier was assigned
barrier[i].consumers = list of waiting instructions
State transitions:
FREE → PENDING: Barrier allocated to a long-latency producer
PENDING → COMPLETED: Hardware signals completion (implicit)
COMPLETED → FREE: All consumers have executed their wait
PENDING → FREE: Forced recycle (all barriers in use, oldest evicted)
Allocation Pseudocode
function AllocateBarrier(ctx, producer_instr):
// 1. Try to find a free barrier
for i in 0..5:
if barrier[i].status == FREE:
barrier[i].status = PENDING
barrier[i].producer = producer_instr
barrier[i].set_cycle = current_cycle
return i
// 2. No free barrier: recycle the oldest
oldest = argmin(barrier[i].set_cycle for i in 0..5)
// 3. Force all consumers of oldest to wait NOW
InsertWaitForBarrier(ctx, oldest)
// 4. Recycle
barrier[oldest].status = PENDING
barrier[oldest].producer = producer_instr
barrier[oldest].set_cycle = current_cycle
return oldest
function AssignWaitMask(ctx, consumer_instr):
wait_mask = 0
for each source_operand in consumer_instr:
producer = FindProducer(source_operand)
if producer.barrier_index != NONE:
if producer.latency > stall_count_range:
wait_mask |= (1 << producer.barrier_index)
consumer_instr.wait_mask = wait_mask
Barrier Reuse Heuristics
The allocator uses several heuristics to maximize barrier reuse:
-
Oldest-first eviction: When all 6 barriers are occupied, the oldest (earliest set_cycle) is evicted. This maximizes the chance that the evicted operation has already completed.
-
Type affinity: Texture operations preferentially reuse barriers previously assigned to other texture operations, because texture latencies tend to be similar and the texture pipeline may batch completions.
-
Distance-based freeing: A barrier is marked free without an explicit wait if the instruction distance from the producer exceeds the architecture's maximum latency for that operation class. The
sub_A318F0function computes this distance. -
Conflict avoidance:
sub_A31390checks whether a proposed barrier assignment would conflict with an existing barrier that has not yet been waited on. If a conflict is detected, the allocator tries a different barrier index.
Scoreboard Tracking State
The scoreboard tracking state is maintained in the scheduling context object. Key fields:
| Offset | Type | Content |
|---|---|---|
ctx+232 | QWORD | Current instruction pointer |
ctx+240 | DWORD | Current register space ID (7 = done) |
ctx+244 | QWORD | Current operand descriptor pair |
ctx+248 | DWORD | Current write barrier index |
ctx+252 | DWORD | Current barrier assignment type (1 or 2) |
ctx+264 | DWORD | Current instruction sequence number |
ctx+1040 | BYTE | Architecture feature flags (bit 5 = texture scoreboard, bit 4 = extended barriers) |
ctx+1090 | BYTE | Capability flags (bit 2 = fast-path scoreboard, bit 4 = extended operand tracking) |
The *(ctx+1040) & 0x20 flag controls whether the architecture supports texture-specific scoreboard handling. The *(ctx+1090) & 4 flag enables the fast-path scoreboard assignment for known instruction patterns.
Scoreboard Object Layout (952 bytes)
The scoreboard object is allocated by sub_8D0640 (ScheduleInstructions) when the architecture feature flag *(func+1385) & 4 is set. The 952-byte allocation goes through the function context's vtable-dispatched allocator at *(func+16), and the constructor sub_69A1A0 initializes it. The pointer is stored at func+1864.
The object has three regions: 35 reference-counted counter slots, a linked-list/tree node for active barrier tracking, and 14 barrier tracking records.
Region 1: Counter Slots (offsets +0 to +272)
35 QWORD pointer slots, each pointing to an externally-allocated 24-byte counter node. Each counter node has the layout:
Counter node (24 bytes):
+0 QWORD refcount (initialized to 1)
+8 QWORD value (initialized to 0)
+16 QWORD allocator back-reference
The 35 slots are organized as barrier state / stall counter pairs for each register class, plus additional scoreboard tracking counters:
| Offset | Slot | Purpose |
|---|---|---|
| +0 | 0 | R (general-purpose register) barrier state |
| +8 | 1 | R stall counter |
| +16 | 2 | P (predicate register) barrier state |
| +24 | 3 | P stall counter |
| +32 | 4 | UR (uniform register) barrier state |
| +40 | 5 | UR stall counter |
| +48 | 6 | UP (uniform predicate) barrier state |
| +56 | 7 | UP stall counter |
| +64 | 8 | B (barrier register) barrier state |
| +72 | 9 | B stall counter |
| +80 | 10 | Arch-specific class 5 barrier state |
| +88 | 11 | Arch-specific class 5 stall counter |
| +96 | 12 | Arch-specific class 6 barrier state |
| +104 | 13 | Arch-specific class 6 stall counter |
| +112 | 14 | Arch-specific class 7 barrier state |
| +120 | 15 | Arch-specific class 7 stall counter |
| +128 | 16 | Arch-specific class 8 barrier state |
| +136 | 17 | Arch-specific class 8 stall counter |
| +144--+272 | 18--34 | Additional scoreboard tracking counters (17 slots) |
Total: 35 slots x 8 bytes = 280 bytes.
Region 2: Linked List / Tree Node (offsets +280 to +391)
This region contains an intrusive data structure (linked list or red-black tree node) used for tracking active barrier assignments. It cross-references counter slots from Region 1.
| Offset | Size | Type | Init | Purpose |
|---|---|---|---|---|
| +280 | 8 | ptr | from a2+16 | Allocator reference (arena/memory pool) |
| +288 | 8 | QWORD | 0 | List sentinel / null node |
| +296 | 8 | ptr | &self+304 | List head pointer |
| +304 | 8 | ptr | &self+288 | Forward link (points to sentinel) |
| +312 | 8 | QWORD | 0 | Node data |
| +320 | 8 | ptr | &self+288 | Backward link (points to sentinel) |
| +328 | 8 | ptr | &self+304 | Secondary forward link |
| +336 | 4 | DWORD | 2 | Node type / RB-tree color (2 = initial) |
| +344 | 8 | ptr | slot 1 ref | Cross-reference to counter slot 1 (R stall counter); refcount incremented |
| +352 | 8 | QWORD | 0 | Pending producer instruction pointer |
| +360 | 8 | QWORD | 0 | Set cycle timestamp |
| +368 | 8 | QWORD | 0 | Consumer list head |
| +376 | 4 | DWORD | 0 | Active flag / barrier index |
| +384 | 8 | ptr | slot 19 ref | Cross-reference to counter slot 19 |
Total: 112 bytes.
Region 3: Barrier Tracking Records (offsets +392 to +951)
14 identical 40-byte records, each tracking one dependency barrier register. The first 6 records correspond to the 6 hardware dependency barriers per warp. Records 6--12 are extended/spare slots for overflow or future barrier model expansion (sm_100+). Record 13 uses a different initialization path (sub_6996C0 instead of sub_69A120), suggesting it serves as a sentinel or special-purpose record.
Per-record layout (40 bytes):
| Offset (within record) | Size | Type | Init | Purpose |
|---|---|---|---|---|
| +0 | 8 | QWORD | 0 | Barrier status: FREE (0), PENDING, COMPLETED |
| +8 | 8 | QWORD | 0 | Producer instruction pointer (or NULL when free) |
| +16 | 8 | QWORD | 0 | Set cycle / consumer tracking state |
| +24 | 4 | DWORD | 0 | Barrier flags / consumer count |
| +28 | 4 | -- | -- | (padding) |
| +32 | 8 | ptr | slot 19 ref | Cross-reference to counter slot 19 (allocator back-pointer) |
Record index to offset mapping:
| Record | Offset | Hardware Barrier |
|---|---|---|
| 0 | +392 | Dependency barrier 0 |
| 1 | +432 | Dependency barrier 1 |
| 2 | +472 | Dependency barrier 2 |
| 3 | +512 | Dependency barrier 3 |
| 4 | +552 | Dependency barrier 4 |
| 5 | +592 | Dependency barrier 5 |
| 6 | +632 | Extended / spare 0 |
| 7 | +672 | Extended / spare 1 |
| 8 | +712 | Extended / spare 2 |
| 9 | +752 | Extended / spare 3 |
| 10 | +792 | Extended / spare 4 |
| 11 | +832 | Extended / spare 5 |
| 12 | +872 | Extended / spare 6 |
| 13 | +912 | Sentinel record (different init via sub_6996C0) |
Tail pointer:
| Offset | Size | Type | Purpose |
|---|---|---|---|
| +944 | 8 | ptr | Counter reference for sentinel record (from slot 25) |
Total: 14 records x 40 bytes = 560 bytes + 8 byte tail = 568 bytes.
Memory Layout Diagram
ScoreboardObject (952 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+
|+0 slot0 (R bar) |+8 slot1 (R stl) |+16 slot2 (P bar)|+24 slot3 (P stl)|
+--------+--------+--------+--------+--------+--------+--------+--------+
|+32 slot4 (UR) |+40 slot5 (UR) |+48 slot6 (UP) |+56 slot7 (UP) |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+64 slot8 (B) |+72 slot9 (B) |+80..+272 slots 10--34 |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+280 allocRef |+288 sentinel |+296 listHead |+304 fwdLink |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+312 nodeData |+320 bwdLink |+328 secFwd |+336 type | |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+344 slotRef |+352 producer |+360 setCycle |+368 consumers |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+376 flags| |+384 slotRef19 | |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+392 barrierRecord[0] (40B) |+432 barrierRecord[1] (40B) |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+472 barrierRecord[2] |+512 barrierRecord[3] |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+552 barrierRecord[4] |+592 barrierRecord[5] |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+632..+912 barrierRecords[6..13] (8 extended / spare records) |
+--------+--------+--------+--------+--------+--------+--------+--------+
|+944 tailPtr |+952 END |
+--------+--------+--------+--------+--------+--------+--------+--------+
Design Notes
The counter nodes use reference counting (initial refcount = 1, incremented when cross-referenced from Region 2 or Region 3). This enables sharing counter state across multiple tracking contexts -- for example, when the scheduling passes for pre-scheduling and post-scheduling need to track the same barrier state.
The 14 barrier records provide 6 slots for the hardware barrier registers plus 8 extended slots. Current architectures use exactly 6 dependency barriers per warp, but the extended slots provide headroom for the expanded barrier model hinted at in sm_100+ configurations (see *(ctx+1040) & 0x10 extended barriers flag).
Stall Count Computation
The stall count is the minimum number of cycles the warp scheduler must wait before issuing the instruction. It is computed from the dependency distance to the instruction's producers.
Algorithm
function ComputeStallCount(ctx, instr):
max_stall = 0
for each source_operand in instr:
producer = FindProducer(source_operand)
if producer == NULL:
continue
distance = instr.cycle - producer.cycle
latency = GetLatency(producer.opcode, ctx.sm_backend)
required_wait = latency - distance
if required_wait > 0:
max_stall = max(max_stall, required_wait)
// Clamp to 4-bit range
stall = min(max_stall, 15)
// Apply architecture minimum (knob 741, default 3)
stall = max(stall, ctx.min_stall)
// Cap at architecture maximum (knob 805/806, max 16)
stall = min(stall, ctx.max_stall)
return stall
The stall count computation uses the SM backend at *(func+1584) (sm_backend) to look up per-opcode latencies from the architecture's hardware latency tables; see Latency Model.
Yield Flag
The yield flag is computed by sub_A333A0 alongside the stall count. The decision:
function ComputeYield(ctx, instr, stall):
if stall >= yield_threshold: // arch-dependent, typically 4
return 1 // long stall: yield to another warp
if instr is branch/exit:
return 1 // control flow: always yield
if instr.is_last_in_bb and next_bb.has_barrier:
return 1 // barrier boundary: yield
return 0
The yield threshold is read from the SM backend's latency table and varies by architecture. On sm_80+ it is typically 4 cycles.
Mercury Opex Path (Phase 120)
In addition to the Ori IR scoreboard passes (phases 114--116), the Mercury backend has its own scoreboard generation pass: MercGenerateOpex (phase 120, sub_703480 / sub_6FFDC0).
This pass runs after Mercury encode/decode (phase 117) and instruction expansion (phase 118), operating on the Mercury intermediate representation rather than Ori IR. It generates:
- DEPBAR instructions for explicit barrier management
- Scoreboard wait annotations
- Stall count adjustments for expanded instruction sequences
- Synchronization barriers for cross-warp dependencies
The Mercury opex pass and the Ori scoreboard passes serve different purposes:
- Phases 114--116 generate scoreboard metadata at the Ori IR level, before Mercury encoding
- Phase 120 generates additional scoreboard metadata for instructions introduced during Mercury expansion (pseudo-instruction expansion, WAR hazard insertion)
- The WAR pass must run twice (phases 119 and 121) because opex introduces new instructions that create additional write-after-read hazards
Scheduling Output Encoding
After the control word fields are computed, the scheduling output pipeline (0x8F1EB0--0x8FDD60, ~57 KB) encodes them into the final SASS binary format.
Encoding Pipeline
| Address | Size | Function | Purpose |
|---|---|---|---|
sub_8F1EB0 | 15 KB | EncodeScheduleWords | Main scheduling output encoder -- iterates all instructions and produces control words |
sub_8F3130 | 1.0 KB | EncodeStallField | Packs 4-bit stall count into control word |
sub_8F31F0 | 6.1 KB | EncodeBarrierField | Packs barrier set/wait fields with architecture-specific layout |
sub_8F3650 | 2.7 KB | EncodeYieldField | Packs yield flag |
sub_8F3860 | 3.0 KB | EncodeScoreboardField | Packs scoreboard dependencies |
sub_8F3AB0 | 5.0 KB | EncodeDependencyField | Packs inter-instruction dependency metadata |
sub_8F3DE0 | 1.3 KB | EncodeControlField | Packs control flags |
sub_8F3EA0 | 2.1 KB | ValidateEncoding | Checks encoded control word for consistency |
sub_8F3FE0 | 1.7 KB | EncodeWaitField | Packs wait mask |
sub_8F4140 | 5.6 KB | EncodeFullControlWord | Combines all fields into the final 23-bit encoding |
Emission
sub_8F4510 (EmitControlWordForInstr) writes the packed control word into the output buffer. sub_8F4820 (EmitControlBlock) constructs the complete 128-bit scheduling control instruction from three consecutive instruction slots.
Scoreboard Entry Construction
sub_8E4920 (6.9 KB) constructs a balanced BST (red-black tree) of scoreboard entries during schedule output. Each entry contains:
- Instruction pointer
- 16-bit scoreboard register ID
- 16-bit dependency type
The tree is used by the verification pass to check that barrier assignments are consistent across the instruction stream.
Verification
Seven verification functions (0x8F7610--0x8F8CB0) validate the generated schedule:
- Stall count bounds (0--15)
- Barrier index validity (0--5 or 7=none)
- Wait mask consistency (only wait on barriers that have been set)
- Scoreboard dependency completeness (every long-latency producer has a barrier)
- Control word format correctness
- Yield hint plausibility
- Overall schedule integrity (no live-range violations)
Alternative Control Word Path (sub_8D7760)
sub_8D7760 (41 KB) is the stall/barrier insertion function used by the pre-scheduling passes. Unlike sub_A36360 which generates control words for the final Ori IR, sub_8D7760 operates during the scheduling algorithm itself, computing stall and barrier assignments as instructions are placed.
This function:
- Manages a 32-entry barrier tracking table
- Contains architecture-variant switches for different barrier models (sm_70 vs sm_80 vs sm_90+)
- Computes stall cycles as the distance in cycles to the nearest dependent consumer
- Assigns barriers from the pool of 6 using an oldest-first eviction policy
- Handles architecture-specific barrier count variations
The two control word generators (sub_A36360 for final emission, sub_8D7760 for scheduling) share the same barrier allocation algorithm but operate at different pipeline stages. sub_8D7760 produces preliminary assignments that sub_A36360 may refine during the final scoreboard pass.
Architecture-Specific Control Word Configuration
sub_A2BD90 (23 KB) configures architecture-dependent scheduling parameters by querying feature flags through the architecture vtable at *(ctx+72). Configuration includes:
- Stall count thresholds and caps
- Barrier count (6 for all current architectures, but the infrastructure supports variation)
- Reuse buffer policy
- Yield threshold
- Texture scoreboard behavior
- Extended barrier modes for sm_100+
The function queries multiple feature IDs through vtable dispatch, building an architecture profile that sub_A36360 and its encoder chain use for all per-instruction decisions.
Per-Instruction Control Word (Internal Structure)
Within the scheduling context, the control word is maintained at instruction offsets +196 through +200:
| Field | Offset | Bits | Description |
|---|---|---|---|
| Stall count | *(instr+200) bits [0:4] | 5 | Internal stall count (wider than SASS encoding to allow values up to 31 during optimization) |
| Extended stall | *(instr+200) bits [5:9] | 5 | Second stall field for dual-issue scheduling |
| Barrier flags | *(instr+200) bits [10:14] | 5 | Barrier control flags |
| Control bits | *(instr+48) bits [17:13] | 5 | Barrier format in Mercury encoding |
| Scoreboard flag | *(instr+32) byte 13 bit 2 | 1 | Instruction has scoreboard information |
| Encoding format | *(instr+56) | DWORD | 4 = barrier format in Mercury |
| Stall bits | *(instr+168) | BYTE | Final stall value for encoding |
The sub_A2D340 (32 KB) function writes these fields through a large opcode switch, handling opcodes 50 (atomics), 73 (BAR), 74 (ST), 77 (LDS/STS), 78 (HMMA), and others with instruction-specific field layouts.
Function Map
| Address | Size | Identity |
|---|---|---|
sub_85C890 | 1.5 KB | ScoreboardDispatcher -- opcode-based fast/slow path router |
sub_A220A0 | 9 KB | InstructionPropertyQuery -- scheduling descriptor filler |
sub_A22B40 | -- | WriteBarrierAssign -- barrier index assignment for operand |
sub_A22BC0 | -- | ReadBarrierAssign -- wait mask assignment for operand |
sub_A22CE0 | -- | InstructionClassify -- scoreboard processing classification |
sub_A22D00 | -- | DependencyDistance -- compute instruction distance |
sub_A227F0 | -- | FlaggedInstrPreprocess -- bit-12-set instruction handling |
sub_A231E0 | -- | SchedulingScore -- full-DAG-analysis necessity check |
sub_A23CF0 | 54 KB | DAGListScheduler -- full dependency-driven scoreboard |
sub_A265B0 | 10 KB | BarrierDependencyTracker -- barrier assignment tracking |
sub_A29220 | 12 KB | InstructionEmissionFilter -- instruction emission gating |
sub_A2BD90 | 23 KB | ArchControlWordConfig -- architecture-specific parameter loader |
sub_A2D340 | 32 KB | InstructionControlWordEncoder -- per-opcode field writer |
sub_A31040 | -- | MasterFieldEncoder -- coordinates per-field encoders |
sub_A31390 | 4 KB | BarrierConflictDetect -- barrier set intersection check |
sub_A318F0 | 4 KB | BarrierDistanceCompute -- dependency distance to barrier |
sub_A31F80 | 7 KB | ComputeReuseFlags -- operand reuse buffer hints |
sub_A32C70 | -- | OperandDependencyAnalysis -- source/dest dep extraction |
sub_A333A0 | 3 KB | EncodeStallAndYield -- 4-bit stall + 1-bit yield |
sub_A33660 | 7 KB | EncodeReadBarrierMask -- 6-bit read barrier mask |
sub_A342E0 | 9 KB | EncodeWriteBarrierIndex -- 3-bit write barrier index |
sub_A34B70 | 10 KB | EncodeWaitBarrierMask -- 6-bit wait barrier mask |
sub_A356A0 | 12 KB | EncodeScoreboardFields -- combined scoreboard encoder |
sub_A36360 | 52 KB | GenerateControlWord -- master control word generator |
sub_8D7760 | 41 KB | StallAndBarrierInsertion -- pre-scheduling control words |
sub_8E4920 | 6.9 KB | BuildScoreboardEntries -- scoreboard BST construction |
sub_8E5CA0 | 20 KB | EmitScheduleOutput -- control word output encoder |
sub_8F1EB0 | 15 KB | EncodeScheduleWords -- SASS control word output |
sub_8F4140 | 5.6 KB | EncodeFullControlWord -- 23-bit packing |
sub_A95DC0 | 36 KB | SASSControlWordEncoder -- architecture-dispatched encoder |
sub_6FFDC0 | 66 KB | MercOpexBody -- Mercury opex scoreboard generation |
sub_703480 | 1.4 KB | RunOpexPass -- MercGenerateOpex entry |
Cross-References
- Scheduler Overview -- 3-phase scheduler architecture, scheduling output pipeline
- Scheduling Algorithm -- priority list scheduling, dependency DAG construction
- Latency Model -- per-opcode latency tables used by stall count computation
- Mercury Encoder -- Mercury pipeline including MercGenerateOpex (phase 120)
- SASS Encoding -- instruction encoding format including control word bit layout
- Phase Manager -- how phases 114--116 fit in the 159-phase pipeline
- Sync & Barriers -- software synchronization barriers (distinct from dependency barriers)
- Knobs -- scheduling knobs 741 (stall threshold), 805/806 (stall caps)
Code Generation Overview
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The SASS code generation subsystem converts optimized Ori IR into executable GPU machine code. It is the largest subsystem in ptxas by every metric: approximately 12,000 functions, 9 MB of binary code, and nine functions so large that Hex-Rays cannot decompile them. The pipeline spans phases 112--158 of the 159-phase PhaseManager and comprises seven interlinked subsystems -- instruction selection, SASS binary encoding, peephole optimization, the Mercury encoding pipeline, Newton-Raphson math templates, SASS text generation, and ELF output packaging. Every subsystem dispatches through per-SM-family tables, so the same high-level flow produces correct output for targets from Kepler (sm_30) through Blackwell Ultra (sm_121).
| Pipeline phases | 112--158 (code generation spans the final third of the pipeline) |
| Total functions | ~12,000 (ISel, encoding, peephole, Mercury, formatters, ELF) |
| Total binary size | ~9 MB of machine code |
| Non-decompilable functions | 9 (3 peephole + 6 encoding megadispatchers) |
| Core primitive | sub_7B9B80 -- bitfield insert (216 bytes, 18,347 callers) |
| Architecture selector | *(int*)(config+372) >> 12 -- SM generation ID |
| Largest function | sub_169B190 -- generic peephole dispatcher (280 KB) |
| Output modes | mercury (SM 75--99), capmerc (SM 100+), sass (explicit) |
| CLI option | --binary-kind mercury,capmerc,sass |
Pipeline
Optimized Ori IR (register-allocated, scheduled)
|
v
┌─────────────────────────────────────────────────────────────┐
│ SASS CODE GENERATION │
│ │
│ 1. Instruction Selection (ISel) ───────> [isel.md] │
│ │ DAG pattern matching: ~750 matchers │
│ │ Mega-selector: sub_C0EB10 (185 KB) │
│ │ 4 arch-variant dispatch tables │
│ v │
│ 2. SASS Binary Encoding ───────────────> [encoding.md] │
│ │ ~4,000 template-generated handlers │
│ │ 6 megadispatchers (750 KB total) │
│ │ sub_7B9B80 bitfield packer (18,347 callers) │
│ v │
│ 3. Peephole Optimization ──────────────> [peephole.md] │
│ │ 3 mega-dispatchers: 280+233+233 KB = 746 KB │
│ │ ~3,185 pattern matchers │
│ v │
│ 4. Mercury Pipeline (phases 117-122) ──> [mercury.md] │
│ │ Encode/Decode → Expand → WAR → Opex → WAR → SASS │
│ │ sub_6D9690 master encoder (94 KB) │
│ v │
│ 5. Newton-Raphson Templates ───────────> [templates.md] │
│ │ DDIV/DRCP/DSQRT/DRSQRT software sequences │
│ │ 36 functions, up to 298 virtual registers each │
│ v │
│ 6. SASS Text Generation (phase 129) ──> [sass-printing.md] │
│ │ 580 formatter functions + 12.9 KB dispatcher │
│ v │
│ 7. ELF/Cubin Output ──────────────────> [../output/…] │
│ sub_612DE0 finalizer → sub_1C9F280 ELF emitter │
└─────────────────────────────────────────────────────────────┘
|
v
.cubin / .o (NVIDIA custom ELF)
Phase-to-Subsystem Map
The code generation pipeline occupies phases 112--158. This table maps each phase to its subsystem and documents the six-stage Mercury core that is the dominant path for SM 75+ targets.
| Phase | Name | Subsystem | Detail page |
|---|---|---|---|
| 112 | PlaceBlocksInSourceOrder | Block layout | cfg.md |
| 113 | PostFixForMercTargets | Mercury pre-fixup | mercury.md |
| 114 | FixUpTexDepBarAndSync | Scoreboard / sync | scoreboards.md |
| 115 | AdvancedScoreboardsAndOpexes | Scoreboard hook | scoreboards.md |
| 116 | ProcessO0WaitsAndSBs | Scoreboard (-O0) | scoreboards.md |
| 117 | MercEncodeAndDecode | Mercury core | mercury.md |
| 118 | MercExpandInstructions | Mercury core | mercury.md |
| 119 | MercGenerateWARs1 | Mercury core | mercury.md |
| 120 | MercGenerateOpex | Mercury core | mercury.md |
| 121 | MercGenerateWARs2 | Mercury core | mercury.md |
| 122 | MercGenerateSassUCode | Mercury core | mercury.md |
| 123 | ComputeVCallRegUse | Post-Mercury bookkeeping | -- |
| 124 | CalcRegisterMap | Post-Mercury bookkeeping | -- |
| 125 | UpdateAfterPostRegAlloc | Post-Mercury bookkeeping | -- |
| 126 | ReportFinalMemoryUsage | Reporting | dumpir.md |
| 127 | AdvancedPhaseOriPhaseEncoding | Encoding hook | -- |
| 128 | UpdateAfterFormatCodeList | Post-Mercury bookkeeping | -- |
| 129 | DumpNVuCodeText | SASS text output | sass-printing.md |
| 130 | DumpNVuCodeHex | SASS hex output | dumpir.md |
| 131 | DebuggerBreak | Debug | -- |
| 132 | UpdateAfterConvertUnsupportedOps | Late cleanup | -- |
| 133 | MergeEquivalentConditionalFlow | Late cleanup | -- |
| 134 | AdvancedPhaseAfterMidExpansion | Late cleanup hook | -- |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Late cleanup hook | -- |
| 136 | LateMergeEquivalentConditionalFlow | Late cleanup | -- |
| 137 | LateExpansionUnsupportedOpsMid | Late lowering | -- |
| 138 | OriSplitHighPressureLiveRanges | Late regalloc fixup | -- |
| 139--158 | (architecture-specific) | Arch backends | phase-manager.md |
Subsystem grouping summary:
| Subsystem | Phases | Key property |
|---|---|---|
| Block layout | 112 | Restores source-order block placement |
| Scoreboard / sync | 113--116 | Pre-Mercury texture and dependency bar fixups |
| Mercury core | 117--122 | Six-stage encode-expand-WAR-opex-WAR-emit pipeline |
| Post-Mercury bookkeeping | 123--128 | Register maps, data structure refresh |
| SASS output + debug | 129--131 | Text/hex dumps and debugger hook |
| Late cleanup | 132--138 | Conditional merging, late lowering, live-range splits |
| Arch-specific | 139--158 | 20 backend-overridable phases (no-op by default) |
Scale
| Subsystem | Functions | Binary size | Key entry point |
|---|---|---|---|
| ISel pattern matchers | ~750 | ~1.3 MB | sub_B285D0 (ISel driver, 9 KB) |
| ISel mega-selector | 1 | 185 KB | sub_C0EB10 |
| SASS encoding handlers | ~4,000 | ~2.5 MB | sub_7B9B80 (bitfield packer) |
| Encoding megadispatchers | 6 | ~750 KB | sub_10C0B20 (setField, 180 KB) |
| Peephole mega-dispatchers | 3 | ~746 KB | sub_169B190 (generic, 280 KB) |
| Peephole pattern matchers | ~3,185 | ~1.5 MB | (individual matchers) |
| Mercury pipeline | ~50 | ~400 KB | sub_6F52F0 (orchestrator, 23 KB) |
| Mercury encode tables | 530 | ~500 KB | format initializers at 0xC66000 |
| Encoding vtable methods | ~2,735 | ~450 KB | tiny dispatchers at 0xAF0000 |
| Newton-Raphson templates | 36 | ~180 KB | sub_170E260 (DDIV coordinator) |
| SASS text formatters | 580 | ~850 KB | sub_5D4190 (dispatcher, 12.9 KB) |
| ELF emitter | ~60 | ~300 KB | sub_1C9F280 (master, 97 KB) |
| Total | ~12,000 | ~9 MB |
Nine functions exceed the decompilation threshold: the three peephole mega-dispatchers (280 + 233 + 233 KB) and the six encoding megadispatchers (180 + 197 + 187 + 142 + 68 + 65 KB). All analysis of these functions derives from disassembly, call graphs, and the smaller functions they invoke.
Instruction Selection
ISel converts abstract Ori IR operations into concrete SASS instruction forms using SelectionDAG-style pattern matching. Unlike upstream LLVM's TableGen-driven ISel, ptxas uses handwritten C++ matchers compiled into ~750 functions invoked from the ISel driver via per-opcode dispatch tables. The ISel driver (sub_B285D0, 9 KB, 66 callees) selects architecture-variant builders based on the SM version. The mega-selector (sub_C0EB10, 185 KB) handles the full IR-to-SASS mapping through a giant switch over instruction opcodes. Four nearly identical dispatch functions (15,049 bytes each) at sub_B128E0--sub_B12920 provide architecture-variant opcode routing, all jumping to shared handler code at 0x1C39xxx.
See Instruction Selection for the full DAG matcher protocol, helper function table, architecture dispatch tables, and operand variant selectors.
SASS Binary Encoding
The encoding subsystem translates ISel output into packed binary SASS machine code. Each instruction is encoded into a 1280-bit (160-byte, 20-QWORD) buffer via the universal bitfield packer sub_7B9B80. The full architecture is documented in SASS Instruction Encoding; the key facts for the overview:
- ~4,000 encoding handler functions -- each follows an identical 10-phase template, differing only in constants and modifier helpers
- 6 megadispatchers (750 KB total) route field-level queries by instruction category:
setField(180 KB),getFieldOffset(197 KB),hasField(187 KB),setFieldDefault(142 KB),getOperandFieldOffset(68 KB),setOperandField(65 KB) - 2,095 bitfield accessor functions at
0x10B0000--0x10BF2C0(1,661 under 200 bytes) - 530 encoding table initializers at
0xC66000--0xD27000, each populating one instruction format row - 3-level opcode hierarchy: major (9 bits), minor (8 bits), sub-opcode (7 bits)
- Instruction widths: 64-bit (format code 1), 128-bit (format code 2), 256-bit (format code 8)
Peephole Optimization
Three monolithic dispatch functions implement brute-force pattern-match-and-rewrite. The full architecture is documented in Peephole Optimization. The key positioning facts:
| Dispatcher | Size | Matchers | Entry trampoline | Runs when |
|---|---|---|---|---|
sub_169B190 | 280 KB | 762 | sub_B12930 | Pre-scheduling (all SM) |
sub_143C440 | 233 KB | 1,087 | sub_B12940 | Pre-scheduling (SM 120 only) |
sub_198BCD0 | 233 KB | 1,336 | sub_B12960 | Post-scheduling (all SM) |
All three use identical architecture: a 373-case primary switch on the 16-bit opcode at instruction+0x0C, per-case pattern matcher invocations with priority tracking, and a secondary switch for rewrite actions. The SM 120 dispatcher (sub_143C440) is architecture-gated and runs only when compiling for consumer RTX 50-series or enterprise Pro GPUs.
Mercury Pipeline
Mercury is NVIDIA's intermediate encoding layer between the optimizer's Ori IR and native SASS machine code. It occupies phases 113--122 and forms a six-stage sub-pipeline. Three output modes are controlled by --binary-kind: mercury (SM 75--99), capmerc (SM 100+, with embedded PTX source and relocation metadata), and sass (explicit direct SASS output). The master encoder sub_6D9690 (94 KB) is the largest backend function, with the orchestrator sub_6F52F0 (23 KB, 18 parameters) driving the full stage sequence.
See Mercury Encoder Pipeline for the six-stage architecture, key function table, and output mode details. See Capsule Mercury & Finalization for the SM 100+ variant.
Newton-Raphson Templates
Double-precision operations lacking dedicated hardware (DDIV, DRCP, DSQRT, DRSQRT) are lowered into multi-instruction SASS sequences implementing Newton-Raphson iterative refinement. The template system at 0x1700000--0x1722D60 comprises 36 functions organized in a two-level hierarchy: a top-level handler per operation delegates to a coordinator that allocates up to 298 virtual registers and chains 5--7 sub-expander functions. The register-count dispatcher sub_1704070 selects between full inline, partial inline, and template-based expansion paths based on register file pressure (thresholds: 20,479 / 16,383).
See Newton-Raphson Templates for the complete template hierarchy, register-count dispatch logic, and sub-expander details.
SASS Text Generation
Phase 129 (DumpNVuCodeText) converts the internal instruction stream into human-readable SASS assembly text for --verbose output and --out-sass dumps. The dispatcher sub_5D4190 (12.9 KB) routes 81 named opcodes via direct string comparison and 473 via hash-based switch to 580 template-generated formatter functions at 0x4DA340--0x5A8E40 (~850 KB). All formatters use a monolithic 1.8 MB format string table -- an unusual design that trades memory for formatting speed.
See SASS Text Generation for the full formatter architecture and opcode routing details.
ELF/Cubin Output
The final stage packages the encoded SASS binary into NVIDIA's custom ELF format (.cubin/.o). The kernel finalizer sub_612DE0 (47 KB) feeds the master ELF emitter sub_1C9F280 (97 KB), which delegates to symbol table emission (sub_713710), relocation generation (sub_7163C0), string table construction (sub_7122C0), and section layout finalization (sub_716DC0).
See ELF/Cubin Output for section catalog, relocation format, and EIATTR attribute encoding.
Intrinsic Lowering
The OCG (On-Chip Global) intrinsic system at 0x6C0000--0x6D0000 handles PTX builtin operations for SM 100+ targets. The master intrinsic table at sub_6C9EB0 (13 KB) initializes a 10,664-byte dispatch table with prefix "__nv_ptx_builtin_ocg_", covering operations from basic add/load/store through SM 100 tensor core (tcgen05) and bulk async copy:
| Handler | Size | Operations |
|---|---|---|
sub_6C0D90 | 19 KB | Atomic reduce (atom.add/min/max/cas -- 54 validation strings) |
sub_6C3470 | 20 KB | cp.async.bulk (bulk async copy) |
sub_6C1CF0 | 16 KB | mbarrier (arrive, wait, test, counted variants) |
sub_6C4DA0 | 15 KB | Load/store with scope, memory order, domain validation |
sub_6D4350 | 30 KB | MMA intrinsics (HMMA, IMMA, DMMA variants) |
sub_6D7AF0 | 19 KB | TCGen05 MMA (SM 100, 5th generation tensor core) |
Intrinsic parameter validators at sub_6BDB60--sub_6BF910 enforce type, sub-operation, and memory domain constraints. NVIDIA consistently misspells "intrinsic" as "instrinsic" in all validation error strings.
Post-Scheduling Statistics
Eight SM-variant statistics printers at sub_ABBA50--sub_ABEB50 (7,603 bytes each, spaced 0x700 apart) generate "# [...] " comments with comprehensive post-codegen metrics: instruction counts, register usage, spill/refill bytes, estimated latency and occupancy, per-functional-unit instruction estimates, MMA counts, and throughput figures. The per-unit instruction counter sub_ABF590 (17 KB) uses SSE2 operations for batch updates.
Operand Legalization
Post-register-allocation operand legalization rewrites instructions that cannot be directly encoded in SASS:
| Address | Size | Purpose |
|---|---|---|
sub_AB3C30 | 32 KB | Post-RA instruction legalization (opcodes 288, 167, 185, 241, 299, 300, 317) |
sub_AB2D50 | 18 KB | Per-class operand legalization (opcode 307 = ternary/FMA-like) |
sub_ACF4D0 | 14 KB | Constraint solver -- splits instructions when direct encoding fails |
sub_AB8940 | 19 KB | Register move coalescing / copy elimination |
sub_AC2750 | 36 KB | Operand-to-encoding converter (36-byte operand records) |
When legalization requires instruction splitting, sub_ACF4D0 creates new instructions via sub_934630 (instruction constructor). The constraint solver tries alternative encodings before resorting to splits.
WGMMA Pipeline (SM 90+)
The WGMMA (Warp Group Matrix Multiply-Accumulate) pipeline optimizer at 0xACE000--0xAE6000 manages asynchronous tensor core execution for Hopper and later. It automatically inserts warpgroup.arrive and warpgroup.wait fences to ensure correct register handoff. The warning emitter (sub_ACE480) issues "Potential Performance Loss" advisories (codes 7509--7511) when pipelining fails due to extern calls, insufficient registers, or ill-formed pipeline stages.
See WGMMA Pipeline Optimizer for the full call tree, register pressure estimator, and serialization warning details.
Per-SM Architecture Dispatch
Every code generation subsystem dispatches through architecture-specific tables. The SM generation is determined by *(int*)(config+372) >> 12:
config+372 >> 12 | Generation | SM versions |
|---|---|---|
| 3 | Kepler | sm_30--sm_37 |
| 5 | Maxwell | sm_50--sm_53 |
| 6 | Pascal | sm_60--sm_62 |
| 7 | Volta / Turing | sm_70--sm_75 |
| 8 | Ampere | sm_80--sm_89 |
| 9 | Hopper | sm_90--sm_90a |
| 10+ | Blackwell | sm_100--sm_121 |
Architecture-specific dispatch points across the codegen pipeline:
| Subsystem | Dispatch mechanism | Evidence |
|---|---|---|
| ISel | 4 arch-variant dispatch tables at sub_B128E0--sub_B12910 | All JUMPOUT to shared code at 0x1C39xxx |
| Encoding | vtable at *(context+416) with ~200 virtual methods | Per-opcode encoding, latency, hazard rules |
| Peephole | 3 mega-dispatchers with per-SM case logic | SM 120 dispatcher (sub_143C440) is arch-gated |
| Mercury | sub_6E8EB0 sets arch-specific flags in opcode descriptor table | SM 80: bits 1, 8; SM 84: bits 16, 64 |
| Statistics | 8 SM-variant printer clones at sub_ABBA50--sub_ABEB50 | 7,603 bytes each, 0x700 spacing |
| NR templates | Register-count-based dispatch at sub_1704070 | Thresholds: 20479 / 16383 |
Function Map (Top 10)
| Address | Size | Identity |
|---|---|---|
sub_169B190 | 280 KB | Generic peephole dispatcher (all SM, 762 matchers) |
sub_10D5E60 | 197 KB | Encoding getFieldOffset megadispatcher (961 callers) |
sub_10E32E0 | 187 KB | Encoding hasField megadispatcher (72 callers) |
sub_C0EB10 | 185 KB | Main instruction selector (500+ locals, giant switch) |
sub_10C0B20 | 180 KB | Encoding setField megadispatcher (3,109 callers) |
sub_10CCD80 | 142 KB | Encoding setFieldDefault megadispatcher (4 callers) |
sub_1C9F280 | 97 KB | Master ELF emitter |
sub_6D9690 | 94 KB | Mercury master encoder (instruction type switch) |
sub_6FFDC0 | 66 KB | Mercury opex body (scoreboard generation) |
sub_6E8EB0 | 64 KB | BasicBlock::Initialize (encoder state, opcode descriptors) |
See function-map.md for the complete table (~30 entries with all codegen functions).
Cross-References
- Instruction Selection -- DAG pattern matching, builder variants, operand validation
- SASS Instruction Encoding -- bit-level encoding format, 10-phase template, opcode hierarchy
- Peephole Optimization -- 3 mega-dispatchers, 3,185 matchers, priority-based rewrite
- Mercury Encoder Pipeline -- 6-stage sub-pipeline, WAR resolution, opex
- Capsule Mercury & Finalization -- SM 100+ variant with embedded PTX + relocations
- Newton-Raphson Templates -- DDIV/DRCP/DSQRT/DRSQRT software sequences
- SASS Text Generation -- 580 formatters, format string table
- Pipeline Overview -- full PTX-to-SASS compilation flow
- Phase Manager -- 159-phase pipeline infrastructure
- Scheduling Architecture -- 3-phase scheduler (pre-codegen)
- Register Allocation -- Fatpoint algorithm (pre-codegen)
- ELF/Cubin Output -- custom ELF emitter, section catalog
- Knobs System -- knobs controlling codegen behavior
Instruction Selection
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Instruction selection in ptxas is a two-phase process that converts PTX virtual ISA operations into concrete SASS machine opcodes. Unlike LLVM, which uses a single SelectionDAG or GlobalISel framework, ptxas distributes instruction selection across two distinct pipeline stages separated by the entire optimization pipeline: Phase 1 converts PTX opcodes to Ori IR opcodes during initial lowering (phase 5, ConvertUnsupportedOps), and Phase 2 converts Ori IR to final SASS binary forms during code generation (phases 112--122, ISel driver + Mercury encoder). The two phases serve fundamentally different purposes: Phase 1 legalizes the IR so the optimizer can reason about it, while Phase 2 selects the optimal machine encoding for the target architecture after register allocation and scheduling are complete.
| Phase 1 location | Phase 5: ConvertUnsupportedOps (PTX opcode to Ori opcode) |
| Phase 2 location | Phases 112+: ISel driver + Mercury encoder (Ori to SASS binary) |
| MercConverter dispatch | sub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF mask) |
| ISel driver | sub_B285D0 (9 KB, 66 callees, vtable entry) |
| ISel mega-selector | sub_C0EB10 (185 KB, 500+ locals, giant switch) |
| DAG pattern matchers | ~801 functions at 0xB28F60--0xB7D000 (~1.3 MB) |
| Arch dispatch tables | 4 copies at sub_B128E0--sub_B12920 (15,049 bytes each) |
| Mercury master encoder | sub_6D9690 (94 KB, instruction type switch) |
| MercExpand | sub_C3CC60 (26 KB, pseudo-instruction expansion) |
| SM120 pattern coordinator | sub_13AF3D0 (137 KB, 130-case switch, opcodes 2--352) |
| Opcode variant selectors | sub_B0BE00 (19 KB, class 194), sub_B0AA70 (5 KB, class 306) |
Architecture
PTX source text
|
v
[Bison parser] sub_4CE6B0 (48KB)
| Reduction actions build raw Ori nodes with PTX-derived opcodes
v
+------------------------------------------------------------------+
| RAW ORI IR (PTX opcodes: add.f32, ld.global, mad.lo.s32, ...) |
+------------------------------------------------------------------+
|
| PHASE 1: PTX-to-Ori Opcode Legalization (phase 5)
|
| sub_9F3340 (orchestrator, 7KB)
| -> sub_9F1A90 (MercConverter main, 35KB)
| -> sub_9ED2D0 (opcode dispatch, 25KB)
| Switch on (*(instr+72)) with BYTE1 & 0xCF mask
| ~120 case values -> ~60 handler functions
| + vtable dispatch for architecture-extensible ops
| -> sub_934630 (instruction creation, called N times)
| -> sub_9EF5E0 (post-conversion lowering, 27KB)
|
v
+------------------------------------------------------------------+
| OPTIMIZER-READY ORI IR (SASS opcodes: FADD, IMAD, LDG, STG, ...).|
| Every instruction has a valid SASS opcode for the target SM. |
+------------------------------------------------------------------+
|
| [Phases 14-111: Full optimization pipeline]
| Register allocation, scheduling, peephole, etc.
|
v
+------------------------------------------------------------------+
| OPTIMIZED ORI IR (register-allocated, scheduled) |
+------------------------------------------------------------------+
|
| PHASE 2: Ori-to-SASS Selection & Encoding (phases 112+)
|
| sub_B285D0 (ISel driver, 9KB)
| -> sub_C0EB10 (mega-selector, 185KB, default backend)
| -> sub_13AF3D0 (pattern coordinator, 137KB, SM120 backend)
| -> sub_B1FA20 / sub_B20E00 (builder variants)
| -> sub_B28F60..sub_B74C60 (~801 DAG pattern matchers)
| -> sub_B128E0..sub_B12920 (4 arch dispatch tables)
|
| sub_6D9690 (Mercury master encoder, 94KB)
| -> Switch on instruction type (*(instr+8))
| -> sub_C00BF0 (opcode lookup)
| -> sub_91D160 (register encoding)
| -> sub_7B9B80 (bitfield insert, 18,347 callers)
|
| sub_C3CC60 (MercExpand, 26KB)
| -> sub_C37A10 (expand instruction, 16KB)
| -> sub_C39B40 (expand memory, 10KB)
| -> sub_C3BCD0 (expand control flow, 19KB)
|
v
+------------------------------------------------------------------+
| SASS binary (packed machine code in 64/128/256-bit words) |
+------------------------------------------------------------------+
Phase 1: PTX-to-Ori Opcode Conversion
Phase 1 runs as ConvertUnsupportedOps (pipeline phase 5), the most substantial bridge phase. Its job is to replace every PTX-derived opcode in the raw Ori IR with a valid SASS-level opcode for the target SM. After this phase completes, the optimizer sees only SASS-level instruction semantics.
The conversion is not a simple table lookup. Many PTX operations have no 1:1 SASS equivalent and must be expanded into multi-instruction sequences. The expansion depends on the target architecture, the operand types, and the available hardware functional units.
MercConverter Dispatch -- sub_9ED2D0 (25 KB)
The central dispatch function of Phase 1. Despite the sweep's initial identification as PhaseRunner::executePhaseSequence, the decompiled code reveals a classic opcode switch: it reads *(instr+72), masks byte 1 with 0xCF (stripping modifier bits 4--5), and dispatches to per-category handler functions. The switch covers approximately 120 distinct case values (opcode indices 1--352) routing to roughly 60 handler functions plus vtable-dispatched methods for architecture-extensible operations.
// sub_9ED2D0 -- simplified dispatch logic
void MercConverter_Dispatch(context, instruction) {
// Pre-dispatch: check predication eligibility
bool can_predicate = sub_7E18A0(instruction, *(context+8));
if (can_predicate)
can_predicate = vtable[205](*(*(context+8)+1584), instruction);
*(context+40) = can_predicate;
// Read opcode, mask out modifier bits
int opcode = *(DWORD*)(instruction + 72);
BYTE1(opcode) &= 0xCF;
// Special case: opcode 130 (HSET2 in ROT13; internal marker) with GPR operand -> clear predication
if (opcode == 130) {
int operand = *(DWORD*)(instruction + 84);
if (((operand >> 28) & 7) == 1 && reg_type(operand) == 6)
*(context+40) = 0;
}
// Main dispatch
switch (opcode) {
case 1: sub_9DA5C0(context, instruction); break; // opcode class 1
case 6: sub_9DA100(context, instruction); break; // arithmetic
case 8: sub_9D2440(context, instruction); break; // specific class
case 10: case 11: case 149: case 151: case 152: case 290: case 291:
sub_9D80E0(context, instruction); break; // memory load/store
case 16: sub_9E8B20(context, instruction); break; // texture/surface
case 61: case 63: case 80:
sub_9E6600(context, instruction); break; // instruction expansion
case 108: sub_9D76D0(context, instruction); break; // memory legalization
// ... ~100 more cases ...
default: emit_noop(context, 0xFFFF); break; // unknown -> passthrough
}
// Post-dispatch: apply predication and operand adjustments
vtable[107](context, instruction);
}
MercConverter Opcode Dispatch Table
The complete switch covers opcodes 1--352. Cases route to three dispatch mechanisms: direct function calls (for common PTX categories), vtable-indirect calls (for architecture-extensible operations), and the emit_noop fallback for unrecognized opcodes. Below is the reconstructed routing table from the decompiled sub_9ED2D0.
Direct handler dispatch (35 handlers):
| Opcode(s) | Handler | Size | Category |
|---|---|---|---|
| 1 | sub_9DA5C0 | 2 KB | Opcode class 1 (basic ALU) |
| 6 | sub_9DA100 | 9 KB | Arithmetic operations |
| 8 | sub_9D2440 | -- | Specific class |
| 10, 11, 149, 151, 152, 290, 291 | sub_9D80E0 | 17 KB | Memory load/store |
| 15, 85 | sub_9EC340 | 23 KB | Multi-operand legalization |
| 16 | sub_9E8B20 | 17 KB | Texture/surface lowering |
| 17 | sub_9E7FB0 | -- | Surface operations |
| 22 | sub_9D6DB0 | -- | Specific lowering |
| 23 | sub_9E58F0 | -- | Specific lowering |
| 24 | sub_9D9F60 | -- | Specific lowering |
| 26 | sub_9E54C0 | -- | Specific lowering |
| 27 | sub_9E4BB0 | -- | Specific lowering |
| 28 | sub_9D9E70 | -- | Specific lowering |
| 32, 271 | sub_9E2440 | -- | Bitfield operations |
| 34 | sub_9E55E0 | -- | Specific lowering |
| 38, 59, 106, 180, 182, 192, 194, 215, 221, 242 | sub_9DA6B0 | -- | Generic ALU group |
| 41, 284 | sub_9D1DA0 | -- | Specific lowering |
| 42, 53, 55, 66 | sub_9D54B0 | -- | Grouped operations |
| 47 | sub_9E74E0 | -- | Conditional (arch flag check) |
| 51 | sub_9E2F60 | -- | Specific lowering |
| 52, 54, 72, 97 | sub_9D09C0 | -- | Group with v8=1 (deletion flag) |
| 57, 101 | sub_9D6170 | -- | Paired operations |
| 60, 62, 78, 79 | sub_9E5EE0 | -- | Comparison group |
| 61, 63, 80 | sub_9E6600 | 25 KB | Instruction expansion (64-bit split) |
| 67 | sub_9D9C30 | -- | Specific lowering |
| 70 | sub_9E3490 | -- | Specific lowering |
| 75 | sub_9E0C10 | -- | Specific lowering |
| 77 | sub_9E4DF0 | -- | Specific lowering |
| 83 | sub_9D6AB0 | -- | Specific lowering |
| 88, 89 | sub_9D5990 | -- | Paired operations |
| 90 | sub_9D2820 | -- | Specific lowering |
| 91 | sub_9E7600 | -- | Specific lowering |
| 92 | sub_9E7890 | -- | Specific lowering |
| 93, 95 | sub_9E1D40 | -- | Comparison variants |
| 94 | sub_9E1DF0 | -- | Specific lowering |
| 96 | sub_9D41C0 | -- | Specific lowering |
| 98 | sub_9D3230 | -- | Specific lowering |
| 100 | sub_9D70E0 | -- | Specific lowering |
| 102 | sub_9D9750 | -- | Specific lowering |
| 103, 104 | sub_9E31D0 | -- | Paired operations |
| 108 | sub_9D76D0 | 18 KB | Memory instruction legalization |
| 124 | sub_9E18B0 | -- | Specific lowering |
| 135 | sub_9D6560 | -- | Specific lowering |
| 139, 140, 141, 143 | sub_9D4C10 | -- | Related operations group |
| 145 | sub_9D3020 | -- | Specific lowering |
| 155, 268 | sub_9E5260 | -- | Paired operations |
| 156 | sub_9D94B0 | -- | Specific lowering |
| 158, 167 | sub_9E4A00 | -- | Paired operations |
| 161 | sub_9D21D0 | -- | Specific lowering |
| 162 | sub_9D9660 | -- | Specific lowering |
| 166 | sub_9E2100 | -- | Specific lowering |
| 170 | sub_9E2DF0 | -- | Specific lowering |
| 173, 267 | sub_9EB5C0 | -- | Paired operations |
| 174 | sub_9D9300 | -- | Specific lowering |
| 184 | sub_9D2E70 | -- | Specific lowering |
| 185 | sub_9E32F0 | -- | Specific lowering |
| 188, 190 | sub_9E2970 | -- | Paired operations |
| 195 | sub_9D2AB0 | -- | Specific lowering |
| 196 | sub_9D9080 | -- | Specific lowering |
| 198 | sub_9D66F0 | -- | Specific lowering |
| 201, 202, 204, 285 | sub_9EAC30 | -- | Async/bulk group |
| 203 | sub_9D8E90 | -- | Specific lowering |
| 205 | sub_9E1260 | -- | Specific lowering |
| 209 | sub_9E5740 | -- | Specific lowering |
| 210, 213, 214 | sub_9D8B30 | -- | Grouped operations |
| 240 | sub_9D6280 | -- | Specific lowering |
| 241 | sub_9E2CC0 | -- | Specific lowering |
| 247 | sub_9D0F70 | -- | Specific lowering |
| 248 | sub_9D0DF0 | -- | Specific lowering |
| 262 | sub_9E7440 | -- | Specific lowering |
| 264 | sub_9D73F0 | -- | Specific lowering |
| 276 | sub_9D5EC0 | -- | Specific lowering |
| 292 | sub_9D0E90 | -- | Specific lowering |
Vtable-indirect dispatch (for architecture-extensible operations):
| Opcode(s) | Vtable offset | Category (inferred) |
|---|---|---|
| 2, 3, 4, 5, 7 | vtable[0] (+0) | Generic fallback |
| 14, 39, 40, 105, 125, 299, 300, 321 | vtable[7] (+56) | Group A operations |
| 18 | vtable[3] (+24) | Specific class |
| 31 | vtable[4] (+32) | Specific class |
| 35 | vtable[6] (+48) | Specific class |
| 36 | vtable[21] (+168) | Specific class |
| 43 | vtable[9] (+72) | Specific class |
| 50 | vtable[12] (+96) | Specific class |
| 65 | vtable[22] (+176) | Specific class |
| 73 | vtable[15] (+120) | Specific class |
| 74 | vtable[16] (+128) | Specific class |
| 81 | vtable[24] (+192) | Specific class |
| 110, 111, 112, 114 | vtable[25] (+200) | Warp shuffle group |
| 118 | vtable[10] (+80) | Specific class |
| 119 | vtable[28] (+224) | Specific class |
| 120, 121, 126, 127, 128, 280, 281 | vtable[27] (+216) | Barrier/sync group |
| 122, 123, 310, 311, 312 | vtable[26] (+208) | Related group |
130 (HSET2), 169 | vtable[29] (+232) | Move/convert group (130 is MOV-like internally; actual SASS MOV = 19) |
| 157 | vtable[84] (+672) | Specific class |
| 176, 177 | vtable[34] (+272) | Paired operations |
| 183, 288 | vtable[36] (+288) | Paired operations |
| 186 | vtable[35] (+280) | Specific class |
| 211 | vtable[39] (+312) | Specific class |
| 220 | vtable[40] (+320) | Specific class |
| 223, 238 | vtable[41] (+328) | Paired operations |
| 228 | vtable[42] (+336) | Specific class |
| 243 | vtable[43] (+344) | Specific class |
| 245--253, 257 | vtable[67--77] (+536--+624) | SM 100+ operations |
| 265, 266 | vtable[93] (+744) | Paired operations |
| 270 | vtable[77] (+616) | Specific class |
| 277 | vtable[65] or vtable[11] (+520/+88) | Operand-type dependent |
| 279--351 | various high vtable offsets | SM 100+ / Blackwell operations |
The vtable mechanism allows architecture backends to override conversion behavior without modifying the core dispatch. The vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the SM version.
Per-Category Handlers
The larger handlers implement non-trivial conversion logic:
| Handler | Size | Category | Key behavior |
|---|---|---|---|
sub_9E6600 | 25 KB | Instruction expansion | Splits 64-bit ops on 32-bit ALU into hi/lo pairs with carry chains. Calls sub_9D4380 (instruction builder) ~10 times per expansion. |
sub_9EC340 | 23 KB | Multi-operand legalization | Operand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0. |
sub_9D76D0 | 18 KB | Memory legalization (load/store) | Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication). |
sub_9D80E0 | 17 KB | Memory legalization (variant) | Same opcode set as sub_9D76D0, alternate code path for different operand patterns. |
sub_9E8B20 | 17 KB | Texture/surface lowering | Register type 6 = GPR. Manipulates bitmask at register descriptor offset +48. |
sub_9DA100 | 9 KB | Arithmetic operations | Handles opcode case 6 -- standard ALU instruction legalization. |
sub_9DA6B0 | -- | Generic ALU group | Covers 10 opcode values (38, 59, 106, 180, 182, 192, 194, 215, 221, 242). |
1:1 vs 1:N Expansion
Most PTX operations map 1:1 to a single SASS opcode. When they do not, the handlers in sub_9E6600 and related functions create multi-instruction sequences:
PTX Ori IR (after Phase 1)
----------------------------------- -----------------------------------
add.f32 %r1, %r2, %r3 --> FADD R1, R2, R3 [1:1]
add.s32 %r4, %r5, %r6 --> IADD3 R4, R5, R6, RZ [1:1, operand added]
mul.lo.s64 %rd1, %rd2, %rd3 --> IMAD.LO R1, R2, R6, RZ [1:N split]
IMAD.HI R0, R2, R6, RZ
IMAD R0, R3, R6, R0
IMAD R0, R2, R7, R0
div.f32 %r7, %r8, %r9 --> MUFU.RCP R10, R9 [1:N, Newton-Raphson]
FMUL R7, R8, R10
(+ correction iterations)
bar.sync 0 --> BAR [1:1]
The expansion creates new instruction nodes via sub_934630 and links them into the doubly-linked instruction list. The original PTX-level instruction is replaced by the expanded sequence.
Type-Dependent Opcode Selection
PTX's explicitly-typed opcodes (where the type is a qualifier like .f32, .s64) map to different SASS mnemonics based on the type:
| PTX type | SASS prefix | Example PTX | Example SASS |
|---|---|---|---|
.f16 / .f16x2 | H | add.f16 | HADD2 |
.f32 | F | add.f32 | FADD |
.f64 | D | add.f64 | DADD |
.s32 / .u32 | I | add.s32 | IADD3 |
.s64 / .u64 | I (split) | add.s64 | IADD3 + IADD3.X (carry chain) |
.pred | P | setp.eq.f32 | FSETP |
The type qualifier disappears from the instruction syntax during conversion. It becomes encoded in the SASS mnemonic itself (the F in FADD, the I in IADD3) and in the register class of the operands.
SM-Dependent Legalization
The MercConverter gates operations by SM version through the architecture vtable. An instruction available natively on one SM may require a multi-instruction lowering sequence on another:
- 64-bit integer arithmetic on SM 50--75 (no native 64-bit ALU): splits into 32-bit hi/lo pairs
- FP16 operations on pre-SM 53 targets: promoted to FP32 (handled by Phase 2
PromoteFP16) bfe/bfivariants: some bit-field extract/insert modes not supported on all targets- Tensor core intrinsics: SM 70 has HMMA v1, SM 75 has HMMA v2, SM 80+ has HMMA v3/DMMA, SM 100 has TCGen05
The architecture vtable factory at sub_1CCEEE0 populates the vtable with SM-specific method overrides. The vtable has approximately 90 method slots (up to offset +720), with the highest-numbered slots (offset 624+) serving SM 100+ Blackwell operations.
Phase 2: Ori-to-SASS Selection & Encoding
Phase 2 runs during code generation (phases 112+) after the optimizer, register allocator, and scheduler have completed. It operates on fully optimized, register-allocated Ori IR and produces final SASS machine code. Phase 2 has three major components: the ISel driver with DAG pattern matching, the Mercury master encoder, and MercExpand pseudo-instruction expansion.
ISel Driver -- sub_B285D0 (9 KB)
The top-level ISel coordinator is a vtable entry point with 66 callees. It selects the appropriate instruction builder variant based on the target architecture:
// Simplified ISel driver
void ISel_LowerInstruction(context, instruction) {
int sm = *(context + 184); // SM version
int opcode = instruction[18] & 0xFFFFCFFF;
// Select architecture-variant builder
if (sm == 14)
Builder_VariantA(context, instruction); // sub_B1FA20 (13 KB)
else
Builder_VariantB(context, instruction); // sub_B20E00 (11 KB)
// Apply post-ISel modifiers
ApplyModifiers(context, instruction); // sub_B1D670 (13 KB)
SetProperties(context, instruction); // sub_B241A0 (7 KB)
}
The two builder variants (sub_B1FA20 and sub_B20E00) are structurally near-identical, with 50 callees each. Both call sub_7E3EF0 (operand index helper) 6 times (3 source + 3 destination operands) and use sub_A3B930 (operand register class resolver). The key difference is the validation function: variant A uses sub_C49440, variant B uses sub_C49400, reflecting different encoding constraints for different SM families.
ISel Mega-Selector -- sub_C0EB10 (185 KB)
The single largest function in the Phase 2 ISel range: 185 KB decompiled, 6,016 lines, 719+ local variables. It performs the final Ori-IR-to-SASS opcode and operand encoding for 169 distinct instruction types (SASS opcode indices 7--221). While the ~801 DAG pattern matchers handle template-based ISel through a priority contest, the mega-selector handles complex instructions that require procedural, multi-step encoding logic -- instructions where the operand marshalling depends on runtime state (calling conventions, symbol resolution, address space aliasing).
Dual-Switch SM-Generation Dispatch
The function contains two copies of the same 169-case switch statement, separated by a vtable-based opcode translation mechanism. This dual-switch structure is the SM-generation dispatch:
// sub_C0EB10 -- simplified dispatch skeleton
void MegaSelector(context *a1, instruction *a2, isel_ctx *a3) {
int64_t *vtable = *(a3->backend);
int opcode = *(int *)(a2 + 8); // SASS opcode type
// Pre-dispatch: capability check via vtable[12]
auto cap_check = vtable[12]; // offset +96
if (cap_check != sub_BFEAA0) // default stub?
if (cap_check(a3, a2))
ctx->flags[256] = 1; // set encoding flag
// Read opcode translator from vtable[2]
auto translator = vtable[2]; // offset +16
if (translator != sub_BFEBF0) {
// PATH A: SM-specific translation
int encoding_index = translator(a3, opcode);
int isel_opcode = *(ctx + 8); // post-translation opcode
switch (isel_opcode) { // PRIMARY SWITCH (169 cases)
case 7: case 34: case 35: case 36:
emit_simple(encoding_index, ...);
break;
case 8: case 38: case 46: ...
/* already encoded */ break;
// ... 169 cases total ...
default: goto high_opcode_path;
}
} else {
// PATH B: static table lookup (default backend)
int encoding_index = 355; // sentinel for extended opcodes
if (opcode <= 0xDD)
encoding_index = word_22B4B60[opcode];
switch (opcode) { // FALLBACK SWITCH (same 169 cases)
case 7: ...: goto handler_7; // jumps into Path A handlers
// ... identical case set ...
default: return;
}
}
high_opcode_path:
if (opcode > 0x199) return;
// Try vtable[3] extension dispatch for SM 100+ / Blackwell
auto extension = vtable[3]; // offset +24
if (extension != sub_BFEA30)
extension(a3, a2); // arch-extension handler
}
The dual-switch pattern is a code-generation artifact: the compiler emitted two copies because the vtable path and static-table path produce different values for the encoding index but need identical case routing. This doubles the binary size but avoids a conditional merge point at every case entry.
Three Vtable Dispatch Points
| Vtable slot | Offset | Default stub | Purpose |
|---|---|---|---|
vtable[2] | +16 | sub_BFEBF0 | Opcode-to-encoding-index translator. SM-specific override remaps opcodes to different encoding slots. Fallback: word_22B4B60[] static table. |
vtable[12] | +96 | sub_BFEAA0 | Pre-dispatch capability check. Returns boolean that sets ctx[256] encoding flag. |
vtable[3] | +24 | sub_BFEA30 | Extension opcode handler for opcodes outside the 169-case set (barrier/sync 61--63/221, opcodes > 0x199, SM 100+ extensions). |
The word_22B4B60 static table is a uint16[] array indexed by SASS opcode (0--0xDD = 221). Each entry is a SASS encoding slot index. Opcodes > 221 receive the sentinel value 355. This provides the default encoding mapping; SM-specific vtable overrides can remap any opcode to a different encoding index, enabling per-architecture instruction variants without modifying the mega-selector logic.
Opcode Case Routing
The 169 distinct opcode cases (338 total case labels across both switches) group into approximately 70 handler blocks. The groupings reveal SASS ISA families:
| Group | Opcodes | Handler pattern | Instruction family |
|---|---|---|---|
| No-op passthrough | 8, 38, 46, 87, 89, 90, 93, 97, 98, 208 | goto LABEL_33 (already encoded) | Pre-encoded by upstream ISel |
| Simple emission | 7, 34, 35, 36 | sub_9314F0(encoding_index, 1 operand) | Basic ALU / simple 1-op |
| Branch/call | 9, 10, 11, 12, 13, 22 | sub_926370 / vtable[17] / linked-list walk | Control flow, call frames |
| Memory load/store | 15, 16, 18, 19, 20, 23, 24, 25, 26, 30 | sub_C01840 + address helpers | LDG, STG, LDS, etc. |
| Control flow | 31, 32, 33 | SSA phi nodes, branch tables | Phi, switch, call return |
| Generic ALU | 39, 41, 42, 50, 51, 52, 53 | sub_9314F0 passthrough | Standard arithmetic |
| Special register | 43, 44, 45 | sub_C06E90 symbol lookup | SR access, shared memory alias |
| Constant/predicate | 47, 54, 55, 56 | Direct operand copy / sub_BFFD60 | Constant bank, predicate ops |
| Address compute | 57 | 200-line handler, "__nv_reservedSMEM_offset_0_alias" | Complex addressing with SMEM |
| Immediate ops | 59, 60 | sub_C05CC0 / sub_C07690 | Immediate-operand variants |
| Barrier/sync | 61, 62, 63, 221 | Forward to vtable[3] extension | BAR, MEMBAR, SYNC |
| Conversion/move | 65 | Operand loop with per-element sub_9314F0 | MOV, CVT |
| Texture/surface | 67, 68, 69, 70 | Multi-operand type-qualified encoding | TEX, TLD, TXQ |
| Intrinsics | 71, 74, 75 | Loop-based operand emission | Hardware intrinsics |
| Tensor core | 84, 88, 91, 92 | Wide-operand encoding (case 92 = 354 lines) | HMMA, DMMA, IMMA, TCGen05 |
| Predication ext | 94, 95 | Predicate-dependent path selection | Extended predication |
| Memory extended | 99--130 (19 opcodes) | sub_C0B2C0 or sub_BFFD60 + encoding lookup | Extended memory ops |
| Warp intrinsics | 131--189 (50+ opcodes) | Mixed handlers, vtable[198]+632 dispatch | SHFL, VOTE, MATCH, REDUX |
| Async/bulk | 192--218 (15 opcodes) | sub_C0B2C0 / individual handlers | TMA, async copy, bulk ops |
The largest case handlers:
- Cases 141/142: ~503 lines (warp shuffle/vote extended operations)
- Case 92: ~354 lines (tensor core instructions -- widest operand format)
- Cases 45, 57, 95: ~200 lines each (shared memory, address compute, predication)
Operand Encoding Protocol
The mega-selector encodes operands into a stack-allocated 256-byte output buffer using a tagged-pointer word format. Each operand occupies 8 bytes (a DWORD pair):
| Bits | Field | Description |
|---|---|---|
[31:28] of word 0 | Type tag | 0x1=register, 0x4=constant bank, 0x5=immediate, 0x6=control/modifier, 0x9=special register |
[23:0] of word 0 | Value | Register index, immediate value, or bank offset |
| word 1 | Flags | Modifier bits, encoding-format flags |
The marshalling pipeline for a typical case:
1. sub_C01840(ctx, instr, operand_list, output_buf, max_count, ...)
-> Iterates source operands, writes tagged words to output_buf
-> Returns: number of operand words written
2. sub_C01F50(ctx, instr, dest_list, output_buf, max_count, ...)
-> Same for destination operands
3. Encoding-index lookup:
if (vtable[2] != default)
index = vtable[2](ctx, opcode);
else
index = word_22B4B60[opcode];
4. sub_9314F0(output, ctx, encoding_index, count, n_words, buf, ...)
-> Emits the instruction record to the output stream
| Helper | Calls | Purpose |
|---|---|---|
sub_C01840 | 52 | Marshal source operands into tagged-word buffer |
sub_9314F0 | 31 | Emit instruction with encoding index + operand buffer |
sub_C00EA0 | 8 | Extract single operand as tagged word |
sub_91D160 | 8 | Encode register index to encoding bits |
sub_934630 | 6 | Build new instruction node in IR (for multi-instruction expansion) |
sub_91D150 | 5 | Decode register index from operand word |
sub_926370 | 4 | Emit simple instruction (branch/jump) |
sub_C01F50 | 3 | Marshal destination operands |
sub_7D6860 | 3 | Encode data type qualifier (FP32/FP64/INT) |
sub_BFEF10 | 3 | Register bank capacity check / grow |
sub_92E1B0 | 2 | Emit instruction with constant-bank operand |
Cross-Reference: Arch Dispatch Tables
The 4 arch dispatch tables (sub_B128E0--sub_B12920) are not called from the mega-selector. They operate at the Mercury encoder level:
Mega-selector (sub_C0EB10)
-> Produces (encoding_index, operand_buffer) pairs
-> Calls sub_9314F0 to package into instruction nodes
Mercury encoder (sub_6D9690)
-> Reads instruction type field from instruction node
-> Arch dispatch tables (sub_B128E0 etc.) resolve type to encoding format
-> Encoder emits binary SASS using format + operand data
The mega-selector and arch dispatch tables thus operate at different abstraction levels: the mega-selector decides what to encode (opcode selection, operand marshalling), while the arch tables decide how to encode it (encoding format, bit layout). The arch tables' per-SM variants handle encoding-level differences (field widths, modifier positions) that are invisible to the mega-selector's opcode-level logic.
Post-ISel Modifiers -- sub_B1D670 (13 KB)
After the main ISel selection, this pass applies architecture-specific instruction modifications:
- Opcode 13: sets instruction field
[79] = 3 - Opcode 14: sets instruction field
[79] = 2 - Opcode 11: separate modifier path
The function has 51 callees including sub_AAD690 (field accessor, called multiple times), sub_AADF40, and sub_C49400 (encoding validator). It handles encoding mode bits, register class adjustments, and predicate attachment.
Instruction Properties -- sub_B241A0 (7 KB)
Sets scheduling-relevant properties on the selected instruction:
inst[74] = 7-- scheduling classinst[75] = (opcode == 325)-- special flag for specific opcodeinst[77] = sub_A3B930(...)-- operand class from register resolverinst[79]-- derived froma2[19], architecture-dependent
Contains a switch on *(context+46) (target architecture selector), confirming per-SM property assignment.
DAG Pattern Matchers -- ~800 Functions at 0xB28F60--0xB7D000
Every pattern matcher follows an identical prototype and a strict check-and-report protocol. These are the ptxas equivalent of LLVM's TableGen-generated ISel patterns, but handwritten in C++. Binary analysis confirms 801 functions with the matching *a4 <= priority-comparison idiom, with the bulk (750+) residing in the 0xB30000--0xB7D000 range and a handful of smaller matchers in the 0xB28F60--0xB30000 preamble zone.
Pattern Matcher Architecture
The pattern matching system implements a priority-based best-match selection protocol. For each instruction being lowered, the ISel infrastructure invokes all applicable matchers (dispatched through vtable function pointers, not direct calls). Each matcher independently tests whether the instruction matches its pattern; if it does, it writes a (template_id, priority) pair to the output parameters. The dispatcher selects the match with the highest priority value.
Function signature (all 801+ matchers):
char __fastcall match(
int64_t ctx, // a1: ISel context (passed through to field reader)
int64_t dag_node, // a2: pointer to the Ori IR instruction node
int32_t *template_id, // a3: OUT: encoding template index [1..152]
int32_t *priority // a4: IN/OUT: current best priority; written only if better
);
The priority parameter is read-then-conditionally-written: the matcher checks if (*a4 <= threshold) before overwriting. This means the dispatcher initializes *a4 = 0 and calls matchers in sequence; each matcher only upgrades the result if its specificity exceeds the current best. After all matchers complete, *a3 holds the template index of the winning pattern.
Matching pipeline (invariant across all 801 matchers):
1. OPCODE PROPERTY CHECKS sub_10AE5C0(ctx, node, field_id)
Check 1-12 instruction properties against expected values.
Any mismatch -> return immediately (early exit).
2. SOURCE OPERAND COUNT sub_B28F50(node) -> source_count
Verify the instruction has the expected number of source operands.
3. SOURCE OPERAND VALIDATION sub_B28F30(node, i) -> operand_record
For each source operand:
a. Type predicate: isImmediate / isGPR / isPredicate / isUniformReg / ...
b. Register class: class == 1023 (wildcard) OR class == specific_value
4. RESULT OPERAND COUNT sub_B28F40(node) -> result_count
Verify the expected number of result (destination) operands.
5. RESULT OPERAND VALIDATION sub_B28F30(node, first_result + j)
Same type + register-class checks as for source operands.
First-result index = sub_B28E00(*(node + 92)).
6. PRIORITY WRITE if (*a4 <= N) { *a4 = N+1; *a3 = template; }
Conditional update: only overwrite if this pattern is more specific
than whatever was already matched.
Match-Score Priority System
The priority values range from 2 (least specific) to 34 (most specific), with the distribution heavily concentrated in the 8--19 range. The priority correlates directly with pattern specificity: matchers with more constraints (more sub_10AE5C0 checks, more operand type checks, tighter register class requirements) assign higher priority values.
| Priority range | Count | Interpretation |
|---|---|---|
| 2--5 | 31 | Fallback / generic patterns (few constraints) |
| 6--10 | 253 | Common patterns (3--6 constraints) |
| 11--15 | 293 | Standard patterns (5--8 constraints) |
| 16--20 | 168 | Specific patterns (6--10 constraints) |
| 21--34 | 56 | Highly specific patterns (8--12+ constraints) |
Template IDs range from 1 to 152. Multiple matchers can target the same template ID at different priority levels, forming a specificity ladder: a generic matcher might match FADD at priority 8 while a specialized matcher matches FADD.FTZ.SAT with specific register classes at priority 17. Both write the same template ID but the specialized matcher wins when its constraints are satisfied.
Dispatcher Mechanism
The matchers are not called directly from a single dispatch function. Instead, they are registered as virtual methods on per-instruction-class descriptor objects. The dispatch chain is:
sub_B285D0 (ISel driver, 9 KB)
-> opcode switch on (instruction[18] & 0xFFFFCFFF)
-> selects builder variant (sub_B1FA20 / sub_B20E00 / sub_B1EC10 / ...)
-> builder invokes vtable method on instruction descriptor
-> vtable slot contains pointer to one of the 801 pattern matchers
-> matcher writes (template_id, priority) if pattern matches
The vtable dispatch occurs at various offsets including +2600, +2616, +2656, and +2896 (observed in sub_13AF3D0, the 137 KB ISel pattern coordinator). The matchers have no static callers -- they appear exclusively through indirect function pointer invocation, which is why the sweep reports them as "no callers in function DB."
For a given instruction, the dispatcher may invoke multiple matchers (one per applicable template variant). Each matcher independently checks its constraints and conditionally updates the priority/template pair. After all candidates have been tried, the dispatcher reads the final template_id and uses it to select the SASS encoding template.
DAG Node Property Accessor -- sub_10AE5C0
The field reader is the most-called function in the matcher range (typically 2--12 calls per matcher, so 3,000--8,000 total invocations across all 801 matchers):
// sub_10AE5C0 -- Read instruction property by field_id
int64_t DAGNode_ReadField(int64_t ctx, int64_t node, uint32_t field_id) {
if (sub_10E32E0(node, field_id)) // field exists in descriptor?
return sub_10D5E60(node, field_id); // read value from property table
else
return 0xFFFFFFFF; // sentinel: field not present
}
The field_id values form a large flat namespace (observed range: 5--595). These are not byte offsets into the instruction record; they are logical property identifiers resolved through a descriptor table. The backing store (managed by sub_10E32E0 / sub_10D5E60) implements a sparse property bag that maps field IDs to integer values.
The companion write functions follow the same field-ID namespace:
// sub_10AE590 -- Write single field
void DAGNode_WriteField(int64_t ctx, int64_t node, uint32_t field_id, uint32_t value);
// sub_10AE640 -- Write two fields atomically (multi-field update)
void DAGNode_WriteFields(int64_t ctx, int64_t node, uint32_t f1, uint32_t v1, uint32_t v2);
Inferred semantic groupings for field IDs (from cross-referencing matcher patterns):
| Field range | Likely semantics |
|---|---|
| 5--7 | Opcode class / major instruction group |
| 88 | Sub-operation modifier |
| 105 | Operation variant selector |
| 126 | Data type qualifier (e.g., field 126 in {547,548}) |
| 163 | Addressing mode / operand encoding class |
| 190--211 | Encoding format selectors |
| 220 | Specific encoding property |
| 242 | Width/size qualifier |
| 294 | Generic constraint field |
| 327 | Register format descriptor |
| 345 | Rounding / saturation mode |
| 348 | Precision qualifier |
| 355--429 | Extended instruction properties |
| 397 | Instruction validity flag (value 2115 appears as a near-universal gate) |
| 480 | High opcode range (Blackwell/SM 100+ instructions) |
| 595 | Highest observed field ID |
Field 397 with value 2115 appears in the majority of matchers as a mandatory check, suggesting it encodes a "this instruction is encoding-compatible" or "instruction is valid for ISel" flag.
Operand Record Layout
Each operand is a 32-byte record accessed by index via sub_B28F30:
// sub_B28F30 -- Get operand record by index
int64_t GetOperand(int64_t node, int index) {
return *(int64_t*)(node + 32) + 32LL * index;
}
The 32-byte operand record:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 1 | type_tag | Operand kind (see predicate table below) |
| +4 | 4 | primary_class | Register class ID; 1023 = wildcard (any class) |
| +14 | 1 | modifier_a | Written by sub_B28F10 |
| +15 | 1 | modifier_b | Written by sub_B28F20 |
| +20 | 4 | secondary_class | Fallback register class constraint |
Source operand count is stored at node + 92 and doubles as the first-result-operand index:
uint32_t source_count = *(uint32_t*)(node + 92); // sub_B28F50
uint32_t result_count = *(node + 40) + 1 - source_count; // sub_B28F40
Operand Type Predicates
Fifteen predicate functions classify operand type tags. Each is a single comparison returning bool:
| Address | Name | Test | Semantics |
|---|---|---|---|
sub_B28E20 | isImmediate | tag == 1 | Constant / immediate literal |
sub_B28E10 | isGPR | tag == 2 | General-purpose register |
sub_B28E80 | isPredicate | tag == 3 | Predicate register |
sub_B28E70 | isType4 | tag == 4 | (specific operand class) |
sub_B28E60 | isType5 | tag == 5 | (specific operand class) |
sub_B28E30 | isSpecialReg | tag == 6 | Special register |
sub_B28ED0 | isType7 | tag == 7 | (specific operand class) |
sub_B28EF0 | isType8 | tag == 8 | (specific operand class) |
sub_B28E50 | isType9 | tag == 9 | (specific operand class) |
sub_B28E40 | isValidReg | tag == 10 | Generic valid register |
sub_B28EE0 | isType11 | tag == 11 | (specific operand class) |
sub_B28EA0 | isType13 | tag == 13 | (specific operand class) |
sub_B28EB0 | isType14 | tag == 14 | (specific operand class) |
sub_B28E90 | isUniformReg | tag == 15 | Uniform register (SM 75+) |
sub_B28EC0 | isType16 | tag == 16 | (specific operand class) |
Register class 1023 serves as a wildcard: if (class == 1023 || class == expected). This allows matchers to accept both unconstrained operands and operands already assigned to a specific register file.
Register Class Constraint Protocol
Operand records carry two register class fields: primary_class at offset +4 and secondary_class at offset +20. The matching protocol checks them with a cascading OR:
// Typical register class check (from sub_B33F00, sub_B390A0, etc.)
uint32_t primary = *(uint32_t*)(operand + 4);
uint32_t secondary = *(uint32_t*)(operand + 20);
if (sub_B28E00(primary) == 1023) {
// Wildcard -- operand is unconstrained, accept it
} else {
uint32_t cls = sub_B28E00(secondary);
if (cls != expected_class) return; // mismatch
}
sub_B28E00 and sub_B28F00 are identity functions -- the register class is stored as a plain integer, not packed. The two-field scheme allows the matcher to accept an operand where either the allocation constraint (primary) is wildcard or the resolved register file (secondary) matches.
Observed register class values in matchers:
| Class | Frequency | Likely meaning |
|---|---|---|
| 1023 | ubiquitous | Wildcard (any register class) |
| 1 | very common | 32-bit GPR (R0..R255) |
| 2 | common | 64-bit GPR pair |
| 3 | occasional | 128-bit GPR quad |
| 4 | occasional | Predicate or special register file |
| 5 | rare | Extended register class |
Representative Matcher Walkthroughs
sub_B30160 -- simple 2-source, 4-result pattern (68 lines, priority 9, template 12):
1. field 480 == 2481 -> opcode/subclass check
2. source_count == 2 -> expects 2 source operands
3. operand[0].type == 1 (immediate) -> first source is a constant
4. operand[1].type == 2 (GPR) -> second source is a register
5. operand[1].class == 1023 OR sec == 1 -> 32-bit GPR or unconstrained
6. result_count == 4 -> expects 4 result operands
7. result[0].type == 2 (GPR) -> first result is GPR
result[0].class == 1023 OR sec == 1
8. result[1].type == 3 OR 15 -> predicate or uniform register
9. result[2].type == 2 (GPR) -> third result is GPR
result[2].class == 1023 OR sec == 1
10. if (*a4 <= 8) -> *a4 = 9, *a3 = 12
sub_B33F00 -- medium 2-source, 5-result pattern (4,166 bytes, priority 21, template 22):
1. field 7 == 21 -> major opcode class
2. field 163 in {705, 706} -> addressing mode variant
3. field 203 in {1113..1117} -> encoding format (5 values)
4. field 105 == 477 -> operation variant
5. field 88 == 408 -> sub-operation modifier
6. field 345 == 1903 -> rounding/saturation mode
7. source_count == 2 -> 2 sources
8. operand[0].type == 1 (immediate) -> constant source
9. operand[1].type == 2 (GPR) -> register source
operand[1].class: primary wildcard or secondary in {1,2}
10. result_count == 5 -> 5 results
11. result[0].type == 2 (GPR), class != 1023, secondary == 2 (64-bit)
12. result[1].type == 3 OR 15 (pred/uniform)
13. result[2].type == 2 (GPR), class: wildcard or secondary in {1,2}
14. result[3].type == 2 (GPR), class: wildcard or secondary in {1,2}
15. if (*a4 <= 20) -> *a4 = 21, *a3 = 22
sub_B44CA0 -- complex 0-source, 7-result pattern (6,214 bytes, priority 11, template varies):
1. field 5 == 12 -> opcode class 12
2. field 220 == 1206 -> encoding property
3. field 595 in {2937, 2938} -> extended field (high range)
4. field 294 == 1493 -> constraint
5. field 242 in {1281, 1282} -> width qualifier
6. field 355 == 1943 -> extended property
7. field 376 == 2035 -> extended property
8. field 377 in {2037..2041} -> extended property (5 values)
9. field 429 in {2252, 2253} -> extended qualifier
10. field 126 in {547, 548} -> data type
11. field 397 == 2115 -> validity gate
12. source_count == 0 -> no source operands
13. result_count == 7 -> 7 result operands
14. All 7 results checked: type == 10 (valid register), various class constraints
15. if (*a4 <= 10) -> *a4 = 11, *a3 = (template)
This pattern has the most field checks (12) of the representative examples, validating properties deep into the extended field namespace (field 595). Its zero-source, seven-result shape suggests a hardware intrinsic or complex output instruction like a tensor-core operation.
sub_B28FE0 -- minimal matcher in the preamble zone (31 lines, priority 8, template 42):
1. field 211 == 1182
2. field 201 == 1109
3. field 348 in {1912, 1915} -> precision qualifier
4. field 397 == 2115 -> validity gate
5. source_count == 0 -> no sources
6. if (*a4 <= 7) -> *a4 = 8, *a3 = 42
The simplest matchers skip operand validation entirely and rely solely on opcode-property checks. These are for instructions with fixed operand formats where the operand shape is fully determined by the opcode.
Helper Function Summary
| Address | Name | Signature | Purpose |
|---|---|---|---|
sub_10AE5C0 | DAGNode_ReadField | (ctx, node, field_id) -> value | Read instruction property by ID; returns 0xFFFFFFFF if absent |
sub_10AE590 | DAGNode_WriteField | (ctx, node, field_id, value) | Write single instruction property |
sub_10AE640 | DAGNode_WriteFields | (ctx, node, f1, v1, v2) | Multi-field atomic update |
sub_B28F30 | GetOperand | (node, index) -> operand_ptr | Index into operand array (32-byte records at *(node+32)) |
sub_B28F40 | GetResultCount | (node) -> count | Number of result operands: node[40] + 1 - node[92] |
sub_B28F50 | GetSourceCount | (node) -> count | Number of source operands: *(node+92) |
sub_B28E00 | DecodeRegClass | (packed) -> class_id | Identity function (class stored as plain int) |
sub_B28F00 | DecodeRegClass2 | (packed) -> class_id | Second identity accessor (same semantics) |
sub_B28F10 | SetModifierA | (operand, value) | Write operand modifier at offset +14 |
sub_B28F20 | SetModifierB | (operand, value) | Write operand modifier at offset +15 |
sub_B28E10 | isGPR | (tag) -> bool | tag == 2 |
sub_B28E20 | isImmediate | (tag) -> bool | tag == 1 |
sub_B28E30 | isSpecialReg | (tag) -> bool | tag == 6 |
sub_B28E40 | isValidReg | (tag) -> bool | tag == 10 |
sub_B28E50 | isType9 | (tag) -> bool | tag == 9 |
sub_B28E60 | isType5 | (tag) -> bool | tag == 5 |
sub_B28E70 | isType4 | (tag) -> bool | tag == 4 |
sub_B28E80 | isPredicate | (tag) -> bool | tag == 3 |
sub_B28E90 | isUniformReg | (tag) -> bool | tag == 15 |
sub_B28EA0 | isType13 | (tag) -> bool | tag == 13 |
sub_B28EB0 | isType14 | (tag) -> bool | tag == 14 |
sub_B28EC0 | isType16 | (tag) -> bool | tag == 16 |
sub_B28ED0 | isType7 | (tag) -> bool | tag == 7 |
sub_B28EE0 | isType11 | (tag) -> bool | tag == 11 |
sub_B28EF0 | isType8 | (tag) -> bool | tag == 8 |
SM120 Pattern Coordinator -- sub_13AF3D0 (137 KB)
The largest ISel function in the binary (137 KB, 4,225 lines, 570+ locals). It is an architecture-specific operand-emission coordinator that runs in Phase 2 as a parallel backend to the mega-selector sub_C0EB10. The two do not call each other -- they are mutually exclusive implementations of the same ISel protocol, selected per-SM by the vtable in the ISel driver. The mega-selector covers opcodes 7--221 for the default backend; the coordinator covers opcodes 2--352 for the SM120 (consumer RTX 50xx / enterprise Pro) backend.
Position in the ISel Pipeline
sub_B285D0 (ISel driver, 9 KB)
-> selects builder variant by SM version
-> Builder variant vtable dispatch
|
+-- DEFAULT BACKEND: sub_C0EB10 (mega-selector, 185 KB)
| opcodes 7..221, dual-switch, word_22B4B60 encoding table
|
+-- SM120 BACKEND: sub_A29220 (instruction iterator, 435 lines)
-> sub_13AF3D0 (pattern coordinator, 137 KB)
opcodes 2..352, single switch, inline operand emission
The coordinator is called once per instruction by sub_A29220, which walks the instruction list. Before entering the main switch, the coordinator performs a predication pre-test: if bit 0x1000 is set in the opcode word and the opcode is not 169, it queries vtable[3232/8] and optionally emits the last source operand via sub_13A6AE0.
Dispatch Structure
The coordinator reads the opcode from *(instr+72) with the standard BYTE1 & 0xCF mask (identical to Phase 1's MercConverter) and enters a single 130-case switch. Unlike the mega-selector's dual-switch encoding-slot translation, the coordinator emits operands inline -- each case directly calls sub_13A6280 (the operand emitter) with explicit operand indices.
// sub_13AF3D0 -- simplified dispatch skeleton
char PatternCoordinator(context *a1, instruction *a2, output *a3,
pattern_table *a4, flags a5, int a6) {
int opcode = *(DWORD*)(a2 + 72);
BYTE1(opcode) &= 0xCF;
// Pre-dispatch: predication check when bit 0x1000 is set
if ((*(a2+72) & 0x1000) && opcode != 169) {
if (vtable[3232/8] == sub_868720 || vtable[3232/8]())
EmitLastSource(a1[1], a2, operand_count - 2*flag, a3);
}
// Setup output context
vtable[104/8](output_ctx, a1, &context_ref);
switch (opcode) {
case 2: case 4: case 7: // FMA/MAD 2-source
operand_span = 16; src_count = 2;
goto SHARED_FMA_HANDLER;
case 3: case 5: // FMA/MAD 3-source
operand_span = 24; src_count = 3;
goto SHARED_FMA_HANDLER;
case 6: // IMAD/IADD3 with 3+ sources
EmitOperand(ctx, instr, 3, out);
EmitOperand(ctx, instr, 4, out);
EmitOperand(ctx, instr, 5, out);
break;
case 8: // Pure vtable dispatch (vtable+2328)
vtable[2328/8](a1, a2, operand_count, a3, a5, 0);
break;
case 10: case 11: case 151: case 152: case 290: case 291:
vtable[2768/8](a1, a2, a3, a4, a5); // Memory load/store
break;
case 16: // Texture/surface (163-line handler)
for (i = first_src; i < last_src; i++)
EmitOperand(ctx, instr, i, out); // loop up to 15 operands
break;
// ... 120 more cases ...
case 329: // Variable-count loop + vtable+2328
for (i = 0; i < src_count; i++)
EmitOperand(ctx, instr, i, out);
vtable[2328/8](a1, a2, remaining, a3, a5, 0, 0, 0);
break;
default:
break; // no-op passthrough
}
}
Opcode Case Routing
The 130 distinct case labels (spanning 82 distinct handler blocks) cover the full SASS opcode range including SM100+/SM120 extensions:
| Opcodes | Handler pattern | Instruction family |
|---|---|---|
| 2, 3, 4, 5, 7 | Shared FMA handler with operand-span parametrization | FMA/MAD variants (32/64-bit) |
| 6 | Inline 3-source emission + optional operands 6/7 | IMAD/IADD3 wide |
| 8 | Pure vtable+2328 delegation | Builder-only instructions |
| 10, 11, 151, 152, 290, 291 | vtable+2768 delegation | Memory load/store |
| 16 | 163-line operand loop (up to 15 sources) | Texture/surface |
| 20, 21 | vtable+2680/2688 with stub check | Memory/store alternates |
| 22, 77, 83, 297, 352 | vtable+2744 with nullsub_463 check | Control flow |
| 24, 34, 209, 213, 214 | Passthrough: emit src 1 + dst 2 | Simple 2-operand ALU |
| 29, 95, 96, 190 | Conditional operand-6 check | Predicate-source instructions |
| 38, 59, 106, 180, 182, 192, 194, 215, 221 | Single EmitOperand(1) at high SM | Generic ALU |
| 42, 53, 55 | EmitOperand(1) | Paired ALU |
| 60, 61, 62, 63, 64 | Comparison / inner sub-opcode switch (case 61: 5 sub-cases) | Compare / set-predicate |
| 88, 89 | Variable source count (2 or 3) with sign-dependent offsets | Extended FMA |
| 110, 111, 114, 115, 117 | Warp operand emission | Warp shuffle / vote |
| 120, 121, 126, 127 | Barrier handler with operand loop at LABEL_53 | Barrier / sync |
| 139, 140, 141, 143 | sub_13A4DA0 for commutative operand selection | Commutative ALU |
| 183 | Extended memory with register-class-6 check | Wide memory |
| 201, 202, 204 | vtable+2328 delegation | Async / bulk operations |
| 270, 279, 282, 285, 325--328 | Goto LABEL_53 (barrier/sync shared handler) | Extended memory / warp |
| 280, 281 | vtable+2896 with nullsub_239 check, then LABEL_53 | Sync instructions |
| 329 | Variable-count operand loop + vtable+2328 | Variable-width encoding |
Three Competing-Match Selection Mechanisms
The coordinator selects among competing pattern matchers through three mechanisms:
1. LABEL_750 -- vtable alternate-match dispatch. Six opcode paths (cases 6, 36, 130, 137, plus opcodes reaching LABEL_119 when sub_7D6850 confirms a double-precision operand) jump to LABEL_750:
LABEL_750:
replacement = vtable[16/8](output_ctx, instruction);
*output = replacement;
return;
This is the "try architecture-specific alternate" escape hatch. The vtable slot at offset +16 on the ISel context object points to an SM-specific matcher. If it succeeds, the coordinator's inline emission is entirely bypassed and the replacement instruction is written to the output.
2. sub_13A4DA0 -- commutative operand position selector. Called 12 times for commutative instructions (FMA, IADD3, comparison) where source operands can be swapped for better encoding. The function holds up to 4 pattern entries at offsets +12/+16 through +36/+40, each a (lo_word, hi_word_mask) pair. It tests operand properties via sub_13A48E0 against each entry; the first match returns a preferred operand index. The coordinator then calls sub_13A6280 with the returned index instead of the default.
// sub_13A4DA0 -- simplified
int SelectOperandSlot(pattern_table, instruction, default_slot, alt_slot, out_match) {
if (!pattern_table->active) return default_slot;
uint64_t operand_desc = GetOperandDescriptor(instruction, default_slot);
for (i = 0; i < pattern_table->count; i++) { // up to 4 entries
if (operand_desc matches pattern_table->entry[i])
{ *out_match = entry[i].preferred; return default_slot; }
}
// Repeat with alt_slot if no match on default_slot
operand_desc = GetOperandDescriptor(instruction, alt_slot);
for (i = 0; i < pattern_table->count; i++) {
if (operand_desc matches pattern_table->entry[i])
{ *out_match = entry[i].preferred; return alt_slot; }
}
return default_slot;
}
3. Inline vtable override checks. Many cases test whether a vtable function pointer equals a known null-stub before calling it. The stub addresses serve as sentinel values -- when the vtable slot has been overridden by an SM-specific implementation, the coordinator calls the override:
| Vtable offset | Default stub | Purpose |
|---|---|---|
| +2680 | sub_A8CBE0 | Memory operation alternate matcher |
| +2688 | sub_A8CBF0 | Store operation alternate matcher |
| +2744 | nullsub_463 | Control flow alternate |
| +2632 | nullsub_233 | Move/convert alternate |
| +2760 | nullsub_235 | Atomic/barrier alternate |
| +2896 | nullsub_239 | Sync instruction alternate |
| +3232 | sub_868720 | Pre-dispatch predication alternate |
| +3112 | sub_A8CCA0 | MADC alternate (case 36) |
When the vtable slot holds the stub, the coordinator skips the call and proceeds with its inline emission logic.
Primary Callee: sub_13A6280 (239 lines)
The operand emitter, called 83 times. It reads the operand at instruction[operand_index + 10] (each operand is 8 bytes starting at instruction + 84), checks the type tag at bits [31:28], and emits:
- Tag 1 (register): fast-path returns if register class == 6 (UB/dead register). Otherwise reads the register descriptor from
*(context+88)[reg_index], checks register class at descriptor offset +64. - Tags 2/3 (constant/immediate): calls
sub_7DBC80to validate constant-bank availability, thensub_A9A290for type-5 immediate expansion. Delegates to vtable methods at*(*(context+1584) + 1504)and*(*(context+1584) + 3248). - Other types: pass through to the vtable dispatch chain.
The third parameter (operand index) ranges from 0 to 7 across the coordinator's call sites, with 0/1/2/3 being the most common (corresponding to the first 4 source operands in the Ori IR instruction layout).
Function Map Additions
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_13AF3D0 | 137 KB | SM120 ISel pattern coordinator (130-case switch, 83x operand emission) | HIGH |
sub_A29220 | 435 lines | Instruction iterator / coordinator caller (per-instruction walk) | HIGH |
sub_13A6280 | 239 lines | Operand emitter (type-tag dispatch, register class 6 fast-path) | HIGH |
sub_13A7410 | -- | Destination operand emitter (with register class 6 check) | MEDIUM |
sub_13A6AE0 | -- | Pre-dispatch source emitter (predicated instruction operands) | MEDIUM |
sub_13A4DA0 | 180 lines | Commutative operand position selector (4-entry pattern table) | HIGH |
sub_13A6F90 | -- | Extended destination emitter (3rd variant, class 6 check) | MEDIUM |
sub_13A6790 | -- | Fenced memory operand emitter | MEDIUM |
sub_13A45E0 | -- | Extra operand emitter (operands 6/7 for wide instructions) | MEDIUM |
sub_13A5ED0 | -- | Modifier flag emitter (operands with 0x18000000 bits) | MEDIUM |
sub_13A75D0 | -- | Register class 6 (UB) operand substitution handler | MEDIUM |
sub_13A48E0 | -- | Operand property extractor (for sub_13A4DA0 matching) | MEDIUM |
Architecture Dispatch Tables -- 4 Copies at sub_B128E0--sub_B12920
Four nearly identical functions (15,049 bytes each) provide architecture-variant opcode dispatch. Despite being only 13 binary bytes each (3 instructions -- a thunk into shared code at 0x1C39xxx), the decompiled output is 79,562 bytes due to the massive shared switch statement they jump into.
Each table contains a switch on *(a3+12) (the opcode word field) with 50+ cases, and secondary switches on *(a3+14) (opcode sub-field) within certain cases. The return values are SASS encoding slot indices (e.g., 197, 691, 526, 697, 772, 21). The four copies serve different SM architecture families, mapping the same logical opcode to different encoding slots depending on the target.
Opcode Variant Selectors
Two specialized variant selectors handle the final opcode-to-encoding mapping for specific instruction families:
sub_B0BE00 (19 KB) -- opcode class 194:
- Massive switch on
a2(100+ cases) - All cases call
sub_10AE590(ctx, inst, 194, N)with sequential N values starting from 827 - Pattern:
case K -> sub_10AE590(ctx, inst, 194, 826+K) - Maps sub-variant indices to SASS encoding slots for one PTX opcode family
sub_B0AA70 (5 KB) -- opcode class 306:
- Same pattern but with opcode class 306
- Variants numbered 1680--1726 with non-sequential case indices (2, 3, 8, 9, 14, 15, 20, 21, 26, 27, 30, 31, 36, 37, 40, 41, ...)
- The alternating-pair pattern at stride 6 suggests type-width combinations (e.g., F32/pair, F64/pair, S32/pair, ...)
Instruction Modifier Dispatchers
Two modifier-application functions run after the main ISel selection to set type modifiers, rounding modes, and register width:
sub_B13E10 (5,792 B) -- basic modifier dispatcher:
- All 21 callees are
sub_10AE640(DAG node modifier) - Switch on
BYTE1(a7) & 0x1Ffor modifier type - Maps modifier values 1--6 to internal codes 31--35
- Secondary dispatch on
(a7 >> 3)for register width encoding
sub_B157E0 (11,815 B) -- extended modifier dispatcher:
- All 37 callees are
sub_10AE640 - Handles texture/surface operations specially (opcode type 18)
- Maps sub-opcodes
(BYTE5(a7) & 0x3F)to encoding values 54--60
Mercury Master Encoder -- sub_6D9690 (94 KB)
The Mercury master encoder is the single largest backend function and the final instruction selection point before binary emission. It contains a massive switch on the instruction type field (read from instruction+8) covering all SASS instruction formats. While its primary role is encoding (documented in Mercury Encoder Pipeline and SASS Instruction Encoding), the switch itself performs the final opcode-to-encoding-format selection:
// Simplified encoding flow
void EncodeInstruction(context, instruction) {
int type = *(int*)(instruction + 8);
uint64_t base = 0x2000000000LL; // encoding base constant
switch (type) {
case 61: // FFMA with literal operand
sub_6D9580(ctx, operand); // encode literal
break;
case 455: // complex multi-operand format
// bit-field extraction and assembly
break;
// ... hundreds of cases ...
}
// Common tail: append operand words, commit
sub_6D2750(ctx, word); // append 8-byte operand word
sub_6D28C0(ctx); // commit instruction record
}
Key encoding dispatch details:
- Operand word type prefix in bits
[31:28]:0x1= register,0x5= immediate/constant,0x6= control/modifier,0x7= literal,0x9= special sub_7D6860handles data type encoding (FP32/FP64/INT)sub_C00BF0provides opcode lookup from the encoding tables- Architecture-specific bits accumulated via SM 100+ extensions controlled by knob 4176
MercExpand -- Pseudo-Instruction Expansion
sub_C3CC60 (26 KB) runs as phase 118 (MercExpandInstructions) and expands Mercury pseudo-instructions into concrete SASS sequences. This is the third and final instruction selection point -- where abstract instruction forms that survived through ISel and Mercury encoding are replaced by their concrete multi-instruction implementations.
| Handler | Size | Instruction class |
|---|---|---|
sub_C37A10 | 16 KB | General instruction expansion (jump table, 4+ cases) |
def_C37B2E | 13 KB | Complex expansion cases (default handler, string "EXPANDING") |
sub_C39B40 | 10 KB | Memory operations (LDG, STG, LDS, etc.) |
sub_C3A460 | 6 KB | Atomic operations |
sub_C3B560 | 8 KB | Texture operations |
sub_C3BCD0 | 19 KB | Control flow (branches, jumps, calls) |
sub_C3E030 | 18 KB | Finalization and cleanup |
The expansion creates new instruction nodes, links them into the doubly-linked list, and deletes the original pseudo-instruction. After all expansions, sub_C3E030 performs post-expansion verification. The expansion engine also uses sub_719D00 (50 KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate) -- four near-identical code blocks corresponding to template instantiations over operand width types.
OCG Encoding Template Lookup -- sub_C3F490
The OCG (Optimized Code Generation) intrinsic pipeline on SM100+ does not use the ISel mega-selector or DAG pattern matchers. Instead, the OCG router (sub_6CC690, documented in Intrinsics) assigns each instruction one of 7 internal routing values and passes it to the SASS instruction emitter sub_6CB8A0. These routing values are not Ori IR opcodes, not binary SASS opcodes, and not encoding slot indices from word_22B4B60. They are a small, closed set of keys that exist solely to select an operand gathering template inside sub_C3F490.
Routing values assigned by the OCG router
| Value | Hex | Instruction class | Assigned when |
|---|---|---|---|
| 70 | 0x46 | Memory-ordered load/store/atomic (with barrier) | Barrier register present (v108 != 0 in conditional paths) |
| 243 | 0xF3 | Default memory operation | Fallback for general memory ops without barrier or special fence |
| 245 | 0xF5 | Load variant (LD/LDG/LDS) | Load-type operations (from OCG load/store handler) |
| 246 | 0xF6 | Reduction/atomic default | Atomic operations and reductions |
| 247 | 0xF7 | Fenced memory operation (LDGSTS) | Operations requiring memory fence semantics |
| 257 | 0x101 | Async copy without memory order | Bulk copy ops when no barrier: v108 == 0 selects 257, else 70 |
| 261 | 0x105 | Atomic with pre-existing value read | Atomic exchange / compare-and-swap returning old value |
How sub_C3F490 maps routing values to encoding templates
sub_C3F490 is a pure lookup function (184 bytes) that takes a routing value plus 7 boolean modifier flags and returns a pointer to an operand gathering template in .data at 0x22B8960--0x22BB460. The function is a nested if-else tree: the first-level switch selects on the routing value, then inner branches refine the template based on the modifier flags.
sub_C3F490(routing_value, a2..a8) -> template_ptr
a2: has pre-existing-value operand (used only by value 257)
a3: SM generation > sm_7x (SM80+)
a4: has predicate attachment
a5: has scope/fence operand (SM generation > sm_8x && memory_order == 4)
a6: (always 0 from OCG emitter, used by MercExpand callers)
a7: (always 0 from OCG emitter, used by MercExpand callers)
a8: (always 0 from OCG emitter, used by MercExpand callers)
The OCG emitter (sub_6CB8A0) always passes a6=a7=a8=0, which means the OCG path only reaches a subset of template leaves. The MercExpand callers (sub_C41100, sub_C40420, sub_C40B90, sub_C42330) pass all 7 flags and can reach the full template space. The returned template is a packed array: template[0] is the operand count, followed by operand slot indices that reference positions in the 39-QWORD operand buffer (v134[]). The emitter iterates over these indices, gathers the tagged operand words, builds control words from bitfields, and calls sub_9314F0 to commit the encoded instruction.
Two additional routing values (254, 262) are handled by sub_C3F490 but are never assigned by the OCG router -- they originate exclusively from the MercExpand memory instruction handlers, where the routing value is read from the instruction's opcode field (instr[18] masked with & 0xCFFF).
| Value | Hex | Origin | Instruction class |
|---|---|---|---|
| 254 | 0xFE | MercExpand only | Extended memory format (operand gather mode 3) |
| 262 | 0x106 | MercExpand only | Wide memory format (operand gather mode 0, with scope/fence branches) |
Template address space
The 40+ distinct templates returned by sub_C3F490 occupy a contiguous .data region:
| Address range | Routing values served |
|---|---|
0x22B8960--0x22B8E60 | 257 (async copy variants) |
0x22B8E60--0x22B9360 | 70 (barrier memory variants) |
0x22B9360--0x22B9860 | 262 (MercExpand wide memory) |
0x22B9860--0x22B9E60 | 247, 245 (fenced / load variants) |
0x22B9E60--0x22BA960 | 243, 246, 70 (default / reduction / barrier sub-variants) |
0x22BA960--0x22BB460 | Leaf templates for bare operand forms (no modifiers) |
Each template is 256 bytes (0x100). For a given routing value, the modifier flags select progressively simpler templates as flags are cleared: the most complex template (all modifiers active) is reached first in the if-chain, and the simplest (no modifiers) is the final fallback.
Addressing Mode Selection
Addressing mode selection is distributed across Phases 1 and 2. During Phase 1, the operand processing function sub_6273E0 (44 KB) classifies PTX operand forms into internal categories. During Phase 2, the ISel driver and Mercury encoder select the optimal SASS addressing mode based on the register-allocated operand forms.
PTX addressing modes and their SASS encodings:
| PTX syntax | Addressing mode | SASS instruction | Encoding |
|---|---|---|---|
[%rd1] | Register indirect | LDG.E R0, [R2] | Register + zero offset |
[%rd1+16] | Register + offset | LDG.E R0, [R2+0x10] | Register + immediate offset |
c[2][0x100] | Constant bank | LDC R0, c[0x2][0x100] | Bank index + offset |
[%rd1], %r2 | Base + index | STG.E [R2], R4 | Separate base/data registers |
Special string references in sub_6273E0 confirm complex addressing:
".nv.reservedSmem.offset0"-- reserved shared memory region"COARSEOFFSET"-- coarse-grained offset computation for large address spaces"__$endLabel$__%s"-- label generation for structured control flow
The ISel mega-selector (sub_C0EB10) references "__nv_reservedSMEM_offset_0_alias" for shared memory alias resolution during final encoding.
Vtable Dispatcher Zone -- 0xAF0000--0xB10000
The range 0xAF0000--0xB10000 contains approximately 2,735 tiny vtable method implementations (average 160 bytes) that form the instruction encoding hierarchy. These implement polymorphic instruction property queries:
// Typical vtable method (sub_AFXXXX, ~160 bytes)
int64_t get_property(int64_t a1, unsigned int a2) {
if (a2 <= N)
return (unsigned int)dword_XXXXXXX[a2]; // table lookup
return default_value;
}
Each function maps a small integer index to an encoding constant, answering questions like "what is the register class for operand N of this instruction?" The 0xAF0000--0xB00000 sub-range has 1,269 functions (all under 200 bytes), while 0xB00000--0xB10000 has 1,466 with slightly more complex logic (13 exceeding 1 KB).
Comparison with LLVM ISel
| Aspect | LLVM | ptxas |
|---|---|---|
| ISel framework | SelectionDAG or GlobalISel (single pass) | Two-phase: MercConverter (phase 5) + ISel driver (phase 112+) |
| Pattern specification | TableGen .td files, machine-generated | Handwritten C++ (~750 functions) |
| Pattern count | Target-dependent (thousands for x86) | ~801 DAG matchers + 185 KB mega-selector |
| Architecture dispatch | Subtarget feature bits | 4 architecture dispatch tables + vtable overrides |
| Intermediate form | MachineInstr (already selected) | Ori IR (SASS opcodes after phase 5, not yet encoded) |
| Encoding | MCInst emission (separate pass) | Integrated: ISel + Mercury encode in same pipeline |
| Expansion | Pseudo-instruction expansion in AsmPrinter | MercExpand (phase 118, post-ISel) |
| Optimization post-ISel | MachineFunction passes | Phases 14--111 (full optimizer runs between Phase 1 and Phase 2) |
The key architectural difference: LLVM performs instruction selection once, then optimization happens on already-selected machine instructions. ptxas selects SASS opcodes early (phase 5) so the optimizer can reason about SASS-level semantics, then performs a second selection/encoding pass after optimization is complete. This two-phase design gives the optimizer accurate cost models (it sees real SASS opcodes, not abstract PTX operations) at the cost of architectural complexity.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_C0EB10 | 185 KB | ISel mega-selector (719 locals, dual 169-case switch, SM-generation dispatch) | HIGH |
sub_6D9690 | 94 KB | Mercury master encoder (instruction type switch) | VERY HIGH |
sub_9F1A90 | 35 KB | MercConverter main instruction conversion pass | HIGH |
sub_9EF5E0 | 27 KB | Post-MercConverter lowering ("CONVERTING") | HIGH |
sub_C3CC60 | 26 KB | MercExpand::run (pseudo-instruction expansion) | HIGH |
sub_9ED2D0 | 25 KB | MercConverter opcode dispatch (master switch, & 0xCF mask) | HIGH |
sub_9E6600 | 25 KB | Instruction expansion (64-bit split) | HIGH |
sub_9EC340 | 23 KB | Multi-operand instruction legalization | MEDIUM |
sub_B0BE00 | 19 KB | Opcode variant selector (class 194, 100+ cases) | HIGH |
sub_C3BCD0 | 19 KB | MercExpand::expandControlFlow | MEDIUM |
sub_9D76D0 | 18 KB | Memory instruction legalization (load/store) | HIGH |
sub_C3E030 | 18 KB | MercExpand::finalizeExpansion | MEDIUM |
sub_9D80E0 | 17 KB | Memory instruction legalization (variant) | HIGH |
sub_9E8B20 | 17 KB | Texture/surface lowering | MEDIUM |
sub_C37A10 | 16 KB | MercExpand::expandInstruction (jump table) | HIGH |
sub_B128E0--sub_B12920 | 15 KB x4 | Architecture dispatch tables (4 SM families) | HIGH |
sub_B1FA20 | 13 KB | SASS 3-operand builder (variant A) | HIGH |
sub_B1D670 | 13 KB | Post-ISel instruction modifier | HIGH |
def_C37B2E | 13 KB | MercExpand complex cases ("EXPANDING") | HIGH |
sub_B157E0 | 12 KB | Extended modifier dispatcher (37 callees) | HIGH |
sub_B20E00 | 11 KB | SASS 3-operand builder (variant B) | HIGH |
sub_C39B40 | 10 KB | MercExpand::expandMemoryOp | MEDIUM |
sub_9DA100 | 9 KB | Arithmetic operation handler (case 6) | HIGH |
sub_B285D0 | 9 KB | ISel lowering driver (66 callees) | HIGH |
sub_B241A0 | 7 KB | SASS instruction property setter | HIGH |
sub_9F3340 | 7 KB | MercConverter orchestrator ("After MercConverter") | HIGH |
sub_C3A460 | 6 KB | MercExpand::expandAtomicOp | MEDIUM |
sub_B13E10 | 6 KB | Basic modifier dispatcher (21 callees) | HIGH |
sub_B0AA70 | 5 KB | Opcode variant selector (class 306) | HIGH |
sub_9DA5C0 | 2 KB | Opcode class 1 handler | MEDIUM |
sub_13AF3D0 | 137 KB | SM120 ISel pattern coordinator (130-case switch, 83x sub_13A6280, opcodes 2--352) | HIGH |
sub_A29220 | ~17 KB | SM120 instruction iterator (calls sub_13AF3D0 per instruction) | HIGH |
sub_13A6280 | ~10 KB | Operand emitter (type-tag dispatch, register class 6 fast-path) | HIGH |
sub_13A4DA0 | ~7 KB | Commutative operand position selector (4-entry pattern table) | HIGH |
sub_13A7410 | -- | Destination operand emitter (with register class 6 check) | MEDIUM |
sub_13A6AE0 | -- | Pre-dispatch source emitter (predicated instruction operands) | MEDIUM |
sub_13A6F90 | -- | Extended destination emitter (3rd variant, class 6 check) | MEDIUM |
sub_13A6790 | -- | Fenced memory operand emitter | MEDIUM |
sub_13A45E0 | -- | Extra operand emitter (wide instruction operands 6/7) | MEDIUM |
sub_13A5ED0 | -- | Modifier flag emitter (operands with 0x18000000 bits) | MEDIUM |
sub_13A48E0 | -- | Operand property extractor (for sub_13A4DA0 matching) | MEDIUM |
sub_10AE5C0 | tiny | DAGNode_ReadField (field_id to value, delegates to sub_10D5E60) | VERY HIGH |
sub_10AE590 | tiny | DAGNode_WriteField (single field write) | VERY HIGH |
sub_10AE640 | tiny | DAGNode_WriteFields (multi-field update) | VERY HIGH |
sub_B28F30 | tiny | GetOperand (index into 32-byte operand array at *(node+32)) | VERY HIGH |
sub_B28F40 | tiny | GetResultCount (node[40] + 1 - node[92]) | VERY HIGH |
sub_B28F50 | tiny | GetSourceCount (*(node+92)) | VERY HIGH |
sub_B28E00 | tiny | DecodeRegClass (identity function, class is plain int) | VERY HIGH |
sub_B28E10 | tiny | isGPR operand predicate (tag == 2) | VERY HIGH |
sub_B28E20 | tiny | isImmediate operand predicate (tag == 1) | VERY HIGH |
sub_B28E40 | tiny | isValidReg operand predicate (tag == 10) | VERY HIGH |
sub_B28E80 | tiny | isPredicate operand predicate (tag == 3) | VERY HIGH |
sub_B28E90 | tiny | isUniformReg operand predicate (tag == 15) | VERY HIGH |
sub_B28F60--sub_B74C60 | ~1.3 MB | ~801 DAG pattern matchers (priority 2--34, template 1--152) | HIGH |
sub_C01840 | -- | Mega-selector source operand marshaller (52 calls from mega-selector) | HIGH |
sub_C01F50 | -- | Mega-selector destination operand marshaller | HIGH |
sub_C00EA0 | -- | Single operand extractor (returns tagged operand word) | HIGH |
sub_BFFD60 | -- | Operand reference resolver (register ref to encoding word) | HIGH |
sub_C06E90 | -- | Symbol/special-register lookup for shared memory | HIGH |
sub_C07690 | -- | Immediate-operand encoding helper | MEDIUM |
sub_C0B2C0 | -- | Extended memory/warp operation encoder | HIGH |
sub_C05CC0 | -- | Immediate operation encoder (flag-dependent path) | MEDIUM |
sub_BFEBF0 | tiny | Default vtable[2] stub (opcode translator, no-op identity) | VERY HIGH |
sub_BFEAA0 | tiny | Default vtable[12] stub (capability check, always false) | VERY HIGH |
sub_BFEA30 | tiny | Default vtable[3] stub (extension handler, no-op) | VERY HIGH |
sub_BFEF10 | -- | Register bank capacity check / grow | MEDIUM |
word_22B4B60 | -- | Static opcode-to-encoding-index table (uint16[222], default backend) | VERY HIGH |
sub_C3F490 | 184 B | OCG encoding template lookup (routing value + 7 flags -> template ptr) | VERY HIGH |
sub_6CB8A0 | -- | OCG SASS instruction emitter (calls sub_C3F490 then sub_9314F0) | HIGH |
sub_C41100 | -- | MercExpand memory encoder (calls sub_C3F490 with full flag set) | HIGH |
sub_C40420 | -- | MercExpand memory encoder variant (calls sub_C3F490) | HIGH |
sub_C40B90 | -- | MercExpand memory encoder variant (calls sub_C3F490) | HIGH |
sub_C42330 | -- | MercExpand memory encoder variant (calls sub_C3F490) | HIGH |
unk_22B8960--unk_22BB460 | ~11 KB | Operand gathering templates (40+ entries, 256 B each) | HIGH |
Cross-References
- PTX-to-Ori Lowering -- Phase 1 context: bridge phases, MercConverter call chain
- Code Generation Overview -- ISel within the codegen pipeline
- SASS Instruction Encoding -- bit-level encoding format, operand encoders
- Mercury Encoder Pipeline -- Mercury master encoder, MercExpand
- Peephole Optimization -- post-ISel pattern rewrites (3 mega-dispatchers)
- Newton-Raphson Templates -- DDIV/DRCP/DSQRT expansion sequences
- Intrinsics: OCG Lowering Pipeline -- OCG router that assigns routing values, operand buffer layout
- Ori IR -- instruction format, opcode field layout
- SASS Opcodes -- target instruction set
SASS Instruction Encoding
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The SASS instruction encoder is the single largest subsystem in ptxas by function count. It translates the internal Ori IR instruction representation into packed binary SASS machine code for a specific SM target. The encoder comprises approximately 4,000 template-generated handler functions dispatched through function-pointer tables indexed by opcode, plus six massive switch-dispatch megafunctions that route field-level queries by instruction category. The core encoding primitive is a single 216-byte bitfield-insert function (sub_7B9B80) called from 18,347 sites throughout the binary. NVIDIA internally names this pipeline phase "Ori Phase Encoding" within the Mercury assembler backend.
| Pipeline phase | OriPhaseEncoding (within Mercury) |
| Core bitfield packer | sub_7B9B80 (216 bytes, 18,347 callers) |
| Encoding buffer | 1280 bits = 20 QWORDs at a1+544 |
| Instruction widths | 64-bit (format 1), 128-bit (format 2), 256-bit (format 8) |
| Opcode hierarchy | 3-level: major (9 bits) / minor (8 bits) / sub-opcode (7 bits) |
| SM100 encoder count | ~1,086 encode functions + ~97 decode functions |
| SM100 opcode categories | 370 (case values 0x0 through 0x171) |
| SM100 major opcodes | 102 unique values |
| Bitfield accessor primitives | 2,095 functions (mostly under 200 bytes) |
| Confirmed strings | "AdvancedPhaseOriPhaseEncoding", "MercEncodeAndDecode", "After EncodeAndDecode", "ENCODING" |
Encoding Buffer Layout
Every encoder operates on an instruction encoding context object passed as a1. The primary encoding target is a 1280-bit (160-byte, 20 QWORD) buffer at offset a1+544. The bitfield packer sub_7B9B80 writes individual fields into this buffer by iterating in 64-bit chunks:
// sub_7B9B80(a1, bit_offset, bit_width, value)
// Insert `value` into bits [bit_offset .. bit_offset+bit_width) of the encoding buffer
void bitfield_insert(int64_t a1, int bit_offset, int bit_width, uint64_t value) {
uint64_t mask = (1ULL << bit_width) - 1;
value &= mask;
int pos = bit_offset;
while (pos < 1280) {
int qword_idx = pos >> 6; // which QWORD
int bit_in_qword = pos & 63; // bit position within QWORD
*(uint64_t*)(a1 + 8 * qword_idx + 544) |= value << bit_in_qword;
// Handle fields that cross a QWORD boundary
if (bit_in_qword + bit_width > 64) {
int overflow = bit_in_qword + bit_width - 64;
*(uint64_t*)(a1 + 8 * (qword_idx + 1) + 544) |= value >> (64 - bit_in_qword);
}
break; // single insertion (loop structure handles interleaved format)
}
}
The encoding context object has this layout:
| Offset | Size | Content |
|---|---|---|
+0x000 | 8B | vtable / allocator pointer |
+0x008 | 16B | Format descriptor (xmmword constant from rodata) |
+0x010 | 4B | Bitfield position base index |
+0x018 | 120B | Register class maps (3 arrays of 10 DWORDs: source classes, dest classes, widths) |
+0x090 | 4B | Operand count (a1+144) |
+0x094--+0x110 | Explicit operand mapping table (pairs of index + bit position) | |
+0x194 | 32B | Extended operand attributes (from xmmword tables) |
+0x1D4--+0x214 | 64B | Constant buffer slot table (16 DWORD slots, cleared to 0xFF by sub_7B9D30) |
+0x214 | 4B | Constant buffer slot counter (a1+532) |
+0x218 | 8B | Encoding validation context pointer (a1+536) |
+0x220 | 8B | Instruction bits [63:0] (a1+544) |
+0x228 | 8B | Instruction bits [127:64] (a1+552) |
+0x230+ | Additional encoding space (up to 1280 bits total) |
Instruction Word Format
SASS instructions use a 3-level opcode hierarchy packed into the first 32 bits of the encoding buffer. The format code in bits [0:3] determines instruction width:
128-bit instruction word:
bits [0:3] = 0x2 (format code: 128-bit)
bits [4:6] = 0x0 (scheduling group slot 0)
bits [8:16] = MAJOR (9-bit major opcode, 0x00-0x171)
bits [17:24] = MINOR (8-bit minor opcode / variant)
bits [25:31] = SUBOP (7-bit sub-opcode / format ID)
bits [48+] = MODIFIERS (format-dependent modifier fields)
bits [132:134] = 0x0 (extended opcode flag, at offset 0x84)
64-bit instruction word:
bits [0:3] = 0x1 (format code: 64-bit)
bits [4:6] = 0x0 (scheduling group slot 0)
bits [8:16] = MAJOR (9-bit major opcode)
bits [17:24] = MINOR (8-bit minor opcode)
bits [25:31] = SUBOP (7-bit sub-opcode)
(no bit 132 extended flag -- only 5 initial sub_7B9B80 calls)
256-bit instruction word:
bits [0:3] = 0x8 (format code: 256-bit)
(used for IMAD.WIDE variants with 16 constant-bank operand slots)
The 128-bit format uses 6 initial sub_7B9B80 calls (including one at offset 0x84 for the extended flag). The 64-bit format uses only 5 (no 0x84 call). This is the reliable way to distinguish the two during analysis.
The maximum variant value observed is 0x2F (47 decimal), so each major opcode can have up to 48 sub-operations, though most have far fewer.
Encoder Template
Every encoding handler function follows an identical 10-phase template. The only differences between the approximately 1,086 encoder functions for SM100 are the specific constant values and which modifier-encoding helpers are called. This is textbook C++ template/macro expansion:
int64_t encode_OPCODE_VARIANT(int64_t a1, int64_t a2) {
// a1 = instruction encoding context (output)
// a2 = Ori IR instruction node (input)
// Phase 1: Set instruction format header
sub_7B9B80(a1, 0, 4, FORMAT_CODE); // bits[0:3] = 1 (64b) / 2 (128b) / 8 (256b)
sub_7B9B80(a1, 4, 3, 0); // bits[4:6] = sched group slot 0
sub_7B9B80(a1, 0x84, 3, 0); // bits[132:134] = extended flag (128-bit only)
sub_7B9B80(a1, 8, 9, OPCODE_CLASS); // bits[8:16] = major opcode
sub_7B9B80(a1, 0x11, 8, OPCODE_MINOR); // bits[17:24] = minor opcode / variant
sub_7B9B80(a1, 0x19, 7, FORMAT_ID); // bits[25:31] = sub-opcode / format ID
// Phase 2: Load operand format descriptor
*(xmmword*)(a1 + 8) = xmmword_23FXXXX; // 128-bit format field layout from rodata
// Copy 3 arrays of 10 DWORDs into a1+24..a1+140 (slot sizes, types, flags)
// Phase 3: Set operand count and modifier table
*(int*)(a1 + 144) = NUM_SOURCE_OPERANDS;
*(xmmword*)(a1 + 404) = xmmword_YYYYYYY; // modifier descriptor table
// Phase 4: Initialize encoding context
sub_7B9D30(a1); // clear constant buffer slot table (memset +468, 0xFF, 64)
sub_7B9D60(a1, a2, 0); // encode reuse flags + guard predicate
// Phase 5: Encode primary opcode ID
void* ctx = *(void**)(a1 + 536);
int opcode = sub_10BFxxx(*(a2+32) + 32 * *(a2+40)); // extract from IR operand
int encoded = sub_10B6180(ctx, opcode); // map through lookup table
sub_7B9B80(a1, 8 * *(a1+16), 1, encoded); // insert at computed position
// Phase 6: Encode source operands (variable number and types)
sub_7BC030(a1, a2, 0, 0x60); // register operand 0 at bit offset 0x60
sub_7BC030(a1, a2, 1, 0x70); // register operand 1 at bit offset 0x70
sub_7BCF00(a1, a2, 2, 0x88); // immediate operand 2 at bit offset 0x88
sub_7BC5C0(a1, a2, 3, 0x98); // predicate operand 3 at bit offset 0x98
// Phase 7: Encode instruction-specific modifiers
int mod_val = sub_10B96A0(a2); // read modifier from IR node
int enc_mod = sub_10B3680(ctx, mod_val); // validate and map
*(int64_t*)(a1+544) |= ((int64_t)enc_mod << 55) & 0x180000000000000LL;
// Phase 8: Encode explicit operand mapping (source operands with data offsets)
*(int*)(a1 + 148) = operand_index;
*(int*)(a1 + 152) = 8 * bit_position;
}
Operand Type Encoders
Four type-specific helper functions encode operands into the instruction word. Each reads the operand descriptor from the IR instruction's operand table at *(a2+32) + 32*operand_index.
Register Operand Encoder -- sub_7BC030
814 bytes, 6,147 callers. Encodes a general-purpose register (R0-R255, UR0-UR63):
// sub_7BC030(insn, ir_insn, operand_index, bit_offset)
void encode_register(int64_t a1, int64_t a2, int op_idx, int bit_off) {
if (op_idx >= *(int*)(a2 + 92)) // check operand count
return;
void* operand = *(void**)(a2 + 32) + 32 * op_idx;
int reg_type_raw = *(int*)(operand + 20);
// Map register file type to 4-bit encoding:
// 1->0, 2->1, 3->2, 4->3, 5->4, 6->5, 7->6, 8->7,
// 16->8, 32->9, 64->10, 128->11
int reg_type = map_regfile(reg_type_raw);
int reg_num = *(int*)(operand + 4); // signed register number
sub_7B9B80(a1, bit_off, 1, 1); // 1-bit presence flag
sub_7B9B80(a1, bit_off + 1, 4, reg_type); // 4-bit register type
sub_7B9B80(a1, bit_off + 6, 10, reg_num); // 10-bit register number
}
The register file type encoding maps raw operand type codes to a 4-bit hardware register file selector. The 12 supported values (1 through 128 in powers of 2) cover GPR, uniform registers, predicate registers, special registers, and extended register files.
Immediate / Constant-Buffer Encoder -- sub_7BCF00
856 bytes, 1,657 callers. Encodes immediate values and constant memory references (c[bank][offset]):
// sub_7BCF00(insn, ir_insn, operand_index, bit_offset)
void encode_immediate(int64_t a1, int64_t a2, int op_idx, int bit_off) {
void* operand = *(void**)(a2 + 32) + 32 * op_idx;
int type = *(uint8_t*)operand;
if (type == 14 || type == 15 || type == 16) {
// Predicate-typed immediate: store to constant buffer slot table
*(void**)(a1 + 468 + 8 * *(int*)(a1 + 532)) = operand + 8;
*(int*)(a1 + 532) += 1;
}
sub_7B9B80(a1, bit_off, 1, 1); // presence flag
sub_7B9B80(a1, bit_off + 11, 5, *(int*)(operand + 4)); // 5-bit value
}
Predicate Encoder -- sub_7BC5C0
416 bytes, 1,449 callers. Encodes predicate register operands (PT, P0-P6):
// sub_7BC5C0(insn, ir_insn, operand_index, bit_offset)
void encode_predicate(int64_t a1, int64_t a2, int op_idx, int bit_off) {
void* operand = *(void**)(a2 + 32) + 32 * op_idx;
sub_7B9B80(a1, bit_off, 2, pred_type); // 2-bit predicate type
sub_7B9B80(a1, bit_off + 3, 3, pred_cond); // 3-bit condition code
sub_7B9B80(a1, bit_off + 8, 8, pred_value); // 8-bit predicate value
}
Uniform Register Encoder -- sub_7BC360
Used for uniform registers (UR0-UR63) and source operands with alternative bitfield layouts. 126 calls in the SM100 encoding range. Likely handles the UR register file which has a separate encoding namespace from the main GPR file.
Instruction Format Groups
The encoder functions are organized into 16 format groups, identified by the xmmword constant loaded at a1+8. Each xmmword holds the field layout descriptor for that instruction format. The groups divide into two categories:
128-bit Formats (11 groups)
| Format Descriptor | Format ID | Encoder Count | Description |
|---|---|---|---|
xmmword_23F1DF8 | 0x03 | 145 | General ALU/memory -- most common format |
xmmword_23F29A8 | 0x19 | 117 | Extended format for complex instructions |
xmmword_23F21B0 | 0x0A | 99 | Multi-source ALU operations |
xmmword_23F2678 | 0x13 | 27 | Tensor/extended ALU |
xmmword_23F2018 | 0x07 | 9 | Miscellaneous ALU |
xmmword_23F2348 | 0x0D | 6 | Specialized ALU |
xmmword_23F2EF8 | 0x23 | 5 | Extended variant |
xmmword_23F2810 | 0x16 | 4 | Bulk data / DMA |
xmmword_23F2128 | 0x09 | 2 | Rare format |
xmmword_23F2DE8 | 0x21 | 2 | Rare extended |
xmmword_23F25F0 | 0x12 | 2 | Rare format |
64-bit Formats (5 groups)
| Format Descriptor | Encoder Count | Description |
|---|---|---|
xmmword_23F1F08 | 113 | Short-form general -- widest opcode coverage (27 classes) |
xmmword_23F1D70 | 41 | Short-form 4-operand |
xmmword_23F1F90 | 11 | Short-form variant |
xmmword_23F2238 | 8 | Short-form variant |
xmmword_23F2C50 | 1 | Minimal format |
The 128-bit group (format code 2) encodes long-form SASS instructions (ALU, load/store, texture, tensor core). The 64-bit group (format code 1) encodes short-form instructions (simple moves, branches, barriers, NOP-like control). Two additional functions use format code 8 (256-bit) for IMAD.WIDE variants with 16 constant-bank operand slots.
Instruction Format Group Catalog
Format Descriptor Architecture
Each format group is defined by a 128-bit xmmword constant stored in rodata at addresses 0x23F1xxx--0x23F2xxx. This descriptor is loaded via SSE into the encoding context at a1+8:
*(__m128i *)(a1 + 8) = _mm_loadu_si128(&xmmword_23F29A8);
Immediately following each xmmword in rodata are three arrays of 10 DWORDs that define the operand slot layout. The encoder copies these into the context object at a1+24 through a1+140:
| Rodata Array | Context Offset | Content |
|---|---|---|
dword_XXXXX8[10] | a1+24 .. a1+60 | Operand slot sizes (bits per slot) |
dword_XXXXE0[10] | a1+64 .. a1+100 | Operand slot types (register class selector) |
dword_XXXXX8[10] | a1+104 .. a1+140 | Operand slot flags (encoding mode flags) |
Observed slot-size values: 10 = register (10-bit number + overhead), 12 = register with type, 17 = immediate/cbuf, -1 = unused. Slot-type values: 28 = register-type, 0 = basic, -1 = unused. Slot-flag values: 0 = default, 2 = secondary (uniform/extended), -1 = unused.
The copy uses SSE aligned loads for 16-byte chunks and scalar DWORD stores for remainders. The alignment check visible in every decompiled encoder (if (a1 + 120 <= dword_XXXXX8 || a1 + 24 >= &dword_XXXXX8)) is a compiler-generated overlap guard for the memcpy-like bulk copy.
Bitfield Packer Detail -- sub_7B9B80
The core encoding primitive. 216 bytes compiled, 18,347 callers total. Inserts an arbitrary-width bitfield into the 1280-bit buffer at a1+544:
// sub_7B9B80(a1, bit_offset, bit_width, value)
// Reconstructed algorithm from decompiled code:
__int64 bitfield_insert(__int64 a1, uint32_t bit_offset, int bit_width, uint64_t value) {
uint32_t end = bit_offset + bit_width;
uint32_t neg_base = -64 - bit_offset; // pre-computed right-shift amount
uint32_t pos = 0;
do {
while (1) {
uint32_t chunk_end = pos + 64;
if (bit_offset > pos + 63 || end <= pos) goto next; // no overlap
uint32_t start = (bit_offset >= pos) ? bit_offset : pos;
uint32_t stop = (end <= chunk_end) ? end : chunk_end;
int width = stop - start;
int shift_right = (chunk_end + neg_base < 0) ? 0 : chunk_end + neg_base;
int bit_in_qword = start & 0x3F;
__int64 *qword = (__int64*)(a1 + 8 * (start >> 6) + 544);
uint64_t shifted = value >> shift_right;
if (width == 64)
*qword |= shifted << bit_in_qword;
else
*qword |= (shifted & ~(-1ULL << width)) << bit_in_qword;
next:
pos = chunk_end;
if (chunk_end == 1280) return pos;
}
} while (pos != 1280);
return pos;
}
Key properties:
- Handles cross-QWORD-boundary fields: a 9-bit opcode starting at bit 59 writes 5 bits to QWORD 0 and 4 bits to QWORD 1
- Loop terminates at bit position 1280 (20 QWORDs), hard ceiling
- For typical field widths (1--9 bits), executes 1--2 iterations
- Called 8--12 times per encoder function (average ~10)
- The 256-bit format encoders call it with wider fields (up to 32 bits for data values)
128-bit Format 0x03 -- General ALU/Memory (145 encoders)
The most populous format group. Handles the bread-and-butter ALU and memory instructions.
| Property | Value |
|---|---|
| Descriptor | xmmword_23F1DF8 |
| Format ID | 0x03 (bits[25:31]) |
| Slot arrays | dword_23F1E08, dword_23F1E30, dword_23F1E40 |
| Operand slots | 2--7 per instruction |
| Typical pattern | 3 reg + 1 imm + 1 pred (5 slots) |
| Modifier fields | 4--8 per instruction |
Opcode classes (29): 0x08, 0x0B, 0x0F, 0x10, 0x16, 0x17, 0x19, 0x1A, 0x1B, 0x20, 0x22, 0x25, 0x28, 0x2A, 0x2B, 0x30, 0x32, 0x34, 0x35, 0x36, 0x37, 0x38, 0x3B, 0x41, 0x45, 0x4A, 0x4B, 0x5B, 0x67.
128-bit Format 0x19 -- Extended Complex (117 encoders)
Second most common. Used for instructions with rich modifier fields or unusual operand configurations.
| Property | Value |
|---|---|
| Descriptor | xmmword_23F29A8 |
| Format ID | 0x19 (bits[25:31]) |
| Slot arrays | dword_23F29B8, dword_23F29E0, dword_23F2A08 |
| Operand slots | 3--6 per instruction |
| Modifier fields | 5--8 per instruction |
Opcode classes (8): 0x0F, 0x10, 0x1A, 0x1B, 0x22, 0x38, 0x4D, 0x5E. Notable concentration: opcode 0x1B has 41 variants in this format alone (tensor/MMA family); opcode 0x5E has 26 variants. The load/store family (0x38) uses this format for 7 of its 16 variants -- the ones with extended addressing modes.
128-bit Format 0x0A -- Multi-Source ALU (99 encoders)
Designed for instructions with 4--7 source operands. Heavily weighted toward rich ALU operations.
| Property | Value |
|---|---|
| Descriptor | xmmword_23F21B0 |
| Format ID | 0x0A (bits[25:31]) |
| Operand slots | 4--7 per instruction |
| Typical pattern | 4 reg + 1 imm + 1 pred |
Opcode classes (10): 0x10, 0x16, 0x17, 0x20, 0x25, 0x28, 0x2A, 0x45, 0x4B, 0x67. Opcode 0x2A dominates with 30 variants; opcode 0x25 has 18.
128-bit Format 0x13 -- Tensor/Extended ALU (27 encoders)
Contains the most complex encoders in the binary. Opcode 0x5A variant 0x02 (sub_D89C90, 2015 bytes) has 18 modifier fields -- the maximum observed.
| Property | Value |
|---|---|
| Descriptor | xmmword_23F2678 |
| Format ID | 0x13 (bits[25:31]) |
| Slot arrays | dword_23F2688, dword_23F26B0, dword_23F26D8 |
| Operand slots | 4--7 per instruction |
| Modifier fields | 8--18 per instruction |
Opcode classes (7): 0x10, 0x16, 0x17, 0x1A, 0x41, 0x5A, 0x67.
128-bit Formats 0x07, 0x0D, 0x23, 0x16, 0x09, 0x21, 0x12 -- Rare Formats (35 encoders combined)
| Descriptor | Format ID | Encoders | Opcode Classes |
|---|---|---|---|
xmmword_23F2018 | 0x07 | 9 | 0x0F, 0x10 |
xmmword_23F2348 | 0x0D | 6 | 0x0F, 0x16, 0x67 |
xmmword_23F2EF8 | 0x23 | 5 | 0x10 |
xmmword_23F2810 | 0x16 | 4 | 0x4B (bulk/DMA) |
xmmword_23F2128 | 0x09 | 2 | -- |
xmmword_23F2DE8 | 0x21 | 2 | -- |
xmmword_23F25F0 | 0x12 | 2 | 0x4B |
Format 0x16 and 0x12 share opcode class 0x4B, suggesting they encode different addressing-mode variants of the same bulk-data instruction.
64-bit Format A (xmmword_23F1F08) -- Short-Form General (113 encoders)
Widest opcode coverage of any single format. Covers 27 distinct opcode classes with few variants each -- the simple, common instructions.
| Property | Value |
|---|---|
| Descriptor | xmmword_23F1F08 |
| Operand slots | 0--3 per instruction |
| Register offsets | 0x40, 0x50, 0x60, 0x70 |
Opcode classes (27): 0x00--0x09, 0x0A--0x0F, 0x10, 0x11, 0x12, 0x14, 0x16, 0x1B, 0x1C, 0x20, 0x21, 0x23, 0x25. Many of these are NOP/control, simple moves, and compact branches.
64-bit Format B (xmmword_23F1D70) -- Short-Form 4-Operand (41 encoders)
Bimodal operand count: either 0 operands (control instructions) or 4 operands (compact arithmetic with all-register sources).
Opcode classes: 0x00--0x09, 0x10, 0x12, 0x14--0x1E, 0x26, 0x28, 0x2A.
64-bit Formats C, D, E -- Specialized Short Forms (20 encoders combined)
| Descriptor | Encoders | Notes |
|---|---|---|
xmmword_23F1F90 | 11 | Short-form variant C |
xmmword_23F2238 | 8 | Short-form variant D |
xmmword_23F2C50 | 1 | Minimal format, single encoder; also appears in 128-bit category with 0 uses |
Distinguishing 64-bit vs 128-bit Encoders
The 128-bit format sets the extended opcode flag at bit offset 0x84, which the 64-bit format does not:
128-bit (6 initial sub_7B9B80 calls):
sub_7B9B80(a1, 0, 4, 2) // format code = 2
sub_7B9B80(a1, 4, 3, 0) // sched group slot
sub_7B9B80(a1, 0x84, 3, 0) // extended flag at bit 132 <-- PRESENT
sub_7B9B80(a1, 8, 9, MAJ) // major opcode
sub_7B9B80(a1, 0x11, 8, MIN) // minor opcode
sub_7B9B80(a1, 0x19, 7, FID) // format ID
64-bit (5 initial sub_7B9B80 calls):
sub_7B9B80(a1, 0, 4, 1) // format code = 1
sub_7B9B80(a1, 4, 3, 0) // sched group slot
// NO 0x84 call <-- ABSENT
sub_7B9B80(a1, 8, 9, MAJ) // major opcode
sub_7B9B80(a1, 0x11, 8, MIN) // minor opcode
sub_7B9B80(a1, 0x19, 7, FID) // format ID
The 256-bit format (format code 8) is used by exactly 2 encoders for IMAD.WIDE (major 0x59, minor 0x02 and 0x03), each with 16 constant-buffer operand slots encoded via sub_7BCF00.
Dispatch Tables -- The Six Megafunctions
Six switch-dispatch megafunctions in the 0x10C0B20--0x10E32E0 range form the central routing logic of the instruction codec. All six switch on the opcode category at *(WORD*)(a1+12) with up to 370 cases (0x0 through 0x171), each containing sub-switches on field ID:
| Function | Size | Decompiled Lines | Callers | Purpose |
|---|---|---|---|---|
sub_10C0B20 | 180 KB | 9,231 | 3,109 | setField -- write a value into a named field |
sub_10D5E60 | 197 KB | 6,491 | 961 | getFieldOffset -- return bit-offset of a named field |
sub_10E32E0 | 187 KB | 6,240 | 72 | hasField -- boolean: does this instruction have field X? |
sub_10CCD80 | 142 KB | 7,581 | 4 | setFieldDefault -- write hardcoded default for a field |
sub_10CAD70 | 68 KB | 1,864 | 74 | getOperandFieldOffset -- bit-offset of a per-operand field |
sub_10C7690 | 65 KB | 2,313 | 288 | setOperandField -- write a per-operand field value |
sub_10C0B20 (setField) is one of the most-called functions in the entire ptxas binary at 3,109 call sites. It writes field values using sub_AF80xx writer stubs and contains jump-out targets (0xAF43F0, 0xAF44C0, 0xAF4550) for bit-manipulation operations that cross word boundaries.
sub_10D5E60 (getFieldOffset) returns extractor(a1+48, bit_position) + base_offset for each field, where base_offset is a field-specific constant (observed values: +125, +790, +1278, etc.). Returns 0xFFFFFFFF when the queried field does not exist in the given instruction category.
sub_10CAD70 (getOperandFieldOffset) operates on per-operand packed records at *(QWORD*)(a1+32) + 32*operand_index + 24. The field IDs it handles include: 1 (register class), 2 (operand type), 7, 8, 12 (operand size), 13 (address mode), 14, 15, 19, 24, 25, 26, 27, 29.
Dead cases (opcode categories without the queried field) share a common pattern: cases 3, 0x11, 0x24, 0x26, 0x2D, 0x75, 0x78, 0x84, 0x8C-0x9F, 0xA1-0xBA, 0xBE-0x16F return 0xFFFFFFFF or false.
Bitfield Accessor Library
The 0x10B0000--0x10BF2C0 range contains 2,095 machine-generated bitfield read/write primitives for the 192-bit packed instruction format. These are the building blocks that the six megafunctions call:
- 1,661 functions under 200 bytes: pure getters/setters for individual fields
- 412 functions between 200-500 bytes: multi-field accessors
- 22 functions above 500 bytes: complex accessors with validation
Seven core extractors handle all bitfield reads:
| Function | Width | Storage Format |
|---|---|---|
sub_10B28E0 | 1-bit | 192-bit (3x QWORD) |
sub_10B2860 | 2-bit | 192-bit |
sub_10B27E0 | 3-bit | 192-bit |
sub_10B2760 | 4-bit | 192-bit |
sub_10B26E0 | 5-bit | 192-bit |
sub_10B2650 | 2-bit | 32-bit array |
sub_10B25C0 | 3-bit | 32-bit array |
The 192-bit format (3 QWORDs = 24 bytes, stored at a1+48) handles boundary crossing: if a bitfield spans a QWORD boundary, the extractor combines partial reads from adjacent words. The 32-bit-array format is used for sub-fields that are naturally DWORD-aligned.
A typical accessor is trivially simple:
// sub_10BEF80 (140 bytes)
int get_field_X(int64_t a1) {
return (*(uint32_t*)(a1 + 24) & 3) + 51; // extract 2-bit field, add base
}
Modifier Encoding
After operands are encoded, each handler packs instruction-specific modifier fields into the bits at a1+544 (primary word) and a1+552 (extended word). The pattern is:
- Read modifier value from IR node via a property extractor (
sub_10B9xxxfamily) - Validate and map through an encoding lookup table (
sub_10B3xxx/sub_10B4xxxfamily) - OR the result into the packed word at a shifted/masked position
The most commonly used modifier-encoding functions:
| Function | Callers | Bits | Likely Meaning |
|---|---|---|---|
sub_10B6180 | 8,091 | 1 | Boolean flag (.S, .STRONG, .SAT, etc.) |
sub_10B6160 | 2,205 | 1 | Boolean flag (.NEG, .ABS, etc.) |
sub_10B6140 | 1,645 | 1 | Boolean flag variant |
sub_10B2D90 | 538 | 2 | Data type, rounding mode |
sub_10B5580 | 475 | 5 | Shift amount, cache policy |
sub_10B44E0 | 416 | 2 | Addressing mode |
sub_10B6220 | 363 | 3 | Register bank, cache level |
sub_10B4650 | 330 | 4 | Type qualifier, address mode |
sub_10B47F0 | 243 | 4 | Type qualifier variant |
sub_10B2F00 | 151 | 3 | 3-bit modifier field |
sub_10B2F20 | 101 | 4 | 4-bit modifier field |
Modifier fields per instruction range from 0 (simple control instructions) to 18 (the most complex encoder, sub_D89C90 for opcode class 0x5A). The average is approximately 6 modifier fields per encoder. Bit positions in a1+544 concentrate in bits 48-63; bit positions in a1+552 concentrate in bits 0-11.
Physical Register Encoding
The SASS instruction encoder uses a two-stage pipeline to convert abstract virtual registers into hardware register fields in the final instruction word. The first stage (Ori encoding, described above in "Register Operand Encoder") packs register type and number into operand slots within the 1280-bit encoding buffer. The second stage (SASS emission) maps the compiler's abstract (register_class, sub_index) pair into an 8-bit hardware register number and writes it into the final 128-bit instruction word. This second stage is implemented by the register-class encoding tables at address range 0x1B4C000--0x1B76000 (Zone A of the emission backend).
Class-to-Hardware Formula
sub_1B6B250 (2965 bytes, 254 callers, 0 callees) is a fully unrolled lookup table that implements the mapping:
hardware_reg = register_class * 32 + sub_index
The function takes two integer arguments (a1, a2) where a1 is the register class (0--5) and a2 is the sub-register index within that class. It is compiled as a deeply nested if-chain covering all 156 valid (class, index) combinations. The decompiler output is 495 lines of cascading conditionals, but every return value satisfies the formula a1 * 32 + a2 exactly:
// sub_1B6B250 -- reconstructed from decompiled lookup table
__int64 register_class_to_hardware(int reg_class, int sub_index) {
// Returns reg_class * 32 + sub_index for all valid inputs.
// Valid classes: 0, 1, 2, 3, 4, 5
// Valid sub-indices: 1..15, 17..27 (index 0 and 16 excluded)
// Returns 0 for any unmatched input (fallthrough).
}
The guard wrapper sub_1B73060 (19 bytes, 483 callers) short-circuits the no-register case:
// sub_1B73060 -- guard wrapper
__int64 encode_register_guarded(__int64 ctx, int reg_class, int sub_index) {
if (reg_class | sub_index)
return register_class_to_hardware(reg_class, sub_index);
else
return 0; // no register
}
Per-Class Hardware Number Ranges
Each class occupies a 32-number stride in the hardware register namespace. Within each stride, indices 1--15 and 17--27 are populated (26 registers per class). Index 0 maps to the no-register sentinel via the guard wrapper. Index 16 is absent from the lookup table -- a gap in every class.
| Class | a1 | Hardware Range | Populated Indices | Gap | Likely Register File |
|---|---|---|---|---|---|
| 0 | 0 | 0--27 | 1--15, 17--27 | 16 | R (GPR primary) |
| 1 | 1 | 32--59 | 1--15, 17--27 | 48 | R (GPR secondary) |
| 2 | 2 | 64--91 | 1--15, 17--27 | 80 | P (predicate) |
| 3 | 3 | 96--123 | 1--15, 17--27 | 112 | UR (uniform GPR) |
| 4 | 4 | 128--155 | 1--15, 17--27 | 144 | UR (uniform ext) |
| 5 | 5 | 160--187 | 1--15, 17--27 | 176 | P/UP (uniform pred) |
Hardware numbers 28--31 (and the corresponding padding in each class) are unused, providing alignment to 32-register boundaries. The maximum hardware register number produced by the table is 187 (class 5, index 27). The 8-bit encoding field can represent 0--255, so values 188--255 are reserved.
The index-16 gap in every class is consistent across all 6 classes. This likely corresponds to a hardware-reserved slot or a register numbering convention where physical register class*32+16 has special semantics (potentially a sentinel or a register-file-boundary marker).
Split Bitfield Writer
sub_1B72F60 (32 bytes, 483 callers) writes the 8-bit hardware register number into the SASS instruction word. The encoding is split across two non-contiguous bitfields within a single DWORD:
// sub_1B72F60 -- register field writer (decompiled verbatim)
__int64 write_register_field(__int64 a1, int encoded_reg) {
__int64 buf = *(_QWORD *)(a1 + 112); // instruction encoding buffer
__int64 result = *(_DWORD *)(buf + 12) // DWORD at byte offset 12
| ((_WORD)encoded_reg << 9) & 0x3E00u; // low 5 bits -> [13:9]
*(_DWORD *)(buf + 12) = result
| (encoded_reg << 21) & 0x1C000000; // high 3 bits -> [28:26]
return result;
}
Bit-level layout within the DWORD at *(instruction_buffer + 12):
DWORD bits: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 .. 0
[ h2:h0 ] [ l4:l3:l2:l1:l0 ]
hw[7:5] hw[4:0]
The DWORD at byte offset 12 covers bits [127:96] of the 128-bit instruction word. In full instruction coordinates:
| Field | DWORD Bits | Instruction Bits | Width | Content |
|---|---|---|---|---|
| Low | [13:9] | [109:105] | 5 bits | hardware_reg[4:0] |
| High | [28:26] | [124:122] | 3 bits | hardware_reg[7:5] |
The 12-bit gap between instruction bits [121] and [110] is occupied by other instruction fields (modifiers, flags, secondary operand encodings). This split-field design is common in GPU ISAs where instruction bits are at a premium and different fields must be routed to different functional unit inputs.
sub_1B72FE0 (32 bytes, 104 callers) is byte-identical to sub_1B72F60 but occupies a different vtable slot, used by a secondary operand encoding path.
Extended Register Encoder
sub_1B6EA20 (7194 bytes, 25 callers) extends the base encoding with operand modifier support. It takes 5 parameters:
// sub_1B6EA20 -- register encoding with modifiers
__int64 encode_register_with_modifiers(
int reg_class, // a1: register class (0-5)
int sub_index, // a2: sub-register index
int negation, // a3: .NEG modifier flag
int abs_value, // a4: |.ABS| modifier flag
int type_modifier // a5: type cast modifier
);
When all modifier flags are zero (a3 | a4 | a5 == 0), the function returns the same value as sub_1B6B250 -- the base class * 32 + index result. When modifiers are present, the function continues into extended encoding logic that packs modifier bits alongside the register number. The guard wrapper sub_1B748C0 (35 bytes, 104 callers) provides the same no-register short-circuit for the extended variant.
Additional encoding variants for different operand positions include sub_1B6D590, sub_1B70640, sub_1B71AD0, sub_1B748F0, and sub_1B76100 (5264--6106 bytes each, 2--49 callers each). All share the same nested-if structural pattern and operate on the same class/index domain.
Encoding Pipeline Summary
The complete register encoding pipeline from virtual register to instruction bits:
Virtual Register (vreg+64 = reg_type, vreg+68 = physical_reg)
|
v
[Ori Encoder -- sub_7BC030, 6147 callers]
Reads: operand+20 (reg_type_raw), operand+4 (reg_num)
Writes: 1-bit presence + 4-bit type + 10-bit number into 1280-bit buffer
|
v
[SASS Emission -- sub_1B6B250 via sub_1B73060, 483 callers]
Input: (register_class, sub_index)
Formula: hardware_reg = class * 32 + sub_index
Output: 8-bit hardware register number (0-187)
|
v
[Bitfield Writer -- sub_1B72F60, 483 callers]
Input: 8-bit hardware register number
Output: split across instruction bits [109:105] and [124:122]
Zone A Function Map
| Function | Size | Callers | Role | Confidence |
|---|---|---|---|---|
sub_1B6B250 | 2,965 B | 254 | Core class*32+index lookup table | HIGH |
sub_1B6EA20 | 7,194 B | 25 | Extended encoding with modifier bits | HIGH |
sub_1B73060 | 19 B | 483 | Guard wrapper for sub_1B6B250 | CERTAIN |
sub_1B748C0 | 35 B | 104 | Guard wrapper for sub_1B70640 | CERTAIN |
sub_1B72F60 | 32 B | 483 | Split bitfield register writer | HIGH |
sub_1B72FE0 | 32 B | 104 | Identical writer (different vtable slot) | HIGH |
sub_1B73080 | 6,106 B | 88 | 3-operand register encoding (class, index, modifier) | HIGH |
sub_1B6D590 | 5,264 B | varies | Register encoding variant (operand position A) | HIGH |
sub_1B70640 | varies | varies | Register encoding variant (operand position B) | HIGH |
sub_1B71AD0 | varies | varies | Register encoding variant (operand position C) | HIGH |
sub_1B748F0 | varies | varies | Register encoding variant (operand position D) | HIGH |
sub_1B76100 | varies | varies | Register encoding variant (operand position E) | HIGH |
Decoder Functions
97 decoder functions in the 0xEB3040--0xED0FE0 range reverse the encoding: they extract operand information from packed SASS bitfields back into Ori IR representation. The decoder entry point is sub_EB3040, a dispatcher that performs binary search on the instruction type word (*(a2+12), *(a2+14), *(a2+15)) against a table at off_22E6380. For instruction types 120/121, it falls through to the generic decoder sub_7BFAE0.
The decoder template mirrors the encoder but in reverse:
void decode_OPCODE(int64_t a1, int64_t a2) {
// 1. Set output instruction type
*(uint16_t*)(a2 + 12) = INSTR_TYPE_ID;
// 2. Load operand format table (same xmmword constants as encoder)
// 3. Set operand count
*(int*)(a1 + 144) = NUM_OPERANDS;
// 4. Decode operands using type-specific decoders
sub_7BD3C0(a1, a2, 0, 0x50, 2); // GPR register (type=2)
sub_7BE090(a1, a2, 1, 0x60, 3); // predicate register (type=3)
sub_7BD650(a1, a2, 2, 0x70, 10); // extended register (type=10)
// 5. Extract control bits (reuse flags, stall counts, yield hints)
sub_7BD260(a1, a2);
// 6. Translate encoded values back to IR references
int reg = sub_AF7DF0(*(void**)(a1+536), extracted_bits);
sub_B056B0(dest_ptr, reg);
int pred = sub_AF7200(*(void**)(a1+536), pred_bits);
sub_AFA380(a2, pred);
// 7. Extract modifier bitfields (reverse of encoder phase 7)
sub_AF53B0(*(void**)(a1+536), *(a1+550) & mask);
sub_AFCEB0(); // commit extracted value
}
Decoder operand count distribution: 6 two-operand, 18 three-operand, 22 four-operand, 16 five-operand, 22 six-operand, 12 eight-operand decoders.
Opcode ID Extractors
Over 100 small functions in the 0x10BF000--0x10C0C00 range serve as opcode discriminators. Each maps an IR instruction node to an opcode ID by reading fields from the operand table. The most-used extractors:
| Function | Encoder Users | Major Opcode Family |
|---|---|---|
sub_10BF440 | 48 | Generic (most common) |
sub_10BF230 | 45 | Generic |
sub_10BF590 | 43 | Generic |
sub_10BFA90 | 30 | 0x59 (IMAD variants) |
sub_10BFD30 | 26 | 0xFD family |
sub_10BFFA0 | 25 | 0x4F family |
sub_10BF580 | 23 | 0x29 (IADD/IADD3) |
sub_10BF680 | 16 | 0x38 (load/store) |
sub_10C0AF0 | 14 | 0xDF (WGMMA) |
89 distinct opcode reader functions cover all instruction families.
Per-SM Architecture Encoding
The encoding system is replicated per SM target. Each SM architecture has its own set of encoder/decoder functions with different xmmword opcode constants. The SM100 (Blackwell datacenter) implementation spans these address ranges:
| Range | Functions | Layer |
|---|---|---|
| 0xD27000--0xDFC000 | 592 | Encoder stubs (p1.12) |
| 0xDFC000--0xEB2AE0 | 494 | Encoder stubs continuation (p1.13) |
| 0xEB3040--0xED0FE0 | 97 | Decoder functions (p1.13) |
| 0x107B1E0--0x10AD700 | 641 | Encoder stubs continuation (p1.16) |
| 0x10ADD30--0x10AFF80 | 78 | Instruction lifecycle & scheduling |
| 0x10B0000--0x10BF2C0 | 2,095 | Bitfield accessor library (p1.16) |
| 0x10C0B20--0x10E32E0 | 184 | Dispatch table megafunctions (p1.16) |
| 0x10EE900--0x1134160 | ~400 | Binary encoders: IR fields to bits (p1.16) |
| 0x1134160--0x114F380 | ~132 | High-level encode path (p1.16) |
The total SM100 codec spans roughly 2.5 MB of binary code across approximately 4,700 functions (including the shared bitfield accessor library).
Other SM targets (SM75 Turing, SM80 Ampere, SM86 Ada, SM89 Lovelace, SM90a Hopper, SM103 Blackwell Ultra, SM120 consumer Blackwell) have parallel encoder populations in the p1.14, p1.15, p1.17--p1.22 address ranges, each with matched xmmword constants for their architecture-specific instruction set.
Per-SM Instruction Format Descriptors
316 instruction format descriptor functions at 0x1732170--0x17A9B70 form the shared, architecture-neutral instruction pattern database. Unlike the per-SM encoder stubs (replicated per architecture at separate address ranges), these descriptors are a single set of functions that describe every SASS opcode variant's encoding geometry: bitfield layout, operand slot configuration, and modifier schema. They are invoked exclusively through virtual dispatch (zero static callers) from the ISel passes (sub_A4BC60, sub_A4D3F0) via the FNV-1a hash-based instruction matcher at sub_1731440.
Descriptor Template
Every descriptor function initializes an Encoding Context object through a fixed 4-phase sequence:
// Phase 1: Opcode header (5 calls for 64-bit, 6 for 128-bit)
sub_7B9B80(a1, 0, 4, FORMAT_CODE); // bits[3:0] format: 1=64b, 2=128b
sub_7B9B80(a1, 4, 3, 0); // bits[6:4] sched group slot
sub_7B9B80(a1, 0x84, 3, 0); // bits[134:132] ext flag (128-bit ONLY)
sub_7B9B80(a1, 8, 9, MAJOR_OP); // bits[16:8] 9-bit major opcode
sub_7B9B80(a1, 0x11, 8, MINOR_OP); // bits[24:17] 8-bit minor opcode
sub_7B9B80(a1, 0x19, 7, FORMAT_ID); // bits[31:25] 7-bit format ID
// Phase 2: Format layout descriptor (Tier 1) -- selects operand geometry
*(__m128i*)(a1 + 8) = xmmword_23FXXXX; // 128-bit format template from rodata
// + bulk copy of 3 x 10 DWORD arrays into a1+24..a1+140
// Phase 3: Architecture modifier table (Tier 2) -- selects per-SM encoding
*(__m128i*)(a1 + 404) = xmmword_YYYYYYY; // per-SM modifier constants
*(DWORD*)(a1 + 420) = VAL1; // explicit modifier overrides
*(DWORD*)(a1 + 424) = VAL2;
// Phase 4: Operand count + standard encoding tail
*(DWORD*)(a1 + 144) = NUM_OPERANDS; // 0--7
sub_7B9D30(a1); // clear constant buffer table
sub_7B9D60(a1, a2, 0); // encode reuse + guard predicate
// Then: opcode extraction, register encoding, modifier field packing
Two-Tier xmmword Architecture
Each descriptor loads two classes of xmmword constants that together fully specify the instruction encoding:
Tier 1 (at a1+8): Format Layout Descriptor. Selects the instruction format -- operand slot sizes, types, and field layout. These are the 16 format groups documented in the "Instruction Format Group Catalog" section above. Addresses in the 0x23F1xxx--0x23F2xxx rodata range.
Tier 2 (at a1+404): Architecture Modifier Table. Selects per-SM encoding variations for the same format layout. Two instructions with the same Tier 1 descriptor but targeting different architectures use different Tier 2 constants. Addresses span three rodata ranges:
| Rodata Range | Group | Functions | Paired With |
|---|---|---|---|
0x202A280--0x202A2B0 | A | ~40 | 202A290 or 202A2A0+202A2B0 at a1+420 |
0x22F1B30--0x22F1B50 | B/C | ~8 | None (single 16B block) |
0x22F1BA0--0x22F1BB0 | D | ~3 | None |
0x22F1AA0--0x22F1AE0 | E | ~3 | (observed in SM100 encoder range) |
0x22F1C20--0x22F1C30 | F | ~2 | Paired at a1+404/a1+420 |
0x23B2DE0 | G | 4 | None (rare/specialized) |
SM Generation Mapping
The Tier 2 modifier groups correspond to GPU architecture generations. The mapping is inferred from operand table sizes (larger = newer), function counts per group (fewer = newer/specialized), and cross-reference with the per-SM encoder stubs at known address ranges:
| Modifier Address | Probable SM Range | ISA Family | Confidence |
|---|---|---|---|
0x202A280--0x202A2B0 | sm_50--sm_75 | Maxwell / Pascal / Volta / Turing | MEDIUM |
0x22F1B30--0x22F1B50 | sm_80--sm_86 | Ampere / Ada | MEDIUM |
0x22F1BA0--0x22F1BB0 | sm_89--sm_90a | Lovelace / Hopper | MEDIUM |
0x22F1AA0--0x22F1AE0 | sm_100+ | Blackwell datacenter | MEDIUM |
0x22F1C20--0x22F1C30 | sm_103 / sm_120 | Blackwell Ultra / consumer | LOW |
0x23B2DE0 | Cross-arch | Specialized / rare instructions | LOW |
The progression from 0x202A to 0x22F1 to 0x23B2 in rodata address space mirrors the SM generation ordering. Group A (Maxwell--Turing) is the most populous, consistent with the longest-supported ISA family. Groups E and F have the fewest functions, consistent with the newest architectures that introduce fewer format changes.
Format Code Distribution
| Format Code | Instruction Width | Descriptor Count | sub_7B9B80 Header Calls | Notes |
|---|---|---|---|---|
| 1 | 64-bit | ~120 | 5 (no 0x84 call) | Simple moves, branches, barriers, NOP-like control |
| 2 | 128-bit | ~194 | 6 (includes 0x84) | ALU, load/store, texture, tensor core |
| 8 | 256-bit | 2 | Extended | IMAD.WIDE with 16 constant-bank slots |
Descriptor-Initialized Context Fields
The format descriptor writes these fields into the Encoding Context object. All offsets are decimal:
| Offset | Size | Initialized By | Content |
|---|---|---|---|
+8 | 16B | Phase 2 (Tier 1 xmmword) | Format layout descriptor |
+24--+60 | 40B | Phase 2 (bulk copy) | Operand slot sizes (10 DWORDs) |
+64--+100 | 40B | Phase 2 (bulk copy) | Operand slot types (10 DWORDs) |
+104--+140 | 40B | Phase 2 (bulk copy) | Operand slot flags (10 DWORDs) |
+144 | 4B | Phase 4 | Operand count (0--7) |
+404 | 16B | Phase 3 (Tier 2 xmmword) | Architecture modifier table |
+420 | 4B | Phase 3 (scalar) | Architecture modifier field 1 |
+424 | 4B | Phase 3 (scalar) | Architecture modifier field 2 |
Pipeline Position
The format descriptors bridge ISel pattern matching and per-SM encoding:
ISel Pattern Matcher (sub_1731440, FNV-1a hash on *(a2+12))
|
v (virtual dispatch via vtable)
Format Descriptor (one of 316 at 0x1732170--0x17A9B70)
Writes: a1+0..a1+144 (format layout + operand geometry)
Writes: a1+404..a1+424 (architecture modifier table)
|
v (encoding context passed down)
Per-SM Encoder Stub (e.g. 0xD27xxx for SM100)
Reads: format context from descriptor
Writes: a1+544..a1+703 (1280-bit encoding buffer)
Representative Examples
sub_1732170 -- 64-bit float conversion (single-dest):
| Field | Value | Meaning |
|---|---|---|
| Format code | 1 | 64-bit instruction |
| Major opcode | 0x0C | Float conversion family |
| Minor opcode | 0x0D | Variant D |
| Format ID | 5 | Short-form general (23F1F08) |
| Tier 1 | xmmword_23F1F08 | Short-form general, 27 opcode classes |
| Tier 2 | xmmword_22F1B30 | Group B (Ampere/Ada) |
| Operand count | 3 | Register operands at 0x50, 0x60, 0x70 |
| Modifier fields | 12 | Spanning a1+544 and a1+552 |
sub_1740200 -- 128-bit IMAD.WIDE (dual-dest):
| Field | Value | Meaning |
|---|---|---|
| Format code | 2 | 128-bit instruction |
| Major opcode | 0x23 | IMAD.WIDE family |
| Minor opcode | 0x12 | Variant with modifier 0x13 |
| Format ID | 0x13 | Tensor/extended ALU (23F2678) |
| Tier 1 | xmmword_23F2678 | Extended ALU, 7 opcode classes |
| Tier 2 | xmmword_202A280 | Group A (Maxwell--Turing) |
| Dual-dest | Yes | 0x84 field present, set to 0 |
sub_1732E90 -- 128-bit extended complex:
| Field | Value | Meaning |
|---|---|---|
| Format code | 2 | 128-bit instruction |
| Major opcode | 0x0C | Float conversion family |
| Minor opcode | 0x0C | Same as major (self-referencing variant) |
| Format ID | 0x19 | Extended complex (23F29A8) |
| Tier 1 | xmmword_23F29A8 | Extended complex, 8 opcode classes |
| Tier 2 | xmmword_22F1B30 | Group B (Ampere/Ada) |
Operand Encoding Patterns
The 576 encoder functions in the p1.12 range use 52 distinct operand encoding patterns. The most common:
| Pattern (reg, imm, pred) | Count | Description |
|---|---|---|
| 3 reg + 1 pred | 88 | Standard 3-source with predicate |
| 2 reg + 1 pred | 57 | Binary op with predicate |
| 3 reg only | 43 | Ternary ALU, no predicate/immediate |
| 3 reg + 1 imm + 1 pred | 42 | MAD-class with immediate + predicate |
| 2 reg only | 40 | Simple binary |
| 3 reg + 1 imm | 25 | Ternary with immediate |
| 1 reg + 1 pred | 22 | Unary with predicate |
| 4 reg + 1 imm | 21 | Quaternary with immediate |
| 4 reg only | 20 | Quaternary register-only |
Register operand bit offsets are format-dependent:
- 64-bit format: 0x40, 0x50, 0x60, 0x70
- 128-bit format: 0x60, 0x70, 0x88, 0x98, 0xA8
Major Opcode Summary (SM100)
102 unique major opcodes were identified across 494 encoding variants (p1.13 range alone). Opcode-to-mnemonic mapping is inferred from operand patterns and opcode density; exact mnemonic assignment requires correlation with ROT13-obfuscated instruction names found elsewhere in the binary.
Memory / Load-Store
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x38 | 16 | LDG, STG, LDS, STS |
| 0x60 | 2 | Extended load |
| 0x70--0x72 | 9 | Load groups A/B/C |
| 0xA4--0xA6 | 12 | Load/store with addressing modes |
| 0xAD | 9 | Memory extended |
| 0x1E | 4 | ATOM, ATOMS |
| 0x99, 0xA2 | 2 | Extended atomics |
| 0x39 | 2 | REDUX (reduction) |
Integer Arithmetic
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x59 | 30 | IMAD, IMAD.HI, IMAD.WIDE, ISCADD |
| 0x29 | 24 | IADD3, IADD3.64, IADD32I |
| 0x4F | 25 | Extended integer operations |
| 0x3B | 10 | Integer MUL/MAD extended |
Floating Point
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x3A | 1 | Float operation |
| 0x3E--0x40 | 4 | FFMA, FFMA variants |
| 0x43--0x44 | 2 | Float MUL/MAD |
| 0x4A | 4 | FADD, FMUL, FFMA forms |
| 0x49 | 6 | HFMA2, HADD2, HMUL2 |
| 0x5C | 6 | HFMA2 variants |
| 0x5F | 2 | Half-float extended |
Tensor Core / WGMMA
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0xA8--0xA9 | 16 | Tensor core A/B (WGMMA, HMMA) |
| 0xAB--0xAC | 12 | Tensor core C/D |
| 0xAE--0xB0 | 30 | Tensor core E/F/G |
| 0xB1--0xB3 | 15 | Tensor core H/I/J |
| 0xDF | 14 | WGMMA dispatch (main family) |
| 0x12 | 4 | Matrix operations |
| 0x54 | 6 | Extended matrix |
Control Flow
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x18 | 10 | BRA, SSY, CAL, EXIT, RET, BREAK, CONT |
| 0x19 | 5 | Control flow group B |
| 0x7D | 2 | YIELD, control |
| 0x24 | 2 | BAR, barrier/sync |
| 0xCF | 3 | BARRIER |
| 0xD4 | 2 | BARRIER B |
| 0x33 | 2 | DEPBAR |
Comparison / Predicate
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x0D | 10 | ISETP, FSETP, DSETP |
| 0x17 | 8 | PSETP, PLOP3 |
| 0x95 | 6 | Comparison variants |
Data Movement / Conversion
| Major | Variants | Likely SASS Mnemonics |
|---|---|---|
| 0x61 | 5 | MOV, MOV.64, MOV32I |
| 0x46, 0x66, 0x45 | 3 | MOV variants |
| 0x56 | 6 | F2I, I2F, F2F type conversions |
| 0x62 | 6 | Type conversion group 2 |
| 0x10 | 4 | SEL (conditional select) |
| 0x1B | 3 | PRMT (permute) |
Instruction Object Lifecycle
The instruction object constructor sub_10AFF80 (11 KB, 3 callers: sub_6F0A30, sub_6F52F0, sub_9EE390) takes 32 parameters and builds a ~900-byte instruction-level object:
- 13 sub-object allocations via vtable allocator (
vtable+24) - 4 linked-list structures for instruction chaining
- 2 string buffers for instruction name and alternate name (via
strlen+memcpy) - Architecture descriptor via
sub_B19110(arch_id)at offset +408 - Hash table using FNV-1a (seed
0x811C9DC5, prime16777619) for instruction record lookup
The instruction unlink-and-recycle functions (sub_10ADF90, sub_10AE190) remove an instruction node from a doubly-linked list (head/tail at a1+48/56), update the count at a1+64, free operand attachments via vtable call, and return the node to a free-list at a1+72. The maximum instruction count per list is 16,383 (checked by sub_10AE7C0).
Encoding Pipeline Layers
The full encoding pipeline operates in three layers, from high-level IR to binary output:
Layer 1: High-level encode (0x1134160--0x114F380, ~132 functions)
Populates full IR records before low-level packing. Uses sub_9B3C20(a1, a2, slot, type, mode, width, reg_id) for register operands and sub_9B3D60 for immediates. Handles 255->1023 sentinel translation for "don't care" register values. Sets opcode/modifier fields via sub_AFA910/sub_AFA930. Applies conditional fixups: e.g., if opcode==2038 && subopcode==2257, sets operand_slot+84 = 5.
Layer 2: Binary encoders (0x10EE900--0x1134160, ~400 functions)
Reads operand fields from IR via sub_10BDxxx extractors, transforms through sub_10Bxxx lookup tables, and packs results into the 128-bit output word at *(QWORD*)(a1+40):
// Typical pattern (sub_10F91D0):
int v6 = sub_10BF170(operand_addr); // extract register class
int v7 = sub_10B6180(lookup_table, v6); // translate to encoding value
*(uint64_t*)(a1 + 40) |= ((uint64_t)v7 << 15); // pack at bit position 15
Includes a register pair encoder (sub_112CDA0, 8.9 KB) that maps 40 register pair combinations (R0/R1, R2/R3, ... R78/R79) to packed output values at 0x2000000 intervals.
Layer 3: Template encoder stubs (0xD27000--0xEB2AE0, ~1,086 functions)
The lowest-level stubs that directly write the encoding buffer via sub_7B9B80. These are the functions described by the encoder template above.
Variant/Sub-opcode Distribution
The variant field (bits[17:24], 8 bits) has a distribution that peaks at variant 0x05 with 128 functions, suggesting this is the default or most common variant (possibly .F32 type or the unmodified form):
| Variant | Count | Variant | Count |
|---|---|---|---|
| 0x00 | 21 | 0x08 | 13 |
| 0x01 | 25 | 0x09 | 14 |
| 0x02 | 62 | 0x0A | 10 |
| 0x03 | 24 | 0x0B | 19 |
| 0x04 | 20 | 0x0C | 14 |
| 0x05 | 128 | 0x0D | 9 |
| 0x06 | 30 | 0x0E | 11 |
| 0x07 | 10 | 0x0F--0x2F | decreasing |
Maximum observed variant value is 0x2F (47), giving up to 48 sub-operations per major opcode.
SASS Emission Backend
The final stage of the encoding pipeline operates at the instruction-word level: 11 per-instruction-form bitfield packers at addresses 0x1B79940--0x1B9C220 take a pre-decoded instruction descriptor and pack all fields into a 128-bit SASS instruction word. These functions sit at Level 2 of a 4-level emission hierarchy:
Level 0: SM-target dispatch (0xC4DF70, 0xC53330, 0xC54090, 0xC59610, 0xC5ABE0, 0xC5B5C0)
Level 1: Emission orchestrators (Zone C: 0x1BA0000-0x1BE5000, ~150 functions)
Level 2: Per-form bit packers (Zone B: 0x1B79940-0x1B9C220, 11 functions, THIS SECTION)
Level 3: Register class encoders (Zone A: 0x1B4C000-0x1B76000, ~40 functions)
Each function has exactly 1 caller and 0 callees (pure bitfield packing, no external calls). Sizes range from 6836 to 6980 bytes of compiled code. All 11 share an identical combinator body (verified: same 453 LABEL_xxx targets, same 75 unique OR-mask constants, same max comparison value of 27). They differ only in two things: the opcode base constant, and the prologue field-packing sequence.
Input / Output Interface
int *__fastcall emit_instruction_form_X(int *a1) {
// a1 = pre-decoded instruction descriptor (array of 32-bit ints)
// Returns: pointer to output buffer (also accessible at *((_QWORD*)a1 + 14))
int *result = *((_QWORD *)a1 + 14); // output = 128-bit instruction word
// result[0] = instruction bits [31:0] (opcode base, guard pred, sched group)
// result[1] = instruction bits [63:32] (register operand fields, modifiers)
// result[2] = instruction bits [95:64] (immediate/offset, auxiliary fields)
// result[3] = instruction bits [127:96] (predicate control, combinator encoding)
}
The input struct a1 is a flat array of pre-extracted instruction fields. Fields a1[0] through a1[3] carry common header values; a1[4] through a1[15] carry instruction-specific operand data (which indices are used depends on the instruction form).
Phase 1: Prologue -- Opcode Base and Field Packing
Every function begins with the same template, parameterized by different constants:
// 1. Load output buffer pointer
result = *((_QWORD *)a1 + 14);
// 2. OR opcode base into result[0] -- unique 12-bit constant per function
*result |= OPCODE_BASE; // e.g., 0xA1E, 0x81B, 0x803
// 3. Pack guard predicate: bits [14:12] of result[0]
*result |= ((unsigned short)a1[1] << 12) & 0x7000;
// 4. Pack scheduling group: bits [16:15] of result[0]
*result |= (unsigned short)((unsigned short)a1[2] << 15);
// 5. Pack predicate encoding: bits [25:20] of result[3]
result[3] |= (a1[3] << 20) & 0x3F00000;
// 6. Pack instruction-specific operand fields (VARIES PER FUNCTION)
// Each function packs a different set of a1[6..15] fields into
// result[0], result[1], result[2] using different shifts and masks.
// 7. Set base combinator mask: bits [17:14] of result[3] = 0x3F
result[3] |= 0xFC000;
The prologue is the sole source of variation between functions. The field-packing differs in which a1[] indices are used, which shift amounts are applied, and which result[] DWORDs are targeted.
The 11 Functions and Their Opcode Bases
| Function | Size | Opcode Base | Family | Caller Chain |
|---|---|---|---|---|
sub_1B79940 | 6,900 B | 0xA1B | 0xAxx | sub_1BA5340 via sub_C4DF70 |
sub_1B7B440 | 6,868 B | 0x81B | 0x8xx | sub_1BA5340 via sub_C4DF70 |
sub_1B87740 | 6,852 B | 0x238 | 0x2xx | sub_1BA8D80 via sub_C53330 |
sub_1B89350 | 6,836 B | 0x213 | 0x2xx | sub_1BA8E80 via sub_C54090 |
sub_1B8FFE0 | 6,852 B | 0x202 | 0x2xx | sub_1BA8D80 via sub_C53330 |
sub_1B92590 | 6,868 B | 0x803 | 0x8xx | sub_1BACB10 (direct) |
sub_1B94390 | 6,964 B | 0x21D | 0x2xx | sub_1BACDC0 via sub_C59610 |
sub_1B95ED0 | 6,980 B | 0xA1E | 0xAxx | sub_1BACDC0 via sub_C59610 |
sub_1B985B0 | 6,852 B | 0x804 | 0x8xx | sub_1BAD6D0 (direct) |
sub_1B9A430 | 6,884 B | 0x807 | 0x8xx | sub_1BB1110 via sub_C5ABE0 |
sub_1B9C220 | 6,884 B | 0x81A | 0x8xx | sub_1BAD920 via sub_C5B5C0 |
The opcode bases cluster into three families by high nibble:
- 0x2xx (4 functions): bases 0x202, 0x213, 0x21D, 0x238
- 0x8xx (5 functions): bases 0x803, 0x804, 0x807, 0x81A, 0x81B
- 0xAxx (2 functions): bases 0xA1B, 0xA1E
Phase 2: Combinator -- 3-Axis Predicate Encoding into result[3]
After the prologue, all 11 functions execute an identical ~1900-line decision tree. This combinator reads three integer values from the input struct and produces a single 32-bit mask that is ORed into result[3].
The three axes are:
- axis0 =
a1[0]: instruction class selector, values 0..5 (6 values) - axis1 =
a1[4]: slot/form index, values 1..27 (26 populated, gap at 16) - axis2 =
a1[N]: sub-mode flag, values 0 or 1 (N varies per function -- a1[8], a1[9], a1[10], a1[11], or a1[15])
The combinator exits immediately if all three axes are zero (!(axis0 | axis1 | axis2)). Otherwise it walks a nested decision tree that tests axis0 values (0 through 5), axis1 values (1 through 27), and axis2 values (0 or 1), and ORs the appropriate mask into result[3]:
// Reconstructed combinator logic (pseudocode):
if (axis0 == 0 && axis1 == 0 && axis2 == 0) return;
// For axis0 values 1-5 combined with axis1 values 1-15:
// result[3] |= prefix_for_axis0 | 0xFC000 | (axis1 << 9)
//
// For axis1 values 17-27 combined with axis2:
// result[3] |= base_mask_for_axis1 (if axis2 == 0)
// result[3] |= extended_mask_for_axis1 (if axis2 == 1)
Combinator Mask Encoding
The 75 unique masks in the FC/FD series decompose as:
result[3] bit layout for combinator-generated fields:
bits [17:14] = 0x3F (always set by prologue base 0xFC000)
bits [13:9] = slot_index (5-bit, derived from axis1, values 1-27)
bits [28:26] = axis0 prefix encoding (3-bit, for axis0 values 1-5)
The 5 prefix values correspond to axis0 encodings 1-5:
| axis0 Value | Prefix OR'd | Prefix Bits [28:26] |
|---|---|---|
| 0 | 0x00000 | 000 (no prefix) |
| 1 | 0x40xxxxx | 001 |
| 2 | 0x80xxxxx | 010 |
| 3 | 0xC0xxxxx | 011 |
| 4 | 0x100xxxxx | 100 |
| 5 | 0x140xxxxx | 101 |
Combined with 15 slot values (axis1 = 1..15), this produces 5 x 15 = 75 masks in the 0xFC200--0x140FCE00 range.
For axis1 values 17--27, the masks shift into the 0xFE200--0xFF600 range. These slots use only the "no prefix" and "prefix 0x100" variants (axis0 values 0 and 4), and the axis2 flag selects between the two. This gives an additional 12 unique masks for the high-slot range (11 base + 11 extended, minus shared ones, equals 12 unique).
Why the Combinator Exists
The combinator encodes an architecture-independent mapping from a 3-dimensional instruction property coordinate to a hardware-specific bitfield pattern in the predicate/control section of the 128-bit instruction word. This section (bits [127:96]) controls:
- Guard predicate assignment (bits [25:20] from prologue)
- Scheduling mode (bits [17:14] base + combinator overlay)
- Instruction form variant (bits [13:9] from combinator)
- Predicate class / condition code routing (bits [28:26] from combinator)
The identical combinator across all 11 functions confirms that this is not an opcode-specific encoding but rather a cross-cutting encoding for predicate/scheduling state that applies uniformly to all instruction forms.
Equivalent Lookup Table
The entire 2000-line decision tree can be replaced by a flat table of 6 x 28 x 3 = 504 entries:
// Equivalent reconstruction:
static const uint32_t combinator_table[6][28][3] = { ... };
// Access: result[3] |= combinator_table[axis0][axis1][axis2];
// Table size: 504 * 4 = 2,016 bytes (vs ~6,800 bytes of code per function)
The compiler chose a decision tree over a table lookup, likely because the C++ source used nested switch/case statements (or if/else chains with early return), and the optimizer did not convert this to a table at -O2.
Zone B Function Map (Emission Cluster)
| Address | Size | Opcode Base | Caller | Confidence |
|---|---|---|---|---|
sub_1B79940 | 6,900 B | 0xA1B | sub_1BA5340 | HIGH |
sub_1B7B440 | 6,868 B | 0x81B | sub_1BA5340 | HIGH |
sub_1B87740 | 6,852 B | 0x238 | sub_1BA8D80 | HIGH |
sub_1B89350 | 6,836 B | 0x213 | sub_1BA8E80 | HIGH |
sub_1B8FFE0 | 6,852 B | 0x202 | sub_1BA8D80 | HIGH |
sub_1B92590 | 6,868 B | 0x803 | sub_1BACB10 | HIGH |
sub_1B94390 | 6,964 B | 0x21D | sub_1BACDC0 | HIGH |
sub_1B95ED0 | 6,980 B | 0xA1E | sub_1BACDC0 | HIGH |
sub_1B985B0 | 6,852 B | 0x804 | sub_1BAD6D0 | HIGH |
sub_1B9A430 | 6,884 B | 0x807 | sub_1BB1110 | HIGH |
sub_1B9C220 | 6,884 B | 0x81A | sub_1BAD920 | HIGH |
SM89/90 Codec Layer
SM89 (Ada Lovelace) and SM90 (Hopper) share a pre-encoding instruction reordering layer absent from SM100 (Blackwell). This layer sits above the three-layer encoding pipeline: it manipulates Mercury IR instruction lists to optimize instruction interleaving before the encoding stubs pack bitfields. The entire cluster spans addresses 0x1226E80--0x1233D70, roughly 261 KB of compiled code across 18 functions.
Call Chain
sub_C60910 / sub_C5FEF0 SM-target dispatch (Level 0, 0xC5xxxx range)
|
v
sub_1233D70 (6 KB) Orchestrator: guards on knob 487 and O-level > 1,
| sets up cost-function parameters, calls A then B
|
+-> sub_122AD60 (112 KB) Pass A: classify instructions + reorder within blocks
+-> sub_122F650 (105 KB) Pass B: scheduling-aware emission ordering across blocks
+-> sub_A112C0 Post-pass finalization
The orchestrator sub_1233D70 is called only when optimization level exceeds 2 (sub_7DDB50(ctx) > 1). It reads floating-point cost weights from the target descriptor via knob offsets +7200, +7560, +7128, +7272 and passes them through to both passes. Default base weights are 1.8, -0.8, 3.2, -2.2.
Pass A: Instruction Classification and Reordering (sub_122AD60)
4,118 decompiled lines. Traverses every instruction in each basic block and sorts them into 4 linked-list queues by instruction category:
| Category | Return Code | Instruction Type | Queue Role |
|---|---|---|---|
| Branch / control-flow | 0 | type 9 (BRA, EXIT, RET, ...) | Held at block boundaries |
| Load | 1 | type 12 (LDG, LDS, ...) | Scheduled early for latency hiding |
| Store | 2 | type 5 (STG, STS, ...) | Deferred to maximize distance from load |
| General ALU | 4 | type 4 (IADD, FFMA, ...) | Interleaved between memory ops |
| Uncategorized | 3 | other / missing info | Treated as general |
The classifier is sub_1228670 (30 lines), which reads the instruction scheduling class via sub_7E2FE0 and returns 0--4. A companion predicate sub_1228EF0 (38 lines) returns 0 for types 9, 5, and 12 (the "special" categories), 1 for everything else.
After classification, Pass A performs register-class-aware instruction motion: it uses sub_91BF30 (register class builder), sub_91E390 (class query), and sub_91E610 (class intersection) to verify that moving an instruction does not violate register-class constraints. Instructions that pass the check have their operand flags updated at +48 (bit 0x40 = "moved" marker) and +96 (copy-chain tracking).
The reordering step sub_122AA30 (186 lines) performs the final within-block interleaving. sub_1227D90 (522 lines) handles the actual linked-list surgery: unlink an instruction from its current position and reinsert it at a new location.
Pass B: Scheduling-Aware Emission Ordering (sub_122F650)
3,917 decompiled lines. Takes the classified instruction lists from Pass A and determines the emission order that optimizes scheduling. Operates on 8 bitvector arrays allocated via the sub_BDxxxx bitvector library:
| Bitvector | Purpose |
|---|---|
| v521 | Main liveness set (all instructions) |
| v523 | Load-group register liveness |
| v525 | Store-group register liveness |
| v527 | ALU-group register liveness |
| v529 | Control-flow register liveness |
| v531 | Cross-block interference set |
| v533 | Scheduling priority set |
| v535 | Secondary interference set |
Each bitvector is sized to the function's total register count (*(ctx+224)). Pass B iterates through instructions, populates the bitvectors with defined-register information via sub_BDBC70 (set bit), then merges category-specific vectors into the main set via sub_BDC5F0 (union) in an order determined by the dependency analysis.
The single switch at line 2578 dispatches on the instruction category:
case 4 (ALU): merge load + store + ALU vectors into main
case 3 (branch): merge load vector only
case 0 (uncategorized): merge store vector only
case 2 (load): merge ALU vector only
case 1 (store): no merge (control-flow kept separate)
Knob-derived flags control reordering aggressiveness:
- Knob at target offset +7416 (index ~103): enable load reordering
- Knob at target offset +7488: enable general reordering
- All reordering disabled when
*(ctx+1584)+372 == 12288(specific regalloc config)
Pass B also maintains a red-black tree structure for the emission schedule, with standard left/right/parent pointers at node offsets 0, 8, 16.
Differences from SM100
| Aspect | SM89/90 | SM100 (Blackwell) |
|---|---|---|
| Pre-encode reordering | Present (sub_122AD60 + sub_122F650) | Absent -- scheduling integrated into own pass |
| Instruction classification | 5-category scheme (branch/load/store/ALU/other) | 370-category opcode dispatch via megafunctions |
| Cost model | Floating-point heuristic (4 tunable weights) | Table-driven via hardware profile records |
| Liveness tracking | 8 bitvectors per block | Handled in scheduling pass, not in encoding |
| Knob control | Knobs 103, 106, 218, 230, 487, 501 | Different knob set for Blackwell scheduler |
| Register class validation | sub_91BF30/sub_91E390 per-move check | Per-instruction class check at encoding time |
| Binary encoder calls | None -- IR-level manipulation only | sub_7B9B80 (18,347 callers) |
The SM89/90 pair operates entirely at the Mercury IR level and produces no packed instruction bits. It rewrites the instruction linked lists in each basic block to optimize scheduling, after which the standard encoding pipeline (Layers 1--3) runs on the reordered sequence. SM100 Blackwell does not need this layer because its scheduling infrastructure (documented in scheduling/algorithm.md) already integrates instruction ordering into the scheduling pass itself.
SM89/90 Codec Function Map
| Address | Size | Lines | Identity | Confidence |
|---|---|---|---|---|
sub_1233D70 | 6 KB | 321 | sm89_orchestrator -- guards, cost params, calls A+B | HIGH |
sub_122AD60 | 112 KB | 4,118 | sm89_classify_reorder -- instruction classification + block reordering | HIGH |
sub_122F650 | 105 KB | 3,917 | sm89_emission_order -- scheduling-aware emission ordering | HIGH |
sub_122AA30 | ~3 KB | 186 | local_reorder -- within-block instruction interleaving | HIGH |
sub_1227D90 | ~9 KB | 522 | instruction_reinsert -- unlink + reinsert at new position | HIGH |
sub_122F1E0 | ~6 KB | 330 | scheduling_heuristic -- cost-function comparison for emission order | MEDIUM |
sub_1228670 | ~0.5 KB | 30 | instruction_classify -- 5-category classifier (returns 0--4) | CERTAIN |
sub_1228EF0 | ~0.5 KB | 38 | is_special -- predicate: types 9/5/12 return false | CERTAIN |
sub_1226E80 | ~0.3 KB | 22 | list_prepend -- insert instruction at list head | CERTAIN |
sub_1226EB0 | ~5 KB | 274 | instruction_finalize -- post-reorder operand fixup | HIGH |
sub_1227820 | ~1 KB | 77 | operand_offset_update -- adjust operand offsets after move | HIGH |
sub_1227B60 | ~0.5 KB | 31 | motion_check -- can instruction move to new position? | HIGH |
sub_1228FA0 | ~2 KB | 100 | regclass_propagate -- propagate register class after move | HIGH |
sub_12292B0 | ~0.5 KB | 38 | queue_init_A -- initialize classification queue | HIGH |
sub_1229330 | ~0.5 KB | 38 | queue_init_B -- initialize classification queue | HIGH |
sub_1229BD0 | ~2 KB | 107 | tree_rebalance -- red-black tree rebalance | MEDIUM |
sub_122A050 | ~1 KB | 77 | pre_pass_init -- initialize pass A state object | HIGH |
sub_122A1A0 | ~2 KB | 139 | block_resize -- resize bitvector for new block count | HIGH |
Function Map
| Address | Size | Callers | Identity | Confidence |
|---|---|---|---|---|
sub_7B9B80 | 216 B | 18,347 | bitfield_insert -- core packer into 1280-bit buffer | CERTAIN |
sub_7B9D30 | 38 B | 2,408 | clear_cbuf_slots -- memset(a1+468, 0xFF, 64) | HIGH |
sub_7B9D60 | 408 B | 2,408 | encode_reuse_predicate -- reuse flags + guard predicate | HIGH |
sub_7BC030 | 814 B | 6,147 | encode_register -- GPR operand encoder | HIGH |
sub_7BC360 | ~500 B | 126 | encode_uniform_register -- UR operand encoder | HIGH |
sub_7BC5C0 | 416 B | 1,449 | encode_predicate -- predicate operand encoder | HIGH |
sub_7BCF00 | 856 B | 1,657 | encode_immediate -- immediate/cbuf operand encoder | HIGH |
sub_7BD260 | ~300 B | 96 | decode_finalize -- extract control bits | HIGH |
sub_7BD3C0 | ~500 B | 286 | decode_register -- GPR operand decoder | HIGH |
sub_7BD650 | ~400 B | 115 | decode_register_alt -- destination register decoder | HIGH |
sub_7BE090 | ~400 B | 50 | decode_predicate -- predicate operand decoder | HIGH |
sub_10B6180 | 21 B | 8,091 | encode_bool_field -- 1-bit opcode-to-control mapping | HIGH |
sub_10B6160 | 21 B | 2,205 | encode_bool_field_B -- 1-bit flag variant | HIGH |
sub_10B6140 | 21 B | 1,645 | encode_bool_field_C -- 1-bit flag variant | HIGH |
sub_10AFF80 | 11 KB | 3 | instruction_constructor -- 32-param object builder | HIGH |
sub_10ADF90 | 2.2 KB | 357 | instruction_unlink -- linked-list remove + recycle | HIGH |
sub_10B0BE0 | 6.5 KB | -- | hash_table_insert_64 -- FNV-1a, 8-byte key, 4x resize | HIGH |
sub_10B1C30 | 3.9 KB | -- | hash_table_insert_32 -- FNV-1a, 4-byte key | HIGH |
sub_10C0B20 | 180 KB | 3,109 | setField -- field value writer dispatch | HIGH |
sub_10D5E60 | 197 KB | 961 | getFieldOffset -- field bit-position lookup dispatch | HIGH |
sub_10E32E0 | 187 KB | 72 | hasField -- field existence query dispatch | HIGH |
sub_10CCD80 | 142 KB | 4 | setFieldDefault -- default value writer dispatch | MEDIUM |
sub_10CAD70 | 68 KB | 74 | getOperandFieldOffset -- per-operand field offset dispatch | HIGH |
sub_10C7690 | 65 KB | 288 | setOperandField -- per-operand field writer dispatch | HIGH |
sub_AF7DF0 | -- | 7,355 | encoded_to_ir_register -- hardware reg to IR translation | HIGH |
sub_AF7200 | -- | 552 | encoded_to_ir_predicate -- hardware pred to IR translation | HIGH |
sub_EB3040 | 1.9 KB | -- | decode_dispatcher -- binary search on instruction type | HIGH |
sub_112CDA0 | 8.9 KB | -- | register_pair_encoder -- 40-pair mapping via if-chain | HIGH |
Cross-References
- Mercury Encoder -- the assembler backend that invokes the encoding phase
- Capsule Mercury & Finalization -- post-encoding finalization
- SASS Opcode Catalog -- full mnemonic table
- Instruction Selection -- the preceding pipeline phase
- Blackwell (SM 100-121) -- SM100 architecture details
- IR Instructions & Opcodes -- the Ori IR instruction format consumed by the encoder
Peephole Optimization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The peephole optimization pass in ptxas is the single largest subsystem by code volume in the entire binary. Three monolithic dispatch functions -- totaling approximately 750 KB of machine code -- implement a brute-force pattern-match-and-rewrite engine that recognizes instruction idioms in the internal IR and replaces them with more efficient SASS instruction forms. Each dispatch function serves a different compilation context (generic, SM120-specific, and post-scheduling), but all three share the same architecture: a giant opcode-based switch dispatches to hundreds of pattern matchers; the highest-priority match wins; the winning rewrite modifies the instruction in-place.
None of the three mega-dispatchers can be decompiled by Hex-Rays due to their extreme size (233--280 KB each). All analysis in this page derives from disassembly, call graphs, and the 3,185 pattern-matcher functions that they invoke.
Scale Summary
| Dispatch function | Binary size | Instructions | Pattern matchers | Total call sites | Entry trampoline | Context |
|---|---|---|---|---|---|---|
sub_169B190 | 280 KB | 65,999 | 762 | 15,870 | sub_B12930 | Generic (all SM) |
sub_143C440 | 233 KB | ~56,241 | 1,087 | 1,971 | sub_B12940 | SM120-specific |
sub_198BCD0 | 233 KB | 54,043 | 1,336 | 13,391 | sub_B12960 | Post-scheduling |
All three entry trampolines (sub_B12930, sub_B12940, sub_B12960) are 11-byte
thunks that strip or forward one argument and tail-call the corresponding giant.
Pipeline Position
IR instruction stream
|
v
sub_B12930 -----> sub_169B190 (generic peephole)
|
v
sub_B12940 -----> sub_143C440 (SM120 peephole, RTX 50-series / Pro)
|
v
[instruction scheduling]
|
v
sub_B12960 -----> sub_198BCD0 (post-schedule peephole)
|
v
[instruction encoding via vtable]
The generic and SM120 dispatchers run before scheduling; the post-scheduling
dispatcher runs after. The SM120 dispatcher (sub_143C440) appears to be
architecture-gated -- it is called only when compiling for SM 120 targets
(consumer RTX 50-series, enterprise Pro GPUs).
Dispatch Architecture
All three mega-dispatchers follow the same algorithm.
Entry and primary switch
push callee-saves
sub rsp, 10h
mov rbp, rdi ; ctx
mov rbx, rsi ; instruction node
mov [rsp+var_2C], -1 ; best_template_id = NONE
mov [rsp+var_30], -1 ; best_priority = NONE
movzx edi, word [rsi+0Ch]; read opcode field
call sub_13B9DC0 ; identity / normalization (returns opcode)
cmp ax, 174h ; 373 cases (opcodes 0..372)
ja default
jmp [jump_table + rax*8] ; PRIMARY SWITCH on opcode
The 16-bit opcode at instruction node offset +0x0C selects a primary case.
All three dispatchers use 373-case primary switches.
Per-case pattern matching
Within each primary case, the dispatcher:
- Calls a sequence of pattern-matcher functions, passing pointers to
best_template_idandbest_priorityas out-parameters. - Each matcher may update these if it finds a match with higher priority than the current best.
- After all matchers for the opcode have run, the dispatcher checks
best_template_id. If it is no longer -1, a secondary switch on the template ID selects the rewrite action.
The secondary switches are embedded inside the giant function.
sub_143C440 alone contains 85 secondary jump tables (sizes 7--190 cases),
totaling 1,971 switch cases.
Rewrite action
When a rewrite is selected, the action block performs four operations:
setRewrittenOpcode(instr, new_opcode); // sub_B28F10: writes byte at instr+14
setRewrittenModifier(instr, new_modifier); // sub_B28F20: writes byte at instr+15
setOperandMapping(instr, slot, value); // sub_BA9CF0: writes instr+72+4*slot
markRewritten(instr); // sub_BA9C30 or sub_BA9CB0
sub_BA9C30 (markRewrittenSimple) sets bit 0 of the flags word at instr+140:
*(uint32_t*)(instr + 140) |= 1;
sub_BA9CB0 (markRewrittenComplex) applies priority-aware flag logic that
respects existing rewrites from earlier passes -- it sets bits to 0x8
("superseded") when a higher-priority rewrite exists.
The symmetry of call frequencies in sub_143C440 confirms this: setRewrittenOpcode
and setRewrittenModifier are each called exactly 1,759 times -- every rewrite
always sets both the opcode and modifier bytes.
Pattern Matcher Signature
Every one of the 3,185 pattern matchers shares the same prototype:
char __fastcall match(
int64_t ctx, // a1: peephole optimization context
int64_t instr, // a2: instruction node being examined
int32_t *template_id, // a3: output -- combined opcode / template ID
int32_t *priority // a4: input/output -- current best priority
);
The function returns a char (the last comparison result, used for early-exit
optimization in the caller), but the meaningful outputs are *template_id and
*priority.
Matching algorithm
Every matcher performs a deeply-nested chain of checks:
Step 1 -- Modifier/property checks.
Call queryModifier(ctx, instr, slot) (sub_10AE5C0) repeatedly. Each call
returns an enumerated value for a specific instruction property:
if (queryModifier(ctx, instr, 0xDC) != 1206) return 0; // data type != .f32
if (queryModifier(ctx, instr, 0x163) != 1943) return 0; // rounding != .rn
if (queryModifier(ctx, instr, 0x7E) - 547 > 1) return 0; // saturation out of range
The slot indices (0x05, 0x7B, 0x7E, 0x88, 0x90, 0xA1, 0xBE, 0xD2, 0xD3, 0xDC, 0xF2, 0x101, 0x119, 0x126, 0x127, 0x142, 0x152, 0x155, 0x159, 0x15C, 0x163, 0x167, 0x178, 0x179, 0x18A, 0x18D, 0x196, 0x197, 0x199, 0x19D, 0x1A8, 0x1AD, 0x1AE, 0x1AF, 0x1B2, 0x1D1, 0x1D2, 0x1E0, 0x1E4, 0x1EC, 0x216, 0x253, etc.) index into a per-instruction property table covering type, rounding mode, saturation, negate, comparison type, and architecture-specific modifiers.
Step 2 -- Operand count. Check the number of explicit/fixed operands and the total operand slot count:
int fixed = getExplicitOperandCount(instr); // sub_B28F50: returns *(instr+92)
int total = getTotalOperandSlots(instr); // sub_B28F40: returns *(instr+40)+1 - *(instr+92)
Step 3 -- Operand type and register class validation. For each operand slot, retrieve the operand pointer and check its kind:
void *op = getOperand(instr, idx); // sub_B28F30: returns *(instr+32) + 32*idx
byte kind = *(byte*)op;
if (!isRegister(kind)) return 0; // sub_13B9CD0: kind == 2
if (!isImmediate(kind)) return 0; // sub_13B9CE0: kind == 1 (alt check)
Register class is checked against expected values:
int regclass = getRegisterClass(*(uint32_t*)(op + 4)); // sub_13B9CC0
if (regclass != 1023 && regclass != 1) return 0; // 1023 = wildcard
Step 4 -- Priority gate. If all checks pass and the current priority allows it:
if (*priority <= threshold) {
*priority = threshold + 1;
*template_id = combined_opcode_id;
}
Since matchers are called sequentially and each checks the running maximum, the highest-priority match always wins.
Operand Type Discriminators
Three families of trivial single-instruction functions serve as operand type predicates, one family per dispatch context:
SM120 matchers (Zone A of sub_143C440)
| Function | Test | Semantic |
|---|---|---|
sub_13B9CD0 | kind == 2 | isRegister |
sub_13B9CE0 | kind == 1 | isImmediate |
sub_13B9D00 | kind == 2 || kind == 1 | isRegOrImm |
sub_13B9D10 | kind == ? | isConstantBuffer |
sub_13B9D40 | kind == ? | isPredicate |
sub_13B9D50 | kind == ? | isUniformRegister |
sub_13B9CC0 | extracts class | getRegisterClass (1023 = wildcard) |
Generic matchers (Zone A of sub_169B190)
| Function | Test | Semantic |
|---|---|---|
sub_15F59C0 | a1 == 2 | isRegister |
sub_15F59D0 | a1 == 1 | isImmediate |
sub_15F59E0 | a1 == 0 | isNone |
sub_15F59F0 | a1 == 10 | isConstantMemory |
sub_15F5A00 | a1 == 9 | isTexRef |
sub_15F5A30 | a1 == 3 | isPredicate / isConstImm |
sub_15F5A40 | a1 == 15 | isUniformRegister / isTrueConst |
sub_15F5A80 | a1 == 6 | isLabel |
sub_15F5A90 | a1 == 11 | isTexture |
sub_15F5AB0 | identity | getOperandValue |
Post-schedule matchers (Zone A of sub_198BCD0)
| Function | Test | Semantic | Call count |
|---|---|---|---|
sub_1820170 | identity | getOpcodeRaw | 9,278 |
sub_1820180 | a1 == 2 | isRegOperand | 2,743 |
sub_1820190 | a1 == 1 | isImmOperand | 677 |
sub_18201A0 | a1 == 8 | isUniform | 7 |
sub_18201B0 | a1 == 10 | isPredicateReg | 1,228 |
sub_18201C0 | a1 == 9 | isTexRef | 211 |
sub_18201D0 | a1 == 5 | isConstBuf | 14 |
sub_18201E0 | a1 == 4 | isAddress | 9 |
sub_18201F0 | a1 == 3 | isConstImm | 1,044 |
sub_1820200 | a1 == 15 | isTrueConst | 1,044 |
sub_1820210 | a1 == 7 | isBarrier | 9 |
sub_1820220 | a1 == 12 | isSurface | 12 |
sub_1820230 | a1 == 11 | isTexture | 12 |
sub_1820240 | a1 == 6 | isLabel | 2 |
sub_1820250 | a1 == 14 | isSpecialReg | 2 |
sub_1820260 | a1 == 13 | isUnknown | 6 |
Priority System
Matchers use a strict numeric priority to resolve conflicts when multiple patterns match the same instruction. Higher priority means more specific and/or more profitable transformation.
| Priority range | Description | Example |
|---|---|---|
| 1--2 | Trivial matches (simple mov, basic arithmetic) | Single-operand passthrough |
| 5--11 | Common 2--3 operand combining patterns | Standard FMA combines |
| 14--20 | Complex 4-operand patterns with constraints | Multi-source ALU combines |
| 22--31 | Highly specific multi-operand patterns | Wide register + predicated ops |
| 33--36 | Maximum specificity (8--9 operands + all modifiers) | Full tensor instruction forms |
Pattern IDs range from 1 to approximately 244 in the generic and SM120 dispatchers. Multiple matchers can target the same pattern ID with different priorities, creating a priority cascade.
Instruction Node Layout
The peephole subsystem reveals the following fields of the instruction IR node:
| Offset | Size | Field | Accessor |
|---|---|---|---|
+0x00 | 1 B | Operand type tag | isRegister, isImmediate, etc. |
+0x04 | 4 B | Primary value (register number / immediate) | getRegisterClass / getOperandValue |
+0x0C | 2 B | Opcode number (16-bit) | Direct read in dispatch entry |
+0x0E | 1 B | Rewritten opcode | sub_B28F10 (setRewrittenOpcode) |
+0x0F | 1 B | Rewritten modifier | sub_B28F20 (setRewrittenModifier) |
+0x14 | 4 B | Secondary register field | Direct read |
+0x20 | 8 B | Operand array base pointer | sub_B28F30 base address |
+0x28 | 4 B | Total operand count | Part of sub_B28F40 computation |
+0x48 | var | Operand mapping table (4 B per slot) | sub_BA9CF0 writes here |
+0x5C | 4 B | Explicit operand count | sub_B28F50 returns this |
+0x8C | 4 B | Flags word | Bit 0 = rewritten (set by sub_BA9C30) |
Each operand is a 32-byte record at base + 32 * index:
| Operand offset | Size | Content |
|---|---|---|
+0 | 1 B | Type tag (1=imm, 2=reg, 3=constImm, 10=pred, 15=trueConst, ...) |
+4 | 4 B | Primary value (register ID; 1023 = wildcard / any-reg) |
+20 | 4 B | Secondary value (modifier / sub-register) |
Code Duplication
The pattern matchers exhibit extreme structural duplication. Groups of 2--10 functions are near-identical clones differing only in numeric constants (the specific opcode/modifier values they check, the template ID they assign, and the priority level).
Observed clone clusters in sub_169B190's matchers:
| Cluster size | Count | Byte size each | Address range example |
|---|---|---|---|
| ~5,560 B | 5 functions | 5,560 | 0x167CBB0--0x16E7D20 |
| ~5,282 B | 10 functions | 5,282 | 0x167E3A0--0x16807E0 |
| ~5,298 B | 4 functions | 5,298 | 0x16EA5F0--0x16ECA30 |
| ~5,846 B | 3 functions | 5,846 | 0x16EDC00--0x16EE8B0 |
| ~2,718 B | 7 functions | 2,718 | 0x166F260--0x1692B60 |
| ~2,604 B | 6 functions | 2,604 | 0x166AC30--0x166E170 |
Similarly, in sub_198BCD0's matchers, eight functions of exactly 5,282 bytes
each (sub_1982810, sub_1982AE0, sub_1982DB0, sub_1983080,
sub_1984B40, sub_1984E10, sub_19850E0, sub_19853B0) share identical
structure, varying only in the opcode/modifier constants passed to
sub_10AE5C0.
This strongly suggests compiler-generated code from C++ templates or macros that instantiate one matcher function per instruction variant from ISA specification tables -- a pattern consistent with NVIDIA's internal build tooling.
Size Distribution of Matchers
SM120 matchers (1,087 functions, 429 KB)
| Size range | Count | Description |
|---|---|---|
| < 200 B | 37 | Simple 1--2 modifier checks |
| 200--400 B | 520 | Typical 4--8 modifier checks |
| 400--600 B | 455 | 6--12 modifier checks + operand validation |
| 600--800 B | 66 | Complex multi-operand patterns |
| > 800 B | 9 | Deepest nesting, most constrained patterns |
Generic matchers (762 functions, ~310 KB)
| Size range | Count | Description |
|---|---|---|
| ~2,200 B | most common | 2--4 instruction field checks |
| ~2,800 B | moderate | Patterns with operand constraints |
| ~3,500--4,000 B | fewer | Complex multi-operand patterns |
| ~5,500--8,500 B | rare | 12+ modifier checks, 8--9 operands |
Post-schedule matchers (~1,336 functions)
| Size range | Count | Description |
|---|---|---|
| ~2,200 B | most common | Simple 2-instruction patterns |
| ~2,500 B | common | 3-instruction patterns |
| ~3,100 B | moderate | Patterns with predicate checks |
| ~5,300 B | few | Multi-instruction sequences (8+ operands) |
| ~6,800 B | 1 | Largest matcher (sub_1980D10) |
Representative Matcher Examples
Simplest: sub_143C3B0 (132 bytes, priority 2, template 1)
Checks: no explicit operands, 2 total slots, first operand is register-or-immediate
with register class 1023 or 1. Matches a trivial mov-type instruction for
passthrough combining.
Moderate: sub_13CF0C0 (426 bytes, priority 15, template 28)
Checks 5 modifiers: slot 0xD3 == 1181, slot 0xD2 == 1177, slot 0x0C == 59, slot 0xB3 == 772, slot 0xC8 == 1107. Then validates 1 explicit register operand plus 4 additional operands (register, register, immediate, predicate).
Complex: sub_1615980 (priority 36, template 25 -- highest observed priority)
Checks 12 modifier slots: 0x05 == 12, 0xDC == 1206, 0x253 in {2937,2938}, 0x126 == 1493, 0xF2 in {1281,1282}, 0x163 == 1943, 0x178 == 2035, 0x179 in {2037..2041}, 0x1AD in {2253..2257}, 0x7E in {547,548}, 0x19D in {2167,2168}, 0x18D == 2115. No fixed operands, 7 variable operands, each of type 10 (constant memory) with register class 1023 or specific flag constraints. This is the most constrained pattern observed -- likely a fully specified tensor instruction variant.
Post-schedule: sub_1834600 (pattern 17, priority 16)
Checks modifier slots 0xD3 == 1181, 0xD2 == 1177, 0x0C in {60,61}, 0xB3 == 772, 0xC8 == 1107. Then: first operand offset == 1, that operand is immediate, total operand count == 5, followed by register pattern checks.
Infrastructure Helper Functions
Core accessor (sub_10AE5C0, 60 bytes)
The single most-called function in the peephole subsystem (30,768 callers across the full binary). Queries a property of an instruction node by slot ID:
int queryModifier(int64_t ctx, int64_t instr, int slot) {
if (hasProperty(instr, slot)) // sub_10E32E0
return getPropertyValue(instr, slot); // sub_10D5E60
return 0xFFFFFFFF; // property not present
}
Node accessors
| Function | Size | Semantics | Call frequency |
|---|---|---|---|
sub_B28F30 | 12 B | getOperand(instr, idx) -- returns *(instr+32) + 32*idx | 31,399 |
sub_B28F40 | 10 B | getTotalOperandSlots(instr) -- returns *(instr+40)+1 - *(instr+92) | ~2,500 |
sub_B28F50 | 4 B | getExplicitOperandCount(instr) -- returns *(instr+92) | ~2,100 |
Rewrite helpers
| Function | Semantics | Call frequency in sub_143C440 |
|---|---|---|
sub_B28F10 | setRewrittenOpcode(instr, byte) -- writes instr[14] | 1,759 |
sub_B28F20 | setRewrittenModifier(instr, byte) -- writes instr[15] | 1,759 |
sub_BA9CF0 | setOperandMapping(instr, slot, val) -- writes instr[72+4*slot] | 993 |
sub_BA9C30 | markRewrittenSimple(instr) -- instr[140] |= 1 | 1,222 |
sub_BA9CB0 | markRewrittenComplex(instr) -- priority-aware flag update | 361 |
The ratio of markRewrittenSimple (1,222) to markRewrittenComplex (361)
shows that approximately 77% of rewrites are straightforward replacements,
while 23% involve priority negotiation with competing rewrites.
Call Frequency in sub_169B190 (Generic Dispatcher)
| Callee | Count | Role |
|---|---|---|
sub_B28F10 (setRewrittenOpcode) | 2,142 | Write new opcode byte |
sub_B28F20 (setRewrittenModifier) | 2,142 | Write new modifier byte |
sub_15F59B0 (getOperandValue) | 1,736 | Extract register number |
sub_10AE5C0 (queryModifier) | 1,303 | Read instruction property |
sub_B28F30 (getOperand) | 1,281 | Get operand pointer |
sub_BA9C30 (markRewrittenSimple) | 1,261 | Simple rewrite commit |
sub_BA9CF0 (setOperandMapping) | 855 | Map operand slots |
sub_BA9CB0 (markRewrittenComplex) | 589 | Priority-aware commit |
Relationship to Instruction Encoding
Each dispatch function's address range is adjacent to a zone of SASS instruction encoders that consume the rewritten instructions:
sub_143C440(SM120) sits before 123 SM120 encoders at0x14771E0--0x14A3C80(180 KB), covering 82 unique SASS opcodes with up to 42 encoding variants per opcode.sub_169B190(generic) sits before 100 encoding table entries at0x16DF750--0x16FFFF0and 36 template expanders at0x1700000--0x1722D60.sub_198BCD0(post-schedule) operates on already-scheduled instructions, performing strength reduction and idiom recognition on the final instruction stream.
The encoders are called via vtable dispatch, not directly from the peephole
functions. Each encoder packs a 128-bit SASS instruction word using
sub_7B9B80(state, bit_offset, bit_width, value) for bit-field insertion.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_B12930 | 11 B | Entry trampoline for generic peephole | CERTAIN |
sub_B12940 | 11 B | Entry trampoline for SM120 peephole | CERTAIN |
sub_B12960 | 11 B | Entry trampoline for post-schedule peephole | CERTAIN |
sub_169B190 | 280 KB | Generic peephole mega-dispatcher | HIGH |
sub_143C440 | 233 KB | SM120 peephole mega-dispatcher | HIGH |
sub_198BCD0 | 233 KB | Post-schedule peephole mega-dispatcher | HIGH |
sub_10AE5C0 | 60 B | queryModifier(ctx, instr, slot) | HIGH |
sub_B28F10 | small | setRewrittenOpcode(instr, byte) | HIGH |
sub_B28F20 | small | setRewrittenModifier(instr, byte) | HIGH |
sub_B28F30 | 12 B | getOperand(instr, idx) | CERTAIN |
sub_B28F40 | 10 B | getTotalOperandSlots(instr) | CERTAIN |
sub_B28F50 | 4 B | getExplicitOperandCount(instr) | CERTAIN |
sub_BA9C30 | small | markRewrittenSimple(instr) | HIGH |
sub_BA9CB0 | small | markRewrittenComplex(instr) | HIGH |
sub_BA9CF0 | small | setOperandMapping(instr, slot, value) | HIGH |
sub_13B9CC0 | small | getRegisterClass(field) | HIGH |
sub_13B9CD0 | small | isRegister(byte) | HIGH |
sub_13B9CE0 | small | isImmediate(byte) | HIGH |
sub_13B9D00 | small | isRegisterOrImmediate(byte) | HIGH |
sub_13B9D10 | small | isConstantBuffer(byte) | HIGH |
sub_13B9D40 | small | isPredicate(byte) | HIGH |
sub_13B9D50 | small | isUniformRegister(byte) | HIGH |
sub_13B9DC0 | small | opcodeIdentity(uint) -- passthrough | CERTAIN |
sub_1909030 | small | opcodePassthrough (post-schedule context) | HIGH |
Macro Instruction Expansion (sub_8127C0)
Separate from the three pattern-match-and-rewrite mega-dispatchers, ptxas contains
a dedicated macro instruction expansion pass at sub_8127C0 (10,720 bytes). This
pass resolves register-file constraints for composite instructions -- cases where
source or destination operands span register files or where multi-word results need
splitting into narrower instruction sequences.
It is called from the master lowering dispatcher sub_8380A0 and runs before
instruction scheduling.
Two-phase algorithm
Phase 1 -- Operand scanning and constraint annotation.
The pass iterates every instruction in the function's linked list (traversing via
instr+8). For each instruction, it reads the opcode at instr+72 and dispatches
through a 15-family if-else cascade. For each opcode, it calls
sub_812550 (getOperandConstraint) on each source operand to determine
register-file affinity:
| Return value | Meaning |
|---|---|
| 0 | Unconstrained |
| -2 | Constrained to register file B (e.g., even-aligned pair) |
| -3 | Constrained to register file A (e.g., odd-aligned pair) |
| -1 | Conflict / unresolvable |
The pass annotates register descriptor entries (indexed via ctx+88) at reg+76
(constraint code) and reg+80 (target width code), and builds a linked list of
instructions requiring expansion (linked via instr+56). Registers consumed by
expansion are marked dead (reg+64 = 5).
Phase 2 -- Instruction rewriting.
If any instruction requires expansion, the pass iterates the worklist and performs
actual rewrites: replacing composite instructions with equivalent sequences,
inserting new instructions via the sub_930040 / sub_92FF10 / sub_92E720
emitters, and deleting originals via sub_9253C0. Register-file mapping uses two
lookup tables: dword_21D5EE0[26] (for constraint -2) and dword_21D5F60[16]
(for constraint -3).
Between phases, a cleanup loop removes worklist entries with conflicting constraints
(both operands invalid), resetting reg+76 = -1.
Opcodes handled
| Opcode | Mnemonic | Expansion pattern |
|---|---|---|
| 10 | SHF | Three-source constraint check; emits I2IP (36) + new SHF when sources span register files |
| 18 | FSETP | Predicate operand finalization when operand count == 6 and modifier bits match |
| 29 | PMTRIG | Last-operand extraction and finalization |
| 36 | I2IP | Destination register marking and two-source constraint checking |
| 60 | LEPC | Store/load legalization: validates flags, checks register file == 6, recursive chain validation via sub_812480 |
| 62, 78, 79 | BAR_INDEXED, RTT, BSYNC | Same legalization path as LEPC |
| 95, 96 | STS, LDG | Last-operand extraction for stores; two-source vector-width constraint checking for loads |
| 97 | STG | Source registration for expansion tracking |
| 130 | HSET2 | Validates single-def destination, recursive source constraint chains; inserts HSET2 rewrites or converts to opcode-201 stores |
| 137 | SM73_FIRST | Same path as HSET2 |
| 149 | UFLO | Two-source validation; marks destination with width code 20; vectorization combining |
| 151 | UIMAD | Shared three-source path with SHF |
| 190 | LDGDEPBAR | Shared last-operand path with PMTRIG |
| 201, 202 | QMMA_16816, QMMA_16832 | Full multi-operand legalization; inserts barrier instructions for QMMA |
| 283 | UVIADD | Penultimate operand extraction and type resolution |
| 290 | MOV (sm_104) | Same constraint path as SHF/UIMAD |
| bit 12 set | (arch-specific) | Last-operand extraction for architecture-extended instructions |
sub_812550 -- getOperandConstraint
The single most-called helper (32 call sites), this 40-byte function reads the constraint code from the register descriptor for a given operand reference:
int getOperandConstraint(int64_t ctx, uint32_t *operand_ref) {
int modifier_bits = operand_ref[1];
int constraint = reg_array[*operand_ref & 0xFFFFFF].constraint; // reg+76
if ((modifier_bits & 0xFE000000) == 0)
return constraint; // no sub-register modifier => raw value
// Apply modifier-aware transformations:
// constraint -2 + certain modifier combos => -3 or -1
// constraint -3 + modifier bit 0x3C000000 => -1; + sign bit => -2
...
}
sub_812480 -- validateOperandChain
Recursively walks use-def chains through HSET2 (130) and SM73_FIRST (137)
instructions to verify that an entire operand chain is compatible with a target
register file. Uses sub_A9BD00 to resolve the register file for a width code,
then checks reg+76 and reg+80 agreement.
Knob gate
Option 183 (target profile offset 13176) controls the expansion distance threshold. When enabled, a secondary value at profile+13184 sets the maximum distance between a register definition and its use before the constraint is considered violated. Default threshold: 7.
Function map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_8127C0 | 10,720 B | ExpandMacroInstructions (main pass) | HIGH |
sub_812550 | 40 B | getOperandConstraint | HIGH |
sub_812480 | ~170 B | validateOperandChain | HIGH |
sub_8125E0 | ~450 B | canExpandStoreChain | MEDIUM |
sub_800470 | small | isLegalizable | MEDIUM |
sub_800360 | small | resolveOperandType | MEDIUM |
sub_800400 | small | finalizeOperand | MEDIUM |
Cross-References
- Instruction Selection -- the isel pass that precedes peephole optimization
- SASS Instruction Encoding -- the encoder vtable entries that consume peephole output
- Newton-Raphson Templates -- multi-instruction template expansion (DDIV, DRCP, DSQRT) in the same address neighborhood as
sub_169B190 - Scheduling Algorithm -- the scheduler that runs between pre- and post-schedule peephole
- Blackwell (SM 100--121) -- SM120-specific context for
sub_143C440
Evidence Index
| Claim | Source |
|---|---|
sub_143C440 structure, 1,087 matchers, 373-case switch | p1.20-sweep-0x13CF000-0x14A4000.txt lines 1--486 |
| SM120 encoder zone (123 functions, 180 KB) | p1.20 lines 269--329 |
sub_169B190 structure, 762 matchers, 280 KB | p1.22 lines 1--460, p1.23 lines 1--588 |
Generic operand discriminators (sub_15F59C0 family) | p1.22 lines 181--201 |
| Clone clusters in generic matchers | p1.23 lines 156--174 |
Post-schedule discriminators (sub_1820170 family) | p1.25 lines 271--289 |
sub_198BCD0 structure, 1,336 callees, 373-case switch | p1.26 lines 355--398 |
| Post-schedule 5,282-byte clone group | p1.26 lines 401--424 |
| Rewrite helper call frequencies | p1.20 lines 216--227, p1.23 lines 228--237 |
| Priority 36 as highest observed | p1.22 lines 316--327 |
| Instruction node layout | p1.20 lines 406--420, p1.22 lines 367--409 |
Mercury Encoder Pipeline
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Mercury is NVIDIA's intermediate encoding layer between the optimizer's Ori IR and native SASS machine code. It is not a direct binary encoding of SASS -- it is a separate representation that contains pseudo-instructions, lacks dependency barriers, and requires multiple transformation passes before it becomes executable GPU code. The Mercury pipeline occupies phases 113--122 of the 159-phase PhaseManager, forming a six-stage sub-pipeline: encode/decode verification, pseudo-instruction expansion, two WAR-hazard passes (one before and one after operation expansion), scoreboard/latency generation ("opex"), and final SASS microcode emission. All recent GPU architectures (SM 75+) use Mercury as the encoding backend; SM 100+ (Blackwell) defaults to "Capsule Mercury" (capmerc), a variant that embeds additional metadata for relocatable patching.
| Pipeline phases | 113--122 (8 active phases within Mercury sub-pipeline) |
| Core orchestrator | sub_6F52F0 (23KB, RunStages -- 18 parameters) |
| Master encoder | sub_6D9690 (94KB, EncodeInstruction -- largest backend function) |
| Opex body | sub_6FFDC0 (66KB, EmitInstructions -- scoreboard generation) |
| Expansion pass | sub_C3CC60 (26KB, MercExpand::run) |
| WAR generator | sub_6FBC20 (7.4KB, GenerateWARHazards) |
| SASS emitter | sub_6E4110 (24KB, MercGenerateSassUCode) |
| Bitfield insert | sub_7B9B80 (216 bytes, 18,347 callers across binary) |
| Encoding table funcs | 530 functions at 0xC66000--0xD27000 |
| Mercury mode flag | *(DWORD*)(context+385) == 2 |
| Mode check | sub_10ADF10 returns bool from target descriptor |
| MercConverter | sub_9F3340 (7KB orchestrator), sub_9EF5E0 (27KB operand reorganization) |
| CLI option | --binary-kind mercury,capmerc,sass |
Architecture
Phase 113 PostFixForMercTargets Late Ori fixups for Mercury targets
Phase 114 FixUpTexDepBarAndSync Texture dependency bars + sync fixups
Phase 115 AdvancedScoreboardsAndOpexes Arch hook point (noop by default)
Phase 116 ProcessO0WaitsAndSBs -O0 scoreboard insertion
──────────────────────────────
Phase 117 MercEncodeAndDecode ┐
Phase 118 MercExpandInstructions │ Six-stage Mercury core
Phase 119 MercGenerateWARs1 │
Phase 120 MercGenerateOpex │
Phase 121 MercGenerateWARs2 │
Phase 122 MercGenerateSassUCode ┘
sub_6F52F0 (23KB orchestrator, 18 params)
│
├─ [1] Decode: sub_6F2BF0 (59KB) — Encode Ori→Mercury binary, decode back
│ └─ sub_6D9690 (94KB master encoder switch)
│ ├─ sub_6D2750 — append operand word
│ ├─ sub_6D28C0 — commit encoded instruction
│ ├─ sub_6D9580 — encode literal values
│ └─ sub_931690 — create instruction record
│
├─ [2] Expansion: sub_C3CC60 (26KB) — Expand pseudo-instructions to SASS
│ ├─ sub_C37A10 (16KB) — expandInstruction (jump table dispatch)
│ ├─ sub_C39B40 (10KB) — expandMemoryOp
│ ├─ sub_C3A460 (6KB) — expandAtomicOp
│ ├─ sub_C3B560 (8KB) — expandTexture
│ ├─ sub_C3BCD0 (19KB) — expandControlFlow
│ └─ sub_C3E030 (18KB) — finalizeExpansion
│
├─ [3] WAR pass 1: sub_6FBC20 (7.4KB) — DEPBAR/scoreboard for pre-opex hazards
│ ├─ sub_6FA5B0 — detect WAR hazard per instruction
│ ├─ sub_6FA930 — insert scoreboard barrier (opcode 54)
│ ├─ sub_6FA7B0 — insert WAITDP (opcode 246)
│ └─ sub_6FAA90 — insert stall cycles
│
├─ [4] Opex: sub_6FFDC0 (66KB) — Generate scoreboards + latency waits
│ └─ sub_703480 (1.4KB entry) or sub_7032A0 (2.3KB MercOpex entry)
│
├─ [5] WAR pass 2: sub_6FBC20 — Same pass, re-run for opex-introduced hazards
│
└─ [6] SASS emit: sub_6E4110 (24KB) — Final SASS microcode generation
└─ sub_735290 — per-instruction encoding pipeline
├─ sub_733FA0 — encode instruction operands
├─ sub_734370 — encode immediates
├─ sub_734820 — encode predicates
├─ sub_734AD0 — encode memory operands
└─ sub_734D20 — encode complex operands (texture/surface/barrier)
Each stage logs its completion via trace infrastructure: "After Decode", "After Expansion", "After WAR post-expansion", "After Opex", "After WAR post-opexing".
Mercury vs SASS vs Capsule Mercury
The ptxas CLI (sub_703AB0) accepts --binary-kind with three values:
| Mode | CLI value | Default for | Description |
|---|---|---|---|
| Mercury | mercury | SM 75--99 | Traditional Mercury intermediate encoding |
| Capsule Mercury | capmerc | SM 100+ (Blackwell) | Mercury + embedded PTX source + relocation metadata |
| Raw SASS | sass | (explicit only) | Direct SASS binary output |
Additional CLI flags:
--cap-merc-- force Capsule Mercury generation--self-check-- roundtrip verification: reconstitute SASS from capmerc, compare with original--out-sass-- dump reconstituted SASS from capmerc
Mercury mode is flagged at *(DWORD*)(context+385) == 2. The function sub_10ADF10 queries the target descriptor to determine whether Mercury encoding is active for the current architecture.
MercConverter -- Operand Reorganization for Encoding
| Phase | 141 (MercConverter) |
| Orchestrator | sub_9F3340 (7KB) |
| Post-conversion lowering | sub_9EF5E0 (27KB) |
| Opcode dispatch | sub_9ED2D0 (25KB, shared with phase 5) |
| Strings | "CONVERTING", "After MercConverter" |
Phase 141 runs the MercConverter infrastructure a second time, after the full optimization pipeline has completed. While phase 5 (ConvertUnsupportedOps) performs the initial PTX-to-SASS opcode conversion early in the pipeline, phase 141 re-invokes the same machinery to handle instructions that were introduced or modified by optimization passes (rematerialization, peephole, loop transformations) and may contain PTX-derived opcodes that were never legalized. After phase 141 completes, the "After MercConverter" diagnostic string appears, and every instruction in the IR carries a valid SASS opcode ready for Mercury encoding.
The orchestrator sub_9F3340 runs two steps sequentially:
-
Opcode conversion (
sub_9F1A90, 35KB): the main MercConverter dispatch documented in ISel. Converts any remaining PTX-derived opcodes to SASS equivalents via the master switch insub_9ED2D0. Gated by*(BYTE*)(*(context+8) + 1398) & 0x20. -
Operand reorganization (
sub_9EF5E0, 27KB): post-conversion lowering that restructures operand lists into a form the Mercury encoder can consume directly. Gated by*(BYTE*)(*(context+16) + 1048) != 0AND*(context+104) != 0(non-empty instruction BST).
Post-Conversion Lowering -- sub_9EF5E0 (27KB)
This function transforms the BST (binary search tree) of converted instructions produced by step 1 into encoding-ready conversion nodes. For each instruction record in the BST, it performs three operations:
1. Operand sort. Calls sub_9EC160, a linked-list merge sort (Floyd's slow/fast pointer midpoint, recursive split-and-merge) that sorts the operand chain by the operand index at entry+16. This establishes a canonical ordering required by the encoder.
2. Contiguous/gap partitioning. Walks the sorted operand list and classifies each operand into one of two sublists:
// Simplified partitioning logic (lines 215-348 of decompilation)
for (op = first; op != sentinel; op = op->next) {
int cur_idx = *(DWORD*)(op + 16);
int next_idx = *(DWORD*)(op->next + 16);
if (next_idx - cur_idx == 32) {
// Consecutive register indices -> contiguous sublist
append_to_contiguous_list(node, cur_idx);
} else {
// Non-consecutive -> gap sublist (stores both cur and next index)
append_to_gap_list(node, cur_idx, prev_idx);
}
}
The stride of 32 reflects the operand index encoding: index = register_number * 32 + modifier_bits. Contiguous operands (stride-32 sequences like R0, R1, R2, R3) represent packed register groups -- common in wide loads (LDG.128), GMMA matrix operands, and multi-register moves. The encoder can represent these as a single register-range specifier. Gap operands break the stride and require individual encoding slots.
3. Conversion node construction. Allocates a 168-byte conversion node per instruction, inserts it into a per-record BST sorted by (block_id, sub_block_id), and links the two operand sublists:
Conversion Node (168 bytes):
+0 8B BST left child
+8 8B BST right child
+16 8B BST parent
+24 4B block_id
+28 4B sub_block_id
+32 48B Contiguous operand doubly-linked list (6 pointers)
+80 4B Contiguous operand count
+88 8B Contiguous list ref-counted handle
+96 48B Gap operand doubly-linked list (6 pointers)
+144 4B Gap operand count
+152 8B Gap list ref-counted handle
+160 1B Flags
BST insertion calls sub_7C11F0 for red-black tree rebalancing. The record tracks min/max block IDs at record+32 and record+40 for range queries.
Encoding Validation and Fallback
After building the conversion node, the function attempts encoding:
// Lines 949-982 of decompilation
nullsub_644(*(a1+16), node, "CONVERTING"); // diagnostic trace
int result = sub_7BFC30(node); // encoding validation
if (result == -1) {
// Encoding failed: recursive fallback
sub_9CE210(a1, node);
// Continue with next instruction in BST
} else {
// Encoding succeeded: emit to output
*(node + 4) = result; // store encoding index
output_slot = vtable_alloc(*(a1+24), 120); // allocate output record
*(output_slot + 96) = node; // link conversion node
sub_9314F0(&scratch, *(a1+8), 0xF, 1, 1, // emit SASS instruction
&control_word); // control = 0x60000000
}
sub_7BFC30 validates the conversion node by traversing its operand tree and checking that the contiguous/gap partition can be represented in the target encoding format. It returns the encoding index on success, or -1 if the instruction's operand pattern cannot be encoded in the available formats.
On failure, sub_9CE210 (a recursive fallback) re-processes the instruction using a different encoding strategy -- typically splitting the operand group into smaller sub-groups that each fit the available encoding width. This handles edge cases like wide operations with mixed register classes.
Relationship to Phase 5
Phase 5 and phase 141 share the same code (sub_9F3340 orchestrator, sub_9ED2D0 dispatch, sub_9EF5E0 post-conversion). The difference is context:
| Property | Phase 5 | Phase 141 |
|---|---|---|
| Pipeline position | Before optimization | After optimization, before Mercury encoding |
| Purpose | Convert PTX opcodes to SASS | Re-legalize instructions introduced by optimizer |
| Input | Raw Ori IR with PTX opcodes | Optimized Ori IR with possibly-illegal opcodes |
| Output | Optimizer-ready SASS-opcode IR | Encoding-ready IR for Mercury phase 142+ |
| Gate flag | *(context+1398) & 0x20 | Same flag, re-checked |
Stage 1: MercEncodeAndDecode -- Roundtrip Verification
| Phase | 117 |
| Orchestrator | sub_6F52F0 (23KB, 18 parameters) |
| Decode worker | sub_6F2BF0 (59KB) |
| String | "After EncodeAndDecode" |
This phase encodes the Ori IR instruction stream into Mercury binary form, then immediately decodes it back and verifies that the decoded result matches the original. This is a self-consistency check that catches encoding bugs early -- if the roundtrip fails, the instruction cannot be correctly represented in Mercury format.
The orchestrator sub_6F52F0 passes the entire pipeline state (18 parameters) to sub_6F2BF0, which performs the actual encode-decode cycle using the master encoder sub_6D9690.
Master Encoder -- sub_6D9690 (94KB)
The central SASS instruction encoding function and the single largest function in the ptxas backend. It contains a massive switch statement on the instruction type field (read from instruction+8) with cases covering every SASS instruction format.
// Simplified encoding flow
void EncodeInstruction(context, instruction) {
int type = *(int*)(instruction + 8);
uint64_t base = 0x2000000000LL; // encoding base constant
switch (type) {
case 61: // FFMA with literal operand
sub_6D9580(ctx, operand); // encode literal
break;
case 455: // complex multi-operand format
// ... bit-field extraction and assembly ...
break;
// ... hundreds of cases ...
}
// Common: append operand words, commit
sub_6D2750(ctx, word); // append 8-byte operand word
sub_6D28C0(ctx); // commit instruction record
}
Encoding details:
- Instructions are encoded as sequences of 8-byte words
- Operand word type prefix in bits
[31:28]:0x1= register,0x5= immediate/constant,0x6= control/modifier,0x7= literal,0x9= special - Control words carry the
0x60000000prefix - Architecture-specific bits accumulated in a flags variable, with SM 100+ extensions via knob 4176
sub_7D6860handles data type encoding (FP32/FP64/INT, etc.)sub_C00BF0provides opcode lookup from the encoding tablessub_91D160handles register operand encoding
Instruction Word Format
The Mercury instruction word is a 1280-bit (160-byte, 20-QWORD) structure located at offset +544 in the encoder object. All bit-field insertions use sub_7B9B80:
// sub_7B9B80 -- bitfield insert (216 bytes, 18,347 callers)
// Signature: (encoder_obj, bit_offset, bit_width, value)
void bitfield_insert(void *a1, int bit_offset, int bit_width, uint64_t value) {
uint64_t mask = (1ULL << bit_width) - 1;
int qword_idx = bit_offset >> 6;
int bit_pos = bit_offset & 63;
*(uint64_t *)(a1 + 8 * qword_idx + 544) |= (value & mask) << bit_pos;
// handles cross-QWORD boundary cases in a loop up to bit 1280
}
Two companion helpers run before operand encoding:
sub_7B9D30(38 bytes) -- clears the 16-entry constant buffer slot table ata1+468to0xFFsub_7B9D60(408 bytes) -- encodes reuse flags (1 bit) and predicate register index (5 bits) into the instruction word
Encoding Table Functions (530 functions)
The range 0xC66000--0xD27000 contains 530 functions that each initialize one row of the instruction format table. Every function calls sub_7B9B80 multiple times to describe the SASS bit layout for one instruction format variant:
// Example: sub_C6CF40 — one instruction format initializer
void init_format_XYZ(void *a1) {
sub_7B9B80(a1, 0, 4, 1); // bits[0:3] = opcode field = 1
sub_7B9B80(a1, 4, 3, 0); // bits[4:6] = format = 0
sub_7B9B80(a1, 8, 9, 0xC); // bits[8:16] = subopcode = 12
sub_7B9B80(a1, 0x11, 8, 0x13); // bits[17:24] = modifier = 19
sub_7B9B80(a1, 0x19, 7, 5); // bits[25:31] = unit = 5
}
Function sizes are remarkably uniform (1000--1600 bytes), reflecting mechanical code generation -- roughly 10 functions per ISA opcode group, covering all SASS formats for SM 100+.
Stage 2: MercExpandInstructions -- Pseudo-Instruction Expansion
| Phase | 118 |
| Entry | sub_C3CC60 (26KB, MercExpand::run) |
| Strings | "After MercExpand", "EXPANDING" |
Mercury uses abstract instruction forms that may map to multiple real SASS instructions. This phase expands every pseudo-instruction into its concrete SASS equivalent sequence. The expansion is type-dispatched:
| Handler | Size | Instruction class |
|---|---|---|
sub_C37A10 | 16KB | General instruction expansion (jump table with 4+ cases) |
def_C37B2E | 13KB | Complex expansion cases (default handler, creates new nodes) |
sub_C39B40 | 10KB | Memory operations (LDG, STG, LDS, etc.) |
sub_C3A460 | 6KB | Atomic operations |
sub_C3B560 | 8KB | Texture operations |
sub_C3BCD0 | 19KB | Control flow (branches, jumps, calls) |
sub_C3CC60 iterates over every instruction in the function, dispatching to the appropriate handler. Handlers create new instruction nodes, link them into the list, and delete the original pseudo-instruction. After all expansions, sub_C3E030 (18KB) performs finalization and cleanup.
The expansion engine also uses sub_719D00 (50KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate). The four nearly identical code blocks within that function correspond to template instantiations over operand width types.
Stage 3: WAR Hazard Resolution (Phases 119, 121)
| Phases | 119 (MercGenerateWARs1), 121 (MercGenerateWARs2) |
| Entry | sub_6FC220 / sub_6FC240 |
| Main pass | sub_6FBC20 (7.4KB) |
| String | "After MercWARs" |
| Knob | #16 (WAR generation control) |
Write-After-Read hazards occur when an instruction reads a register that a later instruction will overwrite -- the hardware pipeline can execute them out of order, causing the read to see the wrong value. The WAR pass inserts explicit DEPBAR (dependency barrier) instructions and scoreboard annotations to force correct ordering.
Two passes are needed: WAR1 runs after expansion but before opex, and WAR2 runs after opex. The second pass exists because opex itself introduces new instructions (scoreboard waits, synchronization barriers) that create additional WAR hazards not present in the pre-opex stream.
WAR Pass Algorithm -- sub_6FBC20
// Simplified WAR generation pass
void GenerateWARs(context) {
// Guard conditions
if (!(context->instr_flags & 1)) return; // no WAR-sensitive instrs
if (context->mode != 2) return; // not Mercury mode
// Per-instruction walk
for (instr = first; instr != end; instr = instr->next) {
// Detect hazard
int severity = DetectWARHazard(state, instr); // sub_6FA5B0
if (severity >= 3) {
InsertScoreboardBarrier(state, instr); // sub_6FA930, opcode 54
InsertWAITDP(state, instr); // sub_6FA7B0, opcode 246
InsertWARStalls(state, instr, severity); // sub_6FAA90
}
}
PostWARAdjustment(state); // sub_6FB850
FinalizeWARPass(state); // sub_6FB350
}
WAR Hazard Detection -- sub_6FA5B0 (2.5KB)
The detector classifies instructions by opcode:
- Always hazardous (opcodes 49, 248, 92): unconditionally increment the WAR counter
- Conditionally hazardous (opcode 75): partial hazard depending on operand configuration
- Special handling (opcodes 35, 246): store/scoreboard instructions with custom WAR rules
- Filtered out:
(opcode - 34) > 0x2Cplus bitmask0x100000400001for irrelevant types
Architecture-specific hazard rules are dispatched through vtable methods at offsets +968, +1008, +528, and +504.
The detector maintains per-instruction state:
*(DWORD*)(state+2)-- WAR counter (incremented per detected hazard)*(DWORD*)(state+3)-- severity level (3 = medium, 4 = high)
Inserted Instructions
DEPBAR / Scoreboard barrier (opcode 54) -- sub_6FA930:
- Created when
*(BYTE*)(instr+48) & 0x10is set (barrier-needed flag) - Barrier type extracted from bits 7:5 of the flag byte
- Encoding:
*(DWORD*)(new_instr+56) = 4(barrier format) - Control bits:
*(DWORD*)(new_instr+48) &= 0xFFF83FFF | 0x50000
WAITDP (opcode 246) -- sub_6FA7B0:
- Skipped if a WAITDP already exists at the insertion point
- Operands configured with codes 102/467 and 301/1520
- Uses FNV-1a hash lookup for instruction deduplication
Stall cycles -- sub_6FAA90 (7.9KB):
- Computes required stall cycles from architecture-specific latency tables
- Vtable methods at +888, +896, +904 for stall calculation
- GPU family dispatch:
v8[14] == 9triggers specific handling - Adjusts stall count fields in the instruction control word
Stage 4: Opex -- Operation Expansion
| Phase | 120 (MercGenerateOpex) |
| Entry | sub_703480 (1.4KB, RunOpexPass) |
| MercOpex entry | sub_7032A0 (2.3KB, RunMercOpexPass) |
| Body | sub_6FFDC0 (66KB) |
| String | "After MercOpex" |
| Knobs | #17 (expansion options), #743 (reduce-reg), #747 (dynamic batch) |
"Operation expansion" is the most complex stage. It generates the dependency scoreboards, computes latency waits, and inserts synchronization barriers that the hardware needs to manage instruction-level parallelism. After opex, the instruction stream contains all the scheduling metadata required for correct execution.
Entry Points
Two entry paths exist, both calling the same sub_6FFDC0 body:
sub_703480 (RunOpexPass, 1.4KB):
- Creates pipeline context via
sub_6FC280 - Queries knob #17 to disable WAR penalty flags:
*(context->flags+52) &= ~0x10 - Architecture check:
*(DWORD*)(context+52) == 20481(SM 100a) - For SM 100a: queries knob at offset 1296/1304 for loop unroll factor
- Sets Mercury mode:
*(DWORD*)(context+385) = 2 - Calls
sub_6FFDC0for actual expansion
sub_7032A0 (RunMercOpexPass, 2.3KB):
- Nearly identical to
sub_703480 - Additionally calls
sub_10ADF10to verify Mercury mode is active - Allocates 40-byte + 24-byte records for Mercury-specific context
- Calls
sub_6FED20to destroy previous Mercury context before creating new one
Opex Body -- sub_6FFDC0 (66KB)
This 66KB function with 200+ local variables performs:
- Instruction iteration -- walks the instruction list per basic block
- Latency computation -- determines execution latency for each instruction based on opcode, functional unit, and architecture
- Scoreboard allocation -- assigns dependency scoreboard entries to track producer-consumer relationships
- Wait insertion -- inserts
DEPBAR.WAITinstructions where a consumer must wait for a producer to complete - Stall count computation -- sets per-instruction stall counts in the scheduling control word
- Barrier generation -- inserts memory barriers and synchronization points
The function queries three knobs that control scheduling behavior:
- Knob #17: expansion options, WAR penalty flag control
- Knob #743: reduce-reg scheduling mode (minimize register pressure)
- Knob #747: dynamic batch scheduling mode
New instructions created by opex use sub_10B1F90 (instruction allocator) and sub_10AE590 (operand configuration).
Stage 5: SASS Microcode Emission
| Phase | 122 (MercGenerateSassUCode) |
| Entry | sub_6E4110 (24KB) |
The final stage converts the fully expanded, WAR-resolved, scoreboard-annotated Mercury stream into native SASS binary. This is the point of no return -- after this phase, the output is executable GPU machine code.
sub_6E4110 takes 8 parameters (context, instruction list, descriptors, format info, ...) and dispatches to the per-instruction encoding pipeline:
sub_6E4110 (24KB, final SASS emission)
├─ sub_735290 — per-instruction encoding pipeline
│ ├─ sub_733FA0 (5.1KB) — encode instruction operands
│ │ └─ sub_733870 (10KB) — source operand encoder
│ ├─ sub_734370 (6.1KB) — encode immediates
│ ├─ sub_734820 (4.1KB) — encode predicates
│ ├─ sub_734AD0 (3.3KB) — encode memory operands
│ └─ sub_734D20 (8.1KB) — encode complex operands (texture/surface/barrier)
├─ sub_726E00 (30.6KB) — instruction encoding with FNV-1a dedup cache
│ └─ sub_7266A0 (11.7KB) — hash table lookup (24-byte entries, separate chaining)
├─ sub_6E3F80 (2.2KB) — encode branch offsets
├─ sub_6E3560 (2.6KB) — finalize scheduling control words
└─ sub_712E70 (9.6KB) — handle relocations (cross-BB branch targets)
The encoding pipeline uses FNV-1a hashing (seed 0x811C9DC5, multiplier 16777619) to cache instruction encodings and avoid re-encoding identical instructions.
Architecture-Specific Dispatch
Architecture selection reads *(int*)(config + 372) >> 12 to determine the SM generation. A vtable at *(context+416) with approximately 200 methods provides per-architecture behavior for encoding, latency tables, and hazard rules.
| SM generation | config+372 >> 12 | SM versions |
|---|---|---|
| Kepler | 3 | sm_30--sm_37 |
| Maxwell | 5 | sm_50--sm_53 |
| Pascal | 6 | sm_60--sm_62 |
| Volta/Turing | 7 | sm_70--sm_75 |
| Ampere | 8 | sm_80--sm_89 |
| Hopper | 9 | sm_90--sm_90a |
| Blackwell | (10+) | sm_100--sm_121 |
The encoder state initializer sub_6E8EB0 (64KB) sets architecture-specific flags and populates the opcode descriptor table (40+ entries mapping internal opcode IDs to encoding words). For SM 80 (0x5000) it sets bits 1 and 8; for SM 84 (0x5004) it sets bits 16 and 64.
Vtable dispatch helpers at 0xC65530--0xC656E0:
sub_C65530-- 3-key dispatch (opcode, subop1, subop2), binary search through 24-byte table entriessub_C65600-- instruction-keyed dispatch, reads keys frominstr+12/14/15sub_C656E0-- instruction-keyed dispatch with fallback to default handlersub_9B3020
Data Structures
Mercury Instruction Word
Offset Size Field
------ ------ --------------------------------------------------
+0 8B vtable pointer (encoder object)
+468 64B Constant buffer slot table (16 x DWORD, cleared to 0xFF)
+532 4B Constant buffer slot count
+544 160B Instruction word (1280 bits = 20 QWORDs)
— populated by sub_7B9B80 bitfield inserts
— max addressable bit: 1280
SASS Encoding Record (~264 bytes)
Output of sub_6D9690. Contains the encoded instruction words, operand data, and metadata. The encoding base constant is 0x2000000000LL.
Pipeline Context
Offset Size Field
------ ------ --------------------------------------------------
+52 4B Architecture ID (20481 = sm100a)
+236 1B Uses shared memory flag
+284 4B Function flags (bits 0, 3, 7 checked by WAR pass)
+385 4B Mercury mode flag (2 = Mercury/Capsule mode)
+416 8B Architecture vtable pointer (~200 virtual methods)
Scheduling Control Word (per SASS instruction)
Offset Size Field
------ ------ --------------------------------------------------
+48 4B Control bits (barrier flags at bits 17:13)
+56 4B Encoding format (4 = barrier format)
+144 4B Scheduling slot
+164 4B Resource class
+168 1B Stall bits
+236 4B Latency value
Mercury Instruction Node Layout
The Mercury pipeline (phases 117--122) operates on its own instruction representation, distinct from the 296-byte Ori IR instruction node documented in Instructions & Opcodes. The master encoder sub_6D9690 (phase 117) reads Ori IR nodes and produces Mercury instruction nodes; all subsequent phases -- expansion, WAR resolution, opex, and SASS emission -- operate exclusively on Mercury nodes.
Allocation
Mercury instruction nodes are allocated by sub_10AF8C0 (92 lines), which either recycles a node from a per-block free list or allocates exactly 160 bytes from the arena. The primary API wrappers are sub_10B1F90 and sub_10B1EE0, which call sub_10AF8C0 and perform additional bookkeeping (FNV-1a deduplication cache registration, scheduling state propagation).
Node Layout (160 bytes)
| Offset | Size | Type | Init value | Field | Description |
|---|---|---|---|---|---|
| +0 | 8 | ptr | 0 | next | Forward pointer in per-block doubly-linked list |
| +8 | 8 | ptr | 0 | prev | Backward pointer in per-block doubly-linked list |
| +16 | 8 | ptr | source loc | source_loc | Source location copied from context (slot 124) |
| +24 | 4 | u32 | 772 (0x304) | node_type | Constant type marker -- never modified after init |
| +28 | 2 | u16 | 0xFFFF | opcode | SASS opcode number (0xFFFF = sentinel / BB boundary) |
| +30 | 1 | u8 | 0xFF | sub_key_1 | Encoding sub-key 1 (format variant selector) |
| +31 | 1 | u8 | 0xFF | sub_key_2 | Encoding sub-key 2 (modifier selector) |
| +32 | 4 | u32 | counter | sequence_id | Monotonically increasing ID; FNV-1a dedup key |
| +36 | 4 | --- | --- | (padding) | Alignment to 8-byte boundary |
| +40 | 8 | ptr | ctx | context_ptr | Back-pointer to allocator / code-object base |
| +48 | 8 | u64 | 0 | encoded_data_0 | Encoded operand / property data |
| +56 | 8 | u64 | 0xFFFFFFFF | sentinel_56 | Sentinel / uninitialized marker |
| +64 | 8 | u64 | 0 | encoded_data_1 | Encoded operand / property data |
| +72 | 8 | u64 | 0 | encoded_data_2 | Encoded operand / property data |
| +80 | 8 | u64 | 0 | encoded_data_3 | Encoded operand / property data |
| +88 | 8 | i64 | -1 | sentinel_88 | Sentinel (end-of-data marker) |
| +96 | 8 | i64 | -1 | sentinel_96 | Sentinel |
| +104 | 8 | u64 | 0xFFFFFFFF | sentinel_104 | Sentinel |
| +112 | 8 | u64 | 0 | reserved_112 | Reserved (zeroed) |
| +120 | 8 | u64 | 0 | reserved_120 | Reserved (zeroed) |
| +128 | 8 | ptr | alloc'd | sched_ctrl_ptr | Pointer to 60-byte scheduling control record |
| +136 | 8 | ptr | ctx sched | sched_context | Context scheduling state (context slot 52) |
| +144 | 8 | i64 | 0xFFFFFFFF | sched_slot | Scheduling slot (sentinel = unscheduled) |
| +148 | 4 | u32 | 0 | node_flags | Node flags (bit 1 = BB boundary, bit 10 = 0x400) |
| +152 | 4 | u32 | 0xFFFFFFFF | block_seq | Basic-block sequence number |
The opcode field at +28 carries the Mercury/SASS opcode number. Known values include: 0xFFFF (sentinel, BB boundary marker), 54 (DEPBAR -- dependency barrier), 246 (WAITDP -- wait for dependency pipeline). All other values are SASS instruction opcodes.
Scheduling Control Record (60 bytes)
Each Mercury instruction node points (via +128) to a separately allocated 60-byte scheduling control record. This record carries barrier state, stall counts, and encoding format metadata that the WAR and opex passes read and modify.
| Offset | Size | Type | Init | Field | Description |
|---|---|---|---|---|---|
| +0 | 16 | xmm | SSE const | header | SSE-initialized from xmmword_2027620 |
| +16 | 16 | xmm | SSE const | latency | SSE-initialized from xmmword_202DC90 |
| +32 | 1 | u8 | 0 | flag_32 | General-purpose flag byte |
| +36 | 8 | i64 | -1 | barrier_state | Barrier tracking sentinel |
| +44 | 4 | u32 | 0 | stall_count | Stall cycle count |
| +48 | 4 | u32 | 0xEE (low byte) | control_bits | Scheduling control word; bits 17:13 = barrier type |
| +56 | 4 | u32 | 0 | encoding_format | Format discriminator (1 = basic, 4 = barrier, 15 = NOP stall) |
The control_bits field at sched+48 is the primary target of WAR pass modifications:
Bits 17:13 — barrier type (masked via 0xFFF83FFF then OR'd with type << 13)
Bit 4 — barrier-needed flag (in byte at sched+50)
Bits 7:5 — barrier sub-type (in byte at sched+50)
WAR insertion functions modify this field with specific patterns:
sub_6FA930(InsertScoreboardBarrier):sched[48] = (sched[48] & 0xFFF83FFF) | 0x50000; clears bit 4 of sched[50]; setssched[56] = 4sub_6FA430(InsertNOP):sched[48] = (sched[48] & 0xFFF83FFF) | 0x44000; clears bit 4 of sched[50]; setssched[56] = 1sub_6FAFD0(InsertStall):sched[48] = (sched[48] & 0xFFF83FFF) | 0x3C000; sets bit 4 of sched[50]; setssched[56] = 15
Linked-List Structure
Mercury nodes form a doubly-linked list per basic block, managed through the next (+0) and prev (+8) pointers:
head (ctx+40) tail (ctx+32)
| |
v v
[node_0] <--> [node_1] <--> ... <--> [node_N]
next=node_1 next=node_2 next=0
prev=0 prev=node_0 prev=node_{N-1}
New nodes are inserted before the reference node by sub_10AF8C0. The WAR pass (sub_6FBC20) iterates forward through the list; sub_6FB850 (PostWARAdjustment) iterates backward, skipping sentinel nodes (opcode == 0xFFFF).
FNV-1a Deduplication
The sequence_id at +32 serves as the FNV-1a hash key for the instruction deduplication cache. The hash is computed over the 4-byte ID using the standard FNV-1a parameters (seed 0x811C9DC5, multiplier 16777619). The cache resides at context+488 (hash table pointer) with capacity at context+496 and entry count at context+480. Each hash table entry is 24 bytes with separate chaining via pointer at entry+0, key at entry+8, and value (Mercury encoding record pointer) at entry+16.
Relationship to Ori IR Instruction Node
The Mercury node is distinct from the Ori IR instruction node:
| Property | Ori IR node | Mercury node |
|---|---|---|
| Size | 296 bytes | 160 bytes |
| Allocator | sub_7DD010 | sub_10AF8C0 |
| Opcode location | +72 (32-bit word) | +28 (16-bit) |
| Operand model | Packed array at +84 | Encoded data at +48..+120 |
| Scheduling | Pointer at +40 | Pointer at +128 (60-byte record) |
| List linkage | +0 / +8 (prev/next) | +0 / +8 (next/prev) |
| Pipeline phases | 1--116 | 117--122 |
Phase 117 (MercEncodeAndDecode) reads Ori IR nodes via the master encoder sub_6D9690 and produces Mercury nodes. All subsequent Mercury pipeline phases operate on Mercury nodes exclusively.
Configuration
| Knob | Purpose | Context |
|---|---|---|
| 16 | WAR generation control | Checked in sub_6FBC20 |
| 17 | Expansion/opex options; disables WAR penalty flags | sub_703480 entry |
| 595 | Scheduling enable check | Scheduling pre-check |
| 743 | Scheduling reduce-reg mode | sub_6FFDC0 opex body |
| 747 | Scheduling dynamic batch mode | sub_6FFDC0 opex body |
| 4176 | SM 100+ extension bits for encoding | sub_6D9690 encoder |
Diagnostic Strings
| String | Source | Trigger |
|---|---|---|
"After Decode" | sub_6F2BF0 | Decode stage completion |
"After Expansion" | sub_6F2BF0 | Expansion stage completion |
"After WAR post-expansion" | sub_6F2BF0 | WAR pass 1 completion |
"After Opex" | sub_6F2BF0 | Opex stage completion |
"After WAR post-opexing" | sub_6F2BF0 | WAR pass 2 completion |
"After MercWARs" | sub_6FC240 | WAR pass trace |
"After MercOpex" | sub_7032A0 | Opex pass trace |
"After MercExpand" | sub_C3DFC0 | Expansion pass trace |
"After MercConverter" | 0x9F3818 | MercConverter phase completion |
"CONVERTING" | sub_9EF5E0 | Active operand reorganization (per instruction) |
"After EncodeAndDecode" | 0x23D1A60 | Roundtrip verification |
"EXPANDING" | 0xC381B3 | Active instruction expansion |
"ENCODING" | 0x21C2880 | Active instruction encoding |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_9F1A90 | 35KB | MercConverter::ConvertInstruction (opcode dispatch, phase 5/141) | HIGH |
sub_9EF5E0 | 27KB | MercConverter::ReorganizeOperands (post-conversion lowering) | HIGH |
sub_9ED2D0 | 25KB | MercConverter::Dispatch (master opcode switch, & 0xCF mask) | HIGH |
sub_9F3340 | 7KB | MercConverter::Run (orchestrator, calls 9F1A90 then 9EF5E0) | HIGH |
sub_9EC160 | ~2KB | MergeSort (linked-list merge sort for operand chains) | HIGH |
sub_7BFC30 | ~4KB | MercConverter::ValidateEncoding (returns -1 on failure) | HIGH |
sub_9CE210 | ~6KB | MercConverter::FallbackConvert (recursive re-encoding) | MEDIUM |
sub_6D9690 | 94KB | MercuryEncode::EncodeInstruction (master switch) | HIGH |
sub_6FFDC0 | 66KB | MercuryPipeline::EmitInstructions (opex body) | HIGH |
sub_6E8EB0 | 64KB | BasicBlock::Initialize (encoder state init) | MEDIUM |
sub_6F2BF0 | 59KB | DecodePipeline::DecodeAndExpand | MEDIUM |
sub_719D00 | 50KB | ExpansionEngine::buildOutput | MEDIUM |
sub_726E00 | 30.6KB | Instruction encoding + FNV-1a dedup cache | HIGH |
sub_C3CC60 | 26KB | MercExpand::run (pseudo-instruction expansion) | HIGH |
sub_6FC810 | 24KB | MercuryPipeline::Configure | MEDIUM |
sub_6E4110 | 24KB | MercGenerateSassUCode (final SASS emission) | HIGH |
sub_6F52F0 | 23KB | DecodePipeline::RunStages (orchestrator) | MEDIUM |
sub_C3BCD0 | 19KB | MercExpand::expandControlFlow | HIGH |
sub_6FF070 | 18KB | Predicate handling in expansion | MEDIUM |
sub_C3E030 | 18KB | MercExpand::finalizeExpansion | HIGH |
sub_C37A10 | 16KB | MercExpand::expandInstruction | HIGH |
sub_C38180 | 13KB | MercExpand::expandInstruction (complex cases) | HIGH |
sub_7266A0 | 11.7KB | FNV-1a hash table (instruction cache) | HIGH |
sub_733870 | 10KB | Source operand encoder | MEDIUM |
sub_C39B40 | 10KB | MercExpand::expandMemoryOp | HIGH |
sub_6FAA90 | 7.9KB | WAR stall insertion | HIGH |
sub_735290 | 7.6KB | Per-instruction SASS encoding pipeline | MEDIUM |
sub_6FBC20 | 7.4KB | WAR generation main pass | HIGH |
sub_C3B560 | 8KB | MercExpand::expandTexture | HIGH |
sub_734D20 | 8.1KB | Complex operand encoder (texture/surface/barrier) | MEDIUM |
sub_C3A460 | 6KB | MercExpand::expandAtomicOp | HIGH |
sub_734370 | 6.1KB | Immediate operand encoder | MEDIUM |
sub_733FA0 | 5.1KB | Instruction operand encoder | MEDIUM |
sub_734820 | 4.1KB | Predicate operand encoder | MEDIUM |
sub_734AD0 | 3.3KB | Memory operand encoder | MEDIUM |
sub_6FA5B0 | 2.5KB | WAR hazard detector | HIGH |
sub_7032A0 | 2.3KB | RunMercOpexPass (entry) | HIGH |
sub_6FC280 | 1.8KB | Create pipeline context | MEDIUM |
sub_6FA7B0 | 1.7KB | InsertWAITDP (opcode 246) | HIGH |
sub_703480 | 1.4KB | RunOpexPass (entry) | HIGH |
sub_6FA930 | 1.4KB | InsertScoreboardBarrier (opcode 54) | HIGH |
sub_10AF8C0 | ~0.5KB | MercNode::Allocate (160-byte node allocator, core initializer) | HIGH |
sub_10B1F90 | ~0.2KB | MercNode::Create (wrapper: allocate + dedup cache + sched state) | HIGH |
sub_10B1EE0 | ~0.2KB | MercNode::Clone (wrapper: allocate from clone source) | HIGH |
sub_10B14B0 | ~0.2KB | MercNode::CreateBBBoundary (creates sentinel pair, opcode 0xFFFF) | HIGH |
sub_6FAFD0 | ~1KB | InsertScoreboardStalls (allocate NOP stall nodes) | HIGH |
sub_6FA430 | ~0.5KB | InsertNOP (allocate NOP barrier nodes) | HIGH |
sub_7B9B80 | 216B | Bitfield insert primitive (18,347 callers) | CERTAIN |
sub_7B9D30 | 38B | Clear constant buffer slot table | HIGH |
sub_7B9D60 | 408B | Encode reuse flags + predicate | HIGH |
Cross-References
- ISel & Opcode Selection -- MercConverter opcode dispatch table (
sub_9ED2D0), handler details - Instructions & Opcodes -- Ori IR instruction node layout (296 bytes, input to Mercury encoder)
- Code Generation Overview -- high-level codegen pipeline context
- SASS Instruction Encoding -- detailed bit-level encoding format
- Capsule Mercury & Finalization -- capmerc variant and
--self-check - Scoreboards & Dependency Barriers -- DEPBAR/WAITDP semantics
- Phase Manager -- 159-phase pipeline infrastructure
- Optimization Levels --
-O0vs higher-level scoreboard behavior - Knobs System -- knob #16, #17, #743, #747 details
Capsule Mercury & Finalization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Capsule Mercury ("capmerc") is a packaging format that wraps Mercury-encoded instruction streams with relocation metadata, debug information, and a snapshot of compilation knobs, enabling deferred finalization for a target SM that may differ from the original compilation target. Where standard Mercury produces a fully-resolved SASS binary bound to a single SM, capmerc produces an intermediate ELF object that a downstream tool (the driver or linker) can finalize into native SASS at load time. This is the default output format for all SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series). The capmerc data lives in .nv.capmerc<funcname> per-function ELF sections alongside 21 types of .nv.merc.* auxiliary sections carrying cloned debug data, memory-space metadata, and Mercury-specific relocations. Finalization can be "opportunistic" -- the same capmerc object may be finalized for different SMs within or across architectural families, controlled by --opportunistic-finalization-lvl.
| Output modes | mercury (SM 75--99 default), capmerc (SM 100+ default), sass (explicit only) |
| CLI parser | sub_703AB0 (10KB, ParsePtxasOptions) |
| Auto-enable | SM arch > 99 sets *(context + offset+81) = 1 |
| Mercury mode flag | *(DWORD*)(context+385) == 2 (shared with Mercury) |
| Capsule descriptor | 328-byte object, one per function (sub_1C9C300) |
| Merc section classifier | sub_1C98C60 (9KB, 15 .nv.merc.* names) |
| Master ELF emitter | sub_1C9F280 (97KB, orchestrates full CUBIN output) |
| Self-check verifier | sub_720F00 (64KB Flex lexer) + sub_729540 (35KB comparator) |
| Off-target checker | sub_60F290 (compatibility validation) |
| Kernel finalizer | sub_612DE0 (47KB, fastpath optimization) |
Output Mode Selection
The ptxas CLI (sub_703AB0) registers three binary-kind options plus related flags:
| Option | String literal | Purpose |
|---|---|---|
--binary-kind | "mercury,capmerc,sass" | Select output format |
--cap-merc | "Generate Capsule Mercury" | Force capmerc regardless of SM |
--self-check | "Self check for capsule mercury (capmerc)" | Roundtrip verification |
--out-sass | "Generate output of capmerc based reconstituted sass" | Dump reconstituted SASS |
--opportunistic-finalization-lvl | (in finalization logic) | Finalization aggressiveness |
When --binary-kind is not specified, the default is determined by SM version:
// Pseudocode from sub_703AB0 + auto-enable logic
if (sm_version > 99) {
*(context + offset + 81) = 1; // capmerc auto-enabled
binary_kind = CAPMERC;
} else if (sm_version >= 75) {
binary_kind = MERCURY;
} else {
binary_kind = SASS; // legacy direct encoding
}
The Mercury mode flag *(DWORD*)(context+385) == 2 is shared between Mercury and capmerc -- both use the identical Mercury encoder pipeline (phases 117--122). The capmerc distinction is purely at the ELF emission level: capmerc wraps the phase-122 SASS output in a capsule descriptor with relocation metadata instead of emitting it directly as a .text section.
Capsule Mercury ELF Structure
A capmerc-mode compilation produces a CUBIN ELF with two layers of content: standard CUBIN sections (.text.<func>, .nv.constant0, .nv.info.<func>, etc.) and a parallel set of .nv.merc.* sections carrying the metadata needed for deferred finalization.
CUBIN ELF (capmerc mode)
├── Standard sections
│ ├── .shstrtab, .strtab, .symtab, .symtab_shndx
│ ├── .text.<funcname> (SASS binary, possibly partial)
│ ├── .nv.constant0.<funcname> (constant bank data)
│ ├── .nv.shared.<funcname> (shared memory layout)
│ ├── .nv.info.<funcname> (EIATTR attributes)
│ ├── .note.nv.tkinfo, .note.nv.cuinfo
│ └── .nv.uft.entry (unified function table)
│
├── Per-function capsule descriptor
│ └── .nv.capmerc<funcname> (328-byte descriptor + payload)
│
└── Mercury auxiliary sections (21 types)
├── .nv.merc.debug_abbrev (DWARF abbreviation table)
├── .nv.merc.debug_aranges (DWARF address ranges)
├── .nv.merc.debug_frame (DWARF frame info)
├── .nv.merc.debug_info (DWARF info)
├── .nv.merc.debug_line (DWARF line table)
├── .nv.merc.debug_loc (DWARF locations)
├── .nv.merc.debug_macinfo (DWARF macro info)
├── .nv.merc.debug_pubnames (DWARF public names)
├── .nv.merc.debug_pubtypes (DWARF public types)
├── .nv.merc.debug_ranges (DWARF ranges)
├── .nv.merc.debug_str (DWARF string table)
├── .nv.merc.nv_debug_ptx_txt (embedded PTX source text)
├── .nv.merc.nv_debug_line_sass (SASS-level line table)
├── .nv.merc.nv_debug_info_reg_sass (register allocation info)
├── .nv.merc.nv_debug_info_reg_type (register type info)
├── .nv.merc.symtab_shndx (extended section index table)
├── .nv.merc.nv.shared.reserved (shared memory reservation)
├── .nv.merc.rela (Mercury relocations)
├── .nv.merc.rela<secname> (per-section relocation tables)
└── .nv.merc.<memory-space> (cloned constant/global/local/shared)
Capsule Descriptor -- sub_1C9C300
Each function produces a .nv.capmerc<funcname> section constructed by sub_1C9C300 (24KB, 3816 bytes binary). This function processes .nv.capmerc and .merc markers, embeds KNOBS data (compilation configuration snapshot), manages constant bank replication, and creates the per-function descriptor.
The descriptor is a 328-byte object containing:
- Mercury-encoded instruction stream for the function
- R_MERCURY_* relocation entries that must be patched during finalization
- KNOBS block -- a serialized snapshot of all knob values affecting code generation, optimization level, target parameters, and feature flags
- References to the
.nv.merc.*auxiliary sections - Function-level metadata: register counts, barrier counts, shared memory usage
The KNOBS embedding allows the finalizer to reproduce exact compilation settings without the original command-line arguments. This is critical for off-target finalization where the finalizer runs in a different context (e.g., the CUDA driver at application load time).
Capsule Descriptor Layout (328 bytes)
The descriptor is heap-allocated via sub_424070(allocator, 328) and zero-filled before field initialization. The constructor also creates a companion .merc<funcname> descriptor (same 328-byte layout) when merc section mirroring is active.
Capsule Descriptor (328 bytes = 0x148)
======================================
Group 1: Identity
┌─────────┬──────┬────────────────────────────────────────┐
0x000 │ WORD │ 2B │ desc_version │
0x002 │ WORD │ 2B │ instr_format_version │
0x004 │ DWORD │ 4B │ section_index │
0x008 │ DWORD │ 4B │ weak_symbol_index │
0x00C │ -- │ 4B │ (padding) │
└─────────┴──────┴────────────────────────────────────────┘
Group 2: SASS Data
┌─────────┬──────┬────────────────────────────────────────┐
0x010 │ QWORD │ 8B │ weak_symbol_desc │
0x018 │ QWORD │ 8B │ sass_data_offset │
0x020 │ DWORD │ 4B │ sass_data_size │
0x024 │ -- │ 4B │ (padding) │
0x028 │ QWORD │ 8B │ func_name_ptr │
└─────────┴──────┴────────────────────────────────────────┘
Group 3: Relocation Infrastructure
┌─────────┬──────┬────────────────────────────────────────┐
0x030 │ QWORD │ 8B │ rela_list_a (vector) │
0x038 │ QWORD │ 8B │ rela_list_b (vector) │
0x040 │ QWORD │ 8B │ reloc_symbol_list (vector) │
0x048 │ QWORD │ 8B │ aux_rela_list (vector) │
0x050 │ QWORD │ 8B │ debug_rela_list (vector) │
0x058 │ QWORD │ 8B │ text_section_offset │
0x060 │ QWORD │ 8B │ reloc_index_set (sorted container) │
0x068 │ QWORD │ 8B │ per_reloc_data_set (sorted container) │
0x070 │ BYTE │ 1B │ sampling_mode │
0x071 │ -- │ 7B │ (padding) │
0x078 │ QWORD │ 8B │ reloc_payload_map (sorted container) │
└─────────┴──────┴────────────────────────────────────────┘
0x080 │ -- │ 32B │ (reserved, not written by constructor) │
└─────────┴──────┴────────────────────────────────────────┘
Group 4: Function Metadata
┌─────────┬──────┬────────────────────────────────────────┐
0x0A0 │ QWORD │ 8B │ section_flags │
0x0A8 │ DWORD │ 4B │ max_register_count │
0x0AC │ DWORD │ 4B │ extra_section_index │
0x0B0 │ BYTE │ 1B │ has_global_refs │
0x0B1 │ BYTE │ 1B │ has_shared_refs │
0x0B2 │ BYTE │ 1B │ has_exit │
0x0B3 │ BYTE │ 1B │ has_crs │
0x0B4 │ BYTE │ 1B │ uses_atomics │
0x0B5 │ BYTE │ 1B │ uses_shared_atomics │
0x0B6 │ BYTE │ 1B │ uses_global_atomics │
0x0B7 │ BYTE │ 1B │ has_texture_refs │
0x0B8 │ -- │ 24B │ (padding) │
└─────────┴──────┴────────────────────────────────────────┘
Group 5: Code Generation Parameters
┌─────────┬──────┬────────────────────────────────────────┐
0x0D0 │ QWORD │ 8B │ knobs_section_desc_ptr → 64B sub-obj │
│ │ │ +0x00 DWORD: knobs_section_index │
│ │ │ +0x08 QWORD: knobs_section_offset │
│ │ │ +0x10 DWORD: knobs_section_size │
│ │ │ +0x18 QWORD: knobs_section_name_ptr │
0x0D8 │ DWORD │ 4B │ stack_frame_size │
0x0DC │ -- │ 4B │ (padding) │
0x0E0 │ DWORD │ 4B │ register_count │
0x0E4 │ -- │ 4B │ (padding) │
0x0E8 │ DWORD │ 4B │ barrier_info_size │
0x0EC │ -- │ 4B │ (padding) │
0x0F0 │ QWORD │ 8B │ barrier_info_data_ptr │
0x0F8 │ -- │ 8B │ (reserved) │
└─────────┴──────┴────────────────────────────────────────┘
Group 6: Constant Bank & Section Info
┌─────────┬──────┬────────────────────────────────────────┐
0x100 │ QWORD │ 8B │ const_bank_offset │
0x108 │ DWORD │ 4B │ const_bank_size │
0x10C │ -- │ 4B │ (padding) │
0x110 │ QWORD │ 8B │ section_name_ptr (".nv.capmerc<func>") │
0x118 │ QWORD │ 8B │ section_alignment (default 16) │
0x120 │ DWORD │ 4B │ const_bank_section_index │
0x124 │ -- │ 4B │ (padding) │
0x128 │ DWORD │ 4B │ text_section_index │
0x12C │ DWORD │ 4B │ text_rela_section_index │
└─────────┴──────┴────────────────────────────────────────┘
Group 7: KNOBS Embedding
┌─────────┬──────┬────────────────────────────────────────┐
0x130 │ QWORD │ 8B │ kv_pair_list (vector) │
0x138 │ QWORD │ 8B │ knobs_pair_list (vector) │
0x140 │ WORD │ 2B │ min_sm_version (default 256 = sentinel) │
0x142 │ BYTE │ 1B │ has_crs_depth │
0x143 │ -- │ 5B │ (padding to 0x148) │
└─────────┴──────┴────────────────────────────────────────┘
Key design observations:
Flag byte block (+0x0B0 to +0x0B7). Eight single-byte flags capture function characteristics that determine which R_MERCURY_* relocation patches the finalizer must apply. The flags are set by type-2, type-3, and type-4 markers in the capmerc stream. Each flag is a boolean (0 or 1), never a bitfield.
KNOBS indirection (+0x0D0). The KNOBS data does not live inline in the descriptor. Instead, +0x0D0 points to a separately allocated 64-byte sub-object carrying the ELF coordinates (section index, file offset, size, and name pointer) of the KNOBS section. This allows the KNOBS data to reside in a dedicated ELF section while the descriptor references it by position. The KNOBS pair list at +0x138 and the generic key-value list at +0x130 store the parsed key-value pairs from marker type 90 data blocks; the "KNOBS" string literal serves as the discriminator between the two lists.
Dual-descriptor pattern. When the merc section mirror is active, the constructor allocates a second 328-byte object for the .merc<funcname> companion section. This companion receives a copy of the SASS data (not a pointer -- an actual memcpy of sass_data_size bytes), the function name with a .merc prefix, and the section flags from the original ELF section header at +0x0A0. The companion's weak_symbol_index (+0x008) is always zero.
Relocation containers. The three sorted containers at +0x060, +0x068, and +0x078 (created via sub_425CA0 with comparator pair sub_427750/sub_427760 and element size 0x20 = 32 bytes) form a three-level relocation index. The reloc_index_set stores symbol indices that appear in relocations. The per_reloc_data_set stores per-symbol relocation metadata. The reloc_payload_map associates symbol indices with the actual payload data that the finalizer patches into instruction bytes. These are populated by marker sub-types 10, 23, 25, 28, 40, 46, 49, 52, 57, 64, 68, 70, 71, 85, and 87.
min_sm_version sentinel. The default value 256 (0x100) at +0x140 acts as a sentinel meaning "no minimum SM constraint." When a target profile is available at construction time, the profile's SM version overwrites this field. Marker sub-type 95 can further override it when CRS depth information constrains the minimum SM.
Capmerc Marker Stream Format
The constructor parses a compact binary marker stream embedded in the capmerc section data. Each marker begins with a type byte followed by a sub-type byte:
| Type | Size | Format | Description |
|---|---|---|---|
| 2 | 4 bytes fixed | [02] [sub] [00] [00] | Boolean flag markers |
| 3 | 4 bytes fixed | [03] [sub] [WORD payload] | Short value markers |
| 4 | variable | [04] [sub] [WORD size] [payload...] | Variable-length data markers |
Selected marker sub-types and the descriptor fields they populate:
| Sub | Type | Descriptor Field | Purpose |
|---|---|---|---|
| 10 | 4 | +0x0B0 has_global_refs | Function accesses global memory |
| 21 | 2 | +0x0B2 has_exit | Function contains EXIT instruction |
| 22 | 2 | +0x0B3 has_crs | Function uses call return stack |
| 23 | 4 | +0x0A8 max_register_count | Register pressure (with max tracking) |
| 25 | 4 | +0x0B1 has_shared_refs | Function accesses shared memory |
| 27 | 3 | +0x000 desc_version | Descriptor format version stamp |
| 47 | 4 | +0x002 instr_format_version | Instruction encoding format version |
| 50 | 4 | +0x0E0 register_count | Allocated register count |
| 54 | 4 | +0x0B7 has_texture_refs | Function uses texture/sampler units |
| 67 | 4 | +0x0E8, +0x0F0 barrier_info | Barrier count and data |
| 72 | 4 | +0x0D0 knobs_section_desc_ptr | KNOBS section ELF binding |
| 73 | 3 | +0x0D8 stack_frame_size | Per-thread stack frame bytes |
| 74 | 2 | +0x070 sampling_mode | Interpolation/sampling mode |
| 88 | 3 | +0x0B4/B5/B6 | Atomic usage (plain/shared/global) |
| 90 | 4 | +0x138 knobs_pair_list | KNOBS key-value data block |
| 95 | 3 | +0x140, +0x142 | Min SM version + CRS depth flag |
.nv.merc.* Section Builder Pipeline
Four functions cooperate to construct the .nv.merc.* section namespace:
sub_1C9F280 (97KB, Master ELF emitter)
│
├─ sub_1C9B110 (23KB) ── Mercury capsule builder
│ Creates .nv.merc namespace, reads symtab entry count,
│ allocates mapping arrays, duplicates sections into merc space
│
├─ sub_1CA2E40 (18KB) ── Mercury section cloner
│ Iterates all sections, clones constant/global/shared/local
│ into .nv.merc.* namespace, creates .nv.merc.rela sections,
│ handles .nv.global.init and .nv.shared.reserved
│
├─ sub_1C9C300 (24KB) ── Capsule descriptor processor
│ Processes .nv.capmerc and .merc markers, embeds KNOBS,
│ handles constant bank replication and rela duplication
│
├─ sub_1CA3A90 (45KB) ── Section merger
│ Merge/combine pass for sections with both merc and non-merc
│ copies; processes .nv.constant bank sections, handles section
│ linking and rela association
│
└─ sub_1C99BB0 (25KB) ── Section index remap
Reindexes sections after dead elimination, handles
.symtab_shndx / .nv.merc.symtab_shndx mapping
The section classifiers sub_1C9D1F0 (16KB) and sub_1C98C60 (9KB) map section names to internal type IDs. The former handles both SASS and merc debug section variants; the latter is merc-specific and recognizes all 15 .nv.merc.debug_* names.
R_MERCURY_* Relocation Types
Capsule Mercury defines its own relocation type namespace for references within .nv.merc.rela sections and the capsule descriptor. These are distinct from standard CUDA ELF relocations (R_NV_32, etc.) and are processed during finalization rather than at link time.
| Type | Description |
|---|---|
R_MERCURY_ABS64 | 64-bit absolute address |
R_MERCURY_ABS32 | 32-bit absolute address |
R_MERCURY_ABS16 | 16-bit absolute address |
R_MERCURY_PROG_REL | PC-relative reference |
R_MERCURY_8_0 | Sub-byte patch: bits [7:0] of target word |
R_MERCURY_8_8 | Sub-byte patch: bits [15:8] |
R_MERCURY_8_16 | Sub-byte patch: bits [23:16] |
R_MERCURY_8_24 | Sub-byte patch: bits [31:24] |
R_MERCURY_8_32 | Sub-byte patch: bits [39:32] |
R_MERCURY_8_40 | Sub-byte patch: bits [47:40] |
R_MERCURY_8_48 | Sub-byte patch: bits [55:48] |
R_MERCURY_8_56 | Sub-byte patch: bits [63:56] |
R_MERCURY_FUNC_DESC | Function descriptor reference |
R_MERCURY_UNIFIED | Unified address space reference |
R_MERCURY_TEX_HEADER_INDEX | Texture header table index |
R_MERCURY_SAMP_HEADER_INDEX | Sampler header table index |
R_MERCURY_SURF_HEADER_INDEX | Surface header table index |
Sub-Byte Relocation Design
The eight R_MERCURY_8_* types enable patching individual bytes within a 64-bit instruction word. Mercury instruction encodings pack multiple fields into single 8-byte QWORDs (the 1280-bit instruction buffer at a1+544 is organized as 20 QWORDs). During finalization for a different SM, only certain bit-fields within an instruction word may need updating -- for example, the opcode variant bits or register class encoding -- while neighboring fields remain unchanged. The sub-byte types let the finalizer patch exactly one byte at a specific position within the word without a read-modify-write cycle on the entire QWORD.
Relocation Resolution
The master relocation resolver sub_1CD48C0 (22KB) handles both standard and capmerc relocations. For R_MERCURY_UNIFIED, it converts internal relocation type 103 to type 1 (standard absolute). The resolver iterates relocation entries and handles: alias redirections, dead-function relocation skipping, __UFT_OFFSET / __UDT_OFFSET pseudo-relocations, PC-relative branch validation, NVRS (register spill) relocations, and YIELD-to-NOP conversion for forward progress guarantees.
Mercury Section Binary Layouts
Section Classifier Algorithm -- sub_1C98C60
The 9KB classifier uses a two-stage guard-then-waterfall pattern to identify .nv.merc.* sections from their ELF section headers.
Stage 1: sh_type range check (fast rejection). The section's sh_type is tested against two NVIDIA processor-specific ranges:
| Range | sh_type span | Decimal | Qualifying types |
|---|---|---|---|
| A | 0x70000006..0x70000014 | SHT_LOPROC+6..+20 | Filtered by bitmask 0x5D05 |
| B | 0x70000064..0x7000007E | SHT_LOPROC+100..+126 | All accepted (memory-space data) |
| Special | 1 | SHT_PROGBITS | Accepted (generic debug data) |
Within Range A, the bitmask 0x5D05 (binary 0101_1101_0000_0101) selects seven specific types:
| Bit | sh_type | Hex | Section types |
|---|---|---|---|
| 0 | SHT_LOPROC+6 | 0x70000006 | Memory-space clones |
| 2 | SHT_LOPROC+8 | 0x70000008 | .nv.merc.nv.shared.reserved |
| 8 | SHT_LOPROC+14 | 0x7000000E | .nv.merc.debug_line |
| 10 | SHT_LOPROC+16 | 0x70000010 | .nv.merc.debug_frame |
| 11 | SHT_LOPROC+17 | 0x70000011 | .nv.merc.debug_info |
| 12 | SHT_LOPROC+18 | 0x70000012 | .nv.merc.nv_debug_line_sass |
| 14 | SHT_LOPROC+20 | 0x70000014 | .nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_* |
Stage 2: Name-based disambiguation (expensive path). When sh_flags bit 28 (0x10000000, SHF_NV_MERC) is set, the classifier calls sub_1CB9E50() to retrieve the section name and performs sequential strcmp() against 15 names, returning 1 on the first match. The check order matches the declaration order in the ELF structure table above. sub_4279D0 is used for .nv.merc.nv_debug_ptx_txt as a prefix match rather than exact match.
SHF_NV_MERC Flag (0x10000000)
Bit 28 of sh_flags is an NVIDIA extension: SHF_NV_MERC. All .nv.merc.* sections carry this flag. It serves two purposes:
- Fast filtering -- the classifier checks this bit before string comparisons, giving O(1) rejection for the common case of non-merc sections.
- Namespace separation -- during section index remapping (
sub_1C99BB0), sections withSHF_NV_MERCare remapped into a separate merc section index space. The finalizer uses this flag to identify which sections require relocation patching during off-target finalization.
.nv.capmerc<funcname> -- Capsule Data Layout
The per-function capsule section contains the full marker stream, SASS data, KNOBS block, and optionally replicated constant bank data. The section is created by sub_1C9C300.
ELF section header:
| Field | Value |
|---|---|
| sh_type | 1 (SHT_PROGBITS) |
| sh_flags | 0x10000000 (SHF_NV_MERC) |
| sh_addralign | 16 |
Section data is organized as four consecutive regions:
.nv.capmerc<funcname> Section Data
====================================
┌──────────────────────────────────────────────────────┐
│ Marker Stream (variable length) │
│ Repeating TLV records: │
│ [type:1B] [sub:1B] [payload:varies] │
│ │
│ Type 2: 4 bytes total [02] [sub] [00 00] │
│ Boolean flags (has_exit, has_crs, sampling_mode) │
│ │
│ Type 3: 4 bytes total [03] [sub] [WORD:value] │
│ Short values (desc_version, stack_frame_size, │
│ atomic flags, min_sm_version) │
│ │
│ Type 4: variable [04] [sub] [WORD:size] .. │
│ Variable-length blocks (register counts, KNOBS │
│ data, barrier info, relocation payloads) │
│ │
│ Terminal marker: sub-type 95 (min_sm + CRS depth) │
├──────────────────────────────────────────────────────┤
│ SASS Data Block (sass_data_size bytes) │
│ Mercury-encoded instruction bytes identical to │
│ what .text.<func> would contain for the compile │
│ target; byte-for-byte match with phase 122 output │
├──────────────────────────────────────────────────────┤
│ KNOBS Block (knobs_section_size bytes) │
│ Serialized key-value pairs from marker sub-type 90 │
│ "KNOBS" tag separates knob pairs from generic KV │
│ Contains: optimization level, target parameters, │
│ feature flags, all codegen-affecting knob values │
├──────────────────────────────────────────────────────┤
│ Constant Bank Data (const_bank_size bytes, optional) │
│ Replicated .nv.constant0 data for deferred binding │
│ Only present when the function references constant │
│ bank data that the finalizer may need to patch │
└──────────────────────────────────────────────────────┘
.nv.merc.debug_info -- Cloned DWARF Debug Info
The cloner (sub_1CA2E40) produces a byte-for-byte copy of the source .debug_info section, placed into the merc namespace with modified ELF section header properties.
ELF section header:
| Field | Value |
|---|---|
| sh_type | 0x70000011 (SHT_LOPROC + 17) |
| sh_flags | 0x10000000 (SHF_NV_MERC) |
| sh_addralign | 1 |
Section data is standard DWARF .debug_info format:
.nv.merc.debug_info Section Data
==================================
┌──────────────────────────────────────────────────────┐
│ Compilation Unit Header │
│ +0x00 unit_length : 4B (DWARF-32) or 12B (-64)│
│ +0x04 version : 2B (typically DWARF 4) │
│ +0x06 debug_abbrev_offset : 4B → .nv.merc.debug_abbrev │
│ +0x0A address_size : 1B (8 for 64-bit GPU) │
├──────────────────────────────────────────────────────┤
│ DIE Tree (Debug Information Entries) │
│ Sequence of entries, each: │
│ abbrev_code : ULEB128 │
│ attributes : per abbreviation definition │
│ │
│ Cross-section references (via relocations): │
│ DW_FORM_strp → .nv.merc.debug_str │
│ DW_FORM_ref_addr → .nv.merc.debug_info │
│ DW_FORM_sec_offset → .nv.merc.debug_line etc. │
└──────────────────────────────────────────────────────┘
The critical difference from standard .debug_info: all cross-section offset references point to other .nv.merc.* sections, not the original .debug_* sections. The .nv.merc.rela.debug_info relocation table handles rebinding these offsets during finalization.
.nv.merc.rela / .nv.merc.rela<secname> -- Mercury Relocations
Mercury relocation sections use standard Elf64_Rela on-disk format (24 bytes per entry) but encode Mercury-specific relocation types with a 0x10000 offset in the type field.
ELF section header:
| Field | Value |
|---|---|
| sh_type | 4 (SHT_RELA) |
| sh_flags | 0x10000000 (SHF_NV_MERC) |
| sh_addralign | 8 |
| sh_entsize | 24 |
| sh_link | symtab section index |
| sh_info | target section index |
Section names are constructed by sub_1C980F0 as ".nv.merc.rela" + suffix (e.g., ".nv.merc.rela.debug_info").
On-disk entry layout (standard Elf64_Rela, 24 bytes):
.nv.merc.rela Entry (24 bytes on disk)
========================================
┌─────────┬──────┬────────────────────────────────────────────┐
0x00 │ QWORD │ 8B │ r_offset — byte position in target section │
0x08 │ DWORD │ 4B │ r_type — relocation type │
│ │ │ Standard: 1=R_NV_ABS64, etc. │
│ │ │ Mercury: r_type > 0x10000 │
│ │ │ Decoded: r_type - 0x10000 → R_MERCURY_* │
0x0C │ DWORD │ 4B │ r_sym — symbol table index │
0x10 │ QWORD │ 8B │ r_addend — signed addend value │
└─────────┴──────┴────────────────────────────────────────────┘
During resolution (sub_1CD48C0), the 24-byte on-disk entries are loaded into a 32-byte in-memory representation that adds two section index fields:
In-Memory Relocation Entry (32 bytes)
=======================================
┌─────────┬──────┬────────────────────────────────────────────┐
0x00 │ QWORD │ 8B │ r_offset — byte position in target section │
0x08 │ DWORD │ 4B │ r_type — relocation type │
0x0C │ DWORD │ 4B │ r_sym — symbol table index │
0x10 │ QWORD │ 8B │ r_addend — signed addend value │
0x18 │ DWORD │ 4B │ r_sec_idx — target section index │
0x1C │ DWORD │ 4B │ r_addend_sec — addend section index │
└─────────┴──────┴────────────────────────────────────────────┘
The extra 8 bytes enable cross-section targeting: r_sec_idx identifies which section r_offset is relative to, and r_addend_sec identifies the section contributing the addend base address. When r_addend_sec != 0, the resolver adds that section's load address to r_offset before patching.
The resolver detects Mercury relocation types via r_type > 0x10000, subtracts 0x10000, then dispatches through a Mercury-specific handler table (off_2407B60) rather than the standard CUDA relocation table (off_2408B60).
Complete sh_type Map
| sh_type | Hex | Section types |
|---|---|---|
| 1 | 0x00000001 | .nv.capmerc<func>, .nv.merc.debug_abbrev (PROGBITS variant), .nv.merc.debug_str, .nv.merc.nv_debug_ptx_txt |
| 4 | 0x00000004 | .nv.merc.rela* (SHT_RELA) |
| SHT_LOPROC+6 | 0x70000006 | .nv.merc.<memory-space> clones |
| SHT_LOPROC+8 | 0x70000008 | .nv.merc.nv.shared.reserved |
| SHT_LOPROC+14 | 0x7000000E | .nv.merc.debug_line |
| SHT_LOPROC+16 | 0x70000010 | .nv.merc.debug_frame |
| SHT_LOPROC+17 | 0x70000011 | .nv.merc.debug_info |
| SHT_LOPROC+18 | 0x70000012 | .nv.merc.nv_debug_line_sass |
| SHT_LOPROC+20 | 0x70000014 | .nv.merc.debug_loc, .nv.merc.debug_ranges, .nv.merc.nv_debug_info_reg_sass, .nv.merc.nv_debug_info_reg_type |
| SHT_LOPROC+100..+126 | 0x70000064..0x7000007E | Memory-space variant sections (constant banks, shared, local, global) |
The .nv.merc.* debug sections reuse the same sh_type values as their non-merc counterparts (.debug_info uses 0x70000011 in both namespaces). The SHF_NV_MERC flag (0x10000000) in sh_flags is the distinguishing marker.
Self-Check Mechanism
The --self-check flag activates a roundtrip verification that validates the capmerc encoding by reconstituting SASS from the capsule data and comparing it against the original:
Phase 122 output (SASS) ──────────────────────────> reference SASS
│
└─ capmerc packaging ─> .nv.capmerc<func>
│
└─ reconstitute ─> reconstituted SASS
│
section-by-section compare
│
pass / fail (error 17/18/19)
The reconstitution pipeline uses sub_720F00 (64KB), a Flex-generated SASS text lexer with thread-safety support (pthread_mutexattr_t), to parse the reconstituted instruction stream. sub_729540 (35KB) performs the actual section-by-section comparison.
Three error codes signal specific self-check failures:
| Error code | Meaning |
|---|---|
| 17 | Section content mismatch (instruction bytes differ) |
| 18 | Section count mismatch (missing or extra sections) |
| 19 | Section metadata mismatch (size, alignment, or flags differ) |
These error codes trigger longjmp-based error recovery in the master ELF emitter (sub_1C9F280), which uses _setjmp at its entry point for non-local error handling.
The --out-sass flag causes ptxas to dump the reconstituted SASS to a file, useful for debugging self-check failures by manual comparison with the original SASS output.
Opportunistic Finalization
The --opportunistic-finalization-lvl flag controls how aggressively capmerc binaries may be finalized for a target SM different from the compilation target:
| Level | Name | Behavior |
|---|---|---|
| 0 | default | Standard finalization for the compile target only |
| 1 | none | No finalization; output stays as capmerc (deferred to driver) |
| 2 | intra-family | Finalize for any SM within the same architectural family |
| 3 | intra+inter | Finalize across SM families |
Level 2 allows a capmerc binary compiled for sm_100 (datacenter Blackwell) to be finalized for sm_103 (Blackwell Ultra / GB300) without recompilation. Level 3 extends this across families -- for example, sm_100 capmerc finalized for sm_120 (consumer RTX 50-series).
The key constraint is instruction encoding compatibility: the sub-byte R_MERCURY_8_* relocations can patch SM-specific encoding bits, but the overall instruction format and register file layout must be compatible between source and target.
Off-Target Finalization
Off-target finalization is the process of converting a capmerc binary compiled for SM X into native SASS for SM Y. The compatibility checker sub_60F290 determines whether the source/target pair is compatible, examining:
- SM version pair and generation compatibility
- Feature flag differences between source and target
- Instruction set compatibility (no target-only instructions used)
- Constant bank layout compatibility
- Register file layout match
When the check passes, the kernel finalizer sub_612DE0 (47KB) applies the "fastpath optimization" -- it directly patches the Mercury-encoded instruction stream using R_MERCURY_* relocations rather than running the full compilation pipeline. On success, ptxas emits the diagnostic:
"applied for off-target %u -> %u finalization"
where the two %u values are the source and target SM numbers.
The fastpath avoids re-running phases 117--122 of the Mercury pipeline. Instead, it:
- Reads the capsule descriptor from
.nv.capmerc<func> - Validates compatibility via
sub_60F290 - Applies R_MERCURY_* relocation patches for the target SM
- Regenerates the ELF
.textsection with patched instruction bytes - Updates
.nv.infoEIATTR attributes for the target (register counts, barrier counts)
This is substantially faster than full recompilation, which is why ptxas logs it as a "fastpath."
Pipeline Integration
Capmerc does not modify the Mercury encoder pipeline (phases 113--122). The instruction encoding, pseudo-instruction expansion, WAR hazard resolution, operation expansion (opex), and SASS microcode emission all execute identically regardless of output mode. The divergence happens after phase 122 completes:
| Mode | Post-Pipeline Behavior |
|---|---|
| Mercury | Phase 122 SASS output written directly to .text.<func> ELF section |
| Capmerc | Phase 122 output wrapped in 328-byte capsule descriptor; .nv.merc.* sections cloned; R_MERCURY_* relocations emitted; KNOBS data embedded |
| SASS | Phase 122 output written as raw SASS binary (no ELF wrapper) |
The master ELF emitter sub_1C9F280 (97KB) orchestrates the post-pipeline divergence:
// Simplified from sub_1C9F280
void EmitELF(context) {
// Common: copy ELF header (64 bytes via SSE loadu)
memcpy(output, &elf_header, 64);
// Common: iterate sections, build section headers
for (int i = 0; i < section_count; i++) {
if (section[i].flags & 4) continue; // skip virtual sections
// ... copy section data, patch headers ...
}
if (is_capmerc_mode) {
sub_1C9B110(ctx); // create .nv.merc namespace
sub_1CA2E40(ctx); // clone sections into merc space
sub_1C9C300(ctx); // build capsule descriptors + KNOBS
sub_1CA3A90(ctx); // merge merc/non-merc section copies
}
// Common: remap section indices, build symbol table
sub_1C99BB0(ctx); // section index remap
sub_1CB68D0(ctx); // build .symtab
// Common: resolve relocations
sub_1CD48C0(ctx); // relocation resolver (handles R_MERCURY_*)
// Common: finalize and write
sub_1CD13A0(ctx); // serialize to file
}
Function Map
| Address | Size | Identity |
|---|---|---|
sub_1C9F280 | 97KB | Master ELF emitter (orchestrates full CUBIN output) |
sub_1CA3A90 | 45KB | Section merger / combined section emitter |
sub_1CB68D0 | 49KB | Symbol table builder (handles merc section references) |
sub_1C99BB0 | 25KB | Section index remap (.symtab_shndx / .nv.merc.symtab_shndx) |
sub_1C9C300 | 24KB | Capsule descriptor processor (328-byte object, KNOBS embed) |
sub_1C9B110 | 23KB | Mercury capsule builder (creates .nv.merc namespace) |
sub_1CD48C0 | 22KB | Master relocation resolver (R_MERCURY_* + standard) |
sub_1CA2E40 | 18KB | Mercury section cloner |
sub_1C9D1F0 | 16KB | Debug section classifier (SASS + merc variants) |
sub_1C98C60 | 9KB | Mercury debug section classifier (15 section names) |
sub_720F00 | 64KB | Flex SASS text lexer (self-check reconstitution) |
sub_729540 | 35KB | SASS assembly verification (self-check comparator) |
sub_703AB0 | 10KB | Binary-kind CLI parser |
sub_612DE0 | 47KB | Kernel finalizer / ELF builder (fastpath optimization) |
sub_60F290 | -- | Off-target compatibility checker |
sub_1CD13A0 | 11KB | ELF serialization (final file writer) |
Cross-References
- Mercury Encoder Pipeline -- phases 113--122, the upstream encoding that capmerc wraps
- SASS Instruction Encoding -- bit-level encoding format and 1280-bit instruction buffer
- Code Generation Overview -- high-level codegen pipeline context
- Knobs System -- knob infrastructure that KNOBS embedding serializes
- Phase Manager -- 159-phase pipeline infrastructure
- SM Architecture Map -- SM version numbers and family groupings
Newton-Raphson & Math Templates
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
NVIDIA GPUs lack hardware integer dividers and native FP64 arithmetic units on the SFU. When ptxas encounters PTX operations such as div.s32, div.u64, rcp.f64, sqrt.f64, or rsqrt.f64, it expands them into multi-instruction SASS sequences that synthesize the result from simpler hardware primitives. These expansions are the math templates -- pre-built instruction sequence generators that emit 20--100+ SASS instructions per PTX operation, using the MUFU (Multi-Function Unit) for initial approximations and Newton-Raphson iterations for refinement.
The template subsystem lives at 0x1700000--0x172A090 in the ptxas binary: 36 functions occupying ~180 KB. It is invoked during instruction selection by the master lowering dispatcher sub_AED3C0 whenever the selected instruction requires multi-instruction expansion.
| Address range | 0x1700000--0x172A090 |
| Function count | 36 (4 top-level handlers + 4 coordinators + ~24 sub-expanders + 4 helpers) |
| Binary size | ~180 KB |
| Master lowering dispatcher | sub_AED3C0 (28 KB, vtable-dispatched) |
| Emission primitives | sub_9314F0 (standard), sub_934630 (extended), sub_935130 (branch), sub_9352C0 (wide) |
| Virtual register allocator | sub_91BF30 (535 bytes, allocates 160-byte register descriptors) |
| Immediate encoder | sub_91D160 (318 bytes, encodes constant values into operand descriptors) |
| Operand legalizer | sub_13A6A10 (called before each expansion to widen immediates / fix register classes) |
| Template name strings | __ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3 |
Architecture
Two-Level Hierarchy
Every math template follows the same structural pattern: a top-level handler performs lazy initialization and operand legalization, then delegates to a coordinator that allocates virtual registers and calls a sequence of sub-expanders, each of which emits a portion of the final SASS instruction sequence.
sub_AED3C0 (Master Lowering Dispatcher, 28 KB)
|
+-- sub_170E8B0 (DDIV handler) -- FP64 division
| +-- sub_170E260 (coordinator) -- 298 vregs, 6 sub-expanders
|
+-- sub_1718D60 (DRCP/DSQRT handler) -- FP64 reciprocal / square root
| +-- sub_1718790 (coordinator) -- 289 vregs, 7 sub-expanders (inc. MUFU.RCP)
|
+-- sub_17276C0 (DRSQRT handler) -- FP64 reciprocal square root
| +-- sub_1720D60 (coordinator A) -- 247 vregs, 5 sub-expanders (MUFU.RSQ path)
| +-- sub_1727130 (coordinator B) -- 59 vregs, integer div/mod path
|
+-- sub_1704070 (Inline DDIV handler) -- FP64 division, register-pressure variants
+-- sub_1702990 (>20K regs) -- full unrolled, ~50 instructions
+-- sub_1701F10 (>16K regs) -- partially spilled
+-- sub_1701860 (<=16K regs) -- minimal-register variant
Lazy Initialization
Each top-level handler uses a lazy-init pattern to avoid rebuilding the template for every invocation within a compilation unit:
// sub_170E8B0 -- DDIV handler (simplified from decompilation)
void DDIV_Handler(template_state *a1, instruction *a2) {
if (a1->template_id == -1) { // first invocation
a1->template_id = ctx->next_id++; // allocate unique ID
DDIV_Coordinator(a1, ...); // build template once
}
ctx->insert_point = a2->position;
LegalizeOperand(ctx, a2, 1, ...); // sub_13A6A10
if (a1->use_template_call) {
// Template path: emit BRA-to-template (opcode 168)
EmitExtended(ctx, 168, 0x13, ...); // sub_934630
} else {
// Inline path: emit individual FP ops directly
EmitFP(ctx, 0x86, 0xC, a1->reg[0], ...); // sub_92E800
EmitFP(ctx, 0x85, 0xC, a1->reg[1], ...);
}
}
The a1->use_template_call flag (at offset +8) controls whether the expansion is emitted as a callable template (with BRA to a named code section) or inlined directly at the call site. The template-call path produces three named code objects -- __ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3 -- that are shared across all DDIV call sites in the same function.
Coordinator Pattern
All four coordinators share identical structure. They allocate virtual registers from a static descriptor table, call the shared helper sub_1701140 to build the code object scaffolding, then invoke their sub-expanders in sequence:
// sub_170E260 -- DDIV coordinator (simplified)
void DDIV_Coordinator(template_state *a1, ..., int template_id) {
int *vreg_array = NULL;
int count = 0;
// Allocate 298 virtual registers from static table dword_23993E0
for (int i = 0; i < 298; i++) {
int reg_id = AllocVReg(ctx, dword_23993E0[2*i]); // sub_91BF30
int category = dword_23993E4[2*i]; // 0=output, 1=temp
if (category == 0)
output_regs[out_count++] = reg_id;
else if (category == 1)
temp_regs[temp_count++] = reg_id;
// Mark register as template-owned
*(vreg_table[reg_id] + 48) |= 0x40;
}
// Build code object scaffolding
BuildTemplateScaffold(ctx, template_id, &static_table, 3, ...);
// Name the three code sections
if (a1->use_template_call) {
section[0]->name = intern("__ori_template_DDIV1");
section[1]->name = intern("__ori_template_DDIV2");
section[2]->name = intern("__ori_template_DDIV3");
}
// Allocate 240-byte scratch buffer (zeroed)
void *scratch = arena_alloc(240);
memset(scratch, 0, 232);
// Call 6 sub-expanders in sequence
DDIV_Part1(a1, template_id, scratch, vreg_array, ...); // sub_1704180
DDIV_Part2(a1, template_id, scratch, vreg_array, ...); // sub_1705820
DDIV_Part3(a1, template_id, scratch, vreg_array, ...); // sub_17075A0
DDIV_Part4(a1, template_id, scratch, vreg_array, ...); // sub_1709130
DDIV_Part5(a1, template_id, scratch, vreg_array, ...); // sub_170AE80
DDIV_Part6(a1, template_id, scratch, vreg_array, ...); // sub_170CBD0
// Emit convergence barriers (opcode 0x5D) between code sections
for (each section boundary in static_table) {
EmitBarrier(ctx, 0x5D, pred_reg, ...); // sub_92E1B0
}
// Mark scheduling barriers at section endpoints
*(section[23]->flags + 280) |= 8;
*(section[42]->flags + 280) |= 8;
}
The static descriptor tables (dword_23993E0 for DDIV, dword_2398940 for DRCP/DSQRT, dword_2398000 for DRSQRT, dword_23976E0 for integer div) encode the register type and category for each virtual register used by the template. The category field (second element of each pair) classifies registers as output (0) or temporary (1).
FP64 Division (DDIV)
Double-precision division a / b has no single-instruction implementation on any NVIDIA GPU. ptxas synthesizes it using Newton-Raphson refinement of a single-precision reciprocal seed.
Algorithm
The DDIV template produces three code sections containing the following mathematical steps:
DDIV1 -- Initial reciprocal approximation:
- Extract the high 32 bits of the FP64 divisor
b - Convert to FP32 and compute
MUFU.RCP(single-precision reciprocal, ~23 bits of mantissa) - Convert the FP32 result back to a form suitable for FP64 refinement
- Handle special cases: divisor is zero, infinity, NaN, or denormal
DDIV2 -- Newton-Raphson refinement:
The classical Newton-Raphson iteration for reciprocal is:
x_{n+1} = x_n * (2 - b * x_n)
Each iteration approximately doubles the number of correct bits. Starting from the ~23-bit MUFU.RCP seed:
- Iteration 1: ~46 bits (exceeds FP64 mantissa precision of 52 bits when combined with correction)
- A second partial iteration provides the guard bits needed for correct rounding
The SASS instruction sequence uses DFMA (FP64 fused multiply-add) to implement each iteration step. The FSETP/BRA branches handle edge cases where the intermediate result would overflow or underflow the FP64 range.
DDIV3 -- Final multiply and exception handling:
- Compute
a * (1/b)using the refined reciprocal - Apply IEEE 754 rounding (round-to-nearest-even by default)
- Emit the quotient to the destination register pair
- Handle overflow to infinity, underflow to zero, and NaN propagation
SASS Instruction Sequence (sub_1705820)
The DDIV Part 2 sub-expander (sub_1705820, 7,545 bytes, 1,057 lines decompiled) is the largest single sub-expander and emits the core Newton-Raphson loop. The instruction mix from decompilation:
| SASS Opcode | Internal ID | Count | Role |
|---|---|---|---|
| IMAD | 0xC9 | 10 | Integer multiply-add for mantissa manipulation |
| FSETP | 0x97 | 6 | Floating-point set-predicate for branch conditions |
| MOV | 0x82 | 13 | Register-to-register moves |
| MOV (FP) | 0x0A | 10 | FP register moves with type annotation |
| IADD3 | 0x110 | 5 | Three-operand integer add for exponent arithmetic |
| SHR | 0x19 | 1 | Shift right for exponent extraction |
| BRA | 0x5F | 5 | Conditional branches for special-case handling |
| MUFU | 0x3C | 1 | MUFU.RCP -- the initial reciprocal seed |
| DFMA | 0x122 | 2 | FP64 fused multiply-add (Newton-Raphson iteration) |
| FP64 op | 0x8B | 2 | FP64 arithmetic (multiply or add) |
| FP32 hi/lo | 0x86/0x85 | 4+4 | Move FP32 halves of FP64 register pair |
| Total | ~63 | Per sub-expander (Part 2 of 6) |
The complete DDIV template across all 6 sub-expanders emits approximately 100--120 SASS instructions, using 298 virtual registers.
Register Pressure Variants
The inline DDIV handler (sub_1704070) selects between three implementations based on the target architecture's register file size at *(*(context+1584) + 372):
| Register limit | Handler | Strategy |
|---|---|---|
| > 20,479 | sub_1702990 (5,846 bytes) | Full unrolled -- maximum ILP, 14+ dedicated scratch registers |
| > 16,383 | sub_1701F10 | Partially spilled -- trades some registers for spill/fill |
| <= 16,383 | sub_1701860 | Minimal-register -- reuses registers aggressively, more instructions |
This three-tier approach is a register-pressure/throughput tradeoff: kernels with high register demand (and thus low occupancy) use the minimal variant, while kernels with register headroom use the fully unrolled variant for better instruction-level parallelism.
FP64 Reciprocal and Square Root (DRCP/DSQRT)
The DRCP/DSQRT handler (sub_1718D60) shares the same lazy-init and template-call architecture as DDIV. Its coordinator (sub_1718790) allocates 289 virtual registers from dword_2398940 and calls 7 sub-expanders:
| Sub-expander | Address | Role |
|---|---|---|
| Part 1 | sub_170ED40 | FP64 reciprocal: extract exponent, compute MUFU.RCP seed |
| Part 2 | sub_1710280 | Newton-Raphson iteration 1 for reciprocal refinement |
| Part 3 | sub_17120F0 | Newton-Raphson iteration 2 (second doubling of precision) |
| Part 4 | sub_17139D0 | Rounding and normalization |
| Part 5 | sub_1715910 | Square root path: compute MUFU.RSQ seed, refine |
| Part 6 | sub_1717470 | Final multiply x * rsqrt(x) to get sqrt(x), exception handling |
| (shared) | sub_1701140 | Template scaffolding helper (called by all coordinators) |
The algorithm for DRCP(b) = 1/b:
MUFU.RCP(float32(b))provides ~23-bit seed- Two Newton-Raphson iterations:
x_{n+1} = x_n * (2 - b * x_n), each using DFMA - Final rounding to FP64 precision
The algorithm for DSQRT(a) = sqrt(a):
MUFU.RSQ(float32(a))provides ~23-bit1/sqrt(a)seed- Refine
1/sqrt(a)via Newton-Raphson:y_{n+1} = y_n * (3 - a * y_n^2) / 2 - Compute
sqrt(a) = a * (1/sqrt(a))using the refined reciprocal square root - Apply IEEE 754 rounding
Both paths share the same coordinator and register pool. The coordinator selects the DRCP path or DSQRT path based on the original PTX operation being lowered.
FP64 Reciprocal Square Root (DRSQRT)
The DRSQRT handler (sub_17276C0) is the most complex top-level handler. It dispatches to one of two coordinators based on a hardware capability flag:
// sub_17276C0 -- DRSQRT handler (simplified)
void DRSQRT_Handler(template_state *a1, instruction *a2) {
int hw_flag = *(*(ctx + 1584) + 1037) & 1;
if (a1->template_id == -1) {
a1->template_id = ctx->next_id++;
if (hw_flag)
Coordinator_IntDiv(a1, ...); // sub_1727130: 59 vregs
else
Coordinator_DRSQRT(a1, ...); // sub_1720D60: 247 vregs
}
// ... operand legalization, template call or inline emission
}
The hardware flag at *(config + 1037) & 1 likely distinguishes architectures with enhanced SFU precision (where fewer refinement iterations are needed) from older architectures requiring the full Newton-Raphson sequence.
Coordinator A (sub_1720D60): allocates 247 virtual registers from dword_2398000 and calls 5 sub-expanders:
| Sub-expander | Address | Role |
|---|---|---|
| Part 1 | sub_1719080 | Initial MUFU.RSQ seed, exponent extraction |
| Part 2 | sub_171A260 | Newton-Raphson iteration 1 |
| Part 3 | sub_171BB80 | Newton-Raphson iteration 2 |
| Part 4 | sub_171D3A0 | Normalization and rounding |
| Part 5 | sub_171EFD0 | Exception handling (NaN, infinity, negative, zero) |
Coordinator B (sub_1727130): allocates only 59 virtual registers from dword_23976E0 and dispatches to the integer division sub-expanders (sub_1724A20 for 32-bit, sub_1728930 for 64-bit unsigned, sub_1727AC0 for 64-bit signed). This path handles the integer division/modulo lowering via sub_1729B50.
Integer Division Lowering
Integer division and modulo by variable (non-constant) values are expanded into multi-instruction SASS sequences during instruction selection. These sequences use the MUFU.RCP hardware approximation as a starting point, then correct the result with integer arithmetic.
32-bit Division -- sub_1724A20
Size: 28,138 bytes decompiled (957 lines), the largest function in the 0x1723000--0x17F8000 range.
Called from: sub_1727130 (coordinator B).
Virtual registers: 59 (allocated by coordinator B from dword_23976E0).
Temporary pool: indices 90--126 of the parameter array, providing 37 dedicated scratch registers.
Algorithm for unsigned 32-bit a / b:
Step 1: float_b = I2F(b) ; convert divisor to FP32
Step 2: rcp = MUFU.RCP(float_b) ; ~23-bit reciprocal approximation
Step 3: int_rcp = F2I(rcp) ; convert back to integer
Step 4: q_est = IMAD.HI(a, int_rcp, 0) ; estimated quotient (high 32 bits of a*rcp)
Step 5: r_est = IMAD(q_est, -b, a) ; estimated remainder = a - q*b
Step 6: if (r_est >= b) q_est++; r_est -= b ; correction iteration 1
Step 7: if (r_est >= b) q_est++; r_est -= b ; correction iteration 2
Step 8: result = q_est ; (or r_est for modulo)
The correction steps (6--7) are implemented with ISETP (opcode 0xC9) for comparison and BRA (opcode 0x5F) for conditional execution. In the worst case, two correction iterations suffice because the MUFU.RCP approximation is accurate to within 2 ULP of the true reciprocal.
Key constants allocated via sub_91D160:
| Constant | Purpose |
|---|---|
| 23 | Float exponent bias for mantissa extraction |
| 255 | Exponent mask (8-bit IEEE 754 exponent field) |
| 127 | IEEE 754 single-precision exponent bias |
| 254 | Double-bias for overflow guard (2 * 127) |
| 1, -1 | Correction increments for quotient adjustment |
The complete SASS instruction mix for the 32-bit division template:
| SASS Opcode | Internal ID | Count | Role |
|---|---|---|---|
| I2F | 0xD5 | 2 | Integer-to-float conversion |
| F2I | 0xD6 | 3 | Float-to-integer conversion |
| IMAD | 0x6E | 5 | Integer multiply-add (quotient estimation) |
| IMAD.WIDE | 0x6F | 3 | Wide multiply-add (64-bit intermediate) |
| IADD | 0x02 | ~3 | Integer add (correction) |
| MOV | 0x82 | 10 | Register moves |
| MOV (typed) | 0x0A | 6 | Typed register moves |
| ISETP | 0xC9 | 8 | Integer set-predicate (comparison) |
| FSETP | 0x97 | 3 | Float set-predicate |
| SHL/LEA | 0x24 | 2 | Shift-left / load effective address |
| BRA | 0x5F | 4 | Conditional branch (correction paths) |
| POPC/LOP | 0x93 | 1 | Population count / logic op |
| Total | ~50 |
64-bit Division
Two variants handle 64-bit operands, both called from sub_1729B50:
-
sub_1728930(16,545 bytes): unsigned 64-bit division. The algorithm is analogous to 32-bit but requires double-width multiply (IMAD.WIDE), carry propagation, and additional correction iterations. Emits ~80 SASS instructions. -
sub_1727AC0(13,776 bytes): signed 64-bit division. Wraps the unsigned algorithm with sign extraction, absolute value computation, and sign fixup of the quotient and remainder.
Both allocate from the same 59-register pool managed by coordinator B.
Division by Constant
Division by compile-time constant is handled separately during the GeneralOptimize bundle passes (not by these templates). The classic Granlund-Montgomery magic-number technique converts x / C to MULHI(x, magic) >> shift, producing 2--3 instructions instead of ~50. See Strength Reduction for details.
MUFU: The Hardware Approximation Engine
All math templates depend on the MUFU (Multi-Function Unit) instruction, which provides low-precision hardware approximations for transcendental and special functions. MUFU is a single SASS instruction (internal opcode 0x3C) with a sub-function selector:
| MUFU Sub-function | Operation | Precision | Latency (typical) |
|---|---|---|---|
MUFU.RCP | 1/x (reciprocal) | ~23 bits (FP32 mantissa) | ~8 cycles |
MUFU.RSQ | 1/sqrt(x) (reciprocal square root) | ~23 bits | ~8 cycles |
MUFU.RCP64H | High-precision 1/x seed for FP64 | ~28 bits (sm_80+) | ~10 cycles |
MUFU.RSQ64H | High-precision 1/sqrt(x) seed for FP64 | ~28 bits (sm_80+) | ~10 cycles |
MUFU.SIN | sin(x) | ~23 bits | ~8 cycles |
MUFU.COS | cos(x) | ~23 bits | ~8 cycles |
MUFU.EX2 | 2^x (base-2 exponential) | ~23 bits | ~8 cycles |
MUFU.LG2 | log2(x) (base-2 logarithm) | ~23 bits | ~8 cycles |
MUFU.SQRT | sqrt(x) (sm_89+) | ~23 bits | ~8 cycles |
MUFU executes on the SFU (Special Function Unit), which is separate from the integer and floating-point ALU pipelines. On sm_80 (Ampere) and later, the SFU can execute one MUFU per cycle per SM partition. The key insight is that MUFU provides only FP32-precision seeds; achieving FP64 precision requires the Newton-Raphson refinement implemented by the math templates.
Fast-Math vs. IEEE Precision
For FP32 operations, the PTX modifiers .approx and .ftz control whether ptxas uses MUFU directly or applies refinement:
div.approx.f32: Emits a singleMUFU.RCPfollowed byFMUL. No Newton-Raphson. Result has ~23-bit precision (not IEEE-correct rounding).div.full.f32: EmitsMUFU.RCP+ one Newton-Raphson iteration via FFMA. Result is IEEE-correct for all normal inputs.div.rn.f64: Emits the full DDIV template (~100+ instructions) with two Newton-Raphson iterations. Result is IEEE 754 round-to-nearest-even.
For transcendental functions (sin, cos, exp, log):
sin.approx.f32/cos.approx.f32: SingleMUFU.SIN/MUFU.COS. ~23-bit precision over a reduced range.sin.f32(full range): Range reduction to [-pi, pi] via polynomial argument reduction, thenMUFU.SIN+ polynomial correction. Emitted as a libdevice call or inline sequence depending on optimization level.ex2.approx.f32: SingleMUFU.EX2.lg2.approx.f32: SingleMUFU.LG2.
There are no FP64 versions of MUFU.SIN/COS/EX2/LG2. FP64 transcendentals are always implemented by linking against libdevice (the CUDA math library), which provides polynomial approximation sequences compiled from C source code. These are not handled by ptxas's internal templates but by the libdevice bitcode linked during cicc compilation, upstream of ptxas.
Template Instantiation Infrastructure
Emission Primitives
The sub-expanders construct SASS instructions using a family of emission functions:
| Function | Size | Signature | Role |
|---|---|---|---|
sub_9314F0 | 403 bytes | (scratch, ctx, opcode, type, operand_count, operands, xmm, fp) | Standard SASS instruction emission (2--5 operands) |
sub_934630 | 1,213 bytes | (scratch, ctx, opcode, type, ?, ?, xmm, fp, operand_buf, count) | Extended emission for control flow and >4 operands |
sub_935130 | 390 bytes | (scratch, ctx, opcode, count, label_buf, label_count, ...) | Branch emission with label resolution |
sub_9352C0 | (variant) | (scratch, ctx, opcode, type, operands, count, ..., extra_buf, ...) | Wide emission with extra operand buffer (used for MUFU) |
sub_92E800 | 70 bytes | (scratch, ctx, opcode, type, reg_id, src_operand, xmm, fp) | Simplified emission for single-source FP ops |
sub_92E720 | 51 bytes | (scratch, ctx, opcode, type, dest_pair, src_operand, xmm, fp) | Simplified emission wrapper for register pairs |
sub_92E1B0 | (variant) | (scratch, ctx, opcode, pred_reg, xmm, fp) | Predicated barrier/convergence emission |
Operand Encoding
Each operand in the emission buffer is a 32-bit tagged value:
| Tag (bits 31:24) | Meaning |
|---|---|
0x90 | Destination register (bits 23:0 = register ID) |
0x10 | Source register |
0x20 | Immediate constant (from constant pool via sub_91D160) |
0x40 | External constant reference |
0x60 | Template call target / sentinel (used for BRA-to-template) |
0x80 | Negate modifier (OR'd onto source tag: 0x90 = negated source) |
The 64-bit modifier word 0x6000000500000000 appearing in many emission calls encodes instruction-level flags such as .reuse hints and type specifiers.
Virtual Register Allocation
Each coordinator allocates its full set of virtual registers in a single loop before any instructions are emitted. The sub_91BF30 allocator creates 160-byte register descriptors and returns a 24-bit register ID. Each register is marked with flags |= 0x40 (bit 6) to indicate it is owned by a template rather than the main register allocation pass. This prevents the register allocator from coalescing or splitting template-internal registers.
The static descriptor tables encode register types as the first element of each (type, category) pair:
| Template | Table address | Register count | Data types |
|---|---|---|---|
| DDIV | dword_23993E0 | 298 | FP64 pairs, FP32, integer, predicate |
| DRCP/DSQRT | dword_2398940 | 289 | FP64 pairs, FP32, integer, predicate |
| DRSQRT | dword_2398000 | 247 | FP64 pairs, FP32, integer, predicate |
| Integer div | dword_23976E0 | 59 | Integer (32/64-bit), predicate |
The register counts explain why these templates dominate register pressure in FP64-heavy kernels: 298 virtual registers for a single DDIV expansion is enormous by GPU standards, where the entire physical register file is 65,536 32-bit registers shared across all active warps.
Template Call vs. Inline
The use_template_call flag at template_state + 8 selects between two emission strategies:
Template-call path (flag set):
- The coordinator builds three named code sections (
__ori_template_DDIV1/2/3) - Each call site emits a
BRA(opcode 168 viasub_934630) to the template code - The template code is shared across all call sites in the same function
- Convergence barriers (opcode 0x5D via
sub_92E1B0) ensure correct re-convergence - A
CALL-like instruction (opcode 164) handles the return path
Inline path (flag clear):
- The sub-expander instructions are emitted directly at the call site
- Each call site gets its own copy of the full instruction sequence
- Uses direct
IADD3(opcode 0x110) for control flow instead ofBRA - No named code sections, no convergence barriers
- A
JMP/BRA(opcode 0x20 or 0x5F) replaces the template return
The template-call path is preferred for functions with multiple DDIV/DRCP/DSQRT operations because it avoids duplicating the large instruction sequence. The inline path is used when the function has only one such operation, or when the register allocator determines that the overhead of the template call mechanism (saving/restoring registers across the call boundary) exceeds the code-size benefit.
SM-Specific Variants
Register File Size Dispatch
The inline DDIV handler (sub_1704070) reads the register file capacity from *(*(context + 1584) + 372) and selects between three implementation tiers:
int reg_file_size = *(*(ctx + 1584) + 372); // physical register count
if (reg_file_size > 20479)
DDIV_FullUnroll(a1, a2, ...); // sub_1702990: max ILP
else if (reg_file_size > 16383)
DDIV_PartialSpill(a1, a2, ...); // sub_1701F10: balanced
else
DDIV_MinimalRegs(a1, a2, ...); // sub_1701860: min pressure
The thresholds (20,479 and 16,383) correspond to register file sizes across GPU generations:
- sm_50--sm_61 (Maxwell/Pascal): 65,536 registers per SM -> 20,479 threshold met at occupancy < 3 blocks
- sm_70--sm_89 (Volta through Ada): 65,536 registers -> same thresholds
- sm_100+ (Blackwell): 65,536 registers -> same, but wider warp execution changes the pressure calculus
Hardware Capability Flag
The DRSQRT handler checks *(*(context + 1584) + 1037) & 1 to select between coordinator A (full Newton-Raphson, 247 registers) and coordinator B (reduced sequence, 59 registers). This flag likely indicates the presence of MUFU.RCP64H / MUFU.RSQ64H on sm_80+ architectures, which provide higher-precision seeds (~28 bits vs. ~23 bits) and thus require fewer refinement iterations.
SASS Opcode Reference
Internal opcode IDs used by the math templates, mapped to SASS mnemonics:
| Internal ID | SASS Mnemonic | Description |
|---|---|---|
| 0x02 | IADD | Integer add (3-operand) |
| 0x0A | MOV | FP register move (typed) |
| 0x19 | SHR | Shift right (exponent extraction) |
| 0x20 | BRA/JMP | Unconditional branch (inline return path) |
| 0x24 | SHL/LEA | Shift left / load effective address |
| 0x3C | MUFU | Multi-function unit (RCP, RSQ, SIN, COS, EX2, LG2) |
| 0x5D | BSYNC | Barrier synchronization / convergence barrier |
| 0x5F | BRA | Conditional branch |
| 0x6E | IMAD | Integer multiply-add |
| 0x6F | IMAD.WIDE | Wide integer multiply-add (64-bit result) |
| 0x82 | MOV | General register move |
| 0x85 | MOV.LO | Move low 32 bits of FP64 pair |
| 0x86 | MOV.HI | Move high 32 bits of FP64 pair |
| 0x8B | DFMA/DMUL | FP64 fused multiply-add or multiply |
| 0x93 | POPC | Population count / bitwise logic |
| 0x97 | FSETP | FP32 set-predicate (comparison) |
| 0xA8 | SEL | Conditional select |
| 0xC9 | ISETP | Integer set-predicate (comparison) |
| 0xD5 | I2F | Integer-to-float conversion |
| 0xD6 | F2I | Float-to-integer conversion |
| 0x110 | IADD3 | Three-operand integer add |
| 0x122 | DFMA | FP64 fused multiply-add (Newton-Raphson core) |
Function Map
| Address | Size | Function | Role |
|---|---|---|---|
sub_AED3C0 | 28 KB | Master lowering dispatcher | Vtable-dispatched, calls all template handlers |
sub_170E8B0 | 1,166 bytes | DDIV top-level handler | Lazy init, operand legalization, template-call/inline dispatch |
sub_170E260 | 1,615 bytes | DDIV coordinator | 298 vregs, 6 sub-expanders, names __ori_template_DDIV1/2/3 |
sub_1704180 | ~5 KB | DDIV sub-expander 1 | Initial reciprocal approximation |
sub_1705820 | 7,545 bytes | DDIV sub-expander 2 | Newton-Raphson core (MUFU.RCP + refinement) |
sub_17075A0 | ~6 KB | DDIV sub-expander 3 | Second refinement iteration |
sub_1709130 | ~6 KB | DDIV sub-expander 4 | Final multiply (a * 1/b) |
sub_170AE80 | ~6 KB | DDIV sub-expander 5 | Rounding and normalization |
sub_170CBD0 | ~5 KB | DDIV sub-expander 6 | Exception handling |
sub_1718D60 | 790 bytes | DRCP/DSQRT top-level handler | Lazy init, shares structure with DDIV |
sub_1718790 | 1,487 bytes | DRCP/DSQRT coordinator | 289 vregs, 7 sub-expanders |
sub_170ED40 | ~5 KB | DRCP/DSQRT sub-expander 1 | MUFU.RCP seed extraction |
sub_1710280 | ~5 KB | DRCP/DSQRT sub-expander 2 | Newton-Raphson iteration 1 |
sub_17120F0 | ~6 KB | DRCP/DSQRT sub-expander 3 | Newton-Raphson iteration 2 |
sub_17139D0 | ~6 KB | DRCP/DSQRT sub-expander 4 | Rounding/normalization |
sub_1715910 | ~6 KB | DRCP/DSQRT sub-expander 5 | DSQRT path (MUFU.RSQ) |
sub_1717470 | ~5 KB | DRCP/DSQRT sub-expander 6 | Final multiply, exceptions |
sub_17276C0 | 1,011 bytes | DRSQRT top-level handler | HW capability dispatch |
sub_1720D60 | 1,423 bytes | DRSQRT coordinator A | 247 vregs, 5 sub-expanders (full N-R) |
sub_1719080 | ~5 KB | DRSQRT sub-expander 1 | MUFU.RSQ seed |
sub_171A260 | ~6 KB | DRSQRT sub-expander 2 | Newton-Raphson iteration 1 |
sub_171BB80 | ~6 KB | DRSQRT sub-expander 3 | Newton-Raphson iteration 2 |
sub_171D3A0 | ~6 KB | DRSQRT sub-expander 4 | Normalization |
sub_171EFD0 | ~5 KB | DRSQRT sub-expander 5 | Exception handling |
sub_1727130 | 1,423 bytes | Integer div coordinator (B) | 59 vregs, dispatches to div templates |
sub_1724A20 | 28,138 bytes | 32-bit integer div/mod | Newton-Raphson via MUFU.RCP + IMAD correction |
sub_1728930 | 16,545 bytes | 64-bit unsigned div/mod | Double-width Newton-Raphson |
sub_1727AC0 | 13,776 bytes | 64-bit signed div/mod | Signed wrapper around unsigned |
sub_1729B50 | ~2 KB | 64-bit div dispatcher | Selects signed vs. unsigned handler |
sub_1704070 | 263 bytes | Inline DDIV dispatcher | Register-pressure based 3-tier selection |
sub_1702990 | 5,846 bytes | Inline DDIV (full unroll) | >20K register variant |
sub_1701F10 | ~4 KB | Inline DDIV (partial spill) | >16K register variant |
sub_1701860 | ~3 KB | Inline DDIV (minimal regs) | <=16K register variant |
sub_1701140 | 8,690 bytes | Template scaffolding helper | Code object construction, called by all coordinators |
sub_172A090 | 3,095 bytes | Conditional move emission | Scheduling barrier fixup |
Cross-References
- Code Generation Overview -- pipeline context showing templates as step 5 of 7
- Strength Reduction -- division-by-constant optimization (Granlund-Montgomery), peephole patterns
- Instruction Selection -- ISel mega-selector and pattern matchers that feed into template expansion
- SASS Instruction Encoding --
sub_7B9B80bitfield packer used by encoding tables - Peephole Optimization -- post-template simplification of emitted sequences
- Register Model -- virtual register allocation and the
0x40template-ownership flag - Scheduling -- scheduling of template-emitted instruction sequences
SASS Text Generation
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Phases 129 (DumpNVuCodeText) and 130 (DumpNVuCodeHex) convert the internal instruction stream into human-readable SASS assembly text and raw hex dumps respectively. The text output is the same format produced by cuobjdump --dump-sass and is used for --verbose output, DUMPIR diagnostics, --forcetext mode, --out-sass dumps, and the --self-check roundtrip verification pipeline. The subsystem spans two distinct address ranges: a PTX-level text generation system (580 formatter functions at 0x4DA340--0x5A8E40) and a SASS-level disassembly renderer (~123 virtual printer methods at 0x17F8000--0x181FFFF).
| Pipeline phases | 129 (DumpNVuCodeText), 130 (DumpNVuCodeHex) |
| Phase category | Debug (conditionally executed) |
| PTX formatter count | 580 functions at 0x4DA340--0x5A8E40 (~850 KB) |
| PTX dispatcher | sub_5D4190 (12.9 KB, two-level opcode dispatch) |
| SASS printer count | ~123 vtable methods at 0x17F8000--0x181FFFF |
| Builder/visitor vtable | ~520 method slots (4,160+ byte vtable) |
| Format string table | ~1.8 MB monolithic NUL-terminated string block |
| Temp buffer size | 50,000 bytes per formatter invocation |
| Largest formatter | sub_5A8E40 (wmma.load.b, 9,757 bytes) |
| Key helpers | sub_9D12F0 (operand encoder), sub_9DB7E0 (predicate printer) |
Output Format
SASS text generation produces output compatible with cuobjdump --dump-sass. The format includes control information (scheduling metadata), predicate guards, opcode mnemonics, operands with modifiers, and optional annotations.
Instruction Line Format
/*ADDR*/ {CTRL} OPCODE{.MODIFIERS} DST, SRC0{, SRC1{, SRC2}} ; /* LINE */
Concrete examples of the format ptxas produces:
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0004017802 */
/*0010*/ S2R R0, SR_CTAID.X ; /* 0x0000000000007919 */
/*0020*/ @P0 IMAD.MOV.U32 R4, RZ, RZ, c[0x0][0x168] ;
/*0030*/ IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x16c] ;
/*0040*/ ISETP.GE.AND P0, PT, R0, R2, PT ;
/*0050*/ @P0 EXIT ;
/*0060*/ STG.E [R4.64], R0 ;
/*0070*/ EXIT ;
/*0080*/ BRA 0x80 ;
Control Word Format
For architectures with explicit scheduling control (SM 50--SM 70), the control word is printed in a dedicated line before each group of three instructions:
/* 0x001c4400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ;
/*0010*/ S2R R0, SR_TID.X ;
/*0018*/ S2R R2, SR_CTAID.X ;
The 64-bit control word encodes scheduling data for three instructions:
| Field | Bits | Description |
|---|---|---|
| Stall count | 4 bits per instruction | Minimum cycles to wait before issue (0--15) |
| Yield hint | 1 bit per instruction | Suggest warp scheduler switch |
| Write barrier | 3 bits per instruction | Dependency barrier index (0--5, 7 = none) |
| Read barrier | 3 bits per instruction | Read dependency barrier mask |
| Wait barrier mask | 6 bits per instruction | Which barriers to wait on before issue |
For SM 75+ architectures (Turing and later), scheduling information is embedded per-instruction rather than in grouped control words, so the text output places it differently or omits the separate control line.
Hex Dump Format (Phase 130)
Phase 130 (DumpNVuCodeHex) emits the raw encoding bytes as hex values:
/*0000*/ 0x00000a0004017802
/*0008*/ 0x0000000000007919
/*0010*/ 0x000000ff0aff7824
Each line contains the instruction address and its encoded QWORD(s). For 128-bit instructions, two QWORDs are printed.
Architecture
The text generation subsystem has two levels: a PTX-level pretty-printer that formats instructions from the Ori IR representation, and a SASS-level disassembly renderer that decodes binary-encoded SASS instructions back to text.
Level 1: PTX Instruction Text Formatters
This is the primary text generation system. The 580 formatter functions convert internal instruction representations (accessed via the instruction object at *(a1+1096)) into PTX assembly text strings.
sub_5D4190 (12.9 KB, dispatcher)
├─ First: calls sub_5D1660 to initialize intrinsic ID table (608 entries)
├─ Registers 121 named opcodes at a1+808 via sub_426150()
├─ Registers ~400 hash-keyed opcodes at a1+816 via sub_426150()
└─ Dispatches to one of 580 formatters at 0x4DA340-0x5A8E40
└─ Each: alloc 50 KB → sprintf via format table → shrink-copy → free
The dispatcher uses a two-level dispatch strategy:
-
Named dispatch (121 opcodes): Direct string-to-function registration for recent or complex instructions. The opcode name string (e.g.,
"wmma.load.a","tcgen05.mma","barrier.cta") is looked up in a hash map ata1+808. -
Hash dispatch (~400 opcodes): Numeric hash values of opcode names are used as keys in a second hash map at
a1+816. The hash values are stored as decimal string representations (e.g.,"2644314910","605425506"). This covers the stable ISA core -- arithmetic, logic, loads, stores, branches, conversions.
Level 2: SASS Disassembly Renderer
The SASS printer at 0x17F8000--0x181FFFF operates on binary-encoded SASS instructions and produces text through a builder/visitor pattern. This is used for the --self-check roundtrip verification and --out-sass output.
SASS instruction (binary-encoded)
│
├─ Read opcode at instruction+72, mask BYTE1 &= 0xCF
├─ Switch on canonical opcode ID
│
├─ For each operand:
│ └─ sub_9D12F0(output_128, ctx, instr, operand_idx, stride, mode, flag)
│ → 64-byte operand encoding structure
│
├─ Emit via builder/visitor vtable at *(a1 + 24):
│ ├─ vtable+936: begin_predicate_guard()
│ ├─ vtable+3768: begin_operands()
│ ├─ vtable+16: emit_operand(kind_id, ...)
│ ├─ vtable+272: emit_integer(value)
│ ├─ vtable+1760: set_rounding_mode(mode)
│ ├─ vtable+3952: emit_saturation_flag()
│ ├─ vtable+3960: emit_ftz_flag()
│ ├─ vtable+3968: emit_negate_flag()
│ ├─ vtable+4072: emit_cache_operation()
│ ├─ vtable+4080: emit_eviction_hint()
│ ├─ vtable+944: end_predicate_guard()
│ └─ vtable+4160: end_statement()
│
└─ Predicate guard: sub_9DB7E0 (662 bytes, 19 callers)
The builder/visitor vtable has approximately 520 method slots (vtable spans 4,160+ bytes), making it one of the largest virtual dispatch interfaces in the binary. Different concrete visitor implementations produce different output formats (text, hex, self-check comparison).
Formatter Template
Every PTX formatter function is mechanically generated from instruction definition tables. All 580 follow an identical structure:
char* format_OPCODE(int64_t a1, int64_t a2) {
// a1 = instruction context (instruction data at a1+1096)
// a2 = format string table base pointer (~1.8 MB)
// Phase 1: Allocate temp buffer
int64_t pool = ((int64_t*)sub_4280C0(a1, a2))[3]; // arena_get_pool
char* buf = (char*)sub_424070(pool, 50000); // pool_alloc(50KB)
if (!buf) sub_42BDB0(pool, 50000, ...); // alloc_fail_abort
// Phase 2: Build instruction text via sprintf chain
int pos = sprintf(buf, "%s", (char*)(a2 + OFFSET_A)); // opcode prefix
if (sub_70B6E0(*(a1+1096))) // has_predicate?
pos += sprintf(buf+pos, fmt, sub_70B780(*(a1+1096))); // predicate name
pos += sprintf(buf+pos, "%s", (char*)(a2 + OFFSET_B)); // operand template
// ... more operands via sub_70B8E0, sub_70B910, sub_70B920 ...
strcpy(buf+pos, (char*)(a2 + OFFSET_N)); // trailing text
// Phase 3: Shrink-copy to exact size
size_t len = strlen(buf);
int64_t pool2 = ((int64_t*)sub_4280C0(buf, ...))[3];
char* result = (char*)sub_424070(pool2, len + 1);
strcpy(result, buf);
// Phase 4: Free temp buffer
sub_4248B0(buf); // pool_free
return result;
}
The format string table (a2) is a single monolithic ~1.8 MB block of NUL-terminated strings containing pre-assembled text templates with %s, %llu, %d placeholders. Different formatters access it at different offsets:
| Formatter | Offset into a2 | Approximate position |
|---|---|---|
| wgmma.mma_async | 1,731,609 | ~1.7 MB |
| wmma.mma | 1,731,130 | ~1.7 MB |
| rsqrt | 67,573 | ~67 KB |
| copysign | 110,152 | ~110 KB |
| vavrg4 | 286,309 | ~286 KB |
| guardrails.alloc | ~1,843,620 | ~1.8 MB |
This design trades memory for speed: instead of building instruction text dynamically, ptxas stores the complete format template and fills in operand names at runtime.
Instruction Operand Accessors
All formatters query the instruction object through a uniform set of tiny accessor functions:
| Address | Size | Callers | Identity |
|---|---|---|---|
sub_70B700 | 14 B | 946 | has_predicate() |
sub_70B6E0 | 14 B | 42 | has_predicate_v2() |
sub_70B710 | 111 B | 348 | get_opcode_string() |
sub_70B780 | 151 B | 514 | get_predicate_name() |
sub_70B8E0 | 12 B | 1,449 | get_reg_operand(idx) |
sub_70B910 | 12 B | 1,656 | get_src_part0(idx) |
sub_70B920 | 12 B | 1,296 | get_src_part1(idx) |
sub_70B930 | 7 B | 68 | get_operand_count() |
sub_70B4C0 | 22 B | 46 | get_base_address() |
sub_70CA60 | 11 B | 480 | get_operand_type(idx) |
sub_70CA70 | 427 B | 191 | get_type_suffix() |
sub_70CD20 | 122 B | 158 | get_operand_offset(idx) |
sub_710860 | 39 B | 2,953 | get_data_type(idx, part) |
sub_70FA00 | 10 B | 286 | get_target_sm(idx) |
sub_70FA10 | 66 B | 7 | check_target_sm(idx, str) |
sub_709910 | 14 B | 13 | get_variant_count() |
sub_709A10 | 73 B | 46 | get_variant_string() |
sub_707CE0 | 22 B | 93 | get_address_operand(idx) |
sub_709760 | 127 B | 21 | get_comparison_op() |
sub_709FE0 | 11 B | 17 | get_rounding_mode() |
sub_70A500 | 13 B | 15 | get_saturation_mode() |
sub_70B3F0 | -- | -- | get_ftz_flag() |
sub_707530 | -- | -- | get_precision_string() |
sub_707C80 | -- | -- | get_scope_string() |
sub_7075E0 | -- | -- | get_layout_string() |
sub_707BE0 | -- | -- | get_shape_string() |
sub_70A810 | -- | -- | get_scale_string() |
All accessors read from the instruction object at *(a1+1096). The tiny sizes (7--151 bytes for most) indicate these are simple field extractions from the instruction record.
Memory Allocation
The formatter memory lifecycle uses a pool allocator:
| Function | Size | Callers | Identity |
|---|---|---|---|
sub_4280C0 | 597 B | 3,928 | arena_get_pool(ctx, table) |
sub_424070 | 2,098 B | 3,809 | pool_alloc(pool, size) |
sub_42BDB0 | 14 B | 3,825 | alloc_fail_abort() |
sub_4248B0 | 923 B | 1,215 | pool_free(ptr) |
Every formatter allocates a 50,000-byte temporary buffer, builds the instruction string via sprintf chains, measures the result with strlen, allocates an exact-size copy, and frees the temporary. The 50 KB buffer provides headroom for the largest instructions (WMMA loads produce multi-KB strings) but is wasteful for simple 2-operand instructions that generate ~50-byte strings.
Predicate Guard Printing
Predicate guards (@P0, @!P1, etc.) are printed by checking has_predicate() on the instruction, then formatting the guard via get_predicate_name():
// PTX-level predicate printing (in every formatter)
int pos = sprintf(buf, "%s", opcode_prefix);
if (sub_70B6E0(*(a1+1096))) { // has_predicate?
int64_t pred = sub_70B780(*(a1+1096)); // get_predicate_name
pos += sprintf(buf+pos, guard_fmt, pred); // e.g., "@P0 " or "@!P1 "
}
// SASS-level predicate printing (in disassembly renderer)
// sub_9DB7E0 (662 bytes, 19 callers) — emits guard through builder vtable
// calls builder->begin_predicate_guard() at vtable+936
// emits predicate register name
// calls builder->end_predicate_guard() at vtable+944
Register and Operand Formatting
Register operands are resolved from the instruction's operand array. The formatter accesses operands by index through get_reg_operand(idx), get_src_part0(idx), and get_src_part1(idx). The standard register naming follows NVIDIA conventions:
| Register class | Naming | Examples |
|---|---|---|
| General-purpose | R0--R255 | R0, R4, R255 |
| Zero register | RZ | RZ |
| Predicate | P0--P6, PT | @P0, PT |
| Uniform | UR0--UR63 | UR4, UR16 |
| Uniform predicate | UP0--UP6, UPT | UP0 |
| Constant buffer | c[bank][offset] | c[0x0][0x168] |
| Special | SR_* | SR_CTAID.X, SR_TID.X |
For the SASS disassembly renderer, the register class discriminator sub_91C840 (347 bytes, 232 callers) maps internal type codes 1--0x17 to output class IDs 0--18, covering integer registers, float registers, double registers, predicate registers, condition registers, texture/surface references, and uniform registers.
The operand encoder sub_9D12F0 (1,423 bytes, 289 callers) is the core serializer for SASS-level printing. It takes an instruction and operand index, resolves whether the operand is a register, immediate, or memory reference, handles constant buffer lookups, and fills a 64-byte (4x __m128i) encoding structure that the builder/visitor consumes.
Address and Offset Formatting
Memory operands are formatted with address space qualifiers and offset expressions:
[R4.64] — register indirect, 64-bit
[R4+0x10] — register + offset
c[0x0][0x168] — constant buffer bank 0, offset 0x168
[UR4] — uniform register indirect
The address space qualifier resolver sub_9CEB50 (185 bytes, 57 callers) combines address space information from the operand descriptor with the instruction context. For SASS-level output, the address space emitter sub_9E7B00 and related functions (sub_9E9910, sub_9E9A70) handle data type and memory space qualifiers.
Architecture-Conditional Formatting
86 of the 580 formatters contain architecture-conditional paths that check the target SM version via sub_70FA00 (numeric comparison) or sub_70FA10 (string comparison). Architecture-specific formatting reflects hardware evolution:
| SM | Era | Formatting impact |
|---|---|---|
| sm_20, sm_21 | Fermi (2010) | copysign has different operand layout (7 vs 5 fields) |
| sm_62 | Pascal mobile (2016) | vavrg4 gets per-component register formatting |
| sm_103 | Blackwell Ultra (2025) | rsqrt gains new operand layout for extended precision |
Five formatters additionally use string-based SM comparison via sub_70FA10:
sub_4DD860(copysign): checks"sm_20","sm_21"sub_56BA60(vavrg4): checks"sm_62"sub_56C8D0(dp2a.lo): SM string comparisonsub_577BA0(dp2a.hi): SM string comparisonsub_583190(rsqrt): checks"sm_103"
SASS Disassembly Renderer
The SASS-level renderer at 0x17F8000--0x181FFFF (~160 KB, ~123 virtual entry points) converts binary-encoded SASS instructions into textual SASS assembly. Unlike the PTX formatters (Level 1) which work from the high-level Ori IR via sprintf chains, the SASS renderer decodes the binary instruction encoding and drives a builder/visitor object through a structured sequence of emit_* calls. The builder's concrete implementation determines the output format -- text for --out-sass, comparison data for --self-check, or binary encoding verification.
Internal Layers
The subsystem splits into five layers by address range and function role:
| Layer | Range | Count | Role |
|---|---|---|---|
| A: Encoding templates | 0x17F8000--0x180FFFF | ~75 | Build per-opcode operand layout descriptors |
| B: Accessor vtable methods | 0x1810700--0x1810BFF | ~15 | ISA version/class discriminator predicates |
| C: Format-class printers | 0x1810D20--0x18167FF | ~50 | Workhorses: decode operands + emit through builder |
| D: Complex multi-format printers | 0x1817000--0x181CFFF | ~15 | Texture, multi-operand, predicated printers |
| E: Post-processing hooks | 0x181E000--0x181FFFF | ~8 | ISA-override detection, fixup dispatch |
All ~123 entry points have zero static callers, confirming they are virtual method overrides dispatched through vtables. The printer dispatch layer at 0xAA8000--0xACA000 (sub_AA9330, sub_AA9860, sub_AAB9C0, sub_AC99D0) invokes them.
Rendering Protocol
Every SASS printer receives (a1, a2) where a1 is the printer context (builder pointer at a1+24) and a2 is the binary-encoded instruction. The rendering follows a fixed protocol:
1. vtable[0](builder, instruction_kind_id) // begin_instruction
2. vtable[3760](builder, sync_mode) // set_sync_type (if applicable)
3. vtable[3768](builder) // begin_operand_list
4. sub_9DB7E0(a1, a2, 1) // emit predicate guard (@Px)
5. For each operand:
a. sub_9D12F0(&buf, a1, a2, idx, stride, mode, flag) // encode -> 64B struct
b. vtable[16](builder, kind_id, buf...) // emit_operand
6. vtable[3952](builder) // emit_saturation (.SAT)
7. vtable[3960](builder) // emit_ftz (.FTZ)
8. vtable[3968](builder) // emit_negate (.NEG)
9. vtable[4072](builder) // emit_cache_operation
10. vtable[4080](builder) // emit_eviction_hint
11. vtable[4160](builder) // end_instruction
The protocol is directly visible in decompiled code. In sub_1812F60 (16-DWORD immediate printer), the function begins with vtable[0](builder, 89) (begin instruction kind 89), calls vtable[3760] for sync type, vtable[3768] for begin operand list, sub_9DB7E0 for predicate guard, then loops 16 times calling vtable[272] (create integer operand) followed by vtable[16] (emit operand) with kind IDs 55 through 70 -- one per DWORD.
In sub_1810D20 (comparison-mode printer), the function first reads the modifier word from the operand array at instruction+84, switches on (modifier >> 4) & 0xF, calls vtable[3528]/vtable[3536] to configure comparison mode and variant, then emits 2--3 operands via the standard sub_9D12F0 + vtable[16] sequence.
Builder/Visitor Vtable
The builder object at *(a1 + 24) exposes a vtable spanning 4,160+ bytes (~520 method slots at 8 bytes each). The complete set of identified methods:
| Offset | Method | Category |
|---|---|---|
| +0 | begin_instruction(kind_id) | Framing |
| +8 | get_current_instruction() | Accessor |
| +16 | emit_operand(kind_id, operand_buf...) | Core emission |
| +24 | post_process_operand() | After-emit hook |
| +112 | get_register_size_32() | Register geometry |
| +120 | get_register_size_64() | Register geometry |
| +128 | create_register_operand() | Operand factory |
| +152 | create_memory_operand() | Operand factory |
| +192 | create_special_operand() | Operand factory |
| +208 | create_literal_operand() | Operand factory |
| +272 | create_integer_operand(value) | Operand factory |
| +304 | create_register_ref_operand() | Operand factory |
| +368 | set_address_space() | Memory qualifier |
| +936 | begin_predicate_guard() | Predicate block |
| +944 | end_predicate_guard() | Predicate block |
| +984 | set_predicate_mode() | Predicate negate/true |
| +1000 | emit_modifier() | Generic modifier |
| +1056 | set_offset_mode() | Address offset |
| +1128 | emit_width_qualifier() | .B32, .B64 |
| +1392 | set_comparison_flag() | Comparison type |
| +1760 | set_rounding_mode() | .RN, .RZ, .RM, .RP |
| +1936 | begin_sync_block() | Sync scope |
| +1944 | end_sync_block() | Sync scope |
| +2016 | set_sync_width() | Sync width |
| +2024 | set_sync_depth() | Sync depth |
| +2584 | set_uniform_flag() | .U modifier |
| +2960 | set_address_mode() | Address mode |
| +2992 | set_cache_level_a() | Cache hint (L1) |
| +3000 | set_cache_level_b() | Cache hint (L2) |
| +3096 | set_comparison_type() | Second comparison slot |
| +3128 | set_source_type_a() | Source type |
| +3136 | set_source_type_b() | Source type |
| +3144 | set_interlock_mode() | Memory ordering |
| +3152 | begin_comparison_block() | Comparison section |
| +3160 | set_comparison_width() | Comparison width |
| +3520 | set_data_width() | Operand width |
| +3528 | set_comparison_mode() | Comparison config |
| +3536 | set_comparison_variant() | Comparison variant |
| +3560 | set_conversion_type() | Conversion modifier |
| +3576 | begin_conversion() | Conversion block |
| +3760 | set_sync_type() | Synchronization type |
| +3768 | begin_operand_list() | Operand section |
| +3776 | emit_rounding_decoration() | Rounding modifier |
| +3824 | emit_texture_header() | Texture header index |
| +3952 | emit_saturation_flag() | .SAT |
| +3960 | emit_ftz_flag() | .FTZ |
| +3968 | emit_negate_flag() | .NEG |
| +4072 | emit_cache_operation() | Cache operation hint |
| +4080 | emit_eviction_hint() | Eviction priority |
| +4160 | end_instruction() | Framing |
Different concrete visitor implementations produce different output formats. The vtable design means adding a new output format (e.g., JSON, binary verification) requires only a new visitor class with no changes to any of the ~123 printer functions.
Encoding Template Builders (Layer A)
~75 functions at 0x17F8000--0x180FFFF build per-opcode instruction format descriptors that define the expected operand signature. Each function:
- Sets the SASS opcode ID:
*(a2+12) = opcode_number - Loads a 128-bit format descriptor:
*(a1+8) = xmmword_23Fxxxx(from rodata) - Fills up to 10 operand slots at
a1+24..a1+120with type codes, register class IDs, and modifier flags - Writes expected-value constraints at
a1+64..a1+160(-1 = any) - Writes type constraint modifiers at
a1+104..a1+200
From sub_17F8210 (opcode 274):
*(a2+12) = 274; // SASS opcode ID
*(a1+8) = _mm_loadu_si128(&xmmword_23F21B0); // 128-bit descriptor
*(a1+24) = 10; // operand 0: predicate register type
*(a1+64) = -1; // operand 0: any value accepted
*(a1+104) = 0; // operand 0: no modifier constraint
*(a1+28) = 17; // operand 1: specific register class
*(a1+68) = -1; // operand 1: any value
*(a1+108) = 3; // operand 1: modifier constraint 3
// ... remaining operands bulk-copied via SSE from xmmword_23F1C60 table
The 128-bit descriptors at xmmword_23F1xxx--23F2xxx encode canonical operand layouts. The bulk SSE copies (_mm_load_si128/_mm_loadu_si128) fill 4 operand slots per iteration, making the template builders compact despite handling up to 10 operand positions.
Format-Class Printers (Layer C)
The instruction's format class at instruction+76 determines which printer handles it. The dispatch computes index = format_class - 11, then looks up dword_23B39E0[index] for the encoding strategy:
| Strategy | Value | Description |
|---|---|---|
| Default | 0 | Standard register fields |
| Wide | 1 | 9-bit register fields, 8 sequential operands |
| Pair | 2 | 2x register fields per operand |
| Extended | 3 | Extra modifier bits |
| Special | 4+ | Texture header, 16-DWORD immediate |
Printer functions for each format class:
| Function | Size | Format class | Evidence |
|---|---|---|---|
sub_1810D20 | 8.8 KB | Comparison-mode | Switches on (modifier >> 4) & 0xF: case 4 emits comparison with two variants, case 6 emits single-variant. Calls vtable[3528]/vtable[3536] for comparison config |
sub_18111F0 | 11.6 KB | Wide-operand | 8 sequential sub_9D12F0 calls with indices 0--7 |
sub_1811E20 | 11.6 KB | Wide + special | Both sub_9D12F0 and sub_9CF740 calls |
sub_1812890 | 10.5 KB | Register + constant | sub_9CF8A0 for constant folding |
sub_1812F60 | 15.3 KB | 16-DWORD immediate | sub_7E4CF0 iterator, 16x vtable[272] + vtable[16] with kind IDs 55--70 |
sub_18141C0 | 6.5 KB | Per-operand comparison | Dispatch entry from sub_1820000 |
sub_1814660 | 7.1 KB | Load/store | sub_C49400 + sub_9CEB50 for address space |
sub_1814B10 | 17.6 KB | Load/store + predicated | sub_C49400, sub_91E860, sub_91C840, sub_9CEB50 |
sub_1815810 | 12.7 KB | Wide variant | Similar to sub_1811E20 |
sub_1816000 | 13.1 KB | Data-type qualified | sub_9E9910 for data type emission |
sub_18167F0 | 11.8 KB | Memory-access | sub_9E7B00 for address space qualifier, sub_A3B930 for operand modifier |
sub_1816FC0 | 6.4 KB | Modifier-heavy | Checks bits 6, 7, 14 of operand word for negate/absolute modifiers |
Texture/Surface Printer (Layer D)
The texture/surface printer sub_18189C0 is the largest at 45.2 KB. It handles the complete texture and surface instruction families:
sub_18189C0 (45.2 KB, 1361 lines)
├─ Read opcode at +72, mask to canonical form
├─ Giant switch on opcodes:
│ 18 (FADD/FMUL?), 119 (MUFU?), 186 (TEX),
│ 211 (TLD), 283 (SULD), 315 (SUST)
├─ Check operand modifier bits for predication/negation
├─ dword_23B39E0[format_class-11] → subtype (values 0-4)
├─ word_23B3A58[subtype] → builder kind ID
├─ Emit predicate: vtable[89], begin_operand_list
├─ sub_9DB7E0 → predicate guard
├─ For each operand: sub_9D12F0 → builder->emit_operand
├─ If format > 9: sub_1817C50 (12.8KB) → texture header index
│ └─ Linearizes from 2D bit fields: bits[0:8] x bits[9:17] → index 0-52
├─ Emit cache/eviction: vtable[4080], vtable[4072]
├─ Emit saturation/ftz: vtable[3952], vtable[3960], vtable[3968]
└─ Return 1 on success
The multi-operand printer sub_181B370 (27.8 KB) handles instructions with many operand variants (VOTE at opcode 0x7A, multi-op at 0x138), emitting up to 12 sequential operands through sub_9CEF90 (extended operand encoder) and sub_9CF740 (immediate encoder).
ISA-Override Detection (Layer E)
sub_181E1D0 (7.3 KB) is a post-processing fixup dispatcher called from sub_AA9330. It performs ISA-target-aware fixups by comparing vtable method pointers against known discriminator functions:
// If the current ISA target has the DEFAULT implementation:
if (vtable[111] == sub_1810B90) // default comparison handler
apply_default_fixup();
// Otherwise the target has OVERRIDDEN the method:
else
apply_specialized_fixup(); // e.g., sub_1BCBB90 for arch-specific
This mechanism supports 45 opcodes (0x12, 0x16, 0x24, ..., 0x141) and dispatches to architecture-specific post-processors (sub_1BCBB90, sub_1BCC2D0, sub_BCCF80, sub_1BCF120) or re-emits modifiers via sub_9E9910/sub_9E9A70.
The discriminator functions at 0x1810700--0x1810BFF (~15 tiny functions) serve as sentinel values: sub_1810720, sub_1810750, sub_18108A0, sub_18108D0, sub_1810B90. Their identity (which function pointer is stored) determines which specialization path the fixup dispatcher takes.
Instruction Object Layout
The binary instruction object (a2) used by all SASS printers:
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | Context/vtable pointer |
| +8 | 8 | ISA context pointer (register file, instruction info table) |
| +24 | 8 | Builder/visitor object pointer |
| +32 | 8 | Operand metadata pointer |
| +40 | 1 | Half-precision flag |
| +48 | 8 | Operand modifier context |
| +72 | 4 | Opcode (bits 12--13 are variant flags, masked via &0xCFFF) |
| +76 | 4 | Format class (subtract 11 for dword_23B39E0[] indexing) |
| +80 | 4 | Operand count |
| +84+ | 8*N | Operand array (N operands, 8 bytes each) |
Each 8-byte operand slot encodes:
| Bits | Word | Field |
|---|---|---|
| 28--30 | 0 | Operand type tag: 1=register, 4=address, 5=constant buffer, 7=special |
| 0--23 | 0 | Register/constant index |
| 24--27 | 0 | Modifier flags |
| 0 | 1 | Negate |
| 1 | 1 | Absolute value |
| 20 | 1 | Constant pool flag (0x100000) |
| 29 | 1 | Sign extension (0x20000000) |
| 30 | 1 | Uniform flag (0x40000000) |
| 31 | 1 | Negation modifier (0x80000000) |
Global Lookup Tables
| Table | Size | Index | Purpose |
|---|---|---|---|
dword_23B39E0[10] | 40 B | format_class - 11 | Format class to encoding strategy (0--4) |
word_23B3A58[4] | 8 B | Subtype from above | Subtype to builder kind_id mapping |
dword_23B3A20[14] | 56 B | register_class - 3 | Register class to comparison type ID |
dword_23B3980[7] | 28 B | width_field - 1 | Encoded width to builder width value |
xmmword_23F1xxx--23F2xxx | ~16 B each | Per-opcode | 128-bit operand layout descriptor templates |
SASS Renderer Function Map
| Address | Size | Callers | Identity | Confidence |
|---|---|---|---|---|
sub_17F8210 | ~1.3 KB | 0 (vtable) | Encoding template builder (opcode 274) | 95% |
sub_1810D20 | 8.8 KB | 0 (vtable) | Comparison-mode format-class printer | 90% |
sub_18111F0 | 11.6 KB | 0 (vtable) | Wide-operand format-class printer | 85% |
sub_1811E20 | 11.6 KB | 0 (vtable) | Wide-operand + special encoding printer | 85% |
sub_1812890 | 10.5 KB | 0 (vtable) | Register + constant operand printer | 85% |
sub_1812F60 | 15.3 KB | 0 (vtable) | 16-DWORD immediate printer | 90% |
sub_18141C0 | 6.5 KB | 0 (vtable) | Per-operand comparison printer | 85% |
sub_1814660 | 7.1 KB | 0 (vtable) | Load/store with address space printer | 85% |
sub_1814B10 | 17.6 KB | 0 (vtable) | Load/store + predication printer | 85% |
sub_1815810 | 12.7 KB | 0 (vtable) | Wide-operand variant printer | 80% |
sub_1816000 | 13.1 KB | 0 (vtable) | Data-type qualified printer | 85% |
sub_18167F0 | 11.8 KB | 0 (vtable) | Memory-access instruction printer | 85% |
sub_1816FC0 | 6.4 KB | 0 (vtable) | Modifier-heavy instruction printer | 85% |
sub_1817C50 | 12.8 KB | ~1 | Texture header index encoder | 90% |
sub_18189C0 | 45.2 KB | 0 (vtable) | Texture/surface instruction printer | 92% |
sub_181B370 | 27.8 KB | 0 (vtable) | Multi-operand instruction printer | 88% |
sub_181CF60 | 14.0 KB | 0 (vtable) | Predicated instruction printer | 85% |
sub_181D9B0 | 12.6 KB | 0 (vtable) | Load/store variant printer | 80% |
sub_181E1D0 | 7.3 KB | ~1 | ISA-override fixup dispatcher | 90% |
sub_181E630 | 14.7 KB | ~1 | Comparison instruction post-processor | 88% |
sub_181F000 | 7.6 KB | ~1 | Data-type specialized printer | 75% |
sub_181F4F0 | 17.3 KB | ~1 | Multi-variant data-type printer | 80% |
CLI Integration
--verbose / -v
Enables printing of code generation statistics after compilation. The statistics printers at sub_ABBA50--sub_ABEB50 (8 SM-variant clones, 7,603 bytes each) emit post-scheduling metrics in "# [...] " comment format.
--forcetext
Forces text-mode SASS output regardless of the default binary output mode. Internal flag: "forcetext=%d".
--out-sass
Generates reconstituted SASS text from the Capsule Mercury representation. Used for debugging the capmerc encode/decode roundtrip. Triggers the SASS text Flex lexer sub_720F00 (64 KB) for parsing in --self-check mode.
--self-check
Roundtrip verification for Capsule Mercury: encodes the instruction stream to capmerc format, decodes it back, renders both original and reconstituted as SASS text, and compares. The Flex lexer at sub_720F00 parses the text output for comparison. The SASS text formatter sub_719D00 (50 KB) builds the output for self-check.
DUMPIR
The DUMPIR environment variable (and related knobs) triggers intermediate representation dumps at named phases. Phase 129 (DumpNVuCodeText) is one of the dump targets, emitting the full instruction stream as formatted text when DUMPIR includes that phase name.
Formatter Size Distribution
Function size directly correlates with PTX instruction complexity:
| Tier | Size range | Count | Description |
|---|---|---|---|
| Tiny | < 500 B | 13 | Simple 2-operand (wgmma.fence: 295 B) |
| Small | 500--1,000 B | 191 | Standard 3--4 operand (copysign: 794 B) |
| Medium | 1,000--2,000 B | 319 | Instructions with modifiers (bfind: 1,130 B) |
| Large | 2,000--4,000 B | 36 | Arch-conditional paths (membar: 2,788 B) |
| Very large | 4,000--6,000 B | 20 | Complex multi-form (tex.grad: 5,636 B) |
| Monster | 6,000--10,000 B | 2 | WMMA matrix loads (wmma.load.b: 9,757 B) |
The WMMA load/store formatters account for 34,423 bytes (4% of the total range), reflecting the combinatorial explosion of matrix shapes, data types, layouts, and architectures.
Named Opcode Dispatch Table
The 121 named opcodes registered at a1+808 by sub_5D4190:
| Category | Opcodes |
|---|---|
| Memory fence | membar |
| Conversion | cvt, tensormap.replace |
| Math | div, div.full, rem, rcp, rsqrt, ex2, lg2, sqrt, tanh, copysign |
| Bit manipulation | bfind, brev, bfe, bfi, clz, popc, testp |
| Load/store | _ldldu, ldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulk, prefetch |
| Texture | tex, tex.base, tex.level, tex.grad, tld4, sured.b |
| Video SIMD | vadd--vmad, vadd2--vavrg2, vadd4--vavrg4 |
| Dot product | dp2a.lo, dp2a.hi, dp4a |
| Barriers | bar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta, barrier.cta, + .arrive/.red variants, bar.warp |
| Warp ops | vote, shfl, match, redux |
| Async copy | cp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensor |
| Cache policy | createpolicy.range, createpolicy.fractional, createpolicy.cvt |
| Multi-memory | multimem.ld_reduce, multimem.st, multimem.red |
| WMMA | wmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mma, mma |
| WGMMA | wgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group |
| TCGen05 | tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.ws |
| Guardrails | _tcgen05.guardrails.is_phase_valid, .are_columns_allocated, .is_current_warp_valid_owner, .in_physical_bounds, .allocation_granularity, .datapath_alignment, .sp_consistency_across_idesc_mod, .check_sparse_usage |
The remaining ~400 opcodes (arithmetic, logic, load/store, control flow, etc.) are dispatched through hash values at a1+816.
SASS Printer Key Functions
| Address | Size | Callers | Identity |
|---|---|---|---|
sub_5D4190 | 12.9 KB | 1 | PTX instruction text dispatch + intrinsic registration |
sub_5D1660 | 46 KB | 1 | Intrinsic library registration (608 entries) |
sub_5FF700 | 354 KB | -- | Builtin function declaration emitter (prototype generator) |
sub_4DA340 | 61 B | 1,080 | Builtin declaration lookup helper |
sub_719D00 | 50 KB | -- | SASS text formatter (self-check output builder) |
sub_720F00 | 64 KB | -- | Flex lexer for SASS text parsing (self-check input) |
sub_9D12F0 | 1.4 KB | 289 | Operand encoder (64-byte struct per operand) |
sub_9DB7E0 | 662 B | 19 | Predicate guard printer |
sub_91C840 | 347 B | 232 | Register class discriminator |
sub_9CEB50 | 185 B | 57 | Address space qualifier resolver |
sub_91E860 | 31 B | 214 | Data size accessor |
sub_18189C0 | 45.2 KB | -- | Texture/surface instruction printer (largest SASS printer) |
sub_181B370 | 27.8 KB | -- | Multi-operand instruction printer |
sub_1817C50 | 12.8 KB | -- | Texture header index encoder |
Instruction Data Flow
┌──────────────────────────────────┐
│ Ori IR Instruction Object │
│ (instruction data at *(a1+1096)) │
└────────────────┬─────────────────┘
│
┌─────────────────────┼──────────────────────┐
│ │ │
v v v
sub_70B6E0/B700 sub_70B8E0/B910/B920 sub_70CA60/CA70
has_predicate() get_reg_operand(idx) get_operand_type()
get_predicate_name() get_src_part0/1(idx) get_type_suffix()
│ │ │
└─────────────────────┼──────────────────────┘
│
v
┌─────────────────────┐
│ sprintf() chain │
│ into 50 KB buffer │
│ using format table │
│ at a2+offset │
└──────────┬──────────┘
│
v
┌─────────────────────┐
│ strlen → alloc → │
│ strcpy → free temp │
└──────────┬──────────┘
│
v
┌─────────────────────┐
│ Formatted PTX text │
│ string (exact size) │
└─────────────────────┘
Cross-References
- Code Generation Overview -- pipeline context and subsystem map
- SASS Instruction Encoding -- binary encoding format that this subsystem renders
- Mercury Encoder Pipeline -- source of instructions for text generation
- Capsule Mercury & Finalization --
--self-checkand--out-sassintegration - CLI Options --
--verbose,--forcetext,--out-sassflags - Knobs System -- DUMPIR knob triggering phase 129/130
- Phase Manager -- phase 129/130 registration and execution
SM Architecture Map
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas validates the --gpu-name target against three sorted lookup tables, constructs a profile object with family metadata and a CUDA_ARCH macro value, then populates seven parallel dispatch tables that drive capability checks, code generation factory selection, performance modeling, and occupancy calculation throughout the compiler. The default target is sm_75 (Turing). Every downstream decision -- instruction legality, encoder selection, register file geometry, scheduling latencies -- routes through the profile object built here.
| SM validation | sub_6765E0 (54KB, profile object construction) |
| Capability dispatch | sub_607DB0 (14KB, 7 parallel hash maps) |
| Default target | sub_6784B0 -- returns sm_75 when --gpu-name is omitted |
| Validation tables | 3 bsearch arrays: base (32 entries at unk_1D16220), f (6 entries at unk_1D16160), a (7 entries at unk_1D161C0) |
| Per-SM accessors | sub_609XXX cluster (24 functions, ~1.2KB each) |
| Per-SM intrinsic init | sub_60AXXX cluster (12 functions, ~1KB each) |
| Profile lookup | sub_608D70 (384 bytes, dispatcher registered via sub_42BEC0) |
Per-SM Deep Dives:
- Turing & Ampere (SM 75--88) -- Baseline feature set, codegen factory 24577/28673
- Ada & Hopper (SM 89--90a) -- WGMMA, cluster operations, codegen factory 32768
- Blackwell (SM 100--121) -- tcgen05, arch/family gating, codegen factory 36864
- TCGen05 -- 5th Gen Tensor Cores -- Blackwell tensor core ISA detail
Complete SM Table
23 active SM base targets ship in ptxas v13.0.88 (plus 9 legacy and 2 internal/alias entries retained in the validation table for backward compatibility). Each base target optionally has a (accelerated) and/or f (feature-reduced) sub-variants. The CUDA_ARCH column shows the value the -D__CUDA_ARCH__ macro expands to.
| SM | __CUDA_ARCH | Family | Product | Codegen Factory | Status | Deep Dive |
|---|---|---|---|---|---|---|
sm_75 | 750 | Turing | TU10x (RTX 20xx) | 24577 | Production | turing-ampere |
sm_80 | 800 | Ampere | A100 | 28673 | Production | turing-ampere |
sm_86 | 860 | Ampere | A40/A10/RTX 30xx | 28673 | Production | turing-ampere |
sm_87 | 870 | Ampere | Orin (Jetson) | 28673 | Production | turing-ampere |
sm_88 | 880 | Ampere | -- | 28673 | Production | turing-ampere |
sm_89 | 890 | Ada Lovelace | AD10x (RTX 40xx) / L40S | 28673 | Production | ada-hopper |
sm_90 / sm_90a | 900 | Hopper | H100 / H200 | 32768 | Production | ada-hopper |
sm_100 / sm_100a / sm_100f | 1000 | Blackwell | B200 (datacenter) | 36864 | Production | blackwell |
sm_103 / sm_103a / sm_103f | 1030 | Blackwell Ultra | GB300 (datacenter) | 36864 | Production | blackwell |
sm_110 / sm_110a / sm_110f | 1100 | Jetson Thor | Thor SoC (auto/robotics) | 36864 | Production | blackwell |
sm_120 / sm_120a / sm_120f | 1200 | Blackwell (sm120) | RTX 50xx / RTX Pro | 36864 | Production | blackwell |
sm_121 / sm_121a / sm_121f | 1210 | Blackwell (sm120) | DGX Spark | 36864 | Production | blackwell |
The family name stored in the profile object (from sub_6765E0) uses NVIDIA's internal naming: "Turing", "Ampere", "Hopper", "Blackwell". Ada Lovelace (sm_89) is stored as Ampere-derived internally despite being a distinct microarchitecture. sm_120/121 use "Blackwell" internally despite being a different consumer microarchitecture from sm_100 datacenter Blackwell.
Suffix Semantics
ptxas uses three suffix modes to control forward compatibility. The distinction is critical: it determines which SASS binary a cubin can execute on.
| Suffix | Meaning | Forward Compatibility | Validation Table |
|---|---|---|---|
| (none) | Base feature set | Full forward-compat across generations | unk_1D16220 (32 entries) |
a (accelerated) | Architecture-locked, advanced features | No forward compat -- locked to specific silicon | unk_1D161C0 (7 entries) |
f (feature-reduced) | Same-family forward compat only | Forward-compat within family, not across | unk_1D16160 (6 entries) |
The base variant (no suffix) produces SASS that runs on the named architecture and all later ones: sm_80 code runs on sm_86, sm_89, sm_90, sm_100, etc. The a suffix locks the binary to exact silicon: sm_90a code runs only on H100/H200 hardware and will not execute on Blackwell. The f suffix allows forward compatibility within the same family: sm_100f code runs on sm_100 and sm_103 (both Blackwell datacenter) but not on sm_120 (different family).
Compilation rules from help text:
sm_90aPTX must be compiled tosm_90aSASS (no cross-arch compilation)sm_100fPTX can compile tosm_100forsm_103fSASS (same family)sm_100aPTX must compile tosm_100aSASS only- Base
sm_100PTX compiles to anysm_100+SASS
Sub-Variant Expansion
| Base | a Variant | f Variant | CUDA_ARCH (a) | CUDA_ARCH (f) |
|---|---|---|---|---|
sm_90 | sm_90a | -- | 90a0 | -- |
sm_100 | sm_100a | sm_100f | 100a0 | 100f0 |
sm_103 | sm_103a | sm_103f | 103a0 | 103f0 |
sm_101 | sm_101a | sm_101f | -- | -- |
sm_110 | sm_110a | sm_110f | 110a0 | 110f0 |
sm_120 | sm_120a | sm_120f | 120a0 | 120f0 |
sm_121 | sm_121a | sm_121f | 121a0 | 121f0 |
sm_75 through sm_89 have no a or f variants. sm_90 has only the a variant (no f). All Blackwell-era targets (sm_100+) have both a and f. sm_101 is a legacy alias for sm_110 (Jetson Thor, original internal designation); it passes validation but is not registered as a profile object, so its CUDA_ARCH values are not populated.
SM Validation Tables
Target name validation uses three sorted arrays searched via bsearch(). The CLI parser extracts the SM string from --gpu-name, strips any suffix, and searches the appropriate table.
Base Table -- unk_1D16220 (32 entries)
Contains all valid base SM names without suffix, sorted by numeric SM ID. Includes legacy architectures no longer supported for active compilation but retained for validation, plus two internal/alias entries. Each entry is 12 bytes: {uint32 sm_id, uint32 ptx_major, uint32 ptx_minor}. The bsearch comparison (sub_484B70) compares the numeric sm_id extracted from the --gpu-name string via sscanf.
sm_10, sm_11, sm_12, sm_13, // Tesla (legacy, PTX 1.0--1.2)
sm_20, sm_21, // Fermi (legacy, PTX 2.0)
sm_30, sm_32, sm_35, sm_37, // Kepler (legacy, PTX 3.0--4.1)
sm_50, sm_52, sm_53, // Maxwell (legacy, PTX 4.0--4.2)
sm_60, sm_61, sm_62, // Pascal (legacy, PTX 5.0)
sm_70, sm_72, // Volta (legacy, PTX 6.0--6.1)
sm_75, // Turing (active, PTX 6.3)
sm_80, sm_82, sm_86, sm_87, sm_88, sm_89, // Ampere/Ada (active, PTX 6.2--7.8)
sm_90, // Hopper (active, PTX 7.8)
sm_100, sm_101, sm_103, sm_110, sm_120, sm_121 // Blackwell (active, PTX 8.6--9.0)
sm_82 (PTX 6.2): Undocumented internal Ampere target. Not registered in sub_6765E0 (no profile object). Serves as the SASS opcode generation boundary (SM82_FIRST/SM82_LAST, opcode indices 172--193). The anomalously low PTX version requirement (6.2 vs sm_80's 7.0) suggests it was an early development target added before PTX ISA versioning was finalized.
sm_101 (PTX 8.6): Original internal designation for Jetson Thor, renamed to sm_110 in a later CUDA release. Both entries coexist in the validation table for backward compatibility with PTX files referencing the old name. sub_6765E0 registers only sm_110; sm_101 is validation-only.
Accelerated Table -- unk_1D161C0 (7 entries)
sm_90a, sm_100a, sm_101a, sm_103a, sm_110a, sm_120a, sm_121a
One Hopper entry, six Blackwell entries. sm_101a is the legacy alias for sm_110a (Jetson Thor, original internal designation).
Feature-Reduced Table -- unk_1D16160 (6 entries)
sm_100f, sm_101f, sm_103f, sm_110f, sm_120f, sm_121f
No Hopper entry (sm_90 has no f variant). All Blackwell-era. sm_101f is the legacy alias for sm_110f.
Architecture Registration -- sub_6765E0
This 54KB function constructs profile objects for every SM version. Each profile contains:
| Field | Content | Example (sm_90) |
|---|---|---|
| SM name | "sm_90" | "sm_90" |
| Compute name | "compute_90" | "compute_90" |
| Family name | "Hopper" | "Hopper" |
CUDA_ARCH macro | Decimal integer | 900 |
| LTO name | "lto_90" | "lto_90" |
isaClass | Architecture class ID | -- |
The function registers each profile into three hash maps indexed by sm_XX, compute_XX, and lto_XX strings. This allows lookup by any of the three naming conventions used in different contexts (CLI, PTX .target directive, LTO linking).
Family assignment from sub_6765E0:
| SM Range | Family String | Notes |
|---|---|---|
| sm_75 | "Turing" | Single entry |
| sm_80, sm_86, sm_87, sm_88 | "Ampere" | Includes sm_88 (undocumented Ada variant) |
| sm_89 | "Ampere" | Ada Lovelace stored as Ampere-derived internally |
| sm_90/90a | "Hopper" | Single silicon, two feature levels |
| sm_100/100a/100f | "Blackwell" | Datacenter B200 |
| sm_103/103a/103f | "Blackwell" | Blackwell Ultra (GB300) |
| sm_110/110a/110f | "Blackwell" | Jetson Thor -- same family string despite different product |
| sm_120/120a/120f | "Blackwell" | Consumer/enterprise (RTX 50xx) -- different uarch, same string |
| sm_121/121a/121f | "Blackwell" | DGX Spark |
All sm_100 through sm_121 share the "Blackwell" family string internally, even though sm_110 (Jetson Thor) and sm_120 (consumer RTX 50xx) are distinct microarchitectures. The compiler distinguishes them through the capability dispatch tables, not through family name.
Capability Dispatch -- sub_607DB0
The capability dispatch initializer builds 7 parallel hash maps at initialization time, protected by a once-guard (byte_29FE1D8). Each map indexes sm_XX / compute_XX strings to per-architecture values or handler functions. Error recovery uses setjmp/longjmp.
| Map | Global | Purpose | Value Type |
|---|---|---|---|
| 1 | qword_29FE1D0 | Handler A (primary codegen) | Function pointer |
| 2 | qword_29FE1C8 | Handler B (secondary codegen) | Function pointer |
| 3 | qword_29FE1C0 | Intrinsic table initializer | Function pointer |
| 4 | qword_29FE1B8 | Capability flags | Byte value |
| 5 | qword_29FE1B0 | Profile registration | Registered via sub_42BEC0 |
| 6 | qword_29FE1A8 | Perf-stats / occupancy handler E | Function pointer |
| 7 | qword_29FE1A0 | Perf-stats / occupancy handler F | Function pointer |
Handler Function Assignments
Each SM version registers its own handler functions into these maps. Functions within the same suffix group (e.g., sm_100/100a/100f) share all handlers -- they are the same silicon with different feature exposure.
Map 1 -- Handler A (per SM):
| SM | Handler A | SM | Handler A |
|---|---|---|---|
| sm_75 | sub_609B70 | sm_100 | sub_609C30 |
| sm_80 | sub_609CC0 | sm_110 | sub_609F30 |
| sm_86 | sub_609D50 | sm_103 | sub_608F20 |
| sm_87 | sub_609F00 | sm_120 | sub_609E40 |
| sm_88 | sub_609E70 | sm_121 | sub_609ED0 |
| sm_89 | sub_609E10 | ||
| sm_90 | sub_609DB0 |
Map 2 -- Handler B (per SM):
| SM | Handler B | SM | Handler B |
|---|---|---|---|
| sm_75 | sub_609B40 | sm_100 | sub_609BD0 |
| sm_80 | sub_609C90 | sm_110 | sub_608F50 |
| sm_86 | sub_609D80 | sm_103 | sub_609D20 |
| sm_87 | sub_609DE0 | sm_120 | sub_609C60 |
| sm_88 | sub_609EA0 | sm_121 | sub_609BA0 |
| sm_89 | sub_609CF0 | ||
| sm_90 | sub_609C00 |
Map 3 -- Intrinsic table initializer (per SM):
| SM | Initializer | SM | Initializer |
|---|---|---|---|
| sm_75 | sub_60A2E0 | sm_100 | sub_60A910 |
| sm_80 | sub_60A3E0 | sm_110 | sub_60AA20 |
| sm_86 | sub_60AC30 | sm_103 | sub_60A700 |
| sm_87 | sub_60AD30 | sm_120 | sub_608DF0 |
| sm_88 | sub_60AB30 | sm_121 | sub_60A4E0 |
| sm_89 | sub_60A810 | ||
| sm_90 | sub_60A5F0 |
Shared Handler Groups
Sub-variants within a base SM share all handler functions, confirming they are identical silicon:
| Group | Members | Shared Handlers |
|---|---|---|
| Hopper | sm_90, sm_90a | All 7 maps |
| Blackwell DC | sm_100, sm_100a, sm_100f | All 7 maps |
| Blackwell Ultra | sm_103, sm_103a, sm_103f | All 7 maps |
| Jetson Thor | sm_110, sm_110a, sm_110f | All 7 maps |
| Consumer | sm_120, sm_120a, sm_120f | All 7 maps |
| DGX Spark | sm_121, sm_121a, sm_121f | All 7 maps |
Codegen Factory Values
The profile object stores an encoded architecture identifier at a known offset (visible as field +348 on the profile pointer chain, e.g., *(_QWORD *)(a1+1584)+348). This value is compared throughout the compiler to gate features:
| Codegen Factory | SM Range | SASS ISA Generation |
|---|---|---|
| 24577 | sm_75 | Turing (SM 7.5) |
| 28673 | sm_80 -- sm_89 | Ampere / Ada (SM 8.x) |
| 32768 | sm_90 | Hopper (SM 9.0) |
| 36864 | sm_100 -- sm_121 | Blackwell (SM 10.x -- 12.x) |
These values appear in feature-gating checks. For example, FMA/DFMA combining in the peephole optimizer checks profile[+372] > 28673 to require sm_70+ capability. The exact encoding formula is (isa_generation << 12) | variant, where the high bits identify the SASS instruction set generation.
Related encoded values seen in the binary:
12288 = sm_30 (Kepler) // 3 << 12
16385 = sm_50 (Maxwell) // 4 << 12 | 1
20481 = sm_50 alt (Maxwell) // 5 << 12 | 1
24576 = sm_60 (Pascal) // 6 << 12
24577 = sm_75 (Turing) // 6 << 12 | 1
28673 = sm_80 (Ampere) // 7 << 12 | 1
28674-28677 = sm_86/87/88/89 // 7 << 12 | 2..5
32768 = sm_90 (Hopper) // 8 << 12
36864 = sm_100 (Blackwell) // 9 << 12
36865-36869 = sm_103..121 // 9 << 12 | 1..5
Hardware Resource Geometry
ptxas assembles per-SM hardware parameters from three data sources: sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator properties). These parameters control register allocation limits, shared memory partitioning, occupancy calculations, and scheduling decisions throughout the compiler.
Universal Constants (sub_8688F0)
sub_8688F0 sets the baseline hardware profile shared by all SM 75+ targets. These values are architecture-invariant within the ptxas v13.0.88 binary:
| Parameter | Value | Binary Evidence | Profile Offset |
|---|---|---|---|
| Warp size | 32 threads | *(a1+1472) = 32 | +1472 |
| Max registers per thread | 255 | *(a1+612) = 0xFF0000003F | +612 |
| Register file per SM | 65,536 x 32-bit | Derived: max_warps = 65536 / (regcount * 32) | -- |
| Dependency barriers per warp | 6 | *(a1+604) = 6 | +604 |
| Named barriers per CTA | 16 | barrier_arrive_0 through barrier_arrive_15 intrinsics | -- |
| Static shared memory base | 48 KB (49,152 B) | *(a1+1484) = 49152 | +1484 |
| Shared memory config base | 1 MB (1,048,576 B) | *(v6+344) = 0x100000 in all per-SM inits | profile +344 |
The register file size of 65,536 registers is confirmed by the EIATTR_REGCOUNT formula (code 0x2F): max_warps_per_SM = total_registers / (regcount * warp_size), and by explicit reference in codegen/templates.md ("the entire physical register file is 65,536 32-bit registers shared across all active warps").
Per-SM Resource Geometry Table
Combines binary evidence (sub_8E4400 scheduling profile, sub_8688F0 baseline, sub_ABF250 occupancy properties, sub_60AXXX per-SM initializers) with NVIDIA public specifications for parameters not stored as scalar constants in the binary. Confidence column rates how directly the value was extracted from the binary vs. inferred from public documentation.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_75 | 65,536 | 255 | 1,024 | 32 | 16 | 7 / 208 | 208 | 32 / 48 / 64 KB | 90% |
sm_80 | 65,536 | 255 | 2,048 | 64 | 32 | 7 / 208 | 208 | 48 / 100 / 132 / 164 KB | 90% |
sm_86 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_87 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 / 164 KB | 90% |
sm_88 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | (same as sm_86) | 85% |
sm_89 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_90 | 65,536 | 255 | 1,024 | 64 | 32 | 8 / 224 | 224 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_100 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_103 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 88% |
sm_110 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 85% |
sm_120 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 88% |
sm_121 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_120) | 85% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block). Not stored as a ptxas constant; derived from
warps_per_SM * warp_size / max_CTAs. - Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). The scheduler partition count is the number of warp scheduler units; dispatch slots is the total scheduling capacity. - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute. Stored as pointer-to-table at profile offsets +1488/+1496; sm_75 has 3 entries, later architectures have more.
sm_88 note: No known product ships on sm_88. It shares all handler functions with sm_86. Listed parameters are inherited; actual hardware behavior is unverifiable.
Scheduler Partition Geometry (sub_8E4400 Detail)
The packed DWORD at offset +18 of the warp-level profile encodes scheduler partition counts. The WORD at offset +22 is the dispatch slot count -- a scheduling capacity value distinct from the raw warp count.
| Codegen Factory Range | Packed DWORD | Hex | Partitions | Dispatch Slots | SM Era |
|---|---|---|---|---|---|
| <= 20479 | 458,759 | 0x00070007 | 7 | 96 | sm_50 (Maxwell) |
| 20480 -- 24575 | 786,444 | 0x000C000C | 12 | 176 | sm_60 (Pascal) |
| 24576 -- 28672 | 851,981 | 0x000D000D | 13 | 192 | sm_70 (Volta) |
| 28673 -- 32767 | 917,518 | 0x000E000E | 14 | 208 | sm_75 -- sm_89 |
| 32768 -- 36863 | 983,055 | 0x000F000F | 15 | 224 | sm_90 (Hopper) |
| > 36863 | 1,048,592 | 0x00100010 | 16 | 240 | sm_100+ (Blackwell) |
The dispatch slot count increases monotonically across generations, reflecting wider scheduling capacity. All sm_75 through sm_89 targets (Turing, Ampere, Ada Lovelace) share identical scheduling partition geometry despite their hardware differences -- the differentiation occurs in the per-SM latency tables, not in the partition structure.
Shared Memory Configuration Tables
ptxas stores configurable shared memory sizes as a pointer + count pair at profile offsets +1488 and +1496. The driver uses this table to validate cudaFuncSetAttribute(cudaFuncAttributeMaxDynamicSharedMemorySize, ...) calls.
sub_8688F0 sets the sm_75 configuration:
*(a1+1488) = &unk_21D9168 // pointer to shared memory size table
*(a1+1496) = 3 // 3 valid configurations
For sm_75 (Turing), the 3 entries correspond to 32 KB, 48 KB, and 64 KB configurable shared memory. The L1/shared partitioning on Turing splits the 96 KB unified data cache between L1 and shared memory.
For sm_80 (Ampere), the configurable shared memory extends to 164 KB, reflecting the larger shared memory/L1 combined capacity. sub_ABF250 records the maximum as 167,936 bytes (163.8 KB) for the base sm_60 path and 233,472 bytes (228 KB) for sm_70+ paths, though these values encode via xmmword constants that depend on the specific SM variant.
For sm_90+ (Hopper, Blackwell), sub_ABF250 populates a maximum configurable value of 233,472 bytes (228 KB), supporting the opt-in extended shared memory mode added in Hopper.
Register Allocation Mechanics
ptxas allocates registers in units determined by the register allocation granularity stored in sub_ABF250:
| SM Generation | Alloc Granularity | a2[6][1] | a2[6][2] | Notes |
|---|---|---|---|---|
| sm_30 -- sm_60 | 64 registers / warp | 63 | 1 | Allocates in blocks of 2 regs/thread |
| sm_70+ | 256 registers / warp | 255 | 2 | Allocates in blocks of 8 regs/thread |
The register allocation unit directly affects occupancy. With 256-register granularity on sm_75+, a kernel using 33 registers effectively consumes 40 (rounded up to the next multiple of 8), which means each warp uses 40 * 32 = 1280 of the 65,536 available registers, allowing up to 51 warps -- but capped by the hardware limit of 32 warps on sm_75.
The formula the GPU driver uses (from EIATTR_REGCOUNT documentation):
effective_regs = ceil(regcount / alloc_granularity) * alloc_granularity
regs_per_warp = effective_regs * warp_size
max_warps = min(registers_per_SM / regs_per_warp, hw_max_warps)
max_CTAs = min(max_warps / warps_per_CTA, hw_max_CTAs)
SM Version Encoding
The raw SM version number stored in profile objects and code object headers uses a packed integer format. This is the value at v4[93] in the code object builder (sub_A465F0):
| Encoded Value | SM Target | Code Object Version | Max Threads/CTA |
|---|---|---|---|
| 12288 | sm_30 | 0x70007 | 96 |
| 20481 | sm_50 | 0xC000C | 176 |
| 24576 | sm_60 | -- | -- |
| 28673 | sm_80 | -- | -- |
| 36864 | sm_90 | 0x100010 | 240 |
The code object builder (sub_A465F0 at 0xA465F0) maps these encoded SM versions to ELF code object version fields and thread-per-CTA limits. The magic number 0x16375564E is written at offset 0 of every code object header, with the SM version at offset +8.
Per-SM Capability Accessors -- sub_609XXX
The 24 functions in the sub_609XXX cluster (range 0x609280--0x609F60, ~1.2KB each) are the per-SM-version capability accessor functions. They are registered into Maps 1 and 2 of the dispatch tables and return architecture-specific values: register file sizes, feature flags, warp geometry, shared memory limits, and similar hardware parameters.
These are the functions that downstream code calls (through the dispatch table) to answer questions like "how many registers does this SM have?" or "is feature X available on this target?"
Profile Layering
Three levels of SM profile information cooperate:
Level 1: sub_607DB0 // Capability dispatch (7 hash maps)
| // -> feature flags, handler functions
v
Level 2: sub_6765E0 // Profile objects (name, family, CUDA_ARCH, lto)
| // -> identity metadata, isaClass
v
Level 3: sub_609XXX / sub_60AXXX // Per-SM accessor functions
// -> concrete hardware parameter values
Level 1 provides the dispatch infrastructure. Level 2 provides identity metadata for diagnostics and linking. Level 3 provides the actual numeric values that drive register allocation, scheduling, and instruction legality.
Generation-Specific Features
Turing (sm_75)
sm_75 is the default architecture for ptxas v13.0.88, returned by sub_6784B0 when no --gpu-name is specified. Codegen factory value: 24577.
- Base tensor core (WMMA m16n16k16 f16/f32, m32n8k16, m8n32k16)
- Integer MMA (IMMA int8/int4), binary MMA (BMMA b1)
- Base warp-level operations (shfl.sync, vote.sync, match.sync, redux.sync)
- Named barrier support (bar 0-15 with arrive/red/sync variants)
Ampere (sm_80 -- sm_88)
Codegen factory value: 28673. Shared with Ada Lovelace (sm_89).
- Extended tensor core: TF32, BF16, FP64 MMA shapes
cp.asyncfor asynchronous shared memory copies- L2 cache hints on atomic operations
createpolicyinstructions for cache management- 14 additional intrinsics (
__cuda_sm80_*: bf16/tf32/s4/s8/b1 MMA, createpolicy)
Ada Lovelace (sm_89)
Codegen factory value: 28673 (same as Ampere). Stored as "Ampere" internally despite being a distinct Ada Lovelace microarchitecture.
- Same codegen path as Ampere; differentiated through capability flags, not codegen factory
- 39 additional MMA intrinsics (
__cuda_sm_8x_mma_*)
Hopper (sm_90 / sm_90a)
Codegen factory value: 32768. sm_90a is architecture-locked (H100/H200 only).
- WGMMA (warpgroup MMA async):
wgmma.mma_async,wgmma.fence,wgmma.commit_group,wgmma.wait_group - Cluster operations:
barrier.cluster.arrive/wait, distributed shared memory - setmaxnreg: Dynamic register allocation limit
- Cluster special registers:
%clusterid,%cluster_ctaid,%cluster_ctarank, etc. - 38 sub-byte MMA intrinsics (
__cuda_sm_9x_mma_sub_byte_internal_*: s4/u4 sparse)
Blackwell Datacenter (sm_100, sm_103)
Codegen factory value: 36864. Both a and f sub-variants available.
- tcgen05: 5th-generation tensor core ISA (alloc, dealloc, ld, st, commit, cp, shift, mma) --
a/fsub-variants only - tcgen05 guardrails: 8 debug validation functions (phase validity, column allocation, bounds checking)
- Extended MMA: 10 Blackwell-specific hmma/imma + bit MMA intrinsics (
__cuda_sm_10x_*) - 11 tcgen05 guardrail trap intrinsics (
__cuda_sm10x_tcgen05_guardrail_trap_*) - 18 sm_1xx bulk copy intrinsics (
__cuda_sm1xx_*: cp.async.bulk.tensor 1D-5D tile/im2col)
Jetson Thor (sm_110)
Codegen factory value: 36864 (same as sm_100). Originally sm_101 before rename. Automotive/robotics SoC.
- Retains full tcgen05/TMEM hardware on
a/fsub-variants - Same Blackwell datacenter feature set for tensor operations
- Differentiated through capability flags for SoC-specific constraints
Blackwell Consumer (sm_120, sm_121)
Codegen factory value: 36864 (same as sm_100). Architecturally a distinct consumer microarchitecture despite sharing the "Blackwell" family string.
- No tcgen05: The entire tcgen05 ISA is absent on sm_120/121 -- gated by SM version checks
- Tensor core falls back to HMMA/IMMA/WGMMA inherited from sm_70--sm_90 path
- sm_120 = RTX 50xx consumer / RTX Blackwell Pro (enterprise)
- sm_121 = DGX Spark
Diagnostic Strings
| String | Context | Function |
|---|---|---|
"Turing" | Family name in profile object | sub_6765E0 |
"Ampere" | Family name in profile object | sub_6765E0 |
"Hopper" | Family name in profile object | sub_6765E0 |
"Blackwell" | Family name in profile object | sub_6765E0 |
"isaClass" | Architecture class reference on profile | sub_6765E0 |
"sm_%d" | SM name formatting | Multiple |
"compute_%d" | Compute name formatting | sub_6765E0 |
"lto_%d" | LTO name formatting | sub_6765E0 |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_607DB0 | 14KB | SM capability dispatch -- builds 7 hash maps | 99% |
sub_608D70 | 384B | Profile lookup dispatcher | 80% |
sub_608DF0 | ~1KB | sm_120 intrinsic table initializer | 85% |
sub_608F20 | ~1.2KB | sm_103 handler A (capability accessor) | 90% |
sub_608F50 | ~1.2KB | sm_110 handler B (capability accessor) | 90% |
sub_609280--sub_609F60 | ~1.2KB each | 24 per-SM capability accessor functions (Maps 1+2) | 90% |
sub_609F60 | 2.8KB | lds128convert option handler ("always", "nonconst", "never") | 90% |
sub_60A2E0--sub_60AD30 | ~1KB each | 12 per-SM intrinsic table initializers (Map 3) | 85% |
sub_60B040 | 4.5KB | Stress test options ("stress-maxrregcount", etc.) | 85% |
sub_6765E0 | 54KB | SM profile object construction (family, CUDA_ARCH, lto) | 95% |
sub_6784B0 | -- | Default architecture -- returns sm_75 | 99% |
sub_8688F0 | 31 lines | Universal HW profile baseline (warp size, regs, barriers, shmem) | 95% |
sub_8E4400 | 3.3KB | Warp-level HW profile: scheduler partitions, dispatch slots | 95% |
sub_ABF250 | ~600B | Occupancy property table: configurable shmem, reg alloc granularity | 90% |
sub_A95DC0 | ~1.8KB | Extended HW profile: architecture-specific shmem config | 85% |
sub_A465F0 | 14KB | Code object header builder (SM version -> ELF fields) | 88% |
Profile Object Layout (1936 bytes)
Every SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object that carries target-specific parameters throughout the compiler. This is the compilation unit's target descriptor -- the single structure that downstream code reads to answer "what hardware am I compiling for?"
Construction Sequence
1. sub_71BDE0(1936, a1) heap allocate 1936 bytes
2. sub_C1B7A0(profile) zero-fill + structural defaults (8 SSE blocks, 5 scalars)
3. sub_917990(a3) overlay: codegen factory default, tail constants
4. sub_60AXXX(a1,a2,a3,a4) per-SM: codegen factory, shmem base, capability flags
Key Fields -- Explicitly Initialized
These fields receive non-zero values during construction. Offsets are byte offsets from the profile object base pointer. Type column: D=DWORD(4B), Q=QWORD(8B), O=OWORD(16B), B=BYTE.
| Offset | Type | Default | Set By | Semantic Name | Confidence |
|---|---|---|---|---|---|
| +0 | Q | 0 | sub_C1B7A0 | object_base -- zeroed, likely vtable/class pointer | 75% |
| +112 | Q | 0x500000000 | sub_C1B7A0 | packed_config -- stores DWORD 5 at +112, DWORD 0 at +116 | 85% |
| +120 | D | 5 | sub_C1B7A0 | opt_level_default -- initial optimization level or block dimension | 85% |
| +132 | Q | 0xFFFFFFFF | sub_C1B7A0 | max_register_limit -- -1 sentinel = "no limit" | 85% |
| +340 | D | 1 | sub_C1B7A0 | enable_flag_A -- default-enabled capability | 85% |
| +344 | D | 0x100000 | per-SM init | shared_memory_config_base -- 1 MB for all SM 75+ targets | 95% |
| +348 | D | per-SM | per-SM init | codegen_factory -- ISA generation encoding (gen << 12) | variant | 99% |
| +424 | Q | 0x100000000 | sub_C1B7A0 | packed_enable -- stores DWORD 1 at +424, DWORD 0 at +428 | 90% |
| +428 | D | 0 (cond.) | per-SM init | conditional_feature_flag -- sm_90+ only: set to 0 when *(a2+355) is true | 85% |
| +432 | D | computed | per-SM init | module_base_address -- callback() - 0x6FFFFE64 or -1 if disabled | 95% |
| +588 | D | 0 | sub_917990 | cleared_field -- explicitly re-zeroed; used in 10+ consumer functions | 90% |
| +708 | D | 1 | per-SM init | enable_flag_D -- universally set to 1 by all per-SM initializers | 85% |
| +944 | D | 4 | sub_C1B7A0 | pipeline_depth -- possibly barrier count or pipeline stage limit | 85% |
| +1200 | Q | "NVIDIA" | sub_43A400 | vendor_string_ptr -- pointer to vendor identification string | 95% |
| +1208 | Q | (pointer) | sub_43A400 | associated_data_ptr -- assigned from callback result | 90% |
| +1216 | D | 1 | sub_43A400 | vendor_flag -- set to 1 during ELF builder initialization | 85% |
| +1385 | B | 0 (bits) | runtime | scheduling_feature_flags -- bitfield, 21+ consumer sites | 99% |
| +1536 | Q | 1832 | sub_C1B7A0 | dynamic_region_offset -- points to tail SSE constant region start | 90% |
| +1552 | Q | 0 | runtime | pipeline_progress -- monotonically increasing counter (values 0--21); scoreboard guards check 16--19 | 95% |
| +1584 | Q | nullsub_856 | sub_C1B7A0 | sm_backend_vtable_ptr -- THE central polymorphic pointer; initialized to null stub | 99% |
| +1684 | D | CLI value | per-SM init | cli_option_value -- *(a1+108) passthrough from compiler driver | 90% |
| +1840 | D | 1 | per-SM init | elf_section_data -- initially 1 (enable), later overwritten with ptr | 85% |
| +1880 | Q | 1 | per-SM init | barrier_tracking_ptr -- initially 1, later pointer to scoreboard data | 95% |
| +1892 | D | 2 | sub_917990 | tail_mode_value -- possibly versioning or encoding mode indicator | 85% |
| +1912 | D | 0 (cond.) | per-SM init | conditional_clear -- cleared when *(a2+233) is true (debug mode) | 85% |
| +1928 | D | 1 | per-SM init | output_config_value -- compilation output configuration | 85% |
SSE Constant Blocks
10 blocks of 16 bytes each are loaded from .rodata segments via _mm_load_si128. These likely contain per-register-class sizing parameters, pipeline configuration constants, or default opcode table pointers. Exact values require .rodata dump.
| Offset | Source | Set By |
|---|---|---|
| +184 | xmmword_20206F0 | sub_C1B7A0 |
| +280 | xmmword_2027950 | sub_C1B7A0 |
| +312 | xmmword_2027600 | sub_C1B7A0 |
| +680 | xmmword_2027620 | sub_C1B7A0 |
| +696 | xmmword_22B4ED0 | sub_C1B7A0 |
| +740 | xmmword_22B4EE0 | sub_C1B7A0 |
| +788 | xmmword_22B4EF0 | sub_C1B7A0 |
| +816 | xmmword_22B4F00 | sub_C1B7A0 |
| +1832 | xmmword_21DEBA0 | sub_917990 |
| +1908 | xmmword_21DEBB0 | sub_917990 |
Scheduling Feature Flags (+1385 bitfield)
The byte at offset +1385 is the most heavily accessed bitfield on the profile object (21+ consumer sites in 15+ decompiled functions). Each bit gates a scheduling or codegen behavior.
| Bit | Mask | Meaning | Evidence |
|---|---|---|---|
| 0 | 0x01 | Function has sync barriers | sub_792CD0 sets/clears; sub_75F680, sub_75F580 check |
| 1 | 0x02 | (unknown) | -- |
| 2 | 0x04 | Extended barrier model | sub_796D60 checks jointly with +1382 & 0x20 |
| 3 | 0x08 | Scoreboard tracking enabled | sub_925670, sub_925510, sub_9253C0 check jointly with +1880 and +1552 |
| 4 | 0x10 | (unknown) | -- |
| 5 | 0x20 | Scheduling feature flag | sub_793220, sub_A36360 (scoreboard encoder) check |
| 6 | 0x40 | Temporary analysis flag | sub_752E40 sets; sub_77F0D0 clears |
| 7 | 0x80 | Preserved across resets | sub_7F7DC0: *(a1+1385) &= 0x80 (all others cleared) |
Per-SM Initializer Differences
All 12 per-SM initializers (one per SM family) are structurally identical. Only two fields differ between families.
| Field | sm_75--sm_88 | sm_89 | sm_90 | sm_100--sm_121 |
|---|---|---|---|---|
+348 codegen_factory | 24577--28676 | 28677 | 0x8000 (32768) | 36864--36869 |
+428 conditional_feature_flag | not written | not written | written (if *(a2+355)) | written (if *(a2+355)) |
All other fields (+344, +432, +588, +708, +1684, +1840, +1880, +1912, +1928) are set identically across all SM families.
Critical Distinction: Profile Object vs SM Backend
The profile object (1936 bytes, this layout) is the compilation unit's target descriptor, stored somewhere in the compilation context. The pointer at context+1584 (sm_backend_vtable_ptr) points to a separate polymorphic SM backend object -- not to another profile object. Fields accessed as *(*(ctx+1584) + N) are on the SM backend, not on this profile object.
Commonly accessed SM backend fields (NOT on the 1936-byte profile):
| SM Backend Offset | Semantic | Consumer Count |
|---|---|---|
| +12 | arch_class_id (4=Maxwell, 10=Grace, 11=NVLink-Volta+) | 3+ |
| +372 | codegen_factory (THE feature-gating value, 37+ sites) | 37+ |
| +1037 | hw_capability_flags bitfield (SFU precision, barriers) | 20+ |
| +1312 | vtable dispatch for predicate marking capability | 2+ |
| +1320 | vtable function slot for optimization dispatch | 2+ |
| +1417 | Grace/ARM architecture flag (bit 7) | 1+ |
| +1418 | NVLink-capable flag (bit 0) | 1+ |
The profile's own +348 stores the codegen factory value at construction time. The SM backend's +372 is where all downstream code reads it. These are the same numeric value stored in two different objects.
Object Region Map
Offset Range Size Content
----------- ---- -------
[0..111] 112B Object header, vtable chain, zeroed bulk
[112..159] 48B Config scalars (opt level, register limit)
[160..343] 184B Zeroed + 3 SSE constant blocks (184, 280, 312)
[344..587] 244B Target identity (shmem base, codegen factory, enable flags)
[588..879] 292B Capability fields, 4 SSE constant blocks (680, 696, 740, 788)
[880..1063] 184B Scheduling config, barrier config (+944=4 default)
[1064..1279] 216B Extended config, vendor metadata (+1200="NVIDIA")
[1280..1535] 256B Architecture feature flags (+1385 bitfield, +1417/+1418)
[1536..1663] 128B Dynamic region pointer, phase counter, sm_backend ptr (+1584)
[1664..1831] 168B CLI passthrough, ELF section data, barrier tracking ptr
[1832..1935] 104B Tail region: 2 SSE constant blocks, mode value (+1892=2)
Cross-References
- Turing & Ampere (SM 75--88) -- Detailed feature flags for sm_75 through sm_89
- Ada & Hopper (SM 89--90a) -- WGMMA, cluster operations, sm_90a arch-lock
- Blackwell (SM 100--121) -- tcgen05, arch-conditional vs family-conditional gating
- TCGen05 -- 5th Gen Tensor Cores -- tcgen05 instruction set detail
- Intrinsic Table (608 Entries) -- Master intrinsic catalog with per-SM generation ranges
- Tensor Core Intrinsics -- WMMA/MMA/tcgen05 intrinsic lowering
- Latency Model & HW Profiles -- Per-SM scheduling parameters
- SASS Instruction Encoding -- Codegen factory -> encoder selection
- Mercury Encoder -- SM-dependent SASS encoding
- CLI Options --
--gpu-nameparsing and default target - Data Structures -- context+1584 polymorphic pointer documentation
Turing & Ampere (SM 75--88)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
SM 75 through SM 88 span two microarchitecture generations that ptxas treats as a contiguous feature band. sm_75 (Turing) is the default target for ptxas v13.0.88 -- when --gpu-name is omitted, sub_6784B0 returns sm_75. The Ampere targets (sm_80, sm_86, sm_87, sm_88) share generation-7 SASS encoding and add incremental tensor core and async-copy capabilities. sm_89 (Ada Lovelace) is architecturally Ampere-derived internally but is covered in Ada & Hopper because it bridges to sm_90 features.
| SM targets | sm_75, sm_80, sm_86, sm_87, sm_88 (+ sm_82 validation-only) |
| Codegen factory range | 24577--28676 |
| ISA generation | 6 (Turing), 7 (Ampere) |
| Encoding format | 128-bit per-instruction control word |
| Scheduler profile | 7 warps, 208 dispatch slots |
| Family strings | "Turing" (sm_75), "Ampere" (sm_80--88) |
| Sub-variants | None (no a or f suffixes) |
| Profile object size | 1,936 bytes (allocated by sub_917990) |
SM Version Table
| SM | Product | Family | __CUDA_ARCH__ | Codegen Factory | Hex | Variant |
|---|---|---|---|---|---|---|
sm_75 | TU10x (RTX 20xx, Quadro RTX) | Turing | 750 | 24577 | 0x6001 | 1 (gen 6) |
sm_80 | GA100 (A100, A30) | Ampere | 800 | 28673 | 0x7001 | 1 (gen 7) |
sm_86 | GA10x (A40, A10, RTX 30xx) | Ampere | 860 | 28674 | 0x7002 | 2 |
sm_87 | GA10B (Jetson Orin) | Ampere | 870 | 28675 | 0x7003 | 3 |
sm_88 | -- (undocumented) | Ampere | 880 | 28676 | 0x7004 | 4 |
Codegen factory encoding: (isa_generation << 12) | sub_variant. Turing is generation 6; Ampere is generation 7. The sub-variant distinguishes silicon cut within a generation. sm_75 and Pascal sm_60 share generation 6 (sm_60 = 24576 = 0x6000), differentiated by sub-variant 0 vs 1.
sm_88 note: Registered in ptxas with CUDA_ARCH=880 and codegen factory 28676, but no public product ships on this SM. It may represent an unreleased Ampere derivative or internal test target.
SM 82 -- Internal Ampere Target
sm_82 is an undocumented internal Ampere target present in the base validation table (unk_1D16220, entry [20]) but not registered in the profile constructor sub_6765E0. It has no capability dispatch handler, no profile object, and no handler functions in any of the 7 hash maps. It exists in ptxas solely as a validation table entry and as the SASS opcode generation boundary.
| Validation table entry | {82, 6, 2} -- sm_82, PTX 6.2 |
| PTX ISA requirement | 6.2 (anomalously low -- see below) |
| Profile object | None -- not registered in sub_6765E0 |
| Capability handlers | None -- not registered in sub_607DB0 |
| SASS opcode role | SM82_FIRST (index 172) through SM82_LAST (index 193) |
PTX 6.2 Anomaly
sm_82 requires PTX ISA version 6.2, which is lower than both its neighbors:
| SM | PTX ISA | CUDA Toolkit |
|---|---|---|
| sm_75 | 6.3 | CUDA 10.0 |
| sm_80 | 7.0 | CUDA 11.0 |
| sm_82 | 6.2 | -- |
| sm_86 | 7.1 | CUDA 11.1 |
PTX 6.2 corresponds to CUDA 10.1 (Turing era). This backward version number strongly suggests sm_82 was created as an early Ampere development target -- a PTX-level placeholder added before the Ampere PTX ISA (7.0) was defined. The validation table entry was never removed, but no profile object was ever created for it.
SASS Opcode Boundary Role
sm_82's primary significance in ptxas is as the opcode generation boundary for Ampere SASS instructions. The opcode hierarchy uses SM-number-based range labels:
SM82_FIRST = index 172 (first Ampere-era SASS opcode)
SM82_LAST = index 193 (last opcode in the sm_82 range)
These 22 opcode slots (indices 172--193) cover the core Ampere SASS additions:
| Opcodes | Category |
|---|---|
| GATHER, GENMETADATA, SPMETADATA | Sparse MMA infrastructure |
| BMMA_88128, BMMA_168128, BMMA_168256 | Binary tensor core MMA shapes |
| DMMA | FP64 tensor core MMA (re-introduced at index 215 for Hopper) |
| HMMA_SP_1688, HFMA2_MMA, HMNMX2 | FP16 sparse/packed operations |
| IMMA_88, IMMA_SP_88, IMMA_16816, IMMA_16832, IMMA_SP_16832 | Integer tensor core MMA shapes |
| ARRIVES, LDGDEPBAR, LDGSTS | Async copy and barrier infrastructure |
| REDUX | Warp-wide reduction |
| CLMAD | Carry-less multiply-add (GF(2) arithmetic) |
The name SM82_FIRST/SM82_LAST is used as the boundary label even though these instructions are available on sm_80+ (any codegen factory >= 28673). The "82" in the label refers to the internal target used during Ampere development, not to a minimum SM requirement for the opcodes themselves.
Why sm_82 Matters
sm_82 is a ghost target: it occupies a validation table slot and lends its name to an opcode range, but cannot be compiled for. Passing --gpu-name sm_82 to ptxas would pass the initial validation check (bsearch succeeds in the base table) but fail during profile construction because sub_6765E0 has no case for SM 82. The practical consequence is that sm_82 is a naming artifact preserved from Ampere development, not a usable compilation target.
Profile Object Construction
Each SM's intrinsic table initializer (Map 3 handler) calls sub_917990 to allocate a 1,936-byte profile object, then populates architecture-specific fields.
Source: decompiled sub_60A2E0 (sm_75), sub_60A3E0 (sm_80), sub_60AC30 (sm_86), sub_60AD30 (sm_87), sub_60AB30 (sm_88).
All five initializers are structurally identical. The only field that differs between them is offset +348 (codegen factory). Common initialization:
v6 = sub_917990(a3); // allocate 1936-byte profile object
*(_DWORD *)(v6 + 344) = 0x100000; // shared memory config base (1 MB)
*(_DWORD *)(v6 + 348) = FACTORY; // codegen factory (per-SM)
v7[147] = 0; // cleared
v7[421] = cli_option_value; // from a1+108
v7[460] = 1; // enable flag
v7[482] = 1; // enable flag
v7[470] = 1; // enable flag
v7[177] = 1; // enable flag
The sub_917990 base constructor sets the default codegen factory to 0x2000 (8192), which every per-SM initializer immediately overwrites. Key base constructor fields:
| Offset | Default | Content |
|---|---|---|
| +348 | 0x2000 | Codegen factory (overridden) |
| +588 | 0 | Cleared |
| +1892 | 2 | Mode/config value |
| +1832--1848 | xmmword | SSE-loaded constant block |
| +1908--1924 | xmmword | SSE-loaded constant block |
Handler Dispatch
ptxas registers per-SM handler functions into 7 parallel hash maps via sub_607DB0. For sm_75--88, all handlers are thin wrappers around shared codegen infrastructure.
Map 1 (Handler A) and Map 2 (Handler B)
| SM | Handler A | Handler B |
|---|---|---|
| sm_75 | sub_609B70 | sub_609B40 |
| sm_80 | sub_609CC0 | sub_609C90 |
| sm_86 | sub_609D50 | sub_609D80 |
| sm_87 | sub_609F00 | sub_609DE0 |
| sm_88 | sub_609E70 | sub_609EA0 |
Every Handler A function is identical:
bool handler_A(int64_t a1, int64_t a2) {
*(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
return sub_663C30(a2, 0) != 0; // a2=0: sets *(a1+96) = 1
}
Every Handler B function is identical except the second argument:
bool handler_B(int64_t a1, int64_t a2) {
*(int32_t*)(a2 + 100) = read_option(a2, "cpf_optx");
return sub_663C30(a2, 1) != 0; // a2=1: skips the flag set
}
The "cpf_optx" option controls OptiX IR compilation mode. sub_663C30 is the core codegen-factory-aware driver that delegates to sub_662920 (ELF section iteration, 26KB) and sub_7FBB70 (actual compilation pass). The only behavioral difference between Handler A and Handler B: when called as Handler A (arg=0), sub_663C30 sets *(a1 + 96) = 1 before processing -- likely a "primary pass" flag.
Map 3 (Intrinsic Table Initializer)
| SM | Initializer | Notes |
|---|---|---|
| sm_75 | sub_60A2E0 | Factory 24577 |
| sm_80 | sub_60A3E0 | Factory 28673 |
| sm_86 | sub_60AC30 | Factory 28674 |
| sm_87 | sub_60AD30 | Factory 28675 |
| sm_88 | sub_60AB30 | Factory 28676 |
Maps 6--7 (Performance / Occupancy)
| Map | sm_75 Handler | Purpose |
|---|---|---|
| 6 | sub_608D50 | Perf-stats / occupancy handler E |
| 7 | sub_6096E0 | Perf-stats / occupancy handler F |
These handlers return per-SM occupancy parameters used by the driver API's occupancy calculator. Other Ampere SMs have their own handlers registered into the same maps (addresses not fully traced in sweep data).
Scheduler Profile
sub_8E4400 (InitHWProfile_Warp) uses the codegen factory value to select scheduling parameters. The dispatch is a linear threshold cascade:
| Factory Range | Warps | Dispatch Slots | Architecture Class |
|---|---|---|---|
| <= 20479 | 4 | 96 | Kepler (sm_30) |
| <= 24575 | 6 | 176 | Pascal (sm_60) |
| <= 28672 | 7 | 192 | Volta (sm_70) |
| <= 32767 | 7 | 208 | Turing / Ampere (sm_75--88) |
| <= 36863 | 8 | 224 | Hopper (sm_90) |
| > 36863 | 16 | 240 | Blackwell (sm_100+) |
All SM 75--88 targets fall into the 7-warp / 208-slot bucket. After the warp count, a secondary switch maps specific codegen factory values to sub-architecture variants (stored at a1+26):
| Codegen Factory | Variant | SM |
|---|---|---|
| 24576, 32768, 36864 | 0 | sm_60, sm_90, sm_100 (base) |
| 8193, 20481, 28674 | 2 | sm_30, sm_50, sm_86 |
| 28675, 36867 | 3 | sm_87, sm_103 |
| 28676, 36868 | 4 | sm_88, sm_110 |
| 28677, 36869 | 5 | sm_89, sm_121 |
sm_75 (24577) and sm_80 (28673) are absent from the variant table and fall through to the default variant (0 or 1). This means sm_75 and sm_80 use the baseline latency model, while sm_86--88 get tuned sub-architecture parameters.
HW Latency Tables
Each SM has a dedicated latency table function containing per-opcode pipeline stage assignments, stall cycle costs, and functional unit mappings. These are called from the scheduling infrastructure to drive instruction ordering.
| Function | Size | SM | Notes |
|---|---|---|---|
sub_8E7720 | 3.5 KB | sm_75 | Turing baseline |
sub_8E7940 | 2.9 KB | sm_80 (base) | Ampere shared layer |
sub_8E7B40 | 3.3 KB | sm_80 | Ampere full table |
sub_8E7D80 | 4.4 KB | sm_86 | Largest in Ampere family |
sub_8E8070 | 3.5 KB | sm_87 | Orin tuning |
sub_8E8280 | 3.1 KB | sm_89 | Ada Lovelace |
sm_80 uniquely has two latency tables: a "base" table (sub_8E7940, 2.9KB) and a full table (sub_8E7B40, 3.3KB), suggesting a layered lookup where the base provides defaults and the full table overrides specific entries. sm_86's table is the largest at 4.4KB, likely because RTX 30xx consumer GPUs have different pipeline characteristics from A100 datacenter parts.
No separate table entry was found for sm_88 in the sweep data. It may share sm_86 or sm_87's latency profile, or be registered through a path not captured in the sweep.
SASS Instruction Encoding
SM 75--88 all use the 128-bit per-instruction encoding format introduced with Turing. This replaced the Volta/Pascal scheme where scheduling control was packed into a separate 64-bit control header shared by 3 instructions.
Control Word Layout
Each 128-bit SASS instruction carries a 23-bit control word encoding scheduling decisions. The control word is generated by sub_A36360 (52KB, the control-word / scoreboard encoder).
bits [0:3] stall count (4 bits, max 15 cycles)
bit [4] yield flag (1 bit, warp scheduler hint)
bits [5:7] write barrier idx (3 bits, selects 1 of 6 barriers)
bits [8:13] read barrier mask (6 bits, one per barrier)
bits [14:19] wait barrier mask (6 bits, one per barrier)
bits [20:25] reuse flags (6 bits, per source operand)
Instruction Word Structure
128-bit instruction word:
bits [0:3] = 0x2 (format code: 128-bit)
bits [4:6] = slot (scheduling group slot)
bits [8:16] = MAJOR (9-bit major opcode, 0x00-0x171)
bits [17:24] = MINOR (8-bit minor opcode / variant)
bits [25:31] = SUBOP (7-bit sub-opcode / format ID)
bits [48+] = MODIFIERS (format-dependent modifier fields)
bits [132:134] = 0x0 (extended opcode flag, at offset 0x84)
The 3-level opcode hierarchy (major/minor/subop) allows up to 102 major opcodes with 48 sub-operations each. Maximum observed variant value: 0x2F (47 decimal).
Scoreboard / Dependency Barriers
SM 75--88 provide 6 hardware dependency barriers per warp. The scoreboard tracker is managed by sub_8E4920 (BuildScoreboardEntries, 6.9KB) and encoded by sub_A36360.
| Resource | Width | Range | Notes |
|---|---|---|---|
| Write barrier index | 3 bits | 0--5 active, 6--7 reserved | Assigns instruction to a barrier |
| Read barrier mask | 6 bits | 1 bit per barrier | Indicates which barriers to check before read |
| Wait barrier mask | 6 bits | 1 bit per barrier | Indicates which barriers to wait on |
The scoreboard tracker allocates 952 bytes per function when bit 4 of the flag byte at offset +1385 is set. An additional 856-byte bitset is allocated when bit 8 is also set (for barrier register tracking in writeback mode).
The scoreboard infrastructure is shared across all SM 75--88 targets. The barrier count (6) is constant for this entire range. sm_90 (Hopper) potentially increases this, and sm_100+ (Blackwell) changes the barrier model further.
Intrinsic Table
Intrinsic availability is cumulative. Each generation adds to the previous.
sm_75 Baseline (IDs 0x89--0x1FA, 370 intrinsics)
sm_75 inherits the full sm_70 (Volta) intrinsic set labeled __cuda_sm70_*:
| Category | Intrinsics | PTX Operations |
|---|---|---|
| Named barriers | barrier_arrive/red/sync (0--15) | bar.arrive, bar.red.{and,or,popc}, bar.sync |
| Warp shuffle | shflsync_bfly/down/idx/up | shfl.sync.{bfly,down,idx,up} |
| Warp vote | votesync_all/any/ballot/uni | vote.sync.{all,any,ballot,uni} |
| Warp match | matchsync_all/any_b32/b64 | match.sync.{all,any}.b{32,64} |
| Warp sync | warpsync | bar.warp.sync |
| Redux | reduxsync_* (IDs 0x01--0x11) | redux.sync.{and,or,xor,min,max,add} |
| WMMA | m16n16k16, m32n8k16, m8n32k16 | wmma.{load,store,mma} |
WMMA intrinsics cover all combinations of:
- Shapes: m16n16k16, m32n8k16, m8n32k16
- Operations: load_a, load_b, load_c, store_d, mma
- Layouts: row/col combinations
- Types: f16, f32, with/without satfinite
- Address spaces: generic, global, shared
sm_80 Additions (IDs 0x1FB--0x22F, 53 intrinsics)
sm_80 adds two intrinsic groups:
14 __cuda_sm80_* intrinsics (IDs 0x1FB--0x208):
| Intrinsic | PTX Operation | Notes |
|---|---|---|
| createpolicy_range | createpolicy.range | L2 cache persistence control |
| createpolicy_fractional | createpolicy.fractional | L2 cache fraction control |
| createpolicy_cvt | Cache policy conversion | |
| mma_bf16_* | mma.sync with .bf16 | BF16 tensor core MMA |
| mma_tf32_* | mma.sync with .tf32 | TF32 tensor core MMA |
| mma_s4_* | mma.sync with .s4 | INT4 tensor core MMA |
| mma_s8_* | mma.sync with .s8 | INT8 tensor core MMA |
| mma_b1_* | mma.sync with .b1 | Binary tensor core MMA |
39 __cuda_sm_8x_mma_* intrinsics (IDs 0x209--0x22F):
Extended MMA shapes and sparse variants for the 2nd/3rd generation tensor core.
Intrinsic Gate Mechanism
The per-SM intrinsic table initializer (Map 3 handler) controls which intrinsics are available. sm_75 registers only the sm_70 intrinsic block. sm_80+ additionally registers the sm_80 and sm_8x blocks. The gate is not a runtime check -- it is a registration-time decision: if the intrinsic's handler function is not registered for a given SM, the PTX parser emits an error when it encounters a call to that intrinsic.
Peephole Optimizer Gates
The SASS-level peephole optimizer (sub_83EF00, 4,858 lines decompiled) applies pattern-matching transformations to the instruction stream. Several transformations are gated by codegen factory thresholds.
FMA/DFMA Combining
// sub_83EF00, case 0x50 (opcode 80 = FMA/DFMA)
if (*(_QWORD *)(*(_QWORD *)(a1 + 1584) + 372) > 28673) {
// FMA combining enabled: sm_86+ (codegen factory 28674+)
// Look for CVT -> FMA patterns and combine
}
Gate: codegen_factory > 28673. This means:
- sm_75 (24577): FMA combining disabled
- sm_80 (28673): FMA combining disabled (equality fails the
>check) - sm_86 (28674): FMA combining enabled
- sm_87 (28675): FMA combining enabled
- sm_88 (28676): FMA combining enabled
This is a significant compiler optimization difference between A100 (sm_80) and RTX 30xx (sm_86). A100 code does not get FMA/DFMA combining, possibly because A100's pipeline already has hardware FMA fusion or because the combining transformation is not profitable on GA100 silicon.
Master Gate
All peephole passes check capability 487 ("enable peephole") and capability 350 (per-instruction gate) before applying any transformation. These capabilities are controlled by optimization level and user options, not by SM target.
BB Initialization Flags
The basic block initializer sub_6E8EB0 sets architecture-specific flags using a secondary SM version encoding (distinct from the codegen factory):
| SM Version (secondary) | Hex | Flags Set | Architecture |
|---|---|---|---|
| 20480 | 0x5000 | bits 1, 8 | sm_80 encoding space |
| 20484 | 0x5004 | bits 16, 64 | sm_84 encoding space |
This secondary encoding uses (gen << 12) with generation 5 for Ampere in the BB init context. The specific bit flags control opcode descriptor table population -- each BB gets a set of 40+ (opcode_id, encoding_word) pairs that define which SASS instructions are legal in that basic block.
Feature Comparison
| Feature | sm_75 | sm_80 | sm_86 | sm_87 | sm_88 |
|---|---|---|---|---|---|
| ISA generation | 6 | 7 | 7 | 7 | 7 |
| Codegen factory | 24577 | 28673 | 28674 | 28675 | 28676 |
| WMMA (1st gen TC) | Yes | Yes | Yes | Yes | Yes |
| MMA bf16/tf32 (2nd gen TC) | -- | Yes | Yes | Yes | Yes |
| MMA s4/s8/b1 extended | -- | Yes | Yes | Yes | Yes |
createpolicy (L2 cache) | -- | Yes | Yes | Yes | Yes |
cp.async (async copy) | -- | Yes | Yes | Yes | Yes |
| sm_8x MMA intrinsics (39) | -- | Yes | Yes | Yes | Yes |
| FMA/DFMA peephole combining | -- | -- | Yes | Yes | Yes |
| Scheduler: warps / slots | 7/208 | 7/208 | 7/208 | 7/208 | 7/208 |
| Sub-arch variant | default | default | 2 | 3 | 4 |
| Separate HW latency table | Yes | Yes (2) | Yes | Yes | ? |
| Family string | Turing | Ampere | Ampere | Ampere | Ampere |
Hardware Resource Geometry
Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_75 | 65,536 | 255 | 1,024 | 32 | 16 | 7 / 208 | 208 | 32 / 48 / 64 KB | 90% |
sm_80 | 65,536 | 255 | 2,048 | 64 | 32 | 7 / 208 | 208 | 48 / 100 / 132 / 164 KB | 90% |
sm_86 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_87 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 / 164 KB | 90% |
sm_88 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | (same as sm_86) | 85% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block).
- Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute.
All Turing/Ampere targets share the 7-partition / 208-slot scheduling geometry. The major resource difference is sm_80 (A100 datacenter) with 2,048 max threads and 64 warps vs. the consumer/embedded parts (sm_86--88) with 1,536 max threads and 48 warps. sm_75 (Turing) is the most constrained with 1,024 max threads and 32 warps.
Codegen Factory Gating Patterns
Throughout ptxas, the codegen factory value at profile offset +348 (or equivalently *(*(QWORD*)(a1+1584)+372) from the function context) is used to gate features. Common patterns for the Turing/Ampere range:
// Pattern 1: Generation check (factory >> 12)
int gen = codegen_factory >> 12;
if (gen >= 7) { /* Ampere+ feature */ }
// Pattern 2: Exact threshold
if (codegen_factory > 28673) { /* sm_86+ only */ }
if (codegen_factory >= 28673) { /* sm_80+ */ }
// Pattern 3: Range check
if (codegen_factory >= 24577 && codegen_factory < 32768) {
/* Turing/Ampere only, not Hopper+ */
}
The >> 12 shift extracts the ISA generation, allowing coarse checks (Turing = 6, Ampere = 7). Fine-grained checks compare against specific factory values to distinguish sm_80 from sm_86, etc.
Function Map
| Address | Size | Identity | SM | Confidence |
|---|---|---|---|---|
sub_609B70 | ~48B | Handler A (cpf_optx + compile) | sm_75 | 99% |
sub_609B40 | ~48B | Handler B (cpf_optx + compile) | sm_75 | 99% |
sub_60A2E0 | ~300B | Intrinsic table initializer | sm_75 | 95% |
sub_609CC0 | ~48B | Handler A | sm_80 | 99% |
sub_609C90 | ~48B | Handler B | sm_80 | 99% |
sub_60A3E0 | ~300B | Intrinsic table initializer | sm_80 | 95% |
sub_609D50 | ~48B | Handler A | sm_86 | 99% |
sub_609D80 | ~48B | Handler B | sm_86 | 99% |
sub_60AC30 | ~300B | Intrinsic table initializer | sm_86 | 95% |
sub_609F00 | ~48B | Handler A | sm_87 | 99% |
sub_609DE0 | ~48B | Handler B | sm_87 | 99% |
sub_60AD30 | ~300B | Intrinsic table initializer | sm_87 | 95% |
sub_609E70 | ~48B | Handler A | sm_88 | 99% |
sub_609EA0 | ~48B | Handler B | sm_88 | 99% |
sub_60AB30 | ~300B | Intrinsic table initializer | sm_88 | 95% |
sub_608D50 | ~200B | Perf/occupancy handler E | sm_75 | 85% |
sub_6096E0 | ~200B | Perf/occupancy handler F | sm_75 | 85% |
sub_663C30 | ~400B | Core codegen driver (shared) | all | 90% |
sub_662920 | 26KB | ELF section iteration (shared) | all | 85% |
sub_917990 | ~300B | Profile object constructor | all | 90% |
sub_8E7720 | 3.5KB | HW latency table | sm_75 | 90% |
sub_8E7940 | 2.9KB | HW latency table (base layer) | sm_80 | 90% |
sub_8E7B40 | 3.3KB | HW latency table (full) | sm_80 | 90% |
sub_8E7D80 | 4.4KB | HW latency table | sm_86 | 90% |
sub_8E8070 | 3.5KB | HW latency table | sm_87 | 90% |
sub_8E4400 | 3.3KB | InitHWProfile_Warp (shared) | all | 90% |
sub_83EF00 | ~100KB | Primary peephole optimizer | all | 85% |
sub_A36360 | 52KB | Control word / scoreboard encoder | all | 85% |
sub_8E4920 | 6.9KB | BuildScoreboardEntries | all | 85% |
Cross-References
- SM Architecture Map -- Overview of all 23 SM targets and the 3-level profile system
- Ada & Hopper (SM 89--90a) -- sm_89 (Ada) shares Ampere codegen factory range but bridges to Hopper features
- Blackwell (SM 100--121) -- Next-generation targets with codegen factory 36864+
- Intrinsic Table (608 Entries) -- Full intrinsic catalog with per-SM generation ranges
- SASS Instruction Encoding -- 128-bit encoding format, bitfield packer, opcode hierarchy
- Peephole Optimization -- FMA combining and other post-scheduling SASS transforms
- 3-Phase Scheduler Architecture -- Scheduler infrastructure that consumes HW latency tables
- CLI Options --
--gpu-nameparsing,sm_75default
Ada Lovelace & Hopper (SM 89--90a)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas handles SM 89 (Ada Lovelace -- RTX 4090, L40S) and SM 90/90a (Hopper -- H100, H200) as adjacent but architecturally distinct targets. Ada shares the Ampere codegen factory (28673) and is stored internally as "Ampere"-family despite being a different microarchitecture. Hopper gets its own codegen factory (32768), its own family string "Hopper", and introduces the largest single-generation feature addition in ptxas: WGMMA, thread-block clusters, TMA, setmaxnreg, and distributed shared memory.
| SM 89 (Ada) | SM 90 (Hopper) | SM 90a (Hopper accel) | |
|---|---|---|---|
| Products | RTX 4090, RTX 4080, L40S, L4 | H100, H200 | H100, H200 (arch-locked) |
| Family string | "Ampere" | "Hopper" | "Hopper" |
__CUDA_ARCH__ | 890 | 900 | 90a0 |
| Codegen factory | 28673 (7 << 12 | 1) | 32768 (8 << 12) | 32768 |
| Handler A | sub_609E10 | sub_609DB0 | sub_609DB0 (shared) |
| Handler B | sub_609CF0 | sub_609C00 | sub_609C00 (shared) |
| Intrinsic init | sub_60A810 | sub_60A5F0 | sub_60A5F0 (shared) |
| HW latency table | sub_8E8280 (3.1KB) | sub_8E8480 (5.2KB) | sub_8E8780 (4.6KB) |
| Suffix variants | None | a only (no f) | -- |
| Forward compat | Full (runs on sm_90+) | Full (base variant) | None (locked to H100/H200) |
The "a" Suffix -- Architecture-Accelerated
SM 90a is the first target to use the a suffix. It appears in the accelerated validation table (unk_1D161C0, 7 entries). The meaning is precise: sm_90a SASS executes only on the specific silicon it was compiled for (H100/H200) and will not run on any future architecture. This trades forward compatibility for access to features that may not survive to the next generation.
In ptxas, sm_90 and sm_90a share all 7 dispatch-table handler functions. The a suffix does not produce different handler code paths -- it produces different compatibility metadata in the output cubin. The ELF header records whether the binary is forward-compatible (base) or arch-locked (accelerated), and the CUDA driver enforces this at load time.
Compilation rules from ptxas help text:
sm_90aPTX must be compiled tosm_90aSASS (no cross-arch)sm_90PTX can compile tosm_90or any later SASS target- No
sm_90fvariant exists; thefsuffix starts with Blackwell
SM 89 -- Ada Lovelace
Internal Classification
Ada is classified as Ampere-derived in the binary. The profile object constructed by sub_6765E0 stores "Ampere" as the family name and uses codegen factory value 28673 -- identical to sm_80 through sm_88. The compiler distinguishes Ada from Ampere through per-SM capability accessor functions, not through the factory ID.
The encoded SM version for sm_89 is 28677 (7 << 12 | 5), placing it as the 5th variant in the Ampere generation:
| Encoded Value | SM | Variant |
|---|---|---|
| 28673 | sm_80 | 7 << 12 | 1 (base Ampere) |
| 28674 | sm_86 | 7 << 12 | 2 |
| 28675 | sm_87 | 7 << 12 | 3 |
| 28676 | sm_88 | 7 << 12 | 4 |
| 28677 | sm_89 | 7 << 12 | 5 (Ada) |
Ada-Specific Features
Ada introduces 4th-generation tensor cores with FP8 (E4M3, E5M2) support. In the PTX validator, sub_4A6050 explicitly references "%s on sm_89" in its CVT instruction validation, confirming sm_89-specific conversion rules for new data types.
The intrinsic table initializer for sm_89 (sub_60A810) enables 39 additional MMA intrinsics:
| Intrinsic ID Range | Count | Category |
|---|---|---|
| 0x209--0x22F | 39 | __cuda_sm_8x_mma_* -- extended MMA shapes and types |
These intrinsics cover FP8 MMA operations, block-scale MMA, and additional type combinations beyond what sm_80--88 support. The MMA validator at sub_49BBA0 checks for "mma with FP8 floating point type" and validates against the target SM version.
Ada Scheduling Profile
The HW latency table for sm_89 (sub_8E8280, 3.1KB) is smaller than Hopper's (5.2KB), reflecting Ada's simpler pipeline structure -- no WGMMA async pipeline, no cluster operations. The register file geometry from sub_8E4400:
| Parameter | Value | Notes |
|---|---|---|
| Warps per scheduler | 8 | Threshold: encoded SM <= 36863 |
| Dispatch slots | 224 | Same as sm_80 class |
| Sub-architecture variant | 5 | From encoded value 28677 |
SM 90 / SM 90a -- Hopper
Profile Construction
Hopper is the first SM to get its own codegen factory value (32768 = 8 << 12) since the Turing/Ampere split. The profile object stores "Hopper" as the family name. Key profile fields:
| Field | sm_90 | sm_90a |
|---|---|---|
| SM name | "sm_90" | "sm_90a" |
| Compute name | "compute_90" | "compute_90a" |
| LTO name | "lto_90" | "lto_90a" |
CUDA_ARCH | 900 | 90a0 |
| Family | "Hopper" | "Hopper" |
| Codegen factory | 32768 | 32768 |
Hopper Scheduling Profile
The HW latency tables for Hopper are substantially larger than any previous architecture, reflecting the async pipeline and tensor core scheduling complexity:
| Function | Size | Target | Notes |
|---|---|---|---|
sub_8E8480 | 5.2KB | sm_90 | Base Hopper latency model |
sub_8E8780 | 4.6KB | sm_90a | Arch-accelerated variant |
sm_90a gets its own latency table (4.6KB) distinct from sm_90 (5.2KB), even though they share all dispatch handler functions. This is the only architecture where base and a variants have separate scheduling profiles -- all Blackwell variants share their tables within each base SM.
Register file geometry from sub_8E4400:
| Parameter | Value | Notes |
|---|---|---|
| Warps per scheduler | 16 | Threshold: encoded SM > 36863 (32768 qualifies) |
| Dispatch slots | 240 | Maximum -- 2x the sm_80 class |
| Sub-architecture variant | 0 | From encoded value 32768 (base variant) |
| Max threads/CTA | 240 | From code object builder sub_A465F0 |
The jump from 8 warps / 224 slots (sm_89) to 16 warps / 240 slots (sm_90) is the largest warp geometry change in the binary. This doubling of warp capacity directly corresponds to Hopper's 4-warp warpgroup execution model.
Hopper Intrinsics
The intrinsic initializer for sm_90 (sub_60A5F0) enables 38 sub-byte MMA intrinsics:
| Intrinsic ID Range | Count | Category |
|---|---|---|
| 0x23A--0x25F | 38 | __cuda_sm_9x_mma_sub_byte_internal_* |
These cover sparse sub-byte MMA operations: s4/u4 sparse variants for m16n8k32, m16n8k64, and m16n8k128 shapes. These are Hopper-specific and do not appear in the Ada (sm_89) intrinsic table.
WGMMA -- Warpgroup Matrix Multiply-Accumulate
WGMMA is Hopper's signature instruction. It operates on warpgroups (4 consecutive warps) rather than single warps, and executes asynchronously -- the tensor core operates in parallel with the warp's regular instruction stream. ptxas handles WGMMA through four PTX instructions and a dedicated compiler pass infrastructure.
PTX Instructions
Registered in the opcode dispatch table at sub_5D4190:
| PTX Instruction | Codegen Handler | Formatter | Formatter Size |
|---|---|---|---|
wgmma.mma_async | sub_50AC70 | sub_4DA380 | 295B |
wgmma.fence | sub_4DA380 | sub_4DA4B0 | 295B |
wgmma.commit_group | sub_4DA4B0 | sub_4DA5E0 | 311B |
wgmma.wait_group | sub_4DA5E0 | sub_505B00 | 1066B |
The formatters are the smallest named formatters in ptxas (295 bytes), reflecting WGMMA's compact text representation. wgmma.wait_group is significantly larger (1066B) because it must encode the pipeline depth parameter.
GMMA Pipeline Pass Infrastructure
The WGMMA pipeline optimizer is the largest single-architecture compiler subsystem in ptxas, spanning approximately 100KB of code across 15+ functions in the range 0xACE000--0xAE6000. It is active only for SM 90+ targets.
Call chain:
sub_AE4F70 (coordinator -- outside primary range)
+-- sub_ACE480 (22.7KB) WGMMA serialization warning emitter
+-- sub_ADEB40 (43.1KB) warpgroup.arrive/wait fence insertion
+-- sub_AE17C0 (37.9KB) pipeline stage builder
| +-- sub_AE0D20 (16.8KB) live range builder
| +-- sub_AE06F0 GMMA operand classifier
+-- sub_ADDDF0 (20.6KB) pass entry (vtable dispatch)
+-- sub_ADCA60 (21.7KB) scheduling coordinator
+-- sub_ADBD30 (23.9KB) register pressure estimator
| +-- sub_ADAD60 (8.4KB) live range limiter
| +-- sub_AD9C20 (14.4KB) register class allocator
+-- sub_AD70B0 (22.6KB) operand register assignment
Warpgroup Synchronization Injection
The fence insertion pass (sub_ADEB40, 43.1KB) scans for wgmma.mma_async operations and automatically injects warpgroup.arrive and warpgroup.wait instructions. These fences manage register ownership when asynchronous tensor core operations are in flight -- the hardware requires explicit handoff between the warpgroup's register file and the tensor core's accumulator registers.
Diagnostic messages emitted by the compiler:
"warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'""warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"
WGMMA Serialization Warnings
When the compiler cannot pipeline WGMMA operations, sub_ACE480 (22.7KB, 98% confidence) emits detailed performance advisories using codes 0x1D55--0x1D57 (7509--7511 decimal). Nine distinct serialization reasons are enumerated:
| Reason Code | Diagnostic Message |
|---|---|
| 1 | "presence of Extern calls in the function" |
| 2 | "wgmma pipeline crossing function boundary" |
| 3 | "insufficient register resources for the wgmma pipeline" |
| 4 | "program dependence on compiler-inserted warpgroup" |
| 5 | "ill formed pipeline stage in the function" |
| 6 | "non wgmma instructions defining accumulator registers" |
| 7 | "non wgmma instructions reading accumulator registers" |
| 8 | "non wgmma instructions defining input registers" |
| 9 | "insufficient register resources for the function" |
All messages are prefixed with "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...". The pass reads its configuration from offsets +26280 (1-byte enable) and +26288 (dword threshold) on the compilation context.
GMMA Live Range Management
The live range limiter (sub_ADAD60) enforces a maximum on simultaneously active live ranges within GMMA sequences:
"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"
When the threshold is exceeded, the system triggers register spilling or sequence splitting through sub_ADBD30 (register pressure estimator). The live range builder uses FNV-1a hashing (constants 16777619 and 0x811C9DC5) for instruction deduplication.
Thread-Block Clusters
Hopper introduces the concept of a thread-block cluster -- a group of cooperating CTAs that can access each other's shared memory (distributed shared memory). ptxas adds several PTX directives and special registers to support this.
Cluster Directives
The directive validator (sub_4CE6B0, 48KB) enforces mutual exclusivity of cluster configuration:
".reqnctapercluster and .maxclusterrank cannot both be specified"
Two shared-memory state spaces are distinguished:
.shared::cta-- CTA-local shared memory (pre-Hopper behavior).shared::cluster-- distributed shared memory accessible across CTAs in a cluster
Cluster Special Registers
Registered in sub_61B850 (special register table initializer):
| Special Register | Purpose |
|---|---|
%clusterid | Cluster ID within the grid |
%nclusterid | Number of clusters in the grid |
%cluster_ctaid | CTA position within the cluster |
%cluster_nctaid | Number of CTAs in the cluster |
%cluster_ctarank | Linear rank of CTA within the cluster |
%cluster_nctarank | Total CTAs in the cluster (linear) |
%is_explicit_cluster | Whether this launch uses explicit clustering |
%aggr_smem_size | Aggregate shared memory across cluster |
Distributed Shared Memory Intrinsics
The intrinsic handler OCG_DshmemHandler at sub_6C60B0 validates distributed shared memory operations:
"Cannot use both the selfcast and the broadcast modifier.""Either the selfcast or the broadcast modifier must be used."
TMA -- Tensor Memory Accelerator
The Tensor Memory Accelerator (TMA) provides hardware-accelerated bulk data movement between global and shared memory, using tensor descriptors to specify multi-dimensional copy patterns. In ptxas, TMA is exposed through cp.async.bulk.tensor.
cp.async.bulk.tensor Codegen
The codegen handler (sub_5AB460, 45KB) is one of the largest single-instruction handlers in ptxas:
| Property | Value |
|---|---|
| Handler function | sub_5AB460 |
| Size | 45KB |
| Buffer allocation | 50,000 bytes |
| Registered name | "cp.async.bulk.tensor" |
| Dimensionality | 1D through 5D |
| Modes | tile, im2col |
| Cast variants | unicast, multicast |
The TMA intrinsic handler (OCG_CpAsyncTensorHandler at sub_6C8100) validates operand counts per mode:
"Must have 1 input with a1t0 and no multicast""Must have 2 inputs with a1t0 and multicast""Must have 2 input with a0tx and no multicast""Must have 3 inputs with a0tx and multicast"
Tensormap Instructions
The tensormap validator (sub_4A73C0, 10.8KB) handles tensor descriptor manipulation:
".tile"mode validation"Tensormap field with input value >= 13"/"with input value == 4"bounds checking".tensormap::generic"addressing mode"Interger Immediate for ordinal"(sic -- typo preserved from binary)
Related Bulk Copy Infrastructure
| Handler | Function | Size |
|---|---|---|
cp.async.bulk | sub_593210 | 5.1KB (formatter) |
cp.async.mbarrier.arrive | sub_4DC180 | -- |
OCG_CpAsyncBulkHandler | sub_6C3470 | 20KB |
OCG_CpAsyncHandler | sub_6C2AE0 | 10KB |
setmaxnreg -- Dynamic Register Allocation
Hopper introduces setmaxnreg (PTX opcode 315) for dynamic register count adjustment. This allows kernels to change their register footprint at runtime, enabling techniques like CTA reconfiguration and warpgroup-level resource management.
setmaxnreg Pass
The handler (sub_97EC60, ~3.5KB, 90% confidence) walks the instruction list looking for opcode 315 and processes or removes them. Five reasons for ignoring setmaxnreg are enumerated as "Potential Performance Loss" warnings:
| Reason | Message |
|---|---|
| 1 | "unable to determine register count at entry" |
| 2 | "to maintain minimum register requirements" |
| 3 | "to allow debugging" |
| 4 | "to maintain compatibility across compilation units" |
| 5 | "to maintain compatibility into 'extern' call" |
The setmaxnreg handling mode is controlled by knob 653.
CTA Reconfig Pragmas
The pragma validator (sub_97F540, ~4KB) enforces ordering constraints on setmaxreg.alloc and setmaxreg.dealloc:
"Found an 'alloc' pragma after 'dealloc'""Found incompatible thread count re-specification""Found a 'dealloc' pragma after 'alloc'"
The dealloc validator (sub_98D100, ~4.8KB) enforces register count bounds:
"setmaxreg.dealloc/release has register count (%d) less than launch min target (%d)""setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count"
Async Pipeline Features
Hopper extends the async copy pipeline introduced in Ampere with mbarrier (memory barrier) objects. ptxas implements mbarrier handling through a cluster of detector, classifier, and emitter functions:
| Function | Address | Identity |
|---|---|---|
MBarrierDetector::isNonTrivialMBarrier | sub_A94440 | Checks "%mbarrier_" prefix |
MBarrierDetector::classifyMBarrier | sub_A9A5F0 | Returns packed (type << 32) | is_mbarrier |
MBarrierDetector::resolveMBarrierBaseName | sub_A9A920 | Extracts base name from symbol |
MBarrierEmitter::rewriteMBarrierOperand | sub_AA33C0 | Constructs "%%mbarrier_%s_%s" |
The mbarrier type classifier returns an enumeration of 0--12 for different operation kinds, covering arrive, arrive_drop, expect_tx, and their counted variants.
Warpgroup Attribute Management
The warpgroup attribute processor (sub_64BAF0, 30KB) handles kernel-level attributes introduced with Hopper:
| Attribute String | Purpose |
|---|---|
"warpgroup-commit_batch" | WGMMA commit batch configuration |
"warpgroup-wait" | WGMMA wait configuration |
"warpgroup-arrive" | WGMMA arrive configuration |
"setsmemsize-state" | Shared memory size pragma state |
"setmaxreg-state" | Register limit pragma state |
"func_begin" | Function entry marker |
"CC-temp" | Calling convention temporary |
Architecture Version Threshold Checks
The binary uses the encoded SM version (codegen factory value) for feature gating throughout the compiler. Key thresholds observed:
| Check Pattern | Threshold | Meaning |
|---|---|---|
profile[+372] > 28673 | > sm_80 base | Post-Ampere features |
a2 <= 32767 | <= sm_89 class | Pre-Hopper geometry (7 warps, 208 slots) |
a2 <= 36863 | <= sm_89 extended | Ampere/Ada geometry (8 warps, 224 slots) |
a2 > 36863 | >= sm_90 | Hopper+ geometry (16 warps, 240 slots) |
| Codegen factory == 32768 | sm_90 exactly | Hopper-specific code paths |
| Codegen factory >= 32768 | sm_90+ | WGMMA, cluster, TMA enabled |
The register file descriptor at sub_8E4400 uses the encoded value to select warp geometry. The full cascade:
encoded <= 20479 -> 4 warps, 96 slots (pre-Maxwell)
encoded <= 24575 -> 6 warps, 176 slots (Pascal)
encoded <= 28672 -> 7 warps, 192 slots (Volta)
encoded <= 32767 -> 7 warps, 208 slots (Turing/Ampere/Ada)
encoded <= 36863 -> 8 warps, 224 slots (Ampere extended)
encoded > 36863 -> 16 warps, 240 slots (Hopper+)
Note: sm_89 (encoded 28677) falls in the <= 32767 range, giving it 7 warps / 208 slots. But the separate warp geometry lookup uses a different cascade where sm_89 (as Ampere class) gets 8 warps / 224 slots. The dual-cascade structure reflects the fact that different subsystems query different profile fields.
Hardware Resource Geometry
Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_89 | 65,536 | 255 | 1,536 | 48 | 16 | 7 / 208 | 208 | 48 / 100 KB | 90% |
sm_90 | 65,536 | 255 | 1,024 | 64 | 32 | 8 / 224 | 224 | 48 / 100 / 132 / 164 / 228 KB | 90% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block).
- Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute.
sm_90a shares sm_90's geometry -- the a suffix affects only compatibility metadata, not hardware resource limits. The jump from sm_89 (7 partitions, 208 slots, 48 warps) to sm_90 (8 partitions, 224 slots, 64 warps) is the largest single-generation scheduling capacity increase in the binary, directly supporting Hopper's 4-warp warpgroup execution model.
MMA Instruction Validators
Several validator functions gate MMA features by SM version:
| Validator | Address | Size | SM Strings Referenced |
|---|---|---|---|
| WMMA/MMA validator | sub_4C2FD0 | 12.2KB | "sm_90", "sm_80", "sm_75" |
| MMA scale/block validator | sub_49BBA0 | 11.4KB | "sm_%d", "mma with FP8 floating point type" |
| WMMA shape validator | sub_4BFED0 | 10.3KB | "sm_80", "sm_75" |
| CVT arch-specific | sub_4A6050 | 5.0KB | "%s on sm_89" |
| Special register validator | sub_49A5A0 | 3.5KB | "sm_90", "%laneid", "%warpid" |
| Instruction fusion validator | sub_4AA3E0 | 7.1KB | "sm_90" |
| Float instruction validator | sub_4A2CA0 | 3.7KB | "sm_90" |
| Function address validator | sub_4B1630 | 4.6KB | "sm_90", "sm_30", "sm_20" |
The WMMA/MMA validator at sub_4C2FD0 performs three-way version checks: sm_75 for base WMMA, sm_80 for extended types (BF16/TF32), sm_90 for WGMMA features. FP8 MMA is gated by both sm_89 (Ada) for the data types and sm_90 (Hopper) for the warpgroup shapes.
Post-Scheduling Statistics
Eight architecture-variant statistics printers (clones at 0x700-byte intervals from sub_ABBA50) emit DUMPIR statistics. The metrics cover Hopper-specific counters:
| Metric | Format String |
|---|---|
| MMA counts | "# [hmma1688=%d]" (and variants for imma, sparse, dmma) |
| Occupancy | "# [Occupancy = %f]" |
| Per-unit throughput | "# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] ..." |
| Issue throughput | "# [issue thru=%f] [adu thru=%f] [alu thru=%f] ..." |
| WGMMA serialization | "Potential Performance Loss: wgmma.mma_async ..." (9 variants) |
| Shared memory | "# [SharedMem Alloc thru=%f]" |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_4A6050 | 5.0KB | CVT validator (sm_89 special cases) | 85% |
sub_4C2FD0 | 12.2KB | WMMA/MMA validator (sm_75/80/90 version checks) | 90% |
sub_4DA380 | 295B | wgmma.mma_async formatter | 99% |
sub_4DA4B0 | 295B | wgmma.fence formatter | 99% |
sub_4DA5E0 | 311B | wgmma.commit_group formatter | 99% |
sub_505B00 | 1066B | wgmma.wait_group formatter | 99% |
sub_50AC70 | -- | wgmma.mma_async codegen handler | 99% |
sub_5AB460 | 45KB | cp.async.bulk.tensor codegen (TMA) | 95% |
sub_609CF0 | ~1.2KB | sm_89 handler B (capability accessor) | 90% |
sub_609DB0 | ~1.2KB | sm_90 handler A (capability accessor) | 90% |
sub_609E10 | ~1.2KB | sm_89 handler A (capability accessor) | 90% |
sub_60A5F0 | ~1KB | sm_90 intrinsic table initializer | 85% |
sub_60A810 | ~1KB | sm_89 intrinsic table initializer | 85% |
sub_61B850 | 10KB | Special register table (cluster regs) | 99% |
sub_64BAF0 | 30KB | Warpgroup/kernel attribute processor | 80% |
sub_6C60B0 | -- | Distributed shared memory intrinsic handler | 65% |
sub_6C8100 | -- | TMA (cp.async.tensor) intrinsic handler | 85% |
sub_8E4400 | 3.3KB | Register file geometry initializer | 90% |
sub_8E8280 | 3.1KB | sm_89 (Ada) HW latency table | 85% |
sub_8E8480 | 5.2KB | sm_90 (Hopper) HW latency table | 85% |
sub_8E8780 | 4.6KB | sm_90a HW latency table | 85% |
sub_97EC60 | ~3.5KB | setmaxnreg handler (opcode 315) | 90% |
sub_97F540 | ~4KB | CTA reconfig pragma validator | 90% |
sub_98D100 | ~4.8KB | setmaxreg.dealloc validator | 90% |
sub_A94440 | -- | MBarrierDetector::isNonTrivialMBarrier | 85% |
sub_A9A5F0 | -- | MBarrierDetector::classifyMBarrier | 85% |
sub_AA33C0 | -- | MBarrierEmitter::rewriteMBarrierOperand | 85% |
sub_ACE480 | 22.7KB | WGMMA serialization warning emitter | 98% |
sub_AD70B0 | 22.6KB | GMMA operand register assignment | 75% |
sub_AD9C20 | 14.4KB | GMMA register class allocator | 75% |
sub_ADAD60 | 8.4KB | GMMA live range limiter | 90% |
sub_ADBD30 | 23.9KB | GMMA register pressure estimator | 80% |
sub_ADCA60 | 21.7KB | GMMA scheduling coordinator | 85% |
sub_ADDDF0 | 20.6KB | GMMA pass entry (vtable) | 80% |
sub_ADEB40 | 43.1KB | WGMMA sync fence insertion | 95% |
sub_AE0D20 | 16.8KB | GMMA live range builder | 80% |
sub_AE17C0 | 37.9KB | GMMA pipeline stage builder | 85% |
sub_AE4F70 | -- | GMMA pass coordinator (outside range) | 90% |
sub_AE5030 | 15.5KB | GMMA scheduling wrapper (alt entry) | 75% |
Cross-References
- SM Architecture Map -- Validation tables, codegen factory values, suffix semantics
- Turing & Ampere (SM 75--88) -- Ampere baseline that Ada inherits
- Blackwell (SM 100--121) -- Next-generation features (tcgen05, expanded
a/fvariants) - Intrinsic Table (608 Entries) -- Full intrinsic catalog with sm_8x and sm_9x ranges
- Pass Inventory -- GMMA/WGMMA pipeline pass placement in 159-phase schedule
- Scheduling Overview -- HW latency table architecture
- CLI Options --
--gpu-name sm_90aparsing - Knobs System -- Knob 653 (setmaxnreg mode)
Blackwell (SM 100--121)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas v13.0.88 handles five Blackwell-era base targets -- sm_100, sm_103, sm_110, sm_120, sm_121 -- spanning datacenter, automotive, consumer, and DGX product lines. All share the codegen factory value 36864 (generation 9, 9 << 12) and the "Blackwell" family string internally, despite being distinct microarchitectures. The defining Blackwell feature is Capsule Mercury (capmerc) as the default binary output format, automatically enabled for SM numbers exceeding 99. The datacenter variants (sm_100, sm_103, sm_110) support tcgen05 (5th-generation tensor cores with dedicated tensor memory); the consumer variants (sm_120, sm_121) do not.
| SM targets | sm_100, sm_103, sm_110, sm_120, sm_121 (+ a and f sub-variants each) |
| Codegen factory | 36864 (0x9000, generation 9) |
| Family string | "Blackwell" (all five targets) |
| Default binary format | Capsule Mercury (capmerc) -- auto-enabled for SM > 99 |
| SASS encoding | 128-bit per instruction (Mercury-encoded) |
| Warp geometry | 16 warps, 240 dispatch slots (shared with Hopper sm_90) |
| Sub-variants per SM | 3: base, a (accelerated), f (feature-reduced) |
| Profile constructor | sub_6765E0 (54KB) |
| Capability dispatch | sub_607DB0 (7 hash maps, once-guarded) |
SM Version Table
| SM | Product | __CUDA_ARCH__ | Codegen Factory | Hex | Variant |
|---|---|---|---|---|---|
sm_100 | B100 / B200 (datacenter) | 1000 | 36864 | 0x9000 | 0 (gen 9 base) |
sm_103 | GB300 (Blackwell Ultra) | 1030 | 36867 | 0x9003 | 3 |
sm_110 | Jetson Thor SoC | 1100 | 36868 | 0x9004 | 4 |
sm_120 | RTX 50xx / RTX Pro | 1200 | 36869 | 0x9005 | 5 |
sm_121 | DGX Spark | 1210 | 36869 | 0x9005 | 5 |
Codegen factory encoding: (9 << 12) | sub_variant. sm_100 is variant 0 (generation base). sm_103 is variant 3. sm_110 is variant 4. sm_120 and sm_121 appear to share variant 5 in the scheduling sub-architecture table at sub_8E4400.
Unreleased SM numbers referenced in the binary: The SASS formatter sub_583190 (rsqrt) checks for SM codes 102, 103, 107, 124, 130 in architecture-specific dispatch paths, suggesting internal/future variants beyond the five publicly exposed targets.
Sub-Variant System
Every Blackwell SM has three sub-variants. The base and a/f variants within an SM share all 7 dispatch table handler functions -- they are identical silicon with different compatibility metadata and feature exposure.
Profile Object Fields
The profile constructor sub_6765E0 builds profile objects for each sub-variant with these fields:
| SM | Base | a Variant | f Variant |
|---|---|---|---|
| sm_100 | sm_100 / compute_100 / lto_100 | sm_100a / compute_100a / lto_100a | sm_100f / compute_100f / lto_100f |
| sm_103 | sm_103 / compute_103 / lto_103 | sm_103a / compute_103a / lto_103a | sm_103f / compute_103f / lto_103f |
| sm_110 | sm_110 / compute_110 / lto_110 | sm_110a / compute_110a / lto_110a | sm_110f / compute_110f / lto_110f |
| sm_120 | sm_120 / compute_120 / lto_120 | sm_120a / compute_120a / lto_120a | sm_120f / compute_120f / lto_120f |
| sm_121 | sm_121 / compute_121 / lto_121 | sm_121a / compute_121a / lto_121a | sm_121f / compute_121f / lto_121f |
CUDA_ARCH Macro Values
| Sub-Variant | sm_100 | sm_103 | sm_110 | sm_120 | sm_121 |
|---|---|---|---|---|---|
| Base | -D__CUDA_ARCH__=1000 | =1030 | =1100 | =1200 | =1210 |
| Accelerated | =100a0 | =103a0 | =110a0 | =120a0 | =121a0 |
| Feature-reduced | =100f0 | =103f0 | =110f0 | =120f0 | =121f0 |
Suffix Bit Flags in Profile Objects
From the decompiled profile constructor, suffixed variants set specific byte flags:
| Suffix | Flag Position | Evidence |
|---|---|---|
a (accelerated) | profile[4] = 1 | v79->m128i_i8[4] = 1; *(_BYTE *)(v82 + 4) = 1; |
f (feature-reduced) | profile[5] = 1 (on all 3 objects: sm, compute, lto) | v88->m128i_i8[5] = 1; *(_BYTE *)(v91 + 5) = 1; v94[5] = 1; |
The a flag is set on the SM and compute profile objects only. The f flag is set on all three (sm, compute, lto), reflecting the fact that f-compiled code must retain its feature-reduced metadata through linking.
isaClass Inheritance
Sub-variants reference their base SM's isaClass rather than defining a new one:
sm_100aandsm_100freference"(profile_sm_100)->isaClass"sm_103aandsm_103freference"(profile_sm_103)->isaClass"sm_120aandsm_120freference"(profile_sm_120)->isaClass"sm_121aandsm_121freference"(profile_sm_121)->isaClass"
This confirms that sub-variants share the instruction set architecture class with their base. The a/f distinction is purely in compatibility metadata, not in the ISA or codegen.
Capability Dispatch
sub_607DB0 registers handler functions into 7 parallel hash maps. All sub-variants of a given SM register the same function pointers.
Handler Assignments (Maps 1--3)
| SM | Handler A (Map 1) | Handler B (Map 2) | Intrinsic Init (Map 3) |
|---|---|---|---|
| sm_100 / 100a / 100f | sub_609C30 | sub_609BD0 | sub_60A910 |
| sm_103 / 103a / 103f | sub_608F20 | sub_609D20 | sub_60A700 |
| sm_110 / 110a / 110f | sub_609F30 | sub_608F50 | sub_60AA20 |
| sm_120 / 120a / 120f | sub_609E40 | sub_609C60 | sub_608DF0 |
| sm_121 / 121a / 121f | sub_609ED0 | sub_609BA0 | sub_60A4E0 |
Performance / Occupancy Handlers (Maps 6--7)
| SM | Handler E (Map 6) | Handler F (Map 7) |
|---|---|---|
| sm_100 / 100a / 100f | sub_609080 | sub_6098A0 |
| sm_103 / 103a / 103f | sub_609020 | sub_6091A0 |
| sm_110 / 110a / 110f | sub_609000 | sub_609280 |
| sm_120 / 120a / 120f | sub_608FE0 | sub_609520 |
| sm_121 / 121a / 121f | sub_609040 | sub_6097C0 |
Every Blackwell SM has unique handler functions in all 7 maps. This contrasts with Hopper where sm_90 and sm_90a share all handlers. Each Blackwell SM variant is architecturally distinct enough to warrant separate capability accessors, performance models, and intrinsic tables.
Warp Geometry
The warp geometry initializer at sub_8E4400 uses the codegen factory value to select dispatch parameters. All Blackwell targets (codegen factory > 36863) fall into the maximum bucket:
encoded > 36863 -> 16 warps, 240 dispatch slots
This is identical to Hopper (sm_90). The 16-warp / 240-slot geometry supports Blackwell's warpgroup execution model (4 warps per warpgroup, 4 warpgroups per SM partition).
Sub-Architecture Variant Table
The secondary variant assignment at sub_8E4400 maps codegen factory values to sub-architecture indices:
| Codegen Factory | Variant | SM |
|---|---|---|
| 36864 | 0 | sm_100 (base) |
| 36867 | 3 | sm_103 |
| 36868 | 4 | sm_110 |
| 36869 | 5 | sm_120, sm_121 |
These variant indices select different entries within the per-SM latency tables, allowing the scheduler to use silicon-specific pipeline timing.
Hardware Resource Geometry
Per-SM hardware resource limits used by ptxas for register allocation, occupancy calculations, and scheduling decisions. Extracted from sub_8688F0 (universal baseline), sub_8E4400 (scheduler partition geometry), and sub_ABF250 (occupancy calculator). See targets/index.md -- Per-SM Resource Geometry Table for the complete table across all architectures.
| SM | Regs/SM | Max Regs/Thread | Max Threads/CTA | Warps/SM | Max CTAs/SM | Sched Partitions | Dispatch Slots | Configurable Shared Memory | Conf |
|---|---|---|---|---|---|---|---|---|---|
sm_100 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 90% |
sm_103 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 88% |
sm_110 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_100) | 85% |
sm_120 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | 48 / 100 / 132 / 164 / 228 KB | 88% |
sm_121 | 65,536 | 255 | 1,024 | 64 | 32 | 16 / 240 | 240 | (same as sm_120) | 85% |
Column definitions:
- Regs/SM: Total 32-bit registers per streaming multiprocessor. 65,536 universally for sm_75+.
- Max Regs/Thread: Maximum registers a single thread can use. 255 universally (
sub_8688F0offset +612). - Max Threads/CTA: Maximum threads per cooperative thread array (block).
- Warps/SM: Total concurrent warps per SM. Determines peak occupancy.
- Max CTAs/SM: Maximum concurrent CTAs per SM.
- Sched Partitions / Dispatch Slots: From
sub_8E4400offset +18 (packed DWORD) and offset +22 (WORD). - Configurable Shared Memory: Valid shared memory sizes per CTA, selected by
cudaFuncSetAttribute.
All Blackwell targets share the 16-partition / 240-slot geometry (identical to Hopper sm_90). The a and f sub-variants within each SM share the same geometry -- differentiation is in compatibility metadata and feature exposure, not in resource limits. The primary distinction across Blackwell SMs is in the latency tables and tcgen05 availability, not in the scheduling partition structure.
Capsule Mercury (capmerc) -- Default Output Format
Capsule Mercury is automatically enabled for all Blackwell targets. When the SM architecture number exceeds 99, ptxas sets the capmerc flag at offset+81 in the compilation context. This applies to sm_100, sm_103, sm_110, sm_120, and sm_121 uniformly.
Three Output Modes
| Mode | String | Default For | SM Range |
|---|---|---|---|
mercury | "mercury" | sm_75 -- sm_90 | Turing through Hopper |
capmerc | "capmerc" | sm_100 -- sm_121 | All Blackwell |
sass | "sass" | None (explicit only) | Any |
Capsule Mercury vs Mercury
Both modes use the same Mercury encoder pipeline (phases 117--122). The capmerc distinction is at the ELF emission level:
- Mercury produces a fully-resolved SASS binary in
.text.<funcname>sections - Capsule Mercury wraps Mercury-encoded instructions in
.nv.capmerc<funcname>sections with a 328-byte capsule descriptor, plus.nv.merc.*debug/metadata sections
The capsule descriptor (constructed by sub_1C9C300, 24KB) contains the Mercury instruction stream, relocation metadata (R_MERCURY_* types), KNOBS compilation configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage).
Opportunistic Finalization
Capsule Mercury enables deferred finalization -- compiling once for one SM and reconstituting SASS for a different SM at link or load time.
| Level | Name | Behavior | Example |
|---|---|---|---|
| 0 | default | Standard finalization (compile-target only) | -- |
| 1 | none | No finalization; output stays as capmerc | -- |
| 2 | intra-family | Finalize within same SM family | sm_100 -> sm_103 |
| 3 | intra+inter | Finalize across SM families | sm_100 -> sm_120 |
The compatibility checker sub_60F290 determines whether a capmerc binary compiled for SM X can be finalized for SM Y. On success, ptxas emits: "applied for off-target %u -> %u finalization".
Self-Check Mechanism
The --self-check CLI option performs roundtrip verification:
- Generate capmerc output (Mercury encoding + metadata)
- Reconstitute SASS from the capmerc data
- Compare section-by-section; report error codes 17 (content mismatch), 18 (count mismatch), 19 (metadata mismatch)
The reconstituted SASS can be dumped with --out-sass for debugging self-check failures.
TCGen05 -- 5th Generation Tensor Cores
TCGen05 is the defining hardware feature of Blackwell datacenter parts. It introduces tensor memory (TMEM) as a dedicated register-like storage directly connected to the tensor core, eliminating the shared-memory bottleneck of previous WGMMA designs.
SM Availability
| SM | tcgen05 Available | Notes |
|---|---|---|
| sm_100 / 100a / 100f | Yes | Full datacenter tcgen05 |
| sm_103 / 103a / 103f | Yes | Blackwell Ultra -- same tcgen05 ISA |
| sm_110 / 110a / 110f | Yes | Jetson Thor -- full tcgen05 hardware |
| sm_120 / 120a / 120f | No | Consumer -- no TMEM, no tcgen05 |
| sm_121 / 121a / 121f | No | DGX Spark -- no TMEM, no tcgen05 |
The tcgen05 ISA is gated by SM version checks (visible as sub_70FA00(*, 29) capability queries). sm_120 and sm_121 fall back to inherited HMMA/IMMA/WGMMA tensor core paths.
PTX Instructions
Registered in the opcode dispatch table at sub_5D4190:
| PTX Instruction | Codegen Handler | Formatter | Size |
|---|---|---|---|
tcgen05.alloc | sub_569180 | sub_526370 | 1287B |
tcgen05.relinquish_alloc_permit | sub_526370 | -- | -- |
tcgen05.dealloc | sub_58C7F0 | sub_574050 | 2130B |
tcgen05.ld | sub_574050 | sub_578DB0 | 2466B |
tcgen05.ld.red | sub_578DB0 | -- | -- |
tcgen05.st | sub_571FE0 | sub_56C190 | 1842B |
tcgen05.commit | sub_56C190 | sub_5427F0 | 1575B |
tcgen05.cp | sub_5427F0 | sub_4F1A90 | 903B |
tcgen05.shift | sub_4F1A90 | sub_58FA20 | 4604B |
tcgen05.mma | sub_5BBC30 (90KB) | -- | -- |
tcgen05.mma.ws | sub_58FA20 | sub_4DA720 | 343B |
The tcgen05.mma codegen handler at sub_5BBC30 is 90KB -- the largest single-instruction handler in ptxas -- reflecting the complexity of 5th-gen tensor core MMA with tensor memory operands, scale factors, sparsity, and accumulator management.
TCGen05 Guardrail Functions
Eight debug/validation functions provide runtime instrumentation for tensor memory operations when compiled with --g-tensor-memory-access-check (or -g-tmem-access-check):
| Guardrail | Formatter | Size | Validation |
|---|---|---|---|
_tcgen05.guardrails.is_phase_valid | sub_4DA720 | 775B | Phase lifecycle |
_tcgen05.guardrails.are_columns_allocated | sub_4DDE70 | 599B | Column allocation |
_tcgen05.guardrails.is_current_warp_valid_owner | sub_4DBF20 | 791B | Warp ownership |
_tcgen05.guardrails.in_physical_bounds | sub_4DB050 | 439B | Memory bounds |
_tcgen05.guardrails.allocation_granularity | sub_4F0960 | 839B | Allocation alignment |
_tcgen05.guardrails.datapath_alignment | sub_4DD580 | 735B | Data path checks |
_tcgen05.guardrails.sp_consistency_across_idesc_mod | sub_500FA0 | 970B | Sparse descriptor |
_tcgen05.guardrails.check_sparse_usage | sub_4DDB80 | 743B | Sparsity validation |
The bounds checker (sub_70E0E0, 296 lines decompiled) generates inline PTX code to validate tensor memory column counts, extracting bitfield positions from tcgen05 descriptors (e.g., and.b32 %s, 0x7E0000, %s; shr.u32 %s, %s, 17; mul.lo.u32 %s, %s, 8).
TCGen05 Intrinsics
| ID Range | Count | Category |
|---|---|---|
| 0x20--0x2A | 11 | __cuda_sm10x_tcgen05_guardrail_trap_* (trap on validation failure) |
| 0x230--0x239 | 10 | __cuda_sm_10x_* (hmma/imma mdata + bit MMA) |
Additional tcgen05 helper intrinsics observed in decompiled code:
__cuda_sm_100_tcgen05_ld_red_immhalfSplitOff-- load-reduce with immediate half-split offset__cuda_sm_100_tcgen05_ld_immhalfSplitOff-- load with immediate half-split offset__cuda_sm_100_tcgen05_st_immhalfSplitOff-- store with immediate half-split offset__cuda_sm_100_tcgen05_ld_red_funcRetArr-- load-reduce returning array__cuda_sm_100_tcgen05_ld_funcRetArr-- load returning array
These helpers (decompiled in sub_70D910 and sub_70DDB0) generate inline PTX for complex tensor memory access patterns including array returns via ld.param.b32 sequences.
Bulk Copy Intrinsics (sm_1xx)
18 intrinsics in the __cuda_sm1xx_* namespace cover cp.async.bulk.tensor 1D--5D in tile and im2col modes. These extend the Hopper TMA infrastructure with Blackwell-specific enhancements.
SM 100 / SM 100a / SM 100f -- Blackwell Datacenter
sm_100 is the reference Blackwell architecture. Codegen factory 36864 (0x9000), CUDA_ARCH 1000.
Products
B100, B200 (datacenter GPU), paired as GB200 NVL72 superchips.
Key Features
- TCGen05: Full 5th-gen tensor core with TMEM (alloc, dealloc, ld, st, commit, cp, shift, mma, mma.ws)
- Capsule Mercury: Default output format (auto-enabled for SM > 99)
- WGMMA inherited: Warpgroup MMA from Hopper carries forward
- Cluster operations: Thread-block clusters, distributed shared memory (from Hopper)
- setmaxnreg: Dynamic register allocation (from Hopper)
- Uniform register ALU: UFADD, UFFMA, UFSEL, UFSETP, UVIADDR (Blackwell uniform register ISA additions)
Handler Functions
| Map | Function | Role |
|---|---|---|
| Handler A | sub_609C30 | Primary codegen capability accessor |
| Handler B | sub_609BD0 | Secondary codegen capability accessor |
| Intrinsic init | sub_60A910 | Intrinsic table population |
| Perf/occupancy E | sub_609080 | Performance statistics |
| Perf/occupancy F | sub_6098A0 | Occupancy calculator |
HW Latency Table
sub_8E8A90 (3.0KB) -- the base Blackwell latency table. Two-part structure: a 3.0KB base table for standard instructions plus a ~949-byte TCGEN05 supplement covering tensor core scheduling classes 745--772+.
Profile Object
From sub_6765E0:
SM name: "sm_100"
Compute name: "compute_100"
LTO name: "lto_100"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1000"
The profile constructor stores dword_29FE2C4 = 100 after constructing all sm_100 sub-variants, likely recording the current highest-registered base SM number.
SM 103 / SM 103a / SM 103f -- Blackwell Ultra
sm_103 is Blackwell Ultra, targeting the GB300 NVL72 platform. Codegen factory 36867 (0x9003), CUDA_ARCH 1030.
Products
GB300 (datacenter, Blackwell Ultra). Incremental silicon revision over sm_100.
Differentiation from sm_100
The SASS formatter sub_583190 (rsqrt instruction) explicitly checks for "sm_103" and applies a Blackwell Ultra-specific operand layout. The formatter dispatch first tests raw SM codes (102, 103, 107, 130, 124), then uses check_target_sm(instr, 0, "sm_103") for string-based validation.
From sweep data on the SM103-specific encoding path:
- Sets encoding flags XOR 0x10, XOR 0x40 for sm_103-specific instruction variants
- New operand layout for transcendental instructions (rsqrt, likely also rcp/sqrt)
sm_103 has a separate 618-byte supplementary latency table -- the smallest in the binary -- suggesting minimal scheduling parameter changes from sm_100.
Handler Functions
All unique from sm_100:
| Map | Function |
|---|---|
| Handler A | sub_608F20 |
| Handler B | sub_609D20 |
| Intrinsic init | sub_60A700 |
| Perf/occupancy E | sub_609020 |
| Perf/occupancy F | sub_6091A0 |
Profile Object
SM name: "sm_103"
Compute name: "compute_103"
LTO name: "lto_103"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1030"
SM 110 / SM 110a / SM 110f -- Jetson Thor
sm_110 targets the Jetson Thor SoC for automotive and robotics applications. Codegen factory 36868 (0x9004), CUDA_ARCH 1100.
Products
Jetson Thor (automotive-grade SoC with integrated GPU). Originally internally designated sm_101 before rename.
sm_101 Legacy Alias
sm_101 (with variants sm_101a and sm_101f) was the original internal name for Jetson Thor. It was renamed to sm_110 in a later CUDA release, but all three validation table entries are retained for backward compatibility:
| Table | Entry | PTX ISA | Purpose |
|---|---|---|---|
Base (unk_1D16220) | {101, 8, 6} | 8.6 | Accepts --gpu-name sm_101 in existing PTX files |
Accelerated (unk_1D161C0) | {101, 8, 6} | 8.6 | Accepts sm_101a |
Feature-reduced (unk_1D16160) | {101, 8, 8} | 8.8 | Accepts sm_101f |
The validation tables use bsearch() (sub_484B70 comparator), so both sm_101 and sm_110 are independently findable. However, sub_6765E0 (the profile constructor) registers only sm_110 / sm_110a / sm_110f -- there is no profile object for sm_101. After passing validation, sm_101 must resolve to the sm_110 profile through an internal aliasing path (likely in sub_4B1080, the target directive parser).
The PTX ISA version difference is notable: sm_101 requires PTX 8.6 (same as sm_100), while sm_110 requires PTX 9.0. This reflects the timeline -- sm_101 was named when the Jetson Thor target was first added alongside sm_100, before the sm_110 numbering and PTX 9.0 specification existed.
Key Characteristics
- Full tcgen05 hardware: Retains datacenter-class tensor memory and tensor core features
- Same WGMMA/cluster support: Inherits Hopper-era warpgroup and cluster operations
- SoC-specific constraints: Differentiated from sm_100 through capability flags, not through missing features -- the capability accessor functions (
sub_609F30,sub_608F50) return SoC-appropriate resource limits
Handler Functions
| Map | Function |
|---|---|
| Handler A | sub_609F30 |
| Handler B | sub_608F50 |
| Intrinsic init | sub_60AA20 |
| Perf/occupancy E | sub_609000 |
| Perf/occupancy F | sub_609280 |
Profile Object
SM name: "sm_110"
Compute name: "compute_110"
LTO name: "lto_110"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1100"
Note: sm_110 uses different xmmword constants for profile fields [5]--[7] compared to sm_100, visible in the profile constructor where v98[5] = xmmword_2027D70; v98[6] = v103 (from v209); v98[7] = v104 (from v207). This encodes different hardware resource parameters.
SM 120 / SM 120a / SM 120f -- Blackwell Consumer
sm_120 targets consumer and enterprise workstation GPUs. Codegen factory 36869 (0x9005), CUDA_ARCH 1200.
Products
RTX 5090, RTX 5080, RTX 5070 Ti, RTX 5070, RTX 5060 (consumer). RTX Blackwell Pro (enterprise workstation).
The tcgen05 Gap
sm_120 is architecturally distinct from sm_100 despite sharing the "Blackwell" family string. The critical difference: no tcgen05 support. The entire tensor memory subsystem (alloc, dealloc, ld, st, commit, cp, shift, mma) is absent. Tensor operations fall back to:
- HMMA/IMMA inherited from sm_70--sm_89 (direct MMA path)
- WGMMA inherited from sm_90 (warpgroup async MMA)
This is gated by SM version checks in the capability accessor functions. The intrinsic table initializer (sub_608DF0) does not register tcgen05 intrinsic handlers for sm_120.
HW Latency Table
sm_120 has a distinct latency model, split into two parts:
| Function | Size | Content |
|---|---|---|
sub_8E9000 | 2.9KB | Base consumer Blackwell table |
sub_8E92E0 | 5.5KB | Extended table (largest individual table in binary) |
The 5.5KB extended table is larger than any other individual latency table, suggesting that sm_120's consumer pipeline has significantly different scheduling characteristics from datacenter sm_100. The consumer pipeline likely has different functional unit counts, memory latencies, and tensor core throughput profiles.
Handler Functions
| Map | Function |
|---|---|
| Handler A | sub_609E40 |
| Handler B | sub_609C60 |
| Intrinsic init | sub_608DF0 |
| Perf/occupancy E | sub_608FE0 |
| Perf/occupancy F | sub_609520 |
Profile Object
SM name: "sm_120"
Compute name: "compute_120"
LTO name: "lto_120"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1200"
sm_120 uses a third distinct set of xmmword constants for profile fields, including xmmword_2027DC0 at field [6], confirming different hardware resource parameters from both sm_100 and sm_110.
SM 121 / SM 121a / SM 121f -- DGX Spark
sm_121 targets the DGX Spark desktop AI workstation. Codegen factory 36869 (0x9005), CUDA_ARCH 1210.
Products
NVIDIA DGX Spark (desktop AI workstation with Grace CPU + Blackwell GPU).
Relationship to sm_120
sm_121 shares the same codegen factory sub-variant (5) as sm_120 in the scheduling sub-architecture table, and inherits sm_120's xmmword profile constants. This suggests sm_121 is a binned or slightly modified sm_120 die, similar to how sm_86 relates to sm_80 in the Ampere generation.
Like sm_120, sm_121 has no tcgen05 support -- tensor operations use the HMMA/IMMA/WGMMA path.
Handler Functions
All unique from sm_120:
| Map | Function |
|---|---|
| Handler A | sub_609ED0 |
| Handler B | sub_609BA0 |
| Intrinsic init | sub_60A4E0 |
| Perf/occupancy E | sub_609040 |
| Perf/occupancy F | sub_6097C0 |
Profile Object
SM name: "sm_121"
Compute name: "compute_121"
LTO name: "lto_121"
Family: "Blackwell"
CUDA_ARCH: "-D__CUDA_ARCH__=1210"
Blackwell Uniform Register ISA
Blackwell extends the uniform register (UR) ISA introduced in Turing/Ampere with dedicated uniform ALU instructions:
| SASS Instruction | Operation | Notes |
|---|---|---|
UFADD | Uniform floating-point add | New in Blackwell |
UFFMA | Uniform fused multiply-add | New in Blackwell |
UFSEL | Uniform floating-point select | New in Blackwell |
UFSETP | Uniform FP set-predicate | New in Blackwell |
UVIADDR | Uniform integer-to-address | New in Blackwell |
These instructions execute on the uniform datapath (UDP, functional unit index 9), allowing floating-point uniform computations to stay in the UR file without round-tripping through the R file. Mercury encoding assigns major opcode 0x0E with 6 variants (sub_10C0550) for uniform ALU.
Architecture Version Threshold Checks
All Blackwell targets share codegen factory 36864+ (0x9000+). The binary uses these thresholds to gate Blackwell-specific features:
| Check Pattern | Threshold | Meaning |
|---|---|---|
encoded > 36863 | > sm_90 extended | Blackwell warp geometry (16 warps, 240 slots) |
codegen_factory >= 36864 | >= sm_100 | Blackwell generation features |
codegen_factory == 36864 | sm_100 exactly | sm_100-specific paths |
SM_number > 99 | sm_100+ | Capsule Mercury auto-enable |
sub_70FA00(*, 29) | -- | tcgen05 capability query (SM-specific) |
The tcgen05 gating is not a simple threshold -- it uses a per-SM capability query (sub_70FA00 with argument 29) that returns true for sm_100/103/110 and false for sm_120/121.
BB Initialization (Secondary Encoding)
The basic block initializer sub_6E8EB0 uses a secondary encoding space for Blackwell:
| Secondary Encoding | SM | Flags |
|---|---|---|
20480 (0x5000) | sm_100 | Instruction set flags for datacenter |
20484 (0x5004) | sm_103 | XOR 0x10, XOR 0x40 for Ultra variants |
This secondary encoding uses generation 5 in the BB init context (5 << 12), separate from the primary codegen factory's generation 9.
Scheduling Profile Differences
Blackwell targets share the 16-warp / 240-slot geometry with Hopper but have distinct latency tables:
| SM | Latency Table | Size | Structure |
|---|---|---|---|
| sm_100 | sub_8E8A90 | 3.0KB | Base + 949B TCGEN05 supplement |
| sm_103 | (supplementary) | 618B | Smallest table -- minimal delta from sm_100 |
| sm_110 | (shared with sm_100 or dedicated) | -- | Not separately identified in sweep |
| sm_120 | sub_8E9000 + sub_8E92E0 | 2.9KB + 5.5KB | Two-part consumer table |
| sm_121 | (likely shares sm_120's table) | -- | Same sub-variant index as sm_120 |
| Universal | sub_8E97B0 | 8.8KB | Fallback / universal |
The 5.5KB extended table for sm_120 is the largest individual latency table in the binary, reflecting the consumer microarchitecture's distinct pipeline design. The sm_100 table uses a supplement mechanism for TCGEN05-specific scheduling classes that the consumer sm_120 table does not need (since sm_120 lacks tcgen05).
Scheduling Class Assignments
From the opcode-to-scheduling-class mapper sub_89FBA0 (85KB), Blackwell-era opcodes use high-numbered scheduling classes:
| Class Range | Category | Architecture |
|---|---|---|
| 700--772+ | Mercury/Blackwell tensor ops | sm_100+ with tcgen05 |
| 745 | WGMMA primary | sm_90+ (Hopper carry-forward) |
| 744 | WGMMA variant | sm_90+ |
| 765--767 | BGMMA/QMMA (Blackwell-specific MMA types) | sm_100+ |
| 759 | HMMA/BMMA tensor core | sm_100+ |
| 757, 761 | Narrow/wide DP tensor | sm_100+ |
| 600, 604 | Tensor fence / tensor sync | sm_90+ |
Intrinsic Table
Blackwell intrinsic availability is cumulative -- all sm_70, sm_80, sm_8x, and sm_9x intrinsics carry forward. Blackwell adds two new intrinsic groups:
sm_10x Intrinsics (21 entries)
| ID Range | Count | Namespace | Category |
|---|---|---|---|
| 0x20--0x2A | 11 | __cuda_sm10x_tcgen05_guardrail_trap_* | Trap on guardrail validation failure |
| 0x230--0x239 | 10 | __cuda_sm_10x_* | hmma/imma mdata + bit MMA (Blackwell-specific shapes) |
sm_1xx Intrinsics (18 entries)
| Namespace | Count | Category |
|---|---|---|
__cuda_sm1xx_* | 18 | cp.async.bulk.tensor 1D--5D tile/im2col (extended bulk copy) |
OCG (On-Chip Generated) Intrinsics
The OCG builtin name table at sub_6C9EB0 (13KB) contains the master list of Blackwell+ runtime-generated intrinsic names:
cp_async_bulk, cp_red_async_bulk, cp_async_tensor,
cp_async_prefetch_tensor, fence_view_async,
viaddmax, viaddmin, viadd, vimax, vimin, vimax3, vimin3,
write_async, tcbar, mmareadshma, tcmma,
tcshift, tcatomsws, tcldsws, tcstsws,
gdesc, breuse, bkeep, virtcount,
memclear, acqshminit, sparsify, spfactor2to4,
2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit,
16dp32bitt0t15, 16dp32bitt16t31,
selfcast, broadcast, findandset, align
These names represent the Blackwell SASS operations exposed through the OCG intrinsic interface, covering tensor core scheduling (tcbar, tcmma, tcshift), sparse operations (sparsify, spfactor2to4), integer min/max variants (viaddmax, viaddmin), and async memory operations.
New PTX Instructions (Blackwell-Specific)
Beyond tcgen05, Blackwell introduces or extends several instruction families visible in the opcode dispatch:
| Instruction | Category | Evidence |
|---|---|---|
tcgen05.* (11 instructions) | Tensor memory ops | sub_5D4190 registration |
fence_view_async | Memory ordering | OCG builtin table |
write_async | Async writes | OCG builtin table |
viaddmax / viaddmin | Integer add-with-max/min | OCG builtin table |
| BGMMA / QMMA | Block/quantized MMA | Scheduling classes 765--767 |
CLI Options -- Tensor Memory Checks
Two CLI options control tcgen05 guardrail instrumentation:
| Option | Short Form | Default | Description |
|---|---|---|---|
--g-tensor-memory-access-check | -g-tmem-access-check | Enabled with -g | Enable tensor memory access checks for tcgen05 operations |
--gno-tensor-memory-access-check | -gno-tmem-access-check | false | Disable checks (overrides the above) |
When enabled, the compiler inserts inline validation code (the 8 guardrail functions) around tcgen05 operations. These emit trap instructions if tensor memory invariants are violated at runtime -- useful for debugging TMEM allocation errors, bounds violations, and ownership conflicts.
Feature Comparison
| Feature | sm_100 | sm_103 | sm_110 | sm_120 | sm_121 |
|---|---|---|---|---|---|
| Codegen factory | 36864 | 36867 | 36868 | 36869 | 36869 |
| Sub-arch variant | 0 | 3 | 4 | 5 | 5 |
| Family string | Blackwell | Blackwell | Blackwell | Blackwell | Blackwell |
| Capsule Mercury default | Yes | Yes | Yes | Yes | Yes |
| tcgen05 (tensor memory) | Yes | Yes | Yes | No | No |
| WGMMA (from Hopper) | Yes | Yes | Yes | Yes | Yes |
| Cluster operations | Yes | Yes | Yes | Yes | Yes |
| setmaxnreg | Yes | Yes | Yes | Yes | Yes |
| Uniform FP ALU (UFADD etc.) | Yes | Yes | Yes | Yes | Yes |
| BGMMA/QMMA | Yes | Yes | Yes | ? | ? |
| Guardrail instrumentation | Yes | Yes | Yes | N/A | N/A |
| HW latency table | 3.0KB + 949B | 618B (supp.) | -- | 2.9KB + 5.5KB | (shared w/ sm_120) |
a sub-variant | Yes | Yes | Yes | Yes | Yes |
f sub-variant | Yes | Yes | Yes | Yes | Yes |
| Products | B100/B200 | GB300 | Jetson Thor | RTX 50xx | DGX Spark |
SASS Instruction Encoding
Blackwell continues the 128-bit per-instruction format introduced in Turing. The Mercury encoder handles the SM100+ instruction set through a dedicated encoding subsystem spanning approximately 851KB in the address range 0xDFC000--0x107B000.
The encoding subsystem covers 16 instruction format groups with full Blackwell ISA support including:
- Standard ALU/FPU/memory operations (inherited from earlier architectures)
- TCGEN05 tensor memory operations (new encoding classes)
- BGMMA/QMMA block-scale and quantized MMA variants
- Extended bulk copy operations (UBLKCP variants)
- Sparse tensor operations
Function Map
| Address | Size | Identity | SM | Confidence |
|---|---|---|---|---|
sub_607DB0 | 14KB | Capability dispatch (all Blackwell registrations) | all | 99% |
sub_608DF0 | ~1KB | Intrinsic table initializer | sm_120 | 85% |
sub_608F20 | ~1.2KB | Handler A (capability accessor) | sm_103 | 90% |
sub_608F50 | ~1.2KB | Handler B (capability accessor) | sm_110 | 90% |
sub_609000 | ~200B | Perf/occupancy E | sm_110 | 85% |
sub_609020 | ~200B | Perf/occupancy E | sm_103 | 85% |
sub_609040 | ~200B | Perf/occupancy E | sm_121 | 85% |
sub_609080 | ~200B | Perf/occupancy E | sm_100 | 85% |
sub_6091A0 | ~200B | Perf/occupancy F | sm_103 | 85% |
sub_609280 | ~200B | Perf/occupancy F | sm_110 | 85% |
sub_609520 | ~200B | Perf/occupancy F | sm_120 | 85% |
sub_6097C0 | ~200B | Perf/occupancy F | sm_121 | 85% |
sub_6098A0 | ~200B | Perf/occupancy F | sm_100 | 85% |
sub_609BA0 | ~48B | Handler B | sm_121 | 99% |
sub_609BD0 | ~48B | Handler B | sm_100 | 99% |
sub_609C30 | ~48B | Handler A | sm_100 | 99% |
sub_609C60 | ~48B | Handler B | sm_120 | 99% |
sub_609D20 | ~48B | Handler B | sm_103 | 99% |
sub_609E40 | ~48B | Handler A | sm_120 | 99% |
sub_609ED0 | ~48B | Handler A | sm_121 | 99% |
sub_609F30 | ~48B | Handler A | sm_110 | 99% |
sub_60A4E0 | ~1KB | Intrinsic table initializer | sm_121 | 85% |
sub_60A700 | ~1KB | Intrinsic table initializer | sm_103 | 85% |
sub_60A910 | ~1KB | Intrinsic table initializer | sm_100 | 85% |
sub_60AA20 | ~1KB | Intrinsic table initializer | sm_110 | 85% |
sub_60F290 | est. | Off-target capmerc compatibility checker | all | 75% |
sub_612DE0 | 47KB | Kernel finalizer / ELF builder | all | 80% |
sub_6765E0 | 54KB | Profile constructor (Blackwell entries at lines 600--1330) | all | 95% |
sub_703AB0 | 10KB | CLI option parser (capmerc/mercury/sass) | all | 90% |
sub_70D910 | 24 lines | tcgen05 immhalfSplitOff helper | sm_100 | 90% |
sub_70DDB0 | 47 lines | tcgen05 funcRetArr helper | sm_100 | 90% |
sub_70E0E0 | 296 lines | tcgen05 guardrail bounds checker | sm_100 | 90% |
sub_8E4400 | 3.3KB | Warp geometry initializer (Blackwell = 16 warps, 240 slots) | all | 90% |
sub_8E8A90 | 3.0KB | HW latency table (Blackwell datacenter) | sm_100 | 85% |
sub_8E9000 | 2.9KB | HW latency table (consumer base) | sm_120 | 85% |
sub_8E92E0 | 5.5KB | HW latency table (consumer extended) | sm_120 | 85% |
sub_8E97B0 | 8.8KB | Universal fallback latency table | all | 85% |
sub_89FBA0 | 85KB | Opcode-to-scheduling-class mapper | all | 90% |
sub_5BBC30 | 90KB | tcgen05.mma codegen handler | sm_100 | 98% |
sub_5D4190 | -- | Opcode dispatch table (tcgen05 registrations) | all | 99% |
sub_1C9C300 | 24KB | Capsule Mercury section processor | all | 85% |
sub_1C9B110 | 23KB | Mercury capsule builder | all | 85% |
Cross-References
- SM Architecture Map -- Validation tables, codegen factory encoding, suffix semantics
- Turing & Ampere (SM 75--88) -- Baseline features that Blackwell inherits
- Ada & Hopper (SM 89--90a) -- WGMMA, clusters, TMA -- carried forward to Blackwell
- TCGen05 -- 5th Gen Tensor Cores -- Detailed tcgen05 ISA, TMEM model, instruction encoding
- Capsule Mercury & Finalization -- Capmerc format, ELF structure, finalization levels
- Mercury Encoder -- Shared encoding pipeline for all Mercury/capmerc output
- Intrinsic Table (608 Entries) -- Full intrinsic catalog with sm_10x and sm_1xx ranges
- Latency Model & HW Profiles -- Per-SM scheduling parameters and functional units
- Uniform Register Optimization -- UR-file passes, Blackwell uniform ALU additions
- CLI Options --
--gpu-name,--binary-kind, tensor memory check flags - SASS Instruction Encoding -- 128-bit format, Mercury encoder selection
TCGen05 -- 5th Generation Tensor Cores
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
TCGen05 is the Blackwell-generation tensor core instruction family introduced with SM 100. It replaces Hopper's WGMMA with a descriptor-based programming model that operates on Tensor Memory (TMEM) -- a dedicated register-file-like storage visible only to the tensor core. ptxas implements TCGen05 as 13 PTX instruction mnemonics (plus 8 debug guardrails), backed by a 90KB MMA codegen function, 11 SASS opcode groups (28 encoding variants), and a set of compiler-inserted validation hooks. TCGen05 is absent on sm_120/sm_121 (consumer Blackwell).
| Target architectures | sm_100, sm_100a, sm_100f, sm_103, sm_103a, sm_103f, sm_110, sm_110a, sm_110f |
| NOT available | sm_120, sm_121 (consumer/DGX Spark) -- gated by SM version checks |
| Capability check | sub_70FA00(*, 29) -- returns true for tcgen05-capable targets |
| PTX instructions | 13: alloc, dealloc, relinquish_alloc_permit, ld, ld.red, st, commit, cp, shift, fence, wait, mma, mma.ws |
| Guardrail instructions | 8: is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usage |
| SASS opcode range | Opcodes 122--139 (TMEM operations), 213--221 (TCGEN05_MMA/FENCE, TMEM extended), 342--372 (TCGEN05 control) |
| Codegen factory | 36864 (9 << 12) -- shared across all Blackwell targets |
| MMA codegen | sub_5BBC30 (90KB) |
| PTX validator | sub_4C5FB0 (28KB -- shared MMA/WMMA/tcgen05 validator) |
| Intrinsic handler | sub_6D7AF0 (19KB -- TCGen05 MMA handler) |
| Intrinsic validator | sub_6D69B0 (12KB -- TCGen05 MMA validator) |
| EIATTR markers | EIATTR_TCGEN05_1CTA_USED, EIATTR_TCGEN05_2CTA_USED |
| Version constraint | Objects using tcgen05 from CUDA 12.x cannot link with 13.0+; must rebuild |
Architecture Overview
Descriptor-Based Model
TCGen05 abandons the register-operand model of previous tensor core generations (WMMA, HMMA, WGMMA) in favor of descriptors. The instruction descriptor (idesc) encodes the matrix operation configuration -- dimensions, data types, data path width, sparsity, and layout. The descriptor is passed as an operand to tcgen05.mma rather than encoded in the instruction mnemonic.
This design decouples the instruction encoding from the operation specification. Where WGMMA required hundreds of distinct intrinsic hash entries to cover every shape/type/layout combination, tcgen05 uses a single instruction with different descriptor values. The ~400 numeric MMA hash entries in the intrinsic dispatch table (at a1+816 in sub_5D4190) map WGMMA variants; tcgen05 replaces that complexity with descriptor-driven dispatch.
Tensor Memory (TMEM)
TMEM is a dedicated storage region private to the tensor core unit. It is not part of the general register file and is not directly addressable by non-tensor-core instructions. TMEM is organized into columns that are allocated, used, and deallocated explicitly by the programmer.
Key properties from binary analysis:
- Column-based allocation:
tcgen05.allocreserves columns;tcgen05.deallocreleases them - Two CTA granularities: Operations execute at
.cta_group::1(single CTA) or.cta_group::2(CTA pair) granularity. A function cannot mix both -- ptxas enforces: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed." - Allocation tracking: The compiler inserts reserved shared memory variables to track allocation state:
__nv_reservedSMEM_tcgen05_partition-- partition identifier__nv_reservedSMEM_allocation_phase-- current allocation phase__nv_reservedSMEM_allocation_mask-- bitmask of allocated columns__nv_reservedSMEM_tmem_allocation_pipeline_mbarrier-- mbarrier for allocation pipeline__nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity-- parity tracking
TMEM Address Computation
Tensor memory addresses are computed through a standardized pattern visible in the TMEM address generator functions (sub_70E740, sub_70E940, sub_70EB00):
cvt.u32.u64 __cuda_sm_100_tcgen05_tmem_addr_base, %s;
add.u32 %s, __cuda_sm_100_tcgen05_tmem_addr_base, %s;
Five named TMEM address roles exist for MMA operations:
| Address Role | Intrinsic Name | Purpose |
|---|---|---|
| D (destination) | __cuda_sm10x_tcgen05_mma_tmemD | Accumulator/output matrix |
| A (input) | __cuda_sm10x_tcgen05_mma_tmemA | Left input matrix |
| Scale A | __cuda_sm10x_tcgen05_mma_scaleTmemA | Scale factors for A |
| Scale B | __cuda_sm10x_tcgen05_mma_scaleTmemB | Scale factors for B |
| Sparse Meta | __cuda_sm10x_tcgen05_mma_spMetaTmem | Sparsity metadata |
Constraint from the binary: "URa must be uint32 when URa is TMEM" -- uniform registers addressing TMEM must use 32-bit unsigned integers. When addressing a global descriptor: "URa must be uint64 when URa is GDESC".
PTX Instruction Set
Lifecycle Instructions
| PTX Instruction | Formatter Address | Size | Purpose |
|---|---|---|---|
tcgen05.alloc | 0x526370 | 1,287 B | Allocate TMEM columns for tensor core use |
tcgen05.dealloc | 0x574050 | 2,130 B | Release allocated TMEM columns |
tcgen05.relinquish_alloc_permit | 0x58C7F0 | 4,282 B | Relinquish allocation permit (multi-CTA coordination) |
The alloc instruction has two CTA-granularity variants visible in the prototype strings:
__cuda_sm10x_tcgen05_alloc_one_sm-- single-SM allocation (.cta_group::1)__cuda_sm10x_tcgen05_alloc_two_sm-- two-SM allocation (.cta_group::2)
Both take a destination pointer argument (__cuda_sm10x_tc_alloc_dst_ptr_arg) and a column count (__cuda_sm10x_tc_alloc_num_cols_arg).
Data Movement Instructions
| PTX Instruction | Formatter Address | Size | Purpose |
|---|---|---|---|
tcgen05.ld | 0x578DB0 | 2,466 B | Load data into TMEM from shared/global memory |
tcgen05.ld.red | 0x571FE0 | 2,066 B | Load with reduction (accumulate into TMEM) |
tcgen05.st | 0x56C190 | 1,842 B | Store data from TMEM to shared/global memory |
tcgen05.cp | 0x5427F0 | 903 B | Copy between TMEM regions (intra-tensor-core) |
Three intrinsic helper arrays support the ld/st/ld.red operations:
| Helper | Purpose |
|---|---|
__cuda_sm_100_tcgen05_ld_funcRetArr | Return array descriptor for loads |
__cuda_sm_100_tcgen05_ld_red_funcRetArr | Return array descriptor for load-reduce |
__cuda_sm_100_tcgen05_st_funcInputArr | Input array descriptor for stores |
Each has a corresponding immhalfSplitOff parameter controlling split behavior:
__cuda_sm_100_tcgen05_ld_immhalfSplitOff__cuda_sm_100_tcgen05_ld_red_immhalfSplitOff__cuda_sm_100_tcgen05_st_immhalfSplitOff
Synchronization Instructions
| PTX Instruction | Formatter Address | Size | Purpose |
|---|---|---|---|
tcgen05.commit | 0x5427F0 | 1,575 B | Commit pending tensor core operations |
tcgen05.fence | (inline) | -- | Fence preventing reordering of tcgen05 operations |
tcgen05.wait | (inline) | -- | Wait for committed tcgen05 operations to complete |
tcgen05.shift | 0x58FA20 | 4,604 B | Shift accumulator data within TMEM (shared formatter with mma) |
Compute Instructions
| PTX Instruction | Formatter Address | Size | Purpose |
|---|---|---|---|
tcgen05.mma | 0x5BBC30 (codegen) / 0x58FA20 (formatter) | 90KB / 4,604 B | Matrix multiply-accumulate |
tcgen05.mma.ws | 0x4DA720 (formatter) | 343 B | Warp-specialized MMA variant |
TCGen05.MMA -- Matrix Multiply-Accumulate
Codegen Function: sub_5BBC30 (90KB)
The largest per-instruction codegen function for TCGen05. Registered as the "tcgen05.mma" handler in sub_5D4190 (the intrinsic dispatch table builder). The function:
- Allocates a 50,000-byte working buffer
- Queries
sub_70FA00(*, 29)to validate tcgen05 capability on the current target - Processes the instruction descriptor to determine operation parameters
- Generates tensor memory addressing code for all operands (D, A, scaleA, scaleB, sparsity meta)
- Emits the final MMA instruction encoding
MMA Modifiers
The binary reveals a rich set of MMA modifiers extracted by functions in the sub_70D1F0--sub_70D410 cluster:
| Modifier | String | Purpose |
|---|---|---|
.o128 | ".o128" | 128-bit output size |
.transA | ".transA" | Transpose A matrix |
.transB | ".transB" | Transpose B matrix |
.negA | "_negA" | Negate A matrix |
.negB | "_negB" | Negate B matrix |
_expand16bit | "_expand16bit" | 16-bit expansion mode |
_pack16bit | "_pack16bit" | 16-bit packing mode |
_maxabs | "_maxabs" | Maximum absolute value reduction |
_minabs | "_minabs" | Minimum absolute value reduction |
_fused | "_fused" | Fused operation mode |
_blockscale | "_blockscale" | Block scaling (MX format support) |
_ashift | "_ashift" | A-matrix shift |
_areuse | "_areuse" | A-matrix register reuse |
_akeep | "_akeep" | A-matrix keep (preserve for reuse) |
Data Path Configurations
The MMA data path width determines the number of elements processed per cycle and the accumulator layout. Six configurations exist:
| Data Path | String | Interpretation |
|---|---|---|
_4dp256bit | 4 data paths, 256 bits each | |
_16dp32bit | 16 data paths, 32 bits each (two sub-variants: t0t15, t16t31) | |
_32dp32bit | 32 data paths, 32 bits each | |
_16dp256bit | 16 data paths, 256 bits each | |
_128dp256bit | 128 data paths, 256 bits each |
Constraint: "fused and l16dp32bit must be specified together" -- the fused mode requires the 16dp32bit data path.
Block Scaling (MX Format)
TCGen05 adds native block scaling support for microscaling (MX) format operations, visible through the tcmma prefix strings:
"tcmma_*_o must be specified with blockscale"-- output modifier requires blockscale"uri width for tcmma_*_o must be 2"-- output uniform register index width must be 2"tcmma_*_q with blockscale must have uri width of 2"-- quantization with blockscale"tcmma_*_mxq must be specified with blockscale"-- MX quantization requires blockscale
Warp-Specialized MMA (.ws)
The .ws modifier enables warp-specialized execution where different warps in a warpgroup contribute to different phases of the MMA pipeline. Constraints from the binary:
"When using buffer1-3, WS modifier must be specified"-- triple buffering requires.ws"ws opcode modifier not allowed with .2CTA"-- warp specialization is single-CTA only"ws opcode modifier not allowed with areuse or akeep"--.wsincompatible with A-matrix reuse"ws opcode modifier not allowed with ashift"--.wsincompatible with A-matrix shift
Triple-buffer register reuse strings for .ws mode:
| Buffer | Variant |
|---|---|
_breuse_bkeep_buffer1 | B-reuse + B-keep, buffer 1 |
_breuse_buffer1 | B-reuse, buffer 1 |
_breuse_bkeep_buffer2 | B-reuse + B-keep, buffer 2 |
_breuse_buffer2 | B-reuse, buffer 2 |
_breuse_bkeep_buffer3 | B-reuse + B-keep, buffer 3 |
_breuse_buffer3 | B-reuse, buffer 3 |
Sparsity Support
TCGen05 supports structured sparsity through the sparsity metadata TMEM address (spMetaTmem). The _ashift modifier is constrained: "Ashift can only be specific when URa is in TMEM".
SASS Encoding
Opcode Map
TCGen05 SASS instructions span three opcode regions in the SM 100 SASS ISA. The encoding information comes from the latency model tables (sub_8E8A90 for sm_100) and the master instruction encoder (sub_6D9690, 94KB).
TMEM Operations (Opcodes 122--139)
| Opcode | Variants | Category | Encoding Class | Operands |
|---|---|---|---|---|
| 122 | 2 | TMEM_OP / new ISA | F1F08, F1C60 | 3-op, reg10 |
| 123 | 6 | TMEM_LD (tensor mem load) | F1F08, F1DF8 | 2--3 op |
| 125 | 6 | TMEM_ST (tensor mem store) | F1F08, F1DF8 | 2--3 op |
| 127 | 9 | TMEM_ALLOC / FENCE | F1F08..F29A8 | 3--6 op |
| 129 | 3 | TMEM extended | F1F08 | 2 op |
| 130 | 26 | EXTENDED_MOV / TMEM_MVA | F1F08..F2678 | 2--9 op |
| 131 | 3 | EXTENDED_ALU / UTMA | F21B0 | 4--5 op |
| 133 | 1 | UTMA variant | F21B0 | 4 op |
| 139 | 4 | TCGEN05 operations | F21B0, F2568 | 4--8 op |
TCGEN05 MMA/FENCE (Opcodes 213--221)
| Opcode | Variants | Category | Encoding Class | Operands |
|---|---|---|---|---|
| 213 | 6 | TCGEN05_MMA | F2678 | 5--7 op |
| 216 | 2 | TCGEN05_FENCE | F2678 | 3--4 op |
| 219 | 6 | TMEM_LD extended | F1C60..F2810 | 3--7 op |
| 220 | 1 | TMEM_ST extended | F1C60 | 3 op |
| 221 | 1 | TMEM_PREFETCH | F1C60 | 3 op |
| 255 | 1 | SETSTMEMADDR | F1F08 | 1 op |
| 269 | 4 | TMEM_ALLOC_FENCE ext | F2018, F1DF8 | 2--3 op |
TCGEN05 Control (Opcodes 342--372)
28 encoding variants across 10 opcodes. These are the primary tensor core pipeline control instructions:
| Opcode | Variants | Category | Encoding Class | Operands |
|---|---|---|---|---|
| 342 | 1 | TCGEN05 ctrl A | F1F08 | 0 op (scheduling marker) |
| 343 | 1 | TCGEN05 ctrl B | F1F08 | 0 op (scheduling marker) |
| 344 | 14 | TCGEN05 execute | F1F08..F3008 | 2--7 op |
| 346 | 4 | TCGEN05 commit | F1F08, F2018 | 2--3 op |
| 349 | 1 | TCGEN05 sync | F1D70 | 0 op |
| 359 | 3 | TCGEN05 alloc | F1D70, F1F08 | 0--2 op |
| 369 | 1 | TCGEN05 dealloc | F1F08 | 0 op |
| 370 | 1 | TCGEN05 release A | F1D70 | 0 op |
| 371 | 1 | TCGEN05 release B | F1D70 | 0 op |
| 372 | 1 | TCGEN05 release C | F1D70 | 0 op |
Opcode 344 (TCGEN05 execute) has the most variants (14), spanning encoding classes from F1F08 to F3008 with 2 to 7 operands. This is the actual MMA dispatch instruction -- the wide encoding range reflects the variety of descriptor configurations, operand modes, and data path widths.
Encoding Class Distribution
The encoding classes used by TCGen05 SASS instructions:
| Class | Hex | Usage |
|---|---|---|
| F1D70 | Control/sync | alloc (0-op), sync, release A/B/C |
| F1F08 | General | ctrl markers, execute, commit, alloc, dealloc, TMEM ops |
| F1C60 | Extended | TMEM_LD/ST extended, TMEM_PREFETCH |
| F1DF8 | Standard | TMEM_LD/ST, TMEM_ALLOC_FENCE ext |
| F2018 | Commit ext | TCGEN05 commit, TMEM_ALLOC_FENCE ext |
| F21B0 | ALU | TCGEN05 operations, UTMA |
| F2568 | TCGEN05 ops | TCGEN05 operations |
| F2678 | MMA/FENCE | TCGEN05_MMA, TCGEN05_FENCE |
| F29A8 | TMEM_ALLOC | TMEM_ALLOC/FENCE |
| F2810 | Extended | TMEM_LD extended |
| F3008 | Execute max | TCGEN05 execute (high-operand-count) |
Latency Model
The sm_100 latency table (sub_8E8A90) uses a two-part structure: a 3.0KB base table covering standard instructions and a 949-byte supplement dedicated to TCGEN05 operations. The sm_120 consumer Blackwell table (sub_8E9000 + sub_8E92E0, 5.5KB) is the largest individual table and does not include TCGEN05 entries (confirming the feature's absence on consumer silicon).
CTA Granularity
TCGen05 operations specify whether they execute at single-CTA or CTA-pair granularity through the .cta_group modifier:
| Granularity | Modifier | EIATTR | ELF Marker |
|---|---|---|---|
| Single CTA | .cta_group::1 | EIATTR_TCGEN05_1CTA_USED | TC_1CTA |
| CTA Pair | .cta_group::2 | EIATTR_TCGEN05_2CTA_USED | TC_2CTA |
The compiler emits the appropriate EIATTR marker into the output cubin based on which granularity the kernel uses. The CUDA runtime uses this to configure the CTA launch parameters.
The binary enforces exclusivity: a single function cannot mix .cta_group::1 and .cta_group::2 operations. The error message is explicit: "Function '%s' uses single CTA(.cta_group::1) and CTA pair granularity(.cta_group::2) and that is not allowed."
ELF/Cubin Markers
EIATTR Entries
| EIATTR Name | Purpose |
|---|---|
EIATTR_TCGEN05_1CTA_USED | Kernel uses tcgen05 at single-CTA granularity |
EIATTR_TCGEN05_2CTA_USED | Kernel uses tcgen05 at CTA-pair granularity |
EICOMPAT Attributes
| EICOMPAT Name | Purpose |
|---|---|
EICOMPAT_ATTR_INST_TCGEN05_MMA | Kernel uses tcgen05.mma instructions |
EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED | Kernel uses deprecated (12.x-era) tcgen05.mma encoding |
Entry Fragment Markers
TMEM usage per-CTA is recorded in entry fragment markers:
| Marker | Version | Purpose |
|---|---|---|
AT_ENTRY_FRAGMENT_TMEM_CTA1 | V1 | TMEM usage for single-CTA kernels |
AT_ENTRY_FRAGMENT_TMEM_CTA2 | V1 | TMEM usage for CTA-pair kernels |
AT_ENTRY_FRAGMENT_TMEM_CTA1_V2 | V2 | TMEM usage V2 format, single-CTA |
AT_ENTRY_FRAGMENT_TMEM_CTA2_V2 | V2 | TMEM usage V2 format, CTA-pair |
Guardrail Debug Instrumentation
When compiling with -g (debug mode), ptxas inserts runtime validation checks around tcgen05 operations. These are controlled by the --g-tensor-memory-access-check / --gno-tensor-memory-access-check CLI options.
Guardrail Check Functions
Eight _tcgen05.guardrails.* pseudo-instructions insert inline validation code:
| Guardrail | Formatter Address | Size | Validates |
|---|---|---|---|
is_phase_valid | 0x4DDE70 | 775 B | Allocation phase is correct for the operation |
are_columns_allocated | 0x4DBF20 | 599 B | Accessed columns are currently allocated |
is_current_warp_valid_owner | 0x4DE180 | 791 B | Current warp owns the accessed TMEM region |
in_physical_bounds | 0x4DB050 | 439 B | Column access is within physical TMEM bounds |
allocation_granularity | 0x4F0960 | 839 B | Column count meets granularity requirements |
datapath_alignment | 0x4DD580 | 735 B | TMEM address is aligned for the data path width |
sp_consistency_across_idesc_mod | 0x500FA0 | 970 B | Sparsity settings in descriptor match modifier |
check_sparse_usage | 0x4DDB80 | 743 B | Sparse mode usage is valid for the environment |
Guardrail Trap Functions
When a guardrail check fails, it calls a trap function that reports the violation and terminates:
| Trap Intrinsic | Parameters |
|---|---|
__cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc | (.reg .b32 phase) |
__cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid | (.reg .b32 tmem_start_lane_accessed, .reg .b32 cur_warp_id, ...) |
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access | (.reg .b32 col_no_accessed, .reg .b32 alloced_mask, .reg .b32 instr_kind) |
__cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced | (.reg .b32 col_no_being_dealloced, .reg .b32 alloced_mask) |
__cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc | (.reg .b32 col_no_being_dealloced_not_returned_by_alloc, ...) |
__cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid | (.reg .b32 nCols) |
__cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds | (.reg .b32 oob_access_col_no, .reg .b32 instr_kind) |
__cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment | (.reg .b32 dp_lane, .reg .b32 matrix_kind, .reg .b32 valid_alignment_kind, ...) |
__cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod | (.reg .b32 idesc_sp_enabled, .reg .b32 mod_sp_enabled) |
__cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env | (.reg .b32 idesc_sp_enabled, .reg .b32 idesc, .reg .b32 mma_kind, .reg .b32 ptx_target, .reg .b32 is_family_portable) |
These are intrinsic IDs 0x20--0x2A (11 entries total including a mask creation helper) in the intrinsic table.
Guardrail Check Wrappers
The compiler also generates .FORCE_INLINE wrapper functions that combine multiple checks:
| Wrapper | Parameters |
|---|---|
__cuda_sm10x_tcgen05_guardrails_check_phase_validity | (.reg .u32 dummyInp) |
__cuda_sm10x_tcgen05_guardrails_check_column_allocation | (.reg .u32 start_col_num, .reg .u32 num_of_cols, ...) |
__cuda_sm10x_tcgen05_guardrails_check_datapath_validity | (.reg .u32 tmem_addr, .reg .u32 ld_or_st) |
__cuda_sm10x_tcgen05_guardrails_check_physical_bounds | (.reg .u32 start_col_num, .reg .u32 num_of_cols, ...) |
__cuda_sm10x_tcgen05_guardrails_check_allocation_granularity | (.reg .u32 num_of_cols) |
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment | (.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group, ...) |
Bulk Copy Operations (cp.async.bulk.tensor)
TCGen05 is complemented by asynchronous bulk copy operations for loading data into tensor memory. These are registered as separate intrinsic IDs (0x2B--0x3C, 18 entries) under the __cuda_sm1xx_* naming convention:
| Operation | Codegen Handler | Size |
|---|---|---|
cp.async.bulk.tensor (1D--5D, tile/im2col, unicast/multicast) | sub_5AB460 | 45KB |
cp.async.bulk | sub_593210 | -- |
cp.async.mbarrier.arrive | sub_4DC180 | -- |
The cp.async.bulk.tensor handler is 45KB and covers all dimensionality variants (1D through 5D), both tile and im2col access patterns, and unicast/multicast delivery modes.
SM Availability Gating
Capability Check
TCGen05 availability is gated by sub_70FA00(*, 29), which checks the target SM version. The check returns true for sm_100, sm_103, and sm_110 (and their a/f sub-variants) and false for sm_120/sm_121.
OCG Builtin Names
The OCG (Optimized Code Generation) layer uses short mnemonic names for tcgen05 operations visible in the builtin name lookup (sub_6C9EB0):
| OCG Name | Full Operation |
|---|---|
tcmma | tcgen05.mma core multiply-accumulate |
tcshift | tcgen05.shift accumulator data shift |
gdesc | Global descriptor operations |
memclear | Tensor memory clear |
sparsify | Sparsity pattern application |
The .tcgen05op string identifies an Ori IR instruction as belonging to the tcgen05 family during the optimizer pipeline.
Version Compatibility
CUDA 12.x to 13.0 Breaking Change
ptxas v13.0.88 includes a linker-level version check for tcgen05 objects:
"Object '%s' cannot be linked due to version mismatch. Objects using tcgen05 in 12.x cannot be linked with 13.0 or later, they must be rebuilt with latest compiler"
The EICOMPAT_ATTR_INST_TCGEN05_MMA_DEPRECATED attribute tags objects compiled with the 12.x-era tcgen05 encoding, which is binary-incompatible with the 13.0 encoding. The SASS instruction encoding for tcgen05.mma changed between CUDA 12.x and 13.0 -- objects must be recompiled.
SM 100 vs SM 103 Differences
Both sm_100 and sm_103 share the same tcgen05 instruction set and codegen factory (36864). They share all 7 dispatch-table handler functions. The differences between sm_100 and sm_103 are:
- Different Handler A and Handler B capability accessor functions (sm_100:
sub_609C30/sub_609BD0; sm_103:sub_608F20/sub_609D20) - Different intrinsic table initializers (sm_100:
sub_60A910; sm_103:sub_60A700) - sm_103 may expose additional capability flags for GB300-specific features
Both targets produce identical SASS for tcgen05 instructions. The f sub-variants (sm_100f, sm_103f) allow cross-compilation within the family: sm_100f code can run on sm_103 hardware.
Compiler Pipeline
PTX Parsing and Validation
- Lexer (
sub_720F00, 64KB): Recognizestcgen05.*tokens during lexical analysis - Validator (
sub_4C5FB0, 28KB): Shared MMA/WMMA/tcgen05 validation function. Checks instruction legality for the current SM target, validates operand types, descriptor fields, and modifier combinations - Instruction table (
sub_46E000, 93KB): Registers tcgen05 instruction variants with their type combinations (e.g.,.tcgen05op)
Intrinsic Dispatch
The intrinsic dispatch table builder (sub_5D4190, 41KB) registers tcgen05 handlers:
| Registration | PTX Instruction | Handler | Size |
|---|---|---|---|
| Line 112 | tcgen05.mma | sub_5BBC30 | 90KB |
| Lifecycle | tcgen05.alloc | sub_569180 | -- |
| Lifecycle | tcgen05.relinquish_alloc_permit | sub_526370 | -- |
| Lifecycle | tcgen05.dealloc | sub_58C7F0 | -- |
| Data | tcgen05.ld | sub_574050 | -- |
| Data | tcgen05.ld.red | sub_578DB0 | -- |
| Data | tcgen05.st | sub_571FE0 | -- |
| Sync | tcgen05.commit | sub_56C190 | -- |
| Copy | tcgen05.cp | sub_5427F0 | -- |
| Compute | tcgen05.shift | sub_4F1A90 | -- |
| Compute | tcgen05.mma.ws | sub_58FA20 | -- |
Intrinsic Lowering
The TCGen05 MMA handler (sub_6D7AF0, 19KB) and validator (sub_6D69B0, 12KB) in the encoding zone handle the lowering from abstract intrinsic operations to concrete SASS encoding. The handler checks modifier consistency:
- "fused and l16dp32bit must be specified together"
- "Inputs vector length is inconsistent with layout and num modifiers"
TMEM Address Generation
The TMEM address generator cluster (sub_70E740, sub_70E940, sub_70EB00) generates PTX parameter passing code for tensor memory addresses:
st.param.b32 [%s + %d], %s;
ld.param.b32 %s, [%s + %d];
SASS Encoding
The master instruction encoder (sub_6D9690, 94KB) handles the final binary encoding. TCGen05 instructions use the Mercury encoding pipeline (encoder factory 36864) with Blackwell-specific opcode tables.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_4C5FB0 | 28KB | PTX validator (MMA/WMMA/tcgen05 shared) | HIGH |
sub_4DA720 | 343 B | tcgen05.mma.ws formatter | HIGH |
sub_4DB050 | 439 B | guardrails.in_physical_bounds formatter | HIGH |
sub_4DBF20 | 599 B | guardrails.are_columns_allocated formatter | HIGH |
sub_4DD580 | 735 B | guardrails.datapath_alignment formatter | HIGH |
sub_4DDB80 | 743 B | guardrails.check_sparse_usage formatter | HIGH |
sub_4DDE70 | 775 B | guardrails.is_phase_valid formatter | HIGH |
sub_4DE180 | 791 B | guardrails.is_current_warp_valid_owner formatter | HIGH |
sub_4F0960 | 839 B | guardrails.allocation_granularity formatter | HIGH |
sub_4F1A90 | 903 B | tcgen05.shift / tcgen05.cp formatter | HIGH |
sub_500FA0 | 970 B | guardrails.sp_consistency_across_idesc_mod formatter | HIGH |
sub_526370 | 1,287 B | tcgen05.alloc / tcgen05.relinquish_alloc_permit formatter | HIGH |
sub_5427F0 | 1,575 B | tcgen05.commit formatter | HIGH |
sub_569180 | -- | tcgen05.alloc codegen handler | HIGH |
sub_56C190 | 1,842 B | tcgen05.st formatter | HIGH |
sub_571FE0 | 2,066 B | tcgen05.ld.red formatter | HIGH |
sub_574050 | 2,130 B | tcgen05.dealloc formatter | HIGH |
sub_578DB0 | 2,466 B | tcgen05.ld formatter | HIGH |
sub_58C7F0 | 4,282 B | tcgen05.relinquish_alloc_permit / tcgen05.dealloc formatter | HIGH |
sub_58FA20 | 4,604 B | tcgen05.shift + tcgen05.mma formatter | HIGH |
sub_593210 | -- | cp.async.bulk codegen | HIGH |
sub_5AB460 | 45KB | cp.async.bulk.tensor codegen (1D--5D) | HIGH |
sub_5BBC30 | 90KB | tcgen05.mma codegen (main) | HIGH |
sub_6D69B0 | 12KB | TCGen05 MMA validator (encoding zone) | MED |
sub_6D7AF0 | 19KB | TCGen05 MMA handler (encoding zone) | HIGH |
sub_70BC30 | -- | TCGen05 parameter helper | MED |
sub_70BCC0 | -- | TCGen05 parameter helper | MED |
sub_70DEF0 | -- | TCGen05 parameter helper | MED |
sub_70E0E0 | -- | SM100 guardrail bounds-check code generator | MED |
sub_70E740 | -- | TMEM address generator (tmemD) | MED |
sub_70E940 | -- | TMEM address generator (tmemA) | MED |
sub_70EB00 | -- | TMEM address generator (scaleTmemA/B, spMetaTmem) | MED |
sub_70FA00 | -- | Instruction capability checker (29 = tcgen05) | HIGH |
sub_8E8A90 | 3.0KB + 949 B | SM 100 latency table (base + TCGEN05 supplement) | HIGH |
Cross-References
- Blackwell (SM 100--121) -- Target-level architecture gating, codegen factory 36864
- SM Architecture Map -- Complete SM table, capability dispatch infrastructure
- GMMA/WGMMA Pipeline -- Predecessor tensor core pipeline (sm_90), same warpgroup execution model
- Intrinsic Table (608 Entries) -- IDs 0x20--0x3C (tcgen05 guardrails + bulk copy)
- Tensor Core Intrinsics -- WMMA/MMA/tcgen05 intrinsic lowering detail
- Late Expansion & Legalization -- tcgen05 guardrail insertion during late expansion
- SASS Instruction Encoding -- Mercury encoder, opcode tables
- Latency Model & HW Profiles -- SM 100 TCGEN05 supplement table
- SASS Text Generation -- TCGen05 instruction formatters
- CLI Options --
--g-tensor-memory-access-checkoption
Intrinsic Table Architecture (607 Registered Entries)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas maintains two separate intrinsic subsystems that together cover every CUDA runtime helper function, every PTX opcode requiring inline code generation, and every Blackwell+ OCG builtin operation. The first subsystem (sub_5D1660 + sub_5D4190 + sub_5D7430 + sub_5FF700) handles 607 classical CUDA intrinsics and PTX opcode dispatch through a name-to-ID hash map, a body template name table, and a giant prototype generator. The second subsystem (sub_6C9EB0 and its handler cluster at 0x6C0000--0x6CC000) handles OCG (Optimized Code Generation) builtins for SM100+ targets. Both subsystems use the same hash map infrastructure (sub_425CA0 / sub_426150 / sub_426D60) documented in Hash Tables & Bitvectors.
| Master registration | sub_5D1660 (46KB) -- 607 CUDA intrinsics, name-to-integer-ID hash map (608 table slots, ID 0 = null) |
| Opcode dispatch | sub_5D4190 (41KB) -- ~120 PTX opcodes to codegen handlers + ~400 MMA hash entries |
| Body template names | sub_5D7430 (161KB) -- 1,079 intrinsic names constructed from .rodata prefixes + type suffixes, stored in hash map at +824 |
| Prototype generator | sub_5FF700 (354KB) -- switch generating .weak .func PTX declarations |
| OCG intrinsic table | sub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+ |
| OCG router | sub_6CC690 (22KB) -- routes OCG calls to type-specific handlers |
| OCG name resolver | sub_6C9BC0 -- resolves operation names to internal enums |
| Hash map create | sub_425CA0 (initial capacity 0x80) |
| Hash map insert | sub_426150(map, name, value) |
| Hash map lookup | sub_426D60 |
Per-Family Deep Dives:
- OCG Intrinsic System -- SM100+ OCG builtins (44 operations), lowering pipeline, SASS handler map
- Math Intrinsics -- IEEE math software emulation (div, rcp, sqrt, rem)
- Tensor Core Intrinsics -- WMMA, MMA, WGMMA, tcgen05 lowering
- Sync & Warp Intrinsics -- Barriers, vote, shuffle, match, redux
System Overview
sub_451730 (intrinsic lowering context constructor)
│
├── sub_5D4190(ctx) ── register PTX opcode & MMA handlers ──────────────┐
│ │ (1) Calls sub_5D1660 to populate intrinsic ID table (607 entries) │
│ │ (2) Registers ~120 PTX opcode -> codegen handler mappings │
│ │ (3) Registers ~400 MMA hash -> codegen handler mappings │
│ │ │
│ ├─ Hash map at +808 ── PTX opcode name -> codegen function ptr │
│ │ "div" -> sub_5B76D0 (64KB) │
│ │ "sqrt" -> sub_5B4040 (49KB) │
│ │ "wmma.mma"-> sub_5C7A50 (173KB) │
│ │ "mma" -> sub_5C10A0 (120KB) │
│ │ ... ~116 more │
│ │ │
│ ├─ Hash map at +816 ── numeric MMA hash -> codegen handler ptr │
│ │ "2644314910" -> sub_4DDB80 │
│ │ ... ~399 more (shape/type/layout combinations) │
│ │ │
│ └─ ID table at +1056 ── 9728-byte array (memcpy from unk_1D4D940) │
│ Hash map at +1064 ── name -> integer ID (sub_5D1660, 607) │
│ Count at +1072 = 608 (includes null ID 0 slot) │
│ │
├── sub_4CE230(ctx) ── register modifier keywords (GUARD, PRED, ...) │
│ │
├── sub_5D7430(ctx, sregs) ── body template name table (161KB) ──────────┤
│ │ 1,079 entries, each constructed from: │
│ │ 16-byte .rodata prefix (e.g. "__cuda_sm20_div_") │
│ │ + 4-byte type suffix (e.g. "s16\0", "u64\0", "rn_f") │
│ │ → registered into hash map at +824 with sequential integer IDs │
│ │ │
│ └─ Hash map at +824 ── intrinsic name -> body template ID │
│ "__cuda_sm20_div_s16" -> 0 │
│ "__cuda_sm20_div_u16" -> 1 │
│ ... 1,079 total entries │
│ │
└── sub_451330("<fermi macros>", ...) ── load Fermi macro library │
│
sub_5FF700 (354KB) ─────────────────────────────────────────────────────────┘
│ switch(body_template_id) with hundreds of cases
│ Each case: allocate buffer via sub_4DA340, strcpy() PTX prototype
│
│ case 0: ".weak .func (.reg .s32 %d) __cuda_sm20_div_s16
│ (.reg .s32 %a0, .reg .s32 %a1)"
│ case 4: ".weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
│ (.reg .u64 %rda1, .reg .u64 %rda2)"
│ case 9: ".weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
│ (.reg .f32 %fa1, .reg .f32 %fa2)"
│ case 25: ".weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
│ (...)"
│ ... hundreds more for rcp, sqrt, dsqrt, barrier, wmma, mma, etc.
v
Emitted into PTX output as .weak .func declarations
(linker resolves calls to runtime helper functions)
Master Registration -- sub_5D1660
This 46KB function is the master catalog. It allocates a 9728-byte table (memcpy from unk_1D4D940, 0x2600 bytes = 608 x 16B slots), creates a hash map with initial capacity 0x80 via sub_425CA0, then calls sub_426150(hashmap, "name", (char*)ID) exactly 607 times to register every CUDA runtime helper function with an integer ID (IDs 1--607, contiguous). The hash map is stored at a1+1064, the table at a1+1056, and the count 608 at a1+1072 (includes the unused null ID 0 slot).
Complete ID Allocation
607 intrinsics are registered with contiguous IDs from 0x01 through 0x25F. The binary stores count=608 at a1+1072 because the pre-built 9,728-byte table (608 x 16B slots) includes a null ID 0 sentinel. The ID ranges partition cleanly by SM generation and functional category.
| ID Range | Count | Prefix | Category | SM Floor |
|---|---|---|---|---|
0x001--0x011 | 17 | __cuda_reduxsync_* | Redux sync (b32 and/or/xor, f32 max/min/abs/NaN, s32/u32 add/max/min) | sm_70 |
0x012--0x018 | 7 | __cuda_sanitizer_memcheck_* | Compute-sanitizer hooks (free, generic, global, local, malloc, readmetadata, shared) | -- |
0x019--0x01F | 7 | __cuda_scalar_video_emulation_* | Video instruction emulation helpers | sm_20 |
0x020--0x02A | 11 | __cuda_sm10x_* | Blackwell tcgen05 guardrail traps + create_mask helper | sm_100 |
0x02B--0x03C | 18 | __cuda_sm1xx_* | Bulk copy + cp.async.bulk.tensor 1D--5D tile/im2col uni/multicast | sm_100+ |
0x03D--0x082 | 70 | __cuda_sm20_* | IEEE math: bfe, bfi, div, rcp, sqrt, dsqrt, drsqrt, rem (all rounding modes + slowpaths) | sm_20 |
0x083--0x086 | 4 | __cuda_sm3x_div_* | Optimized division variants (rn_ftz_f32, rn_noftz_f32 + slowpaths) | sm_30 |
0x087--0x088 | 2 | __cuda_sm62_dp2a/dp4a | Integer dot product emulation | sm_62 |
0x089--0x211 | 393 | __cuda_sm70_* | Volta+ intrinsics (barriers, shuffle, vote, match, WMMA -- all shapes, layouts, address spaces) | sm_70 |
0x212--0x214 | 3 | __cuda_sm80_* | Ampere: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode | sm_80 |
0x215--0x21E | 10 | __cuda_sm_10x_* | Blackwell hmma/imma mdata + bit MMA (and/xor m8n8k128/m16n8k128/m16n8k256) | sm_100 |
0x21F--0x22C | 14 | __cuda_sm_8x_* | Direct MMA operations (f16/f32 accum, 4 layout combos) + mma_shfl_f16/f32 | sm_80+ |
0x22D--0x25F | 51 | __cuda_sm_9x_* | Hopper sub-byte + bit MMA: s4/u4 dense m16n8k32/k64 + sparse m16n8k64/k128, bit xor (m8n8k128/m16n8k128/m16n8k256) | sm_90 |
Total: 607 registered intrinsics across 13 prefix groups. Table has 608 slots (ID 0 unused).
sm_70 Intrinsic Breakdown (IDs 0x89--0x211)
The sm_70 block is by far the largest at 393 entries. It covers every Volta-era warp synchronous intrinsic plus the complete WMMA API. The explosion in count comes from the combinatorial product of shapes, layouts, data types, address spaces, and predicate/satfinite variants.
| Sub-Category | Examples | Combinatorial Source |
|---|---|---|
barrier_arrive | 0--15, with/without count | 16 barrier IDs x 2 count variants |
barrier_red_and/or/popc | 0--15, with/without count | 3 reduction ops x 16 IDs x 2 count |
barrier_sync | 0--15, with/without count | 16 IDs x 2 count variants |
matchsync_all/any_b32/b64 | with predicate variants | 2 match modes x 2 types x pred |
shflsync_bfly/down/idx/up | with predicate variants | 4 shuffle modes x pred |
votesync_all/any/ballot/uni | -- | 4 vote modes |
warpsync | -- | 1 entry |
wmma_* | m16n16k16, m32n8k16, m8n32k16 | 3 shapes x {load_a, load_b, load_c, store_d, mma} x {row, col} x {f16, f32} x {generic, global, shared} x {satfinite} |
The WMMA entries dominate the count. Each combination of shape (m16n16k16/m32n8k16/m8n32k16), operation (load_a/load_b/load_c/store_d/mma), layout (row/col for each matrix), data type (f16/f32), address space (generic/global/shared), and optional satfinite flag produces a separate intrinsic registration.
Opcode Dispatch -- sub_5D4190
This 41KB function first calls sub_5D1660(a1) to populate the intrinsic ID table, then builds two more hash maps for PTX opcode dispatch.
Named Opcode Table (at a1+808)
~120 PTX instruction names mapped to codegen handler function pointers. Each handler allocates a 50,000-byte buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.
| Category | Opcodes | Codegen Handlers |
|---|---|---|
| Math | div.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanh | sub_573860, sub_5B76D0 (64KB), sub_589810, sub_5B0CD0 (44KB), sub_57BFC0, sub_5B4040 (49KB), sub_583190, sub_52A5C0, sub_505B00 |
| Memory | membar, _ldldu, prefetch | sub_4DB410, sub_4DD860, sub_507FB0 |
| Conversion | cvt | sub_59F630 |
| Bit manipulation | bfind, brev, bfe, bfi, clz, popc, testp, copysign | sub_590C20, sub_50B5A0, sub_578470, sub_52E100, sub_4DBCC0, sub_4DB210, sub_581A10, sub_50B180 |
| Texture | tex, tex.base, tex.level, tld4, tex.grad | sub_584D10, sub_5879B0, sub_58B6A0, sub_56D700, sub_5ADDC0 (50KB) |
| Video (SIMD) | vadd/vsub/vmin/vmax/vabsdiff/vshl/vshr/vset/vmad (scalar), vadd2/vmax2/vmin2/vabsdiff2/vset2/vsub2/vavrg2 (packed 2x16), vadd4/vmin4/vmax4/vabsdiff4/vset4/vsub4/vavrg4 (packed 4x8) | per-instruction handlers |
| Dot product | dp2a.lo, dp2a.hi, dp4a | sub_56BA60, sub_56C8D0, sub_577BA0 |
| Barriers | bar, barrier, bar.arrive, barrier.arrive, bar.red, barrier.red, bar.cta/barrier.cta (.arrive/.red variants), bar.warp | sub_524FB0, sub_570290, sub_500BF0, sub_570940, sub_52D590, sub_5889B0, sub_56A5A0 |
| Warp | vote, shfl, match, redux | sub_580E50, sub_5801D0, sub_58A730, sub_567680 |
| Async copy | cp.async.mbarrier.arrive, cp.async.bulk, cp.async.bulk.tensor | sub_4DC180, sub_593210, sub_5AB460 (45KB) |
| Matrix | ldmatrix, movmatrix, stmatrix, st.async, red.async, st.bulk | sub_50D4B0, sub_4DAEA0, sub_4F05D0, sub_58E9B0, sub_5825A0, sub_549430 |
| Cache | createpolicy.range, createpolicy.fractional, createpolicy.cvt | per-instruction handlers |
| WMMA | wmma.load.a, wmma.load.b, wmma.load.c, wmma.store.d, wmma.mma | sub_5A2D10, sub_5A0EA0, sub_5A8E40, sub_5A6BD0, sub_5C7A50 (173KB) |
| MMA | mma | sub_5C10A0 (120KB) |
| WGMMA | wgmma.mma_async, wgmma.fence, wgmma.commit_group, wgmma.wait_group | sub_50AC70, sub_4DA380, sub_4DA4B0, sub_4DA5E0 |
| Multimem | multimem.ld_reduce, multimem.st, multimem.red | sub_58D8B0, sub_57B4C0, sub_50A850 |
| Tensormap | tensormap.replace | sub_57F6E0 |
| TCGen05 | tcgen05.alloc, tcgen05.relinquish_alloc_permit, tcgen05.dealloc, tcgen05.ld, tcgen05.ld.red, tcgen05.st, tcgen05.commit, tcgen05.cp, tcgen05.shift, tcgen05.mma, tcgen05.mma.ws | sub_569180, sub_526370, sub_58C7F0, sub_574050, sub_578DB0, sub_571FE0, sub_56C190, sub_5427F0, sub_4F1A90, sub_5BBC30 (90KB), sub_58FA20 |
| TCGen05 guardrails | _tcgen05.guardrails.is_phase_valid, are_columns_allocated, is_current_warp_valid_owner, in_physical_bounds, allocation_granularity, datapath_alignment, sp_consistency_across_idesc_mod, check_sparse_usage | per-instruction handlers |
Numeric MMA Hash Table (at a1+816)
~400 entries where the key is a numeric string representation of a hash value (e.g., "2644314910") that encodes a specific MMA shape/type/layout combination. The hash encodes the instruction variant completely: matrix dimensions (m16n8k16, m16n8k32, etc.), data type (f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1), and layout (row/col combinations). Each entry maps to a codegen handler function pointer. This avoids a multi-dimensional lookup by collapsing the full variant space into a single hash probe.
Body Template Name Table -- sub_5D7430
At 161KB of machine code (0x5D7430--0x5FF700), this is the largest function in the intrinsic infrastructure by code size and the 6th largest function in the entire ptxas binary. IDA failed to decompile it; all analysis comes from raw x86-64 disassembly. The function constructs a third hash map (at context offset +824 / 0x338) containing 1,079 entries that map dynamically constructed __cuda_* intrinsic names to sequential body template IDs (0--1078).
Why 1,079 Body Templates for 607 Logical Intrinsics
The master registration table (sub_5D1660) maps 607 intrinsic names to logical IDs. The body template table (sub_5D7430) maps 1,079 variant-specific names to prototype generator case numbers. The 1.78x expansion has one dominant cause: WMMA template proliferation across GPU generations.
The 204 logical WMMA entries in sub_5D1660 cover only the original sm_70 Volta shapes (m16n16k16/m32n8k16/m8n32k16 with f16/f32 types). But the body template table includes all later-generation WMMA variants -- sm7x sub-byte/bit, sm72 integer, sm8x tf32/bf16/f64 -- that were added as hardware evolved. These ~416 extra WMMA templates have no matching entry in the 607 logical ID table; they exist only in the body template hash map and the prototype generator switch.
Non-WMMA intrinsics map approximately 1:1 between logical IDs and body templates. The math operations (div, rcp, sqrt) are already fully type-specialized at the logical level -- each rounding-mode/type combination is a separate logical intrinsic.
Three sources of expansion beyond the 607 logical entries:
- Later-generation WMMA variants (~416 template-only entries):
- sm7x sub-byte WMMA (s4/u4 m8n8k32) + bit WMMA (m8n8k128): ~231 templates
- sm72 integer WMMA (m16n16k16/m32n8k16/m8n32k16 integer types): ~105 templates
- sm8x tf32 WMMA (m16n16k8) + bf16/f64 WMMA: ~80 templates
- Aligned warp sync variants (~13 extra templates):
matchsync_aligned,votesync_aligned,votesync_ballot_groupwise,query_activemask/query_activemask_groupwisefor cooperative group support - Additional SM100 specializations (~8 extra templates):
tcgen05_alloc_two_sm, extra guardrails check variants,get_warp_rank
Conversely, 18 sm1xx bulk copy intrinsics have logical IDs but zero body templates -- they bypass the template/prototype mechanism entirely and are lowered directly to inline PTX by the opcode dispatch handlers (sub_593210, sub_5AB460).
Template Distribution Table
| Logical Group | Logical | Template | Factor |
|---|---|---|---|
| SM20 IEEE math (div, rem, rcp, sqrt, bfe/bfi) | 70 | 70 | 1.0x |
| SM3x optimized division | 4 | 4 | 1.0x |
| SM62 integer dot product | 2 | 2 | 1.0x |
| SM70 barriers | 170 | 170 | 1.0x |
| SM70 warp sync (match, vote, shfl, query) | 19 | 32 | 1.7x |
| SM70 WMMA (f16/f32 original Volta) | 204 | 249 | 1.2x |
| SM7x WMMA extended (sub-byte, bit) | 0 | 231 | tmpl-only |
| SM72 WMMA (integer) | 0 | 105 | tmpl-only |
| SM8x WMMA (tf32, bf16, f64) | 0 | 80 | tmpl-only |
| SM80 cache policy | 3 | 4 | 1.3x |
| SM8x direct MMA | 14 | 14 | 1.0x |
| SM9x Hopper sub-byte/bit MMA | 51 | 52 | 1.0x |
| SM10x Blackwell MMA metadata | 10 | 10 | 1.0x |
| SM100 tcgen05 + guardrails | 11 | 19 | 1.7x |
| SM100+ bulk copy / TMA | 18 | 0 | (no templates) |
| Redux sync primitives | 17 | 17 | 1.0x |
| Compute-sanitizer hooks | 7 | 7 | 1.0x |
| Video instruction emulation | 7 | 7 | 1.0x |
| Total | 607 | 1,073 | 1.78x |
WMMA subtotal: 204 logical entries expand to 665 body templates (3.3x). Non-WMMA: 403 logical entries map to 408 templates (~1.0x). The remaining 7 templates (1,080 prototype switch cases minus 1,073 classified) are sanitizer/cache variants where IDA produced qmemcpy instead of strcpy, preventing exact name extraction.
The sm70 WMMA group itself expands from 204 to 249 templates because the prototype generator includes update_ptr and desc (descriptor-based addressing) variants of certain load/store operations that the logical table does not separate.
The three "tmpl-only" WMMA rows (sm7x/sm72/sm8x) are the single largest contributor to the expansion. They represent ~416 templates with zero logical ID counterparts. These families use .FORCE_INLINE .func linkage in their prototypes instead of the .weak .func used by the original sm70 WMMA entries:
sm70 (original): .weak .func (...) __cuda_sm70_wmma_m16n16k16_load_a_col (...)
sm72 (integer): .FORCE_INLINE .func (...) __cuda_sm72_Integer_wmma_m16n16k16_load_a_row (...)
sm7x (sub-byte): .FORCE_INLINE .func (...) __cuda_sm7x_sub_byte_wmma_m8n8k32_load_a_row (...)
sm8x (tf32): .FORCE_INLINE .func (...) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (...)
The .FORCE_INLINE directive forces inlining at every call site rather than emitting a separate callable function. The later-gen WMMA implementations are more complex and performance-sensitive, making per-call-site specialization profitable.
Name Construction Algorithm
The function contains zero string references because it constructs all 1,079 names at runtime. For each entry:
- Allocate a 20-byte buffer via
sub_424070(allocator, 20) - Copy prefix (16 bytes) from
.rodatavia SSEmovdqa+movups(e.g.,"__cuda_sm20_div_") - Append suffix (4 bytes) via
movlimmediate at offset +16 (e.g.,"s16\0","u64\0","rn_f") - Register via
sub_426150(context+824, buffer, template_id)with sequential integer IDs
The 533 unique .rodata prefix addresses fan out through multiple suffixes per prefix:
.rodata prefix (16B) suffix (4B) result (20B buffer)
─────────────────────── ─────────── ──────────────────────
"__cuda_sm20_div_" + "s16\0" = "__cuda_sm20_div_s16"
"__cuda_sm20_div_" + "u16\0" = "__cuda_sm20_div_u16"
"__cuda_sm20_div_" + "u64\0" = "__cuda_sm20_div_u64"
"__cuda_sm20_div_" + "s64\0" = "__cuda_sm20_div_s64"
"__cuda_sm20_div_" + "rn_f" = "__cuda_sm20_div_rn_f" (truncated)
"__cuda_sm20_rem_" + "s16\0" = "__cuda_sm20_rem_s16"
"__cuda_sm20_rcp_" + "rn_f" = "__cuda_sm20_rcp_rn_f" (truncated)
"__cuda_sm70_barr" + "ier_" = "__cuda_sm70_barrier_" (prefix chain)
Names truncated at the 20-byte buffer limit are still sufficient for hash map lookup -- the full untruncated name appears only inside the prototype string in sub_5FF700.
Worked Example: Division (Cases 0--26)
The __cuda_sm20_div operation illustrates the template-to-prototype mapping. Division has 19 logical IDs and 19 body templates (1:1 ratio) because each type/rounding/precision variant is already a separate logical intrinsic. The suffix encodes the type specialization:
| Case | Body Template Name | Type Suffix | PTX Signature |
|---|---|---|---|
| 0 | __cuda_sm20_div_s16 | s16 | (.reg .s32 %d) ... (.reg .s32 %a0, .reg .s32 %a1) |
| 1 | __cuda_sm20_div_u16 | u16 | (.reg .u32 %d) ... (.reg .u32 %a0, .reg .u32 %a1) |
| 4 | __cuda_sm20_div_u64 | u64 | (.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2) |
| 5 | __cuda_sm20_div_s64 | s64 | (.reg .u64 %rdv1) ... (.reg .u64 %rda1, .reg .u64 %rda2) |
| 9 | __cuda_sm20_div_rn_f32 | rn_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 10 | __cuda_sm20_div_rd_f32 | rd_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 14 | __cuda_sm20_div_rn_ftz_f32 | rn_f | (.reg .f32 %fv1) ... (.reg .f32 %fa1, .reg .f32 %fa2) |
| 22 | __cuda_sm20_div_ru_f64_v2 | ru_f | (.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2) |
| 25 | __cuda_sm20_div_rn_f64_full | rn_f | (.reg .f64 %fdv1) ... (.reg .f64 %fda1, .reg .f64 %fda2) |
Cases 2--3 (rem s16/u16), 6--7 (rem s64/u64) are interleaved between the division entries. Cases 8, 13 are _slowpath variants that implement Newton-Raphson refinement fallbacks. Cases 18--21 are the sm3x-optimized division variants with the same suffix scheme. Note: s16/u16 division uses .s32/.u32 register types because PTX has no 16-bit register class; the 16-bit operation is performed by 32-bit hardware with appropriate sign/zero extension.
Statistics
| Metric | Value |
|---|---|
| Machine code size | 164,560 bytes (0x5D7430--0x5FF700) |
sub_426150 calls | 1,079 |
| Unique .rodata prefix addresses | 533 |
| Hash map destination | context+824 (0x338) |
| Buffer size per entry | 20 bytes |
| IDA decompilation | Failed (function too large/repetitive) |
Context Hash Map Summary
The intrinsic lowering context object holds five hash maps and one flat table:
| Offset | Field | Builder | Contents | Entries |
|---|---|---|---|---|
| +808 | opcode handlers | sub_5D4190 | PTX opcode name -> codegen fn ptr | ~120 |
| +816 | MMA hash handlers | sub_5D4190 | numeric hash -> codegen fn ptr | ~400 |
| +824 | body templates | sub_5D7430 | intrinsic name -> template ID | 1,079 |
| +1056 | descriptor table | sub_5D1660 | 608 x 16B intrinsic descriptor slots | 608 |
| +1064 | ID map | sub_5D1660 | intrinsic name -> logical ID (1-607) | 607 |
| +1072 | count | sub_5D1660 | 608 (includes null slot 0) | -- |
Instruction Property Accessors
All codegen handlers query instruction properties through accessor functions on the instruction object at a1+1096. These are the same accessors used by WMMA, MMA, and tcgen05 codegen.
| Accessor | Purpose | Usage Example |
|---|---|---|
sub_70B6E0 | Check if feature enabled | sub_70B6E0(obj) -- boolean feature gate |
sub_70B780 | Get feature parameter | Numeric feature parameter |
sub_70FA00 | Check instruction capability for SM | sub_70FA00(*, 23) = texture, sub_70FA00(*, 29) = tcgen05 |
sub_70E940 | Get operand count | Number of operands |
sub_70E6E0 | Get data type | Operand data type enumeration |
sub_70ACC0 | Get accumulator type | MMA accumulator data type |
sub_709860 | Get register type/size | Register class and width |
sub_70F460 | Get layout variant | row/col matrix layout |
sub_707D60 | Check MMA shape variant | m16n16k16 vs m32n8k16, etc. |
sub_709910 | Check sparse mode | Sparse MMA variant flag |
sub_70F650 | Get matrix dimension (M/N) | Matrix size parameter |
sub_70F600 | Get matrix dimension (K) | Alternate dimension parameter |
sub_70CA60 | Get operand type by index | sub_70CA60(*, 0) -- type of first operand (21 = specific type, 58 = f32, 59 = f64) |
sub_70BA40 | Texture mode query | Texture sampling mode |
sub_70BD50 | Sampler mode query | Texture sampler configuration |
sub_70BB20 | Bulk tensor mode | cp.async.bulk.tensor transfer mode |
sub_70F0A0 | Get sparse metadata | Sparse matrix metadata parameter |
Prototype Generator -- sub_5FF700
At 354KB, this is the single largest function in the intrinsic infrastructure and the 2nd largest function in the entire ptxas binary. It takes a body template ID (a1, range 0--1079) and an allocator context (a2), allocates a buffer via sub_4DA340(size, a2), fills it with a PTX prototype string via strcpy(), and returns the result. The output is a complete .weak .func or .FORCE_INLINE .func PTX declaration that gets emitted into the PTX output stream so the linker can resolve calls to CUDA runtime helper functions.
The function is a single switch(a1) with 1,080 case labels (0--1079) plus a default case that returns an empty string "". Each case allocates an exact-sized buffer (72--1,200 bytes), copies a hardcoded PTX prototype string into it, and returns the pointer.
Prototype Generator Architecture
sub_5FF700(template_id, allocator)
│
│ switch(template_id) ← 1,080 cases, 0--1079
│
├── case N:
│ buf = sub_4DA340(byte_count, allocator) ← allocate exact-fit buffer
│ strcpy(buf, ".weak .func (...) name (...)") ← copy PTX prototype
│ return buf
│
├── case M: (45 large WMMA mma cases)
│ buf = sub_4DA340(byte_count, allocator) ← up to 1,200 bytes
│ *(u64*)buf = 0x662E206B6165772E ← ".weak .f" (inline store)
│ *(u64*)(buf+N-8) = <trailer> ← last 8 bytes inline
│ qmemcpy(buf+8, .rodata_addr, size-16) ← bulk copy middle
│ return buf
│
└── default:
return ""
Three copy strategies appear in the decompilation, all producing the same result:
| Strategy | Cases | Trigger | Max Size |
|---|---|---|---|
strcpy() with inline string literal | 1,035 | Prototype fits in decompiler string threshold | ~520 bytes |
qmemcpy() with QWORD bookend stores | 45 | Prototype too long for IDA to reproduce as literal | 1,200 bytes |
| Indirect variable assignment + copy | ~130 | IDA SSA split (subset of strcpy) | ~120 bytes |
The qmemcpy cases are the 45 WMMA mma operations with the largest parameter lists (3--4 fragment matrices of 8 elements each). IDA stores the first and last 8 bytes as inline immediates (0x662E206B6165772E = ".weak .f", trailer varies per case) and bulk-copies the middle from .rodata. The prototype content is structurally identical to the strcpy cases.
Linkage Directives
Two PTX linkage types are emitted, controlling how the linker handles the declared function:
| Directive | Count | Meaning | Used By |
|---|---|---|---|
.weak | 616 | Overridable by user code; linker uses user version if present | SM20 math, SM70 barriers/sync/WMMA (original Volta), SM80 cache policy, SM8x/9x/10x MMA, redux sync, sanitizer hooks, video emulation, dp2a/dp4a |
.FORCE_INLINE | 464 | Inlined at every call site; no separate callable function | SM70 aligned vote/match/query_activemask, SM7x sub-byte/bit WMMA, SM72 integer WMMA, SM8x tf32/bf16/f64 WMMA, SM10x tcgen05 alloc/guardrails, SM80 createpolicy_fractional |
The .weak linkage supports user-supplied replacements: if the user provides their own implementation of __cuda_sm20_div_s16, the linker will use that instead of the built-in runtime version. The .FORCE_INLINE directive forces per-call-site specialization -- the later-generation WMMA implementations are more complex and performance-sensitive, making inlining profitable.
A subset of .weak prototypes (~410) carry the .unique qualifier:
.weak .func (.reg .b32 dst) __cuda_sm70_barrier_sync (.reg .b32 arg0) .unique ;
.unique instructs the PTX linker to keep exactly one copy of the function body even if multiple compilation units reference it. All barriers, redux sync, warpsync, non-aligned vote/match/shuffle use .unique.
Prototype Format
Every emitted prototype follows one of these structural patterns:
<linkage> .func (<return_params>) <name> (<input_params>) [.unique] ;
| Case | Prototype |
|---|---|
| 0 | .weak .func (.reg .s32 %d) __cuda_sm20_div_s16 (.reg .s32 %a0, .reg .s32 %a1) |
| 4 | .weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64 (.reg .u64 %rda1, .reg .u64 %rda2) |
| 9 | .weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) |
| 25 | .weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full (.reg .f64 %fda1, .reg .f64 %fda2) |
| 76 | .weak .func (.reg .b32 dst) __cuda_reduxsync_b32_xor (.reg .b32 src, .reg .b32 mask) .unique |
| 303 | .weak .func (.param .align 16 .b32 dst[8]) __cuda_sm70_wmma_m16n16k16_load_a_row (.reg .u64 ptr, .reg .u32 ldm) .unique |
| 666 | .FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3) __cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) |
| 890 | .weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32 (.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, ...) |
| 1055 | .FORCE_INLINE .func (.reg .b32 warp_rank) __cuda_sm10x_get_warp_rank () |
| 1073 | .weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_readmetadata (.param .b64 ..._param_0, .param .b64 ..._param_1) |
Parameter Passing Conventions
Five distinct parameter-passing ABIs appear across the 1,080 prototypes:
Convention A -- Register-only (.reg): Used by math operations, barriers, warp sync, redux sync, video emulation. Return and input parameters are individual .reg declarations with typed names. This is the simplest and most common convention.
.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32 (.reg .f32 %fa1, .reg .f32 %fa2) ;
Convention B -- Param-array with alignment (.param .align N .b32 name[K]): Used by WMMA load/mma, MMA, Hopper sub-byte MMA, Blackwell MMA. Returns an aligned array of .b32 elements. Array sizes: dst[2], dst[3], dst[4], dst[5], dst[8], mma_dst[2], mma_dst[4], mma_dst[8], ret_dst[3], ret_dst[5]. 326 prototypes use .align 16; 1 prototype (mma_shfl_f16) uses .align 8.
.weak .func (.param .align 16 .b32 d[8]) __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32
(.param .align 16 .b32 a[8], .param .align 16 .b32 b[8], .param .align 16 .b32 c[8]) ;
Convention C -- Param-scalar (.param .b64): Used exclusively by the 7 compute-sanitizer hooks. Parameters use fully-qualified names (__cuda_sanitizer_memcheck_malloc_param_0).
.weak .func (.param .b64 func_retval0) __cuda_sanitizer_memcheck_malloc
(.param .b64 __cuda_sanitizer_memcheck_malloc_param_0,
.param .b64 __cuda_sanitizer_memcheck_malloc_param_1) ;
Convention D -- Void return (): Used by WMMA store_d, tcgen05 guardrail traps, sanitizer_free. ~140 prototypes (45 .weak + 95 .FORCE_INLINE).
.weak .func () __cuda_sm70_wmma_m16n16k16_store_d_row_f32
(.reg .b64 ptr, .reg .b32 ldm, .reg .b32 sreg0, .reg .b32 sreg1, ...) ;
Convention E -- Multi-register return (.FORCE_INLINE only): Used by extended WMMA load operations (SM7x/SM72/SM8x). Returns 1--4 registers in the return position (never 8 -- 8-element returns use Convention B's .param arrays instead).
.FORCE_INLINE .func (.reg .b32 dst0, .reg .b32 dst1, .reg .b32 dst2, .reg .b32 dst3)
__cuda_sm8x_tf32_wmma_m16n16k8_load_a_row (.reg .u64 ptr, .reg .u32 ldm) ;
PTX Register Types
Eight PTX register types appear across the prototypes:
| Type | Approx. Count | Usage |
|---|---|---|
.reg .b32 | ~2,925 | Dominant: barrier args, WMMA/MMA fragments, guardrail params |
.reg .u64 | ~520 | Pointers (WMMA/MMA base addresses) |
.reg .u32 | ~341 | Integer params (leading dimension, counts, offsets) |
.reg .b64 | ~246 | 64-bit bitwise (match bitmask, shuffle predicates, retval) |
.reg .f32 | ~106 | Float math (div, rcp, sqrt) |
.reg .f64 | ~70 | Double math (div, rcp, sqrt, dsqrt) |
.reg .pred | ~10 | Predicate (vote output, matchsync predicate out) |
.reg .s32 | ~6 | Signed 32-bit (SM20 div/rem s16 return values only) |
Note: .b32 is used instead of .s32/.u32/.f32 for operations where the type interpretation is determined by the instruction rather than the register declaration (WMMA fragments, MMA accumulators, barrier IDs). The .s32 type appears only in the 4 oldest SM20 div/rem_s16/u16 prototypes (cases 0--3).
Register Naming Convention
The prototype register names encode the data type and role:
| Prefix | Meaning |
|---|---|
%d | 32-bit integer return value (SM20 div/rem s16/u16 only) |
%a0, %a1 | 32-bit integer input parameters |
%rdv1 | 64-bit integer return value |
%rda1, %rda2 | 64-bit integer input parameters |
%fv1 | f32 return value |
%fa1, %fa2 | f32 input parameters |
%fdv1 | f64 return value |
%fda1, %fda2 | f64 input parameters |
%fdnum, %fdden | f64 numerator/denominator (div_f64_v2 variants) |
dst, dst0..dst7 | Generic output registers (WMMA load, barriers) |
src, sreg0..sreg7 | Generic input registers (WMMA store) |
ptr, base | 64-bit pointer registers |
ldm | Leading dimension parameter (WMMA) |
mask | Warp participation mask |
cnt | Thread count (barrier_sync_count, barrier_arrive_count) |
arg0..arg3 | Generic numbered arguments |
parg | Predicate argument (vote) |
retVal, dummy | Return/placeholder (tcgen05 guardrails) |
activemask, warp_rank | Cooperative group queries |
Buffer Allocation Sizes
sub_4DA340(size, allocator) allocates an exact-fit buffer per prototype:
| Metric | Value |
|---|---|
| Minimum allocation | 72 bytes |
| Maximum allocation | 1,200 bytes |
| Median allocation | ~130 bytes |
| Most common sizes | 132 (37x), 182 (31x), 192 (30x), 125 (29x), 118 (28x) |
| Total allocations | 1,080 |
The 45 qmemcpy cases have the largest buffers: 386--1,200 bytes. These are WMMA mma operations whose prototypes enumerate all 3--4 fragment matrices (a, b, c, d) with 4--8 elements each, producing prototype strings that exceed 900 bytes.
Case Range Layout
The 1,080 cases follow the body template registration order from sub_5D7430, roughly grouped by SM generation:
| Case Range | Count | Category | Linkage |
|---|---|---|---|
| 0--69 | 70 | SM20 IEEE math (div, rem, rcp, sqrt, bfe, bfi, dsqrt, drsqrt) | .weak |
| 70--73 | 4 | SM3x optimized division (rn_ftz/noftz f32 + slowpaths) | .weak |
| 74--75 | 2 | SM62 dp2a/dp4a | .weak |
| 76--92 | 17 | Redux sync (b32/s32/u32/f32 add/max/min/xor/and/or/abs/NaN) | .weak .unique |
| 93--~274 | ~182 | SM70 barriers (sync/arrive/red, 16 IDs x with/without count) | .weak .unique |
| ~275--~302 | ~28 | SM70 vote, shuffle, match (bfly/down/idx/up, all/any/b32/b64) | .weak .unique / .FORCE_INLINE |
| ~303--~665 | ~363 | SM70 WMMA load/store (m16n16k16, m32n8k16, m8n32k16, all types/spaces) | .weak .unique |
| ~666--~889 | ~224 | SM7x/SM72/SM8x extended WMMA (sub-byte, integer, tf32, bf16, f64) | .FORCE_INLINE |
| ~890--~964 | ~75 | SM70 WMMA store_d (all shapes/layouts/spaces/types) | .weak |
| ~965--~1048 | ~84 | SM70 WMMA mma + SM8x/SM9x/SM10x MMA (f16/f32, sub-byte, bit, sparse) | .weak |
| ~1049--~1055 | ~7 | SM10x tcgen05 guardrail traps | .weak |
| ~1056--~1060 | ~5 | SM8x direct MMA (mma_shfl, row/col f16/f32 combos) | .weak |
| ~1061--~1072 | ~12 | SM10x tcgen05 alloc/guardrails check functions + get_warp_rank + create_mask | .FORCE_INLINE / .weak |
| 1073--1079 | 7 | Compute-sanitizer hooks (readmetadata, generic, global, local, shared, malloc, free) | .weak |
Statistics
| Metric | Value |
|---|---|
| Machine code size | 362,496 bytes (0x5FF700--0x658B00) |
| Decompiled lines | 9,414 |
| Switch cases | 1,080 (case 0 through case 1079 + default) |
| Local variables declared | ~716 (IDA SSA artifacts) |
.weak prototypes | 616 (571 strcpy + 45 qmemcpy) |
.FORCE_INLINE prototypes | 464 |
.unique-qualified prototypes | ~410 |
.param .align prototypes | 327 (326 align-16, 1 align-8) |
| Void-return prototypes | ~140 |
| Predicate-using prototypes | ~10 |
Major Codegen Handlers
The four largest codegen handlers together represent ~500KB of code and cover the tensor core instruction families.
sub_5C7A50 -- WMMA.MMA Codegen (173KB)
The largest codegen handler. Generates inline PTX code for wmma.mma instructions across all variant combinations.
- Allocates a 50,000-byte buffer for code generation
- Covers shapes: m16n16k16, m32n8k16, m8n32k16
- Data types: f16, f32, bf16, tf32, s8, u8, s4, u4, b1
- Layouts: row/col for each of the A, B, C, D matrices (4 layout combinations)
- Satfinite variants for each configuration
- Address spaces: generic, global, shared
sub_5C10A0 -- MMA Codegen (120KB)
Handles the newer mma.sync API (non-WMMA). Covers the post-Volta PTX MMA instructions.
- Shapes: m8n8k4, m16n8k8, m16n8k16, m16n8k32, m16n8k64, m16n8k128, m16n8k256
- Types: f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1
- Sparse variants for sm_80+ and sm_90+ (structured sparsity 2:4)
sub_5BBC30 -- TCGen05.MMA Codegen (90KB)
Blackwell 5th-generation tensor core MMA code generation. Handles the tcgen05.mma instruction family introduced in sm_100.
- Allocates a 50,000-byte buffer
- Queries
sub_70FA00(*, 29)to validate tcgen05 capability - Handles standard, sparse, and warp-shared (
.ws) variants - Uses
sub_70F0A0for sparse metadata parameter extraction - Generates code for tcgen05-specific tensor memory addressing
sub_5B76D0 -- Division Codegen (64KB)
Generates inline PTX code for all div variants.
- Integer division: s16, s64, u16, u64
- Floating-point division: f32, f64 with all rounding modes (rn, rd, ru, rz)
- Flush-to-zero (ftz) variants for f32
- Checks operand type via
sub_70CA60(*(_QWORD *)(a1+1096), 0) == 21 - Emits both fastpath and slowpath (Newton-Raphson) code sequences
OCG Intrinsic System -- sub_6C9EB0
The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical system at sub_5D1660 maps CUDA runtime helper names to integer IDs and generates inline PTX, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handlers that validate parameters and emit SASS instructions directly -- bypassing PTX entirely. The OCG table contains 44 operations across 9 categories: arithmetic, packed float, vector integer, async copy/TMA, load/store/cache, reduction/fence, tensor core, tensor memory, and synchronization.
See OCG Intrinsic System (44 Operations) for the complete builtin name table, handler functions, validation strings, SASS-level handlers, and the full five-stage lowering pipeline with operand buffer layout.
Intrinsic Families by SM Generation
Each SM generation introduces new intrinsic families while preserving all earlier ones. The per-SM intrinsic table initializer functions (sub_60AXXX cluster, registered in Map 3 of the capability dispatch) control which intrinsics are available on each target.
sm_20 -- Software IEEE Math (70 entries)
The foundation layer. 70 intrinsics providing IEEE-754-compliant software implementations of math operations that either lack hardware support or need exact rounding guarantees. All later SM targets inherit these.
- Division:
div_s16,div_u64,div_rn_f32,div_rn_f64_full, etc. -- all rounding modes (rn/rd/ru/rz) and types (s16/s64/u16/u64/f32/f64) - Reciprocal:
rcp_rn_f32,rcp_rn_f64, etc. -- all rounding modes - Square root:
sqrt_rn_f32,sqrt_rn_f64, etc. -- all rounding modes - Double-precision sqrt:
dsqrt_rn,dsqrt_rd,dsqrt_ru,dsqrt_rz - Double-precision reciprocal sqrt:
drsqrt_rn - Bit extract/insert:
bfe(bit field extract),bfi(bit field insert) - Remainder:
rem_s32,rem_u32,rem_s64,rem_u64
Codegen handlers: sub_5B76D0 (div, 64KB), sub_5B0CD0 (rcp, 44KB), sub_5B4040 (sqrt, 49KB).
sm_3x -- Optimized Division (4 entries)
Four optimized division variants introduced on Kepler to improve throughput on common division patterns.
sm_62 -- Integer Dot Product (2 entries)
dp2a and dp4a integer dot product intrinsics introduced on Pascal (GP10x). Software emulation of the hardware instructions added in sm_61/sm_62.
sm_70 -- Volta Warp-Synchronous + WMMA (393 entries)
The largest single block. Volta introduced mandatory warp-synchronous programming with explicit sync masks and the first generation of tensor core (WMMA) instructions.
Synchronization primitives:
barrier_arrive/barrier_sync/barrier_red(0--15, with/without count)matchsync_all/any_b32/b64with predicate variantsshflsync_bfly/down/idx/upwith predicate variantsvotesync_all/any/ballot/uniwarpsync
WMMA (Warp Matrix Multiply-Accumulate):
- Shapes: m16n16k16, m32n8k16, m8n32k16
- Operations per shape:
load_a,load_b,load_c,store_d,mma - Layouts: row/col combinations for A and B matrices
- Types: f16, f32 (with satfinite optional)
- Address spaces: generic, global, shared
sm_80 -- Ampere Cache Policy (3 entries)
Three createpolicy intrinsics for L2 cache management: createpolicy_fractional, createpolicy_fractional_encode, createpolicy_range_encode.
sm_10x -- Blackwell MMA Metadata + Bit MMA (10 entries)
10 hmma_mdata/imma_mdata + bit MMA intrinsics for sm_100: metadata variants at m16n8k16/k32/k64 shapes, and 1-bit AND/XOR MMA at m8n8k128/m16n8k128/m16n8k256.
sm_8x -- Direct MMA (14 entries)
14 mma_* intrinsics for sm_8x: 12 direct MMA operations (f16/f32 accumulator x 4 layout combinations of col/row for A and B) plus mma_shfl_f16 and mma_shfl_f32 for register-to-register MMA shuffle.
sm_9x -- Sub-Byte + Bit MMA (51 entries)
51 Hopper-era intrinsics: 3 bit-XOR MMA (m8n8k128/m16n8k128/m16n8k256), 24 dense sub-byte MMA (s4/u4 at m16n8k32/m16n8k64/m8n8k32, with satfinite), 8 sparse m16n8k128, and 16 sparse m16n8k64 (with _0/_1 split variants and satfinite).
sm_10x (via __cuda_sm10x_*) -- Blackwell Tensor Memory + Guardrails (11 entries)
- 1
create_mask_from_bit_idx_and_alloc_size_v1helper - 10
tcgen05_guardrail_trap_*intrinsics for debug validation of tensor memory operations
sm_1xx -- Bulk Copy (18 entries)
18 bulk copy and cp.async.bulk.tensor intrinsics covering 1D through 5D tensor copies with tile and im2col addressing modes, both unicast and multicast variants.
Intrinsic Lookup Flow
The lookup path from a function call in PTX source to the codegen handler follows this sequence:
PTX source: call.uni __cuda_sm70_warpsync, (%mask);
|
v
sub_5D1660 hash map (a1+1064)
key: "__cuda_sm70_warpsync"
value: integer ID (within 0x89..0x211 range)
|
v
sub_5FF700 switch(ID)
Emits: .weak .func __cuda_sm70_warpsync (.reg .u32 %a0)
|
v
sub_5D4190 named opcode hash map (a1+808)
key: PTX opcode (e.g., "shfl", "vote", "barrier")
value: codegen handler function pointer
|
v
Codegen handler (e.g., sub_5801D0 for "shfl")
Queries instruction properties via sub_70XXXX accessors
Generates inline PTX code into 50KB buffer
For OCG intrinsics on SM100+, the path bypasses PTX entirely: sub_6A97B0 matches call nodes to SASS instructions via an RB-tree, sub_6C9BC0 parses the __nv_ptx_builtin_ocg_* name into an operation enum + sub-op array, sub_6CC690 routes to type-specific handlers and assembles operands, and sub_6CB8A0 emits the final SASS instruction. See the OCG Intrinsic System page for the full five-stage pipeline breakdown with operand buffer layout and internal SASS opcode enum values.
Per-SM Intrinsic Initializers
Each SM target has its own intrinsic table initializer function registered in Map 3 of the capability dispatch (sub_607DB0). These functions control which subset of the 607 intrinsics are available on each target.
| SM | Initializer | SM | Initializer |
|---|---|---|---|
| sm_75 | sub_60A2E0 | sm_100 | sub_60A910 |
| sm_80 | sub_60A3E0 | sm_110 | sub_60AA20 |
| sm_86 | sub_60AC30 | sm_103 | sub_60A700 |
| sm_87 | sub_60AD30 | sm_120 | sub_608DF0 |
| sm_88 | sub_60AB30 | sm_121 | sub_60A4E0 |
| sm_89 | sub_60A810 | ||
| sm_90 | sub_60A5F0 |
Sub-variants (e.g., sm_100a, sm_100f) share the same initializer as their base SM since they represent the same silicon with different feature exposure levels.
Instruction Description Loader -- sub_9EE390
sub_9EE390 (3,584 bytes, 0x9EE390--0x9EF190) is the constructor for an instruction description object that feeds the register allocator's pre-coloring pass. Despite the diagnostic string "IntrinsicDescrFile=%s", the function loads instruction descriptions broadly -- not just intrinsic operations. It determines which instructions exist for the target SM, what register classes they use, and what scheduling properties apply. The sole caller is sub_991790 (pre-coloring pass, 12KB).
Invocation pattern: The pre-coloring pass checks context+1936 before calling sub_9EE390. If the descriptor for the current SM class already exists, it is reused. This means the expensive initialization happens once per SM architecture per ptxas process lifetime.
Initialization Sequence
-
Extract target properties. Read the target descriptor from
context+1584. Compute the SM architecture class:v111 = target_descriptor[+372] >> 12. Read resource descriptors from option interface slots 41--44. -
Check option 404 (IntrinsicDescrFile). Query the option interface at
context[208]. If option 404 is set, extract the file path and log" IntrinsicDescrFile=%s". This CI-internal mechanism supplies an external description file that overrides or extends the built-in instruction table. When absent, the built-in database is used. -
Determine instruction format class. Call
sub_7DDB50(context)(GetSmVersionIndex), subtract 1, index intodword_21E6330:sub_7DDB50 return v114 Format 1 0 basic 64-bit 2 1 128-bit 3 3 extended 4 2 192-bit 5+ 3 extended (default) -
Determine SM generation class. Read
context+12(sm_version_id), subtract 1, index intodword_21E5C80. The table is an identity mapping (1--11), one entry per SM generation. -
Construct instruction table (648 bytes). Call
sub_10AFF80with 32 parameters including memory pool, register count, format class, description file path, architecture descriptor (16 bytes fromcontext+1888), SM generation class, instruction count limits, and context flags. Follow withsub_10B1A90(init pass 2) andsub_10AEF10(finalization). -
Apply option overrides. Options 497, 738, 739 from the option interface set register limits and allocation budget values on the instruction table sub-object at
+312. -
Select SM-specific instruction set descriptor. Based on v111 (SM architecture class):
v111 SM range (inferred) Alloc size Constructor Vtable 5 sm_50--sm_62 200 B sub_9CDF90off_23F3B006 sm_70--sm_75 216 B sub_9CE030off_22BB7387 sm_80--sm_89 232 B sub_9CE120off_22B51508+ sm_90--sm_121 240 B sub_9CE190off_22AD230<5 (reuse existing) -- -- -- Each successor inherits the previous class and extends it with generation-specific instructions. The descriptor is stored at
context+1936andthis+48.
Object Layout
| Offset | Size | Contents |
|---|---|---|
| +0 | 8 | Vtable (off_21E6818 -> [sub_9DAA40, sub_9CADF0, sub_9CAE10, sub_9DDEE0]) |
| +8 | 8 | Back-pointer to compilation context |
| +16 | 8 | Instruction table object (648 B, built by sub_10AFF80) |
| +24 | 8 | Scheduling metadata (from sub_1BBBA60) |
| +32 | 8 | Scratch area pointer (context[198]) |
| +40 | 1 | Dirty flag (0 = clean) |
| +48 | 8 | SM-specific instruction set descriptor |
| +56--136 | -- | Resource descriptors, memory pool, sentinel, sub-allocator |
Diagnostic Strings
| String | Location | Context |
|---|---|---|
"__nv_ptx_builtin_ocg_" | sub_6C9EB0 (0x6c9ecf) | OCG builtin name prefix |
"instrinsic" (sic) | Multiple OCG handlers | Consistent NVIDIA typo for "intrinsic" |
".weak .func" | sub_5FF700 (354KB) | Prototype declaration prefix |
"__cuda_sm20_*", "__cuda_sm70_*", etc. | sub_5D1660 | Intrinsic name patterns in registration |
"__cuda_sanitizer_memcheck_*" | sub_5D1660 | Compute-sanitizer integration hooks |
"__cuda_sm10x_tcgen05_guardrail_trap_*" | sub_5D1660 | Blackwell debug trap intrinsics |
" IntrinsicDescrFile=%s" | sub_9EE390 (0x9EEC9B) | Instruction description loader -- logs external description file path (option 404) |
".RELU not allowed with unsigned type" | sub_6BEC60 | OCG LDC/S2R handler |
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_5D1660 | 46KB | Master intrinsic registration -- 607 name-to-ID entries (608 table slots) | 99% |
sub_5D4190 | 41KB | Opcode dispatch -- ~120 named + ~400 MMA hash entries | 99% |
sub_5FF700 | 354KB | Prototype generator -- .weak .func PTX declarations | 99% |
sub_5C7A50 | 173KB | wmma.mma codegen (all shapes/types/layouts) | 98% |
sub_5C10A0 | 120KB | mma codegen (mma.sync API, post-Volta) | 98% |
sub_5BBC30 | 90KB | tcgen05.mma codegen (Blackwell 5th-gen tensor core) | 98% |
sub_5B76D0 | 64KB | div codegen (integer + FP, all rounding modes) | 95% |
sub_5ADDC0 | 50KB | tex.grad codegen (1D/2D/3D gradient textures) | 95% |
sub_5B4040 | 49KB | sqrt codegen (f32/f64, all rounding modes) | 95% |
sub_5AB460 | 45KB | cp.async.bulk.tensor codegen (1D--5D, tile/im2col) | 95% |
sub_5B0CD0 | 44KB | rcp codegen (f32/f64 reciprocal, all rounding modes) | 95% |
sub_6C9EB0 | 13KB | OCG intrinsic table init -- see OCG Intrinsic System for full function map (27 entries) | 95% |
sub_6BDE20 | 7KB | Intrinsic operand expansion | 88% |
sub_6BEC60 | 5.8KB | LDC/S2R intrinsic handlers | 90% |
sub_9EE390 | 3.5KB | Instruction description loader -- builds per-SM instruction table for pre-coloring ("IntrinsicDescrFile=%s") | 92% |
sub_9CDF90 | 156B | SM class 5 instruction set descriptor (200B, vtable off_23F3B00) | 85% |
sub_9CE030 | 115B | SM class 6 instruction set descriptor (216B, extends sub_9CDF90) | 85% |
sub_9CE120 | 112B | SM class 7 instruction set descriptor (232B, vtable off_22B5150) | 85% |
sub_9CE190 | 114B | SM class 8+ instruction set descriptor (240B, vtable off_22AD230) | 85% |
sub_9EF190 | 1.1KB | Error handler for instruction description loader (ICE on invalid option type) | 88% |
Cross-References
- OCG Intrinsic System -- SM100+ OCG builtin table (44 operations), lowering pipeline, SASS handler map
- SM Architecture Map -- Per-SM capability dispatch tables and intrinsic initializer assignments
- Math Intrinsics -- Detailed coverage of sm_20 IEEE math intrinsic codegen (div, rcp, sqrt, rem)
- Tensor Core Intrinsics -- WMMA, MMA, WGMMA, tcgen05 instruction lowering
- Sync & Warp Intrinsics -- Barrier, vote, shuffle, match, redux intrinsics
- Newton-Raphson Templates -- Software math slowpath sequences used by div/rcp/sqrt
- TCGen05 -- 5th Gen Tensor Cores -- Blackwell tensor core ISA detail
- Hash Tables & Bitvectors -- Hash map infrastructure (
sub_425CA0/sub_426150/sub_426D60) - Mercury Encoder -- Master SASS encoder
sub_6D9690(94KB) that encodes validated intrinsics - SASS Instruction Encoding -- Instruction encoding infrastructure
- Pipeline Overview -- OCG-time measurement covers intrinsic lowering
Appendix: Complete Intrinsic Name Catalog (607 Entries)
Every intrinsic registered by sub_5D1660, extracted from the decompiled source. IDs are contiguous 1--607 (0x001--0x25F). The suffix after stripping the prefix encodes the operation, data type, rounding mode, address space, and optional modifiers.
__cuda_reduxsync_* -- Redux sync (17 entries, 0x001--0x011, sm_70)
| ID | Hex | Name |
|---|---|---|
| 1 | 0x001 | __cuda_reduxsync_b32_and |
| 2 | 0x002 | __cuda_reduxsync_b32_or |
| 3 | 0x003 | __cuda_reduxsync_b32_xor |
| 4 | 0x004 | __cuda_reduxsync_f32_max |
| 5 | 0x005 | __cuda_reduxsync_f32_max_NaN |
| 6 | 0x006 | __cuda_reduxsync_f32_max_abs |
| 7 | 0x007 | __cuda_reduxsync_f32_max_abs_NaN |
| 8 | 0x008 | __cuda_reduxsync_f32_min |
| 9 | 0x009 | __cuda_reduxsync_f32_min_NaN |
| 10 | 0x00A | __cuda_reduxsync_f32_min_abs |
| 11 | 0x00B | __cuda_reduxsync_f32_min_abs_NaN |
| 12 | 0x00C | __cuda_reduxsync_s32_add |
| 13 | 0x00D | __cuda_reduxsync_s32_max |
| 14 | 0x00E | __cuda_reduxsync_s32_min |
| 15 | 0x00F | __cuda_reduxsync_u32_add |
| 16 | 0x010 | __cuda_reduxsync_u32_max |
| 17 | 0x011 | __cuda_reduxsync_u32_min |
__cuda_sanitizer_memcheck_* -- Compute-sanitizer hooks (7 entries, 0x012--0x018, --)
| ID | Hex | Name |
|---|---|---|
| 18 | 0x012 | __cuda_sanitizer_memcheck_free |
| 19 | 0x013 | __cuda_sanitizer_memcheck_generic |
| 20 | 0x014 | __cuda_sanitizer_memcheck_global |
| 21 | 0x015 | __cuda_sanitizer_memcheck_local |
| 22 | 0x016 | __cuda_sanitizer_memcheck_malloc |
| 23 | 0x017 | __cuda_sanitizer_memcheck_readmetadata |
| 24 | 0x018 | __cuda_sanitizer_memcheck_shared |
__cuda_scalar_video_emulation_* -- Video emulation (7 entries, 0x019--0x01F, sm_20)
| ID | Hex | Name |
|---|---|---|
| 25 | 0x019 | __cuda_scalar_video_emulation_operandExtractAndSignExtend01 |
| 26 | 0x01A | __cuda_scalar_video_emulation_operandExtractAndSignExtend11 |
| 27 | 0x01B | __cuda_scalar_video_emulation_operandExtractAndSignExtend12 |
| 28 | 0x01C | __cuda_scalar_video_emulation_operandExtractAndSignExtend22 |
| 29 | 0x01D | __cuda_scalar_video_emulation_optionalMerge32 |
| 30 | 0x01E | __cuda_scalar_video_emulation_saturate64 |
| 31 | 0x01F | __cuda_scalar_video_emulation_secondOp64 |
__cuda_sm10x_* -- Blackwell tcgen05 guardrails + mask (11 entries, 0x020--0x02A, sm_100)
| ID | Hex | Name |
|---|---|---|
| 32 | 0x020 | __cuda_sm10x_create_mask_from_bit_idx_and_alloc_size_v1 |
| 33 | 0x021 | __cuda_sm10x_tcgen05_guardrail_trap_access_out_of_physical_bounds |
| 34 | 0x022 | __cuda_sm10x_tcgen05_guardrail_trap_allocation_granularity_invalid |
| 35 | 0x023 | __cuda_sm10x_tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc |
| 36 | 0x024 | __cuda_sm10x_tcgen05_guardrail_trap_current_warp_owner_invalid |
| 37 | 0x025 | __cuda_sm10x_tcgen05_guardrail_trap_invalid_datapath_alignment |
| 38 | 0x026 | __cuda_sm10x_tcgen05_guardrail_trap_phase_invalid_during_alloc |
| 39 | 0x027 | __cuda_sm10x_tcgen05_guardrail_trap_sp_used_in_unsupported_env |
| 40 | 0x028 | __cuda_sm10x_tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod |
| 41 | 0x029 | __cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_access |
| 42 | 0x02A | __cuda_sm10x_tcgen05_guardrail_trap_unallocated_columns_being_dealloced |
__cuda_sm1xx_* -- Bulk copy + cp.async.bulk.tensor (18 entries, 0x02B--0x03C, sm_100+)
| ID | Hex | Name |
|---|---|---|
| 43 | 0x02B | __cuda_sm1xx_bulk_copy_multicast |
| 44 | 0x02C | __cuda_sm1xx_bulk_copy_unicast |
| 45 | 0x02D | __cuda_sm1xx_cp_async_bulk_tensor_1d_tile_multicast |
| 46 | 0x02E | __cuda_sm1xx_cp_async_bulk_tensor_1d_tile_unicast |
| 47 | 0x02F | __cuda_sm1xx_cp_async_bulk_tensor_2d_tile_multicast |
| 48 | 0x030 | __cuda_sm1xx_cp_async_bulk_tensor_2d_tile_unicast |
| 49 | 0x031 | __cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_multicast |
| 50 | 0x032 | __cuda_sm1xx_cp_async_bulk_tensor_3d_im2col_unicast |
| 51 | 0x033 | __cuda_sm1xx_cp_async_bulk_tensor_3d_tile_multicast |
| 52 | 0x034 | __cuda_sm1xx_cp_async_bulk_tensor_3d_tile_unicast |
| 53 | 0x035 | __cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_multicast |
| 54 | 0x036 | __cuda_sm1xx_cp_async_bulk_tensor_4d_im2col_unicast |
| 55 | 0x037 | __cuda_sm1xx_cp_async_bulk_tensor_4d_tile_multicast |
| 56 | 0x038 | __cuda_sm1xx_cp_async_bulk_tensor_4d_tile_unicast |
| 57 | 0x039 | __cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_multicast |
| 58 | 0x03A | __cuda_sm1xx_cp_async_bulk_tensor_5d_im2col_unicast |
| 59 | 0x03B | __cuda_sm1xx_cp_async_bulk_tensor_5d_tile_multicast |
| 60 | 0x03C | __cuda_sm1xx_cp_async_bulk_tensor_5d_tile_unicast |
__cuda_sm20_* -- IEEE math (70 entries, 0x03D--0x082, sm_20)
| ID | Hex | Name |
|---|---|---|
| 61 | 0x03D | __cuda_sm20_bfe_s64_ |
| 62 | 0x03E | __cuda_sm20_bfe_u64_ |
| 63 | 0x03F | __cuda_sm20_bfi_u64_ |
| 64 | 0x040 | __cuda_sm20_dblrcp_rn_slowpath_v3 |
| 65 | 0x041 | __cuda_sm20_div_rd_f32 |
| 66 | 0x042 | __cuda_sm20_div_rd_f64_v2 |
| 67 | 0x043 | __cuda_sm20_div_rd_ftz_f32 |
| 68 | 0x044 | __cuda_sm20_div_rn_f32 |
| 69 | 0x045 | __cuda_sm20_div_rn_f64_fast |
| 70 | 0x046 | __cuda_sm20_div_rn_f64_full |
| 71 | 0x047 | __cuda_sm20_div_rn_ftz_f32 |
| 72 | 0x048 | __cuda_sm20_div_rn_ftz_f32_slowpath |
| 73 | 0x049 | __cuda_sm20_div_rn_noftz_f32_slowpath |
| 74 | 0x04A | __cuda_sm20_div_ru_f32 |
| 75 | 0x04B | __cuda_sm20_div_ru_f64_v2 |
| 76 | 0x04C | __cuda_sm20_div_ru_ftz_f32 |
| 77 | 0x04D | __cuda_sm20_div_rz_f32 |
| 78 | 0x04E | __cuda_sm20_div_rz_f64_v2 |
| 79 | 0x04F | __cuda_sm20_div_rz_ftz_f32 |
| 80 | 0x050 | __cuda_sm20_div_s16 |
| 81 | 0x051 | __cuda_sm20_div_s64 |
| 82 | 0x052 | __cuda_sm20_div_u16 |
| 83 | 0x053 | __cuda_sm20_div_u64 |
| 84 | 0x054 | __cuda_sm20_drsqrt_f64_slowpath_v2 |
| 85 | 0x055 | __cuda_sm20_drsqrt_f64_v2 |
| 86 | 0x056 | __cuda_sm20_dsqrt_rd_f64 |
| 87 | 0x057 | __cuda_sm20_dsqrt_rn_f64_mediumpath_v1 |
| 88 | 0x058 | __cuda_sm20_dsqrt_rn_f64_v3 |
| 89 | 0x059 | __cuda_sm20_dsqrt_ru_f64 |
| 90 | 0x05A | __cuda_sm20_dsqrt_rz_f64 |
| 91 | 0x05B | __cuda_sm20_rcp_f64_v3 |
| 92 | 0x05C | __cuda_sm20_rcp_rd_f32 |
| 93 | 0x05D | __cuda_sm20_rcp_rd_f32_slowpath |
| 94 | 0x05E | __cuda_sm20_rcp_rd_f64 |
| 95 | 0x05F | __cuda_sm20_rcp_rd_ftz_f32 |
| 96 | 0x060 | __cuda_sm20_rcp_rd_ftz_f32_slowpath |
| 97 | 0x061 | __cuda_sm20_rcp_rn_f32 |
| 98 | 0x062 | __cuda_sm20_rcp_rn_f32_slowpath |
| 99 | 0x063 | __cuda_sm20_rcp_rn_ftz_f32 |
| 100 | 0x064 | __cuda_sm20_rcp_rn_ftz_f32_slowpath |
| 101 | 0x065 | __cuda_sm20_rcp_ru_f32 |
| 102 | 0x066 | __cuda_sm20_rcp_ru_f32_slowpath |
| 103 | 0x067 | __cuda_sm20_rcp_ru_f64 |
| 104 | 0x068 | __cuda_sm20_rcp_ru_ftz_f32 |
| 105 | 0x069 | __cuda_sm20_rcp_ru_ftz_f32_slowpath |
| 106 | 0x06A | __cuda_sm20_rcp_rz_f32 |
| 107 | 0x06B | __cuda_sm20_rcp_rz_f32_slowpath |
| 108 | 0x06C | __cuda_sm20_rcp_rz_f64 |
| 109 | 0x06D | __cuda_sm20_rcp_rz_ftz_f32 |
| 110 | 0x06E | __cuda_sm20_rcp_rz_ftz_f32_slowpath |
| 111 | 0x06F | __cuda_sm20_rem_s16 |
| 112 | 0x070 | __cuda_sm20_rem_s64 |
| 113 | 0x071 | __cuda_sm20_rem_u16 |
| 114 | 0x072 | __cuda_sm20_rem_u64 |
| 115 | 0x073 | __cuda_sm20_sqrt_rd_f32 |
| 116 | 0x074 | __cuda_sm20_sqrt_rd_f32_slowpath |
| 117 | 0x075 | __cuda_sm20_sqrt_rd_ftz_f32 |
| 118 | 0x076 | __cuda_sm20_sqrt_rd_ftz_f32_slowpath |
| 119 | 0x077 | __cuda_sm20_sqrt_rn_f32 |
| 120 | 0x078 | __cuda_sm20_sqrt_rn_f32_slowpath |
| 121 | 0x079 | __cuda_sm20_sqrt_rn_ftz_f32 |
| 122 | 0x07A | __cuda_sm20_sqrt_rn_ftz_f32_slowpath |
| 123 | 0x07B | __cuda_sm20_sqrt_ru_f32 |
| 124 | 0x07C | __cuda_sm20_sqrt_ru_f32_slowpath |
| 125 | 0x07D | __cuda_sm20_sqrt_ru_ftz_f32 |
| 126 | 0x07E | __cuda_sm20_sqrt_ru_ftz_f32_slowpath |
| 127 | 0x07F | __cuda_sm20_sqrt_rz_f32 |
| 128 | 0x080 | __cuda_sm20_sqrt_rz_f32_slowpath |
| 129 | 0x081 | __cuda_sm20_sqrt_rz_ftz_f32 |
| 130 | 0x082 | __cuda_sm20_sqrt_rz_ftz_f32_slowpath |
__cuda_sm3x_* -- Optimized division (4 entries, 0x083--0x086, sm_30)
| ID | Hex | Name |
|---|---|---|
| 131 | 0x083 | __cuda_sm3x_div_rn_ftz_f32 |
| 132 | 0x084 | __cuda_sm3x_div_rn_ftz_f32_slowpath |
| 133 | 0x085 | __cuda_sm3x_div_rn_noftz_f32 |
| 134 | 0x086 | __cuda_sm3x_div_rn_noftz_f32_slowpath |
__cuda_sm62_* -- Integer dot product (2 entries, 0x087--0x088, sm_62)
| ID | Hex | Name |
|---|---|---|
| 135 | 0x087 | __cuda_sm62_dp2a |
| 136 | 0x088 | __cuda_sm62_dp4a |
__cuda_sm70_* -- Volta sync/warp/WMMA (393 entries, 0x089--0x211, sm_70)
| ID | Hex | Name |
|---|---|---|
| 137 | 0x089 | __cuda_sm70_barrier_arrive |
| 138 | 0x08A | __cuda_sm70_barrier_arrive_0 |
| 139 | 0x08B | __cuda_sm70_barrier_arrive_0_count |
| 140 | 0x08C | __cuda_sm70_barrier_arrive_1 |
| 141 | 0x08D | __cuda_sm70_barrier_arrive_10 |
| 142 | 0x08E | __cuda_sm70_barrier_arrive_10_count |
| 143 | 0x08F | __cuda_sm70_barrier_arrive_11 |
| 144 | 0x090 | __cuda_sm70_barrier_arrive_11_count |
| 145 | 0x091 | __cuda_sm70_barrier_arrive_12 |
| 146 | 0x092 | __cuda_sm70_barrier_arrive_12_count |
| 147 | 0x093 | __cuda_sm70_barrier_arrive_13 |
| 148 | 0x094 | __cuda_sm70_barrier_arrive_13_count |
| 149 | 0x095 | __cuda_sm70_barrier_arrive_14 |
| 150 | 0x096 | __cuda_sm70_barrier_arrive_14_count |
| 151 | 0x097 | __cuda_sm70_barrier_arrive_15 |
| 152 | 0x098 | __cuda_sm70_barrier_arrive_15_count |
| 153 | 0x099 | __cuda_sm70_barrier_arrive_1_count |
| 154 | 0x09A | __cuda_sm70_barrier_arrive_2 |
| 155 | 0x09B | __cuda_sm70_barrier_arrive_2_count |
| 156 | 0x09C | __cuda_sm70_barrier_arrive_3 |
| 157 | 0x09D | __cuda_sm70_barrier_arrive_3_count |
| 158 | 0x09E | __cuda_sm70_barrier_arrive_4 |
| 159 | 0x09F | __cuda_sm70_barrier_arrive_4_count |
| 160 | 0x0A0 | __cuda_sm70_barrier_arrive_5 |
| 161 | 0x0A1 | __cuda_sm70_barrier_arrive_5_count |
| 162 | 0x0A2 | __cuda_sm70_barrier_arrive_6 |
| 163 | 0x0A3 | __cuda_sm70_barrier_arrive_6_count |
| 164 | 0x0A4 | __cuda_sm70_barrier_arrive_7 |
| 165 | 0x0A5 | __cuda_sm70_barrier_arrive_7_count |
| 166 | 0x0A6 | __cuda_sm70_barrier_arrive_8 |
| 167 | 0x0A7 | __cuda_sm70_barrier_arrive_8_count |
| 168 | 0x0A8 | __cuda_sm70_barrier_arrive_9 |
| 169 | 0x0A9 | __cuda_sm70_barrier_arrive_9_count |
| 170 | 0x0AA | __cuda_sm70_barrier_arrive_count |
| 171 | 0x0AB | __cuda_sm70_barrier_red_and |
| 172 | 0x0AC | __cuda_sm70_barrier_red_and_0 |
| 173 | 0x0AD | __cuda_sm70_barrier_red_and_0_count |
| 174 | 0x0AE | __cuda_sm70_barrier_red_and_1 |
| 175 | 0x0AF | __cuda_sm70_barrier_red_and_10 |
| 176 | 0x0B0 | __cuda_sm70_barrier_red_and_10_count |
| 177 | 0x0B1 | __cuda_sm70_barrier_red_and_11 |
| 178 | 0x0B2 | __cuda_sm70_barrier_red_and_11_count |
| 179 | 0x0B3 | __cuda_sm70_barrier_red_and_12 |
| 180 | 0x0B4 | __cuda_sm70_barrier_red_and_12_count |
| 181 | 0x0B5 | __cuda_sm70_barrier_red_and_13 |
| 182 | 0x0B6 | __cuda_sm70_barrier_red_and_13_count |
| 183 | 0x0B7 | __cuda_sm70_barrier_red_and_14 |
| 184 | 0x0B8 | __cuda_sm70_barrier_red_and_14_count |
| 185 | 0x0B9 | __cuda_sm70_barrier_red_and_15 |
| 186 | 0x0BA | __cuda_sm70_barrier_red_and_15_count |
| 187 | 0x0BB | __cuda_sm70_barrier_red_and_1_count |
| 188 | 0x0BC | __cuda_sm70_barrier_red_and_2 |
| 189 | 0x0BD | __cuda_sm70_barrier_red_and_2_count |
| 190 | 0x0BE | __cuda_sm70_barrier_red_and_3 |
| 191 | 0x0BF | __cuda_sm70_barrier_red_and_3_count |
| 192 | 0x0C0 | __cuda_sm70_barrier_red_and_4 |
| 193 | 0x0C1 | __cuda_sm70_barrier_red_and_4_count |
| 194 | 0x0C2 | __cuda_sm70_barrier_red_and_5 |
| 195 | 0x0C3 | __cuda_sm70_barrier_red_and_5_count |
| 196 | 0x0C4 | __cuda_sm70_barrier_red_and_6 |
| 197 | 0x0C5 | __cuda_sm70_barrier_red_and_6_count |
| 198 | 0x0C6 | __cuda_sm70_barrier_red_and_7 |
| 199 | 0x0C7 | __cuda_sm70_barrier_red_and_7_count |
| 200 | 0x0C8 | __cuda_sm70_barrier_red_and_8 |
| 201 | 0x0C9 | __cuda_sm70_barrier_red_and_8_count |
| 202 | 0x0CA | __cuda_sm70_barrier_red_and_9 |
| 203 | 0x0CB | __cuda_sm70_barrier_red_and_9_count |
| 204 | 0x0CC | __cuda_sm70_barrier_red_and_count |
| 205 | 0x0CD | __cuda_sm70_barrier_red_or |
| 206 | 0x0CE | __cuda_sm70_barrier_red_or_0 |
| 207 | 0x0CF | __cuda_sm70_barrier_red_or_0_count |
| 208 | 0x0D0 | __cuda_sm70_barrier_red_or_1 |
| 209 | 0x0D1 | __cuda_sm70_barrier_red_or_10 |
| 210 | 0x0D2 | __cuda_sm70_barrier_red_or_10_count |
| 211 | 0x0D3 | __cuda_sm70_barrier_red_or_11 |
| 212 | 0x0D4 | __cuda_sm70_barrier_red_or_11_count |
| 213 | 0x0D5 | __cuda_sm70_barrier_red_or_12 |
| 214 | 0x0D6 | __cuda_sm70_barrier_red_or_12_count |
| 215 | 0x0D7 | __cuda_sm70_barrier_red_or_13 |
| 216 | 0x0D8 | __cuda_sm70_barrier_red_or_13_count |
| 217 | 0x0D9 | __cuda_sm70_barrier_red_or_14 |
| 218 | 0x0DA | __cuda_sm70_barrier_red_or_14_count |
| 219 | 0x0DB | __cuda_sm70_barrier_red_or_15 |
| 220 | 0x0DC | __cuda_sm70_barrier_red_or_15_count |
| 221 | 0x0DD | __cuda_sm70_barrier_red_or_1_count |
| 222 | 0x0DE | __cuda_sm70_barrier_red_or_2 |
| 223 | 0x0DF | __cuda_sm70_barrier_red_or_2_count |
| 224 | 0x0E0 | __cuda_sm70_barrier_red_or_3 |
| 225 | 0x0E1 | __cuda_sm70_barrier_red_or_3_count |
| 226 | 0x0E2 | __cuda_sm70_barrier_red_or_4 |
| 227 | 0x0E3 | __cuda_sm70_barrier_red_or_4_count |
| 228 | 0x0E4 | __cuda_sm70_barrier_red_or_5 |
| 229 | 0x0E5 | __cuda_sm70_barrier_red_or_5_count |
| 230 | 0x0E6 | __cuda_sm70_barrier_red_or_6 |
| 231 | 0x0E7 | __cuda_sm70_barrier_red_or_6_count |
| 232 | 0x0E8 | __cuda_sm70_barrier_red_or_7 |
| 233 | 0x0E9 | __cuda_sm70_barrier_red_or_7_count |
| 234 | 0x0EA | __cuda_sm70_barrier_red_or_8 |
| 235 | 0x0EB | __cuda_sm70_barrier_red_or_8_count |
| 236 | 0x0EC | __cuda_sm70_barrier_red_or_9 |
| 237 | 0x0ED | __cuda_sm70_barrier_red_or_9_count |
| 238 | 0x0EE | __cuda_sm70_barrier_red_or_count |
| 239 | 0x0EF | __cuda_sm70_barrier_red_popc |
| 240 | 0x0F0 | __cuda_sm70_barrier_red_popc_0 |
| 241 | 0x0F1 | __cuda_sm70_barrier_red_popc_0_count |
| 242 | 0x0F2 | __cuda_sm70_barrier_red_popc_1 |
| 243 | 0x0F3 | __cuda_sm70_barrier_red_popc_10 |
| 244 | 0x0F4 | __cuda_sm70_barrier_red_popc_10_count |
| 245 | 0x0F5 | __cuda_sm70_barrier_red_popc_11 |
| 246 | 0x0F6 | __cuda_sm70_barrier_red_popc_11_count |
| 247 | 0x0F7 | __cuda_sm70_barrier_red_popc_12 |
| 248 | 0x0F8 | __cuda_sm70_barrier_red_popc_12_count |
| 249 | 0x0F9 | __cuda_sm70_barrier_red_popc_13 |
| 250 | 0x0FA | __cuda_sm70_barrier_red_popc_13_count |
| 251 | 0x0FB | __cuda_sm70_barrier_red_popc_14 |
| 252 | 0x0FC | __cuda_sm70_barrier_red_popc_14_count |
| 253 | 0x0FD | __cuda_sm70_barrier_red_popc_15 |
| 254 | 0x0FE | __cuda_sm70_barrier_red_popc_15_count |
| 255 | 0x0FF | __cuda_sm70_barrier_red_popc_1_count |
| 256 | 0x100 | __cuda_sm70_barrier_red_popc_2 |
| 257 | 0x101 | __cuda_sm70_barrier_red_popc_2_count |
| 258 | 0x102 | __cuda_sm70_barrier_red_popc_3 |
| 259 | 0x103 | __cuda_sm70_barrier_red_popc_3_count |
| 260 | 0x104 | __cuda_sm70_barrier_red_popc_4 |
| 261 | 0x105 | __cuda_sm70_barrier_red_popc_4_count |
| 262 | 0x106 | __cuda_sm70_barrier_red_popc_5 |
| 263 | 0x107 | __cuda_sm70_barrier_red_popc_5_count |
| 264 | 0x108 | __cuda_sm70_barrier_red_popc_6 |
| 265 | 0x109 | __cuda_sm70_barrier_red_popc_6_count |
| 266 | 0x10A | __cuda_sm70_barrier_red_popc_7 |
| 267 | 0x10B | __cuda_sm70_barrier_red_popc_7_count |
| 268 | 0x10C | __cuda_sm70_barrier_red_popc_8 |
| 269 | 0x10D | __cuda_sm70_barrier_red_popc_8_count |
| 270 | 0x10E | __cuda_sm70_barrier_red_popc_9 |
| 271 | 0x10F | __cuda_sm70_barrier_red_popc_9_count |
| 272 | 0x110 | __cuda_sm70_barrier_red_popc_count |
| 273 | 0x111 | __cuda_sm70_barrier_sync |
| 274 | 0x112 | __cuda_sm70_barrier_sync_0 |
| 275 | 0x113 | __cuda_sm70_barrier_sync_0_count |
| 276 | 0x114 | __cuda_sm70_barrier_sync_1 |
| 277 | 0x115 | __cuda_sm70_barrier_sync_10 |
| 278 | 0x116 | __cuda_sm70_barrier_sync_10_count |
| 279 | 0x117 | __cuda_sm70_barrier_sync_11 |
| 280 | 0x118 | __cuda_sm70_barrier_sync_11_count |
| 281 | 0x119 | __cuda_sm70_barrier_sync_12 |
| 282 | 0x11A | __cuda_sm70_barrier_sync_12_count |
| 283 | 0x11B | __cuda_sm70_barrier_sync_13 |
| 284 | 0x11C | __cuda_sm70_barrier_sync_13_count |
| 285 | 0x11D | __cuda_sm70_barrier_sync_14 |
| 286 | 0x11E | __cuda_sm70_barrier_sync_14_count |
| 287 | 0x11F | __cuda_sm70_barrier_sync_15 |
| 288 | 0x120 | __cuda_sm70_barrier_sync_15_count |
| 289 | 0x121 | __cuda_sm70_barrier_sync_1_count |
| 290 | 0x122 | __cuda_sm70_barrier_sync_2 |
| 291 | 0x123 | __cuda_sm70_barrier_sync_2_count |
| 292 | 0x124 | __cuda_sm70_barrier_sync_3 |
| 293 | 0x125 | __cuda_sm70_barrier_sync_3_count |
| 294 | 0x126 | __cuda_sm70_barrier_sync_4 |
| 295 | 0x127 | __cuda_sm70_barrier_sync_4_count |
| 296 | 0x128 | __cuda_sm70_barrier_sync_5 |
| 297 | 0x129 | __cuda_sm70_barrier_sync_5_count |
| 298 | 0x12A | __cuda_sm70_barrier_sync_6 |
| 299 | 0x12B | __cuda_sm70_barrier_sync_6_count |
| 300 | 0x12C | __cuda_sm70_barrier_sync_7 |
| 301 | 0x12D | __cuda_sm70_barrier_sync_7_count |
| 302 | 0x12E | __cuda_sm70_barrier_sync_8 |
| 303 | 0x12F | __cuda_sm70_barrier_sync_8_count |
| 304 | 0x130 | __cuda_sm70_barrier_sync_9 |
| 305 | 0x131 | __cuda_sm70_barrier_sync_9_count |
| 306 | 0x132 | __cuda_sm70_barrier_sync_count |
| 307 | 0x133 | __cuda_sm70_matchsync_all_b32 |
| 308 | 0x134 | __cuda_sm70_matchsync_all_b32_p |
| 309 | 0x135 | __cuda_sm70_matchsync_all_b64 |
| 310 | 0x136 | __cuda_sm70_matchsync_all_b64_p |
| 311 | 0x137 | __cuda_sm70_matchsync_any_b32 |
| 312 | 0x138 | __cuda_sm70_matchsync_any_b64 |
| 313 | 0x139 | __cuda_sm70_shflsync_bfly |
| 314 | 0x13A | __cuda_sm70_shflsync_bfly_p |
| 315 | 0x13B | __cuda_sm70_shflsync_down |
| 316 | 0x13C | __cuda_sm70_shflsync_down_p |
| 317 | 0x13D | __cuda_sm70_shflsync_idx |
| 318 | 0x13E | __cuda_sm70_shflsync_idx_p |
| 319 | 0x13F | __cuda_sm70_shflsync_up |
| 320 | 0x140 | __cuda_sm70_shflsync_up_p |
| 321 | 0x141 | __cuda_sm70_votesync_all |
| 322 | 0x142 | __cuda_sm70_votesync_any |
| 323 | 0x143 | __cuda_sm70_votesync_ballot |
| 324 | 0x144 | __cuda_sm70_votesync_uni |
| 325 | 0x145 | __cuda_sm70_warpsync |
| 326 | 0x146 | __cuda_sm70_wmma_m16n16k16_load_a_col |
| 327 | 0x147 | __cuda_sm70_wmma_m16n16k16_load_a_col_global |
| 328 | 0x148 | __cuda_sm70_wmma_m16n16k16_load_a_col_shared |
| 329 | 0x149 | __cuda_sm70_wmma_m16n16k16_load_a_row |
| 330 | 0x14A | __cuda_sm70_wmma_m16n16k16_load_a_row_global |
| 331 | 0x14B | __cuda_sm70_wmma_m16n16k16_load_a_row_shared |
| 332 | 0x14C | __cuda_sm70_wmma_m16n16k16_load_b_col |
| 333 | 0x14D | __cuda_sm70_wmma_m16n16k16_load_b_col_global |
| 334 | 0x14E | __cuda_sm70_wmma_m16n16k16_load_b_col_shared |
| 335 | 0x14F | __cuda_sm70_wmma_m16n16k16_load_b_row |
| 336 | 0x150 | __cuda_sm70_wmma_m16n16k16_load_b_row_global |
| 337 | 0x151 | __cuda_sm70_wmma_m16n16k16_load_b_row_shared |
| 338 | 0x152 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16 |
| 339 | 0x153 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16_global |
| 340 | 0x154 | __cuda_sm70_wmma_m16n16k16_load_c_col_f16_shared |
| 341 | 0x155 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32 |
| 342 | 0x156 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32_global |
| 343 | 0x157 | __cuda_sm70_wmma_m16n16k16_load_c_col_f32_shared |
| 344 | 0x158 | __cuda_sm70_wmma_m16n16k16_load_c_row_f16 |
| 345 | 0x159 | __cuda_sm70_wmma_m16n16k16_load_c_row_f16_global |
| 346 | 0x15A | __cuda_sm70_wmma_m16n16k16_load_c_row_f16_shared |
| 347 | 0x15B | __cuda_sm70_wmma_m16n16k16_load_c_row_f32 |
| 348 | 0x15C | __cuda_sm70_wmma_m16n16k16_load_c_row_f32_global |
| 349 | 0x15D | __cuda_sm70_wmma_m16n16k16_load_c_row_f32_shared |
| 350 | 0x15E | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16 |
| 351 | 0x15F | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f16_satfinite |
| 352 | 0x160 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32 |
| 353 | 0x161 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f16_f32_satfinite |
| 354 | 0x162 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16 |
| 355 | 0x163 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f16_satfinite |
| 356 | 0x164 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32 |
| 357 | 0x165 | __cuda_sm70_wmma_m16n16k16_mma_col_col_f32_f32_satfinite |
| 358 | 0x166 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16 |
| 359 | 0x167 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f16_satfinite |
| 360 | 0x168 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32 |
| 361 | 0x169 | __cuda_sm70_wmma_m16n16k16_mma_col_row_f16_f32_satfinite |
| 362 | 0x16A | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16 |
| 363 | 0x16B | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f16_satfinite |
| 364 | 0x16C | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32 |
| 365 | 0x16D | __cuda_sm70_wmma_m16n16k16_mma_col_row_f32_f32_satfinite |
| 366 | 0x16E | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16 |
| 367 | 0x16F | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f16_satfinite |
| 368 | 0x170 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32 |
| 369 | 0x171 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f16_f32_satfinite |
| 370 | 0x172 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16 |
| 371 | 0x173 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f16_satfinite |
| 372 | 0x174 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32 |
| 373 | 0x175 | __cuda_sm70_wmma_m16n16k16_mma_row_col_f32_f32_satfinite |
| 374 | 0x176 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16 |
| 375 | 0x177 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f16_satfinite |
| 376 | 0x178 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32 |
| 377 | 0x179 | __cuda_sm70_wmma_m16n16k16_mma_row_row_f16_f32_satfinite |
| 378 | 0x17A | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16 |
| 379 | 0x17B | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f16_satfinite |
| 380 | 0x17C | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32 |
| 381 | 0x17D | __cuda_sm70_wmma_m16n16k16_mma_row_row_f32_f32_satfinite |
| 382 | 0x17E | __cuda_sm70_wmma_m16n16k16_store_d_col_f16 |
| 383 | 0x17F | __cuda_sm70_wmma_m16n16k16_store_d_col_f16_global |
| 384 | 0x180 | __cuda_sm70_wmma_m16n16k16_store_d_col_f16_shared |
| 385 | 0x181 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32 |
| 386 | 0x182 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32_global |
| 387 | 0x183 | __cuda_sm70_wmma_m16n16k16_store_d_col_f32_shared |
| 388 | 0x184 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16 |
| 389 | 0x185 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16_global |
| 390 | 0x186 | __cuda_sm70_wmma_m16n16k16_store_d_row_f16_shared |
| 391 | 0x187 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32 |
| 392 | 0x188 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32_global |
| 393 | 0x189 | __cuda_sm70_wmma_m16n16k16_store_d_row_f32_shared |
| 394 | 0x18A | __cuda_sm70_wmma_m32n8k16_load_a_col |
| 395 | 0x18B | __cuda_sm70_wmma_m32n8k16_load_a_col_global |
| 396 | 0x18C | __cuda_sm70_wmma_m32n8k16_load_a_col_shared |
| 397 | 0x18D | __cuda_sm70_wmma_m32n8k16_load_a_row |
| 398 | 0x18E | __cuda_sm70_wmma_m32n8k16_load_a_row_global |
| 399 | 0x18F | __cuda_sm70_wmma_m32n8k16_load_a_row_shared |
| 400 | 0x190 | __cuda_sm70_wmma_m32n8k16_load_b_col |
| 401 | 0x191 | __cuda_sm70_wmma_m32n8k16_load_b_col_global |
| 402 | 0x192 | __cuda_sm70_wmma_m32n8k16_load_b_col_shared |
| 403 | 0x193 | __cuda_sm70_wmma_m32n8k16_load_b_row |
| 404 | 0x194 | __cuda_sm70_wmma_m32n8k16_load_b_row_global |
| 405 | 0x195 | __cuda_sm70_wmma_m32n8k16_load_b_row_shared |
| 406 | 0x196 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16 |
| 407 | 0x197 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16_global |
| 408 | 0x198 | __cuda_sm70_wmma_m32n8k16_load_c_col_f16_shared |
| 409 | 0x199 | __cuda_sm70_wmma_m32n8k16_load_c_col_f32 |
| 410 | 0x19A | __cuda_sm70_wmma_m32n8k16_load_c_col_f32_global |
| 411 | 0x19B | __cuda_sm70_wmma_m32n8k16_load_c_col_f32_shared |
| 412 | 0x19C | __cuda_sm70_wmma_m32n8k16_load_c_row_f16 |
| 413 | 0x19D | __cuda_sm70_wmma_m32n8k16_load_c_row_f16_global |
| 414 | 0x19E | __cuda_sm70_wmma_m32n8k16_load_c_row_f16_shared |
| 415 | 0x19F | __cuda_sm70_wmma_m32n8k16_load_c_row_f32 |
| 416 | 0x1A0 | __cuda_sm70_wmma_m32n8k16_load_c_row_f32_global |
| 417 | 0x1A1 | __cuda_sm70_wmma_m32n8k16_load_c_row_f32_shared |
| 418 | 0x1A2 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16 |
| 419 | 0x1A3 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f16_satfinite |
| 420 | 0x1A4 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32 |
| 421 | 0x1A5 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f16_f32_satfinite |
| 422 | 0x1A6 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16 |
| 423 | 0x1A7 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f16_satfinite |
| 424 | 0x1A8 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32 |
| 425 | 0x1A9 | __cuda_sm70_wmma_m32n8k16_mma_col_col_f32_f32_satfinite |
| 426 | 0x1AA | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16 |
| 427 | 0x1AB | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f16_satfinite |
| 428 | 0x1AC | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32 |
| 429 | 0x1AD | __cuda_sm70_wmma_m32n8k16_mma_col_row_f16_f32_satfinite |
| 430 | 0x1AE | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16 |
| 431 | 0x1AF | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f16_satfinite |
| 432 | 0x1B0 | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32 |
| 433 | 0x1B1 | __cuda_sm70_wmma_m32n8k16_mma_col_row_f32_f32_satfinite |
| 434 | 0x1B2 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16 |
| 435 | 0x1B3 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f16_satfinite |
| 436 | 0x1B4 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32 |
| 437 | 0x1B5 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f16_f32_satfinite |
| 438 | 0x1B6 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16 |
| 439 | 0x1B7 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f16_satfinite |
| 440 | 0x1B8 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32 |
| 441 | 0x1B9 | __cuda_sm70_wmma_m32n8k16_mma_row_col_f32_f32_satfinite |
| 442 | 0x1BA | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16 |
| 443 | 0x1BB | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f16_satfinite |
| 444 | 0x1BC | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32 |
| 445 | 0x1BD | __cuda_sm70_wmma_m32n8k16_mma_row_row_f16_f32_satfinite |
| 446 | 0x1BE | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16 |
| 447 | 0x1BF | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f16_satfinite |
| 448 | 0x1C0 | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32 |
| 449 | 0x1C1 | __cuda_sm70_wmma_m32n8k16_mma_row_row_f32_f32_satfinite |
| 450 | 0x1C2 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16 |
| 451 | 0x1C3 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16_global |
| 452 | 0x1C4 | __cuda_sm70_wmma_m32n8k16_store_d_col_f16_shared |
| 453 | 0x1C5 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32 |
| 454 | 0x1C6 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32_global |
| 455 | 0x1C7 | __cuda_sm70_wmma_m32n8k16_store_d_col_f32_shared |
| 456 | 0x1C8 | __cuda_sm70_wmma_m32n8k16_store_d_row_f16 |
| 457 | 0x1C9 | __cuda_sm70_wmma_m32n8k16_store_d_row_f16_global |
| 458 | 0x1CA | __cuda_sm70_wmma_m32n8k16_store_d_row_f16_shared |
| 459 | 0x1CB | __cuda_sm70_wmma_m32n8k16_store_d_row_f32 |
| 460 | 0x1CC | __cuda_sm70_wmma_m32n8k16_store_d_row_f32_global |
| 461 | 0x1CD | __cuda_sm70_wmma_m32n8k16_store_d_row_f32_shared |
| 462 | 0x1CE | __cuda_sm70_wmma_m8n32k16_load_a_col |
| 463 | 0x1CF | __cuda_sm70_wmma_m8n32k16_load_a_col_global |
| 464 | 0x1D0 | __cuda_sm70_wmma_m8n32k16_load_a_col_shared |
| 465 | 0x1D1 | __cuda_sm70_wmma_m8n32k16_load_a_row |
| 466 | 0x1D2 | __cuda_sm70_wmma_m8n32k16_load_a_row_global |
| 467 | 0x1D3 | __cuda_sm70_wmma_m8n32k16_load_a_row_shared |
| 468 | 0x1D4 | __cuda_sm70_wmma_m8n32k16_load_b_col |
| 469 | 0x1D5 | __cuda_sm70_wmma_m8n32k16_load_b_col_global |
| 470 | 0x1D6 | __cuda_sm70_wmma_m8n32k16_load_b_col_shared |
| 471 | 0x1D7 | __cuda_sm70_wmma_m8n32k16_load_b_row |
| 472 | 0x1D8 | __cuda_sm70_wmma_m8n32k16_load_b_row_global |
| 473 | 0x1D9 | __cuda_sm70_wmma_m8n32k16_load_b_row_shared |
| 474 | 0x1DA | __cuda_sm70_wmma_m8n32k16_load_c_col_f16 |
| 475 | 0x1DB | __cuda_sm70_wmma_m8n32k16_load_c_col_f16_global |
| 476 | 0x1DC | __cuda_sm70_wmma_m8n32k16_load_c_col_f16_shared |
| 477 | 0x1DD | __cuda_sm70_wmma_m8n32k16_load_c_col_f32 |
| 478 | 0x1DE | __cuda_sm70_wmma_m8n32k16_load_c_col_f32_global |
| 479 | 0x1DF | __cuda_sm70_wmma_m8n32k16_load_c_col_f32_shared |
| 480 | 0x1E0 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16 |
| 481 | 0x1E1 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16_global |
| 482 | 0x1E2 | __cuda_sm70_wmma_m8n32k16_load_c_row_f16_shared |
| 483 | 0x1E3 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32 |
| 484 | 0x1E4 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32_global |
| 485 | 0x1E5 | __cuda_sm70_wmma_m8n32k16_load_c_row_f32_shared |
| 486 | 0x1E6 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16 |
| 487 | 0x1E7 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f16_satfinite |
| 488 | 0x1E8 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32 |
| 489 | 0x1E9 | __cuda_sm70_wmma_m8n32k16_mma_col_col_f16_f32_satfinite |
| 490 | 0x1EA | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16 |
| 491 | 0x1EB | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f16_satfinite |
| 492 | 0x1EC | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32 |
| 493 | 0x1ED | __cuda_sm70_wmma_m8n32k16_mma_col_col_f32_f32_satfinite |
| 494 | 0x1EE | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16 |
| 495 | 0x1EF | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f16_satfinite |
| 496 | 0x1F0 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32 |
| 497 | 0x1F1 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f16_f32_satfinite |
| 498 | 0x1F2 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16 |
| 499 | 0x1F3 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f16_satfinite |
| 500 | 0x1F4 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32 |
| 501 | 0x1F5 | __cuda_sm70_wmma_m8n32k16_mma_col_row_f32_f32_satfinite |
| 502 | 0x1F6 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16 |
| 503 | 0x1F7 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f16_satfinite |
| 504 | 0x1F8 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32 |
| 505 | 0x1F9 | __cuda_sm70_wmma_m8n32k16_mma_row_col_f16_f32_satfinite |
| 506 | 0x1FA | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16 |
| 507 | 0x1FB | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f16_satfinite |
| 508 | 0x1FC | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32 |
| 509 | 0x1FD | __cuda_sm70_wmma_m8n32k16_mma_row_col_f32_f32_satfinite |
| 510 | 0x1FE | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16 |
| 511 | 0x1FF | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f16_satfinite |
| 512 | 0x200 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32 |
| 513 | 0x201 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f16_f32_satfinite |
| 514 | 0x202 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16 |
| 515 | 0x203 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f16_satfinite |
| 516 | 0x204 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32 |
| 517 | 0x205 | __cuda_sm70_wmma_m8n32k16_mma_row_row_f32_f32_satfinite |
| 518 | 0x206 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16 |
| 519 | 0x207 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16_global |
| 520 | 0x208 | __cuda_sm70_wmma_m8n32k16_store_d_col_f16_shared |
| 521 | 0x209 | __cuda_sm70_wmma_m8n32k16_store_d_col_f32 |
| 522 | 0x20A | __cuda_sm70_wmma_m8n32k16_store_d_col_f32_global |
| 523 | 0x20B | __cuda_sm70_wmma_m8n32k16_store_d_col_f32_shared |
| 524 | 0x20C | __cuda_sm70_wmma_m8n32k16_store_d_row_f16 |
| 525 | 0x20D | __cuda_sm70_wmma_m8n32k16_store_d_row_f16_global |
| 526 | 0x20E | __cuda_sm70_wmma_m8n32k16_store_d_row_f16_shared |
| 527 | 0x20F | __cuda_sm70_wmma_m8n32k16_store_d_row_f32 |
| 528 | 0x210 | __cuda_sm70_wmma_m8n32k16_store_d_row_f32_global |
| 529 | 0x211 | __cuda_sm70_wmma_m8n32k16_store_d_row_f32_shared |
__cuda_sm80_* -- Ampere createpolicy (3 entries, 0x212--0x214, sm_80)
| ID | Hex | Name |
|---|---|---|
| 530 | 0x212 | __cuda_sm80_createpolicy_fractional |
| 531 | 0x213 | __cuda_sm80_createpolicy_fractional_encode |
| 532 | 0x214 | __cuda_sm80_createpolicy_range_encode |
__cuda_sm_10x_* -- Blackwell hmma/imma/bit MMA (10 entries, 0x215--0x21E, sm_100)
| ID | Hex | Name |
|---|---|---|
| 533 | 0x215 | __cuda_sm_10x_hmma_mdata_m16n8k16 |
| 534 | 0x216 | __cuda_sm_10x_hmma_mdata_m16n8k32 |
| 535 | 0x217 | __cuda_sm_10x_imma_mdata_m16n8k32 |
| 536 | 0x218 | __cuda_sm_10x_imma_mdata_m16n8k64 |
| 537 | 0x219 | __cuda_sm_10x_mma_bit_internal_and_m16n8k128 |
| 538 | 0x21A | __cuda_sm_10x_mma_bit_internal_and_m16n8k256 |
| 539 | 0x21B | __cuda_sm_10x_mma_bit_internal_and_m8n8k128 |
| 540 | 0x21C | __cuda_sm_10x_mma_bit_internal_xor_m16n8k128 |
| 541 | 0x21D | __cuda_sm_10x_mma_bit_internal_xor_m16n8k256 |
| 542 | 0x21E | __cuda_sm_10x_mma_bit_internal_xor_m8n8k128 |
__cuda_sm_8x_* -- Direct MMA + shfl (14 entries, 0x21F--0x22C, sm_80+)
| ID | Hex | Name |
|---|---|---|
| 543 | 0x21F | __cuda_sm_8x_mma_col_col_f16_f16_f16_f16 |
| 544 | 0x220 | __cuda_sm_8x_mma_col_col_f32_f16_f16_f16 |
| 545 | 0x221 | __cuda_sm_8x_mma_col_col_f32_f16_f16_f32 |
| 546 | 0x222 | __cuda_sm_8x_mma_col_row_f16_f16_f16_f16 |
| 547 | 0x223 | __cuda_sm_8x_mma_col_row_f32_f16_f16_f16 |
| 548 | 0x224 | __cuda_sm_8x_mma_col_row_f32_f16_f16_f32 |
| 549 | 0x225 | __cuda_sm_8x_mma_row_col_f16_f16_f16_f16 |
| 550 | 0x226 | __cuda_sm_8x_mma_row_col_f32_f16_f16_f16 |
| 551 | 0x227 | __cuda_sm_8x_mma_row_col_f32_f16_f16_f32 |
| 552 | 0x228 | __cuda_sm_8x_mma_row_row_f16_f16_f16_f16 |
| 553 | 0x229 | __cuda_sm_8x_mma_row_row_f32_f16_f16_f16 |
| 554 | 0x22A | __cuda_sm_8x_mma_row_row_f32_f16_f16_f32 |
| 555 | 0x22B | __cuda_sm_8x_mma_shfl_f16 |
| 556 | 0x22C | __cuda_sm_8x_mma_shfl_f32 |
__cuda_sm_9x_* -- Hopper sub-byte/bit MMA (51 entries, 0x22D--0x25F, sm_90)
| ID | Hex | Name |
|---|---|---|
| 557 | 0x22D | __cuda_sm_9x_mma_bit_internal_xor_m16n8k128 |
| 558 | 0x22E | __cuda_sm_9x_mma_bit_internal_xor_m16n8k256 |
| 559 | 0x22F | __cuda_sm_9x_mma_bit_internal_xor_m8n8k128 |
| 560 | 0x230 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4 |
| 561 | 0x231 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_s4_satfinite |
| 562 | 0x232 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4 |
| 563 | 0x233 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_s4_u4_satfinite |
| 564 | 0x234 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4 |
| 565 | 0x235 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_s4_satfinite |
| 566 | 0x236 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4 |
| 567 | 0x237 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k32_u4_u4_satfinite |
| 568 | 0x238 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4 |
| 569 | 0x239 | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_s4_satfinite |
| 570 | 0x23A | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4 |
| 571 | 0x23B | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_s4_u4_satfinite |
| 572 | 0x23C | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4 |
| 573 | 0x23D | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_s4_satfinite |
| 574 | 0x23E | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4 |
| 575 | 0x23F | __cuda_sm_9x_mma_sub_byte_internal_m16n8k64_u4_u4_satfinite |
| 576 | 0x240 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4 |
| 577 | 0x241 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_s4_satfinite |
| 578 | 0x242 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4 |
| 579 | 0x243 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_s4_u4_satfinite |
| 580 | 0x244 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4 |
| 581 | 0x245 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_s4_satfinite |
| 582 | 0x246 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4 |
| 583 | 0x247 | __cuda_sm_9x_mma_sub_byte_internal_m8n8k32_u4_u4_satfinite |
| 584 | 0x248 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4 |
| 585 | 0x249 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4_satfinite |
| 586 | 0x24A | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4 |
| 587 | 0x24B | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_u4_satfinite |
| 588 | 0x24C | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4 |
| 589 | 0x24D | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_s4_satfinite |
| 590 | 0x24E | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4 |
| 591 | 0x24F | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_u4_u4_satfinite |
| 592 | 0x250 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_0 |
| 593 | 0x251 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_1 |
| 594 | 0x252 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_0 |
| 595 | 0x253 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_s4_satfinite_1 |
| 596 | 0x254 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_0 |
| 597 | 0x255 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_1 |
| 598 | 0x256 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_0 |
| 599 | 0x257 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_s4_u4_satfinite_1 |
| 600 | 0x258 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_0 |
| 601 | 0x259 | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_1 |
| 602 | 0x25A | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_0 |
| 603 | 0x25B | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_s4_satfinite_1 |
| 604 | 0x25C | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_0 |
| 605 | 0x25D | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_1 |
| 606 | 0x25E | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_0 |
| 607 | 0x25F | __cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k64_u4_u4_satfinite_1 |
OCG Intrinsic System (44 Builtin Operations, SM100+)
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The OCG (Optimized Code Generation) intrinsic subsystem is a separate, parallel dispatch mechanism for SM100+ builtin operations. While the classical intrinsic system at sub_5D1660 maps __cuda_* runtime helper names to integer IDs and emits inline PTX code via body templates, the OCG system maps __nv_ptx_builtin_ocg_* function names to type-specific handler functions that validate parameters and emit SASS instructions directly -- bypassing the PTX intermediate step entirely.
| OCG intrinsic table | sub_6C9EB0 (13KB) -- __nv_ptx_builtin_ocg_* dispatch for SM100+ |
| OCG router | sub_6CC690 (22KB) -- routes OCG calls to type-specific handlers |
| OCG name resolver | sub_6C9BC0 -- resolves operation names to internal enums |
Initialization -- sub_6C9EB0
sub_6C9EB0 initializes a 10,664-byte (0x29A8) lookup table and sets the vtable pointer to off_202CF48. The operation name prefix is stored at *(_QWORD *)(a1 + 120) = "__nv_ptx_builtin_ocg_". The table contains 44 operations in 248-byte slots starting at offset 128. Each slot holds the operation name followed by up to 30 sub-operation/modifier string pointers (unused slots are NULL from the memset).
OCG Builtin Name Table -- Complete (44 Operations)
The complete OCG builtin table extracted from sub_6C9EB0. Thirty numeric string pointers that IDA left unresolved were recovered by reading null-terminated strings from the ptxas binary at addr - 0x400000 (ELF LOAD virtual address base). The table size 0x29A8 and 248-byte slot stride are verified against the memset in the decompiled code.
Arithmetic and ALU Operations
| Slot | Offset | OCG Name | Sub-Operations / Types | SASS Equivalent |
|---|---|---|---|---|
| 0 | 128 | add | s32, f32, s64, f64, sat | IADD3 / FADD |
| 28 | 7072 | mnmx | s32, u32, s64, u64 | IMNMX / FMNMX |
| 15 | 3848 | viadd | 32, f16x2 | VIADD |
Vector Integer Operations (SM100+ VIMNMX family)
All six vector integer operations share the same type set: s32, u32, s16x2, u16x2 with an optional relu modifier for ReLU clamping.
| Slot | Offset | OCG Name | SASS Equivalent | Description |
|---|---|---|---|---|
| 16 | 4096 | viaddmax | VIADDMNMX | fused add + max |
| 17 | 4344 | viaddmin | VIADDMNMX | fused add + min |
| 18 | 4592 | vimax | VIMNMX | vector integer max |
| 19 | 4840 | vimin | VIMNMX | vector integer min |
| 20 | 5088 | vimax3 | VIMNMX3 | 3-way vector integer max |
| 21 | 5336 | vimin3 | VIMNMX3 | 3-way vector integer min |
Packed Float Operations (f16x2 arithmetic)
All three packed operations share the same modifier set: ftz (flush-to-zero) and rounding modes rn, rm, rp, rz.
| Slot | Offset | OCG Name | SASS Equivalent | Description |
|---|---|---|---|---|
| 25 | 6328 | fadd2 | HADD2 / FADD.PACKED | packed f16 addition |
| 26 | 6576 | ffma2 | HFMA2 / FFMA.PACKED | packed f16 fused multiply-add |
| 27 | 6824 | fmul2 | HMUL2 / FMUL.PACKED | packed f16 multiplication |
| 29 | 7320 | fmax3 | FMNMX3 | 3-way float max (ftz, nan modifiers) |
| 30 | 7568 | fmin3 | FMNMX3 | 3-way float min (ftz, nan modifiers) |
Async Copy and TMA Operations
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 1 | 376 | cp_async_commit | mem, bulk, shared, global | LDGDEPBAR |
| 2 | 624 | cp_async_wait | mem, bulk, shared, global, read, write | DEPBAR |
| 10 | 2608 | cp_async_bulk | mbarrier, counted, shared, global, multicast, sequenced, bytemask | UBLKCP |
| 11 | 2856 | cp_red_async_bulk | mbarrier, counted, shared, global; types: u32/s32/u64/s64/f16/f32/f32ftz/f64/bf16; ops: add/min/max/inc/dec/and/or/xor | UBLKCP.RED |
| 12 | 3104 | cp_async_tensor | mbarrier, shared, global, 1d/2d/3d/4d/5d, im2col, multicast | UTMAKCP |
| 13 | 3352 | cp_async_prefetch_tensor | global, 1d/2d/3d/4d/5d, im2col | UTMAPF |
Note: The SASS mnemonics UBLKCP and UTMAKCP do not appear as strings in the ptxas binary. These are SASS assembler-level names visible only in cuobjdump output; the OCG names (cp_async_bulk, cp_async_tensor) are the canonical internal form.
Load, Store, and Cache Operations
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 3 | 872 | cache | tensor, pf (prefetch), iv (invalidate), ivall (invalidate all) | CCTL / PREFETCH |
| 4 | 1120 | ld_mc | ops: add/min/max/f32add/and/or/xor; types: f16x2/f16x4/f16x8/bf16x2/bf16x4/bf16x8/f32/f32x2/f32x4/f64/u32/s32/s64/u64 | LDG.MC |
| 5 | 1368 | ldc | u32, u64 | LDC |
| 6 | 1616 | s2r | (none -- register 0-255) | S2R |
| 22 | 5584 | write_async | release; shared/global; gpu/sys/mmio; v2/v4; u8/s8/u16/s16/b32/b64/u32/f64 | STG.ASYNC |
| 23 | 5832 | cctl_c | ldc/ldcu, shallow/deep, iv/ivall | CCTL |
Async Reduction and Fence Operations
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 9 | 2360 | red_async | release; shared/global; gpu/sys/mmio; v2/v4; u32/s32/u64; add/min/max/inc/dec/and/or/xor | RED.ASYNC |
| 14 | 3600 | fence_view_async | all, global, shared, dshared, tensor | FENCE.VIEW.ASYNC |
Tensor Core Operations (Blackwell TC family)
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 31 | 7816 | tcbar | cta1/cta2, a1t0/a0tx, flush, multicast, b32 | TCBAR |
| 32 | 7880 | mmareadshma | (none) | LDSM variant |
| 33 | 8064 | tccp | 128dp256bit/4dp256bit/128dp128bit/2x64dp128bitlw02lw13/2x64dp128bitlw01lw23/4x32dp128bit/u4x16p64/u6x16p32; cta1/cta2; b32/b64 | TCCP |
| 34 | 8312 | tcmma | gdesc/tmem; h/i/q/o/mxq; cta1/cta2; ashift/scale/lutb; areuse/akeep/breuse/bkeep; ws; buffer0-3; 2x/4x/blockscale/impl; b32/b64/u32 | TCMMA |
| 35 | 8560 | tcshift | cta1/cta2, b32 | TCSHIFT |
| 37 | 9056 | tcatomsws | and/or/findandset/align/cas; cta1/cta2; b32/b64 | TCATOM.SWS |
| 38 | 9304 | tcldsws | cta1/cta2 | TCLD.SWS |
| 39 | 9552 | tcstsws | cta1/cta2; b32/b64 | TCST.SWS |
The tcmma operation at slot 34 is the primary Blackwell MMA instruction, successor to HMMA/IMMA/DMMA. Its sub-operations encode:
- Descriptor mode:
gdesc(global descriptor via UR),tmem(tensor memory direct) - Input formats:
h(half/f16),i(integer),q(quarter/fp8),o(output descriptor),mxq(MX-format quarter for microscaled block-scaling) - Operand reuse:
areuse/akeep(A matrix),breuse/bkeep(B matrix) -- register reuse hints - Warp-shared:
ws-- warp-shared execution across 2 warps - Block scaling:
blockscalewith2x/4xmultipliers andimpl(implementation-defined) -- FP4/FP6 microscaled format support - Buffers:
buffer0-buffer3-- double/quad buffering for pipelined execution
The SWS (Software Scoreboard) operations (tcatomsws, tcldsws, tcstsws) are a Blackwell synchronization mechanism for tensor core pipelines that replaces hardware scoreboards with software-managed tracking.
Tensor Memory Load/Store (Blackwell native)
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 42 | 10296 | ldtm | formats: 16dp128bit/16dp256bit/32dp32bit/16dp64bit/16dp32bitt0t15/16dp32bitt16t31/16dp32bit; scale: x1-x128; pack16bit; fused/stat; statistics: nan/max/maxabs/min/minabs; types: u32/s32/f32/b32; sparsity: sparsify/u2/spfactor2to4 | LDTM |
| 43 | 10544 | sttm | formats: (same 7 as ldtm); scale: x1-x128; expand16bit; fused; b32 | STTM |
The ldtm/sttm format strings encode the tensor memory data layout:
16dp128bit-- 16 data-points, 128-bit total (e.g., 16x fp8)16dp256bit-- 16 data-points, 256-bit total (e.g., 16x fp16)32dp32bit-- 32 data-points, 32-bit total (e.g., 32x 1-bit)16dp32bitt0t15/16dp32bitt16t31-- 16 data-points in thread groups 0-15 / 16-31- Scale factors
x1throughx128control the number of consecutive elements loaded sparsifyandspfactor2to4enable structured 2:4 sparsity metadata generationstatwithnan/max/maxabs/min/minabsenables online statistics collection during load
Synchronization and Control
| Slot | Offset | OCG Name | Sub-Operations | SASS Equivalent |
|---|---|---|---|---|
| 7 | 1864 | acqblk | (none) | barrier acquire block |
| 8 | 2112 | preexit | (none) | EXIT.KEEPREFCOUNT |
| 24 | 6080 | getnextworkid | selfcast, broadcast | work distribution primitive |
| 36 | 8808 | virtcount | u32 | virtual warp counter |
| 40 | 9800 | memclear | b32, b64 | MEMCLEAR |
| 41 | 10048 | acqshminit | (none) | shared memory init barrier |
Category Summary
| Category | Count | Operations |
|---|---|---|
| Arithmetic / ALU | 3 | add, mnmx, viadd |
| Packed float | 5 | fadd2, ffma2, fmul2, fmax3, fmin3 |
| Vector integer | 6 | viaddmax, viaddmin, vimax, vimin, vimax3, vimin3 |
| Async copy / TMA | 6 | cp_async_commit, cp_async_wait, cp_async_bulk, cp_red_async_bulk, cp_async_tensor, cp_async_prefetch_tensor |
| Load / store / cache | 6 | ld_mc, ldc, s2r, write_async, cctl_c, cache |
| Async reduction / fence | 2 | red_async, fence_view_async |
| Tensor core (TC) | 8 | tcbar, mmareadshma, tccp, tcmma, tcshift, tcatomsws, tcldsws, tcstsws |
| Tensor memory (TM) | 2 | ldtm, sttm |
| Sync / control | 6 | acqblk, preexit, getnextworkid, virtcount, memclear, acqshminit |
| Total | 44 |
Handler Functions
The OCG handler cluster at 0x6C0000--0x6CC000 contains ~25--30 specialized handler/validator functions. Each validates parameters, types, sub-operations, and memory domains before delegating to the SASS encoding engine.
| Address | Size | Handler | Confidence |
|---|---|---|---|
sub_6C0D90 | 19KB | Atomic reduction (atom.add/min/max/cas, scope, memory order, vector width) | 90% |
sub_6C1CF0 | 16KB | Mbarrier (arrive, wait, test, counted, bytemask variants) | 88% |
sub_6C2AE0 | 10KB | cp.async (basic async copy) | 85% |
sub_6C3470 | 20KB | cp.async.bulk (bulk async copy with type validation) | 85% |
sub_6C46B0 | -- | cp.red.async.bulk (bulk async reduction) | 85% |
sub_6C4DA0 | 15KB | Load/store (scope, memory order, domain validation) | 85% |
sub_6C5A40 | 8KB | Cache control (CCTL: shallow/deep, iv/ivall, ldc/ldcu) | 85% |
sub_6C60B0 | 7KB | Distributed shared memory (selfcast/broadcast) | 80% |
sub_6C8100 | 9KB | cp.async.tensor / TMA (1--5D, multicast, tile/im2col) | 85% |
sub_6C9BC0 | -- | Name resolver (operation name -> internal enum) | 80% |
sub_6CC690 | 22KB | Router (dispatches to type-specific handlers via vtable) | 80% |
OCG Validation Strings
The OCG handlers share a consistent validation pattern. Notable error messages (NVIDIA consistently misspells "intrinsic" as "instrinsic" throughout the codebase):
| Error String | Handler | Meaning |
|---|---|---|
"Op {add, min, max, inc, dec, and, or, xor} not specified" | Atomic | Missing reduction operation |
"Domain param '_shared' or '_global' required" | Atomic/LS | No memory domain specified |
"Unsupported non _add global memory reduction" | Atomic | Only add supported for global reductions |
"Deprecated scope without memory order semantics" | Memory order | Legacy scope usage |
"Required scope with memory order semantics" | Memory order | Missing scope on memory-ordered op |
"byte mask not allowed with counted" | Mbarrier | Conflicting mbarrier modifiers |
"Exactly one of the 'shallow' or 'deep' modifiers must be used." | CCTL | Missing cache depth modifier |
"Cannot use both the selfcast and the broadcast modifier." | Dshmem | Conflicting multicast mode |
"Unexpected instrinsic name (%s)" | Name resolver | Unknown OCG operation name |
"Unexpected instrinsic subop (%s)" | Name resolver | Unknown sub-operation |
"Unexpected instrinsic type (%s) instead of (%s) in param (%d)" | Type validator | Parameter type mismatch |
"LDC requires a constant/immediate bank number" | LDC/S2R | Missing constant bank operand |
"S2R register must be between 0 and 255 inclusive" | LDC/S2R | System register out of range |
OCG SASS-Level Handlers
Separate from the validation layer, the SASS encoding zone at 0x6D0000--0x6E0000 contains MMA-specific handlers that operate during final instruction encoding:
| Address | Size | Handler | Confidence |
|---|---|---|---|
sub_6D4350 | 30KB | MMA intrinsic lowering (HMMA, IMMA, DMMA) | 90% |
sub_6D5CB0 | 16KB | MMA operand encoder (matrix fragments, accumulator registers) | 80% |
sub_6D7AF0 | 19KB | TCGen05 MMA handler (SM100 5th-gen tensor core encoding) | 90% |
sub_6D69B0 | 12KB | TCGen05 MMA validator (parameter validation only) | 80% |
Notable validation strings from the tcgen05 MMA handler:
"fused and l16dp32bit must be specified together""Inputs vector length is inconsistent with layout and num modifiers"
OCG Intrinsic Lowering Pipeline -- sub_6A97B0 + sub_6CC690
The full end-to-end flow that takes a PTX call.uni __nv_ptx_builtin_ocg_* intrinsic and produces a binary SASS instruction passes through five stages. Three are data-structure manipulation (matching, cleanup), two are instruction encoding (operand assembly, SASS emission).
sub_6B5F30 (intrinsic lowering driver)
|
├─ sub_6B40C0 ── pre-processing
|
├─ sub_6A97B0 (LowerIntrinsicOp, 26KB) ──────────────────────────────┐
| │ |
| │ Phase 1: SASS instruction matching |
| │ For each intrinsic call node in linked list [a1+32..a1+40): |
| │ Walk operand tree at node+288 |
| │ For each leaf: read instruction ID at leaf+24 |
| │ Search RB-tree at context+760 for matching SASS defn |
| │ On match: store ptr at node+464, back-link at SASS+440 |
| │ |
| │ Phase 2: Unmatched node garbage collection |
| │ If node+464 == 0 (no SASS match): |
| │ Walk use-def chain at node+40..48 |
| │ Delete matching RB-tree entries (full rebalance via |
| │ sub_6A92E0) |
| │ Unlink node from work list |
| │ Release internal resources (operands, types) |
| │ Return node to free pool at a1+80 |
| │ |
| │ Phase 3: Secondary cleanup (re-scan remaining nodes) |
| │ Nodes with SASS match but no definition link: |
| │ Clear back-pointer, clean up, recycle to free pool |
| │ |
| │ Key data: context+760 = RB-tree root (SASS instruction defs) |
| │ context+768/776 = min/max pointers |
| │ context+784 = tree node count |
| │ context+792 = tree free list |
| |
├─ (post-processing: sub_693D00 per remaining node) ─────────────────┘
|
v
sub_6D9690 (master SASS encoder, 94KB)
|
├─ sub_6D9290 (OCG vtable entry point) ────────────────────────────────┐
| │ |
| │ 1. Extract intrinsic name from IR node |
| │ 2. Call sub_6C9BC0(this+120, name) ── ParseOCGBuiltinName |
| │ Strips "__nv_ptx_builtin_ocg_" prefix |
| │ Iterates 43 operation slots (248B each) in OCG table |
| │ Matches operation name, then parses '_'-delimited sub-ops |
| │ Output: this+10688 = operation enum (0..42) |
| │ this+10704 = int[] of sub-op indices |
| │ this+10712 = sub-op count |
| │ 3. Fall through to sub_6D8B20 for secondary dispatch |
| |
├─ sub_6CC690 (OCGRouter, 22KB) ──────────────────────────────────────┘
| │
| │ Input: (self, instruction_node, sass_descriptor)
| │
| │ 1. Read SASS opcode from descriptor+8
| │ 2. Read target profile from context+1584
| │ Key profile fields:
| │ +503 = operand decode flag
| │ +1012 = target SM enum
| │ +1020 = extended address mode
| │ +1021 = barrier mode
| │ +1041 = memory order capabilities bitmask
| │
| │ 3. Vtable dispatch (off_202CF48):
| │ vtable[2] = OpcodeValidator (default: sub_6BC1D0)
| │ vtable[24] = ScopeValidator (default: sub_6BCE50)
| │ vtable[25] = MemOrderValidator (default: sub_6BBEC0)
| │ Each is compared by address; if overridden, calls the
| │ custom validator; if default, uses inline fast-path.
| │
| │ 4. Opcode-range dispatch (descriptor+8):
| │ 178..189: Memory ops (ld_mc, st) -> SASS enum 243/245/247
| │ 416..420, 434: Reduction/atomic -> SASS enum 243/246/261
| │ 445..448: Barrier/fence -> memory op path
| │ 467: cp.async.tensor/special -> SASS enum 70 or 257
| │ default: zero-init modifiers, use raw descriptor
| │
| │ 5. Operand assembly into v134[] (312-byte buffer):
| │ sub_6CAFD0: decode src/dst registers -> v134[8..10]
| │ sub_6CAE80: encode uniform operands -> v134[16]
| │ sub_6CAF50: encode scope/mem-order -> v134[13]
| │ sub_6CBA50: encode barrier level -> v134[26..28]
| │
| │ 6. Build control words:
| │ v134[26] = 0x60000000 | modifier_bits
| │ v134[27] = 0x60000000 | ordering | barrier | write_mask
| │ v134[28] = 0x60000000 | scope_flags | 0x81000
| │
| v
sub_6CB8A0 (EmitSASS)
│
│ Input: (self, sass_opcode_enum, instr_node, v134[], flags...)
│ 1. Read SM version from profile+372 (>> 12)
│ 2. sub_6CB4B0: final operand validation
│ 3. sub_C3F490(opcode, ...): look up SASS encoding template
│ 4. Encode instruction word from template + v134[] operands
│ 5. sub_9253C0: commit encoded instruction to output
Internal SASS opcode enum values assigned by the router (not binary SASS opcodes -- these are routing keys that sub_C3F490 maps to encoding templates):
| Enum | Hex | Meaning |
|---|---|---|
| 70 | 0x46 | Memory-ordered load/store/atomic (with barrier) |
| 243 | 0xF3 | Default memory operation |
| 245 | 0xF5 | Load variant (LD/LDG/LDS) |
| 246 | 0xF6 | Reduction/atomic default |
| 247 | 0xF7 | Fenced memory operation (LDGSTS) |
| 257 | 0x101 | Async copy without memory order |
| 261 | 0x105 | Atomic with pre-existing value read |
Operand buffer layout (v134[], 39 QWORDs passed to sub_6CB8A0):
| Slot | Content |
|---|---|
| 0--3 | Reserved (zero) |
| 4 | Barrier register (0x90000000 | reg) |
| 5--7 | Extra source operands (from instruction node) |
| 8--10 | Primary operands (from sub_6CAFD0 decode) |
| 11 | Secondary operand (LDC, conditional loads) |
| 12 | Predicate thread operand |
| 13 | Scope / memory-order (from sub_6CAF50) |
| 14 | Cache mode operand |
| 15 | Memory fence operand |
| 16 | Uniform / extended operand (from sub_6CAE80) |
| 17 | Memory ordering constant / barrier tracking |
| 19--21 | Source address (bulk/tensor ops) |
| 22--24 | Destination address (bulk/tensor ops) |
| 25 | Extra predicate (opcode 187 only) |
| 26 | Control word 0: 0x60000000 | modifier_bits |
| 27 | Control word 1: 0x60000000 | ordering | flags |
| 28 | Control word 2: 0x60000000 | scope | 0x81000 |
| 29 | Write mask operand (conditional) |
OCG Lookup Flow
PTX source: call.uni __nv_ptx_builtin_ocg_tcmma, (%args...);
|
v
sub_6A97B0 (LowerIntrinsicOp, 26KB)
Matches call node to SASS instruction via RB-tree at ctx+760
Garbage-collects unmatched nodes
|
v
sub_6D9290 -> sub_6C9BC0 (ParseOCGBuiltinName)
Strips "__nv_ptx_builtin_ocg_" prefix
Parses op name + sub-ops from 43-slot table (sub_6C9EB0)
|
v
sub_6CC690 (OCGRouter, 22KB)
Vtable dispatch: validates opcode, scope, memory order
Decodes operands into 312-byte buffer via sub_6CAFD0 cluster
Builds control words (0x60000000 | modifier_bits)
|
v
sub_6CB8A0 (EmitSASS)
Looks up encoding template via sub_C3F490(sass_opcode_enum)
Encodes instruction word, commits to output via sub_9253C0
See OCG Intrinsic Lowering Pipeline for the full five-stage breakdown with operand buffer layout and internal SASS opcode enum values.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_6C9EB0 | 13KB | OCG intrinsic table init (__nv_ptx_builtin_ocg_*) | 95% |
sub_6CC690 | 22KB | OCG router -- vtable-dispatched operand assembly and SASS emission | 90% |
sub_6C9BC0 | -- | OCG name parser -- decomposes __nv_ptx_builtin_ocg_X_Y_Z into enum + sub-op array | 95% |
sub_6C0D90 | 19KB | OCG atomic/reduction handler | 90% |
sub_6C1CF0 | 16KB | OCG mbarrier handler | 88% |
sub_6C3470 | 20KB | OCG cp.async.bulk handler | 85% |
sub_6C4DA0 | 15KB | OCG load/store handler | 85% |
sub_6C5A40 | 8KB | OCG cache control handler | 85% |
sub_6C60B0 | 7KB | OCG distributed shared memory handler | 80% |
sub_6C8100 | 9KB | OCG cp.async.tensor / TMA handler | 85% |
sub_6D4350 | 30KB | MMA intrinsic lowering (SASS encoding) | 90% |
sub_6D7AF0 | 19KB | TCGen05 MMA handler (SASS encoding) | 90% |
sub_6D5CB0 | 16KB | MMA operand encoder | 80% |
sub_6D69B0 | 12KB | TCGen05 MMA validator | 80% |
sub_6D9290 | -- | OCG vtable entry point (calls sub_6C9BC0 then sub_6D8B20) | 85% |
sub_6CB8A0 | -- | SASS instruction emitter (template lookup via sub_C3F490) | 80% |
sub_6CAFD0 | -- | Operand decoder (registers into v134[] slots) | 85% |
sub_6CAE80 | -- | Uniform operand encoder | 85% |
sub_6CAF50 | -- | Scope / memory-order encoder | 85% |
sub_6CBA50 | -- | Barrier-level encoder | 85% |
sub_6CB4B0 | -- | Operand validator (called by sub_6CB8A0) | 80% |
sub_6A97B0 | 26KB | LowerIntrinsicOp -- SASS matching and unmatched-node GC | 90% |
sub_6B5F30 | -- | Intrinsic lowering driver (calls sub_6B40C0 then sub_6A97B0) | 90% |
sub_6A92E0 | -- | RB-tree fixup (rotation/recolor after deletion) | 90% |
sub_6BC1D0 | -- | Default opcode validator (vtable[2] of off_202CF48) | 90% |
sub_6BCE50 | -- | Default scope validator (vtable[24]) | 90% |
sub_6BBEC0 | -- | Default memory-order validator (vtable[25]) | 90% |
Diagnostic Strings
| String | Location | Context |
|---|---|---|
"__nv_ptx_builtin_ocg_" | sub_6C9EB0 (0x6c9ecf) | OCG builtin name prefix |
"instrinsic" (sic) | Multiple OCG handlers | Consistent NVIDIA typo for "intrinsic" |
".RELU not allowed with unsigned type" | sub_6BEC60 | OCG LDC/S2R handler |
Cross-References
- Intrinsic Table Architecture -- Classical 607-entry intrinsic system and body template tables
- Math Intrinsics -- IEEE math software emulation (div, rcp, sqrt, rem)
- Tensor Core Intrinsics -- WMMA, MMA, WGMMA, tcgen05 lowering
- Sync & Warp Intrinsics -- Barriers, vote, shuffle, match, redux
- SM Architecture Map -- Per-SM capability dispatch tables
- TCGen05 -- 5th Gen Tensor Cores -- Blackwell tensor core ISA detail
- Mercury Encoder -- Master SASS encoder
sub_6D9690(94KB) - SASS Instruction Encoding -- Instruction encoding infrastructure
- ISel Pattern Matching -- Internal SASS routing values
- Pipeline Overview -- OCG-time measurement covers intrinsic lowering
Math Intrinsics
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas provides built-in IEEE-compliant math functions for operations that have no single-instruction hardware implementation. When a PTX instruction like div.rn.f64, rcp.rn.f32, sqrt.rn.f64, or rsqrt.approx.f64 is encountered, ptxas either emits a single MUFU (Multi-Function Unit) instruction for approximate results, or generates a multi-instruction SASS sequence using Newton-Raphson refinement for IEEE-compliant precision. For operations too complex to inline, ptxas emits a call to one of 70 registered __cuda_sm20_* helper functions.
| Intrinsic ID range | 0x3D--0x86 (70 math entries + 4 sm_3x division variants) |
| Math codegen handlers | 9 functions: div.full, div, rem, rcp, rsqrt, sqrt, ex2, lg2, tanh |
| Newton-Raphson templates | 4 top-level handlers at 0x1700000--0x172A090 (~180 KB) |
| MUFU internal opcode | 0x3C (60 decimal), Ori mnemonic ZHSH |
| MUFU Mercury major | 0x58, minor sub-function encoded in operand fields |
| SFU functional unit | Index 8 in the latency model (RCP, RSQ, SIN, COS, EX2, LG2) |
| MUFU encoding variants | 14 (reg/reg, reg/pred, reg/ureg, bar operands) |
MUFU -- Multi-Function Unit
The MUFU instruction is a single-cycle-issue instruction that computes transcendental approximations on the SFU (Special Function Unit). Each SM has a dedicated SFU pipe that executes MUFU operations independently of the ALU pipes.
Sub-Function Table
The MUFU sub-function is encoded in the instruction's modifier field (not a separate operand). The following sub-functions are available across all SM architectures supported by ptxas v13.0:
| Sub-Function | Operation | Input | Output | Precision |
|---|---|---|---|---|
MUFU.COS | cos(x * 2pi) | FP32 | FP32 | ~22 bits mantissa |
MUFU.SIN | sin(x * 2pi) | FP32 | FP32 | ~22 bits mantissa |
MUFU.EX2 | 2^x | FP32 | FP32 | ~22 bits mantissa |
MUFU.LG2 | log2(x) | FP32 | FP32 | ~22 bits mantissa |
MUFU.RCP | 1/x | FP32 | FP32 | ~23 bits mantissa |
MUFU.RSQ | 1/sqrt(x) | FP32 | FP32 | ~23 bits mantissa |
MUFU.RCP64H | 1/x (FP64 high-word seed) | FP32 | FP32 | ~23 bits, sm_80+ |
MUFU.RSQ64H | 1/sqrt(x) (FP64 high-word seed) | FP32 | FP32 | ~23 bits, sm_80+ |
MUFU.TANH | tanh(x) | FP32 | FP32 | ~22 bits, sm_75+ |
MUFU.RCP and MUFU.RSQ produce results accurate to approximately 1 ULP of the true FP32 value (23 mantissa bits). The trigonometric and exponential sub-functions (SIN, COS, EX2, LG2) are slightly less precise at approximately 22 bits. MUFU.TANH was added in Turing (sm_75).
MUFU in the Ori IR
In the ptxas internal representation, MUFU uses Ori opcode 0x3C (decimal 60) with the mnemonic ZHSH. During instruction selection, PTX operations like sin.approx.f32, cos.approx.f32, ex2.approx.f32, lg2.approx.f32, rcp.approx.f32, and rsqrt.approx.f32 are each lowered to a single ZHSH (MUFU) instruction with the appropriate sub-function selector.
The lowering pass responsible for MUFU emission is at sub_80E9B0 (LowerSpecialFunctions), called from the master lowering dispatcher sub_8380A0. It converts Ori-level special function opcodes into MUFU SASS instructions with appropriate sub-function encoding.
MUFU Encoding (sm_100+)
In the Mercury/Blackwell encoding, MUFU is major opcode 0x58 with a single variant at the basic encoding level (sub_10C0170). The encoding signature:
| Field | Value |
|---|---|
| Major opcode | 0x58 |
| f19 | 0xB |
| Format | 1 (single-operand class) |
| Operand count | 1 (destination implicit, source = register) |
| Encoding function | sub_10C0170 |
The variant table at 0xF7CEB0--0xF80760 defines 14 encoding patterns for MUFU, supporting combinations of:
reg2, reg2-- standard register source and destinationreg2, pred3-- predicated sourcereg2, reg10-- extended register classreg2, ureg4-- uniform register source (sm_100+ addition)reg2, bar6-- barrier operand (scheduling)
Uniform register support (ureg4) in MUFU is a Blackwell-specific addition, allowing MUFU to consume values directly from the uniform register file without a prior UMOV to a general-purpose register.
Pre-Assignment Constraints
The register allocator applies pre-assignment constraints for MUFU at sub_93F000. MUFU (internal opcode 22 in the constraint check, mapped from Ori opcode 0x3C) requires its operands in specific register classes. The constraint handler calls sub_93E9D0 with constraint type 1 (early) for MUFU operands.
Precision Levels
ptxas implements two distinct precision tiers for every math operation, selected by PTX instruction modifiers:
Approximate (.approx)
A single MUFU instruction. This is the default for sin.approx.f32, cos.approx.f32, ex2.approx.f32, lg2.approx.f32, rcp.approx.f32, and rsqrt.approx.f32. The MUFU hardware provides approximately 22--23 bits of mantissa precision in a single instruction dispatch on the SFU pipe.
PTX to SASS mapping (approximate):
| PTX Instruction | SASS | Latency |
|---|---|---|
sin.approx.f32 | MUFU.SIN | SFU pipe, ~4 cycles |
cos.approx.f32 | MUFU.COS | SFU pipe, ~4 cycles |
ex2.approx.f32 | MUFU.EX2 | SFU pipe, ~4 cycles |
lg2.approx.f32 | MUFU.LG2 | SFU pipe, ~4 cycles |
rcp.approx.f32 | MUFU.RCP | SFU pipe, ~4 cycles |
rsqrt.approx.f32 | MUFU.RSQ | SFU pipe, ~4 cycles |
tanh.approx.f32 | MUFU.TANH | SFU pipe, ~4 cycles (sm_75+) |
IEEE-Compliant (.rn, .rd, .ru, .rz)
Multi-instruction sequences that use MUFU as a seed and refine with Newton-Raphson iterations using FMA instructions. These produce results that are correctly rounded to the specified IEEE 754 rounding mode (round-to-nearest-even, round-down, round-up, round-toward-zero). The instruction count ranges from ~15 for FP32 operations to ~120 for FP64 operations.
The IEEE-compliant paths are implemented in two ways:
- Inline templates -- Multi-instruction SASS sequences emitted directly at the call site by the Newton-Raphson template subsystem (
0x1700000--0x172A090). Used for FP64 division, reciprocal, sqrt, and rsqrt. - Callable helpers -- Calls to
__cuda_sm20_*functions whose bodies are pre-compiled PTX routines linked from libdevice. Used for FP32 operations with directed rounding modes and all slowpath variants.
PTX Math Codegen Handlers
When sub_5D4190 builds the opcode dispatch table, it registers 9 math-related PTX instruction names to codegen handler functions. Each handler allocates a 50,000-byte temporary buffer, queries instruction properties through accessor functions on the instruction object at a1+1096, and generates inline PTX code via sequential sprintf() calls.
| PTX Opcode | Handler | Size | Description |
|---|---|---|---|
div.full | sub_573860 | ~7 KB | FP64 full-precision division (calls Newton-Raphson template) |
div | sub_5B76D0 | 64 KB | General division: dispatches by type (s16/u16/s64/u64/f32/f64) and rounding mode |
rem | sub_589810 | ~13 KB | Integer remainder (s16/u16/s64/u64) |
rcp | sub_5B0CD0 | 44 KB | Reciprocal: dispatches by type (f32/f64) and rounding mode |
rsqrt | sub_57BFC0 | ~10 KB | Reciprocal square root |
sqrt | sub_5B4040 | 49 KB | Square root: dispatches by type (f32/f64) and rounding mode |
ex2 | sub_583190 | ~14 KB | Base-2 exponential |
lg2 | sub_52A5C0 | ~5 KB | Base-2 logarithm |
tanh | sub_505B00 | ~5 KB | Hyperbolic tangent |
Handler Dispatch Logic
All math codegen handlers follow the same structural pattern. The div handler (sub_5B76D0, 1,466 lines decompiled, 64 KB) is the largest because it covers the most type/rounding/precision combinations. The handler:
- Allocates a 50,000-byte output buffer via
sub_424070 - Queries the operand type via
sub_70CA60(*(a1+1096), 0):- Type 58 = FP32 (
f32) - Type 59 = FP64 (
f64) - Type 54 = signed 16-bit (
s16) - Type 56 = unsigned 16-bit / other integer
- Type 58 = FP32 (
- Queries rounding mode, FTZ flag, and precision modifier via additional accessors (
sub_707BC0,sub_70B820,sub_70B8E0,sub_70B710) - Selects the appropriate intrinsic function name and emits a PTX call via
sprintf()into the output buffer - Copies the result to a final allocation and frees the temporary buffer
For the tanh handler (sub_505B00, 121 lines), the dispatch is simpler:
- Type 56: emits a short-form call to a hardware-supported path
- Type 54: emits a multi-operand call querying register info via
sub_70B8E0andsub_70B710 - Default: emits a generic call with 6 operand parameters
Type Codes in Math Handlers
The operand type returned by sub_70CA60(instr, 0) maps to PTX data types:
| Code | PTX Type | Used By |
|---|---|---|
| 54 | .s16 / .s32 | Integer div, rem |
| 56 | .u16 / .u32 | Integer div, rem, tanh variant |
| 58 | .f32 | Float div, rcp, sqrt, rsqrt, ex2, lg2, tanh |
| 59 | .f64 | Double div, rcp, sqrt, rsqrt |
Registered Math Intrinsics (IDs 0x3D--0x86)
The master registration function sub_5D1660 registers 70 math helper functions with IDs 0x3D through 0x82, plus 4 sm_3x-optimized division variants at 0x83--0x86. These are the __cuda_sm20_* functions whose PTX prototypes are emitted by the prototype generator sub_5FF700.
Division Intrinsics (22 entries)
| ID | Name | Category |
|---|---|---|
0x41 | __cuda_sm20_div_rd_f32 | FP32 div, round-down |
0x42 | __cuda_sm20_div_rd_f64_v2 | FP64 div, round-down |
0x43 | __cuda_sm20_div_rd_ftz_f32 | FP32 div, round-down, flush-to-zero |
0x44 | __cuda_sm20_div_rn_f32 | FP32 div, round-to-nearest |
0x45 | __cuda_sm20_div_rn_f64_fast | FP64 div, round-to-nearest (fast path) |
0x46 | __cuda_sm20_div_rn_f64_full | FP64 div, round-to-nearest (full IEEE) |
0x47 | __cuda_sm20_div_rn_ftz_f32 | FP32 div, round-to-nearest, FTZ |
0x48 | __cuda_sm20_div_rn_ftz_f32_slowpath | FP32 div RN FTZ (denormal handler) |
0x49 | __cuda_sm20_div_rn_noftz_f32_slowpath | FP32 div RN no-FTZ (denormal handler) |
0x4A | __cuda_sm20_div_ru_f32 | FP32 div, round-up |
0x4B | __cuda_sm20_div_ru_f64_v2 | FP64 div, round-up |
0x4C | __cuda_sm20_div_ru_ftz_f32 | FP32 div, round-up, FTZ |
0x4D | __cuda_sm20_div_rz_f32 | FP32 div, round-toward-zero |
0x4E | __cuda_sm20_div_rz_f64_v2 | FP64 div, round-toward-zero |
0x4F | __cuda_sm20_div_rz_ftz_f32 | FP32 div, round-toward-zero, FTZ |
0x50 | __cuda_sm20_div_s16 | Signed 16-bit integer div |
0x51 | __cuda_sm20_div_s64 | Signed 64-bit integer div |
0x52 | __cuda_sm20_div_u16 | Unsigned 16-bit integer div |
0x53 | __cuda_sm20_div_u64 | Unsigned 64-bit integer div |
0x83 | __cuda_sm3x_div_rn_ftz_f32 | sm_30+ optimized FP32 div RN FTZ |
0x84 | __cuda_sm3x_div_rn_ftz_f32_slowpath | sm_30+ FP32 div RN FTZ slowpath |
0x85 | __cuda_sm3x_div_rn_noftz_f32 | sm_30+ optimized FP32 div RN |
0x86 | __cuda_sm3x_div_rn_noftz_f32_slowpath | sm_30+ FP32 div RN slowpath |
Reciprocal Intrinsics (20 entries)
| ID | Name | Category |
|---|---|---|
0x40 | __cuda_sm20_dblrcp_rn_slowpath_v3 | FP64 reciprocal slowpath |
0x5B | __cuda_sm20_rcp_f64_v3 | FP64 reciprocal (default rounding) |
0x5C | __cuda_sm20_rcp_rd_f32 | FP32 rcp, round-down |
0x5D | __cuda_sm20_rcp_rd_f32_slowpath | FP32 rcp RD slowpath |
0x5E | __cuda_sm20_rcp_rd_f64 | FP64 rcp, round-down |
0x5F | __cuda_sm20_rcp_rd_ftz_f32 | FP32 rcp RD FTZ |
0x60 | __cuda_sm20_rcp_rd_ftz_f32_slowpath | FP32 rcp RD FTZ slowpath |
0x61 | __cuda_sm20_rcp_rn_f32 | FP32 rcp, round-to-nearest |
0x62 | __cuda_sm20_rcp_rn_f32_slowpath | FP32 rcp RN slowpath |
0x63 | __cuda_sm20_rcp_rn_ftz_f32 | FP32 rcp RN FTZ |
0x64 | __cuda_sm20_rcp_rn_ftz_f32_slowpath | FP32 rcp RN FTZ slowpath |
0x65 | __cuda_sm20_rcp_ru_f32 | FP32 rcp, round-up |
0x66 | __cuda_sm20_rcp_ru_f32_slowpath | FP32 rcp RU slowpath |
0x67 | __cuda_sm20_rcp_ru_f64 | FP64 rcp, round-up |
0x68 | __cuda_sm20_rcp_ru_ftz_f32 | FP32 rcp RU FTZ |
0x69 | __cuda_sm20_rcp_ru_ftz_f32_slowpath | FP32 rcp RU FTZ slowpath |
0x6A | __cuda_sm20_rcp_rz_f32 | FP32 rcp, round-toward-zero |
0x6B | __cuda_sm20_rcp_rz_f32_slowpath | FP32 rcp RZ slowpath |
0x6C | __cuda_sm20_rcp_rz_f64 | FP64 rcp, round-toward-zero |
0x6D | __cuda_sm20_rcp_rz_ftz_f32 | FP32 rcp RZ FTZ |
0x6E | __cuda_sm20_rcp_rz_ftz_f32_slowpath | FP32 rcp RZ FTZ slowpath |
Square Root Intrinsics (17 entries)
| ID | Name | Category |
|---|---|---|
0x56 | __cuda_sm20_dsqrt_rd_f64 | FP64 sqrt, round-down |
0x57 | __cuda_sm20_dsqrt_rn_f64_mediumpath_v1 | FP64 sqrt RN (medium-complexity path) |
0x58 | __cuda_sm20_dsqrt_rn_f64_v3 | FP64 sqrt, round-to-nearest |
0x59 | __cuda_sm20_dsqrt_ru_f64 | FP64 sqrt, round-up |
0x5A | __cuda_sm20_dsqrt_rz_f64 | FP64 sqrt, round-toward-zero |
0x73 | __cuda_sm20_sqrt_rd_f32 | FP32 sqrt, round-down |
0x74 | __cuda_sm20_sqrt_rd_f32_slowpath | FP32 sqrt RD slowpath |
0x75 | __cuda_sm20_sqrt_rd_ftz_f32 | FP32 sqrt RD FTZ |
0x76 | __cuda_sm20_sqrt_rd_ftz_f32_slowpath | FP32 sqrt RD FTZ slowpath |
0x77 | __cuda_sm20_sqrt_rn_f32 | FP32 sqrt, round-to-nearest |
0x78 | __cuda_sm20_sqrt_rn_f32_slowpath | FP32 sqrt RN slowpath |
0x79 | __cuda_sm20_sqrt_rn_ftz_f32 | FP32 sqrt RN FTZ |
0x7A | __cuda_sm20_sqrt_rn_ftz_f32_slowpath | FP32 sqrt RN FTZ slowpath |
0x7B | __cuda_sm20_sqrt_ru_f32 | FP32 sqrt, round-up |
0x7C | __cuda_sm20_sqrt_ru_f32_slowpath | FP32 sqrt RU slowpath |
0x7D | __cuda_sm20_sqrt_ru_ftz_f32 | FP32 sqrt RU FTZ |
0x7E | __cuda_sm20_sqrt_ru_ftz_f32_slowpath | FP32 sqrt RU FTZ slowpath |
0x7F | __cuda_sm20_sqrt_rz_f32 | FP32 sqrt, round-toward-zero |
0x80 | __cuda_sm20_sqrt_rz_f32_slowpath | FP32 sqrt RZ slowpath |
0x81 | __cuda_sm20_sqrt_rz_ftz_f32 | FP32 sqrt RZ FTZ |
0x82 | __cuda_sm20_sqrt_rz_ftz_f32_slowpath | FP32 sqrt RZ FTZ slowpath |
Reciprocal Square Root Intrinsics (2 entries)
| ID | Name | Category |
|---|---|---|
0x54 | __cuda_sm20_drsqrt_f64_slowpath_v2 | FP64 rsqrt slowpath |
0x55 | __cuda_sm20_drsqrt_f64_v2 | FP64 rsqrt default |
Remainder Intrinsics (4 entries)
| ID | Name | Category |
|---|---|---|
0x6F | __cuda_sm20_rem_s16 | Signed 16-bit remainder |
0x70 | __cuda_sm20_rem_s64 | Signed 64-bit remainder |
0x71 | __cuda_sm20_rem_u16 | Unsigned 16-bit remainder |
0x72 | __cuda_sm20_rem_u64 | Unsigned 64-bit remainder |
Bit-Field Intrinsics (3 entries)
| ID | Name | Category |
|---|---|---|
0x3D | __cuda_sm20_bfe_s64_ | 64-bit signed bit-field extract |
0x3E | __cuda_sm20_bfe_u64_ | 64-bit unsigned bit-field extract |
0x3F | __cuda_sm20_bfi_u64_ | 64-bit unsigned bit-field insert |
Naming Conventions
The intrinsic name encodes the complete variant specification:
__cuda_sm{gen}_{op}_{rounding}_{ftz}_{type}_{suffix}
| Component | Values | Meaning |
|---|---|---|
sm{gen} | sm20, sm3x | Minimum SM architecture |
{op} | div, rcp, sqrt, dsqrt, drsqrt, rem, bfe, bfi | Mathematical operation |
{rounding} | rn, rd, ru, rz | IEEE 754 rounding mode |
{ftz} | ftz, noftz | Flush-to-zero denormal behavior |
{type} | f32, f64, s16, s64, u16, u64 | Operand data type |
{suffix} | slowpath, mediumpath, full, fast, v2, v3 | Implementation variant |
The slowpath suffix indicates a handler for denormalized inputs or edge cases (NaN, infinity, zero) that the fast path branches around. The v2/v3 suffixes indicate iteration count on the implementation (each version may use different Newton-Raphson step counts or algorithm improvements).
Prototype Format
The prototype generator sub_5FF700 (354 KB) emits .weak .func PTX declarations for every registered intrinsic. Example prototypes:
.weak .func (.reg .s32 %d) __cuda_sm20_div_s16
(.reg .s32 %a0, .reg .s32 %a1)
.weak .func (.reg .u64 %rdv1) __cuda_sm20_div_u64
(.reg .u64 %rda1, .reg .u64 %rda2)
.weak .func (.reg .f32 %fv1) __cuda_sm20_div_rn_f32
(.reg .f32 %fa1, .reg .f32 %fa2)
.weak .func (.reg .f64 %fdv1) __cuda_sm20_div_rn_f64_full
(.reg .f64 %fda1, .reg .f64 %fda2)
The .weak linkage allows user-provided implementations to override the built-in versions at link time.
Newton-Raphson Refinement Templates
For FP64 operations, ptxas emits multi-instruction SASS sequences inline rather than calling helper functions. These sequences are generated by the template subsystem at 0x1700000--0x172A090 (36 functions, ~180 KB). The templates use MUFU hardware as the initial seed and iterate Newton-Raphson to achieve full FP64 precision. See Newton-Raphson & Math Templates for complete details.
Template Hierarchy
sub_AED3C0 (Master Lowering Dispatcher, 28 KB)
|
+-- sub_170E8B0 (DDIV handler) -- FP64 division
| +-- sub_170E260 (coordinator) -- 298 vregs, 6 sub-expanders
|
+-- sub_1718D60 (DRCP/DSQRT handler) -- FP64 reciprocal / square root
| +-- sub_1718790 (coordinator) -- 289 vregs, 7 sub-expanders
|
+-- sub_17276C0 (DRSQRT handler) -- FP64 reciprocal square root
| +-- sub_1720D60 (coordinator A) -- 247 vregs, 5 sub-expanders
| +-- sub_1727130 (coordinator B) -- 59 vregs, integer div/mod path
|
+-- sub_1704070 (Inline DDIV handler) -- Register-pressure variants
FP64 Division (DDIV)
Algorithm for a / b:
- Extract the high 32 bits of the FP64 divisor
b - Convert to FP32 and compute
MUFU.RCP-- ~23-bit seed for1/b - Newton-Raphson iteration 1:
x1 = x0 * (2 - b * x0)via DFMA -- ~46 bits - Newton-Raphson iteration 2 (partial): guard bits for correct rounding
- Compute
a * (1/b)using the refined reciprocal - Apply IEEE 754 rounding, handle overflow/underflow/NaN
The complete DDIV template emits ~100--120 SASS instructions across 3 named code sections (__ori_template_DDIV1, __ori_template_DDIV2, __ori_template_DDIV3), using 298 virtual registers. Three register-pressure variants are available:
| Register Limit | Handler | Strategy |
|---|---|---|
| > 20,479 | sub_1702990 | Full unrolled, maximum ILP |
| > 16,383 | sub_1701F10 | Partially spilled |
| <= 16,383 | sub_1701860 | Minimal-register, more instructions |
FP64 Reciprocal (DRCP)
Algorithm for 1/b:
MUFU.RCP(float32(b))-- ~23-bit seed- Newton-Raphson iteration 1:
x1 = x0 * (2 - b * x0)via DFMA - Newton-Raphson iteration 2: doubles precision to ~52+ bits
- Final rounding to FP64 precision
Implemented by sub_1718D60 (coordinator at sub_1718790, 289 vregs, 7 sub-expanders: sub_170ED40 through sub_1717470).
FP64 Square Root (DSQRT)
Algorithm for sqrt(a):
MUFU.RSQ(float32(a))-- ~23-bit seed for1/sqrt(a)- Newton-Raphson refinement:
y1 = y0 * (3 - a * y0^2) / 2 - Compute
sqrt(a) = a * (1/sqrt(a)) - Apply IEEE 754 rounding
Shares the coordinator with DRCP (sub_1718790), selecting the DSQRT sub-expanders (sub_1715910, sub_1717470) based on the original PTX operation.
FP64 Reciprocal Square Root (DRSQRT)
The most complex template handler (sub_17276C0). Dispatches based on a hardware capability flag at *(*(ctx+1584)+1037) & 1:
- Flag set (sm_80+ with enhanced SFU):
sub_1727130-- 59 vregs, fewer refinement iterations due toMUFU.RSQ64Hproviding better initial precision - Flag clear (older architectures):
sub_1720D60-- 247 vregs, full Newton-Raphson with 5 sub-expanders
Integer Division via MUFU
Integer division by variable values also uses MUFU.RCP as a starting point. The algorithm for unsigned 32-bit a / b:
I2F(b) -> MUFU.RCP -> F2I -> IMAD.HI -> correction
Specifically:
float_b = I2F(b)-- convert divisor to FP32rcp = MUFU.RCP(float_b)-- ~23-bit reciprocal approximationint_rcp = F2I(rcp)-- convert back to integerq_est = IMAD.HI(a, int_rcp, 0)-- estimated quotientr_est = IMAD(q_est, -b, a)-- estimated remainder- Correction: up to 2 iterations of
if (r_est >= b) q_est++; r_est -= b
The correction steps (at most 2) are needed because MUFU.RCP is accurate to within 2 ULP. This sequence emits ~50 SASS instructions for 32-bit (sub_1724A20, 28 KB decompiled) and ~80 for 64-bit unsigned (sub_1728930, 16.5 KB).
FP32 Math Paths
Approximate vs Full-Range
For FP32 operations, the codegen handler selects between:
- Single MUFU -- for
.approxmodifier. One instruction, ~23-bit precision. - MUFU + correction -- for
.rn/.rd/.ru/.rzwith FTZ. MUFU seed plus 1--2 FMA correction steps, inline. - Helper function call -- for directed rounding modes (RD/RU/RZ) without FTZ, or when denormal handling is required (slowpath variants). Calls to
__cuda_sm20_*or__cuda_sm3x_*functions.
Flush-to-Zero (FTZ)
The .ftz modifier on FP32 operations flushes denormalized inputs and outputs to zero, which simplifies the math sequence:
- Eliminates denormal input handling branches
- Eliminates denormal output rounding logic
- Allows a shorter inline sequence instead of a function call
Each FP32 math intrinsic exists in both FTZ and non-FTZ variants (e.g., __cuda_sm20_rcp_rn_ftz_f32 vs __cuda_sm20_rcp_rn_f32), and many also have a slowpath variant for edge cases.
sm_3x Optimized Division
Four additional division intrinsics at IDs 0x83--0x86 provide sm_30+ optimized paths for FP32 round-to-nearest division. The __cuda_sm3x_div_rn_ftz_f32 and __cuda_sm3x_div_rn_noftz_f32 variants (plus their slowpath counterparts) take advantage of Kepler+ hardware improvements to produce shorter instruction sequences than the sm_20 versions.
FP16 Math Handling
FP16 (half) math operations do not use MUFU directly. Instead, ptxas:
- Promotes FP16 inputs to FP32 via
H2F(half-to-float conversion) - Performs the FP32 MUFU operation
- Converts the result back to FP16 via
F2H(float-to-half conversion)
For HMMA (half-precision matrix multiply-accumulate) operations, the tensor core path is used instead -- see Tensor Core Intrinsics.
The HADD2, HMUL2, HFMA2 instructions operate on packed FP16x2 values and are separate from the MUFU path. These are direct hardware instructions dispatched to the ALU pipe, not the SFU.
Codegen Handler Deep Dive
sub_5B76D0 -- Division Codegen (64 KB)
The largest math codegen handler at 1,466 decompiled lines. Its dispatch tree:
sub_70CA60(instr, 0) -> operand type
|
+-- type 58 (f32)
| +-- sub_707BC0(instr) -> rounding mode check
| | +-- mode 1 -> short-form call (approx)
| | +-- mode > 39 -> full Newton-Raphson inline sequence
| | +-- else -> helper function call
| +-- sub_70B820(instr) -> precision modifier
| +-- <= 39 -> 3-operand compact call
| +-- > 39 -> multi-segment inline expansion
|
+-- type 59 (f64)
| +-- full/fast path selection
| +-- rounding mode -> specific __cuda_sm20_div_r{n,d,u,z}_f64 call
|
+-- type 54 (s16/s32)
| +-- __cuda_sm20_div_s{16,64} call
|
+-- type 56 (u16/u32)
+-- __cuda_sm20_div_u{16,64} call
The FP32 path at rounding mode > 39 generates a multi-segment inline PTX sequence with ~20 sprintf() calls, each appending a PTX instruction to the output buffer. This is the full-range IEEE-compliant FP32 division path that uses MUFU.RCP as a seed followed by FMA-based correction.
sub_5B0CD0 -- Reciprocal Codegen (44 KB)
Similar structure to the division handler. Dispatches by type (f32/f64) and rounding mode. For FP64, calls __cuda_sm20_rcp_f64_v3. For FP32, selects between 4 rounding modes x 2 FTZ variants x 2 paths (fast/slowpath) = up to 16 different intrinsic calls.
sub_5B4040 -- Square Root Codegen (49 KB)
Handles both FP32 (__cuda_sm20_sqrt_*) and FP64 (__cuda_sm20_dsqrt_*) variants. For FP64, the dsqrt_rn_f64_mediumpath_v1 variant provides an intermediate-complexity path between the fast approximation and the full Newton-Raphson template.
sub_583190 -- Base-2 Exponential (ex2)
Dispatches by operand type:
- FP32 with mode 1: short-form approximate path (single MUFU.EX2)
- FP32 with rounding mode > 39: full-range inline sequence with ~18
sprintf()segments generating a PTX sequence that includes range reduction, MUFU.EX2, and polynomial correction - FP64: multi-operand call to a helper function
sub_57BFC0 -- Reciprocal Square Root (rsqrt)
Dispatches by type:
- FP64 with mode 1: short-form call to
__cuda_sm20_drsqrt_f64_v2 - FP64 with rounding mode > 39: full inline sequence with ~35
sprintf()segments -- the longest inline math expansion for a single-precision-equivalent operation. The sequence implements range reduction, MUFU.RSQ, Newton-Raphson correction, and renormalization - FP32:
MUFU.RSQfor approximate, helper call for IEEE-compliant
Scheduling and Latency
MUFU instructions are scheduled on the SFU (Special Function Unit), which is functional unit index 8 in the ptxas latency model. Key scheduling properties:
| Property | Value |
|---|---|
| Functional unit | SFU (index 8) |
| Issue latency | 1 cycle (can issue every cycle) |
| Result latency | ~4 cycles (pipeline depth) |
| Throughput | 1 per 4 cycles per SM partition (16 per SM for 4 partitions) |
| Dual-issue | Cannot dual-issue with ALU on same warp |
The scheduler (sub_815820) places MUFU instructions to maximize overlap with ALU operations from other warps. The Newton-Raphson sequences interleave MUFU, DFMA, IMAD, and MOV instructions to hide the SFU pipeline latency behind ALU computation.
Fast-Math vs IEEE-Compliant Summary
| PTX Operation | Fast-Math (-use_fast_math) | IEEE-Compliant |
|---|---|---|
div.f32 | MUFU.RCP + FMUL (2 instr) | __cuda_sm20_div_rn_f32 call (~15 instr) |
div.f64 | N/A (no FP64 fast-math) | DDIV template (~100--120 instr) |
rcp.f32 | MUFU.RCP (1 instr) | __cuda_sm20_rcp_rn_f32 call (~10 instr) |
rcp.f64 | N/A | DRCP template (~90 instr) |
sqrt.f32 | MUFU.RSQ + FMUL (2 instr) | __cuda_sm20_sqrt_rn_f32 call (~12 instr) |
sqrt.f64 | N/A | DSQRT template (~80 instr) |
rsqrt.f32 | MUFU.RSQ (1 instr) | __cuda_sm20_drsqrt_f64_v2 (for f64) |
sin.f32 | MUFU.SIN (1 instr) | Range reduction + MUFU.SIN + correction |
cos.f32 | MUFU.COS (1 instr) | Range reduction + MUFU.COS + correction |
ex2.f32 | MUFU.EX2 (1 instr) | Range reduction + MUFU.EX2 + correction |
lg2.f32 | MUFU.LG2 (1 instr) | Range reduction + MUFU.LG2 + correction |
Key Function Reference
| Address | Size | Identity |
|---|---|---|
sub_5B76D0 | 64 KB | div codegen handler -- dispatches all division variants |
sub_5B0CD0 | 44 KB | rcp codegen handler -- reciprocal for f32/f64 |
sub_5B4040 | 49 KB | sqrt codegen handler -- square root for f32/f64 |
sub_57BFC0 | ~10 KB | rsqrt codegen handler -- reciprocal square root |
sub_583190 | ~14 KB | ex2 codegen handler -- base-2 exponential |
sub_52A5C0 | ~5 KB | lg2 codegen handler -- base-2 logarithm |
sub_505B00 | ~5 KB | tanh codegen handler -- hyperbolic tangent |
sub_573860 | ~7 KB | div.full codegen handler -- FP64 full-precision division |
sub_589810 | ~13 KB | rem codegen handler -- integer remainder |
sub_5D1660 | 46 KB | Master intrinsic registration (608 entries) |
sub_5FF700 | 354 KB | Prototype generator (PTX .weak .func declarations) |
sub_80E9B0 | ~1.5 KB | LowerSpecialFunctions -- MUFU emission pass |
sub_170E8B0 | -- | DDIV top-level handler |
sub_1718D60 | 790 B | DRCP/DSQRT coordinator wrapper |
sub_17276C0 | 1,011 B | DRSQRT coordinator wrapper |
sub_1704070 | 263 B | DDIV register-pressure dispatcher |
sub_1724A20 | 28 KB | 32-bit integer division via MUFU.RCP |
sub_1728930 | 16.5 KB | 64-bit unsigned integer division |
sub_1727AC0 | 13.8 KB | 64-bit signed integer division |
sub_AED3C0 | 28 KB | Master lowering dispatcher (invokes all templates) |
sub_10C0170 | ~5 KB | MUFU Mercury encoding function (sm_100+) |
Tensor Core Intrinsics
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas supports five generations of tensor core operations spanning SM 70 through SM 100. The binary contains three major codegen handlers -- sub_5C7A50 (173KB, WMMA), sub_5C10A0 (120KB, MMA), and sub_5BBC30 (90KB, tcgen05.mma) -- plus four WGMMA handlers, eleven tcgen05 instruction handlers, and ~400 numeric MMA hash table entries. Together these constitute ~500KB of code generation logic dedicated to tensor core instructions, making this the single largest functional subsystem in ptxas.
| WMMA codegen | sub_5C7A50 (173KB) -- wmma.mma instruction code generation |
| MMA codegen | sub_5C10A0 (120KB) -- mma.sync instruction code generation |
| TCGen05 MMA codegen | sub_5BBC30 (90KB) -- tcgen05.mma instruction code generation |
| WMMA load/store | sub_5A0EA0 (7.8KB), sub_5A8E40 (9.8KB), sub_5A6BD0 (8.8KB), sub_5A2D10 (8.1KB) |
| WGMMA handlers | sub_50AC70 (1.3KB), sub_4DA380 (295B), sub_4DA4B0 (295B), sub_4DA5E0 (311B) |
| MMA validator | sub_4C2FD0 (12.2KB), sub_49BBA0 (11.4KB), sub_4BFED0 (10.3KB) |
| Numeric MMA hash | ~400 entries at compilation context offset a1+816 |
| Prototype generator | sub_5FF700 (354KB) -- generates .weak .func PTX declarations |
| SASS MMA encoders | sub_6D4350, sub_6D7AF0, sub_6D5CB0, sub_6D69B0 |
Tensor Core Generations
ptxas tracks five distinct tensor core generations, each introducing new SASS opcodes, data types, and matrix shapes. The internal scheduling counters (visible in DUMPIR statistics output) reveal the SASS-level taxonomy.
| Gen | SM | SASS Opcodes | PTX API | Key Addition |
|---|---|---|---|---|
| 1st | 75 (Turing) | HMMA (m16n8k8, m8n8k16) | wmma.mma | FP16 tensor cores; INT8/INT4/B1 WMMA |
| 2nd | 80 (Ampere) | HMMA (m16n8k16), IMMA, DMMA, BMMA | wmma.mma, mma.sync | BF16, TF32, FP64, structured sparsity |
| 3rd | 89 (Ada) | HMMA (extended) | mma.sync | FP8 (E4M3, E5M2), block-scale MMA |
| 4th | 90 (Hopper) | WGMMA | wgmma.mma_async | Warpgroup MMA, async pipeline, sub-byte sparse |
| 5th | 100 (Blackwell) | UTCHMMA, UTCIMMA, tcmma | tcgen05.mma, tcgen05.mma.ws | Tensor memory (TMEM), warp-shared, block-scale |
Scheduling Unit Names
The binary's statistics printer functions (clones at 0x700-byte intervals from sub_ABBA50) emit per-unit throughput counters using the internal SASS operation classification:
| Counter | SASS Operation | Matrix Shape | Description |
|---|---|---|---|
hmma1688 | HMMA m16n8k8 | 16x8x8 | 1st-gen FP16 MMA (Turing) |
hmma1688f16 | HMMA m16n8k8 (f16 accum) | 16x8x8 | FP16 accumulation variant |
hmma16816 | HMMA m16n8k16 | 16x8x16 | 2nd-gen FP16 MMA (Ampere+) |
hmma16816f16 | HMMA m16n8k16 (f16 accum) | 16x8x16 | FP16 accumulation variant |
hmmaSp1688 | HMMA.SP m16n8k8 | 16x8x(8*2) | Sparse FP16 (2:4 sparsity) |
hmmaSp1688f16 | HMMA.SP m16n8k8 (f16 accum) | 16x8x(8*2) | Sparse FP16, f16 accum |
imma16816 | IMMA m16n8k16 | 16x8x16 | Integer MMA (INT8) |
imma16832 | IMMA m16n8k32 | 16x8x32 | Integer MMA (INT4/sub-byte) |
immaSp8832 | IMMA.SP m8n8k32 | 8x8x(32*2) | Sparse integer (m8 variant) |
immaSp16832 | IMMA.SP m16n8k32 | 16x8x(32*2) | Sparse integer (m16 variant) |
dmma | DMMA | 8x8x4 | FP64 tensor MMA |
fma64 | FMA64 | -- | FP64 FMA (non-tensor) |
Format strings from the binary:
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [est hmma16816=%d] [est hmma16816f16=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f]
These counters are emitted by the post-scheduling statistics pass. The scheduler treats each counter as a separate functional unit throughput class, with dedicated latency table entries:
| Ori Opcode | Latency Class | Description |
|---|---|---|
0x144 | 600 | Tensor fence |
0x145--0x146 | 759 | HMMA/BMMA tensor core operations |
0x147--0x148 | 757 or 761 | Narrow/wide DP tensor operations |
0x149 | 604 | Tensor sync |
PTX-Level Instruction Lowering
WMMA Instructions
The WMMA (Warp Matrix Multiply-Accumulate) API is the oldest tensor core interface. Five PTX instructions are registered in the opcode dispatch table at sub_5D4190:
| PTX Instruction | Codegen Handler | Size | Purpose |
|---|---|---|---|
wmma.load.a | sub_5A0EA0 | 7,779B | Load matrix fragment A |
wmma.load.b | sub_5A8E40 | 9,757B | Load matrix fragment B |
wmma.load.c | sub_5A6BD0 | 8,813B | Load accumulator fragment C |
wmma.store.d | sub_5A2D10 | 8,074B | Store result fragment D |
wmma.mma | sub_5C7A50 | 173KB | Matrix multiply-accumulate |
All five handlers allocate a 50,000-byte code generation buffer. The load/store handlers are ~8--10KB each and cover the combinatorial product of shape, layout, data type, and address space. The wmma.mma handler at 173KB is the largest single codegen handler in ptxas.
Instruction property accessors used by WMMA codegen:
| Accessor | Purpose |
|---|---|
sub_7075C0 | Get instruction flag A (layout/type encoding) |
sub_707BC0 | Get instruction flag B (variant/mode encoding) |
sub_7075E0 | Get layout string (row/col) |
sub_707BE0 | Get shape string (m16n16k16 etc.) |
sub_70A810 | Get scale string (satfinite etc.) |
WMMA Shapes and Types
sm_70 (Volta/Turing) -- 1st generation:
| Shape | Data Types | Accumulator | Regs (A) | Regs (B) | Regs (C/D) |
|---|---|---|---|---|---|
| m16n16k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
| m32n8k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
| m8n32k16 | f16 | f16 or f32 | 8 x b32 | 8 x b32 | 4 x b32 (f16) or 8 x b32 (f32) |
sm_72 (Turing) -- integer WMMA extension:
| Shape | Data Types | Accumulator | Regs (A) | Regs (B) | Regs (C/D) |
|---|---|---|---|---|---|
| m16n16k16 | s8, u8 | s32 | 2 x b32 | 2 x b32 | 8 x b32 |
| m32n8k16 | s8, u8 | s32 | 4 x b32 | 1 x b32 | 8 x b32 |
| m8n32k16 | s8, u8 | s32 | 1 x b32 | 4 x b32 | 8 x b32 |
| m8n8k32 | s4, u4 (sub-byte) | s32 | 1 x b32 | 1 x b32 | 2 x b32 |
| m8n8k128 | b1 (1-bit) | s32 | 1 x b32 | 1 x b32 | 2 x b32 |
sm_80 (Ampere) -- 2nd generation extensions:
| Shape | Data Types | Accumulator | Notes |
|---|---|---|---|
| m16n16k16 | bf16 | f32 | New BF16 support, 4 x b32 per fragment |
| m16n16k8 | tf32 | f32 | New TF32 support, 4 x b32 per fragment |
| m8n8k4 | f64 | f64 | Double-precision MMA, 1 x f64 per A/B, 2 x f64 per C/D |
All WMMA shapes support three address spaces (generic, global, shared) and optional descriptor-based addressing (_desc variants). The binary contains separate intrinsic registrations for each combination, with the full prototype available in the string table.
MMA (mma.sync) Instructions
The mma.sync API uses a single PTX opcode mma dispatched to sub_5C10A0 (120KB). Unlike WMMA, it uses asymmetric matrix shapes with M and N decoupled from K, and operates at a single-warp granularity.
The numeric MMA hash table at a1+816 collapses the full variant space (shape + type + layout) into ~400 hash entries. Each entry maps a numeric string key (e.g., "2644314910") to a specific codegen handler function pointer, avoiding multi-dimensional dispatch.
sm_80+ MMA intrinsics (IDs 0x209--0x22F):
The 39 intrinsics registered under __cuda_sm_8x_mma_* cover:
| Intrinsic Pattern | Layout | Types (D, C, A, B) | Regs |
|---|---|---|---|
mma_row_col_f16_f16_f16_f16 | row x col | f16, f16, f16, f16 | D: 4 x b32 |
mma_row_col_f32_f16_f16_f16 | row x col | f32, f16, f16, f16 | D: 8 x b32 |
mma_row_col_f32_f16_f16_f32 | row x col | f32, f32, f16, f16 | D: 8 x b32, C: 8 x b32 |
mma_col_col_* | col x col | same set | same |
mma_row_row_* | row x row | same set | same |
mma_col_row_* | col x row | same set | same |
mma_shfl_f16 | -- | shuffle f16 | D: 2 x b32 |
mma_shfl_f32 | -- | shuffle f32 | D: 4 x b32 |
Prototype examples from the binary:
.weak .func (.param .align 16 .b32 mma_dst[4])
__cuda_sm_8x_mma_row_col_f16_f16_f16_f16
(.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3);
.weak .func (.param .align 16 .b32 mma_dst[8])
__cuda_sm_8x_mma_row_col_f32_f16_f16_f32
(.reg .b32 a0, .reg .b32 a1, .reg .b32 b0, .reg .b32 b1,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
.reg .b32 c4, .reg .b32 c5, .reg .b32 c6, .reg .b32 c7);
Note the .param .align 16 .b32 mma_dst[N] return convention -- MMA results are returned through aligned parameter space, not registers, because the warp-cooperative nature of the operation means each thread holds only a fragment.
MMA Shape Summary Across Generations
| Shape | Types | SM Floor | SASS Opcode | Notes |
|---|---|---|---|---|
| m8n8k4 | f64 | 80 | DMMA | Double-precision |
| m8n8k16 | f16 | 75 | HMMA | Original Turing shape |
| m8n8k32 | s4/u4 | 75 | IMMA | Sub-byte integer |
| m8n8k128 | b1 | 75 | BMMA | 1-bit (XOR/AND pop) |
| m16n8k8 | f16 | 75 | HMMA | Asymmetric Turing shape |
| m16n8k16 | f16, bf16, s8/u8 | 80 | HMMA, IMMA | Primary Ampere shape |
| m16n8k32 | s8/u8, s4/u4 | 80/90 | IMMA | Integer, sub-byte at sm_90 |
| m16n8k64 | s4/u4 | 90 | IMMA | Hopper sub-byte extension |
| m16n8k128 | s4/u4 (sparse), b1 | 90 | IMMA, BMMA | Hopper sparse sub-byte |
| m16n8k256 | b1 | 90/100 | BMMA | Extended 1-bit MMA |
MMA Validation
Three validator functions gate MMA features by SM version:
sub_4C2FD0 (12.2KB) -- WMMA/MMA master validator:
Performs three-way version checks:
- SM 75: base WMMA (f16)
- SM 80: extended types (BF16, TF32, FP64 --
"MMA with double types") - SM 90: WGMMA features
- FP8:
"mma with FP8 floating point type"(gated by sm_89+)
sub_49BBA0 (11.4KB) -- MMA type/scale validator:
Validates FP8 and block-scale configurations:
"mma with FP8 floating point type"-- sm_89+ gate"mma with FP8 floating point type and FP16 accumulation"-- additional FP16 accum check"mma with FP8 floating point type and .m16n8k16 shape"-- shape/type cross-validation"Sparse mma with block scale"-- block-scale + sparsity interaction.block_scalemodifier validation
sub_4BFED0 (10.3KB) -- WMMA shape validator:
Validates WMMA-specific shapes and the .aligned modifier:
".aligned modifier for wmma"-- alignment enforcement- SM 75/80 version checks for shape legality
sub_490F90 -- Integer MMA validator:
Checks integer MMA shape validity: "Integer MMA with shape " -- validates m/n/k dimensions against the SM-level capability set.
sub_494210 (2.3KB) -- Sparse GMMA validator:
Validates sparse MMA metadata: "Sparse GMMA with " -- checks 2:4 sparsity pattern encoding.
sub_495900 -- WMMA floating-point validator:
Checks: "'wmma.mma' with floating point type" -- validates FP type compatibility with the target shape.
sub_4428E0 -- FP64 MMA validator:
Validates: "mma with .f64 type" -- gates double-precision MMA on sm_80+.
sm_90+ Sub-Byte MMA Intrinsics (IDs 0x23A--0x25F)
Hopper introduces 38 sub-byte MMA intrinsics covering s4/u4 sparse operations at warp granularity. These are distinct from the WGMMA API and provide backward-compatible sub-byte operations through the classical mma.sync interface.
Dense sub-byte (m8n8k32, m16n8k32, m16n8k64):
| Shape | Type Combinations | Variants | Count |
|---|---|---|---|
| m8n8k32 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
| m16n8k32 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
| m16n8k64 | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
Sparse sub-byte (m16n8k64, m16n8k128):
| Shape | Type Combinations | Variants | Count |
|---|---|---|---|
| m16n8k64 (sparse) | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite, split _0/_1 | 16 |
| m16n8k128 (sparse) | s4xs4, s4xu4, u4xs4, u4xu4 | plain + satfinite | 8 |
The _0 and _1 suffixes on sparse m16n8k64 represent the two halves of a split operation -- the K dimension is decomposed into two steps for the sparsity pattern. The sparse variants take an additional e (metadata) operand encoding the 2:4 sparsity pattern.
Bit-operations (b1):
| Shape | Operation | SM | Intrinsic |
|---|---|---|---|
| m8n8k128 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m8n8k128 |
| m16n8k128 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m16n8k128 |
| m16n8k256 | XOR | 90 | __cuda_sm_9x_mma_bit_internal_xor_m16n8k256 |
Prototype example (sparse m16n8k128):
.weak .func (.param .align 16 .b32 mma_dst[4])
__cuda_sm_9x_mma_sub_byte_internal_sparse_m16n8k128_s4_s4
(.reg .b32 a0, .reg .b32 a1, .reg .b32 a2, .reg .b32 a3,
.reg .b32 b0, .reg .b32 b1, .reg .b32 b2, .reg .b32 b3,
.reg .b32 c0, .reg .b32 c1, .reg .b32 c2, .reg .b32 c3,
.reg .b32 e);
The sparse m16n8k128 variant takes 4 A-operands, 4 B-operands (double-width due to sparsity encoding), 4 C-operands, and 1 metadata operand e.
sm_100 Blackwell MMA (IDs 0x230--0x239)
Blackwell introduces 10 intrinsics for extended MMA operations:
HMMA/IMMA metadata helpers (__cuda_sm_10x_*_mdata_*):
| Intrinsic | Shape | Returns | Inputs |
|---|---|---|---|
__cuda_sm_10x_hmma_mdata_m16n8k16 | m16n8k16 | ret_dst[3] | a0, a1, e, f_temp |
__cuda_sm_10x_hmma_mdata_m16n8k32 | m16n8k32 | ret_dst[5] | a0, a1, a2, a3, e, f_temp |
__cuda_sm_10x_imma_mdata_m16n8k32 | m16n8k32 | ret_dst[3] | a0, a1, e, f_temp |
__cuda_sm_10x_imma_mdata_m16n8k64 | m16n8k64 | ret_dst[5] | a0, a1, a2, a3, e, f_temp |
These mdata functions compute sparse metadata for Blackwell's 5th-gen tensor cores. The e parameter is the sparsity selector, f_temp is a scratch register. The return array includes the transformed A-operands plus the computed metadata word.
Bit MMA (AND + XOR for sm_100):
| Intrinsic | Shape | Operation | Regs |
|---|---|---|---|
__cuda_sm_10x_mma_bit_internal_and_m8n8k128 | m8n8k128 | AND | D: 2, A: 1, B: 1, C: 2 |
__cuda_sm_10x_mma_bit_internal_and_m16n8k128 | m16n8k128 | AND | D: 4, A: 2, B: 1, C: 4 |
__cuda_sm_10x_mma_bit_internal_and_m16n8k256 | m16n8k256 | AND | D: 4, A: 4, B: 2, C: 4 |
__cuda_sm_10x_mma_bit_internal_xor_m8n8k128 | m8n8k128 | XOR | same as AND |
__cuda_sm_10x_mma_bit_internal_xor_m16n8k128 | m16n8k128 | XOR | same as AND |
__cuda_sm_10x_mma_bit_internal_xor_m16n8k256 | m16n8k256 | XOR | same as AND |
Blackwell adds the AND reduction mode for 1-bit MMA (sm_90 only had XOR).
WGMMA -- Warpgroup MMA (SM 90+)
WGMMA (Warp Group Matrix Multiply-Accumulate) operates at warpgroup granularity (4 warps, 128 threads) and uses an asynchronous pipeline protocol. Four PTX instructions are registered:
| PTX Instruction | Handler | Size | Role |
|---|---|---|---|
wgmma.mma_async | sub_50AC70 | 1,282B | Dispatch asynchronous MMA operation |
wgmma.fence | sub_4DA380 | 295B | Open pipeline stage |
wgmma.commit_group | sub_4DA4B0 | 295B | Close pipeline stage |
wgmma.wait_group | sub_4DA5E0 | 311B | Wait for N committed groups |
WGMMA Pipeline Protocol
The hardware requires strict sequencing:
wgmma.fence -- open pipeline stage
wgmma.mma_async (1..N) -- asynchronous MMA operations sharing accumulators
wgmma.commit_group -- close pipeline stage
wgmma.wait_group N -- wait for N outstanding groups to complete
Between fence and wait, strict register constraints apply:
- No non-WGMMA definitions of accumulator registers
- No non-WGMMA reads of accumulator registers
- No non-WGMMA definitions of WGMMA input registers (including descriptors)
Violation of any constraint triggers pipeline serialization via sub_AE47B0, which collapses the pipeline to individual fence/mma/commit/wait per operation. The serialization reason is reported through warning codes 0x1D55--0x1D5E (see GMMA Pipeline).
WGMMA Descriptor Format
The wgmma.mma_async handler (sub_50AC70, 1,282 bytes) encodes the operation's matrix dimensions, data types, layout, and scale factors into the instruction. The A operand can be either a register operand or a descriptor -- a 64-bit value encoding the matrix base address, leading dimension, stride, and swizzle pattern. The B operand is always descriptor-based.
The descriptor format allows the hardware to fetch matrix data directly from shared memory via the TMA (Tensor Memory Accelerator), bypassing register file involvement for the B matrix operand entirely.
Ori Internal Encoding
| Constant | Value | Meaning |
|---|---|---|
| WGMMA opcode | 309 | Ori opcode for wgmma.mma_async |
| Arrive opcode (masked) | 271 | opcode & 0xFFFFCFFF for _warpgroup.arrive/wait |
| Commit opcode | 323 | Ori opcode for _warpgroup.commit_batch |
| Accum reg_type | 6 | vreg+64 value for tensor/accumulator registers |
| Accum src tag | 0x90000000 | High nibble tag for source accumulator set |
| Accum dst tag | 0x10000000 | High nibble tag for destination accumulator set |
Compiler-Inserted Warpgroup Instructions
The GMMA pipeline passes (phases 85 and 87) insert three compiler-internal pseudo-operations prefixed with underscore:
| Pseudo-Op | SASS Output | Purpose |
|---|---|---|
_warpgroup.arrive | WARPSYNC / BAR.ARRIVE | Warpgroup synchronization (arrive) |
_warpgroup.wait | WARPSYNC / BAR.WAIT | Warpgroup synchronization (wait) |
_warpgroup.commit_batch | DEPBAR variant | Warpgroup dependency barrier |
These are not directly written by the programmer. The compiler inserts them to manage register ownership transfer between the warpgroup's register file and the tensor core's accumulator pipeline.
TCGen05 -- 5th Generation Tensor Cores (SM 100+)
Blackwell introduces TCGen05 (Tensor Core Generation 5), which operates through tensor memory (TMEM) -- a dedicated on-chip memory visible only to the tensor core engine, separate from the register file and shared memory. Eleven PTX instructions are registered:
| PTX Instruction | Handler | Size | Purpose |
|---|---|---|---|
tcgen05.mma | sub_5BBC30 | 90KB | Tensor core MMA from TMEM |
tcgen05.mma.ws | sub_58FA20 | 4,604B | Warp-shared MMA variant |
tcgen05.ld | sub_574050 | -- | Load data into TMEM |
tcgen05.ld.red | sub_578DB0 | -- | Load-reduce into TMEM |
tcgen05.st | sub_571FE0 | -- | Store from TMEM |
tcgen05.cp | sub_5427F0 | -- | Copy within TMEM |
tcgen05.commit | sub_56C190 | -- | Commit pending operations |
tcgen05.shift | sub_4F1A90 | -- | Shift TMEM contents |
tcgen05.alloc | sub_569180 | -- | Allocate TMEM columns |
tcgen05.dealloc | sub_58C7F0 | -- | Deallocate TMEM columns |
tcgen05.relinquish_alloc_permit | sub_526370 | -- | Release allocation rights |
TCGen05 MMA Codegen
The tcgen05.mma handler (sub_5BBC30, 90KB) is the third-largest codegen handler in ptxas. It:
- Allocates a 50,000-byte code generation buffer
- Validates tcgen05 capability via
sub_70FA00(*, 29) - Handles standard, sparse (
.sp), and warp-shared (.ws) variants - Extracts sparse metadata via
sub_70F0A0 - Generates TMEM address computation code
The tcgen05.mma.ws formatter (sub_58FA20, 4,604B, also used for tcgen05.shift) handles the warp-shared variant where multiple warps contribute to a single MMA operation.
TCGen05 SASS Encoding
At the SASS level, TCGen05 operations are encoded by four specialized Mercury encoder functions:
| Address | Handler | Purpose |
|---|---|---|
sub_6D4350 | SASS MMA encoder | Primary MMA SASS emission |
sub_6D7AF0 | SASS MMA encoder | Alternate MMA variant |
sub_6D5CB0 | SASS MMA encoder | Additional MMA mode |
sub_6D69B0 | SASS MMA encoder | Additional MMA mode |
The SASS encoder at sub_6D4350 references the tcmma operation namespace and validates block-scale configurations:
"tcmma_*_o must be specified with blockscale"-- output operand requires block-scale modifier"uri width for tcmma_*_o must be 2"-- output URI width constraint"tcmma_*_q with blockscale must have uri width of 2"-- scale factor operand constraint"tcmma_*_mxq must be specified with blockscale"-- MX quantization operand constraint"For UTCHMMA, #scaleU4 must be 0 in SPA 10.1."-- SM 100 vs 103 compatibility
The string "UTCHMMA" (Unified Tensor Core HMMA) and "tcmma" (Tensor Core MMA) are the internal SASS-level names for Blackwell's tensor core operations.
TCGen05 Guardrails
Blackwell includes a debug/validation mode activated by --g-tensor-memory-access-check. When enabled, ptxas wraps TMEM operations with guardrail trap functions. Ten guardrail intrinsics are registered (IDs 0x20--0x2A):
| Intrinsic | Trap Condition |
|---|---|
tcgen05_guardrail_trap_phase_invalid_during_alloc | TMEM allocation during invalid phase |
tcgen05_guardrail_trap_current_warp_owner_invalid | Warp accessing TMEM it does not own |
tcgen05_guardrail_trap_unallocated_columns_access | Access to unallocated TMEM columns |
tcgen05_guardrail_trap_unallocated_columns_being_dealloced | Deallocation of unallocated columns |
tcgen05_guardrail_trap_col_being_dealloced_not_returned_by_alloc | Dealloc of column not from alloc |
tcgen05_guardrail_trap_allocation_granularity_invalid | Invalid allocation granularity |
tcgen05_guardrail_trap_access_out_of_physical_bounds | Out-of-bounds TMEM access |
tcgen05_guardrail_trap_invalid_datapath_alignment | TMEM datapath misalignment |
tcgen05_guardrail_trap_sparse_mismatch_between_idesc_mod | Sparsity mismatch in instruction descriptor |
tcgen05_guardrail_trap_sp_used_in_unsupported_env | Sparsity in unsupported config |
Eight guardrail PTX instructions are registered in the opcode dispatch table:
| PTX Instruction | Check |
|---|---|
_tcgen05.guardrails.is_phase_valid | Phase validity for alloc |
_tcgen05.guardrails.is_current_warp_valid_owner | Warp ownership |
_tcgen05.guardrails.are_columns_allocated | Column allocation status |
_tcgen05.guardrails.in_physical_bounds | Physical bounds check |
_tcgen05.guardrails.allocation_granularity | Granularity validation |
_tcgen05.guardrails.datapath_alignment | Alignment validation |
_tcgen05.guardrails.sp_consistency_across_idesc_mod | Sparsity consistency |
_tcgen05.guardrails.check_sparse_usage | Sparse usage validation |
The guardrail check functions are .FORCE_INLINE and return a boolean retVal:
.FORCE_INLINE .func (.reg .b32 retVal)
__cuda_sm10x_tcgen05_guardrails_check_datapath_alignment
(.reg .u32 tmemAddr, .reg .u32 iDesc, .reg .u32 cta_group,
.reg .u32 hasWS, .reg .u32 hasSP, .reg .u32 matrix_kind);
The parameters reveal TMEM addressing structure: tmemAddr (TMEM base address), iDesc (instruction descriptor), cta_group (CTA group for cluster operations), hasWS (warp-shared flag), hasSP (sparse flag), matrix_kind (operand role).
Block-Scale MMA
Block-scale MMA allows per-block scaling factors for mixed-precision computation. In ptxas, this is gated by the .block_scale modifier on the PTX mma instruction:
- Validator
sub_49BBA0checks".block_scale"and"Sparse mma with block scale"(sparsity + block-scale interaction) - Additional intrinsic suffix
__cuda_sm_100_tcgen05_ld_immhalfSplitOffand variants handle block-scale-aware loads - The
bf16x2.ue8m0x2type string in the binary indicates UE8M0 (unsigned exponent-only) scale factor format for MX (microscaling) quantization
Intrinsic Registration Summary
Full Tensor Core Intrinsic ID Map
| ID Range | Count | SM | Category | SASS Target |
|---|---|---|---|---|
0x89--0x1FA (subset) | ~200 | 70+ | __cuda_sm70_wmma_* -- WMMA load/store/mma (f16) | HMMA |
0x1FB--0x208 | 14 | 80 | __cuda_sm80_* -- bf16/tf32/s4/s8/b1 MMA, createpolicy | HMMA, IMMA, DMMA, BMMA |
0x209--0x22F | 39 | 80+ | __cuda_sm_8x_mma_* -- direct MMA operations | HMMA, IMMA |
0x230--0x239 | 10 | 100 | __cuda_sm_10x_* -- hmma/imma mdata + bit MMA | UTCHMMA, UTCIMMA |
0x23A--0x25F | 38 | 90 | __cuda_sm_9x_mma_sub_byte_internal_* -- sub-byte sparse | IMMA |
TCGen05 Intrinsics (Not in Master ID Table)
TCGen05 operations are dispatched through the named opcode table, not the numeric ID table. The 11 tcgen05.* instructions are registered directly in the opcode hash map at a1+808:
| PTX Opcode | Codegen Handler |
|---|---|
tcgen05.alloc | sub_569180 |
tcgen05.relinquish_alloc_permit | sub_526370 |
tcgen05.dealloc | sub_58C7F0 |
tcgen05.ld | sub_574050 |
tcgen05.ld.red | sub_578DB0 |
tcgen05.st | sub_571FE0 |
tcgen05.commit | sub_56C190 |
tcgen05.cp | sub_5427F0 |
tcgen05.shift | sub_4F1A90 |
tcgen05.mma | sub_5BBC30 |
tcgen05.mma.ws | sub_58FA20 |
OCG-Level MMA Operations
The OCG system at sub_6C9EB0 registers additional SASS-level MMA operations for SM100+:
tcmma (2x64dp128bitlw02lw13, 2x64dp128bitlw01lw23, 4x32dp128bit)
tcbar, mmareadshma
16dp32bitt0t15, 16dp32bitt16t31
sparsify, spfactor2to4
tcshift, tcatomsws, tcldsws, tcstsws
These are internal Mercury-level operations that do not correspond 1:1 to PTX instructions. The tcmma variants encode the specific SASS datapath configuration: 2x64dp128bit means 2 datapaths, 64-element wide, 128-bit lane width.
Data Type Matrix
Complete data type support across tensor core generations as visible in the binary:
| Data Type | PTX Type | Width | SM Floor | Use |
|---|---|---|---|---|
| FP16 | .f16 | 16b | 75 | Primary HMMA operand |
| BF16 | .bf16 | 16b | 80 | HMMA alternate format |
| TF32 | .tf32 | 19b (stored as 32b) | 80 | Reduced-precision FP32 |
| FP32 | .f32 | 32b | 75 | HMMA accumulator |
| FP64 | .f64 | 64b | 80 | DMMA (double-precision) |
| FP8 E4M3 | .e4m3 | 8b | 89 | Ada/Hopper FP8 MMA |
| FP8 E5M2 | .e5m2 | 8b | 89 | Ada/Hopper FP8 MMA |
| INT8 | .s8, .u8 | 8b | 72 | IMMA integer MMA |
| INT4 | .s4, .u4 | 4b | 72 | IMMA sub-byte MMA |
| INT1 | .b1 | 1b | 75 | BMMA 1-bit MMA (XOR/AND) |
| UE8M0 | .ue8m0x2 | 8b (packed) | 100 | Block-scale exponent factor |
| B1024 | .b1024 | 1024b | 100 | TMEM-width operand |
ELF Metadata
The cubin output includes tensor-core-specific EIATTR attributes:
| EIATTR | Purpose |
|---|---|
EIATTR_SPARSE_MMA_MASK | Records which MMA operations use structured sparsity |
EIATTR_TCGEN05_1CTA_USED | Kernel uses 1-CTA tcgen05 operations |
EIATTR_TCGEN05_2CTA_USED | Kernel uses 2-CTA tcgen05 operations |
Knobs and Options
| Knob / Option | Purpose |
|---|---|
suppress-sparse-mma-advisory-info | Suppress advisory info for mma.sp operations |
--g-tensor-memory-access-check | Enable tcgen05 guardrail instrumentation |
Key Function Table
| Address | Size | Identity | Confidence |
|---|---|---|---|
0x5C7A50 | 173KB | WMMA.MMA codegen (all shapes/types/layouts) | 98% |
0x5C10A0 | 120KB | MMA.sync codegen (post-Volta shapes) | 98% |
0x5BBC30 | 90KB | TCGen05.MMA codegen (Blackwell TMEM-based) | 98% |
0x5A0EA0 | 7,779B | wmma.load.a formatter | 95% |
0x5A8E40 | 9,757B | wmma.load.b formatter | 95% |
0x5A6BD0 | 8,813B | wmma.load.c formatter | 95% |
0x5A2D10 | 8,074B | wmma.store.d formatter | 95% |
0x50AC70 | 1,282B | wgmma.mma_async handler | 99% |
0x4DA380 | 295B | wgmma.fence handler | 99% |
0x4DA4B0 | 295B | wgmma.commit_group handler | 99% |
0x4DA5E0 | 311B | wgmma.wait_group handler | 99% |
0x58FA20 | 4,604B | tcgen05.mma.ws / tcgen05.shift formatter | 95% |
0x4DA720 | 343B | tcgen05.mma.ws formatter | 90% |
0x569180 | -- | tcgen05.alloc handler | 90% |
0x526370 | -- | tcgen05.relinquish_alloc_permit handler | 90% |
0x58C7F0 | -- | tcgen05.dealloc handler | 90% |
0x574050 | -- | tcgen05.ld handler | 90% |
0x578DB0 | -- | tcgen05.ld.red handler | 90% |
0x571FE0 | -- | tcgen05.st handler | 90% |
0x56C190 | -- | tcgen05.commit handler | 90% |
0x5427F0 | -- | tcgen05.cp handler | 90% |
0x4F1A90 | -- | tcgen05.shift handler | 90% |
0x4C2FD0 | 12.2KB | WMMA/MMA master validator (sm_75/80/90) | 90% |
0x49BBA0 | 11.4KB | MMA type/scale validator (FP8, block-scale) | 90% |
0x4BFED0 | 10.3KB | WMMA shape validator | 90% |
0x490F90 | -- | Integer MMA shape validator | 85% |
0x494210 | 2,276B | Sparse GMMA validator | 85% |
0x495900 | -- | WMMA floating-point type validator | 85% |
0x496570 | -- | FP8 MMA shape validator | 85% |
0x4961F0 | -- | FP8 MMA accumulation validator | 85% |
0x4428E0 | -- | FP64 MMA type validator | 85% |
0x6D4350 | -- | SASS TCGen05 MMA encoder (UTCHMMA/tcmma) | 85% |
0x6D7AF0 | -- | SASS MMA encoder variant | 85% |
0x6D5CB0 | -- | SASS MMA encoder variant | 85% |
0x6D69B0 | -- | SASS MMA encoder variant | 85% |
0x50D4B0 | 1,187B | ldmatrix formatter | 90% |
0x4DAEA0 | -- | movmatrix formatter | 90% |
0x4F05D0 | -- | stmatrix formatter | 90% |
0x5D1660 | 46KB | Master intrinsic registration (608 entries) | 99% |
0x5D4190 | 41KB | Opcode dispatch (MMA hash table builder) | 99% |
0x5FF700 | 354KB | Prototype generator (.weak .func declarations) | 99% |
Cross-References
- Intrinsic Table Architecture -- Master registration, ID ranges, opcode dispatch
- GMMA/WGMMA Pipeline -- Phases 85/87, pipeline constraints, serialization
- Ada & Hopper Targets -- SM 89/90 feature gates, WGMMA details
- Turing & Ampere Targets -- SM 75--88 tensor core introduction
- Latency Model -- HMMA/IMMA/DMMA functional unit scheduling
- Register Model -- reg_type 6 (tensor/accumulator, allocator class 6)
- Mercury Encoder -- SASS encoding of MMA instructions
- ELF Output -- EIATTR_SPARSE_MMA_MASK, EIATTR_TCGEN05_*
Synchronization & Warp Intrinsics
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
This page documents how ptxas v13.0.88 handles the lowering of synchronization primitives and warp-level intrinsic operations from PTX source through Ori IR to final SASS instructions. The coverage spans warp vote, shuffle, match, and redux operations; thread-block barriers; memory barriers and fences; warp-level synchronization; asynchronous barriers (mbarrier); and atomic/reduction intrinsic lowering.
| PTX codegen handlers | sub_580E50 (vote), sub_5801D0 (shfl), sub_58A730 (match), sub_567680 (redux), sub_524FB0 (bar), sub_570290 (barrier), sub_500BF0 (bar.arrive), sub_570940 (barrier.arrive), sub_52D590 (bar.red), sub_5889B0 (barrier.red), sub_56A5A0 (bar.warp), sub_4DB410 (membar), sub_6C0D90 (atom/red) |
| Intrinsic IDs | 0x01--0x11 (reduxsync, 17), 0x89--0x1FA (sm70 warp/barrier/wmma, 370) |
| Ori IR opcodes | 96 (barrier), 119 (vote), 130 (HSET2 in ROT13; sync internal / shared-mem LEA), 314 (atom/red) |
| SASS opcodes | VOTE, VOTEU, BAR, BAR_INDEXED, MATCH, MEMBAR, WARPSYNC, BSYNC, BSSY, DEPBAR, ERRBAR, ELECT, NANOSLEEP, ATOM, ATOMG, ATOMS, RED |
| Blackwell additions | FENCE_G, FENCE_S, FENCE_T, CGABAR_ARV, CGABAR_GET, CGABAR_SET, CGABAR_WAIT, CGAERRBAR, SYNCS_BASIC, SYNCS_LD_UNIFM |
| Optimizer phases | 25 (StageAndFence), 26 (OriRemoveRedundantBarriers), 42 (ExpandMbarrier), 71 (OptimizeSyncInstructions), 72 (LateExpandSyncInstructions), 99, 100, 114 |
| Intrinsic detection | sub_A9A410 (sm70 warp-sync prefix matcher), sub_A94440 (mbarrier classifier) |
| Related EIATTR | EIATTR_NUM_BARRIERS, EIATTR_NUM_MBARRIERS, EIATTR_MBARRIER_INSTR_OFFSETS, EIATTR_SYNC_STACK, EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS, EIATTR_GEN_ERRBAR_AT_EXIT |
| CLI options | --assume-extern-functions-do-not-sync, --no-membermask-overlap, --print-potentially-overlapping-membermasks |
| Related pages | Synchronization & Barriers (passes), Intrinsic Table |
Instruction Classification
ptxas groups synchronization and warp operations into three Ori IR opcode categories that govern scheduling, dependency tracking, and optimization treatment throughout the pipeline.
| Ori Opcode | Category | Instructions | WAR Resource Mask |
|---|---|---|---|
| 96 | Barrier | bar.sync, bar.red, bar.arrive, barrier.* | 0x200001 (bit 0 + bit 21) |
| 119 | Vote | vote.{all,any,uni,ballot}, match.*, redux.*, activemask, elect.sync | 0x1 (bit 0 only) |
130 (HSET2) | Sync internal | BAR/MEMBAR pseudo-ops during optimization (actual SASS BAR = 61, MEMBAR = 111) | Used by phases 26, 71 for redundancy analysis |
The WAR resource mask at opcode 96 (2097153 = 0x200001) uniquely identifies barrier instructions to the scoreboard tracker. For all other opcodes, the base mask is 1. The scheduler uses this to insert appropriate stall cycles between barrier producers and consumers.
Warp Vote
Warp vote operations evaluate a per-lane predicate across the active threads in a warp and return a collective result. In ptxas these are lowered through the vote codegen handler at sub_580E50 (approximately 3.2KB of decompiled code).
PTX to SASS Mapping
| PTX Instruction | SASS Opcode | Result | Membermask |
|---|---|---|---|
vote.sync.all.pred p, q, membermask | VOTE.ALL | Predicate: true iff all active lanes have q=true | Explicit 32-bit mask |
vote.sync.any.pred p, q, membermask | VOTE.ANY | Predicate: true iff any active lane has q=true | Explicit 32-bit mask |
vote.sync.uni.pred p, q, membermask | VOTE.UNI | Predicate: true iff q is uniform across active lanes | Explicit 32-bit mask |
vote.sync.ballot.b32 d, q, membermask | VOTE.BALLOT | R: 32-bit ballot mask of lanes where q=true | Explicit 32-bit mask |
activemask.b32 d | CS2R (read SR_LANEMASK_ACTIVE) | R: current active lane mask | Implicit (all active) |
elect.sync d, membermask | ELECT.SYNC | Predicate: true for exactly one active lane | Explicit 32-bit mask (sm75+) |
On sm100+ (Blackwell), VOTEU is available as a uniform-register variant of VOTE for cases where the result feeds only uniform consumers.
Codegen Handler Structure -- sub_580E50
The vote handler follows the standard intrinsic codegen pattern: allocates a 50,000-byte scratch buffer via sub_424070, then builds an inline PTX function body through sequential sprintf() calls. The handler dispatches on three architecture tiers:
sub_580E50(ctx, string_table):
instr = *(ctx + 1096)
buf = alloc(50000)
// Common prologue: function header + parameter declarations
sprintf(buf, string_table[309261...]) // ".reg .pred %p; ..."
// Feature check: sub_70B6E0(instr) -- has membermask operand?
if has_membermask:
mask_val = sub_70B780(instr)
sprintf(buf, format_string, mask_val)
// Architecture dispatch:
if (sub_70FA00(instr, 11) || SM > 89) && sub_7081E0(instr) == 1:
// Path 1: sm90+ with sync variant
// Emits: vote.sync.{mode} with full operand set
// Reads operands 0,1,2 via sub_70B8E0, sub_70CA70, sub_709E80, sub_70B4F0
else if SM > 69 && sub_7081E0(instr) == 1:
if sub_70FA00(instr, 10) || sub_709E60(instr) == 1:
// Path 2a: sm70+ with explicit sync
// Reads operands via sub_70B510, sub_70B8E0
else:
// Path 2b: sm70+ standard path
// Checks sub_70FA00(instr, 17) -- has predicate output
// Checks sub_70BA10(instr) -- ballot mode
// SM > 75 branch: different register conventions
// sub_70A910(instr) == 1: uniform result path
// Epilogue: closing braces, return buffer
The accessor sub_70FA00(instr, 0) returns the SM architecture level (e.g., 70, 75, 80, 89, 90). The value at parameter 11 checks for a specific feature flag (cluster/sync extension). sub_7081E0(instr) returns the instruction variant (1 = sync form).
Intrinsic Registration
Vote intrinsics are registered as part of the sm70 block (0x89--0x1FA) with four entries:
| Intrinsic Name | PTX Equivalent |
|---|---|
__cuda_sm70_votesync_all | vote.sync.all.pred |
__cuda_sm70_votesync_any | vote.sync.any.pred |
__cuda_sm70_votesync_ballot | vote.sync.ballot.b32 |
__cuda_sm70_votesync_uni | vote.sync.uni.pred |
Detection of these intrinsics at the IR level is handled by sub_A9A410 (194 bytes binary, 908 bytes decompiled), which performs prefix matching against three string patterns:
// sub_A9A410 -- IntrinsicDetector::isSM70WarpSync
static const char* prefixes[] = {
"__cuda_sm70_warpsync",
"__cuda_sm70_votesync_",
"__cuda_sm70_matchsync_",
};
for (each prefix) {
name = getSymbolName(instr.symbol_id);
if (!strncmp(prefix, name, strlen(prefix)))
return 1;
}
return 0;
This function is called during instruction lowering (subsystem 6 at 0xA9F000--0xAA8000) to identify warp-synchronous call sites that need special handling during barrier optimization.
Warp Shuffle
Warp shuffle moves data between lanes within a warp. The codegen handler sub_5801D0 (approximately 3.3KB decompiled) generates inline PTX for the four shuffle modes.
PTX to SASS Mapping
| PTX Instruction | SASS Opcode | Data Movement |
|---|---|---|
shfl.sync.idx.b32 d|p, a, b, c, membermask | SHF / SHFL | d = lane[b] (direct indexed) |
shfl.sync.up.b32 d|p, a, b, c, membermask | SHF / SHFL | d = lane[laneid - b] (shift up) |
shfl.sync.down.b32 d|p, a, b, c, membermask | SHF / SHFL | d = lane[laneid + b] (shift down) |
shfl.sync.bfly.b32 d|p, a, b, c, membermask | SHF / SHFL | d = lane[laneid ^ b] (butterfly XOR) |
The c operand packs the clamp value and width: c = ((width - 1) << 8) | clamp. The optional predicate output p indicates whether the source lane was within bounds.
Codegen Handler -- sub_5801D0
The shuffle handler is structurally similar to vote. It reads up to 5 operands (source value, lane offset, clamp/width, membermask, and optional predicate output) through the accessor chain:
sub_5801D0(ctx, string_table):
// string_table offsets start at 311376
// Prologue with 4 sprintf calls for parameter declarations
if (SM >= 90 || feature_flag_11) && variant == 1:
// sm90+ path: reads operands 0..4
// sub_70B960(instr) -- gets shuffle mode enum
// sub_70B450(instr) -- gets data type
// Emits shfl.sync.{mode}.b32 with full 8-operand format
else if SM > 69 && variant == 1:
if feature_10 || sub_709E60(instr) == 1:
// sm70+ explicit sync path
// 7-operand format
else:
// Standard sm70 path
// Checks sub_70BA10 for predicate output
// SM > 75: different operand packing
The shuffle mode is obtained via sub_70B960(instr), returning an enum: 0=idx, 1=up, 2=down, 3=bfly. sub_70B450(instr) returns the data type (b32 for standard shuffles).
Intrinsic Registration
| Intrinsic Name | PTX Equivalent |
|---|---|
__cuda_sm70_shflsync_idx | shfl.sync.idx.b32 |
__cuda_sm70_shflsync_up | shfl.sync.up.b32 |
__cuda_sm70_shflsync_down | shfl.sync.down.b32 |
__cuda_sm70_shflsync_bfly | shfl.sync.bfly.b32 |
Each has a variant with predicate output (e.g., __cuda_sm70_shflsync_idx_pred).
Warp Match
Warp match instructions compare a value across lanes and return which lanes hold matching values. The codegen handler sub_58A730 (approximately 4.5KB decompiled) is the largest in this group.
PTX to SASS Mapping
| PTX Instruction | SASS Opcode | Result |
|---|---|---|
match.sync.any.b32 d, a, membermask | MATCH.ANY | d: mask of lanes holding the same value as the calling lane |
match.sync.all.b32 d|p, a, membermask | MATCH.ALL | d: mask of lanes holding the same value; p: true if ALL active lanes match |
match.sync.any.b64 d, a, membermask | MATCH.ANY | 64-bit value comparison variant |
match.sync.all.b64 d|p, a, membermask | MATCH.ALL | 64-bit value comparison variant |
Codegen Handler -- sub_58A730
The match handler has three architecture tiers and handles both b32 and b64 operand widths:
sub_58A730(ctx, string_table):
// string_table offsets start at 323786
if (SM >= 90 || feature_flag_11) && variant == 1:
// sm90+ path: 7-operand format
// Reads: sub_70B4F0, sub_709E80, sub_70CA70, sub_70B8E0(0..2)
else if SM > 69 && variant == 1:
// sm70+ path
// Checks feature_10 and sub_709E60 for explicit sync
// sub_70B940(instr) -- has match predicate output?
// sub_70D1F0(instr, 0) -- gets operand by index
// sub_70B950(instr) -- gets comparison mode
Intrinsic Registration
| Intrinsic Name | Variants |
|---|---|
__cuda_sm70_matchsync_any_b32 | Standard |
__cuda_sm70_matchsync_any_b64 | 64-bit |
__cuda_sm70_matchsync_all_b32 | With predicate output |
__cuda_sm70_matchsync_all_b64 | With predicate output, 64-bit |
Detection uses the same sub_A9A410 prefix matcher with "__cuda_sm70_matchsync_".
Warp Redux
Warp redux performs a warp-wide reduction operation and returns the result to all participating lanes. The codegen handler sub_567680 (approximately 2.0KB decompiled) is relatively compact.
PTX to SASS Mapping
| PTX Instruction | SASS Functional Unit | Operation |
|---|---|---|
redux.sync.add.s32 d, a, membermask | redux | Warp-wide signed integer addition |
redux.sync.min.s32 d, a, membermask | redux | Warp-wide signed integer minimum |
redux.sync.max.s32 d, a, membermask | redux | Warp-wide signed integer maximum |
redux.sync.min.u32 d, a, membermask | redux | Warp-wide unsigned integer minimum |
redux.sync.max.u32 d, a, membermask | redux | Warp-wide unsigned integer maximum |
redux.sync.add.u32 d, a, membermask | redux | Warp-wide unsigned integer addition |
redux.sync.and.b32 d, a, membermask | redux | Warp-wide bitwise AND |
redux.sync.or.b32 d, a, membermask | redux | Warp-wide bitwise OR |
redux.sync.xor.b32 d, a, membermask | redux | Warp-wide bitwise XOR |
redux.sync.min.f32.NaN d, a, membermask | redux | Warp-wide float minimum (NaN-propagating) |
redux.sync.max.f32.NaN d, a, membermask | redux | Warp-wide float maximum (NaN-propagating) |
redux.sync.min.f32.abs d, a, membermask | redux | Warp-wide float absolute minimum |
The scheduler tracks redux operations on the dedicated redux functional unit pipeline, alongside adu, alu, cbu, fma2x, fma, half, transcendental, ipa, lsu, schedDisp, tex, ttu, udp, and the various MMA pipelines.
Codegen Handler -- sub_567680
sub_567680(ctx, string_table):
// Prologue: function header
if (SM >= 90 || feature_flag_11) && variant == 1:
// sm90+ path: 8-operand format
// Reads: sub_709E80, sub_70CA70, sub_707530, sub_7087C0,
// sub_707630, sub_70B8E0(0..2)
// sub_707630 -- gets reduction operation type
// sub_7087C0 -- gets data type qualifier
else if SM > 79 && variant == 1:
// sm80+ path
// Two sub-branches:
// - feature_10 || explicit_sync || feature_19: simplified format
// - Standard: full format with sub_707650, sub_7087F0,
// sub_707540, sub_70D1F0
The accessor sub_707630(instr) returns the reduction operation enum (add/min/max/and/or/xor), while sub_7087C0(instr) returns the data type qualifier (s32/u32/b32/f32). Note that redux requires sm80+ in the hardware; the sm70 block in the intrinsic table registers redux-sync intrinsics as software emulation routines.
Redux Sync Intrinsic Registration (IDs 0x01--0x11)
The earliest 17 intrinsic IDs are dedicated to software-emulated redux-sync operations for pre-sm80 targets:
| ID | Intrinsic Name | Operation |
|---|---|---|
0x01 | __cuda_reduxsync_b32_and | Bitwise AND reduction |
0x02 | __cuda_reduxsync_b32_or | Bitwise OR reduction |
0x03 | __cuda_reduxsync_b32_xor | Bitwise XOR reduction |
0x04 | __cuda_reduxsync_f32_max | Float maximum |
0x05 | __cuda_reduxsync_f32_min | Float minimum |
0x06 | __cuda_reduxsync_f32_max_abs | Float absolute maximum |
0x07 | __cuda_reduxsync_f32_min_abs | Float absolute minimum |
0x08 | __cuda_reduxsync_f32_max_NaN | Float maximum (NaN-propagating) |
0x09 | __cuda_reduxsync_f32_min_NaN | Float minimum (NaN-propagating) |
0x0A | __cuda_reduxsync_s32_add | Signed integer sum |
0x0B | __cuda_reduxsync_s32_max | Signed integer maximum |
0x0C | __cuda_reduxsync_s32_min | Signed integer minimum |
0x0D | __cuda_reduxsync_u32_add | Unsigned integer sum |
0x0E | __cuda_reduxsync_u32_max | Unsigned integer maximum |
0x0F | __cuda_reduxsync_u32_min | Unsigned integer minimum |
0x10--0x11 | (additional variants) | Extended operations |
On sm80+, redux PTX instructions lower directly to hardware SASS instructions and bypass the software emulation path.
Thread-Block Barriers
Thread-block barriers synchronize all threads within a CTA (Cooperative Thread Array). ptxas provides codegen handlers for three PTX barrier families plus their PTX 8.0 cluster-aware equivalents.
PTX Barrier Family
| PTX Handler | Address | Size | PTX Instructions |
|---|---|---|---|
sub_524FB0 | 0x524FB0 | 1.8KB | bar.sync, bar |
sub_500BF0 | 0x500BF0 | 1.2KB | bar.arrive |
sub_52D590 | 0x52D590 | 1.5KB | bar.red.{and,or,popc} |
sub_570290 | 0x570290 | 2.5KB | barrier.sync, barrier |
sub_570940 | 0x570940 | -- | barrier.arrive |
sub_5889B0 | 0x5889B0 | 4.8KB | barrier.red |
sub_56A5A0 | 0x56A5A0 | 1.9KB | bar.warp.sync |
PTX to SASS Mapping
| PTX Instruction | SASS Opcode | Behavior |
|---|---|---|
bar.sync N | BAR.SYNC | Block until all CTA threads arrive at barrier N |
bar.sync N, count | BAR.SYNC | Block until count threads arrive at barrier N |
bar.arrive N | BAR.ARV | Signal arrival at barrier N without blocking |
bar.arrive N, count | BAR.ARV | Signal arrival with thread count |
bar.red.and N, p | BAR.RED.AND | Barrier + warp-level AND reduction on predicate |
bar.red.or N, p | BAR.RED.OR | Barrier + warp-level OR reduction |
bar.red.popc N, d | BAR.RED.POPC | Barrier + warp-level population count |
barrier.cta.sync N | BAR.SYNC | PTX 8.0 cluster-aware CTA barrier |
barrier.cta.arrive N | BAR.ARV | PTX 8.0 cluster-aware CTA arrive |
barrier.cta.red N | BAR.RED | PTX 8.0 cluster-aware CTA reduction |
The hardware provides 16 named barriers (indices 0--15). The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used per kernel, which the driver uses to partition the convergence barrier file at launch.
Codegen Handler Details -- sub_524FB0
The bar.sync / bar handler dispatches across three architecture generations:
sub_524FB0(ctx, string_table):
// string_table offsets start at 294205
if feature_flag_11 || SM > 89:
// sm90+ path: 4 format strings for prologue
// Checks sub_70B930(instr) for count variant:
// count != 2: single-operand format (barrier index only)
// count == 2: two-operand format (barrier index + thread count)
else if SM > 69:
if !feature_13 || sub_70B860(instr) > 69:
// sm70+ standard path
// Same count dispatch as sm90
else:
// sm70 with feature_13: extended format
// Reads sub_709400 (barrier scope) and sub_708200 (barrier type)
else: // SM <= 69 (pre-Volta)
// Legacy path
// sub_709400 -- barrier scope identifier
// sub_708200 -- barrier operation type
// count dispatch with 3-operand format strings
The accessor sub_70B930(instr) returns the operand count mode: 1 for single-operand (barrier index only), 2 for two-operand (barrier index + thread count). sub_70C180(instr) returns a special value (-1 for default thread count).
Named Barrier Intrinsic Registration
The sm70 intrinsic block registers barrier operations combinatorially:
| Sub-Category | Count | Combinatorial Source |
|---|---|---|
__cuda_sm70_barrier_arrive_{0..15} | 16 | 16 barrier indices |
__cuda_sm70_barrier_arrive_{0..15}_cnt | 16 | With explicit thread count |
__cuda_sm70_barrier_sync_{0..15} | 16 | 16 barrier indices |
__cuda_sm70_barrier_sync_{0..15}_cnt | 16 | With explicit thread count |
__cuda_sm70_barrier_red_and_{0..15} | 16 | AND reduction per barrier |
__cuda_sm70_barrier_red_and_{0..15}_cnt | 16 | With thread count |
__cuda_sm70_barrier_red_or_{0..15} | 16 | OR reduction per barrier |
__cuda_sm70_barrier_red_or_{0..15}_cnt | 16 | With thread count |
__cuda_sm70_barrier_red_popc_{0..15} | 16 | POPC reduction per barrier |
__cuda_sm70_barrier_red_popc_{0..15}_cnt | 16 | With thread count |
This combinatorial explosion produces 160 intrinsic entries for barriers alone (16 indices x 2 count variants x 5 operation types).
Barrier Codegen Pattern -- sub_570290
The barrier (PTX 8.0 form) handler at sub_570290 (2.5KB) is the most complex barrier handler. It adds cluster-awareness for sm90+ and handles the barrier.cta.* variants. The handler has an elaborate multi-level dispatch:
sub_570290(ctx, string_table):
// sm90+ path: additional CTA scope parameters
// sub_709E80(instr) -- barrier scope enum
// sub_70B8E0(instr, 0) -- barrier index operand
// sub_70B8E0(instr, 1) -- thread count operand
// sm70+ with feature_10 or explicit sync:
// Separate code paths for count=1 vs count=2
// sm70+ standard (no explicit sync, no feature_10):
// Six format strings building an elaborate inline PTX body
// Handles sub_70B930 count modes 1 and 2
// Checks sub_70C180 for default-count (-1) vs explicit count
// Generates bar.sync N [, count]; or bar.arrive N [, count];
Memory Barriers and Fences
Memory Barriers -- sub_4DB410
The membar codegen handler at sub_4DB410 (84 lines decompiled) is the smallest sync handler. It generates memory barrier instructions across three scope levels.
| PTX Instruction | SASS Opcode | Scope |
|---|---|---|
membar.cta | MEMBAR.CTA | Thread block (CTA) |
membar.gpu | MEMBAR.GPU | Device (GPU) |
membar.sys | MEMBAR.SYS | System (all agents) |
sub_4DB410(ctx, string_table):
mode = sub_709FE0(instr)
if mode != 2 && mode != 4:
// Standard 3-operand format
// sub_70B710(instr) -- scope qualifier
// sub_709FF0(instr) -- barrier type
// sub_70A530(instr) -- additional qualifier
else: // mode 2 or 4 (cta or sys)
if SM > 49:
// sm50+ uses 3-operand format with scope
else:
// Pre-sm50: 2-operand membar + separate scope annotation
The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS metadata records the byte offset of every membar.sys instruction in the output SASS. This allows the driver to apply software workarounds (WAR patches) at specific membar.sys locations for known hardware errata.
Fence Operations
Fence operations enforce ordering between different memory proxy domains. These are not exposed as separate PTX codegen handlers but are inserted by the compiler's synchronization passes (phases 25 and 72).
| PTX Instruction | SASS Opcode (sm100+) | Purpose |
|---|---|---|
fence.proxy.alias | (expanded inline) | Orders generic/alias memory accesses |
fence.proxy.async | (expanded inline) | Orders async copy completion visibility |
fence.proxy.async.global | (expanded inline) | Global memory async fence |
fence.sc.cta | FENCE_S | Sequentially-consistent fence, CTA scope |
fence.sc.gpu | FENCE_G | Sequentially-consistent fence, GPU scope |
fence.acq_rel.cta | FENCE_T | Acquire-release fence, CTA scope |
On Blackwell (sm100+), dedicated FENCE_G, FENCE_S, and FENCE_T SASS opcodes replace the older pattern of MEMBAR + proxy annotation sequences.
StageAndFence (Phase 25)
The StageAndFence pass (sub_1392E30, 166 bytes entry, sub_1390B30 8,956 bytes core) inserts fence instructions after loop unrolling to re-establish memory ordering correctness. When loop unrolling replicates memory operations that crossed a synchronization boundary in the original loop body, this pass inserts fence.proxy or MEMBAR pseudo-instructions at the boundaries of unrolled iterations.
The core function takes floating-point parameters (double/__m128d), indicating it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay them to overlap with independent computation.
Warp-Level Synchronization
WARPSYNC
WARPSYNC mask synchronizes the threads in a warp identified by the lane mask. This is the fundamental warp-level sync primitive on sm70+ (Volta and later).
| PTX | SASS | Purpose |
|---|---|---|
bar.warp.sync membermask | WARPSYNC mask | Synchronize warp lanes specified by mask |
The intrinsic __cuda_sm70_warpsync (single entry in the sm70 block) is the simplest warp-sync intrinsic, and is detected by the same sub_A9A410 prefix matcher that handles vote and match.
BSSY / BSYNC (Convergence Barriers)
The BSSY / BSYNC instruction pair replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points:
| SASS Opcode | Purpose |
|---|---|
BSSY B, target | Push a synchronization barrier; target is the reconvergence point |
BSYNC B | Pop and wait at the convergence barrier B |
These are not directly exposed as PTX instructions -- they are inserted by the compiler during phase 72 (LateExpandSyncInstructions) and the architecture-specific sync expansion passes (phases 99, 100, 114). The EIATTR_SYNC_STACK metadata records the convergence barrier stack depth.
ELECT
ELECT.SYNC (sm75+) elects a single active lane from the warp, setting a predicate register to true for exactly one thread.
In the SASS opcode table, ELECT appears among the Blackwell-era additions alongside ENDCOLLECTIVE, PREXIT, SETMAXREG, and SETSMEMSIZE. The ELECT opcode is used for leader-based algorithms where only one thread per warp performs a shared operation.
Asynchronous Barriers (MBARRIER)
Introduced with sm90 (Hopper), mbarrier provides hardware-accelerated asynchronous barriers in shared memory. These are critical for async copy (cp.async.bulk), TMA operations, and warpgroup-level synchronization.
MBARRIER Operation Classification
ptxas classifies mbarrier operations through sub_A94440 (MBarrierDetector, 412 bytes binary) and sub_A9A5F0 (MBarrierClassifier). The classifier resolves the mbarrier symbol name (prefix %mbarrier_) and returns an operation type enum:
| Type ID | Suffix | Operation |
|---|---|---|
| 0 | INIT | Initialize barrier object in shared memory |
| 1 | ARRIVE | Signal arrival (non-blocking) |
| 2 | ARRIVE_NOCOMPLETE | Arrive without completing the phase |
| 3 | ARRIVE_DROP | Arrive and decrement expected count |
| 4 | ARRIVE_DROP_NOCOMPLETE | Arrive, drop, no completion |
| 5 | TEST_WAIT | Test if barrier phase is complete |
| 6 | TEST_WAIT_PARITY | Phase-parity-based completion test |
| 7 | CP_ASYNC_ARRIVE | cp.async arrival notification |
| 8 | INVAL | Invalidate barrier |
| 9 | TRY_WAIT | Wait with timeout |
| 10 | TRY_WAIT_PARITY | Phase-parity-based wait with timeout |
| 11 | EXPECT_TX | Set expected transaction byte count |
| 12 | COMPLETE_TX | Mark transaction bytes as complete |
The "trivial" mbarrier operations (types 0--8) are handled inline; "non-trivial" operations (types 9--12, including EXPECT_TX and the extended TRY_WAIT variants) require more complex lowering.
Mbarrier Symbol Naming
Mbarrier objects are tracked through shared memory symbols following the pattern:
%mbarrier_{basename}_{operation}
The resolver at sub_A9A920 extracts the base name from the full symbol (e.g., %mbarrier_pipeline0_ARRIVE yields base name pipeline0). The format string "%%mbarrier_%s_%s" at sub_AA33C0 is used for symbol construction during mbarrier expansion.
Reserved shared memory regions for TMA pipeline mbarriers:
__nv_reservedSMEM_tmem_allocation_pipeline_mbarrier__nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity
ExpandMbarrier (Phase 42)
Phase 42 expands mbarrier pseudo-instructions into native barrier sequences through architecture vtable dispatch:
mov rdi, [rsi+0x630] ; rdi = ctx->arch_backend (offset 1584)
mov rax, [rdi] ; rax = arch_backend->vtable
jmp [rax+0x168] ; call vtable[45] -- ExpandMbarrier impl
The expansion is architecture-specific:
- Pre-sm90: No mbarrier pseudo-ops exist; the phase is a no-op
- sm90 (Hopper): Expands to hardware mbarrier instruction sequences, resolves shared memory addresses, inserts
fence.proxyfor coherence - sm100+ (Blackwell): Extended semantics for
tcgen05.fence, cluster-level barriers, and async pipeline operations
Expansion pattern:
Before (Ori pseudo-ops): After (native SASS):
MBARRIER_INIT %mbar, count MBARRIER.INIT [smem], count
MBARRIER_ARRIVE_EXPECT_TX MBARRIER.ARRIVE.EXPECT_TX [smem], bytes
%mbar, bytes
CP.ASYNC.BULK.TENSOR CP.ASYNC.BULK.TENSOR [dst], [src], [smem]
dst, src, %mbar
MBARRIER_TRY_WAIT_PARITY MBARRIER.TRY_WAIT.PARITY pred, [smem],
%mbar, parity, pred parity
EIATTR Metadata
| EIATTR | Purpose |
|---|---|
EIATTR_NUM_MBARRIERS | Count of mbarrier objects used by the kernel |
EIATTR_MBARRIER_INSTR_OFFSETS | Byte offsets of mbarrier instructions for driver patching |
Blackwell CGA Barriers
On sm100+ (Blackwell), a new class of CGA (Cooperative Grid Array) barriers extends the mbarrier concept to cluster-level synchronization:
| SASS Opcode | Purpose |
|---|---|
CGABAR_ARV | CGA barrier arrive |
CGABAR_GET | Query CGA barrier state |
CGABAR_SET | Set CGA barrier parameters |
CGABAR_WAIT | Wait on CGA barrier |
CGAERRBAR | CGA error barrier |
Atomic/Reduction Intrinsic Lowering
The OCG atomic/reduction handler at sub_6C0D90 (19KB, 813 decompiled lines) lowers atom.* and red.* intrinsic calls into Ori IR opcode 314 instructions. Unlike the warp-level sync handlers (which generate inline PTX via sprintf), this function works at the Ori IR level directly: it parses a sub-op parameter array, validates all qualifier combinations, resolves operands to register references, and emits the final instruction through sub_92C240. All diagnostics use error code 7308 and the prefix "Instrinsic - \"%s\"" (the typo is in the binary).
Parameter Array Parsing
The intrinsic name is decomposed by the OCG name parser (sub_6C9BC0) into an integer token array stored at *(ctx+10704). Each token encodes one qualifier dimension of the atomic operation. The handler reads tokens sequentially through a switch-case loop:
| Token | Variable | Semantic | Decoded Value |
|---|---|---|---|
| 0 | memory_order=4 | Memory ordering | Relaxed |
| 1 | domain=12, is_mmio=1 | Memory domain | MMIO (global, mapped) |
| 2 | domain=5 | Memory domain | Shared (_shared) |
| 3 | scope=5 | Visibility scope | .cta |
| 4 | scope=6 | Visibility scope | .gpu |
| 5 | is_noreturn=1 | Return behavior | Reduction (fire-and-forget, no return value) |
| 6 | data_size=2 | Operand width | 64-bit (u64) |
| 7 | data_size=4 | Operand width | Vector (v2/v4) |
| 8 | data_type=12 | Data type | .f32 |
| 9 | data_type=11 | Data type | .s32 |
| 10 | data_type=10 | Data type | .u32 (also .u64 when size=2) |
| 11 | op=0 | Operation | .add |
| 12 | op=1 | Operation | .min |
| 13 | op=2 | Operation | .max |
| 14 | op=3 | Operation | .inc |
| 15 | op=4 | Operation | .dec |
| 16 | op=5 | Operation | .and |
| 17 | op=6 | Operation | .or |
| 18 | op=7 | Operation | .xor |
Tokens not matching any case are silently skipped. If the parameter array is empty (no tokens), all values take defaults: data_type=1 (unspecified), op=-1 (unspecified), data_size=1 (32-bit), and all flags zero.
Modifier Word Encoding
The parsed qualifiers are packed into a single 32-bit modifier word that accompanies the Ori instruction through the pipeline to ISel:
Bit [14:13] = type encoding: 00=u32 01=s32 10=u64/f32 11=invalid
Bit [12:10] = operation: 0=add 1=min 2=max 3=inc 4=dec 5=and 6=or 7=xor
Bit [8] = no-return: 1=reduction (red.*) 0=atomic (atom.*)
Bit [7:5] = memory order: 4=relaxed (only value supported here)
Bit [4:2] = scope: 5=cta 6=gpu
Bit [1:0] = operand flags: bit0=(addr_type==u32) bit1=(data_type==u32)
Top nibble = 0x6 (constant marker: 0x60000000)
The type encoding bits [14:13] are set during cross-validation: s32 with 32-bit width sets 0x2000, u32/f32 with 64-bit width sets 0x4000, and invalid combinations set 0xE000 (with an error).
Validation Chain
The handler enforces a strict 10-phase validation sequence. Each failure emits error 7308 with a descriptive message:
| Phase | Check | Error Message |
|---|---|---|
| 1 | Domain must be _shared (5) or _mmio (12); otherwise global is assumed but errors if explicitly set to something else | "Domain param '_shared' or '_global' required" |
| 2 | Vector type operand count must match expected operand count from data_size | "Vector type does not match number of subops" |
| 3 | Data type must be explicitly set (not default 1) | "Type {u32, s32, u64} not specified" |
| 4 | Vector width (v12>1) requires u32 (10) or f32 (12) type | "Vector supported only for {u32, u64}" |
| 5 | Operation must be explicitly set (not default -1); emitted twice | "Op {add, min, max, inc, dec, and, or, xor} not specified" |
| 6 | Shared-domain reductions only support .add | "Unsupported non _add global memory reduction" |
| 7a | Scope without memory order is deprecated | "Deprecated scope without memory order semantics" (warning) |
| 7b | Memory order requires scope | "Required scope with memory order semantics" |
| 8 | MMIO semantics require global domain | "Domain param '_global' required for mmio semantics" |
| 9 | s32 requires 32-bit; f32+64-bit only with add; otherwise invalid | "Invalid data type / op combination" or "Invalid vector / data type combination" |
| 10 | Each data operand's type field must match declared type; address operand must be u32 (10) or f32 (12) | "Operand type does not match specified type" / "Unexpected instrinsic type (%s) in param (%d)" |
Operand Resolution and Shared Memory Address Materialization
After validation, the handler resolves up to three operand slots:
-
Destination/address: Resolved via
sub_926370into a 24-bit register ID, then tagged with0x50000000(output register class marker). -
Source data operand: Read from the operand descriptor array at
*(ctx+10728). Routing depends on bits [30:28] of the operand word:- Value 5 (shared memory pointer): Allocates a temporary register in class 6 via
sub_91BF30, then emits an Ori opcode 130 pseudo-instruction viasub_934630to materialize the shared memory address into a general register. This extra LEA/MOV is necessary becauseATOMSrequires an explicit register operand, not a symbolic shared-memory reference. - Value 1 with
!(operand[1] & 0x1000000): Direct register reference (24-bit register ID from low bits). - Otherwise: Full register resolution through
sub_91D150+ operand legalization throughsub_7DEFA0.
- Value 5 (shared memory pointer): Allocates a temporary register in class 6 via
-
Second data operand (MMIO only,
v109=1): Same three-way resolution for the second source, reading from operand descriptor offset +12.
The final instruction is emitted as:
sub_92C240(output, ctx, 314, data_type, operand_count, operand_buffer, 1)
where Ori opcode 314 represents the unified ATOM/RED operation.
SASS Opcode Selection
The Ori opcode 314 instruction flows through the optimizer pipeline and reaches ISel (sub_C0EB10), which selects the final SASS opcode based on the domain and no-return flag encoded in the modifier word:
| Modifier Bits | SASS Opcode | ROT13 | Table Entry | PTX Equivalent |
|---|---|---|---|---|
| domain=global, no-return=0 | ATOMG | NGBZT | 103 | atom.global.* |
| domain=shared, no-return=0 | ATOMS | NGBZF | 105 | atom.shared.* |
| domain=generic, no-return=0 | ATOM | NGBZ | 102 | atom.* |
| no-return=1 | RED | ERQ | 104 | red.* |
The operation bits [12:10] further select the SASS sub-opcode qualifier (.ADD, .MIN, .MAX, .INC, .DEC, .AND, .OR, .XOR). The type bits [14:13] determine the data type suffix (.32, .64, .S32, .U32, .F32).
Scope and Memory Order (sm70+)
When scope and memory order are both present, the modifier word carries them through to ISel where they become SASS instruction modifiers:
- Scope
.cta(token 3, value 5): Atomic is visible only within the CTA - Scope
.gpu(token 4, value 6): Atomic is visible to all thread blocks on the device - Memory order relaxed (token 0, value 4): No ordering guarantees beyond atomicity
The handler does not encode acquire, release, or acq_rel memory orders -- these are handled by the separate memory fence/order handler at sub_6C1CF0. The deprecation warning for scope-without-order indicates ptxas is transitioning toward requiring explicit memory order qualifiers for all scoped atomics.
Limitations and Notable Behavior
-
No CAS/EXCH tokens: The parameter array parser has no tokens for
.cas(compare-and-swap) or.exch(exchange). These operations are either encoded through a different OCG intrinsic or use a distinct sub-op encoding not visible in this function's switch-case. -
Shared-memory restriction: Only
atom.shared.addis supported as a reduction. All other shared-memory reduction operations (red.shared.{min,max,and,or,xor}) are rejected. -
MMIO path: Token 1 (domain=MMIO) enables a separate code path that processes two data operands instead of one. This supports the MMIO atomic semantics where both the address and a data value must be explicitly resolved.
-
Error message bug: The message
"Unsupported non _add global memory reduction"fires for shared-memory non-add reductions despite saying "global". This is likely a copy-paste artifact in the ptxas source.
Synchronization Pipeline Summary
The complete synchronization processing pipeline spans 8 optimizer phases:
| Phase | Name | Category | Purpose |
|---|---|---|---|
| 25 | StageAndFence | Lowering | Insert fences after loop unrolling |
| 26 | OriRemoveRedundantBarriers | Optimization | Dataflow-driven barrier elimination (multi-function) |
| 42 | ExpandMbarrier | Lowering | Expand mbarrier pseudo-ops via arch vtable |
| 71 | OptimizeSyncInstructions | Optimization | Sync instruction redundancy elimination |
| 72 | LateExpandSyncInstructions | Lowering | Final sync pseudo-op expansion to SASS |
| 99 | (Architecture-specific) | Lowering | Post-RA sync expansion |
| 100 | (Architecture-specific) | Lowering | Architecture vtable dispatch for sync |
| 114 | (Post-scheduling) | Fixup | Post-scheduling dependency barrier fixup |
The progression is: early fence insertion (25) -> cross-function barrier elimination (26) -> mbarrier expansion (42) -> optimization within partial-SSA (71) -> final expansion (72) -> post-RA architecture hooks (99, 100) -> post-scheduling fixup (114).
Ori IR Opcode 130 -- Sync Analysis Target
The optimizer phases 26 and 71 identify synchronization instructions by checking for Ori opcode 130 (HSET2 in the ROT13 name table; used as an internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111). For each barrier instruction found:
- Extract the destination operand from
field+84 - Resolve the register through the register table at
context+88 - Test whether the register's use-count (
reg+24) indicates consumption - If the barrier result is dead (no thread observes it before the next dominating barrier), eliminate the barrier
- At block boundaries, attempt to merge barriers with compatible scopes
Knobs and Gating
| Knob / Flag | Effect |
|---|---|
| Knob 487 | Master gate for sync optimization (checked by phases 25, 71, 72) |
| Knob 358 | Sync mode selector (0=none, 1=conservative, 2=aggressive, 3+=full multi-function) |
| Knob 472 | Barrier liveness analysis mode |
context+1368 bit 0 | Global sync optimization enable |
context+1397 bits[6:7] | Architecture-specific sync configuration |
context+1398 bits[3:4] | Sync expansion mode (architecture-dependent) |
DisableErrbarAfterMembar | Suppress error barrier insertion after membar |
SASS Opcode Summary Table
Complete mapping of all synchronization and warp SASS opcodes, with their ROT13-encoded internal names as found in the ptxas binary:
| ROT13 (Binary) | SASS Opcode | Table Offset | Category |
|---|---|---|---|
IBGR | VOTE | 4600 | Warp vote |
IBGRH | VOTEU | -- | Uniform warp vote (sm100+) |
ONE | BAR | 5160 | Thread-block barrier |
ONE_VAQRKRQ | BAR_INDEXED | -- | Indexed barrier variant |
REEONE | ERRBAR | 4184 | Error barrier |
QRCONE | DEPBAR | -- | Dependency barrier (scoreboard) |
ZNGPU | MATCH | -- | Warp match |
ZRZONE | MEMBAR | -- | Memory barrier |
JNECFLAP | WARPSYNC | -- | Warp synchronize |
OFLAP | BSYNC | -- | Convergence barrier sync |
OFFL | BSSY | -- | Convergence barrier set |
E2O | R2B | 5128 | Register to barrier transfer |
| -- | ELECT | -- | Warp lane election (sm75+) |
| -- | NANOSLEEP | -- | Nanosleep hint |
| -- | FENCE_G | -- | Global fence (sm100+) |
| -- | FENCE_S | -- | Shared fence (sm100+) |
| -- | FENCE_T | -- | Texture fence (sm100+) |
| -- | CGABAR_ARV | -- | CGA barrier arrive (sm100+) |
| -- | CGABAR_GET | -- | CGA barrier query (sm100+) |
| -- | CGABAR_SET | -- | CGA barrier set (sm100+) |
| -- | CGABAR_WAIT | -- | CGA barrier wait (sm100+) |
| -- | CGAERRBAR | -- | CGA error barrier (sm100+) |
| -- | SYNCS_BASIC | -- | Basic sync (sm100+) |
| -- | SYNCS_LD_UNIFM | -- | Sync with uniform load (sm100+) |
NGBZ | ATOM | 102 | Atomic (generic address space) |
NGBZT | ATOMG | 103 | Atomic (global memory) |
ERQ | RED | 104 | Reduction (fire-and-forget) |
NGBZF | ATOMS | 105 | Atomic (shared memory) |
Key Function Reference
| Address | Identity | Size | Purpose |
|---|---|---|---|
sub_580E50 | VoteCodegen | ~3.2KB | PTX vote.* to inline PTX body |
sub_5801D0 | ShflCodegen | ~3.3KB | PTX shfl.* to inline PTX body |
sub_58A730 | MatchCodegen | ~4.5KB | PTX match.* to inline PTX body |
sub_567680 | ReduxCodegen | ~2.0KB | PTX redux.* to inline PTX body |
sub_524FB0 | BarSyncCodegen | ~1.8KB | PTX bar.sync / bar |
sub_570290 | BarrierCodegen | ~2.5KB | PTX barrier.* (PTX 8.0) |
sub_500BF0 | BarArriveCodegen | ~1.2KB | PTX bar.arrive |
sub_570940 | BarrierArriveCodegen | -- | PTX barrier.arrive |
sub_52D590 | BarRedCodegen | ~1.5KB | PTX bar.red.{and,or,popc} |
sub_5889B0 | BarrierRedCodegen | ~4.8KB | PTX barrier.red |
sub_56A5A0 | BarWarpCodegen | ~1.9KB | PTX bar.warp.sync |
sub_4DB410 | MembarCodegen | ~0.8KB | PTX membar.* |
sub_A9A410 | isSM70WarpSync | 194B | Intrinsic prefix detection |
sub_A94440 | isNonTrivialMBarrier | 412B | Mbarrier operation classifier |
sub_A9A5F0 | classifyMBarrier | -- | Mbarrier type enum resolver |
sub_A9A920 | resolveMBarrierBaseName | -- | Extract mbarrier base name from symbol |
sub_AA33C0 | constructMBarrierSymbol | -- | Build %%mbarrier_%s_%s symbol |
sub_1392E30 | StageAndFence (phase 25) | 166B entry | Post-unroll fence insertion |
sub_1390B30 | StageAndFence core | 8,956B | Main fence insertion logic |
sub_790A40 | RemoveRedundantBarriers | 2,288B | Cross-function barrier elimination |
sub_90A340 | OptimizeSyncInstructions | 1,670B | Sync instruction optimization |
sub_1381DA0 | LateExpandSync | 1,517B | Final sync expansion |
sub_6C0D90 | LowerAtomicRedIntrinsic | ~19KB | OCG atom.*/red.* to Ori opcode 314 |
sub_C0EB10 | InstructionSelector | 185KB | Main ISel (handles all sync opcodes) |
Custom ELF Emitter
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas builds its ELF/cubin output without libelf or any external ELF library. The entire ELF construction pipeline is a custom implementation spread across approximately 20 functions in the 0x1C99--0x1CD1 address range, totaling roughly 300 KB of binary code. At the center is a 672-byte in-memory object called the "ELF world" (ELFW), which owns all sections, symbols, and string tables. The emitter writes standard ELF headers with NVIDIA extensions: EM_CUDA (0xBE / 190) as the machine type, NVIDIA-specific section types (SHT_CUDA_INFO = 0x70000064), and CUDA-specific ELF flags encoding the SM architecture version. The design handles both 32-bit and 64-bit ELF classes, with the class byte at ELF offset 4 set to '3' (32-bit) or 'A' (64-bit). Finalization is a single-pass algorithm that orders sections into 8 priority buckets, assigns file offsets with alignment, and handles the ELF extended section index mechanism (SHN_XINDEX) when the section count exceeds 65,280.
| ELFW constructor | sub_1CB53A0 (3,480 bytes, 672-byte object) |
| Section creator | sub_1CB3570 (1,963 bytes, 44 callers) |
| Text section creator | sub_1CB42D0 (SHT_PROGBITS, SHF_ALLOC|SHF_EXECINSTR) |
| Symbol table builder | sub_1CB68D0 (9,578 bytes, ~1,700 lines) |
| Master ELF emitter | sub_1C9F280 (15,263 bytes, 97 KB decompiled) |
| Section layout calculator | sub_1C9DC60 (5,663 bytes, 29 KB decompiled) |
| Symbol fixup | sub_1CB2CA0 (2,038 bytes) |
| Section index remap | sub_1C99BB0 (4,900 bytes) |
| ELF structure dumper | sub_1CB91C0 (2,668 bytes) |
| File serializer | sub_1CD13A0 (2,541 bytes) |
| Cubin entry point | sub_612DE0 (47 KB, called from sub_446240) |
| ELF machine type | EM_CUDA = 0xBE (190) |
| CUDA section type | SHT_CUDA_INFO = 0x70000064 |
| ELF magic | 0x464C457F (\x7fELF) |
| Memory pool | "elfw memory space" (4,096-byte initial) |
Architecture
sub_446240 (real main -- compilation driver)
|
v
sub_612DE0 (47KB -- cubin generation entry)
| Parses: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
| Sets up setjmp/longjmp error recovery
| Handles recursive self-call for nested finalization
|
+-- sub_1CB53A0 -------- ELFW constructor (672-byte object)
| | Creates "elfw memory space" pool (4096 initial)
| | Writes ELF magic 0x464C457F into header
| | Sets e_machine = 0xBE (EM_CUDA)
| | Sets EI_CLASS = 32-bit ('3') or 64-bit ('A')
| | Creates 7 standard sections:
| | .shstrtab, .strtab, .symtab, .symtab_shndx,
| | .note.nv.tkinfo, .note.nv.cuinfo, .nv.uft.entry
| v
+-- sub_1CB3570 x N ---- Section creator (44 call sites)
| | Creates section: name, type, flags, link, info, align, entsize
| | Auto-creates .rela/.rel companion for executable sections
| v
+-- sub_1CB42D0 --------- .text.<funcname> section creator
| | SHT_PROGBITS, SHF_ALLOC | SHF_EXECINSTR
| | One section per kernel/function
| v
+-- sub_1CB68D0 --------- Symbol table builder (~1700 lines)
| | Iterates internal symbol list
| | Filters deleted symbols
| | Handles __cuda_syscall special symbol
| | Manages SHN_XINDEX overflow (>= SHN_LORESERVE)
| | Builds .symtab_shndx extended index table
| v
+-- sub_1CB2CA0 --------- Symbol fixup
| | Renumbers symbols after section deletion
| | Creates missing section symbols
| v
+-- sub_1C99BB0 --------- Section index remap
| | Reindexes sections after dead elimination
| | Remaps .symtab_shndx / .nv.merc.symtab_shndx
| v
+-- sub_1C9DC60 --------- Section layout calculator (29KB)
| | Computes section offsets and sizes
| | Skips .nv.constant0, .nv.reservedSmem
| | Handles .debug_line special padding
| v
+-- sub_1C9F280 --------- Master ELF emitter (97KB -- largest function)
| | Copies ELF header (64 bytes via SSE loadu)
| | Iterates sections via sub_1CB9FF0 / sub_1CB9C40
| | Skips virtual sections (flag & 4)
| | Patches ELF flags (SM version, ELFCLASS32/64)
| | Handles program headers
| | Embeds Mercury capsule if capmerc mode
| | Processes debug sections
| v
+-- sub_1CD13A0 --------- File serializer
| Iterates sections, writes with alignment padding
| Validates sizes: "section size mismatch"
| Handles 32-bit and 64-bit ELF formats
v
OUTPUT: .cubin / .o file on disk
ELFW Object -- sub_1CB53A0
The ELFW constructor allocates and initializes a 672-byte object that serves as the central data structure for the entire ELF construction pipeline. Every section, symbol, and string table lives under this object. The constructor is called exactly once per compilation unit.
Construction Sequence
// sub_1CB53A0 -- ELFW constructor (simplified)
void* elfw_init(int elf_class, int sm_version) {
// 1. Allocate 672-byte ELFW object from pool allocator
void* elfw = sub_424070(672);
// 2. Create dedicated memory pool
sub_4258D0("elfw memory space", 0, 4096);
// 3. Write ELF header
*(uint32_t*)(elfw + 0) = 0x464C457F; // e_ident[EI_MAG0..3] = "\x7fELF"
*(uint8_t*)(elfw + 4) = (elf_class == 64) ? 'A' : '3'; // EI_CLASS
*(uint16_t*)(elfw + MACHINE_OFF) = 0xBE; // e_machine = EM_CUDA (190)
// 4. Initialize section/symbol/string containers
init_section_table(elfw);
init_symbol_table(elfw);
init_string_table(elfw);
// 5. Create 7 mandatory sections
add_section(elfw, ".shstrtab", SHT_STRTAB, 0, ...);
add_section(elfw, ".strtab", SHT_STRTAB, 0, ...);
add_section(elfw, ".symtab", SHT_SYMTAB, 0, ...);
add_section(elfw, ".symtab_shndx", SHT_SYMTAB_SHNDX, 0, ...);
add_section(elfw, ".note.nv.tkinfo", SHT_NOTE, 0, ...);
add_section(elfw, ".note.nv.cuinfo", SHT_NOTE, 0, ...);
add_section(elfw, ".nv.uft.entry", SHT_PROGBITS, 0, ...);
return elfw;
}
The ELFW object stores:
- The ELF header (first 64 bytes for 64-bit class, 52 for 32-bit)
- A section table (dynamic array of section descriptors)
- A symbol table (dynamic array of symbol entries)
- String tables for section names (
.shstrtab) and symbol names (.strtab) - Metadata for relocation processing and section ordering
ELFW Object Layout (672 bytes)
The 672-byte ELFW object divides into 13 regions. Offsets 0--63 overlay a standard ELF header (whose internal layout depends on ELF class). All pointer-sized fields are 8 bytes (the allocator returns 8-byte-aligned memory). The v17 variable in the decompilation is a uint64_t*, so v17[N] = byte offset N * 8.
Region 1 -- ELF Header Embed (bytes 0--63)
The ELF header is stored inline at the start of the ELFW object. Field positions within it vary by class (32-bit vs 64-bit), matching the standard Elf32_Ehdr / Elf64_Ehdr layout, except that EI_CLASS and EI_OSABI use non-standard CUDA values.
| Offset | Size | Name | Evidence |
|---|---|---|---|
| 0 | 4B | e_ident[EI_MAG0..3] | *(_DWORD*)v17 = 0x464C457F |
| 4 | 1B | e_ident[EI_CLASS] | (v11 != 0) + 1: 1 = ELFCLASS32, 2 = ELFCLASS64 |
| 5 | 1B | e_ident[EI_DATA] | Hardcoded 1 (little-endian) |
| 6 | 1B | e_ident[EI_VERSION] | Hardcoded 1 (EV_CURRENT) |
| 7 | 1B | e_ident[EI_OSABI] | 0x41 ('A') for 64-bit cubin, 0x33 ('3') for 32-bit |
| 8 | 1B | e_ident[EI_ABIVERSION] | Constructor parameter a3 |
| 16 | 2B | e_type | Constructor parameter a1 (cast to uint16) |
| 18 | 2B | e_machine | Hardcoded 0x00BE (EM_CUDA = 190) |
| 62 | 2B | e_shstrndx | *(_WORD*)(v17 + 31) -- set to .shstrtab section index |
For 32-bit class (EI_CLASS = 1):
| Offset | Size | Name | Dumper accessor |
|---|---|---|---|
| 20 | 4B | e_version | *(_DWORD*)(v17 + 5) |
| 28 | 4B | e_phoff | *(_DWORD*)(a1 + 28) |
| 32 | 4B | e_shoff | *(_DWORD*)(a1 + 32) |
| 36 | 4B | e_flags | *(_DWORD*)(a1 + 36) -- dumper prints "flags=%x" |
| 44 | 2B | e_phnum | *(uint16*)(a1 + 44) -- dumper prints "phnum" |
| 48 | 2B | e_shnum | *(uint16*)(a1 + 48) -- dumper prints "shnum" |
For 64-bit class (EI_CLASS = 2):
| Offset | Size | Name | Dumper accessor |
|---|---|---|---|
| 32 | 8B | e_phoff | *(_QWORD*)(a1 + 32) -- dumper prints "phoff=%llx" |
| 40 | 8B | e_shoff | *(_QWORD*)(a1 + 40) -- dumper prints "shoff=%llx" |
| 48 | 4B | e_flags | *(_DWORD*)(a1 + 48) -- dumper prints "flags=%x" |
| 56 | 2B | e_phnum | *(uint16*)(a1 + 56) -- dumper prints "phnum" |
| 60 | 2B | e_shnum | *(uint16*)(a1 + 60) -- dumper prints "shnum" |
Region 2 -- Metadata and Flags (bytes 64--107)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 64 | 1B | debugMode | a8 parameter: deviceDebug |
| 68 | 4B | compilationFlags | rawOptions & 0x70000 -- preserved option bits 16--18 |
| 72 | 4B | smVersion | a4 parameter: SM architecture number (e.g., 100 for Blackwell) |
| 76 | 4B | rawOptions | Full options bitmask a9, possibly OR'd with 0x80000 for relocatable |
| 80 | 1B | lineInfoMode | a6 parameter: lineInfo |
| 82 | 1B | is32bit | Dumper gate: controls 32-bit vs 64-bit format in sub_1CB91C0 |
| 83 | 1B | hasSymbolRemap | Set to *(WORD*)(v17 + 42) != 0 |
| 84 | 1B | flag_relocatable | rawOptions & 1 |
| 85 | 1B | flag_executable | (rawOptions & 2) != 0 |
| 86 | 1B | flag_PIC | (rawOptions & 0x200) != 0 -- position-independent code |
| 87 | 1B | flag_bit2 | (rawOptions & 4) != 0 |
| 88 | 1B | flag_bit3 | (rawOptions & 8) != 0 |
| 89 | 1B | flag_relocOrBit4 | (rawOptions >> 4) & 1, forced to 1 if relocatable mode |
| 90 | 1B | flag_bit5 | (rawOptions & 0x20) != 0 |
| 91 | 1B | flag_bit14 | (rawOptions & 0x4000) != 0 |
| 92 | 1B | flag_bit6 | (rawOptions & 0x40) != 0 |
| 93 | 1B | flag_byte1_bit0 | BYTE1(rawOptions) & 1 -- bit 8 of options |
| 94 | 1B | flag_archGuard | (a5 > 0x45) & (rawOptions >> 7) -- arch-gated feature |
| 96 | 1B | flag_bit11 | (rawOptions & 0x800) != 0 |
| 99 | 1B | flag_notBit12 | ((rawOptions >> 12) ^ 1) & 1 -- inverted bit 12 |
| 100 | 1B | flag_bit13 | (rawOptions & 0x2000) != 0 |
| 101 | 1B | highClass | (a9 & 0x8000) != 0 -- selects 64-bit header variant with wider ELF fields |
Region 3 -- Inline String Tables (bytes 108--171)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 108 | 32B | shstrtab | Section header string table, initialized via sub_1CB0530(v17 + 108, 1000) |
| 140 | 32B | strtab | Symbol name string table, initialized via sub_1CB0530(v17 + 140, 2000) |
These are 32-byte inline structures (not heap pointers). sub_1CB0530 initializes them with the given initial capacity (1000 and 2000 bytes respectively). The .shstrtab is also referenced by sub_1CA6650 during .note.nv.cuinfo attribute injection.
Region 4 -- Section Index Cache (bytes 196--215)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 200 | 2B | strtabSecIdx | .strtab section index -- *(_WORD*)(v17 + 101) = v54 |
| 202 | 2B | symtabSecIdx | .symtab section index -- *(_WORD*)(v17 + 102) = v56 |
| 204 | 2B | xindexSecIdx | .symtab_shndx section index -- *(_WORD*)(v17 + 103) |
| 206 | 2B | cuinfoSecIdx | .note.nv.cuinfo section index -- *(_WORD*)(v17 + 104) |
| 208 | 2B | tkinfoSecIdx | .note.nv.tkinfo section index -- *(_WORD*)(v17 + 105) |
These cached indices avoid repeated linear scans of the section table when cross-referencing sections (e.g., .symtab's sh_link must point to .strtab).
Region 5 -- Sorted Maps and Counters (bytes 288--327)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 288 | 8B | sortedMap_A | Red-black tree via sub_425CA0, initial capacity 512 |
| 296 | 8B | sortedMap_B | Red-black tree via sub_425CA0, initial capacity 512 |
| 304 | 4B | mapCount | Counter for sorted maps, cleared to 0 |
| 312 | 8B | countPair | Packed 0x100000000 = high DWORD 1, low DWORD 0 |
| 320 | 4B | activeFlag | Set to 1 during initialization |
Region 6 -- Section and Symbol Containers (bytes 344--419)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 344 | 8B | sectionList_A | Indexed vector (cap=64) via sub_1CD2F90. Dumper: sub_1CD3060(*(a1+344)) returns symbol count |
| 352 | 8B | sectionList_B | Indexed vector (cap=64). Dumper: sub_1CD3060(*(a1+352)) returns secondary count |
| 360 | 8B | sectionList_C | Indexed vector (cap=64). Dumper iterates *(a1+360) for section dump loop |
| 368 | 8B | secIndexMap | Virtual-to-real section index map. Dumper: *(a1+368) + 4*v11 for reverse lookup |
| 376 | 8B | relocList | Linked list head for relocations. Dumper: for (k = *(a1+376); k; k = *k) |
| 392 | 8B | nvinfoList | Linked list head for .nv.info entries. Dumper: for (j = *(a1+392); j; j = *j) |
| 408 | 8B | auxVector | Indexed vector (cap=32) via sub_1CD2F90 |
| 416 | 4B | auxCount | Counter for auxVector, cleared to 0 |
sub_1CD2F90 creates an indexed vector (growable array with count/capacity tracking). sub_1CD3060 returns the element count; sub_1CD31F0 returns the element at the current iteration index.
Region 7 -- Deletion Remap Tables (bytes 456--479)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 456 | 8B | symDeleteMap | Symbol index remap table (positive indices). Dumper: *(a1+456) + 4*idx |
| 464 | 8B | symDeleteMapNeg | Symbol index remap table (negative indices). Dumper: *(a1+464) + 4*(-idx) |
| 472 | 8B | secDeleteMap | Section index remap table. Dumper: *(a1+472) + 4*idx |
After dead code elimination deletes sections and symbols, these tables map old indices to new indices. The negative-index variant handles symbols stored with inverted sign conventions (a ptxas-internal encoding for unresolved forward references).
Region 8 -- Architecture State (bytes 488--495)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 488 | 8B | archState | Architecture descriptor pointer. Initialized via sub_1CD04F0 (relocatable) or sub_1CCEEE0 (non-relocatable). Fatal error "couldn't initialize arch state" on failure |
Region 9 -- Name Sets and Input Tracking (bytes 496--519)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 496 | 8B | sectionNameSet | Sorted set of well-known section name strings. Populated from static table off_2403A60 (22 entries ending at dword_2403B70) |
| 512 | 8B | inputFileList | Indexed vector (cap=8). First entry is a 16-byte descriptor: {ptr="<input>", arch_version, ...} |
Region 10 -- Hash Maps (bytes 520--567)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 520 | 8B | hashMap_A | Hash map via sub_42D150, initial capacity 16 |
| 528 | 8B | hashMap_B | Hash map (cap=16) |
| 536 | 8B | hashMap_C | Hash map (cap=16) |
| 544 | 8B | hashMap_D | Hash map (cap=16) |
| 552 | 8B | hashMap_E | Hash map (cap=16) |
| 560 | 8B | hashMap_F | Hash map (cap=16) |
Six hash maps initialized identically with sub_42D150(sub_427630, sub_4277B0, 0x10). The two function pointers are the hash function and equality comparator. These maps serve section/symbol lookups during the construction pipeline. The specific role of each map (by-name, by-type, etc.) requires tracing callers of the hash map accessors.
Region 11 -- Extended Index and Miscellaneous (bytes 576--607)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 576 | 8B | smallSortedSet | Sorted set via sub_425CA0, element size 8 |
| 592 | 8B | symtabShndxVec | .symtab_shndx data vector. Dumper: sub_1CD31F0(*(a1+592)) for SHN_XINDEX resolution |
| 600 | 8B | mercSymtabShndx | .nv.merc.symtab_shndx data vector. Dumper: *(a1+600) for Mercury SHN_XINDEX |
Region 12 -- Memory Pool and Tail (bytes 608--671)
| Offset | Size | Name | Purpose |
|---|---|---|---|
| 608 | 8B | memoryPool | "elfw memory space" pool pointer. Only set when (a9 & 0x400) != 0 |
| 616 | 8B | memoryPoolCursor | Pool allocation cursor |
| 624 | 4B | elfFormatVersion | sub_1C97990() return value -- ELF format version from global config |
| 664 | 8B | tailSentinel | v17[83] = 0 -- zeroed during init, marks end of object |
Visual Layout
ELFW Object (672 bytes = 0x2A0)
+---------+------+--------------------------------------------------+
| 0x000 | 64B | ELF Header Embed (Elf32_Ehdr or Elf64_Ehdr) |
+---------+------+--------------------------------------------------+
| 0x040 | 44B | Metadata + Option Flags (debugMode, smVersion, |
| | | rawOptions, 16 boolean flags) |
+---------+------+--------------------------------------------------+
| 0x06C | 64B | Inline String Tables |
| | | +0x06C: shstrtab (32B, cap=1000) |
| | | +0x08C: strtab (32B, cap=2000) |
+---------+------+--------------------------------------------------+
| 0x0AC | 36B | Section Index Cache + Padding |
| | | 5 x uint16 section indices (.strtab, .symtab, |
| | | .symtab_shndx, .cuinfo, .tkinfo) |
+---------+------+--------------------------------------------------+
| 0x120 | 40B | Sorted Maps + Counters |
+---------+------+--------------------------------------------------+
| 0x158 | 76B | Section/Symbol Containers |
| | | 3 indexed vectors (sections), index map, |
| | | relocation list, nvinfo list, aux vector |
+---------+------+--------------------------------------------------+
| 0x1C8 | 24B | Deletion Remap Tables (sym+, sym-, sec) |
+---------+------+--------------------------------------------------+
| 0x1E8 | 8B | Architecture State Pointer |
+---------+------+--------------------------------------------------+
| 0x1F0 | 24B | Name Sets + Input Tracking |
+---------+------+--------------------------------------------------+
| 0x208 | 48B | Six Hash Maps (16-entry initial) |
+---------+------+--------------------------------------------------+
| 0x240 | 32B | Extended Index Vectors + Small Sorted Set |
+---------+------+--------------------------------------------------+
| 0x260 | 64B | Memory Pool + Format Version + Tail |
+---------+------+--------------------------------------------------+
ELF Class Selection
The ELF class byte at offset 4 determines 32-bit vs 64-bit output format. ptxas uses non-standard values:
| EI_CLASS byte | Standard ELF | ptxas meaning |
|---|---|---|
'3' (0x33) | n/a | 32-bit CUDA ELF |
'A' (0x41) | n/a | 64-bit CUDA ELF |
Standard ELF uses 1 (ELFCLASS32) and 2 (ELFCLASS64). The non-standard values '3' and 'A' are a CUDA-specific convention that identifies the binary as a cubin rather than a generic ELF. The CUDA driver recognizes these values during cubin loading.
Section Creator -- sub_1CB3570
The generic section creation function, called from 44 sites throughout the ELF construction pipeline. It accepts the full set of ELF section header parameters and optionally creates a companion relocation section.
// sub_1CB3570 -- add section to ELFW (simplified)
int add_section(void* elfw, const char* name, uint32_t type, uint64_t flags,
uint32_t link, uint32_t info, uint64_t align, uint64_t entsize) {
// 1. Add name to .shstrtab, get string table offset
int name_idx = strtab_add(elfw->shstrtab, name);
// 2. Allocate section descriptor, fill ELF section header fields
section_t* sec = alloc_section(elfw);
sec->sh_name = name_idx;
sec->sh_type = type;
sec->sh_flags = flags;
sec->sh_link = link;
sec->sh_info = info;
sec->sh_addralign = align;
sec->sh_entsize = entsize;
// 3. For executable sections, auto-create relocation companion
if (flags & SHF_EXECINSTR) {
char rela_name[256];
snprintf(rela_name, sizeof(rela_name), ".rela%s", name);
// -- or ".rel%s" depending on ELF class --
section_t* rela = alloc_section(elfw);
rela->sh_type = SHT_RELA; // or SHT_REL
rela->sh_link = symtab_index;
rela->sh_info = sec->index;
}
return sec->index;
}
The assertion "adding function section after callgraph completed" fires if a section is added after the call graph analysis phase has already run. This enforces the ordering constraint: all .text.<funcname> sections must exist before dead code elimination and call graph construction begin.
Text Section Creator -- sub_1CB42D0
Creates a per-function code section with the naming convention .text.<funcname>:
| Field | Value |
|---|---|
sh_type | SHT_PROGBITS (1) |
sh_flags | SHF_ALLOC | SHF_EXECINSTR (0x6) |
| Section name | .text.<funcname> |
| Companion | .rela.text.<funcname> (auto-created) |
Each kernel entry point and each device function gets its own .text section. This per-function section layout enables the linker (nvlink) to perform function-level dead code elimination and allows the CUDA driver to load individual kernels.
Symbol Table Builder -- sub_1CB68D0
The largest function in the ELFW subsystem at 9,578 bytes (approximately 1,700 decompiled lines). It constructs the .symtab section from the internal symbol representation, handling several CUDA-specific concerns.
Processing Steps
- Iterate internal symbol list -- walks the ELFW symbol container
- Filter deleted symbols -- skips entries marked deleted, emits
"reference to deleted symbol"warning (12 occurrences of this check in the function) - Handle
__cuda_syscall-- special-cases the CUDA syscall dispatcher symbol, which serves as the entry point for device-side system calls (vprintf, malloc,__assertfail, etc.) - Compute symbol values/sizes -- resolves virtual addresses from section offsets
- Create section symbols -- ensures every section has a corresponding
STT_SECTIONsymbol - Handle SHN_XINDEX overflow -- when the section index exceeds
SHN_LORESERVE(0xFF00 = 65,280), the symbol'sst_shndxfield is set toSHN_XINDEX(0xFFFF) and the real index is stored in the.symtab_shndxtable - Build
.symtab_shndx-- populates the extended section index table for overflow cases
Error Messages
| String | Condition |
|---|---|
"reference to deleted symbol" | Symbol was deleted but still referenced (12 checks) |
"ignore symbol %s in unused section" | Symbol in dead-eliminated section |
"missing sec strtab" | String table not initialized |
"missing std sections" | Standard sections (.shstrtab, .strtab, .symtab) missing |
"overflow number of sections %d" | Section count exceeds ELF limits |
CUDA Syscall Functions
The __cuda_syscall symbol is the dispatcher for device-side system calls. The known syscall functions referenced throughout the ptxas binary:
| Syscall | Purpose |
|---|---|
vprintf | Device-side formatted output |
malloc | Device-side dynamic memory allocation |
free | Device-side memory deallocation |
__assertfail | Device assertion failure handler |
__profile | Profiling counter increment |
cnpGetParameterBuffer | Cooperative launch parameter access |
These are compiled as indirect calls through the __cuda_syscall dispatch mechanism. The symbol __cuda_syscall_32f3056bbb (observed in binary strings) is a hash-mangled variant used for linking.
Section Layout Calculator -- sub_1C9DC60
Computes file offsets and virtual addresses for all sections in the ELF. This is a multi-pass algorithm that respects alignment constraints and handles several special cases.
Special Section Handling
| Section | Treatment |
|---|---|
.nv.constant0 | Skipped (handled separately by OCG constant bank allocation) |
.nv.reservedSmem | Skipped (shared memory layout computed by master allocator sub_1CABD60) |
.debug_line | Receives special alignment padding for DWARF line table requirements |
The layout calculator assigns offsets in section-table order, which itself is determined by the 8-bucket priority sort performed during finalization.
ELF Finalization -- sub_1C9F280
The master ELF emitter at 15,263 binary bytes (97 KB decompiled) is the single largest function in the post-codegen address range. It assembles the complete ELF output from the ELFW internal representation.
Execution Flow
- Copy ELF header -- 64 bytes transferred via SSE
loadu(128-bit unaligned loads) for performance - Iterate sections -- uses accessor pair
sub_1CB9FF0(section count) /sub_1CB9C40(get section by index) - Skip virtual sections -- sections with
flag & 4set have no file data (metadata-only) - Filter
.nv.constant0-- detected viastrstr(".nv.constant0"), handled by separate constant bank logic - Copy section headers -- SIMD-width stride memcpy of section header entries
- Patch ELF flags -- mask
0x7FFFBFFFclears CUDA-specific flag bits, then sets SM version and relocatable/executable mode - Emit program headers -- creates PT_LOAD segments for loadable sections
- Build symbol table -- delegates to
sub_1CB68D0 - Resolve section indices -- handles cross-references between sections
- Embed Mercury capsule -- if capmerc mode, embeds the
.nv.merc.*sections - Process debug sections -- maps
.debug_info,.debug_line,.debug_framesections - Error recovery -- uses
_setjmpfor non-local error exit on fatal corruption
Section Ordering -- 8 Priority Buckets
During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF. The bucket assignment ensures the correct layout for the CUDA driver's section scanner:
| Bucket | Typical Contents |
|---|---|
| 0 (highest) | ELF header pseudo-section, .shstrtab |
| 1 | .strtab, .symtab, .symtab_shndx |
| 2 | .note.nv.tkinfo, .note.nv.cuinfo |
| 3 | .text.<funcname> (code sections) |
| 4 | .nv.constant0.*, .nv.shared.*, .nv.local.* (data sections) |
| 5 | .rela.*, .rel.* (relocation sections) |
| 6 | .nv.info.*, EIATTR sections |
| 7 (lowest) | .debug_*, .nv.merc.* (debug/mercury metadata) |
Offset Assignment and Alignment
Each section's file offset is aligned to its sh_addralign value. The algorithm walks the sorted section list, advancing a running offset counter with alignment padding:
uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
section_t* sec = sorted_sections[i];
if (sec->sh_addralign > 1)
offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
sec->sh_offset = offset;
offset += sec->sh_size;
}
Extended Section Index Handling (SHN_XINDEX)
When the total section count exceeds 65,280 (SHN_LORESERVE = 0xFF00), standard ELF e_shnum cannot hold the value. The emitter activates the SHN_XINDEX mechanism:
- Sets
e_shnum = 0in the ELF header - Stores the real section count in
sh_sizeof section header index 0 (the null section) - Sets
e_shstrndx = SHN_XINDEX(0xFFFF) - Stores the real
.shstrtabindex insh_linkof section header index 0
This is the standard ELF extension for large section counts, and it is necessary for large CUDA programs with many kernels (each kernel generates at minimum a .text, .rela.text, .nv.info, and .nv.constant0 section).
Cubin Generation Entry -- sub_612DE0
The top-level ELF/cubin generation function at 47 KB. Called from the compilation driver sub_446240 after all per-kernel OCG passes complete. This function orchestrates the entire output pipeline.
Key Behaviors
- Option parsing -- reads compilation flags:
deviceDebug,lineInfo,optLevel,IsCompute,IsPIC - Fastpath optimization --
"Finalizer fastpath optimization"string indicates a fast path for cross-target finalization when no complex linking is needed - Version embedding -- writes
"Cuda compilation tools, release 13.0, V13.0.88"and build ID"Build cuda_13.0.r13.0/compiler.36424714_0"into the cubin - Error recovery -- establishes its own
setjmp/longjmpframe independent of the top-level driver's - Recursive self-call -- handles nested finalization for scenarios where the output pipeline must invoke itself (e.g., generating both a primary cubin and an embedded Mercury cubin simultaneously)
Symbol Fixup -- sub_1CB2CA0
Adjusts symbol indices after dead code elimination removes sections from the ELFW. When sections are deleted, all symbol references to those sections become stale and must be renumbered.
For each section in ELFW:
If section lacks a STT_SECTION symbol:
Create one
If section has multiple STT_SECTION symbols:
Emit "found multiple section symbols for %s"
Renumber all symbol st_shndx values to match new section indices
Called from 4 sites, indicating it runs at multiple points during the output pipeline (after dead function elimination, after mercury section cloning, etc.).
Section Index Remap -- sub_1C99BB0
Handles the .symtab_shndx and .nv.merc.symtab_shndx extended index tables when section indices change. This is the companion to sub_1CB2CA0: while that function fixes symbol st_shndx fields, this one fixes the extended index tables that hold the real indices when SHN_XINDEX is in use.
ELF Structure Dumper -- sub_1CB91C0
Debug-mode function that prints a formatted dump of the ELFW internal state. Triggered by internal debug flags, not by any user-visible CLI option.
Output format:
elfw structure:
header: size=%d type=%d abiv=%d, flags=%x
shnum=%d shoff=%d
phnum=%d phoff=%d
section <v/r>:
[idx] name type flags offset size link info align entsize
symbol <v/r>:
[idx] name value size bind type shndx
The <v/r> suffix indicates virtual (v) or real (r) mode, corresponding to whether the dump shows the in-memory intermediate state or the final file-ready values. Both 32-bit and 64-bit format strings are present.
File Serializer -- sub_1CD13A0
The final step: writes the assembled ELF binary to disk. Called from 2 sites (main cubin and Mercury capsule cubin).
Validation Checks
| Check | Error String |
|---|---|
| Section data size negative | "Negative size encountered" |
| Computed size != declared size | "section size mismatch" |
| Write failure | "writing file" (logged 12 times across write operations) |
The serializer handles both 32-bit and 64-bit ELF formats, adjusting section header entry sizes (40 bytes for 32-bit, 64 bytes for 64-bit) and symbol table entry sizes (16 bytes for 32-bit, 24 bytes for 64-bit).
ELF Header Layout
The ELF header written by the ELFW constructor follows the standard ELF format with CUDA-specific overrides:
| Offset | Size | Field | Value |
|---|---|---|---|
| 0x00 | 4B | e_ident[EI_MAG0..3] | 0x7F 'E' 'L' 'F' (magic 0x464C457F) |
| 0x04 | 1B | e_ident[EI_CLASS] | 0x33 ('3', 32-bit) or 0x41 ('A', 64-bit) |
| 0x05 | 1B | e_ident[EI_DATA] | 0x01 (little-endian) |
| 0x06 | 1B | e_ident[EI_VERSION] | 0x01 (EV_CURRENT) |
| 0x07 | 1B | e_ident[EI_OSABI] | CUDA ABI version |
| 0x12 | 2B | e_machine | 0x00BE (EM_CUDA = 190) |
| 0x14 | 4B | e_version | 0x00000001 |
| 0x24 | 4B | e_flags | SM version + CUDA flags (masked via 0x7FFFBFFF) |
The e_flags field encodes the target SM architecture (e.g., sm_100 for Blackwell) and several CUDA-specific flags including relocatable object mode vs executable mode. The mask 0x7FFFBFFF clears bits 14 and 31, which are reserved CUDA control bits.
NVIDIA-Specific Section Types
Beyond standard ELF section types, the emitter uses NVIDIA-defined types in the SHT_LOPROC--SHT_HIPROC range:
| Type Constant | Value | Section |
|---|---|---|
SHT_CUDA_INFO | 0x70000064 | .nv.info.* -- per-entry EIATTR attributes |
SHT_CUDA_CALLGRAPH | (proc-specific) | .nv.callgraph -- inter-function call edges |
SHT_CUDA_CONSTANT | (proc-specific) | .nv.constant0.* -- per-entry constant banks |
The magic constant 1879048292 appearing in the emitter decompilation is 0x70000064, confirming SHT_CUDA_INFO as the type used for NVIDIA info sections.
Cross-References
- Section Catalog & EIATTR -- complete inventory of section types and EIATTR attributes
- Relocations & Symbols -- relocation resolution and UFT/UDT management
- Debug Information -- DWARF generation and
.debug_*section handling - Mercury Encoder -- Mercury/capmerc encoding that feeds into the ELF emitter
- Pipeline Overview -- where the ELF phase fits in the compilation pipeline
- Memory Pool Allocator -- the
sub_424070pool allocator used by ELFW
Function Map
| Address | Size (binary) | Decompiled | Callers | Callees | Purpose |
|---|---|---|---|---|---|
sub_1CB53A0 | 3,480 B | 13 KB | 1 | 25 | ELFW constructor (672-byte object) |
sub_1CB3570 | 1,963 B | 10 KB | 44 | 13 | Section creator (.rela/.rel auto-create) |
sub_1CB42D0 | -- | -- | -- | -- | .text.<funcname> section creator |
sub_1CB68D0 | 9,578 B | 49 KB | 1 | 36 | Symbol table builder (.symtab) |
sub_1CB2CA0 | 2,038 B | 8 KB | 4 | 11 | Symbol fixup (post-deletion renumbering) |
sub_1C99BB0 | 4,900 B | 25 KB | 1 | 18 | Section index remap (.symtab_shndx) |
sub_1C9DC60 | 5,663 B | 29 KB | 1 | 24 | Section layout calculator |
sub_1C9F280 | 15,263 B | 97 KB | 1 | 42 | Master ELF emitter (assembles complete output) |
sub_1CB91C0 | 2,668 B | 13 KB | 1 | 5 | ELF structure dumper (debug) |
sub_1CD13A0 | 2,541 B | 11 KB | 2 | 6 | File serializer (final write to disk) |
sub_612DE0 | ~12,000 B | 47 KB | 1 | -- | Cubin generation entry point |
sub_1CB9FF0 | -- | -- | -- | -- | Section count accessor |
sub_1CB9C40 | -- | -- | -- | -- | Get section by index |
Section Catalog & EIATTR
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
A CUDA cubin is a standard ELF container with NVIDIA-proprietary extensions. ptxas v13.0.88 populates it with approximately 4*(N+M) sections minimum for a program with N entry kernels and M device functions. Each section carries a specific kind of data -- SASS instructions, constant bank contents, relocation entries, per-kernel resource metadata (EIATTR), shared memory layout, debug information, or Mercury-encoded streams for deferred finalization. This page catalogs every section type ptxas can emit, the NVIDIA-specific ELF section types used, the section ordering rules, and the complete EIATTR attribute encoding.
| Section attribute builder | sub_60FBF0 (76 KB decompiled -- per-kernel section config + codegen launch) |
| Section creator | sub_1CB3570 (1,963 bytes, 44 call sites) |
| Text section creator | sub_1CB42D0 (SHF_ALLOC | SHF_EXECINSTR) |
| nvinfo section creator | sub_1CC7FB0 (creates .nv.info / .nv.info.<func>) |
| EIATTR record emitter | sub_1CC85F0 (emits one TLV record) |
| EIATTR builder | sub_1CC9800 (14,764 bytes, 90 KB decompiled, 2,786 lines) |
| EIATTR propagator | sub_1CC8950 (2,634 bytes -- barrier/register propagation) |
| .nv.compat handler | sub_1CC93A0 (.nv.compat attribute processor) |
| Call graph builder | sub_1CBE1B0 (.nv.callgraph section) |
| Layout calculator | sub_1C9DC60 (5,663 bytes -- offset assignment) |
| Master section allocator | sub_1CABD60 (11,856 bytes -- shared/constant/local addresses) |
| SHT_CUDA_INFO | 0x70000000 (1,879,048,192) |
| SHT_CUDA_CALLGRAPH | 0x70000064 (1,879,048,292) |
| .nv.compat section type | 0x70000086 (1,879,048,326) |
NVIDIA-Specific Section Types
Beyond standard ELF section types (SHT_PROGBITS, SHT_STRTAB, SHT_SYMTAB, SHT_RELA, SHT_NOTE), ptxas uses NVIDIA-defined types in the SHT_LOPROC--SHT_HIPROC range (0x70000000--0x7FFFFFFF):
| Constant | Value | Decimal | Used by |
|---|---|---|---|
SHT_CUDA_INFO | 0x70000000 | 1,879,048,192 | .nv.info, .nv.info.<func> |
SHT_CUDA_CALLGRAPH | 0x70000064 | 1,879,048,292 | .nv.callgraph |
SHT_CUDA_COMPAT | 0x70000086 | 1,879,048,326 | .nv.compat |
The section creator sub_1CB3570 contains a range check on CUDA-specific types:
// sub_1CB3570 -- section type validation
if (elf_mode != 1 && is_relocatable
&& ((sh_type - 0x70000064) <= 0x1A || sh_type == 0x70000006)) {
// These CUDA section types require special handling in relocatable mode
}
This tells us that NVIDIA reserves the range 0x70000064--0x7000007E (27 types) plus 0x70000006 for CUDA-specific sections that receive special treatment in relocatable object mode.
Complete Section Catalog
Standard ELF Infrastructure Sections
Created unconditionally by the ELFW constructor (sub_1CB53A0). These form the skeleton of every cubin.
| Section | Type | Flags | Purpose |
|---|---|---|---|
| (null) | SHT_NULL | -- | Required ELF null section (index 0) |
.shstrtab | SHT_STRTAB | -- | Section name string table |
.strtab | SHT_STRTAB | -- | Symbol name string table |
.symtab | SHT_SYMTAB | -- | Symbol table |
.symtab_shndx | SHT_SYMTAB_SHNDX | -- | Extended section indices (when section count > 65,280) |
NVIDIA Note Sections
Created unconditionally. Carry module-level metadata the CUDA driver reads before launching any kernel.
| Section | Type | Flags | Purpose |
|---|---|---|---|
.note.nv.tkinfo | SHT_NOTE | -- | Toolkit info: version string, build ID, CLI arguments |
.note.nv.cuinfo | SHT_NOTE | -- | CUDA info: SM version, feature flags |
.note.nv.cuver | SHT_NOTE | -- | CUDA version note |
Per-Kernel Code Sections
Created by sub_1CB42D0, one set per kernel entry and device function:
| Section | Type | Flags | sh_link | Purpose |
|---|---|---|---|---|
.text.<func> | SHT_PROGBITS | SHF_ALLOC | SHF_EXECINSTR (0x6) | -- | SASS instruction bytes |
.rela.text.<func> | SHT_RELA | -- | .symtab index | Relocations for the code section |
The .rela companion is auto-created by the section creator when SHF_EXECINSTR is set. The assertion "adding function section after callgraph completed" fires if a code section is added after call graph analysis.
Per-Kernel Metadata Sections
| Section | Type | Flags | sh_link | Purpose |
|---|---|---|---|---|
.nv.info.<func> | SHT_CUDA_INFO | SHF_LINK_ORDER (0x40) | .text.<func> symbol | EIATTR TLV records for this kernel |
.nv.constant0.<func> | SHT_PROGBITS | SHF_ALLOC | -- | Constant bank 0: kernel params + literal constants |
.nv.shared.<func> | SHT_NOBITS | SHF_ALLOC | SHF_WRITE | -- | Shared memory layout (size only, no file data) |
.nv.local.<func> | SHT_NOBITS | SHF_ALLOC | SHF_WRITE | -- | Local (spill) memory layout |
The .nv.info.<func> section uses SHF_LINK_ORDER (flag 0x40) to declare its association with the function's symbol. The SHT_CUDA_INFO type value 0x70000000 is used; note that the nvlink wiki previously documented 0x70000064 for this -- the discrepancy arises because nvlink uses a different constant in its own emitter. Binary evidence from ptxas shows sub_1CC7FB0 consistently passes 1879048192 (0x70000000).
Global Metadata Sections
| Section | Type | Flags | Purpose |
|---|---|---|---|
.nv.info | SHT_CUDA_INFO | -- | Global EIATTR attributes (sh_link = 0, not per-function) |
.nv.compat | SHT_CUDA_COMPAT | -- | Forward-compatibility attributes (sm version negotiation) |
.nv.metadata | SHT_PROGBITS | -- | Module-level metadata |
.nv.callgraph | SHT_CUDA_CALLGRAPH | -- | Inter-function call edges (relocatable mode, -c) |
.nv.prototype | SHT_PROGBITS | -- | Prototype information for cross-module linking |
.nv.rel.action | SHT_PROGBITS | -- | Relocation action table |
.nv.resolvedrela | SHT_PROGBITS | -- | Resolved relocations (post-linking) |
.nv.host | SHT_PROGBITS | -- | Host-side interop data |
Constant Banks
CUDA supports up to 18 numbered constant banks (0--17) plus named constant sections:
| Section | Purpose |
|---|---|
.nv.constant0 | Merged constant bank 0 (whole-program mode) |
.nv.constant0.<func> | Per-function constant bank 0 (kernel params + compiler constants) |
.nv.constant1 -- .nv.constant17 | User-declared __constant__ variables |
.nv.constant.entry_params | Entry point parameter block |
.nv.constant.entry_image_header_indices | Texture/surface header index table |
.nv.constant.driver | Driver-injected constants |
.nv.constant.optimizer | Optimizer-generated constants (OCG) |
.nv.constant.user | User-specified constants |
.nv.constant.pic | Position-independent code constants |
.nv.constant.tools_data | Tools/debugger-injected data |
The layout calculator sub_1C9DC60 skips .nv.constant0 sections during offset assignment because their addresses are managed by the OCG constant bank allocator, not the ELF layout engine.
Shared Memory Sections
| Section | Purpose |
|---|---|
.nv.shared.<func> | Per-kernel shared memory (size declaration, no data) |
.nv.shared.reserved. | Reserved shared memory for runtime allocation |
.nv.reservedSmem | Reserved shared memory master section |
.nv.reservedSmem.begin | Start offset of reserved region |
.nv.reservedSmem.cap | Capacity of reserved region |
.nv.reservedSmem.offset0 | Offset within reserved region 0 |
.nv.global.init | Initialized global variables |
The master section allocator sub_1CABD60 assigns addresses to shared, constant, and local memory sections. The layout calculator skips .nv.reservedSmem for the same reason it skips .nv.constant0 -- its address comes from the shared memory master allocator.
Unified Function/Data Tables
| Section | Purpose |
|---|---|
.nv.uft | Unified Function Table (indirect call dispatch) |
.nv.uft.entry | UFT entry point table |
.nv.udt | Unified Data Table |
.nv.udt.entry | UDT entry point table |
The error "Number of .nv.uft jump slots != Number of entries" fires when the UFT and entry tables are inconsistent. "missing nv.uft.entry" fires when the required entry table section was never created.
DWARF Debug Sections
Generated when --device-debug or --generate-line-info is active:
| Section | Content |
|---|---|
.debug_info | DWARF DIE tree (compilation units, types, variables) |
.debug_abbrev | DWARF abbreviation table |
.debug_line | Source-to-address line number mapping |
.debug_frame | Call frame information for unwinding |
.debug_loc | Location lists for variables |
.debug_str | DWARF string table |
.debug_ranges | Address ranges |
.debug_aranges | Address range lookup table |
.debug_pubnames | Public name index |
.debug_pubtypes | Public type index |
NVIDIA Debug Extensions
| Section | Content |
|---|---|
.nv_debug_ptx_txt | Embedded PTX source text |
.nv_debug_line_sass | SASS-level line number mapping |
.nv_debug_info_reg_sass | Register allocation debug info |
.nv_debug_info_reg_type | Register type information |
.nv_debug.shared | Shared memory debug layout |
Mercury / Capsule Mercury Sections (SM 100+)
For Capsule Mercury output (Blackwell and later), the cubin contains a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization:
| Section | Purpose |
|---|---|
.nv.capmerc | Capsule Mercury descriptor |
.nv.merc.symtab_shndx | Extended section index table (Mercury copy) |
.nv.merc.nv.shared.reserved | Shared memory reservation metadata |
.nv.merc.rela<secname> | Per-section relocation tables |
.nv.merc.debug_abbrev | Cloned DWARF abbreviation table |
.nv.merc.debug_info | Cloned DWARF info |
.nv.merc.debug_line | Cloned DWARF line table |
.nv.merc.debug_frame | Cloned DWARF frame info |
.nv.merc.debug_loc | Cloned DWARF locations |
.nv.merc.debug_str | Cloned DWARF string table |
.nv.merc.debug_ranges | Cloned DWARF ranges |
.nv.merc.debug_aranges | Cloned DWARF address ranges |
.nv.merc.debug_pubnames | Cloned DWARF public names |
.nv.merc.debug_pubtypes | Cloned DWARF public types |
.nv.merc.debug_macinfo | Cloned DWARF macro info |
.nv.merc.nv_debug_ptx_txt | Embedded PTX source text |
.nv.merc.nv_debug_line_sass | SASS-level line mapping |
.nv.merc.nv_debug_info_reg_sass | Register allocation debug info |
.nv.merc.nv_debug_info_reg_type | Register type debug info |
The Mercury section cloner (sub_1CA2E40) iterates all sections and duplicates constant, global, shared, and local sections into the .nv.merc.* namespace, creating corresponding .nv.merc.rela sections for relocations.
Global vs Per-Kernel Sections
The .nv.info / .nv.info.<func> split is the primary distinction between global and per-kernel metadata:
Global .nv.info (one per cubin):
sh_link = 0(no associated symbol)- Contains module-wide EIATTR records:
EIATTR_CUDA_API_VERSION,EIATTR_STATISTICS,EIATTR_HAS_PRE_V10_OBJECT,EIATTR_MERCURY_ISA_VERSION - Created by
sub_1CC7FB0(elfw, 0)-- the zero argument selects global mode
Per-kernel .nv.info.<func> (one per kernel):
- Section name:
sprintf(".nv.info.%s", func_name)(visible insub_1CC7FB0) sh_linkpoints to the symbol table entry for the functionsh_flagsincludesSHF_LINK_ORDER(0x40) to declare its association- Contains per-kernel EIATTR records:
EIATTR_REGCOUNT,EIATTR_NUM_BARRIERS,EIATTR_FRAME_SIZE, etc. - Created by
sub_1CC7FB0(elfw, sym_idx)whensym_idx != 0
The .nv.info section creator (sub_1CC7FB0) first searches for an existing section of type 0x70000000 with the appropriate name. If none exists, it creates one. The per-function variant links the new section to the function's .text section via sub_1CB4180.
Section Ordering
During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:
| Bucket | Priority | Contents |
|---|---|---|
| 0 | Highest | ELF header pseudo-section, .shstrtab |
| 1 | .strtab, .symtab, .symtab_shndx | |
| 2 | .note.nv.tkinfo, .note.nv.cuinfo | |
| 3 | .text.<func> code sections | |
| 4 | .nv.constant0.*, .nv.shared.*, .nv.local.* data sections | |
| 5 | .rela.*, .rel.* relocation sections | |
| 6 | .nv.info.* EIATTR metadata sections | |
| 7 | Lowest | .debug_*, .nv.merc.* debug and Mercury metadata |
Within each bucket, sections appear in creation order. Section file offsets are assigned by sub_1C9DC60 walking the sorted list with alignment padding. The .debug_line section receives special alignment padding for DWARF line table requirements.
Offset Assignment
// sub_1C9DC60 -- simplified layout algorithm
uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
section_t* sec = sorted_sections[i];
if (is_virtual(sec)) continue; // flag & 4 -> no file data
if (is_nv_constant0(sec)) continue; // OCG allocator manages these
if (is_nv_reservedSmem(sec)) continue; // shared memory allocator manages these
if (sec->sh_addralign > 1)
offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
sec->sh_offset = offset;
offset += sec->sh_size;
}
Three section types are skipped during offset assignment:
- Virtual sections (flag bit 2 set) -- have no file data, only metadata
.nv.constant0-- address assigned by the OCG constant bank allocator.nv.reservedSmem-- address assigned by the shared memory master allocatorsub_1CABD60
EIATTR Encoding
Each .nv.info section contains a flat sequence of EIATTR (Entry Information Attribute) records. There is no section header or record count -- the parser walks from byte 0 to sh_size, consuming records sequentially. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled) -- one of the three largest functions in the output pipeline.
TLV Record Format
Offset Size Field
------ ---- -----
0x00 1 format Format byte (determines payload structure)
0x01 1 attr_code EIATTR type code (identifies the attribute)
0x02 2 size Payload size in bytes (little-endian uint16)
0x04 var payload Attribute-specific data (size bytes)
Total record size = 4 + ALIGN_UP(size, 4). Records are 4-byte aligned.
Format Byte
| Format | Name | Payload structure |
|---|---|---|
0x01 | Free | Raw bytes, attribute-specific layout |
0x02 | Value | Single 32-bit value (no symbol index) |
0x03 | Sized | 16-bit value + padding |
0x04 | Indexed | [sym_index:4][value:4] -- per-symbol attribute |
Format 0x04 (indexed) is the most common for per-function attributes. The 4-byte symbol index at payload offset 0 identifies which function the attribute applies to, enabling the linker to remap symbol indices during merge.
Parsing Pseudocode
uint8_t *ptr = section_data;
uint8_t *end = section_data + section_size;
while (ptr < end) {
uint8_t format = ptr[0];
uint8_t attr_code = ptr[1];
uint16_t size = *(uint16_t *)(ptr + 2);
switch (format) {
case 0x04: // Indexed
uint32_t sym_idx = *(uint32_t *)(ptr + 4);
uint32_t value = *(uint32_t *)(ptr + 8);
process_indexed(attr_code, sym_idx, value);
break;
case 0x02: // Value
uint32_t value = *(uint32_t *)(ptr + 4);
process_global(attr_code, value);
break;
default: // Free / Sized
process_raw(attr_code, ptr + 4, size);
break;
}
ptr += 4 + ALIGN_UP(size, 4);
}
EIATTR Record Emitter -- sub_1CC85F0
The low-level function that writes one EIATTR TLV record. Called from the builder and propagator with parameters:
// sub_1CC85F0 -- emit one EIATTR record
void emit_eiattr(
ELFW* elfw, // a1: ELFW object
uint8_t attr_code, // a2: EIATTR type code (e.g., 0x2F for REGCOUNT)
int16_t size, // a3: payload size in bytes
void* payload, // a4: pointer to payload data
uint32_t sym_idx // a5: symbol index (0 = global)
);
Before emitting, it calls sub_1C97840 to check whether the attribute code is supported on the current SM architecture. If not supported, the record is silently skipped. It then calls sub_1CC7FB0 to obtain or create the appropriate .nv.info section, allocates a 16-byte record descriptor, fills the format byte and attribute code, and appends it to the section's linked list (offset +392 in the ELFW object).
EIATTR Attribute Catalog
ptxas v13.0.88 defines 97 EIATTR codes numbered 0 through 96 (plus the EIATTR_ERROR_LAST sentinel at 96). The complete catalog below is cross-referenced against the nvlink v13.0.88 name table (extracted from pointer table at VA 0x1D37D60) and verified against EIATTR codes observed in the ptxas EIATTR builder (sub_1CC9800 switch cases and sub_1CC85F0 call sites).
Complete Code Table
| Code | Hex | Name | Fmt | Category |
|---|---|---|---|---|
| 0 | 0x00 | EIATTR_ERROR | -- | Sentinel |
| 1 | 0x01 | EIATTR_PAD | -- | Sentinel |
| 2 | 0x02 | EIATTR_IMAGE_SLOT | Idx | Texture |
| 3 | 0x03 | EIATTR_JUMPTABLE_RELOCS | Free | Metadata |
| 4 | 0x04 | EIATTR_CTAIDZ_USED | Idx | Metadata |
| 5 | 0x05 | EIATTR_MAX_THREADS | Idx | Resource |
| 6 | 0x06 | EIATTR_IMAGE_OFFSET | Idx | Texture |
| 7 | 0x07 | EIATTR_IMAGE_SIZE | Idx | Texture |
| 8 | 0x08 | EIATTR_TEXTURE_NORMALIZED | Idx | Texture |
| 9 | 0x09 | EIATTR_SAMPLER_INIT | Idx | Texture |
| 10 | 0x0A | EIATTR_PARAM_CBANK | Idx | Param |
| 11 | 0x0B | EIATTR_SMEM_PARAM_OFFSETS | Free | Param |
| 12 | 0x0C | EIATTR_CBANK_PARAM_OFFSETS | Free | Param |
| 13 | 0x0D | EIATTR_SYNC_STACK | Idx | Metadata |
| 14 | 0x0E | EIATTR_TEXID_SAMPID_MAP | Free | Texture |
| 15 | 0x0F | EIATTR_EXTERNS | Free | Metadata |
| 16 | 0x10 | EIATTR_REQNTID | Idx | Resource |
| 17 | 0x11 | EIATTR_FRAME_SIZE | Idx | Resource |
| 18 | 0x12 | EIATTR_MIN_STACK_SIZE | Idx | Resource |
| 19 | 0x13 | EIATTR_SAMPLER_FORCE_UNNORMALIZED | Idx | Texture |
| 20 | 0x14 | EIATTR_BINDLESS_IMAGE_OFFSETS | Free | Texture |
| 21 | 0x15 | EIATTR_BINDLESS_TEXTURE_BANK | Idx | Texture |
| 22 | 0x16 | EIATTR_BINDLESS_SURFACE_BANK | Idx | Texture |
| 23 | 0x17 | EIATTR_KPARAM_INFO | Free | Param |
| 24 | 0x18 | EIATTR_SMEM_PARAM_SIZE | Idx | Param |
| 25 | 0x19 | EIATTR_CBANK_PARAM_SIZE | Sized | Param |
| 26 | 0x1A | EIATTR_QUERY_NUMATTRIB | Idx | Metadata |
| 27 | 0x1B | EIATTR_MAXREG_COUNT | Sized | Resource |
| 28 | 0x1C | EIATTR_EXIT_INSTR_OFFSETS | Free | Offsets |
| 29 | 0x1D | EIATTR_S2RCTAID_INSTR_OFFSETS | Free | Offsets |
| 30 | 0x1E | EIATTR_CRS_STACK_SIZE | Idx | Resource |
| 31 | 0x1F | EIATTR_NEED_CNP_WRAPPER | Idx | Metadata |
| 32 | 0x20 | EIATTR_NEED_CNP_PATCH | Idx | Metadata |
| 33 | 0x21 | EIATTR_EXPLICIT_CACHING | Idx | Metadata |
| 34 | 0x22 | EIATTR_ISTYPEP_USED | Idx | Metadata |
| 35 | 0x23 | EIATTR_MAX_STACK_SIZE | Idx | Resource |
| 36 | 0x24 | EIATTR_SUQ_USED | Idx | Metadata |
| 37 | 0x25 | EIATTR_LD_CACHEMOD_INSTR_OFFSETS | Free | Offsets |
| 38 | 0x26 | EIATTR_LOAD_CACHE_REQUEST | Idx | Metadata |
| 39 | 0x27 | EIATTR_ATOM_SYS_INSTR_OFFSETS | Free | Offsets |
| 40 | 0x28 | EIATTR_COOP_GROUP_INSTR_OFFSETS | Free | Offsets |
| 41 | 0x29 | EIATTR_COOP_GROUP_MASK_REGIDS | Idx | Cluster |
| 42 | 0x2A | EIATTR_SW1850030_WAR | Free | WAR |
| 43 | 0x2B | EIATTR_WMMA_USED | Idx | Metadata |
| 44 | 0x2C | EIATTR_HAS_PRE_V10_OBJECT | Val | Metadata |
| 45 | 0x2D | EIATTR_ATOMF16_EMUL_INSTR_OFFSETS | Free | Offsets |
| 46 | 0x2E | EIATTR_ATOM16_EMUL_INSTR_REG_MAP | Free | Offsets |
| 47 | 0x2F | EIATTR_REGCOUNT | Idx | Resource |
| 48 | 0x30 | EIATTR_SW2393858_WAR | Free | WAR |
| 49 | 0x31 | EIATTR_INT_WARP_WIDE_INSTR_OFFSETS | Free | Offsets |
| 50 | 0x32 | EIATTR_SHARED_SCRATCH | Idx | Shared |
| 51 | 0x33 | EIATTR_STATISTICS | Free | Metadata |
| 52 | 0x34 | EIATTR_INDIRECT_BRANCH_TARGETS | Free | Offsets |
| 53 | 0x35 | EIATTR_SW2861232_WAR | Free | WAR |
| 54 | 0x36 | EIATTR_SW_WAR | Free | WAR |
| 55 | 0x37 | EIATTR_CUDA_API_VERSION | Idx | Metadata |
| 56 | 0x38 | EIATTR_NUM_MBARRIERS | Idx | Resource |
| 57 | 0x39 | EIATTR_MBARRIER_INSTR_OFFSETS | Free | Offsets |
| 58 | 0x3A | EIATTR_COROUTINE_RESUME_OFFSETS | Free | Offsets |
| 59 | 0x3B | EIATTR_SAM_REGION_STACK_SIZE | Idx | Resource |
| 60 | 0x3C | EIATTR_PER_REG_TARGET_PERF_STATS | Free | Metadata |
| 61 | 0x3D | EIATTR_CTA_PER_CLUSTER | Idx | Cluster |
| 62 | 0x3E | EIATTR_EXPLICIT_CLUSTER | Idx | Cluster |
| 63 | 0x3F | EIATTR_MAX_CLUSTER_RANK | Idx | Cluster |
| 64 | 0x40 | EIATTR_INSTR_REG_MAP | Free | Metadata |
| 65 | 0x41 | EIATTR_RESERVED_SMEM_USED | Idx | Shared |
| 66 | 0x42 | EIATTR_RESERVED_SMEM_0_SIZE | Idx | Shared |
| 67 | 0x43 | EIATTR_UCODE_SECTION_DATA | Free | Metadata |
| 68 | 0x44 | EIATTR_UNUSED_LOAD_BYTE_OFFSET | Free | Offsets |
| 69 | 0x45 | EIATTR_KPARAM_INFO_V2 | Free | Param |
| 70 | 0x46 | EIATTR_SYSCALL_OFFSETS | Free | Offsets |
| 71 | 0x47 | EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | Free | WAR |
| 72 | 0x48 | EIATTR_GRAPHICS_GLOBAL_CBANK | Idx | Graphics |
| 73 | 0x49 | EIATTR_SHADER_TYPE | Idx | Graphics |
| 74 | 0x4A | EIATTR_VRC_CTA_INIT_COUNT | Idx | Graphics |
| 75 | 0x4B | EIATTR_TOOLS_PATCH_FUNC | Idx | Metadata |
| 76 | 0x4C | EIATTR_NUM_BARRIERS | Idx | Resource |
| 77 | 0x4D | EIATTR_TEXMODE_INDEPENDENT | Idx | Texture |
| 78 | 0x4E | EIATTR_PERF_STATISTICS | Free | Metadata |
| 79 | 0x4F | EIATTR_AT_ENTRY_FRAGMENTS | Free | Blackwell |
| 80 | 0x50 | EIATTR_SPARSE_MMA_MASK | Free | Blackwell |
| 81 | 0x51 | EIATTR_TCGEN05_1CTA_USED | Idx | Blackwell |
| 82 | 0x52 | EIATTR_TCGEN05_2CTA_USED | Idx | Blackwell |
| 83 | 0x53 | EIATTR_GEN_ERRBAR_AT_EXIT | Idx | Blackwell |
| 84 | 0x54 | EIATTR_REG_RECONFIG | Idx | Blackwell |
| 85 | 0x55 | EIATTR_ANNOTATIONS | Free | Metadata |
| 86 | 0x56 | EIATTR_UNKNOWN | -- | Sentinel |
| 87 | 0x57 | EIATTR_STACK_CANARY_TRAP_OFFSETS | Free | Offsets |
| 88 | 0x58 | EIATTR_STUB_FUNCTION_KIND | Idx | Metadata |
| 89 | 0x59 | EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETS | Free | Offsets |
| 90 | 0x5A | EIATTR_MERCURY_FINALIZER_OPTIONS | Free | Mercury |
| 91 | 0x5B | EIATTR_BLOCKS_ARE_CLUSTERS | Idx | Cluster |
| 92 | 0x5C | EIATTR_SANITIZE | Idx | Blackwell |
| 93 | 0x5D | EIATTR_SYSCALLS_FALLBACK | Free | Metadata |
| 94 | 0x5E | EIATTR_CUDA_REQ | Free | Metadata |
| 95 | 0x5F | EIATTR_MERCURY_ISA_VERSION | Sized | Mercury |
| 96 | 0x60 | EIATTR_ERROR_LAST | -- | Sentinel |
Fmt column: Idx = format 0x04 (indexed, per-symbol), Free = format 0x01 (raw bytes), Val = format 0x02 (single 32-bit value), Sized = format 0x03 (16-bit value).
EIATTR Codes Confirmed in ptxas Builder
The following codes appear as explicit case labels in the sub_1CC9800 switch statement or as arguments to sub_1CC85F0:
| Code | Hex | Confirmed via |
|---|---|---|
| 4 | 0x04 | case 0x4 in builder -- CTAIDZ_USED |
| 13 | 0x0D | case 0xD -- SYNC_STACK |
| 15 | 0x0F | case 0xF + sub_1CC85F0(_, 0xF, ...) -- EXTERNS |
| 17 | 0x11 | case 0x11 -- FRAME_SIZE |
| 18 | 0x12 | case 0x12 + sub_1CC85F0(_, 0x12, ...) -- MIN_STACK_SIZE |
| 27 | 0x1B | case 0x1B -- MAXREG_COUNT |
| 30 | 0x1E | case 0x1E + sub_1CC85F0(_, 0x1E, ...) -- CRS_STACK_SIZE |
| 35 | 0x23 | case 0x23 -- MAX_STACK_SIZE |
| 38 | 0x26 | case 0x26 -- LOAD_CACHE_REQUEST |
| 47 | 0x2F | case 0x2F + sub_1CC85F0(_, 0x2F, ...) -- REGCOUNT |
| 56 | 0x38 | case 0x38 -- NUM_MBARRIERS |
| 59 | 0x3B | case 0x3B + sub_1CC85F0(_, 0x3B, ...) -- SAM_REGION_STACK_SIZE |
| 65 | 0x41 | case 0x41 -- RESERVED_SMEM_USED |
| 74 | 0x4A | case 0x4A -- VRC_CTA_INIT_COUNT |
| 76 | 0x4C | case 0x4C -- NUM_BARRIERS |
| 79 | 0x4F | case 0x4F + sub_1CC85F0(_, 0x4F, ...) -- AT_ENTRY_FRAGMENTS |
| 80 | 0x50 | case 0x50 + sub_1C97840(0x50, ...) -- SPARSE_MMA_MASK |
| 81 | 0x51 | case 0x51 -- TCGEN05_1CTA_USED |
| 82 | 0x52 | case 0x52 -- TCGEN05_2CTA_USED |
| 84 | 0x54 | case 0x54 -- REG_RECONFIG |
The builder's first pass uses a switch with cases 0x04, 0x0D, 0x0F, 0x11, 0x12, 0x1B, 0x1E, 0x23, 0x26, 0x2F, 0x38, 0x3B, 0x41, 0x4A, 0x4C, 0x4F, 0x50, 0x51, 0x52, 0x54 to sort EIATTR records into per-entry arrays. A second pass emits the final records via sub_1CC85F0 and sub_1CC86D0.
Symbol Index Resolution Pass
Before the main builder runs, the EIATTR builder performs a symbol index resolution pass (lines 700--884 in the decompiled builder). This pass walks all pre-existing EIATTR records and resolves symbol indices through the linker's mapping tables:
// Simplified from sub_1CC9800 lines ~716-824
for (record in eiattr_list) {
switch (record->attr_code) {
case 0x02: case 0x06: case 0x07: case 0x08: case 0x09:
case 0x0A: case 0x11: case 0x12: case 0x13: case 0x14:
case 0x17: case 0x23: case 0x26: case 0x2F: case 0x3B:
case 0x45:
// Indexed format: resolve sym_idx through mapping table
int32_t *sym_ptr = (int32_t *)record->payload;
if (mapping_table && *sym_ptr != 0) {
if (*sym_ptr < 0)
*sym_ptr = negative_mapping[-(*sym_ptr)];
else
*sym_ptr = mapping_table[*sym_ptr];
}
if (*sym_ptr == 0 && attr_code != 0x45 && attr_code != 0x17) {
record->attr_code = 0; // disable record
}
break;
case 0x0F:
// EXTERNS: resolve each 4-byte symbol index in the array
int count = record->size / 4;
for (int i = 0; i < count; i++) {
resolve_sym(&payload[i], mapping_table, negative_mapping);
}
break;
}
}
The bitmask 0x800800060000 (seen at line 716) encodes which attribute codes use the simple indexed-resolve path: it selects codes 2, 6, 7, 8, 9, 10, 17, 18, 19, 20, 23, 38, 47, 59, 69.
Barrier and Register Propagation -- sub_1CC8950
When a device function uses barriers or a high register count, those requirements must propagate upward through the call graph to each entry kernel. The propagator sub_1CC8950 handles this:
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
of %s of entry symbol %s"
"regcount %d for %s propagated to entry %s"
The propagator emits EIATTR_REGCOUNT (0x2F) records via sub_1CC85F0(_, 0x2F, 8, ...) and handles EIATTR_NUM_BARRIERS (0x4C) through the sub_1CC7FB0 path. Barrier counts are extracted from the section flags field at bit offset 20 (7-bit field, mask 0x7F), then cleared from the section flags (&= 0xF80FFFFF) after being moved into an EIATTR record.
EIATTR Categories by Function
Resource allocation (GPU driver reads these to configure hardware at launch):
| Code | Name | Description |
|---|---|---|
0x2F | REGCOUNT | Physical register count per thread (primary occupancy determinant) |
0x05 | MAX_THREADS | Maximum threads per block (.maxntid) |
0x10 | REQNTID | Required block dimensions (.reqntid, 3x uint32) |
0x11 | FRAME_SIZE | Per-thread local memory frame size (bytes) |
0x12 | MIN_STACK_SIZE | Minimum call stack (non-recursive) |
0x23 | MAX_STACK_SIZE | Maximum call stack (recursive) |
0x1E | CRS_STACK_SIZE | Call-Return-Sync stack |
0x3B | SAM_REGION_STACK_SIZE | SAM (Streaming Async Memory) region stack |
0x4C | NUM_BARRIERS | Named barrier count (0--16) |
0x38 | NUM_MBARRIERS | Memory barrier (mbarrier) object count |
0x1B | MAXREG_COUNT | Register count hint (--maxrregcount / .maxnreg) |
Parameter bank:
| Code | Name | Description |
|---|---|---|
0x0A | PARAM_CBANK | Constant bank number + offset for parameters |
0x19 | CBANK_PARAM_SIZE | Parameter constant bank size |
0x18 | SMEM_PARAM_SIZE | Shared memory parameter region size |
0x0B | SMEM_PARAM_OFFSETS | Per-parameter shared memory offsets |
0x0C | CBANK_PARAM_OFFSETS | Per-parameter constant bank offsets |
0x17 | KPARAM_INFO | Per-parameter metadata (v1) |
0x45 | KPARAM_INFO_V2 | Per-parameter metadata (v2, extended) |
Instruction offset tables (driver/tools locate and patch instructions at load time):
| Code | Name | Description |
|---|---|---|
0x1C | EXIT_INSTR_OFFSETS | Byte offsets of EXIT instructions |
0x1D | S2RCTAID_INSTR_OFFSETS | Offsets of S2R SR_CTAID.* instructions |
0x25 | LD_CACHEMOD_INSTR_OFFSETS | Load instructions with cache modifier |
0x27 | ATOM_SYS_INSTR_OFFSETS | Atomic instructions with .sys scope |
0x28 | COOP_GROUP_INSTR_OFFSETS | Cooperative group instructions |
0x2D | ATOMF16_EMUL_INSTR_OFFSETS | Emulated FP16 atomics |
0x2E | ATOM16_EMUL_INSTR_REG_MAP | Register map for 16-bit atomic emulation |
0x31 | INT_WARP_WIDE_INSTR_OFFSETS | Integer warp-wide instructions |
0x34 | INDIRECT_BRANCH_TARGETS | Valid indirect branch targets (CFI) |
0x39 | MBARRIER_INSTR_OFFSETS | MBAR memory barrier instructions |
0x3A | COROUTINE_RESUME_OFFSETS | Device coroutine resume points |
0x44 | UNUSED_LOAD_BYTE_OFFSET | Unused load instruction byte offset |
0x46 | SYSCALL_OFFSETS | __cuda_syscall invocation offsets |
0x57 | STACK_CANARY_TRAP_OFFSETS | Stack canary trap instructions |
0x59 | LOCAL_CTA_ASYNC_STORE_OFFSETS | CTA-local async store instructions |
Texture and surface:
| Code | Name | Description |
|---|---|---|
0x02 | IMAGE_SLOT | Texture/surface image slot assignment |
0x06 | IMAGE_OFFSET | Image descriptor table offset |
0x07 | IMAGE_SIZE | Image descriptor size |
0x08 | TEXTURE_NORMALIZED | Normalized texture coordinates flag |
0x09 | SAMPLER_INIT | Sampler initialization data |
0x0E | TEXID_SAMPID_MAP | Texture-to-sampler mapping table |
0x13 | SAMPLER_FORCE_UNNORMALIZED | Force unnormalized sampler |
0x14 | BINDLESS_IMAGE_OFFSETS | Bindless texture/surface offsets |
0x15 | BINDLESS_TEXTURE_BANK | Constant bank for bindless textures |
0x16 | BINDLESS_SURFACE_BANK | Constant bank for bindless surfaces |
0x4D | TEXMODE_INDEPENDENT | Independent texture mode |
Cluster and cooperative launch (sm_90+):
| Code | Name | Description |
|---|---|---|
0x29 | COOP_GROUP_MASK_REGIDS | Cooperative group mask register IDs |
0x3D | CTA_PER_CLUSTER | CTAs per cluster (Hopper+) |
0x3E | EXPLICIT_CLUSTER | Explicit cluster dimensions |
0x3F | MAX_CLUSTER_RANK | Maximum cluster rank |
0x5B | BLOCKS_ARE_CLUSTERS | CTA blocks are clusters flag |
Shared memory:
| Code | Name | Description |
|---|---|---|
0x32 | SHARED_SCRATCH | Shared memory scratch for register spilling |
0x41 | RESERVED_SMEM_USED | Reserved shared memory in use |
0x42 | RESERVED_SMEM_0_SIZE | Reserved shared memory partition 0 size |
Hardware workarounds:
| Code | Name | Description |
|---|---|---|
0x2A | SW1850030_WAR | HW bug 1850030 workaround |
0x30 | SW2393858_WAR | HW bug 2393858 workaround |
0x35 | SW2861232_WAR | HW bug 2861232 workaround |
0x36 | SW_WAR | Generic workaround container |
0x47 | SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | MEMBAR.SYS workaround offsets |
Blackwell+ (sm_100+):
| Code | Name | Description |
|---|---|---|
0x4F | AT_ENTRY_FRAGMENTS | Fragment descriptors at function entry |
0x50 | SPARSE_MMA_MASK | Structured sparsity mask for MMA |
0x51 | TCGEN05_1CTA_USED | 5th-gen tensor core (single-CTA mode) |
0x52 | TCGEN05_2CTA_USED | 5th-gen tensor core (two-CTA mode) |
0x53 | GEN_ERRBAR_AT_EXIT | Generate error barrier at kernel exit |
0x54 | REG_RECONFIG | Dynamic register reconfiguration (setmaxnreg) |
0x5C | SANITIZE | Address sanitizer instrumentation present |
Mercury:
| Code | Name | Description |
|---|---|---|
0x5A | MERCURY_FINALIZER_OPTIONS | Options for Mercury FNLZR post-link pass |
0x5F | MERCURY_ISA_VERSION | Mercury ISA version for shader binary |
Graphics-specific:
| Code | Name | Description |
|---|---|---|
0x48 | GRAPHICS_GLOBAL_CBANK | Global constant bank for graphics shaders |
0x49 | SHADER_TYPE | Shader type (vertex, fragment, compute, etc.) |
0x4A | VRC_CTA_INIT_COUNT | Virtual Register Count CTA init count |
.nv.compat Section
The .nv.compat section (SHT_CUDA_COMPAT = 0x70000086) stores forward-compatibility attributes. Its records use a different format from EIATTR -- each is a small TLV with:
Offset Size Field
------ ---- -----
0x00 1 format (always 0x02 = value)
0x01 1 compat_code
0x02 1 value
The sub_1CC93A0 handler processes these with a switch over compat codes 2--6:
| Code | Behavior |
|---|---|
| 2 | Max of existing and new value (keeps higher) |
| 3 | OR existing with new value (accumulate flags) |
| 4 | Reset to zero |
| 5 | Per-nibble max (two 2-bit fields) |
| 6 | Set to 1 if values differ (conflict detection) |
The guard *(_DWORD *)(a1 + 72) <= 0x59 (SM version <= 89 decimal) means compat processing only applies to SM 90 (Hopper) and later. Unknown compat codes trigger: "unknown .nv.compat attribute (%x) encoutered with value %x." (note the typo "encoutered" in the binary string).
Architecture-Gated EIATTR Emission
Not all EIATTR codes are valid on all SM architectures. The function sub_1C97840 performs architecture checks before emitting a record. Observed gates:
| EIATTR Code | Gate | Meaning |
|---|---|---|
0x04 (CTAIDZ_USED) | Always emitted | |
0x41 (RESERVED_SMEM_USED) | sub_1C97840(0x41, sm) | SM-version dependent |
0x4C (NUM_BARRIERS) | sub_1C97840(0x4C, sm) | SM-version dependent |
0x50 (SPARSE_MMA_MASK) | sub_1C97840(0x50, sm) | SM 100+ (Blackwell) |
0x51 (TCGEN05_1CTA) | sub_1C97840(0x51, sm) implicit | SM 100+ |
0x52 (TCGEN05_2CTA) | sub_1C97840(0x52, sm) implicit | SM 100+ |
0x54 (REG_RECONFIG) | sub_1C97840(0x54, sm) implicit | SM 100+ |
The sub_1C97840 function takes an EIATTR code and the SM version from the ELFW object's field at offset 624, returning a boolean. This prevents older EIATTR codes from appearing in Blackwell cubins and prevents Blackwell-only codes from appearing in Hopper cubins.
Constant Bank Optimization
The master section allocator sub_1CABD60 (11,856 bytes) performs two major space optimizations during address assignment: constant value deduplication within .nv.constant0 banks, and shared memory interference-graph coloring for extern shared variables. Both run before final offset assignment.
Constant Value Deduplication -- sub_1CA6890
When multiple kernels in the same compilation unit use identical constant values, the OCG constant bank can contain duplicates. sub_1CA6890 (454 lines decompiled) eliminates them by value-matching, reducing .nv.constant0 section size.
The algorithm dispatches on constant value width:
| Value Width | Dedup Strategy | Data Structure |
|---|---|---|
| 4 bytes | Hash map lookup (sub_426D60) | Hash table keyed on 32-bit value |
| 8 bytes | Hash map lookup (separate table) | Hash table keyed on 64-bit value |
| 12, 16, 20, 24, 32, 48, 64 bytes | Linear scan with memcmp (sub_1CA6760) | Per-width linked list |
| Other | No deduplication | Direct append |
For each constant data node in the section's linked list (at section+72):
- Extract the value bytes (node+0), alignment (node+16), and size (node+24).
- Look up the value in the appropriate dedup structure.
- If duplicate found: alias the current symbol's offset to the existing symbol's offset. Debug output:
"found duplicate value 0x%x, alias %s to %s"(32-bit) or"found duplicate 64bit value 0x%llx, alias %s to %s"(64-bit) or"found duplicate %d byte value, alias %s to %s"(N-byte viasub_1CA6760). - If not found: align the section cursor to the required alignment, append the data via
sub_1CA6650, and insert into the dedup structure.
After aliasing, the function rewrites pending relocations that targeted the now-eliminated range:
// Simplified relocation rewriting after dedup alias
for (reloc in pending_relocs) {
if (reloc.section == target_section
&& reloc.offset >= old_data_offset
&& reloc.offset < old_data_offset + old_data_size) {
reloc.offset = reloc.offset + alias_target_offset - old_data_offset;
// "optimize ocg constant reloc offset from %lld to %lld"
unlink(reloc); // remove from pending list
}
}
Special cases:
- Zero-valued constants: A "seen set" (parameter a15) prevents distinct zero-valued symbols from being aliased to each other, since different
__constant__variables may legitimately hold zero but need separate addresses. - Redirect mode: When parameter a13 is set and
sub_1CB15C0returns true for a symbol, the constant is redirected to its defining section rather than deduplicated.
The caller sub_1CABD60 wraps this in an optimization check: "optimize OCG constants for %s, old size = %lld". If dedup does not reduce the section size, it reverts: "ocg const optimization didn't help so give up".
Shared Memory Interference Graph -- sub_1CA92F0
When a CUDA program declares multiple extern __shared__ variables used by different kernels, they can potentially share the same memory if no single kernel uses both simultaneously. sub_1CA92F0 (585 lines decompiled) builds an interference graph and performs greedy graph coloring to pack shared objects into minimum total space.
Phase 1 -- Build usage sets (which kernels reference each shared object):
For each global shared object, walk all referencing functions. A kernel "uses" a shared object if it directly references it or transitively calls a device function that does (traced via sub_1CBD800). Objects used by exactly one kernel are privatized -- moved into a per-entry .nv.shared.<func> section. Unused objects are removed entirely (symbol flags set to mark deleted).
"global shared %s only used in entry %d" -- privatize
"remove unused global shared %s" -- delete
Phase 2 -- Build interference edges:
For each pair of remaining shared objects (i, j), test whether their usage sets intersect (via sub_42E460 set membership). If any kernel uses both, they interfere -- they cannot overlap in memory. Edges are stored as linked lists per object.
Phase 3 -- Greedy graph coloring:
Objects are processed in sorted order. For each object:
- Mark all colors used by interfering neighbors as unavailable.
- Assign the lowest available color (starting from 1).
- Update the color's alignment requirement (max of all objects in that color group).
- Update the color's size requirement (max of all objects in that color group).
" allocate to group %d" -- color assignment
Phase 4 -- Compute group offsets:
Groups are laid out sequentially with alignment padding:
group_offset[1] = align_up(base, group_align[1]);
for (g = 2; g <= num_groups; g++)
group_offset[g] = align_up(group_offset[g-1] + group_size[g-1], group_align[g]);
total_size = group_offset[last] + group_size[last];
Each shared object's final offset is group_offset[its_color]. The total extern shared size is written to the section descriptor. Per-entry shared sections are expanded if a referenced object's offset + size exceeds their current size.
"esh %s size = %lld"
"for shared object (%d) %s:"
" offset = 0x%llx, size = 0x%llx"
" edge to %d"
" allocate to group %d"
Constant Bank Optimization Functions
| Address | Size | Purpose |
|---|---|---|
sub_1CA6890 | 454 lines | Constant value deduplication (32/64-bit hash, N-byte memcmp) |
sub_1CA6760 | 57 lines | N-byte value dedup helper (12--64 byte constants) |
sub_1CA6650 | 65 lines | Constant data node appender (40-byte node, alignment + append) |
sub_1CA92F0 | 585 lines | Shared memory interference graph + greedy coloring |
sub_1CA91A0 | -- | Per-entry shared section creator (.nv.shared.<func>) |
sub_1CA5360 | -- | Shared object comparison function (sort key) |
sub_1CA5A00 | -- | Shared memory data copier (offset overlap check) |
Section Attribute Builder -- sub_60FBF0
The per-kernel section attribute builder sub_60FBF0 (76 KB decompiled, 2,541 lines, VA 0x60FBF0) runs once for each kernel entry point and device function. It assembles the full per-function section configuration object (648 bytes), parses compile option overrides, remaps PTX memory space codes to ELF section type IDs, conditionally creates three section types, then invokes the Mercury codegen pipeline (sub_6F52F0, DecodePipeline::RunStages).
Inputs
The function takes three parameters:
| Parameter | Content |
|---|---|
a1 | Per-function descriptor: SM version (a1[0..1]), key-value option list (a1+38), assembler flags (a1+39), global/extern symbol lists (a1+6, a1+7), boolean flags (a1+180..182) |
a2 | Compilation context: config base (a2+248), function list (a2+136), optional symbol tables for textures (a2+112), surfaces (a2+120), globals (a2+72), sass_map flag (a2+232), mutex (a2+240), ELFW object (a2+32), target descriptor (a2+56) |
a3 | Output handle (released and reallocated at function entry) |
Option Parsing
The function iterates the key-value list at a1+38 and matches five string keys by character-by-character comparison:
// Simplified from sub_60FBF0 lines ~638-812
for (int i = 0; i < list_length(a1->options); i++) {
const char** kv = list_get(a1->options, i);
if (strcmp(kv[0], "deviceDebug") == 0)
config->deviceDebug = 1; // config+24
else if (strcmp(kv[0], "lineInfo") == 0)
config->lineInfo = 1; // config+25
else if (strcmp(kv[0], "optLevel") == 0) {
if (!config->optLevel_locked) // config+108
config->optLevel = strtol(kv[1], ...); // config+104
}
else if (strcmp(kv[0], "IsCompute") == 0)
config->isCompute = (strcmp(kv[1], "True") == 0); // config+184
else if (strcmp(kv[0], "IsPIC") == 0)
config->isPIC = (strcmp(kv[1], "True") == 0); // config+191
}
Additional inputs read from a1 and a2:
| Source | Config field | Description |
|---|---|---|
a2+408 | config+44 | Optimization level from compilation context |
a1+180 | config+190 | Negated boolean flag |
a1+181 | config+188 | Boolean flag |
a1+182 | config+189 | Boolean flag |
word_2020620[a2+64 - 20] | v31+4 | SM version lookup table (SM 20..121 range) |
Memory Space Type Remapping
PTX internal memory space type codes in the 0x10000 range are remapped to compact ELF section type IDs by the helper sub_60DA40 (and inline copies). This remapping is applied to every symbol in the global, extern, texture, surface, and per-function symbol lists:
| PTX Code | Hex | Section Type ID | Memory Space |
|---|---|---|---|
| 65538 | 0x10002 | 83 | .nv.shared (per-kernel) |
| 65539 | 0x10003 | 80 | .nv.constant0 (kernel params) |
| 65540 | 0x10004 | 84 | .nv.local (spill memory) |
| 65541 | 0x10005 | 81 | .nv.constant1 |
| 65542 | 0x10006 | 82 | .nv.constant2 |
| 65544 | 0x10008 | 85 | .nv.constant3 |
| 65545 | 0x10009 | 86 | .nv.constant4 |
| 65546 | 0x1000A | 87 | .nv.constant5 |
| 65576 | 0x10028 | 88 | .nv.global.init |
| 65577 | 0x10029 | 89 | .nv.global |
| 65586 | 0x10032 | 93 | High constant bank |
| 65587 | 0x10033 | 90 | High constant bank |
| 65598 | 0x1003E | 91 | Texture/surface descriptor |
| 65599 | 0x1003F | 92 | Texture/surface descriptor |
Special handling: when the space code is 0x10003 (constant0) and the compilation mode is relocatable (*(a3+48) == 12), the descriptor's needs_reloc flag (byte 33) is set to 1, indicating the constant0 section requires special relocation handling during linking.
The value 65596 (0x1003C) serves as a threshold -- symbols with (space_type - 0x1003C) < 2 are counted into the texture/surface allocation arrays.
Conditional Section Creation
Three per-kernel sections are conditionally created:
.sass_map.<func> -- created when *(a2+232) != 0 (sass_map generation enabled):
if (context->sass_map_enabled) { // a2+232
descriptor = alloc(64); // 64-byte section descriptor
memset(descriptor, 0, 64);
pthread_mutex_lock(context->mutex); // a2+240
// Allocate instruction tree node and connect to codegen state
name = sprintf(".sass_map%s", func_name); // ".sass_map" + func_name
descriptor->name = name;
pthread_mutex_unlock(context->mutex);
}
.nv.local.<func> -- created when the register spill size (config+112) is non-zero:
if (config->spill_size != 0) { // config+112
descriptor = alloc(64);
descriptor->size = config->spill_size;
// Name: ".nv.local." + bare_func_name (skip ".text." prefix)
name = sprintf(".nv.local.%s", func_name + 6);
}
The spill size at config+112 is set from the sum of the register spill count and frame size when the spill flag is non-zero.
.nv.constant<N>.<func> -- created when:
- The compilation mode field equals 2 (
*(a1->target+48) == 2) - No pre-existing constant section exists (
*(a1+172) == 0) - The function's symbol list is empty
if (mode == 2 && !has_constant_section && no_symbols) {
descriptor = alloc(64);
int bank = target->get_section_type() - 0x70000064;
int size;
if (func_const_size <= target->get_min_const_size())
size = target->get_min_const_size(); // vtable+80
else
size = target->get_max_const_size(); // vtable+88
descriptor->size = size + func_const_size;
name = sprintf(".nv.constant%d.%s", bank, bare_func_name);
descriptor->data = calloc(descriptor->size);
}
Assembler Flag Processing
The assembler flag list at a1+39 is iterated. Each entry's value string (at offset +8) is split on spaces via strtok_r. Each token is validated by sub_60F790, which constructs a temporary 656-byte object to test the flag. Valid tokens are concatenated with spaces and appended to config+48 (the toolkit info string that ends up in .note.nv.tkinfo).
Codegen Pipeline Invocation
After configuration, the function calls sub_6F52F0 (DecodePipeline::RunStages) with 18 parameters including the configuration object, all 7 descriptor arrays, the ELFW context, and the function name. The return code is mapped:
sub_6F52F0 return | sub_60FBF0 return | Meaning |
|---|---|---|
| 0 | 0 | Success |
| 1 | 14 | Mercury encode failure |
| 2 | 22 | Mercury decode failure |
Post-Pipeline Section Registration
After the Mercury pipeline returns successfully:
- Calls
sub_60DD30twice for pre/post code region finalization - Calls
sub_60DBE0for each optional symbol table (texture, surface, global) to register their sections with the ELFW emitter - Calls
sub_1CB9C30on the ELFW object (a2+32) to commit all sections - If SM version <=
0x45(SM 69): creates UFT/UDT entries (section types 68/69) for each resolved symbol - Under mutex lock, ORs the per-function WAR bitmask (
config+232..240) into the global accumulator ata2+504
Thread Safety
All shared state modifications are protected by the mutex at a2+240:
- String length accumulator updates (
a2+296,a2+304) for string table pre-allocation - WAR bitmask accumulation (
a2+504) .sass_mapsection setup (instruction tree access)- Instruction merge from secondary codegen contexts (
a2+80,a2+88)
Key Functions
| Address | Size | Purpose |
|---|---|---|
sub_60FBF0 | ~76 KB decompiled | Per-kernel section attribute builder (section above) |
sub_1CC9800 | 14,764 B (90 KB decompiled) | EIATTR builder -- master nvinfo section constructor |
sub_1CC8950 | 2,634 B | EIATTR propagator -- barrier/register cross-function propagation |
sub_1CC85F0 | ~200 B | EIATTR record emitter -- writes one TLV record |
sub_1CC86D0 | ~500 B | Per-entry EIATTR emission (MIN_STACK_SIZE, CRS_STACK_SIZE, SAM_REGION_STACK_SIZE) |
sub_1CC7FB0 | ~200 B | .nv.info section creator/finder |
sub_1CC93A0 | ~500 B | .nv.compat attribute processor |
sub_1CB3570 | 1,963 B | Generic section creator (44 call sites) |
sub_1CB42D0 | -- | .text.<func> section creator |
sub_1C9DC60 | 5,663 B | Section layout calculator (offset assignment) |
sub_1CABD60 | 11,856 B | Master section allocator (shared/constant/local addresses) |
sub_1CBE1B0 | ~10 KB | .nv.callgraph section builder |
sub_1C97840 | -- | Architecture-gated EIATTR check |
sub_1CA6890 | 454 lines | Constant bank value deduplication |
sub_1CA92F0 | 585 lines | Shared memory interference graph + coloring |
Cross-References
- Custom ELF Emitter -- ELFW object, section ordering, ELF header
- Relocations & Symbols -- relocation resolution, UFT/UDT management
- Debug Information -- DWARF generation and
.debug_*sections - Mercury Encoder -- Mercury encoding that feeds
.nv.merc.*sections - Capsule Mercury -- SM 100+ capsule and
.nv.capmercsections - Pipeline Overview -- where section emission fits in the pipeline
Debug Information
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas generates DWARF-based debug information for cuda-gdb and other GPU debuggers. The debug subsystem spans three distinct code regions: an early-pipeline DWARF line table generator at 0x45A--0x45C that encodes PTX-level source mappings, a mid-pipeline SASS-level emitter at 0x860--0x868 that produces both .debug_line and .nv_debug_line_sass sections along with register mapping tables, and a late-stage DWARF processor/dumper cluster at 0x1CBF--0x1CC9 that handles .debug_info, .debug_abbrev, .debug_loc, and .debug_frame parsing and emission. The design follows a two-tier model: PTX-level debug info records source file/line to PTX instruction mappings, while SASS-level debug info records the final PTX-to-SASS address correspondence after all optimizations. NVIDIA extends standard DWARF with proprietary sections (.nv_debug_line_sass, .nv_debug_info_reg_sass, .nv_debug_info_reg_type, .nv_debug_info_ptx) and Mercury-namespace variants (.nv.merc.debug_*) for Capsule Mercury binaries.
| DWARF line generator (PTX) | sub_45C3A0 (9,041 bytes) -- PTX source line to address mapping |
| LEB128 encoder | sub_45A870 (5,293 bytes) -- variable-length integer encoding for DWARF |
| Debug line table (SASS) | sub_866BB0 (3,273 bytes) -- .debug_line / .nv_debug_line_sass |
| Debug top-level entry | sub_867880 (100 bytes) -- calls line generator twice (PTX + SASS) |
| Reg info SASS emitter | sub_8679F0 (225 bytes) -- .nv_debug_info_reg_sass |
| Reg type emitter | sub_867B00 (230 bytes) -- .nv_debug_info_reg_type |
| Post-RA debug annotator | sub_88D870 (2,656 bytes) -- final source line annotation |
| DWARF form name table | sub_1CBF820 (400 bytes) -- DW_FORM_* ID-to-string |
| DWARF attribute name table | sub_1CBF9B0 (1,600 bytes) -- DW_AT_* ID-to-string |
.debug_abbrev parser | sub_1CC0850 (3,704 bytes) -- abbreviation table handler |
.debug_info parser | sub_1CC4A40 (5,218 bytes) -- DIE tree walker |
| CU header parser | sub_1CC5EB0 (2,023 bytes) -- compilation unit headers |
| Location expression printer | sub_1CC34E0 (3,094 bytes) -- DW_OP_* decoder |
| DWARF info processor | sub_1CC24C0 (3,993 bytes) -- non-dump emission mode |
| Debug section classifier | sub_1C9D1F0 (2,667 bytes) -- section name to type ID mapper |
| Mercury debug classifier | sub_1C98C60 (1,755 bytes) -- .nv.merc.debug_* classifier |
| SASS debug classifier | sub_1C99340 -- .debug_* standard section classifier |
| Debug section type mapper | sub_1C998D0 -- maps section name to internal buffer pointer |
| DWARF attribute emitter | sub_66A0B0 (28 KB) -- emits DWARF attributes during IR lowering |
| DWARF debug info builder | sub_66F4E0 (59 KB) -- main DWARF debug info section builder |
| DWARF line table builder | sub_66E250 (33 KB) -- builds .debug_line during IR phase |
| Debug line number formatter | sub_671C00 (11 KB) -- formats line number records |
CLI Flags
Two flags control debug information generation:
| Flag | Internal field | Effect |
|---|---|---|
--device-debug / -g | deviceDebug at ELFW context + 432 | Full debug info: all DWARF sections (.debug_info, .debug_abbrev, .debug_frame, .debug_line, .debug_loc, .debug_str, .debug_aranges), plus NVIDIA extensions. Disables most optimizations, preserving source-level variable correspondence. |
--lineinfo / -ln | lineInfo in option context | Line tables only: generates .debug_line and .nv_debug_line_sass without full DWARF DIE trees. Preserves optimization levels. Sufficient for cuda-memcheck and profiler source correlation. |
--suppress-debug-info | suppresses emission | Strips all debug sections from output, even if debug input was provided. |
The cubin entry point sub_612DE0 reads both deviceDebug and lineInfo flags and passes them through the ELF output pipeline. The section classifier at sub_1C9D1F0 checks the byte at context offset +432 (deviceDebug) to decide whether to emit .debug_frame and .debug_line sections -- when this byte is zero (no -g), those sections are conditionally suppressed.
The --lineinfo flag is described in the CLI as "Generate debug line table information" and is orthogonal to -g. When only --lineinfo is active, ptxas generates the two line table sections but omits the heavyweight .debug_info/.debug_abbrev/.debug_loc sections. The string "device-debug or lineinfo" appears in a validation check that prevents --extensible-whole-program from being combined with either debug mode.
Debug Section Catalog
ptxas generates three tiers of debug sections depending on compilation mode. Standard DWARF sections use the conventional .debug_* namespace. NVIDIA extensions use .nv_debug_* names. Capsule Mercury binaries additionally carry .nv.merc.debug_* clones.
Standard DWARF Sections
| Section | DWARF standard | Content |
|---|---|---|
.debug_abbrev | Yes (DWARF 2+) | Abbreviation table defining DIE tag/attribute schemas |
.debug_aranges | Yes | Address range table mapping compilation units to code ranges |
.debug_frame | Yes | Call frame information (CFA rules for unwinding) |
.debug_info | Yes | DIE tree: compilation units, subprograms, variables, types |
.debug_line | Yes | Line number program (source file/line to PTX address mapping) |
.debug_loc | Yes | Location lists for variables with multiple storage locations |
.debug_macinfo | Yes | Macro information |
.debug_pubnames | Yes | Public name lookup table |
.debug_str | Yes | String table referenced by DW_FORM_strp |
NVIDIA Extension Sections
| Section | Content |
|---|---|
.nv_debug_line_sass | SASS-level line table: maps SASS instruction addresses to PTX source lines. Parallel to .debug_line but at the machine code level. |
.nv_debug_info_reg_sass | Register-to-variable mapping for SASS. Records which physical GPU register(s) hold each source variable at each program point. |
.nv_debug_info_reg_type | Type information for register mappings. Associates register locations with DWARF type descriptions. |
.nv_debug_info_ptx | PTX-level debug info section. Created by sub_1CC5EB0 as a PTX-namespace mirror of .debug_info. |
.nv_debug.shared | Debug metadata for shared memory variables. |
Mercury Namespace Variants
For Capsule Mercury binaries (SM 100+), every debug section is cloned into the .nv.merc.* namespace. The Mercury debug classifier sub_1C98C60 recognizes 15 Mercury-namespaced debug sections:
.nv.merc.debug_abbrev .nv.merc.debug_aranges
.nv.merc.debug_frame .nv.merc.debug_info
.nv.merc.debug_loc .nv.merc.debug_macinfo
.nv.merc.debug_pubnames .nv.merc.debug_str
.nv.merc.debug_line (and additional variants)
These sections carry the PTX-level debug information that travels inside the Mercury capsule, enabling deferred finalization to produce debug-capable SASS without re-invoking the full compiler.
Debug Information Pipeline
Debug data flows through three pipeline stages. Each stage operates on a different intermediate representation and produces output at a different abstraction level.
STAGE 1: PTX PARSING (0x45A-0x45C)
PTX source --> .loc directives --> DWARF line number program
sub_45C3A0: Reads .loc directives from PTX input
Builds file/directory tables
Generates DWARF line number program (LEB128-encoded)
Creates "$LDWend" end-of-program label
Uses function-to-index map ("function index not found
in debug function map" on error)
sub_45A870: LEB128 encoder for all numeric fields:
file number, prologue size, address advance,
line advance, context, function offset
STAGE 2: IR LOWERING (0x66A-0x672)
Ori IR instructions carry debug info at instruction node offset +20
sub_66F4E0 (59KB): Main DWARF debug info builder
sub_66E250 (33KB): DWARF .debug_line builder
sub_66A0B0 (28KB): DWARF attribute emitter (directory id, time stamp, file size)
sub_671C00 (11KB): Debug line number formatter (context, functionOffset, line number)
STAGE 3: POST-RA + ELF EMISSION (0x860-0x868, 0x1CBF-0x1CC9)
After register allocation, physical register assignments are known.
sub_88D870: PostRA debug info annotator -- finalizes source line
mappings after all code motion/scheduling
sub_867880: Top-level debug entry -- calls sub_866BB0 twice:
once with a3=0 for .debug_line (PTX-level)
once with a3=1 for .nv_debug_line_sass (SASS-level)
sub_866BB0: DWARF .debug_line section generator
sub_8679F0: .nv_debug_info_reg_sass emitter
sub_867B00: .nv_debug_info_reg_type emitter
Line Number Table Generation
The DWARF .debug_line section generator sub_866BB0 is the central function for line table construction. It produces standard DWARF 2 line number programs that map addresses to source locations.
Parameters
// sub_866BB0 -- DebugLineTableGenerator
// a1: debug_line_context (pointer to ~460-byte state structure)
// a2: ELF output context (ELFW object)
// a3: section index (0 = .debug_line, nonzero = .nv_debug_line_sass)
// a4: unused
// a5: source file path (const char*, used for .nv_debug_line_sass only)
Algorithm
-
Section selection: Based on
a3, selects the target section name. Fora3 == 0, looks up or creates.debug_lineviasub_1CB2C60/sub_1CA7AB0. Fora3 != 0, uses.nv_debug_line_sass. -
Source file collection: If
a5provides a source file path and the section is the SASS variant, copies the filename into the debug context at offset +272. Iterates source files viasub_4271E0(directory iterator), sorts entries usingsub_866B80(file comparison callback). -
Directory table construction: For each source file, splits the path into directory and filename components. Records unique directories via a hash table (
sub_426150/sub_426D60). Each directory gets a sequential index. -
File table construction: Each source file entry is a 40-byte record:
- File name string
- Directory index (LEB128)
- Modification timestamp from
stat()(st_mtim.tv_sec) - File size from
stat()(st_size)
-
Line number program generation: Generates the DWARF line number state machine program using standard opcodes (
DW_LNS_copy,DW_LNS_advance_pc,DW_LNS_advance_line,DW_LNS_set_file, etc.) and special opcodes for compact address/line delta encoding. -
Finalization: Writes the complete section via
sub_1CA7180(ELF section write).
Debug Line Context Structure
debug_line_context (at a1, ~460 bytes):
+0: vtable pointer
+16: SASS line info pointer (nonzero triggers second pass)
+64: per-section context base (160-byte stride, indexed by a3)
+96: filename buffer pointer
+104: filename buffer size
+108: file_count
+112: directory buffer pointer
+120: directory_count
+216: source file count (.debug_line variant)
+256: raw filename pointer
+268: raw filename flag
+272: filename copy buffer (for .nv_debug_line_sass)
+280: filename copy length
+376: source file count (.nv_debug_line_sass variant)
+408: .nv_debug_info_reg_sass buffer chain (linked list head)
+416: .nv_debug_info_reg_sass final buffer pointer
+424: .nv_debug_info_reg_sass total size
+432: .nv_debug_info_reg_type buffer chain (linked list head)
+440: .nv_debug_info_reg_type final buffer pointer
+448: .nv_debug_info_reg_type total size
+456: memory arena / pool allocator
Top-Level Entry
The top-level debug emitter sub_867880 is minimal -- it calls the line table generator twice:
// sub_867880 -- DebugInfoTopLevel (simplified)
void emit_debug_info(ctx, elf, aux, source_path) {
sub_866BB0(ctx, elf, 0, aux, source_path); // .debug_line
if (ctx->sass_line_info) // offset +16
sub_866BB0(ctx, elf, 1, aux, source_path); // .nv_debug_line_sass
}
LEB128 Encoding
The DWARF standard uses LEB128 (Little-Endian Base 128) variable-length encoding for integers throughout debug sections. ptxas implements this in sub_45A870, which handles encoding for multiple fields. Error strings in this function reveal the field types being encoded:
| Field | Error string on overflow |
|---|---|
| File number | "when generating LEB128 number for file number" |
| Prologue marker | "when generating LEB128 number for setting prologue" |
| Address advance | "when generating LEB128 number for address advance" |
| Line advance | "when generating LEB128 number for line advance" |
| Context | "when generating LEB128 number for setting context" |
| Function offset | "when generating LEB128 number for setting function Offset" |
| File timestamp | "when generating LEB128 number for timestamp" (in sub_866BB0) |
| File size | "when generating LEB128 number for file size" (in sub_866BB0) |
Register-to-Variable Mapping
After register allocation, ptxas knows the physical GPU registers assigned to each source variable. Two NVIDIA extension sections capture this:
.nv_debug_info_reg_sass
Emitted by sub_8679F0. Records which physical registers (R0--R255, P0--P7, UR0--UR63) hold which source variables at each SASS instruction address. The data is accumulated during code generation into a linked list of buffer chunks at debug context offsets +408/+416/+424. At emission time, the chunks are concatenated into a contiguous buffer and written as an ELF section:
// sub_8679F0 -- simplified
void emit_reg_sass(debug_ctx, elf) {
// Collect linked list of buffer chunks from debug_ctx+408
chunks = linked_list_to_array(debug_ctx->reg_sass_chain);
buf = allocate(debug_ctx->reg_sass_total_size);
offset = 0;
for (chunk in chunks) {
memcpy(buf + offset, chunk->data, chunk->size);
offset += chunk->size;
}
section = create_section(elf, ".nv_debug_info_reg_sass", 0, 1, 0);
write_section(elf, section, buf, 1, total_size);
}
.nv_debug_info_reg_type
Emitted by sub_867B00. Structurally identical to the reg_sass emitter but operates on offsets +432/+440/+448. Associates register locations with DWARF type information, enabling the debugger to interpret register contents correctly (e.g., distinguishing a 32-bit float in R5 from a 32-bit integer in R5).
DWARF Processing Subsystem
The DWARF processing cluster at 0x1CBF--0x1CC9 handles both generation and diagnostic dumping of DWARF sections. The code can operate in two modes: a dump mode that prints human-readable representations (for --dump-debug-info or internal diagnostics), and an emission mode that processes raw DWARF bytes for the final binary.
DWARF Form Table -- sub_1CBF820
Maps DWARF form IDs to string names. Supports DWARF 2 forms:
| ID | Form | Encoding |
|---|---|---|
| 1 | DW_FORM_addr | Target address |
| 3 | DW_FORM_block2 | 2-byte length block |
| 4 | DW_FORM_block4 | 4-byte length block |
| 5 | DW_FORM_data2 | 2-byte unsigned |
| 6 | DW_FORM_data4 | 4-byte unsigned |
| 7 | DW_FORM_data8 | 8-byte unsigned |
| 8 | DW_FORM_string | Null-terminated inline |
| 9 | DW_FORM_block | ULEB128 length block |
| 10 | DW_FORM_block1 | 1-byte length block |
| 11 | DW_FORM_data1 | 1-byte unsigned |
| 12 | DW_FORM_flag | Boolean byte |
| 13 | DW_FORM_sdata | Signed LEB128 |
| 14 | DW_FORM_strp | 4-byte offset into .debug_str |
| 15 | DW_FORM_udata | Unsigned LEB128 |
| 16 | DW_FORM_ref_addr | Address-sized reference |
| 17 | DW_FORM_ref1 | 1-byte CU-relative reference |
| 18 | DW_FORM_ref2 | 2-byte CU-relative reference |
| 19 | DW_FORM_ref4 | 4-byte CU-relative reference |
| 20 | DW_FORM_ref8 | 8-byte CU-relative reference |
| 21 | DW_FORM_ref_udata | ULEB128 CU-relative reference |
| 22 | DW_FORM_indirect | Form specified inline |
The absence of DWARF 4/5 forms (e.g., DW_FORM_sec_offset, DW_FORM_exprloc, DW_FORM_flag_present) indicates ptxas targets DWARF version 2, consistent with the pointer size and CU header format observed in sub_1CC5EB0.
DWARF Attribute Table -- sub_1CBF9B0
Maps DWARF attribute IDs to string names. The function recognizes a comprehensive set of standard attributes. Notable entries include:
- Location attributes:
DW_AT_location(2),DW_AT_frame_base(64),DW_AT_data_member_location(56) - Name/type:
DW_AT_name(3),DW_AT_type(73),DW_AT_encoding(63) - Scope:
DW_AT_low_pc(17),DW_AT_high_pc(18),DW_AT_stmt_list(16) - Producer:
DW_AT_producer(37),DW_AT_comp_dir(27),DW_AT_language(19) - Subprogram:
DW_AT_inline(32),DW_AT_prototyped(39),DW_AT_artificial(52) - Array:
DW_AT_lower_bound(34),DW_AT_upper_bound(47),DW_AT_count(55) - Calling:
DW_AT_calling_convention(54),DW_AT_return_addr(42) - Accessibility:
DW_AT_accessibility(50),DW_AT_external(63) - C++ support:
DW_AT_vtable_elem_location(77),DW_AT_containing_type(29)
.debug_abbrev Parser -- sub_1CC0850
Parses the abbreviation table that defines the schema for each DIE tag. The dump mode output header is:
Contents of the .debug_abbrev section:
Number TAG
Each entry includes:
- Abbreviation number
- TAG name (e.g.,
DW_TAG_compile_unit,DW_TAG_subprogram) - Children indicator:
[has children]or[has no children] - Attribute-form pairs
The function includes a safety check: "unexpectedly too many dwarf attributes for any DW_TAG entry!" -- a guard against malformed or corrupt abbreviation tables.
.debug_info Parser -- sub_1CC4A40
Walks the DIE tree, printing entries with nesting depth indentation:
<%d><%x>: Abbrev Number: %d (0x%02x %s)
Format: <nesting_depth><byte_offset>: Abbrev Number: <n> (<tag_hex> <tag_name>). Null DIEs are printed as " (nill) ". Attribute values are formatted by sub_1CC4100 (the attribute value printer) which dispatches on form type.
Compilation Unit Header -- sub_1CC5EB0
Parses and prints CU headers, and creates the NVIDIA extension .nv_debug_info_ptx section:
Compilation Unit @ offset 0x%zx:
Length: %d
Version: %d
Abbrev Offset: %d
Pointer Size: %d
The pointer size field is significant -- it determines the size of DW_FORM_addr values and DW_FORM_ref_addr references throughout the CU.
Location Expression Decoder -- sub_1CC34E0
Decodes DWARF location expressions (DW_OP_* operations) used in DW_AT_location and related attributes. The supported operations reveal how ptxas encodes GPU variable locations:
| Operation | String | GPU usage |
|---|---|---|
DW_OP_addr | "DW_OP_addr: 0x%x" | Absolute memory address (global/shared/local) |
DW_OP_const4u | "DW_OP_const4u: %d" | 4-byte unsigned constant |
DW_OP_xderef | "DW_OP_xderef" | Cross-address-space dereference (GPU memory spaces) |
DW_OP_plus_uconst | "DW_OP_plus_uconst: %llu" | Add unsigned constant to stack top |
DW_OP_lit0--DW_OP_lit31 | "DW_OP_lit%u" | Push literal 0--31 |
DW_OP_reg0--DW_OP_reg31 | "DW_OP_reg%d" | Variable in register N |
DW_OP_breg0--DW_OP_breg31 | "DW_OP_breg%d %lld" | Register N + signed offset |
DW_OP_fbreg | "DW_OP_fbreg: %lld" | Frame base + signed offset (stack variables) |
DW_OP_nop | "DW_OP_nop" | No operation |
DW_OP_stack_value | "DW_OP_stack_value" | Value is on DWARF expression stack, not in memory |
The presence of DW_OP_xderef is particularly noteworthy -- this is a DWARF operation rarely used in CPU debuggers but essential for GPU debugging, where variables may reside in different memory spaces (global, shared, local, constant) that require address-space-qualified access.
Debug Section Classification
Three classifier functions map section names to internal type IDs. The type IDs route sections to the correct processing pipeline during ELF assembly.
SASS Classifier -- sub_1C99340
Recognizes standard DWARF sections by comparing the section name (obtained via sub_1CB9E50) against hardcoded strings. Returns 1 (is-debug-section) for:
.debug_abbrev .debug_aranges .debug_frame
.debug_info .debug_loc .debug_macinfo
.debug_pubnames .debug_str .debug_line
Plus the NVIDIA extension:
.nv_debug_info_reg_sass
Mercury Classifier -- sub_1C98C60
The Mercury classifier sub_1C98C60 checks for the .nv.merc. prefix on the same set of debug section names. It uses a strcmp chain against 15 Mercury-namespaced section names. This classifier is called from 4 sites, primarily during Capsule Mercury construction when debug sections need to be cloned into the merc namespace.
Unified Classifier -- sub_1C9D1F0
The master debug section classifier sub_1C9D1F0 (2,667 bytes, 13 callees) handles both SASS and Mercury variants. It:
- Checks whether the section has the Mercury flag (bit 0x10 in section flags byte at offset +11)
- For Mercury sections, dispatches to the
.nv.merc.debug_*name check - For standard sections, dispatches to the
.debug_*name check - Recognizes additional NVIDIA extensions:
.nv_debug_line_sass,.nv_debug_info_reg_sass,.nv_debug_info_reg_type - Uses
setjmp/longjmperror recovery for malformed section handling
The function also checks the deviceDebug flag at context offset +432 to suppress .debug_frame and .debug_line when debug info is not requested. This is the gate that prevents line tables from appearing in release builds.
Section Type Mapper -- sub_1C998D0
Maps debug section names to internal buffer pointers within the debug context object:
| Section name | Returns pointer at offset |
|---|---|
.debug_line | context + 80 (a1[10]) |
.debug_frame | context + 72 (a1[9]) |
.nv_debug_line_sass | context + 88 (a1[11]) |
.debug_info | (subsequent check) |
.debug_loc | (subsequent check) |
This enables the ELF emitter to route section data to the correct output buffer during final assembly.
Instruction-Level Debug Metadata
Each internal instruction node in the Ori IR carries debug metadata at offset +20 (a pointer or encoded value for source line information) and offset +0 (a pointer to the PTX source location). This metadata travels through the entire optimization pipeline:
-
PTX parsing: The parser records
.locdirectives and attaches source file/line/column to each instruction as it is lowered from PTX to the Ori IR. -
Optimization passes: Most optimization passes preserve or propagate debug metadata. When instructions are cloned (e.g., loop unrolling), the clone inherits the original's debug info. When instructions are deleted, their debug info is lost -- the debugger will map those addresses to the nearest surviving instruction's source line.
-
Post-RA annotation (
sub_88D870): After register allocation and scheduling, this pass finalizes the source-line-to-SASS-address correspondence. It walks all instructions and records the final mapping that will be encoded into the.nv_debug_line_sasssection. -
ELF emission: The debug line table generator
sub_866BB0reads the finalized mappings and encodes them as a DWARF line number program.
The PTX-level line generator sub_45C3A0 uses the label $LDWend as an end-of-debug-range marker. The error message "function index not found in debug function map" indicates a function-to-index mapping that translates internal function identifiers to DWARF subprogram indices.
DWARF Version and Extensions
ptxas generates DWARF version 2 debug information. Evidence:
- The form table (
sub_1CBF820) covers exactly forms 1--22, which is the DWARF 2 form set. No DWARF 3+ forms (DW_FORM_sec_offset= 0x17,DW_FORM_exprloc= 0x18) are present. - The CU header parser (
sub_1CC5EB0) prints"Version: %d"as a field, consistent with the 11-byte DWARF 2 CU header format. - The attribute table includes DWARF 2 attributes only.
CUDA-Specific DWARF Extensions
NVIDIA extends standard DWARF for GPU debugging through:
-
Address space encoding:
DW_OP_xderefis used with address space qualifiers to distinguish GPU memory spaces. TheDW_AT_address_classattribute (recognized in the attribute table at ID 51) encodes CUDA memory space identifiers. -
Parallel execution model: GPU warps execute 32 threads simultaneously. Debug info must account for the fact that each "program counter" corresponds to 32 concurrent threads, and divergent threads may be at different source locations.
-
Register mapping: The
.nv_debug_info_reg_sasssection provides a CUDA-specific register-to-variable mapping that goes beyond standard DWARF location lists. GPU register files are much larger (up to 255 general-purpose registers) and have different allocation semantics than CPU registers. -
PTX/SASS duality: The two-tier line table (
.debug_linefor PTX,.nv_debug_line_sassfor SASS) reflects the unique compilation model where PTX is an intermediate representation with its own debug significance.
Section Layout Considerations
The ELF section layout calculator sub_1C9DC60 applies special handling to .debug_line:
.debug_linesections receive special padding during layout to ensure proper alignment- Debug sections are placed after code and data sections in the ELF file
- The
.debug_linesection name is one of only three section names explicitly checked by the layout calculator (alongside.nv.constant0and.nv.reservedSmem)
Key Address Summary
| Address range | Subsystem | Function count |
|---|---|---|
0x45A000--0x45F000 | PTX-level DWARF line generator + LEB128 | ~6 |
0x660000--0x675000 | IR-level DWARF builder/emitter | ~8 |
0x860000--0x869000 | SASS-level debug line + register info | ~8 |
0x88D000--0x88E000 | Post-RA debug annotator | 1 |
0x1C98000--0x1C9A000 | Section classifiers (merc + SASS) | ~6 |
0x1C9D000--0x1C9E000 | Unified section classifier | 1 |
0x1CBF000--0x1CC7000 | DWARF processor/dumper cluster | ~12 |
Relocations & Symbols
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas defines two parallel relocation type systems for CUBIN ELF files: R_CUDA_* (117 types, ordinals 0--116) for SASS-encoded cubins targeting SM 30--90a, and R_MERCURY_* (65 types, ordinals 0--64) for Mercury-encoded cubins targeting SM 100+ (Blackwell and later). Both systems use standard Elf64_Rela relocation entries in .rela.text.<funcname> sections, with a custom resolution algorithm that handles alias redirection, dead function filtering, UFT/UDT pseudo-relocations, PC-relative branch validation, and sub-byte instruction patching. The symbol table (.symtab) follows standard ELF Elf64_Sym format with CUDA-specific symbol types and an extended section index mechanism (.symtab_shndx) for programs exceeding 65,280 sections.
| Relocation resolver | sub_1CD48C0 (4,184 bytes binary, 22 KB decompiled, 17 callees) |
| Relocation writer | sub_1CD5920 (1,985 bytes binary, 11 KB decompiled) |
| Relocation creator (SASS) | sub_1CD4510 (860 bytes binary) |
| Relocation creator (Mercury) | sub_1CD46B0 (540 bytes binary) |
| Relocation pre-scan | sub_1CD43A0 (560 bytes binary) |
| Bit-field patcher | sub_1CD34E0 (3,700 bytes binary, sub_1CD33F0/sub_1CD3330 helpers) |
| Symbol table builder | sub_1CB68D0 (9,578 bytes binary, 49 KB decompiled, 36 callees) |
| Symbol fixup | sub_1CB2CA0 (2,038 bytes binary, 4 call sites) |
| Section index remap | sub_1C99BB0 (4,900 bytes binary) |
| UFT manager | sub_1CD22E0 (1,979 bytes binary, 10 KB decompiled) |
| UFT slot validator | sub_1CD2AA0 (~800 bytes binary) |
| Bindless handler | sub_1CAB300 (2,157 bytes binary, 12 KB decompiled) |
| R_CUDA table address | off_2408B60 (117 entries x 64 bytes) |
| R_MERCURY table address | off_2407B60 (65 entries x 64 bytes) |
Relocation Type Systems
Table Selection Logic
The ELFW object stores the ELF class byte at offset 7 and a flags word at offset 48. The relocation subsystem selects between the two tables based on the IsPIC flag combined with the ELF class:
// Table selection (reconstructed from sub_1CD48C0, sub_1CD4510, sub_1CD5920)
uint32_t test_bit = (elfw->ei_class == 'A') ? 1 : 0x80000000;
bool is_mercury = (test_bit & elfw->flags) != 0;
if (is_mercury) {
// SM 100+ Mercury encoding: off_2407B60
// Type codes start at 0x10000; subtract to index the table
table = &R_MERCURY_table; // off_2407B60
index = raw_type - 0x10000; // range check: index <= 0x3F (63)
} else {
// SM 30-90a SASS encoding: off_2408B60
table = &R_CUDA_table; // off_2408B60
index = raw_type; // range check: index <= 0x73 (115)
}
Mercury relocation type codes are stored with a 0x10000 offset in the internal relocation entry's type field. This lets a single code path handle both systems -- the table selection just subtracts the offset for Mercury types.
Relocation Descriptor Table Format
Each entry in the relocation type descriptor table is 64 bytes (8 qwords). The layout is accessed through pointer arithmetic patterns like table[8 * index + N] where the table pointer type is char** (8-byte stride):
// Relocation type descriptor -- 64 bytes per entry (reconstructed)
struct reloc_type_desc {
const char* name; // +0: R_CUDA_* or R_MERCURY_* name string
uint32_t unknown_04; // +8: unknown field
uint32_t unknown_08; // +12: unknown field
uint32_t bit_start; // +16: starting bit position in instruction
uint32_t bit_width; // +20: field width in bits
uint32_t patch_mode; // +24: patching mode (0=none, 1=direct, 6/7=split)
uint32_t flags_hi; // +28: high flags (value 12-15 triggers callgraph)
// ... remaining 32 bytes: additional patching parameters
};
The patch_mode field at offset +24 drives the bit-field patching logic in sub_1CD34E0. The switch statement handles these modes:
| Mode | Description | Types |
|---|---|---|
| 0 | No-op (sentinel/terminator) | R_CUDA_NONE, R_CUDA_NONE_LAST |
| 1, 0x12, 0x2E | Direct bit-field write (full or partial 64-bit word) | Most absolute/PC-relative types |
| 6, 0x37 | Split low-word patching (handles cross-qword boundaries) | LO types, sub-byte 8_N types |
| 7, 0x38 | Split high-word patching (uses HIDWORD of value) | HI types |
When flags_hi (at descriptor offset +28) is in the range 12--15, the relocation creator calls sub_1CBD0D0 to register the relocation's target section in the call graph. This triggers call graph edge creation for function descriptors and branch targets.
R_CUDA_* Relocation Types
117 types from R_CUDA_NONE (ordinal 0) to R_CUDA_NONE_LAST (ordinal 116). String addresses span 0x23FBE0E--0x23FC6B6 in the ptxas binary, confirming these are contiguous in the read-only data section. Ordinals are assigned by string table order.
Absolute Address Relocations
| Ordinal | Name | Bit Field | Purpose |
|---|---|---|---|
| 0 | R_CUDA_NONE | -- | Sentinel / no relocation |
| 1 | R_CUDA_32 | 32-bit | Absolute 32-bit address |
| 2 | R_CUDA_64 | 64-bit | Absolute 64-bit address |
| 5 | R_CUDA_ABS32_26 | 32-bit at bit 26 | Absolute address, 26-bit encoding |
| 10 | R_CUDA_ABS32_LO_26 | low 32 at bit 26 | Low half of 64-bit address |
| 11 | R_CUDA_ABS32_HI_26 | high 32 at bit 26 | High half of 64-bit address |
| 12 | R_CUDA_ABS32_23 | 32-bit at bit 23 | Absolute address, 23-bit encoding |
| 13 | R_CUDA_ABS32_LO_23 | low 32 at bit 23 | Low half, 23-bit encoding |
| 14 | R_CUDA_ABS32_HI_23 | high 32 at bit 23 | High half, 23-bit encoding |
| 15 | R_CUDA_ABS24_26 | 24-bit at bit 26 | 24-bit absolute address |
| 16 | R_CUDA_ABS24_23 | 24-bit at bit 23 | 24-bit absolute, 23-bit encoding |
| 17 | R_CUDA_ABS16_26 | 16-bit at bit 26 | 16-bit absolute address |
| 18 | R_CUDA_ABS16_23 | 16-bit at bit 23 | 16-bit absolute, 23-bit encoding |
| 42 | R_CUDA_ABS32_20 | 32-bit at bit 20 | Volta+ encoding format |
| 43 | R_CUDA_ABS32_LO_20 | low 32 at bit 20 | Low half, 20-bit encoding |
| 44 | R_CUDA_ABS32_HI_20 | high 32 at bit 20 | High half, 20-bit encoding |
| 45 | R_CUDA_ABS24_20 | 24-bit at bit 20 | 24-bit, 20-bit encoding |
| 46 | R_CUDA_ABS16_20 | 16-bit at bit 20 | 16-bit, 20-bit encoding |
| 55 | R_CUDA_ABS32_32 | 32-bit at bit 32 | Ampere+ encoding format |
| 56 | R_CUDA_ABS32_LO_32 | low 32 at bit 32 | Low half, 32-bit position |
| 57 | R_CUDA_ABS32_HI_32 | high 32 at bit 32 | High half, 32-bit position |
| 58 | R_CUDA_ABS47_34 | 47-bit at bit 34 | 47-bit wide field |
| 59 | R_CUDA_ABS16_32 | 16-bit at bit 32 | 16-bit, 32-bit position |
| 60 | R_CUDA_ABS24_32 | 24-bit at bit 32 | 24-bit, 32-bit position |
| 74 | R_CUDA_ABS24_40 | 24-bit at bit 40 | 24-bit at offset 40 |
| 75 | R_CUDA_ABS55_16_34 | 55-bit, 16+34 split | Split wide field |
| 100 | R_CUDA_ABS20_44 | 20-bit at bit 44 | 20-bit at offset 44 |
| 114 | R_CUDA_ABS56_16_34 | 56-bit, 16+34 split | Split wide field |
| 70 | R_CUDA_32_LO | low 32 | Low half of 64-bit |
| 71 | R_CUDA_32_HI | high 32 | High half of 64-bit |
The naming convention encodes the bit-field geometry: R_CUDA_ABS<width>_<start_bit> indicates that <width> bits of the resolved address are patched into the instruction at bit position <start_bit>. The LO/HI suffix indicates low or high 32 bits of a 64-bit value. The different start positions (20, 23, 26, 32, 34, 40, 44) correspond to different SASS instruction encoding formats across SM generations: Kepler (26), Maxwell/Pascal (23), Volta/Turing (20), Ampere/Ada/Hopper (32).
Global Address Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 3 | R_CUDA_G32 | Global-space 32-bit address |
| 4 | R_CUDA_G64 | Global-space 64-bit address |
| 84 | R_CUDA_G8_0 | Global-space byte 0 of 64-bit instruction |
| 85 | R_CUDA_G8_8 | Global-space byte 1 |
| 86 | R_CUDA_G8_16 | Global-space byte 2 |
| 87 | R_CUDA_G8_24 | Global-space byte 3 |
| 88 | R_CUDA_G8_32 | Global-space byte 4 |
| 89 | R_CUDA_G8_40 | Global-space byte 5 |
| 90 | R_CUDA_G8_48 | Global-space byte 6 |
| 91 | R_CUDA_G8_56 | Global-space byte 7 |
Global address relocations target .nv.global and .nv.global.init sections. The G8_* sub-byte variants patch individual bytes within a 64-bit instruction word, used when the instruction encoding requires the address to be spread across non-contiguous bit fields.
PC-Relative Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 40 | R_CUDA_PCREL_IMM24_26 | PC-relative 24-bit immediate at bit 26 |
| 41 | R_CUDA_PCREL_IMM24_23 | PC-relative 24-bit immediate at bit 23 |
PC-relative relocations resolve branch and call targets. The resolver enforces a critical constraint:
"PC relative branch address should be in the same section"
This means intra-function branches use PC-relative relocations, but cross-function calls use absolute or function descriptor relocations. The 24-bit immediate provides a +/-8 MB range from the instruction address, sufficient for any single kernel.
Constant Field Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 24 | R_CUDA_CONST_FIELD19_28 | 19-bit constant bank offset at bit 28 |
| 25 | R_CUDA_CONST_FIELD19_23 | 19-bit constant bank offset at bit 23 |
| 36 | R_CUDA_CONST_FIELD21_26 | 21-bit constant bank offset at bit 26 |
| 38 | R_CUDA_CONST_FIELD19_26 | 19-bit constant bank offset at bit 26 |
| 39 | R_CUDA_CONST_FIELD21_23 | 21-bit constant bank offset at bit 23 |
| 50 | R_CUDA_CONST_FIELD19_20 | 19-bit constant bank offset at bit 20 |
| 54 | R_CUDA_CONST_FIELD21_20 | 21-bit constant bank offset at bit 20 |
| 64 | R_CUDA_CONST_FIELD19_40 | 19-bit constant bank offset at bit 40 |
| 66 | R_CUDA_CONST_FIELD21_38 | 21-bit constant bank offset at bit 38 |
| 115 | R_CUDA_CONST_FIELD22_37 | 22-bit constant bank offset at bit 37 |
Constant field relocations patch .nv.constant0.<func> bank offsets into load constant (LDC) instructions. The field width (19, 21, or 22 bits) determines the maximum addressable constant bank size: 19-bit supports 512 KB, 21-bit supports 2 MB, 22-bit supports 4 MB. During resolution, the constant bank deduplication pass (sub_1CA6890) may adjust the relocation offset:
"optimize ocg constant reloc offset from %lld to %lld"
Function Descriptor Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 31 | R_CUDA_FUNC_DESC32_23 | 32-bit function descriptor at bit 23 |
| 32 | R_CUDA_FUNC_DESC32_LO_23 | Low 32 of descriptor at bit 23 |
| 33 | R_CUDA_FUNC_DESC32_HI_23 | High 32 of descriptor at bit 23 |
| 34 | R_CUDA_FUNC_DESC_32 | Full 32-bit function descriptor |
| 35 | R_CUDA_FUNC_DESC_64 | Full 64-bit function descriptor |
| 47 | R_CUDA_FUNC_DESC32_20 | 32-bit function descriptor at bit 20 |
| 48 | R_CUDA_FUNC_DESC32_LO_20 | Low 32 of descriptor at bit 20 |
| 49 | R_CUDA_FUNC_DESC32_HI_20 | High 32 of descriptor at bit 20 |
| 61 | R_CUDA_FUNC_DESC32_32 | 32-bit function descriptor at bit 32 |
| 62 | R_CUDA_FUNC_DESC32_LO_32 | Low 32 of descriptor at bit 32 |
| 63 | R_CUDA_FUNC_DESC32_HI_32 | High 32 of descriptor at bit 32 |
| 92--99 | R_CUDA_FUNC_DESC_8_0 -- R_CUDA_FUNC_DESC_8_56 | Sub-byte function descriptor patches |
Function descriptors are used for indirect calls through function pointers. The descriptor contains the target function's entry point address and is loaded by the GPU's indirect call mechanism. The sub-byte FUNC_DESC_8_* variants patch individual bytes of the descriptor into instruction encoding slots, used in wide instruction formats where the descriptor address is spread across multiple fields. When the relocation creator detects a flags_hi value of 12--15 in the descriptor table entry, it calls sub_1CBD0D0 to register the call edge in the call graph.
Texture, Sampler, and Surface Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 6 | R_CUDA_TEX_HEADER_INDEX | Texture header table index |
| 7 | R_CUDA_SAMP_HEADER_INDEX | Sampler header table index |
| 8 | R_CUDA_SURF_HW_DESC | Surface hardware descriptor |
| 9 | R_CUDA_SURF_HW_SW_DESC | Surface hardware+software descriptor |
| 19 | R_CUDA_TEX_SLOT | Texture binding slot |
| 20 | R_CUDA_SAMP_SLOT | Sampler binding slot |
| 21 | R_CUDA_SURF_SLOT | Surface binding slot |
| 26 | R_CUDA_TEX_SLOT9_49 | 9-bit texture slot at bit 49 |
| 52 | R_CUDA_SURF_HEADER_INDEX | Surface header table index |
| 101 | R_CUDA_SAMP_HEADER_INDEX_0 | Sampler header index variant |
These relocations connect texture/sampler/surface operations to their runtime-allocated descriptor table entries. The CUDA driver fills in the actual descriptor indices at launch time based on the kernel's resource binding.
Bindless Texture/Surface Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 22 | R_CUDA_TEX_BINDLESSOFF13_32 | Bindless texture offset, 13-bit at bit 32 |
| 23 | R_CUDA_TEX_BINDLESSOFF13_47 | Bindless texture offset, 13-bit at bit 47 |
| 29 | R_CUDA_TEX_BINDLESSOFF13_41 | Bindless texture offset, 13-bit at bit 41 |
| 30 | R_CUDA_TEX_BINDLESSOFF13_45 | Bindless texture offset, 13-bit at bit 45 |
| 51 | R_CUDA_BINDLESSOFF13_36 | Bindless offset, 13-bit at bit 36 |
| 65 | R_CUDA_BINDLESSOFF14_40 | Bindless offset, 14-bit at bit 40 |
Bindless texture/surface relocations are handled by sub_1CAB300, which creates $NVLINKBINDLESSOFF_<name> symbols for each bindless reference. During resolution:
"change reloc symbol from %d to %d"
"no bindless ref in section %s"
"unexpected usage of non-unified surface descriptors"
Sub-Byte Patch Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 76--83 | R_CUDA_8_0 -- R_CUDA_8_56 | Patch byte 0--7 of 64-bit instruction |
These relocations patch a single byte at a specific 8-bit-aligned position within a 64-bit instruction word. They are used when the resolved value must be inserted into a non-standard bit position that does not align with the instruction encoding's immediate field boundaries.
Miscellaneous Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 27 | R_CUDA_6_31 | 6-bit field at bit 31 |
| 28 | R_CUDA_2_47 | 2-bit field at bit 47 |
| 37 | R_CUDA_QUERY_DESC21_37 | Query descriptor, 21-bit at bit 37 |
| 53 | R_CUDA_INSTRUCTION64 | Whole 64-bit instruction replacement |
| 67 | R_CUDA_INSTRUCTION128 | Whole 128-bit instruction replacement |
| 68 | R_CUDA_YIELD_OPCODE9_0 | YIELD opcode, 9-bit at bit 0 |
| 69 | R_CUDA_YIELD_CLEAR_PRED4_87 | Clear YIELD predicate, 4-bit at bit 87 |
| 72 | R_CUDA_UNUSED_CLEAR32 | Zero out 32-bit unused field |
| 73 | R_CUDA_UNUSED_CLEAR64 | Zero out 64-bit unused field |
| 116 | R_CUDA_NONE_LAST | Sentinel marking end of relocation table |
The R_CUDA_INSTRUCTION64 and R_CUDA_INSTRUCTION128 types replace entire instruction words, used for instruction-level patching by the linker when the instruction encoding changes based on the final resolved address.
The R_CUDA_YIELD_* types handle YIELD-to-NOP conversion. When a kernel has forward-progress requirements that prevent yielding, the resolver converts YIELD instructions to NOPs:
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."
The R_CUDA_UNUSED_CLEAR* types zero out instruction fields that are unused in the final encoding, ensuring deterministic output.
Unified Address Space Relocations
| Ordinal | Name | Purpose |
|---|---|---|
| 102 | R_CUDA_UNIFIED | Unified address (generic pointer) |
| 103 | R_CUDA_UNIFIED_32 | 32-bit unified address |
| 104--111 | R_CUDA_UNIFIED_8_0 -- R_CUDA_UNIFIED_8_56 | Unified address sub-byte patches |
| 112 | R_CUDA_UNIFIED32_LO_32 | Low 32 of unified at bit 32 |
| 113 | R_CUDA_UNIFIED32_HI_32 | High 32 of unified at bit 32 |
Unified address relocations resolve generic pointers that can point to global, shared, or constant memory. During final resolution, the resolver performs a type conversion from unified (type 103) to absolute (type 1):
// In sub_1CD48C0: unified reloc replacement
if (reloc_type == 103) // R_CUDA_UNIFIED_32
reloc_type = 1; // R_CUDA_32
R_MERCURY_* Relocation Types
65 types from R_MERCURY_NONE (ordinal 0) to R_MERCURY_NONE_LAST (ordinal 64). String addresses span 0x23FB8C5--0x23FBDFA. Mercury relocations serve the same purpose as R_CUDA types but are designed for the Mercury intermediate representation used on SM 100+ targets.
Mercury Type Categories
| Category | Types | Purpose |
|---|---|---|
| Address | R_MERCURY_G64, R_MERCURY_ABS64, R_MERCURY_ABS32, R_MERCURY_ABS16 | Memory addresses |
| Split address | R_MERCURY_ABS32_LO, R_MERCURY_ABS32_HI | 64-bit address halves |
| Program-relative | R_MERCURY_PROG_REL64, R_MERCURY_PROG_REL32, R_MERCURY_PROG_REL32_LO, R_MERCURY_PROG_REL32_HI | Offsets from program base |
| Tex/samp/surf | R_MERCURY_TEX_HEADER_INDEX, R_MERCURY_SAMP_HEADER_INDEX, R_MERCURY_SURF_HEADER_INDEX | Resource descriptors |
| Function | R_MERCURY_FUNC_DESC_64 | Function descriptor |
| Sub-byte | R_MERCURY_8_0 -- R_MERCURY_8_56 (8 types) | Byte-level patches |
| Global sub-byte | R_MERCURY_G8_0 -- R_MERCURY_G8_56 (8 types) | Global-space byte patches |
| Func desc sub-byte | R_MERCURY_FUNC_DESC_8_0 -- R_MERCURY_FUNC_DESC_8_56 (8 types) | Function descriptor byte patches |
| Abs-program-relative | R_MERCURY_ABS_PROG_REL32_LO, R_MERCURY_ABS_PROG_REL32_HI, R_MERCURY_ABS_PROG_REL32, R_MERCURY_ABS_PROG_REL64 | Absolute program-relative |
| Program-relative sub-byte | R_MERCURY_PROG_REL8_0 -- R_MERCURY_PROG_REL8_56 (8 types) | Program-relative byte patches |
| Unified | R_MERCURY_UNIFIED, R_MERCURY_UNIFIED_32, R_MERCURY_UNIFIED_8_0 -- R_MERCURY_UNIFIED_8_56, R_MERCURY_UNIFIED32_LO, R_MERCURY_UNIFIED32_HI | Unified address space |
| Cleanup | R_MERCURY_UNUSED_CLEAR64 | Zero out unused fields |
| Sentinels | R_MERCURY_NONE, R_MERCURY_NONE_LAST | Table boundaries |
Mercury introduces program-relative relocations (PROG_REL*) that do not exist in the R_CUDA set. These compute offsets relative to the program base address rather than absolute virtual addresses, enabling position-independent code for the Mercury deferred finalization model. The Mercury finalizer (running at link or load time) resolves these program-relative relocations after the final code layout is known.
Relocation Encoding
ELF Relocation Entry Format
Cubin relocations use standard Elf64_Rela entries in .rela.text.<funcname> sections:
typedef struct {
Elf64_Addr r_offset; // Byte offset within the section
Elf64_Xword r_info; // Symbol index (high 32) | Type (low 32)
Elf64_Sxword r_addend; // Addend for the relocation computation
} Elf64_Rela; // 24 bytes
The r_info field packs the symbol table index in the upper 32 bits and the R_CUDA/R_MERCURY type code in the lower 32 bits:
#define ELF64_R_SYM(info) ((info) >> 32)
#define ELF64_R_TYPE(info) ((info) & 0xFFFFFFFF)
For Mercury types, the type code stored in r_info is the ordinal plus 0x10000. The resolver subtracts 0x10000 before indexing the R_MERCURY descriptor table.
Internal Relocation Entry
The ELFW object maintains relocations in an internal linked list at offset +376 of the ELFW structure. Each internal entry is a 32-byte node:
// Internal relocation entry (reconstructed from sub_1CD4510, sub_1CD46B0)
struct elfw_reloc {
uint64_t offset; // +0: byte offset in target section
uint64_t type_and_section; // +8: (target_section << 32) | reloc_type
uint64_t addend; // +16: relocation addend
uint32_t symbol_index; // +24: index into ELFW symbol table
uint32_t alias_index; // +28: original symbol if aliased, else 0
};
The type_and_section field encodes both the relocation type code (low 32 bits) and the target section index (high 32 bits) in a single 64-bit field.
Resolved Relocation Output
Resolved relocations are written by sub_1CD5920 to .nv.resolvedrela sections. Additionally, .nv.rel.action sections carry relocation action metadata for the CUDA driver's runtime linker.
Symbol Table Structure
.symtab Format
The symbol table uses standard Elf64_Sym entries (24 bytes each for 64-bit, 16 bytes for 32-bit):
typedef struct {
Elf32_Word st_name; // String table offset
unsigned char st_info; // Type (low 4 bits) | Binding (high 4 bits)
unsigned char st_other; // Visibility (low 2 bits) | Flags
Elf16_Half st_shndx; // Section index (or SHN_XINDEX=0xFFFF)
Elf64_Addr st_value; // Symbol value (section offset)
Elf64_Xword st_size; // Symbol size
} Elf64_Sym;
Internal Symbol Representation
The ELFW maintains an internal symbol structure (40+ bytes) with additional metadata:
| Offset | Size | Field | Description |
|---|---|---|---|
| +4 | 1 | st_info | Low nibble = type (STT_*), high nibble = binding strength |
| +5 | 1 | st_other | Bits 0-1 = visibility, bits 4-7 = CUDA-specific flags |
| +6 | 2 | st_shndx | Section index (0xFFFF = use extended index) |
| +8 | 8 | st_value | Symbol address; -1 = unallocated |
| +24 | 4 | section_link | Internal section reference |
| +28 | 4 | extra_index | Secondary symbol link |
| +32 | 8 | name_ptr | Pointer to symbol name string |
Symbol Types
| ELF Type | Value | CUDA Usage |
|---|---|---|
STT_NOTYPE | 0 | Undefined/external symbols |
STT_OBJECT | 1 | Global/constant/shared variables |
STT_FUNC | 2 | Kernel entry points, device functions |
STT_SECTION | 3 | Section symbols (one per section) |
STT_COMMON | 5 | Common symbols (.common symbol) |
STT_CUDA_TEXTURE | 10 | Texture reference symbols |
STT_CUDA_SURFACE | 11 | Surface reference symbols |
STT_CUDA_SAMPLER | 12 | Sampler reference symbols |
STT_CUDA_FUNC_DESC | 13 | Function descriptor (indirect call target) |
The internal type field at offset +4 uses the low nibble for ELF standard types and the high nibble for binding/scope information. The resolver checks st_info & 0xF throughout its processing.
Function descriptor symbols (type 13) receive special handling in the relocation resolver. When the resolver encounters a type-13 symbol, it checks whether the symbol is allocated:
// sub_1CD48C0: function descriptor symbol handling
if ((sym->st_info & 0xF) == 13) { // STT_CUDA_FUNC_DESC
shndx = get_section_index(elfw, sym);
if (shndx == 0) {
// Unresolved -- check binding and ELFW flags
if ((sym->st_other & 0xE0) == 0x20 // STB_GLOBAL
|| (sym->st_other & 0x10)) // CUDA-specific extern flag
{
// External function descriptor: keep relocation for linker
}
}
}
Symbol Binding and Visibility
The st_other byte encodes both ELF visibility (bits 0-1) and CUDA-specific binding flags (bits 4-7):
| Bits | Field | Values |
|---|---|---|
| 0-1 | ELF visibility | 0 = STV_DEFAULT, 1 = STV_INTERNAL, 2 = STV_HIDDEN, 3 = STV_PROTECTED |
| 4 | Extern flag | 1 = external linkage (for nvlink) |
| 5-6 | Binding strength | 0x20 = STB_GLOBAL, 0x80 = STB_WEAK |
| 7 | Reserved | Used internally during resolution |
The binding byte at st_other & 3 (low 2 bits of the high nibble) maps to:
| Value | Meaning | Resolution |
|---|---|---|
| 1 | STB_LOCAL / dead | Skip relocation ("ignore reloc on dead func %s") |
| 2 | STB_GLOBAL | Normal resolution |
| 3 | STB_WEAK | Resolve if available, otherwise use default |
Symbol Table Builder -- sub_1CB68D0
The symbol table builder (9,578 bytes, approximately 1,700 decompiled lines) processes the ELFW internal symbol list in these steps:
- Iterate symbols -- walks the symbol list from the ELFW object
- Filter deleted symbols -- 12 separate checks for
"reference to deleted symbol"guard against stale entries from dead code elimination - Handle
__cuda_syscall-- special-cases the device-side syscall dispatcher symbol - Resolve aliases -- follows alias chains to find the canonical symbol
- Compute values -- resolves
st_valuefrom section base + offset - Create section symbols -- ensures every section has an
STT_SECTIONsymbol; emits"found multiple section symbols for %s"if duplicates exist - Handle SHN_XINDEX overflow -- when section index >=
SHN_LORESERVE(0xFF00 = 65,280), setsst_shndx = SHN_XINDEX(0xFFFF) and stores the real index in.symtab_shndx - Build .symtab_shndx -- populates the extended index table for overflow sections
Error strings observed in the builder:
| String | Condition |
|---|---|
"reference to deleted symbol" | Symbol marked deleted but still referenced (12 checks) |
"ignore symbol %s in unused section" | Symbol in eliminated section |
"ignore symbol string %s for sym %d" | Skipping symbol name for unnamed/internal symbol |
"found multiple section symbols for %s" | Duplicate STT_SECTION entries |
"symbol already assigned" | Duplicate assignment attempt |
"adding global symbols of same name" | Name collision |
"alias to unknown symbol" | Alias target not found |
"unallocated symbol" | Symbol value is -1 (never assigned an address) |
"missing sec strtab" | String table not initialized |
Symbol Fixup -- sub_1CB2CA0
After dead code elimination removes sections, symbol indices become stale. The fixup pass (2,038 bytes, called from 4 sites) renumbers all symbol st_shndx values:
- For each section in the ELFW:
- If the section lacks an
STT_SECTIONsymbol, create one - If the section has multiple
STT_SECTIONsymbols, warn
- If the section lacks an
- Walk the symbol table and remap
st_shndxvalues through the section index mapping
The fixup runs at multiple pipeline points: after dead function elimination, after Mercury section cloning, and after any section deletion.
Section Index Remap -- sub_1C99BB0
The companion to sub_1CB2CA0 for the extended index mechanism. When section indices change, this function updates both .symtab_shndx and .nv.merc.symtab_shndx to keep the extended index tables consistent.
Relocation Resolution Algorithm
The master resolver sub_1CD48C0 implements a 7-step algorithm that processes every relocation entry in the ELFW's linked list:
Step 1: Symbol Address Computation
For each relocation entry, compute the symbol's resolved address by adding the symbol's st_value (from the section base) to the relocation offset:
if (reloc->alias_index) {
sym = lookup_symbol(elfw, reloc->alias_index);
reloc->offset += sym->st_value;
}
For Mercury cubins (64-bit ELF class 'A' with Mercury flag set), the resolver applies an additional address transformation that accounts for the Mercury instruction stride:
if (is_mercury && sym_value != 0) {
int stride = 2 * arch_vtable->get_merc_stride();
reloc->offset += stride * (sym_value >> 7);
}
Step 2: Alias Resolution
If the relocation targets an alias symbol (ELF type STT_NOTYPE with section index pointing to another symbol), redirect the relocation to the canonical target:
"change alias reloc %s to %s"
The resolver follows the alias chain through sub_1CB1E00 (get section index) and sub_1CB3D20 (get section by index), replacing the alias with its real target.
Step 3: Dead Function Filtering
If the relocation's target symbol has local binding (st_other & 3 == 1) and is in a deleted section, the relocation is zeroed out:
"ignore reloc on dead func %s"
The relocation's type is set to 0 (R_CUDA_NONE), effectively removing it. For the output mode != 2 (relocatable), dead relocations on STT_NOTYPE symbols with a binding prefix of 2 are also removed.
Step 4: UFT/UDT Pseudo-Relocation Handling
Relocations targeting special synthetic symbols are intercepted:
| Symbol | Action |
|---|---|
__UFT_OFFSET | Record for UFT slot assignment, zero the relocation |
__UFT_CANONICAL | Map to canonical UFT entry |
__UDT_OFFSET | Record for UDT slot assignment |
__UDT_CANONICAL | Map to canonical UDT entry |
__UFT, __UFT_END | UFT boundary markers |
__UDT, __UDT_END | UDT boundary markers |
The resolver checks if a symbol name starts with "__UFT_OFFSET" (exact 13-character comparison in the decompiled code). If matched:
"ignore reloc on UFT_OFFSET"
The relocation entry is then processed by the UFT manager (sub_1CD22E0) which maps UUIDs to UFT slot indices.
Step 5: PC-Relative Branch Validation
For relocations whose descriptor table entry has *(table + 8*index + 5) == 16 (indicating a PC-relative type), the resolver validates that the source and target sections are identical:
if (reloc_desc->patch_mode == 16 && reloc->section != target_section)
fatal("PC relative branch address should be in the same section");
Step 6: YIELD-to-NOP Conversion
If the relocation type is R_CUDA_YIELD_OPCODE9_0 or R_CUDA_YIELD_CLEAR_PRED4_87, and the kernel has forward-progress requirements, the resolver skips the NOP conversion:
"Ignoring the reloc to convert YIELD to NOP due to forward progress requirement."
Step 7: Bit-Field Patching
The final step delegates to sub_1CD34E0, the bit-field patcher. This function uses the relocation descriptor table entry's parameters to extract the current field value (via sub_1CD33F0), add the resolved address, and write the result back (via sub_1CD3330):
// sub_1CD34E0 -- bit-field patching (simplified)
bool apply_reloc(reloc_desc_table, index, is_addend, instruction_data,
symbol_value, reloc_offset, sym_addr, sym_shndx,
section_type_offset, old_value_out) {
entry = &reloc_desc_table[index * 64 + 12]; // Start at byte 12
end = &reloc_desc_table[index * 64 + 60]; // 4 operations max
while (entry < end) {
uint32_t bit_start = entry[0];
uint32_t bit_width = entry[1];
uint32_t mode = entry[2];
switch (mode) {
case 0: // NOP
break;
case 1: // Direct write: place value at [bit_start, bit_start+bit_width)
case 0x12: case 0x2E:
old = extract_bits(instruction_data, bit_start, bit_width);
insert_bits(instruction_data, resolved_value, bit_start, bit_width);
break;
case 6: // Split low-word write (cross-qword boundary handling)
case 0x37:
// Write low portion, advance to next qword if needed
break;
case 7: // Split high-word write
case 0x38:
// Write HIDWORD of value
break;
}
entry += 4; // Next 16-byte operation
}
return true;
}
If the NVRS (NVIDIA Register Spill) check fails during patching, the resolver emits:
"unexpected NVRS"
NVRS relocations are special-purpose relocations for register spill slot references. When the bit-field patcher returns false, the relocation is invalid for the current context.
Post-Resolution
Successfully resolved relocations are either:
- Removed from the linked list (the relocation was fully applied to the instruction bytes)
- Kept for the output
.nv.resolvedrelasection (the relocation needs runtime resolution by the CUDA driver)
The relocation writer sub_1CD5920 validates every remaining relocation before serializing it:
| Check | Error |
|---|---|
| Symbol value == -1 | "symbol never allocated" |
| Offset >= section size | "relocation is past end of offset" |
| Target section unallocated | "rela section never allocated" |
| Address not found in section data | "reloc address not found" |
Unified Function Table (UFT) and Unified Data Table (UDT)
Purpose
UFT and UDT support indirect function calls and generic data references across compilation units. When nvcc compiles a program using function pointers, virtual functions, or __device__ function addresses taken in host code, the compiler generates UFT/UDT entries that the runtime linker resolves at load time.
Sections
| Section | Purpose |
|---|---|
.nv.uft | Jump slot table (one slot per indirect-callable function) |
.nv.uft.entry | UFT entry metadata (UUID, offset pairs) |
.nv.udt | Data slot table (one slot per externally-referenced data object) |
.nv.udt.entry | UDT entry metadata |
.nv.uft.rel | UFT relocation table |
UFT Entry Structure
Each UFT entry contains a 128-bit UUID and a 64-bit offset:
struct uft_entry {
uint64_t uuid_lo; // Low 64 bits of UUID
uint64_t uuid_hi; // High 64 bits of UUID
uint64_t offset; // Offset into the jump slot table
}; // 24 bytes per entry
UFT Manager -- sub_1CD22E0
The UFT manager (1,979 bytes, 10 KB decompiled) processes UFT/UDT entries across all compilation units:
- Build UID-to-key map -- hashes
uuid_lo ^ uuid_hias the lookup key - Detect conflicts -- reports
"uft map conflict: 0x%llx"when two entries hash to the same key - Detect duplicates -- reports
"duplicate ids in uft.entry"when identical UUIDs appear - Reorder entries --
"Re-ordering UFT entries"/"Re-ordering UDT entries"sorts entries for deterministic output - Match UUIDs -- cross-references UUIDs against the existing UFT for linking:
"matching uuid not found"if a referenced UUID does not exist - Align UDT --
"udt size %lld needs aligning"pads UDT entries to required alignment
UFT Slot Validator -- sub_1CD2AA0
Validates consistency between .nv.uft (jump slots) and .nv.uft.entry (metadata):
"missing nv.uft.entry"
"Number of .nv.uft jump slots != Number of entries in .nv.uft.entry"
"size of uidx window != nv.uft"
Synthetic Symbols
The resolver recognizes these synthetic symbol names:
| Symbol | Purpose |
|---|---|
__UFT_OFFSET | Points to a UFT jump slot |
__UFT_CANONICAL | Canonical UUID entry for a UFT slot |
__UDT_OFFSET | Points to a UDT data slot |
__UDT_CANONICAL | Canonical UUID entry for a UDT slot |
__UFT / __UFT_END | UFT table start/end boundaries |
__UDT / __UDT_END | UDT table start/end boundaries |
$NVLINKBINDLESSOFF_<name> | Bindless texture/surface offset symbol |
__cuda_syscall | Device-side syscall dispatcher |
Extern Shared Memory Relocations
Extern shared memory variables (declared with extern __shared__) are handled specially because their addresses are not known until kernel launch. The resolver tracks these through dedicated strings:
"extern shared variable %s at offset %lld"
"reloc of extern shared %d replaced with symbol %d"
"new extern shared instance %d"
Multiple kernels may reference the same extern shared variable. The linker creates separate instances when necessary and patches the relocation to point to the correct instance.
Weak Symbol Handling
When nvlink encounters a weak symbol that conflicts with a strong definition:
"Could not replace weak symbol '%s'"
This occurs during the relocation pre-scan (sub_1CD43A0) when processing relocations that reference weak symbols. The pre-scan walks all relocations and checks the symbol binding at sym->st_other & 0xE0:
0x80= weak: eligible for replacement by a strong definition0x20= global: normal binding
Linking Model
Relocatable Object Mode (-c)
When ptxas produces a relocatable object (.o), all relocations are preserved in .rela.text.<func> sections. The call graph is written to .nv.callgraph. Symbols retain their binding information for nvlink to resolve.
"No relocatable objects found. Did not generate callgraph."
"Generate relocatable object"
The --preserve-relocs flag additionally preserves relocations that would normally be resolved internally:
"This option will make PTXAS to generate relocatable references for variables and preserve ..."
Executable Mode (default)
In the default mode, ptxas resolves all internal relocations and writes .nv.resolvedrela for any relocations that require runtime resolution. External references and function descriptors for indirect calls are preserved as unresolved relocations for the CUDA driver's runtime linker.
PIC Mode
Position-independent code mode (IsPIC flag) changes the relocation encoding. The ELF flags word at ELFW offset +48 encodes this mode. PIC cubins use additional program-relative relocations and avoid absolute addresses where possible.
Cross-References
- Custom ELF Emitter -- ELFW object, header construction, file serialization
- Section Catalog & EIATTR -- complete section type inventory, EIATTR encoding
- Debug Information -- DWARF section generation
- Pipeline Overview -- where relocation resolution fits in the 11-phase pipeline
- Capsule Mercury -- Mercury-specific relocation handling
Function Map
| Address | Size (binary) | Decompiled | Callers | Callees | Purpose |
|---|---|---|---|---|---|
sub_1CD48C0 | 4,184 B | 22 KB | 1 | 17 | Master relocation resolver (7-step algorithm) |
sub_1CD5920 | 1,985 B | 11 KB | 1 | 9 | Relocation writer (.nv.resolvedrela) |
sub_1CD4510 | ~860 B | 4 KB | -- | -- | Relocation creator (SASS) |
sub_1CD46B0 | ~540 B | 4 KB | -- | -- | Relocation creator (Mercury) |
sub_1CD43A0 | ~560 B | 3 KB | -- | -- | Relocation pre-scan (weak/extern) |
sub_1CD34E0 | 3,700 B | 17 KB | 1 | 2 | Bit-field patcher (sub_1CD33F0 extract, sub_1CD3330 insert) |
sub_1CD33F0 | ~300 B | 2 KB | 7 | 1 | Extract bits from instruction word |
sub_1CD3330 | ~200 B | 1 KB | 5 | 0 | Insert bits into instruction word |
sub_1CD22E0 | 1,979 B | 10 KB | 2 | 20 | UFT manager (UUID-to-slot mapping) |
sub_1CD2AA0 | ~800 B | 3 KB | -- | -- | UFT slot validator |
sub_1CB68D0 | 9,578 B | 49 KB | 1 | 36 | Symbol table builder (.symtab) |
sub_1CB2CA0 | 2,038 B | 8 KB | 4 | 11 | Symbol fixup (post-deletion renumbering) |
sub_1C99BB0 | 4,900 B | 25 KB | 1 | 18 | Section index remap (.symtab_shndx) |
sub_1CB64A0 | ~500 B | 2 KB | -- | -- | Symbol resolver (checks .nv.* special names) |
sub_1CAB300 | 2,157 B | 12 KB | 1 | 19 | Bindless texture/surface handler |
sub_1CA6890 | 2,286 B | 15 KB | 2 | 11 | Constant bank deduplication |
sub_1CBD0D0 | -- | -- | -- | -- | Call graph edge registration |
CLI Options
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas v13.0.88 accepts approximately 160 command-line options: 51 documented in --help output and roughly 109 internal/undocumented options discovered through binary analysis. All option names are registered via sub_432A00 (6,427 bytes at 0x432A00) using a generic option framework shared with other NVIDIA tools. The framework library (sub_1C960C0--sub_1C97640) supports short options (-X), long options (--name), and four value types: boolean toggle, list append, scalar value, and multi-value. Internal option names are stored ROT13-encoded in the binary.
| Total options | ~160 (51 documented + ~109 internal) |
| Option registration | sub_432A00 (0x432A00, 6,427 bytes) |
| Option parser | sub_434320 (0x434320, 10,289 bytes) |
| Framework constructor | sub_1C960C0 |
| Argv processor | sub_1C96680 |
| Help printer | sub_1C97640 |
| Options block | 1,352 bytes on stack in compilation driver |
| Name obfuscation | ROT13 for internal option names |
Architecture
argv
│
▼
┌───────────────────┐ ┌─────────────────────────┐
│ sub_1C960C0 │ │ sub_432A00 │
│ Parser ctor │◄─────│ Register ~160 options │
│ (56-byte context)│ │ (name, type, default, │
└───────┬───────────┘ │ help text, callback) │
│ └─────────────────────────┘
▼
┌───────────────────┐
│ sub_1C96680 │
│ Process argv │
│ Detect - and -- │
│ Type dispatch: │
│ 1=bool toggle │
│ 2=list append │
│ 3=scalar value │
│ 4=multi-value │
└───────┬───────────┘
│
▼
┌───────────────────┐
│ sub_434320 │
│ Validate combos │
│ Populate 1352B │
│ options block │
└───────┬───────────┘
│
▼
Compilation driver
(sub_446240)
Quick Start
The most common ptxas invocations and essential options, ordered by frequency of use:
# 1. Basic compilation: PTX -> cubin for a specific GPU
ptxas -arch sm_90 -o kernel.cubin kernel.ptx
# 2. Compilation with optimization control
ptxas -arch sm_100 -O3 -o kernel.cubin kernel.ptx
# 3. Debug build with line info
ptxas -arch sm_90 -g -lineinfo -o kernel.cubin kernel.ptx
# 4. Register-limited compilation (occupancy tuning)
ptxas -arch sm_90 -maxrregcount 64 -o kernel.cubin kernel.ptx
# 5. Verbose output with resource statistics
ptxas -arch sm_90 -v -o kernel.cubin kernel.ptx
# 6. Relocatable object for separate linking
ptxas -arch sm_90 -c -o kernel.o kernel.ptx
# 7. Fast-compile mode (trade codegen quality for build speed)
ptxas -arch sm_100 -Ofc max -o kernel.cubin kernel.ptx
# 8. Parallel compilation with multiple threads
ptxas -arch sm_90 -split-compile 0 -o kernel.cubin kernel.ptx
# 9. Internal knob override (developer/debugging)
ptxas -arch sm_90 -knob DUMPIR=AllocateRegisters -o kernel.cubin kernel.ptx
# 10. Discover all 1,294 internal knob values
DUMP_KNOBS_TO_FILE=/tmp/knobs.txt ptxas -arch sm_90 -o kernel.cubin kernel.ptx
| Goal | Options |
|---|---|
| Maximize performance | -O3 -allow-expensive-optimizations -fmad |
| Maximize occupancy | -maxrregcount N (N = 32, 64, 128, ...) |
| Minimize compile time | -Ofc max -split-compile 0 |
| Debug build | -g -lineinfo -sp-bounds-check |
| Spill diagnostics | -v -warn-spills -warn-lmem-usage |
| Internal tuning | -knob NAME=VALUE (see Knobs System) |
Option Discovery Methodology
Options were extracted from four independent sources:
- Official
--helpoutput -- 51 options with full metadata. - Binary string extraction --
strings(1)reveals plaintext option names used in error messages and format strings. - ROT13 decode -- Internal option names stored as ROT13 in the registration function. Decoding
fj4575628yieldssw4575628,pbzcvyre-fgngfyieldscompiler-stats, etc. - Decompiled code cross-reference -- String references in option processing functions (
sub_434320,sub_4428E0,sub_43A400) confirm option semantics.
Tables below use these markers:
- Unmarked rows = documented in
--help - Rows marked (internal) = discovered through RE, not in
--help
Core Compilation
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--opt-level | -O | int | 3 | Optimization level (0--4) |
--output-file | -o | file | elf.o | Output file name and location |
--gpu-name | -arch | enum | sm_75 | Target GPU architecture (sm_XX, compute_XX, lto_XX) |
--compile-only | -c | bool | false | Generate relocatable object |
--entry | -e | list | (all) | Entry function name(s) to compile |
--verbose | -v | bool | false | Print code generation statistics |
--version | -V | bool | -- | Print version information |
--help | -h | bool | -- | Print help text |
--machine | -m | int | 64 | Host architecture bitness (only 64 supported) |
--input-as-string | -ias | list | -- | PTX modules as strings instead of files |
--options-file | -optf | list | -- | Include CLI options from file |
--compile-functions (internal) | -- | list | -- | Restrict compilation to named functions |
--ptx-length (internal) | -- | int | -- | PTX input length for --input-as-string mode |
--tool-name (internal) | -- | string | -- | Tool name for diagnostics (nvcc integration) |
--cuda-api-version (internal) | -- | int | (auto) | CUDA API version for compatibility |
--abi-compile (internal) | -- | bool | false | Compile using strict ABI conventions |
Debug and Instrumentation
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--device-debug | -g | bool | false | Generate debug information for device code |
--generate-line-info | -lineinfo | bool | false | Generate line-number information |
--sp-bounds-check | -sp-bounds-check | bool | false | Stack-pointer bounds checking; auto-enabled with -g or -O0 |
--suppress-debug-info | -suppress-debug-info | bool | false | Suppress debug sections in output; ignored without -g or -lineinfo |
--device-stack-protector | -device-stack-protector | bool | false | Stack canaries; heuristic per-function risk assessment |
--sanitize | -sanitize | enum | -- | Instrumented code: memcheck or threadsteer |
--g-tensor-memory-access-check | -g-tmem-access-check | bool | (with -g) | Tensor memory access checks for tcgen05 |
--gno-tensor-memory-access-check | -gno-tmem-access-check | bool | false | Override: disable tensor memory access checks |
--dont-merge-basicblocks | -no-bb-merge | bool | false | Prevent basic block merging (debuggable code) |
--return-at-end | -ret-end | bool | false | Preserve last return instruction for breakpoints |
--make-errors-visible-at-exit | -make-errors-visible-at-exit | bool | false | Generate instructions to surface memory faults at exit |
--trap-into-debugger (internal) | -- | bool | false | Insert trap instructions for debugger attachment |
--device-stack-protector-size (internal) | -- | int | (varies) | Stack protector canary size |
--device-stack-protector-frame-size-threshold (internal) | -- | int | (varies) | Frame size threshold for canary insertion |
Register and Occupancy Control
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--maxrregcount | -maxrregcount | int/enum | (unlimited) | Max registers per function; accepts N, archmax, archmin |
--minnctapersm | -minnctapersm | int | -- | Min CTAs per SM; ignored if -maxrregcount is set |
--maxntid | -maxntid | list | -- | Max thread-block dimensions; ignored if -maxrregcount is set |
--device-function-maxrregcount | -func-maxrregcount | int/enum | (unlimited) | Max registers for device functions (with -c); overrides --maxrregcount for non-entry functions |
--register-usage-level | -regUsageLevel | int | 5 | Register-usage optimization aggressiveness (0--10); BETA |
--override-directive-values | -override-directive-values | bool | false | CLI values override PTX directives for minnctapersm, maxntid, maxrregcount |
--first-reserved-rreg (internal) | -- | int | -- | First reserved register number (tools integration) |
--reg-fatpoint (internal) | -- | string | -- | Fatpoint register allocation mode selector |
--no-fastreg (internal) | -- | bool | false | Disable fast register allocation path |
--no-spill (internal) | -- | bool | false | Disable register spilling (debug/stress) |
Performance and Optimization
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--Ofast-compile | -Ofc | enum | 0 | Fast-compile level: 0 (disabled), min, mid, max |
--fast-compile (internal) | -- | bool | false | Internal fast-compile flag (predecessor of --Ofast-compile) |
--allow-expensive-optimizations | -allow-expensive-optimizations | bool | (auto at O2+) | Allow max resources for expensive optimizations |
--split-compile | -split-compile | int | -- | Max concurrent threads for optimizer; 0 = num CPUs |
--fmad | -fmad | bool | true | Contract FP multiply + add into FMA (FMAD/FFMA/DFMA) |
--optimize-float-atomics | -opt-fp-atomics | bool | false | FP atomic optimizations (may affect precision) |
--disable-optimizer-constants | -disable-optimizer-consts | bool | false | Disable optimizer constant bank |
--cloning (internal) | -- | enum | (auto) | Inline function cloning control (yes/no) |
--perf-per-watt-opt-level (internal) | -- | int | -- | Performance-per-watt optimization level |
--lds128convert (internal) | -- | enum | (auto) | LDS.128 conversion: always, nonconst, never |
--opt-pointers (internal) | -- | bool | (varies) | Enable pointer optimization passes |
--fastpath-off (internal) | -- | bool | false | Disable fast-path optimizations |
--full-double-div (internal) | -- | bool | (varies) | Full-precision double division |
--limit-fold-fp (internal) | -- | bool | (varies) | Limit floating-point constant folding |
--shift-right (internal) | -- | bool | false | Shift-right optimization control |
--dont-reserve-null-pointer (internal) | -- | bool | false | Do not reserve null pointer in address space |
Cache Control
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--def-load-cache | -dlcm | enum | (arch-dep) | Default cache modifier on global/generic load |
--def-store-cache | -dscm | enum | (arch-dep) | Default cache modifier on global/generic store |
--force-load-cache | -flcm | enum | -- | Force cache modifier on global/generic load |
--force-store-cache | -fscm | enum | -- | Force cache modifier on global/generic store |
Warnings and Diagnostics
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--warning-as-error | -Werror | bool | false | Promote all warnings to errors |
--disable-warnings | -w | bool | false | Inhibit all warnings |
--warn-on-spills | -warn-spills | bool | false | Warn when registers spill to local memory |
--warn-on-local-memory-usage | -warn-lmem-usage | bool | false | Warn when local memory is used |
--warn-on-double-precision-use | -warn-double-usage | bool | false | Warn when doubles are used |
--suppress-stack-size-warning | -suppress-stack-size-warning | bool | false | Suppress undetermined-stack-size warning |
--suppress-double-demote-warning | -suppress-double-demote-warning | bool | false | Suppress double demotion warning on SM without double support |
--suppress-async-bulk-multicast-advisory-warning | -suppress-async-bulk-multicast-advisory-warning | bool | false | Suppress .multicast::cluster advisory |
--suppress-sparse-mma-advisory-info | -suppress-sparse-mma-advisory-info | bool | false | Suppress mma.sp advisory |
--print-potentially-overlapping-membermasks (internal) | -- | bool | false | Diagnostic for overlapping member masks |
--no-membermask-overlap (internal) | -- | bool | false | Disable member mask overlap checks |
Output Format and Relocation
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--preserve-relocs | -preserve-relocs | bool | false | Preserve relocations in linked executable |
--position-independent-code | -pic | bool | false (whole-prog: true) | Generate PIC; default on for whole-program compilation |
--compiler-annotations | -annotate | bool | false | Annotate compiler-internal information in binary |
--binary-kind (internal) | -- | enum | (arch-dep) | Target binary format: mercury, capmerc, sass |
--force-rela (internal) | -- | bool | false | Force RELA-style relocations |
--gen-std-elf (internal) | -- | bool | false | Generate standard ELF (vs NVIDIA custom format) |
--link-info (internal) | -- | string | -- | Link information for assembler |
--force-externals (internal) | -- | bool | false | Force functions as external |
--forcetext (internal) | -- | bool | false | Force text-mode SASS output |
--emit-internal-clo (internal) | -- | bool | false | Emit internal compiler-level object metadata |
--hide-user-functions (internal) | -- | bool | false | Hide user function symbols in output |
Workaround Flags
Hardware and software bug workarounds tied to internal NVIDIA bug-tracking IDs. All names are ROT13-encoded in the binary (e.g., fj2614554 decodes to sw2614554). These flags toggle specific code paths that avoid known errata or compiler defects. New workarounds appear (and old ones become permanent) with each ptxas release. The validator in sub_434320 enforces architecture restrictions: a flag set on an unsupported architecture is silently cleared with a diagnostic.
| Long Name | Short Name | Type | Default | Arch Gate | Description |
|---|---|---|---|---|---|
--sw2614554 (internal) | -- | bool | false | all | Thread-safety workaround; incompatible with --split-compile. When set, forces single-threaded compilation -- validator emits "'--sw2614554' ignored because of '--split-compile'" and disables split-compile. Addresses a race condition in the parallel optimizer. |
--sw2837879 (internal) | -- | bool | false | all | Backend codegen workaround. No architecture gating or validator logic; consumed directly in DAG/OCG pipeline phases. Specific behavioral effect not traced beyond registration. |
--sw1729687 (internal) | -- | bool | false | sm_50--sm_53 | Maxwell-era hardware errata workaround. Validator checks (arch_ordinal - 14) > 2 and clears the flag with a warning on any architecture beyond sm_53. Activates an alternate codegen path on Maxwell GPUs. |
--sw200428197 (internal) | -- | bool | false | sm_80+ | Sanitizer-compatible ABI workaround. Forces scratch register reservation for CUDA sanitizer instrumentation state and applies ABI-minimum register counts. Consumed in function/ABI setup (sub_43F400, sub_441780) alongside --compile-as-tools-patch. Validator clears it with "-arch=X ignored because of --sw200428197" on sm_75 and earlier. |
--sw200387803 (internal) | -- | bool | false | deprecated | Retired workaround. Setting it triggers a deprecation advisory (dword_29FBDB0) but no behavioral change -- the underlying fix has been permanently integrated. |
--sw200764156 (internal) | -- | bool | true | sm_90 only | Hopper-specific hardware errata. Default is true (unique among all sw* flags). Help text reads "Enable/Disable sw200764156", confirming it is a toggle that can be turned off. On any architecture other than sm_90, the user-set value is discarded: "option -arch=X ignored because of --sw200764156". |
--sw4575628 (internal) | -- | bool | false | sm_100+ | Cache and texturing mode workaround. Validator clears it with a warning on architectures sm_100 and earlier. In target configuration (sub_43A400), the target profile at offset +2465 independently determines whether the workaround is needed; if both the profile and the CLI flag are set simultaneously, the CLI flag is cleared with "--sw4575628 conflicts with specified texturing mode". |
--sw200531531 (internal) | -- | bool | (varies) | unknown | Known only from ROT13 decode (fj200531531). No help text, no validator cross-references, no decompiled consumption. Consumed in backend passes not covered by available decompiled functions. |
--sw200380282 (internal) | -- | bool | (varies) | unknown | Known only from ROT13 decode (fj200380282). Same as --sw200531531 -- registered but with no traceable validator or target configuration logic. |
--sw4915215 (internal) | -- | bool | false | all (behavior varies) | Generation-dependent workaround. On Blackwell (sm_100+, generation=100), when enabled alongside non-PIC mode, emits informational "sw4915215=true". On other architectures, emits a different informational. Behavioral effect is in backend codegen. |
--sw4936628 (internal) | -- | bool | false | all | Stored at options block offset +503, adjacent to --blocks-are-clusters in the registration sequence. No architecture gating in the validator. Specific behavioral effect requires deeper backend tracing; registration proximity suggests cluster/CTA-level code generation relevance. |
EIATTR-Level Workarounds
Three EIATTR attributes encode workaround metadata directly in the output ELF. These are set by target architecture rather than CLI flags -- ptxas emits them unconditionally when the target requires it, and the GPU driver applies fixups at load time.
| EIATTR Code | Name | Knob Name | Description |
|---|---|---|---|
42 (0x2A) | EIATTR_SW1850030_WAR | OneFlapJne1850030 | Instruction offsets requiring driver-side fixup for HW bug 1850030. |
48 (0x30) | EIATTR_SW2393858_WAR | OneFlapJne2393858 | Instruction offsets requiring driver-side fixup for HW bug 2393858. |
53 (0x35) | EIATTR_SW2861232_WAR | -- | Instruction offsets for HW bug 2861232 workaround. |
54 (0x36) | EIATTR_SW_WAR | -- | Generic software workaround container (variable payload). |
71 (0x47) | EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | -- | Offsets of MEMBAR.SYS instructions needing software workaround. |
Tool and Patch Modes
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--compile-as-tools-patch | -astoolspatch | bool | false | Compile patch code for CUDA tools; forces ABI-minimum regcount |
--extensible-whole-program | -ewp | bool | false | Extensible whole-program mode |
--compile-as-at-entry-patch (internal) | -asatentrypatch | bool | false | Compile as at-entry instrumentation patch |
--compile-as-entry-exit-patch (internal) | -- | bool | false | Compile as entry/exit instrumentation patch |
--compile-device-func-without-entry (internal) | -- | bool | false | Allow device function compilation without entry point |
--assyscall (internal) | -- | bool | false | System-call instrumentation mode |
--fdcmpt (internal) | -- | bool | false | Forward-compatibility mode |
--enable-syscall-abi (internal) | -- | bool | false | Enable syscall ABI for device functions |
--assume-extern-functions-do-not-sync (internal) | -- | bool | false | Assume external functions do not synchronize |
--function-pointer-is-function-pointer (internal) | -- | bool | false | Treat function pointers as true function pointers |
Statistics and Profiling
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--compiler-stats (internal) | -- | bool | false | Print per-phase timing (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo) and peak memory |
--compiler-stats-file (internal) | -- | file | -- | Write statistics to JSON file |
--fdevice-time-trace (internal) | -- | file | -- | Chrome DevTools trace format (JSON) for time profiling |
--ftrace-phase-after (internal) | -- | string | -- | Trace/dump IR state after named optimization phase |
--perf-stats (internal) | -- | bool | false | Print performance statistics |
--dump-perf-stats (internal) | -- | bool | false | Dump performance statistics to output |
--phase-wise (internal) | -- | bool | false | Per-phase statistics breakdown |
--use-trace-pid (internal) | -- | bool | false | Include process ID in trace output |
--verbose-tkinfo (internal) | -- | bool | false | Verbose token/parse information |
Mercury and Capsule Mercury
These options control the Mercury intermediate encoding and Capsule Mercury format, which is the default output format on sm_100+ (Blackwell).
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--cap-merc (internal) | -- | bool | (arch-dep) | Generate Capsule Mercury format |
--self-check (internal) | -- | bool | false | Validate capmerc by comparing reconstituted SASS with original |
--out-sass (internal) | -- | bool | false | Output reconstituted SASS from capmerc |
--opportunistic-finalization-lvl (internal) | -- | int | -- | Opportunistic finalization level for Mercury pipeline |
Threading and Parallelism
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--jobserver | -jobserver | bool | false | Enable GNU Make jobserver support (make -j<N>) |
--threads-dynamic-scheduling (internal) | -- | bool | (varies) | Dynamic scheduling for thread pool tasks |
--threads-min-section-size (internal) | -- | int | (varies) | Minimum section size for thread pool partitioning |
Texture and Memory Modes
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--legacy-bar-warp-wide-behavior | -legacy-bar-warp-wide-behavior | bool | false | Legacy PTX bar semantics; deprecated, ignored for sm_70+ |
--set-texmode-independent (internal) | -- | bool | false | Set texture mode to independent |
--set-texmode-raw (internal) | -- | bool | false | Set texture mode to raw |
--disable-fast-video-emulation (internal) | -- | bool | false | Disable fast video emulation path |
--treat-bf16-as-e6m9 (internal) | -- | bool | false | Treat BF16 as E6M9 format |
--legacy-cvtf64 (internal) | -- | bool | false | Legacy cvt.f64 conversion behavior |
--use-gmem-for-func-addr (internal) | -- | bool | false | Global memory for function addresses |
--blocks-are-clusters (internal) | -- | bool | false | Treat blocks as clusters (sm_90a+ TBC) |
--enable-extended-smem (internal) | -- | bool | false | Extended shared memory support |
--disable-smem-reservation (internal) | -- | bool | false | Disable shared memory reservation |
--membermask-overlap (internal) | -- | bool | (varies) | Member mask overlap control |
--ld-prefetch-random-seed (internal) | -- | int | -- | Random seed for load prefetch heuristic |
--max-stack-size (internal) | -- | int | (auto) | Max kernel stack size |
Constant Bank Allocation
NVIDIA GPUs provide 18 hardware constant banks (c[0] through c[17]), each a 64 KB read-only memory segment accessible by all threads in a warp with uniform-address broadcast -- loads from constant banks cost a single memory transaction when all threads in the warp read the same address. The compiler assigns different data categories (kernel parameters, driver state, user constants, PIC tables, etc.) to separate banks to avoid address-space collisions. These options override the default bank assignments; all are ROT13-encoded.
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--sw-kernel-params-bank (internal) | -- | int | (varies) | Constant bank for kernel parameters |
--sw-driver-bank (internal) | -- | int | (varies) | Constant bank for driver data |
--sw-compiler-bank (internal) | -- | int | (varies) | Constant bank for compiler-generated constants |
--sw-user-bank (internal) | -- | int | (varies) | Constant bank for user constants |
--sw-pic-bank (internal) | -- | int | (varies) | Constant bank for PIC data |
--sw-ocl-param1-bank (internal) | -- | int | (varies) | Constant bank for OpenCL parameter set 1 |
--sw-ocl-param2-bank (internal) | -- | int | (varies) | Constant bank for OpenCL parameter set 2 |
--sw-devtools-data-bank (internal) | -- | int | (varies) | Constant bank for developer tools data |
--sw-bindless-tex-surf-table-bank (internal) | -- | int | (varies) | Constant bank for bindless texture/surface table |
Stress Testing
Internal options for compiler stress testing and regression verification.
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--stress-no-crp (internal) | -- | bool | false | Disable CRP (Caller/callee Register Partitioning) |
--stress-maxrregcount (internal) | -- | int | -- | Override maxrregcount for stress testing |
--stress-noglobalregalloc (internal) | -- | bool | false | Disable global register allocation |
Query and Control Interface
Internal options for the query/control interface used by nvcc and other tools.
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--ext-desc-file (internal) | -- | file | -- | External description file for instruction metadata |
--ext-desc-string (internal) | -- | string | -- | External description string for instruction metadata |
--query-controls (internal) | -- | string | -- | Query control parameters |
--query-schema (internal) | -- | string | -- | Query schema definition |
--apply-controls (internal) | -- | string | -- | Apply control parameters to compilation |
--profile-options (internal) | -- | string | -- | Pass profiling options to backend |
--knob (internal) | -knob | list | -- | Set internal knob: -knob NAME=VALUE; repeatable; see Knobs System |
--omega-knob (internal) | -- | string | -- | Pass omega-subsystem knob settings |
--expand-macros-in-omega (internal) | -- | bool | false | Expand macros in omega (instruction expansion) phase |
--force-expand-macros-after-errors (internal) | -- | bool | false | Force macro expansion after errors |
--enable-func-clone-sc (internal) | -- | bool | false | Enable function cloning for self-check |
--use-alternate-query-implementation (internal) | -- | bool | false | Alternate query implementation |
--use-alternate-const-ptr-implementation (internal) | -- | bool | false | Alternate constant pointer implementation |
Syscall Integration
Internal options for system-call based operations (texturing, bulk copy).
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--use-tex-grad-syscall (internal) | -- | bool | false | Syscall for texture gradient operations |
--use-tex-surf-syscall (internal) | -- | bool | false | Syscall for texture/surface operations |
--use-bulk-copy-syscall (internal) | -- | bool | false | Syscall for bulk copy operations |
Knobs Configuration
The -knob flag is the primary CLI mechanism for setting internal knob values -- the 1,294 tuning parameters documented in Knobs System. It is not listed in --help output and uses a single-dash prefix (not --knob).
Syntax
-knob NAME=VALUE Set a typed knob (int, float, double, string, range)
-knob NAME Set a boolean knob (presence = true)
-knob "A=1~B=2~C=3" Multiple knobs in one argument, separated by ~ (tilde)
Multiple -knob flags are accumulated (list-append semantics):
ptxas -knob SchedNumBB_Limit=100 -knob DisableCSE -knob RegAllocBudget=5000 \
-arch sm_90 -o out.cubin input.ptx
Knob names are case-insensitive. The name is resolved via ROT13-encoded lookup tables in GetKnobIndex (sub_6F0820 for DAG knobs, sub_79B240 for OCG knobs). An unrecognized name produces warning 7203: "Invalid knob specified (%s)".
Value Types
The value after = is parsed according to the knob's registered type:
| Type | Syntax | Example |
|---|---|---|
| Boolean | (no value) | -knob DisableCSE |
| Integer | decimal, 0x hex, 0 octal | -knob SchedNumBB_Limit=100 |
| Float | decimal with . | -knob CostWeight=0.75 |
| Double | decimal with . | -knob PriorityScale=1.5 |
| String | raw text | -knob DUMPIR=AllocateRegisters |
| Int-range | low..high | -knob AllowedRange=100..200 |
| Int-list | comma-separated | -knob TargetOpcodes=1,2,3,4 |
Conditional Overrides (WHEN=)
Knobs can be set conditionally based on shader or instruction hash, applied only when a specific function is compiled:
# Apply knob only when shader hash matches
ptxas -knob "WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200" -arch sm_90 -o out.cubin input.ptx
# Multiple conditional overrides separated by ~
ptxas -knob "WHEN=SH=0xDEAD;DisableCSE~WHEN=IH=0x1234;RegAllocBudget=1000" ...
Condition prefixes: SH= (shader hash), IH= (instruction hash), K= (direct knob, no condition).
Interaction with Other Knob Sources
KnobsInit (sub_79D990) processes knob sources in this order -- later sources override earlier ones for the same knob index:
| Priority | Source | Mechanism |
|---|---|---|
| 1 (lowest) | Environment variables | KnobsInitFromEnv (sub_79C9D0), comma-separated name=value pairs |
| 2 | Knobs file | ReadKnobsFile (sub_79D070), plain-text with [knobs] header |
| 3 | -knob CLI flags | Accumulated list-append from argv processing |
| 4 | PTX .pragma | Per-function; disabled by DisablePragmaKnobs knob |
| 5 (highest) | WHEN= overrides | Per-function conditional, matched by shader/instruction hash |
Environment Variable: DUMP_KNOBS_TO_FILE
The DUMP_KNOBS_TO_FILE environment variable causes ptxas to write all 1,294 knob names and their resolved values to a file:
DUMP_KNOBS_TO_FILE=/tmp/all_knobs.txt ptxas -arch sm_90 -o out.cubin input.ptx
This is the primary mechanism for discovering which knobs exist, their current defaults for a given architecture, and verifying that CLI overrides took effect.
Commonly Used Knobs
| Knob | Type | Purpose |
|---|---|---|
DUMPIR | string | Dump IR after a named phase (e.g., AllocateRegisters) |
DisableCSE | bool | Disable common subexpression elimination |
DisablePhases | string | +-delimited list of phases to skip |
SchedNumBB_Limit | int | Basic block limit for scheduling heuristic |
RegAllocBudget | int | Budget for register allocation cost model |
EmitLDCU | bool | Emit LDCU instructions (SM90: requires -forcetext -sso) |
IgnorePotentialMixedSizeProblems | bool | Suppress mixed-size register warnings |
DisablePragmaKnobs | bool | Ignore all .pragma knob directives in PTX |
For the complete knob type system, file format, and all 1,294 knob categories, see Knobs System.
Version and Architecture Queries
| Long Name | Short Name | Type | Default | Description |
|---|---|---|---|---|
--list-arch | -arch-ls | bool | -- | Print supported GPU architectures |
--list-version | -version-ls | bool | -- | Print supported PTX ISA versions |
Option Interaction Rules
Several options interact in non-obvious ways, as revealed by the validation logic in sub_434320:
-
--maxrregcountdominance -- When--maxrregcountis specified,--minnctapersmand--maxntidare ignored. The register constraint calculator (sub_43B660) enforces this precedence. -
--override-directive-values-- Only affects--minnctapersm,--maxntid, and--maxrregcount. Without this flag, PTX directives (.maxnreg,.minnctapersm,.maxntid) take precedence over CLI values. -
--device-function-maxrregcountvs--maxrregcount-- The former overrides the latter for device functions only, and only under--compile-onlymode. For whole-program compilation,--device-function-maxrregcountis ignored. -
--Ofast-compilevs--fast-compile-- The documented--Ofast-compilesupersedes the internal--fast-compile. Both may conflict with--allow-expensive-optimizations(the validator insub_434320checks for this). -
--device-debugauto-enables -- Setting-gauto-enables--sp-bounds-checkand--g-tensor-memory-access-check. The flag--gno-tensor-memory-access-checkexplicitly overrides regardless of ordering. -
--suppress-debug-inforequires -- Has no effect unless--device-debugor--generate-line-infois also specified. -
--compile-as-tools-patchforces -- Automatically sets maxrregcount to ABI minimum. Interacts with--sw200428197workaround in the function/ABI setup path (sub_43F400). -
--split-compileand--allow-expensive-optimizations-- Both activate the thread pool (sub_1CB18B0). The jobserver client (sub_1CC7300) integrates with GNU Make's--jobserver-auth=to respect parallel build limits.
Function Map
| Address | Size | Identity |
|---|---|---|
0x403588 | 75 B | Usage printer (calls sub_1C97640) |
0x432A00 | 6,427 B | Option registration (~160 options) |
0x434320 | 10,289 B | Option parser and validator |
0x439880 | 2,935 B | Chrome trace JSON parser (--fdevice-time-trace) |
0x43A400 | 4,696 B | Target configuration (cache defaults, --sw4575628) |
0x43B660 | 3,843 B | Register/resource constraint calculator |
0x446240 | 11,064 B | Compilation driver (options block consumer) |
0x4428E0 | 13,774 B | PTX input setup (--compile-only, --extensible-whole-program) |
0x60B040 | 4,500 B | Stress test option handler |
0x703AB0 | 10,000 B | Binary-kind / capmerc CLI parser |
0x1C960C0 | ~1,500 B | Option parser constructor |
0x1C96680 | ~2,000 B | Argv processor |
0x1C97210 | ~1,500 B | Option value validator |
0x1C97640 | -- | Options help printer |
Knobs System
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The knobs system is ptxas's internal configuration mechanism -- a separate layer beneath the public CLI flags that exposes 1,294 tuning parameters to NVIDIA developers. Every significant compiler heuristic (register allocation thresholds, scheduling priorities, pass enable/disable, peephole rules) has a corresponding knob. The system is shared with cicc via a common header (generic_knobs_impl.h) but ptxas instantiates it twice: once for the DAG scheduler pipeline (99 knobs) and once for the OCG (Optimizing Code Generator) backend (1,195 knobs). All knob names are stored ROT13-encoded in the binary, a lightweight obfuscation that prevents casual strings discovery while being trivially reversible.
The knobs infrastructure lives primarily in two address regions: 0x6F0000--0x6F8000 (DAG knob instantiation, shared with the Mercury SASS pipeline) and 0x797000--0x7A2000 (OCG knob instantiation, the larger set). Both regions are compiled from the same template in generic_knobs_impl.h.
| Total knobs | 1,294 (99 DAG + 1,195 OCG) |
| Source header | /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h |
| DAG GetKnobIndex | sub_6F0820 (2,782 bytes) |
| OCG GetKnobIndex | sub_79B240 (518 bytes) |
| ParseKnobValue | sub_6F7360 / sub_79F540 (DAG: 18KB, OCG: 18KB) |
| ReadKnobsFile | sub_79D070 (9,879 bytes) |
| KnobsInit (master) | sub_79D990 (40,817 bytes) |
| KnobInit (per-knob) | sub_7A0C10 (13,874 bytes) |
| Knob descriptor | 64 bytes per entry |
| Knob runtime value | 72 bytes per slot |
| Name obfuscation | ROT13 with case-insensitive comparison |
| Setting mechanisms | -knob NAME=VALUE, knobs file ([knobs] header), PTX pragma, env var |
| Debug dump | DUMP_KNOBS_TO_FILE environment variable |
Architecture
┌──────────────────────────────────────────┐
│ KnobsInit (sub_79D990) │
│ Called once from global init sub_662920 │
└─────┬──────────┬──────────┬──────────────┘
│ │ │
┌─────────▼──┐ ┌───▼──────┐ ┌▼───────────────┐
│ ReadKnobsFile│ │ -knob CLI│ │ PTX pragma │
│ sub_79D070 │ │ parsing │ │ (unless │
│ [knobs] fmt │ │ │ │ DisablePragma) │
└─────────┬───┘ └───┬──────┘ └┬───────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────┐
│ ParseKnobsString (sub_79B530) │
│ Handles WHEN=, INJECTSTRING, ~-delimited │
└──────────────────┬──────────────────────────┘
│
┌────────────▼────────────┐
│ GetKnobIndex │
│ sub_6F0820 (DAG) │
│ sub_79B240 (OCG) │
│ ROT13 decode + lookup │
└────────────┬─────────────┘
│
┌────────────▼────────────┐
│ ParseKnobValue │
│ sub_6F7360 (DAG) │
│ sub_79F540 (OCG) │
│ Type-specific parsing │
└────────────┬─────────────┘
│
┌────────────▼────────────┐
│ Runtime knob array │
│ 72 bytes per slot │
│ Accessed by index │
└──────────────────────────┘
ROT13 Name Obfuscation
Every knob name in the binary is stored as a ROT13-encoded string. The GetKnobIndex function decodes each character inline during comparison, without ever materializing the cleartext name in memory. The decode is combined with a case-insensitive tolower() comparison against the user-supplied query.
The inline ROT13 decode from sub_6F0820:
// For each character in the stored ROT13 name:
char c = stored_name[i];
if ((unsigned char)((c & 0xDF) - 65) <= 12)
c += 13; // A-M (or a-m) -> N-Z (or n-z)
else if ((unsigned char)((c & 0xDF) - 78) < 13)
c -= 13; // N-Z (or n-z) -> A-M (or a-m)
// Then compare case-insensitively:
if (tolower(query_char) != tolower(c))
goto mismatch;
The & 0xDF trick converts lowercase to uppercase before range-checking, so both 'a'-'m' and 'A'-'M' hit the first branch. Non-alphabetic characters pass through unchanged. This means knob names like SchedNumBB_Limit with underscores and digits are handled correctly -- only the alphabetic portion rotates.
To reverse-engineer knob names from the binary: extract the ROT13 strings from the knob definition table (64-byte stride at the table base pointer), apply ROT13, and you get the cleartext name.
Knob Descriptor Layout
Each knob is described by a 64-byte entry in the knob definition table. The table is an array at (knob_state + 16) with count at (knob_state + 24).
Offset Size Field
────── ──── ─────────────────────────────────────
+0 8 name_ptr Pointer to ROT13-encoded primary name
+8 8 name_len Length of primary name
+16 1 type_tag Knob type (OKT_* enum, 1-12)
+17 7 (padding)
+24 16 (reserved)
+40 8 alias_ptr Pointer to ROT13-encoded alias name
+48 8 alias_len Length of alias name
+56 8 (reserved)
────── ────
64 Total
Both primary and alias names are checked during lookup. A knob matches if either its primary name or alias decodes to the query string (case-insensitive). The alias mechanism allows backward-compatible renaming of knobs across toolkit versions.
Knob Value Layout
Runtime knob values are stored in a flat array of 72-byte slots at (knob_state + 72 * index). The slot layout depends on the type:
Offset Size Field
────── ──── ─────────────────────────────────────
+0 1 type_tag Runtime type (0=unset, 1-10)
+1 7 (padding)
+8 8 value / pointer Primary value (int32, int64, float, double, or pointer)
+16 8 list_begin For list types: first element pointer
+24 8 list_sentinel For list types: sentinel node
+32 4 aux_value Secondary value (e.g., int-range high bound)
+36 4 (padding)
+40 8 list_tail For list types: last element pointer
+48 8 list_head For list types: head pointer
+56 4 element_count For list types: number of elements
+60 4 (padding)
+64 8 allocator Arena allocator pointer for list/range types
────── ────
72 Total
The type tag at runtime differs from the definition-table type tag. The definition type drives parsing; the runtime type reflects what was actually stored:
| Runtime Type | Meaning | Payload |
|---|---|---|
| 0 | Unset / invalid | None |
| 1 | int32 | *(int32*)(slot + 8) |
| 2 | float | *(float*)(slot + 8) |
| 3 | double / int64 | *(int64*)(slot + 8) |
| 4 | boolean (true) | No payload; presence = true |
| 5 | string | *(char**)(slot + 8) |
| 6 | when-condition list | Doubly-linked list at +16..+48, count at +56 |
| 7 | int32 with secondary | *(int32*)(slot + 8), *(int32*)(slot + 12) |
| 8 | int-range | *(int32*)(slot + 8) = low, *(int32*)(slot + 12) = high |
| 9 | opcode-string-list | Doubly-linked list (same structure as type 6) |
| 10 | int-list (dynamic) | Growable array at +16, count at +24 |
Per-Type Slot Usage (confirmed from decompilation)
Types 1, 2, 3, 4, 5, 7, 8 -- scalar types using only bytes +0 through +15:
Type 1 (int32): +0 = 0x01, +8 = int32 value (4 bytes)
Type 2 (float): +0 = 0x02, +8 = float value (4 bytes, upper 4 undefined)
Type 3 (double): +0 = 0x03, +8 = double value (8 bytes)
Type 4 (boolean): +0 = 0x04 (no payload -- presence = true)
Type 5 (string): +0 = 0x05, +8 = char* pointer (8 bytes, NOT owned)
Type 7 (budget): +0 = 0x07, +8 = int32 primary, +12 = int32 secondary
Type 8 (int-range): +0 = 0x08, +8 = int32 low, +12 = int32 high
Types 6 and 9 -- doubly-linked list types using the full 72 bytes:
+0: byte type tag (6 or 9)
+8: ptr next pointer (initially 0)
+16: ptr → slot+24 (sentinel backward link)
+24: ptr → slot+8 (sentinel forward link)
+32: int64 (unused, set to 0)
+40: ptr tail of list
+48: ptr head of list
+56: int32 element count (starts at 2 for sentinel nodes)
+64: ptr arena allocator (for node allocation)
Each list node is 24 bytes, allocated from the arena at +64:
Type 6 node: [next(8), prev(8), string_ptr(8)]
Type 9 node: [next(8), prev(8), opcode_id(4) | int_value(4)]
Type 10 -- dynamic growable array:
+0: byte = 0x0A
+8: ptr arena allocator
+16: ptr array base (int32 elements, grown via sub_6EFD20)
+24: int32 element count (initialized to 0xFFFFFFFF = -1; first insert sets to 0)
The array grows by calling sub_6EFD20(slot+8, count+2) before each insertion, which reallocates if capacity is exceeded. Elements are 4-byte int32 values stored contiguously starting at the base pointer.
Knob Type System
The definition-table type tag (at descriptor offset +16) determines how ParseKnobValue interprets the value string. There are 10 logical knob types with 1,294 total registrations:
| Type Tag | Name | Count | Parse Rule |
|---|---|---|---|
| 1 | OKT_NONE | 139 | Boolean flag -- presence = true, no value needed |
| 2 | OKT_INT | 616 | strtol(value, NULL, 0) -- accepts decimal, hex (0x), octal (0) |
| 3 | OKT_BDGT | 88 | Same as INT but stores with secondary field zeroed (budget type) |
| 4 | OKT_IRNG | 8 | "lo..hi" range -- two integers separated by .. |
| 5 | OKT_ILIST | 3 | Comma-separated integers: "1,2,3,4" |
| 6 | OKT_FLOAT | 12 | sscanf(value, "%f", &result) |
| 7 | OKT_DBL | 100 | sscanf(value, "%lf", &result) |
| 8 | OKT_STR | 28 | Direct string assignment (pointer copy) |
| 9 | OKT_WHEN | 2 | When-condition string; parsed into linked list of condition nodes |
| 10 | OKT_OPCODE_STR_LIST | 4 | Opcode-name,integer pairs: "FADD,3,FMUL,2" |
| 11 | OKT_STR (variant) | — | Same as type 8 (alternate string slot) |
| 12 | OKT_ILIST (variant) | — | Int-list with pre-initialized allocator |
The INT type (616 knobs, 47.6%) dominates. These control thresholds, limits, and numeric heuristic parameters across the entire compiler. BDGT (budget) knobs (88) are semantically similar to INT but carry a secondary field used for budget-tracking in cost models. The 100 DBL knobs control floating-point heuristic weights (scheduling priorities, cost ratios, etc.).
Definition-Type to Runtime-Type Mapping
The definition-table type tag drives parsing; ParseKnobValue writes a different runtime type tag into the 72-byte slot. The mapping is not 1:1 -- several definition types collapse into the same runtime type, and compound types undergo a pre-initialization phase before the main parse:
| Def Type | Definition Name | Runtime Type | Runtime Name | Pre-init? |
|---|---|---|---|---|
| 1 | OKT_NONE | 4 | boolean (true) | No |
| 2 | OKT_INT | 1 | int32 | No |
| 3 | OKT_BDGT | 7 | int32 + secondary | No |
| 4 | OKT_IRNG | 8 | int-range (low, high) | No |
| 5 | OKT_ILIST | 10 | int-list (dynamic array) | No |
| 6 | OKT_FLOAT | 2 | float (single precision) | No |
| 7 | OKT_DBL | 3 | double (8-byte) | No |
| 8 | OKT_STR | 5 | string (pointer) | No |
| 9 | OKT_WHEN | 6 | linked list (when-condition) | Yes |
| 10 | OKT_OPCODE_STR_LIST | 9 | linked list (opcode-string) | Yes |
| 11 | OKT_STR (variant) | 5 | string (pointer) | No |
| 12 | OKT_ILIST (variant) | 10 | int-list (dynamic array) | Yes |
Types 11 and 12 are aliases: type 11 shares the exact handler with type 8 (both produce runtime type 5), and type 12 shares parsing logic with type 5 but its pre-switch initializes the allocator from the knob state object instead of inline.
ParseKnobValue Dispatch Algorithm
ParseKnobValue (sub_79F540, source lines 435--551 of generic_knobs_impl.h) implements a two-phase dispatch. The first switch pre-initializes compound types; the second switch parses the value string.
Phase 1 -- Pre-initialization (compound types only):
// v15 = definition type tag at (knob_descriptor + 16)
// v14 = runtime slot at (knob_state[9] + 72 * index)
switch (v15) {
case 9: // OKT_WHEN -> runtime type 6
KnobValueReset(v14);
v14[0] = 6;
// Initialize doubly-linked list with two sentinel nodes:
// +8 = 0 (next), +16 -> +24, +24 -> +8 (circular sentinels)
// +40 = tail, +48 = head, +56 = count (starts at 2)
// +64 = allocator from knob_state[1]
break;
case 10: // OKT_OPCODE_STR_LIST -> runtime type 9
KnobValueReset(v14);
v14[0] = 9;
// Same linked-list initialization as case 9
break;
case 12: // OKT_ILIST variant -> runtime type 10
KnobValueReset(v14);
v14[0] = 10;
*(ptr*)(v14 + 16) = NULL; // growable array base
*(ptr*)(v14 + 8) = allocator; // from knob_state[1]
*(int32*)(v14 + 24) = 0xFFFFFFFF; // sentinel count (-1)
break;
}
Phase 2 -- Value parsing (all types):
Type 1 (OKT_NONE, boolean): No value string needed. Stores runtime type 4 (boolean true). Presence alone indicates the knob is set.
Type 2 (OKT_INT, integer): Calls sub_6F71D0(value, NULL) -- a strtol wrapper with base 0, which auto-detects decimal, hex (0x prefix), and octal (0 prefix). Stores runtime type 1, value at slot+8 as int32.
Type 3 (OKT_BDGT, budget): Same integer parsing as type 2. Stores runtime type 7 with the primary value at slot+8 and the secondary (budget counter) at slot+12 zeroed. Cost models decrement the secondary field as optimization budget is consumed.
Type 4 (OKT_IRNG, integer range): Parses "low..high" format with these edge cases:
"100..200" -> low=100, high=200 Standard range
"100.." -> low=100, high=0x7FFFFFFF Open upper bound
"..200" -> low=0x80000000, high=200 Open lower bound
".." -> low=0x80000000, high=0x7FFFFFFF Full range
"42" -> low=42, high=42 Degenerate (single value)
"" -> error "Empty integer range value"
The .. separator is detected by checking *endptr == '.' && endptr[1] == '.'. Default bounds are INT_MIN (0x80000000) and INT_MAX (0x7FFFFFFF). Stores runtime type 8 with low at slot+8, high at slot+12.
Type 5 (OKT_ILIST, integer list): Parses comma-separated integers. Validation requires each element to start with a digit or -. Uses a growable array (runtime type 10) at slot+16, grown via sub_6EFD20(slot+8, count+2) before each insertion. Elements are 4-byte int32 values stored contiguously. Example: "1,2,3,4" produces a 4-element array.
Type 6 (OKT_FLOAT, float): Calls sscanf(value, "%f", &result). Stores runtime type 2, value at slot+8 as a 4-byte IEEE 754 single. Returns error "Invalid floating point value" if sscanf does not return 1.
Type 7 (OKT_DBL, double): Calls sscanf(value, "%lf", &result). Stores runtime type 3, value at slot+8 as an 8-byte IEEE 754 double. Returns error "Invalid double value" if sscanf does not return 1.
Type 8/11 (OKT_STR, string): Both handled identically. Stores runtime type 5 with a direct pointer copy: *(char**)(slot+8) = value. The string is NOT duplicated -- the pointer references the original buffer, so the caller must ensure the string's lifetime exceeds the knob's.
Type 9 (OKT_WHEN, when-condition): Pre-switch already initialized the linked list (runtime type 6). Allocates a 24-byte node via the allocator's vtable (allocator_vtable[3](allocator, 24)). Node layout: [next_ptr(8), prev_ptr(8), string_ptr(8)]. The condition string pointer is stored at node+16. Nodes are inserted at the tail of the doubly-linked list. Error if value is NULL; empty string is permitted.
Type 10 (OKT_OPCODE_STR_LIST, value-pair list): Pre-switch already initialized the linked list (runtime type 9). Parsing loop:
- Call
vtable+40to split the next comma-delimited token into opcode name and integer value strings - If opcode name is NULL: error
"Empty opcode string"(line 520) - If integer value is NULL: error
"Empty integer value"(line 522) - Parse integer via
strtol(nptr, 0, 10)(base 10 only, unlike OKT_INT) - Resolve opcode name to internal ID via
vtable+56(SASS opcode table lookup) - Allocate 24-byte node:
[next(8), prev(8), opcode_id(4) | int_value(4)] - Insert into linked list; loop until input exhausted
Format: "FADD,3,FMUL,2" produces two nodes: (FADD_id, 3) and (FMUL_id, 2). The opcode resolution uses the same 11,240-byte opcode recognition table as the peephole optimizer.
Type 12 (OKT_ILIST variant, opcode list): Pre-switch already initialized the growable array (runtime type 10). Parsing loop:
- Call
vtable+64to extract the next comma-delimited opcode name - Resolve to internal ID via
vtable+56 - Grow array via
sub_6EFD20(slot+8, count+2) - Store opcode ID as
int32in the array
Format: "FADD,FMUL,IADD3" -- opcode names only, no integers. Each is resolved to its internal opcode ID.
Default: Error "Invalid knob type" (line 551).
Parse Error Messages
ParseKnobValue (sub_79F540 / sub_6F7360) produces these diagnostic strings on parse failure:
| Error String | Source Line | Def Type | Condition |
|---|---|---|---|
"Empty when-string" | 435 | 9 | WHEN knob with NULL value |
"Empty integer range value" | 445 | 4 | IRNG knob with NULL or empty value |
"Empty integer list value" | 451 | 5 | ILIST knob with NULL or empty value |
"Integer list value is not an integer" | 453 | 5 | First char not digit or - |
"End of integer range value is not ',' or null character" | 457 | 5 | ILIST terminator not , or \0 |
"Empty integer value" | 470 | 2 | INT knob with NULL or empty value |
"Empty integer value" | 478 | 3 | BDGT knob with NULL or empty value |
"Empty floating point value" | 491 | 6 | FLOAT knob with NULL or empty value |
"Invalid floating point value" | 496 | 6 | sscanf returns != 1 |
"Empty double value" | 502 | 7 | DBL knob with NULL or empty value |
"Invalid double value" | 506 | 7 | sscanf returns != 1 |
"Empty value pair list" | 515 | 10 | OPCODE_STR_LIST with NULL value |
"Empty opcode string" | 520 | 10 | Opcode name resolves to NULL |
"Empty integer value" | 522 | 10 | Integer after opcode resolves to NULL |
"Empty opcode list" | 536 | 12 | Opcode-list variant with NULL value |
"Invalid knob type" | 551 | — | Unrecognized type tag in definition table |
"Invalid knob identifier" | 395 | — | GetKnobIndex -- name not found |
All errors carry source attribution: generic_knobs_impl.h with a line number and function name ("GetKnobIndex", "ParseKnobValue", "ReadKnobsFile"). Error constructors: sub_79CDB0 (simple format string) and sub_79AED0 (format with knob name and value context).
Setting Knobs
Method 1: -knob CLI Flag
ptxas -knob SchedNumBB_Limit=100 -knob DisableCSE=1 input.ptx -o output.cubin
Multiple -knob flags accumulate. Each is parsed by KnobsInit (sub_79D990) during startup. The knob name is looked up via GetKnobIndex, then the value is parsed according to the knob's type.
Method 2: Knobs File
A knobs file is a plain-text file with a required [knobs] section header:
; Comments or metadata can appear before the header.
; ReadKnobsFile ignores everything until [knobs] is found.
[knobs]
SchedNumBB_Limit=100
DisableCSE=1
RegAllocBudget=5000
; WHEN= syntax is also supported inside the file:
WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200
ReadKnobsFile (sub_79D070, source lines 1060--1090 of generic_knobs_impl.h) processes the file:
1. fopen(path, "r") line ~1060
2. fseek(file, 0, SEEK_END) line 1075
3. size = ftell(file) line 1075
4. fseek(file, 0, SEEK_SET) line 1075
5. buffer = allocator->vtable[2](allocator, size+1) (heap alloc)
6. bytes = fread(buffer, 1, size, file) line 1070
7. buffer[bytes] = '\0' (null-terminate)
8. marker = strstr(buffer, "[knobs]") line 1065
9. if (!marker) error "Knobs header not found"
10. content = marker + 7 (skip "[knobs]")
11. vtable[4](result, knob_state, content, 0) (parse callback)
12. fclose(file) line 1085
Key implementation details:
- Entire file read at once. The file is
fseek/ftell-measured, thenfreadinto a single buffer ofsize + 1bytes. No line-by-line streaming. strstr-based header detection. The[knobs]marker is located viastrstr, so it can appear anywhere in the file -- not necessarily on the first line. Everything before it (comments, version metadata, other INI sections) is silently ignored.- Parsing starts at marker+7. Exactly 7 characters (
[knobs]) are skipped. The parse callback isParseKnobsString(sub_79B530), which processes newline-delimitedkey=valuepairs. The~separator andWHEN=conditional syntax are supported. - Result/Expected monad. Every I/O operation has a corresponding error path. Errors are accumulated via
sub_79A3D0(ErrorChainAppend) and propagated through a tagged result object. Multiple errors from a single file are chained, not short-circuited.
Error strings with source line numbers:
| Error String | Source Line | Condition |
|---|---|---|
"fseek() error knobsfile %s" | 1075 | fseek(SEEK_END) or fseek(SEEK_SET) fails |
"fseek() error for knobsfile %s" | 1080 | fseek(SEEK_END) fails (alternate path) |
"fread() error knobsfile %s" | 1070 | fread returns <= 0 |
"Knobs header not found in %s" | 1065 | strstr(buffer, "[knobs]") returns NULL |
"fclose() error for knobsfile %s" | 1085 | fclose returns non-zero |
Method 3: PTX Pragma
Knobs can be set from PTX source via .pragma directives, unless the DisablePragmaKnobs knob is set. The pragma string is copied into a temporary buffer and parsed by ParseKnobsString (sub_79B530), following the same key=value syntax.
Method 4: WHEN= Conditional Overrides
The most powerful mechanism allows setting knobs conditionally, based on shader hash or instruction hash. The override string uses ~ (tilde) as a record separator:
WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200~WHEN=IH=0x12345;DisableCSE=1
ParseKnobsString (sub_79B530) recognizes these prefixes (case-insensitive):
WHEN=-- conditional knob applicationSH=-- match by shader hash (decimal, hex with0x, or range with..)IH=-- match by instruction hashK=-- direct knob setting (no condition)INJECTSTRING-- special directive terminated by;;(double semicolon)
The full conditional override system is parsed by ParseKnobOverrides (sub_79C210), which iterates a linked list of override entries at knob_state + 68904. Each entry carries the condition (hash match criterion) and the knob assignment to apply when matched.
Hash matching uses FNV-1a (magic 0x811C9DC5, prime 16777619) for the per-function override table lookup at ctx+120 → +1128. See IsPassDisabledFull (sub_7992A0).
Priority Order
When the same knob is set by multiple mechanisms, the last write wins. KnobsInit (sub_79D990) processes sources in this order:
- Environment variable overrides (
getenv) - Knobs file (if specified via
-knobs-fileor equivalent) -knobCLI flags- PTX pragma knobs (applied per-function at compile time)
- WHEN= conditional overrides (applied per-function when hash matches)
Later sources override earlier ones for the same knob index.
Two Instantiations: DAG and OCG
The knob system is a C++ template instantiated twice with different knob definition tables:
DAG Knobs (sub_6F0820)
The DAG (Directed Acyclic Graph) scheduler knob table contains 99 entries. These control the Mercury SASS pipeline: instruction expansion, WAR hazard handling, scoreboard configuration, and the decode/expand/opex pipeline stages.
| Property | Value |
|---|---|
| GetKnobIndex | sub_6F0820 |
| ParseKnobValue | sub_6F7360 |
| InitializeKnobs | sub_6F68C0 (9KB, 24 references to generic_knobs_impl.h) |
| Table size | 99 entries x 64 bytes = 6,336 bytes |
DAG knobs referenced in the binary include knob indices 8 and 17 (pipeline options in sub_6F52F0), 16 (WAR generation options in sub_6FBC20), and 743/747 (expansion options in sub_6FFDC0).
OCG Knobs (sub_79B240)
The OCG (Optimizing Code Generator) knob table contains 1,195 entries -- the vast majority of all knobs. These control the optimization passes, register allocation, instruction scheduling, and code generation.
| Property | Value |
|---|---|
| GetKnobIndex | sub_79B240 |
| ParseKnobValue | sub_79F540 |
| KnobsInit | sub_79D990 (40,817 bytes, master initializer) |
| KnobInit | sub_7A0C10 (per-knob state constructor) |
| Table size | 1,195 entries x 64 bytes = 76,480 bytes |
| Runtime values | 1,195 entries x 72 bytes = 86,040 bytes |
OCG knob indices referenced across the codebase include: 185 (pass-disable string, offset 13320), 294 (epilogue instruction count, used in tepid scheduling), 487 (LoopMakeSingleEntry enablement), 956-957 (shader hint settings at offsets 68832/68904).
Knob State Object
The master knob state object is constructed by KnobInit (sub_7A0C10):
Offset Size Field
──────── ────── ──────────────────────────────
+0 8 vtable pointer (off_21C0738)
+8 8 arena allocator
+16 8 knob definition table pointer
+24 8 knob count
+32 40 (zero-initialized control fields)
+72 var knob value array (72 * count bytes)
+80 4 max knob index (initially 0xFFFFFFFF)
+88 16 DUMP_KNOBS_TO_FILE path (growable string)
The vtable at off_21C0738 provides virtual methods for knob access:
vtable+72:IsKnobSet(index)-- check if a knob has a valuevtable+152:GetKnobIntValue(index)-- retrieve int32 value- And others for bool, string, double retrieval
Knob Access Helpers
Throughout the codebase, knobs are accessed by index via small helper functions:
| Function | Address | Purpose |
|---|---|---|
GetKnobIntValue | sub_7A1B80 | Returns *(int32*)(state + 72*idx + 8) |
GetKnobBoolValue | sub_7A1CC0 | Checks type == 4, returns presence |
GetKnobStringValue | sub_7A1E10 | Returns string pointer (type 5/8) |
SetKnobValue | sub_7A2860 | Writes value with optional WHEN=SH= condition |
IsKnobSet | (inlined) | Checks *(byte*)(state + 72*idx) != 0 |
Access is O(1) by index -- no hash lookup or name comparison at runtime. The GetKnobIndex name-to-index translation happens only during initialization.
Pass Disable Mechanism
The knobs system provides a string-based pass disable mechanism through knob index 185 (OCG offset 13320). The string contains +-delimited pass names:
-knob DisablePhases=LoopMakeSingleEntry+SinkCodeIntoBlock
Two check functions consult this string:
IsPassDisabled (sub_799250)
Simple version. Reads the disable flag byte at ctx+13320:
- If byte == 0: no pass-disable configured, returns false
- If byte == 5: string pointer at
ctx+13328, performs substring match viasub_6E1520(strcasestr-like)
Called from 16+ sites across the codebase: sub_78B430 (LoopMakeSingleEntry), sub_78DB70 (SinkCodeIntoBlock), sub_8236B0, sub_8D0640, sub_8F45E0, and others.
IsPassDisabledFull (sub_7992A0)
Full version with per-function overrides. First checks a per-function hash table at ctx+120 → +1128 using FNV-1a on the function identifier. If the function has a specific override entry, reads the disable string from there. Otherwise falls back to the global disable string at ctx+72 → +13320.
// FNV-1a hash for per-function lookup
uint32_t hash = 0x811C9DC5;
for (each byte b in function_id)
hash = 16777619 * (hash ^ b);
uint32_t bucket = hash & (table_size - 1);
The + character is used as a delimiter between alternative phase names in the disable string, allowing "phaseA+phaseB" to match either name.
NamedPhases Parser (sub_798B60)
Parses a comma-separated list of name=value pairs into parallel arrays (max 256 entries). Used by KnobsInitFromEnv (sub_79C9D0) to process environment variable-based knob overrides.
Input: "knob1=value1,knob2=value2,knob3=value3"
Output: names[256], values[256], full_strings[256]
Knob Categories
The 1,294 knobs cluster into functional categories. Prefix analysis of decoded knob names reveals these major groups:
| Prefix | Count | Domain |
|---|---|---|
Sched* / PostSched* / Sb* | 89 | Instruction scheduling heuristics and thresholds |
RegAlloc* / Reg* | 87 | Register allocation parameters, spill cost model, target selection |
Disable* | 75 | Pass/feature disable switches (boolean) |
Remat* / SinkRemat* | 35 | Rematerialization cost model, enable switches, placement control |
Mercury* / Merc* | 21 | Mercury encoder configuration |
URF* | 24 | Uniform Register File optimization |
Enable* | 19 | Pass/feature enable switches (boolean) |
Dump* | 15 | Debug dump controls (DUMPIR, DumpSched, etc.) |
Peephole* | ~20 | Peephole optimization rules |
Loop* | ~15 | Loop optimization parameters |
Sync* / Barrier* | ~12 | Synchronization and barrier handling |
WAR* | ~8 | Write-after-read hazard parameters |
GMMA* / MMA* | ~10 | Matrix multiply-accumulate configuration |
Spill* | ~8 | Spill code generation parameters |
Budget* | ~10 | Cost model budgets (BDGT type knobs) |
Copy* / CSE* | ~8 | Copy propagation and CSE parameters |
| (other) | ~577 | Miscellaneous per-pass tuning knobs |
Notable Individual Knobs
Selected knobs referenced by address in the binary:
| Index | Name (decoded) | Type | Referenced At | Purpose |
|---|---|---|---|---|
| 8 | (DAG pipeline) | INT | sub_6F52F0 | Pipeline option flag |
| 16 | (WAR generation) | INT | sub_6FBC20 | WAR pass behavior |
| 17 | (DAG pipeline) | INT | sub_6F52F0 | Pipeline option flag |
| 185 | (pass-disable string) | STR | sub_799250, sub_7992A0 | DisablePhases string |
| 294 | (epilogue count) | INT | sub_7A46E0 | Tepid scheduling divisor |
| 487 | (loop single-entry) | BOOL | sub_78B430 | LoopMakeSingleEntry enable |
| 743 | (expansion option) | INT | sub_6FFDC0 | Mercury expansion control |
| 747 | (expansion option) | INT | sub_6FFDC0 | Mercury expansion control |
| 956 | (shader hint) | — | sub_79C210 | Shader hint knob (offset 68832) |
| 957 | (shader hint) | — | sub_79C210 | Shader hint linked list (offset 68904) |
Register Allocation Knobs (87 knobs, indices 613--699)
The register allocator is the most heavily parameterized subsystem in ptxas. Its 87 knobs span indices 613 through 699 in the OCG knob table, registered in ctor_005 at addresses 0x4197F0--0x41B2E0. The knobs cluster into seven functional sub-categories. All names decoded from ROT13 strings at 0x21B9730--0x21BA6C0.
A. Spill Cost Model (26 knobs)
The spill guidance engine (sub_96D940, 84 KB) uses these knobs to compute per-candidate spill costs. The model multiplies hardware-specific latency and resource metrics by configurable scale factors, then applies threshold-based activation logic.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 658 | RegAllocSpillBarriersAcrossSuspend | NONE | Enable spill barriers across suspend points |
| 659 | RegAllocSpillBit | INT | Master spill-bit mode selector |
| 660 | RegAllocSpillBitHighRegCountHeur | INT | High register count heuristic for spill-bit decisions |
| 661 | RegAllocSpillBitHighRegScale | DBL | Scale factor for high-register-count spill cost |
| 662 | RegAllocSpillBitInfPerRegThreshold | INT | Interference-per-register threshold for spill-bit activation |
| 663 | RegAllocSpillBitLowRegCountHeur | INT | Low register count heuristic for spill-bit decisions |
| 664 | RegAllocSpillBitLowRegScale | DBL | Scale factor for low-register-count spill cost |
| 665 | RegAllocSpillBitMediumRegScale | DBL | Scale factor for medium-register-count spill cost |
| 666 | RegAllocSpillBitNonRematSpillThreshold | INT | Threshold for non-rematerializable spill-bit activation |
| 667 | RegAllocSpillBitRLivePerRegThreshold | INT | Live-per-register threshold for R-type spill decisions |
| 668 | RegAllocSpillBitRLiveThreshold | INT | Global R-live threshold for spill activation |
| 669 | RegAllocSpillForceXBlockHoistRefill | INT | Force cross-block hoisting of refill instructions |
| 670 | RegAllocSpillLatencyScale | DBL | Scale factor for latency in spill cost model |
| 671 | RegAllocSpillLatencyScale2 | DBL | Secondary latency scale (nested loops) |
| 672 | RegAllocSpillMemResScale | DBL | Scale factor for memory resource pressure in spill cost |
| 673 | RegAllocSpillMioHeavyThreshold | DBL | Threshold for MIO-heavy (memory-intensive) spill classification |
| 674 | RegAllocSpillOptBudget | BDGT | Budget for spill optimization passes |
| 675 | RegAllocSpillResourceScale | DBL | Scale factor for resource usage in spill cost |
| 676 | RegAllocSpillResCostsScale | DBL | Scale factor for resource costs (secondary weighting) |
| 677 | RegAllocSpillReturnRegister | INT | Spill handling mode for return-value registers |
| 678 | RegAllocSpillSmemFlatMode | INT | Shared memory spill: flat addressing mode selector |
| 679 | RegAllocSpillSmemLatencyScale | DBL | Scale factor for shared-memory spill latency |
| 680 | RegAllocSpillTexDepScale | DBL | Scale factor for texture dependency in spill cost |
| 681 | RegAllocSpillValidateDebug | INT | Debug: validate spill correctness (0=off, >0=level) |
| 682 | RegAllocSpillXBlock | INT | Cross-block spill mode (hoist/refill strategy) |
| 683 | RegAllocSpillXBlock2 | INT | Secondary cross-block spill mode |
The cost model uses three register-count tiers (low/medium/high), each with independent scale factors (664, 665, 661). The tier boundaries are set by the heuristic knobs (663, 660). Latency scales (670, 671) multiply the estimated stall cycles, while resource scales (672, 675, 676) multiply memory bandwidth consumption. The MIO-heavy threshold (673) triggers a separate cost path when the basic block is already saturated with memory operations.
B. Rematerialization (11 knobs)
Rematerialization recomputes values instead of spilling them. The allocator treats remat as a first-class spill alternative with its own budget and candidate ordering.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 619 | RegAllocCtxSensitiveRemat | INT | Enable context-sensitive rematerialization |
| 622 | RegAllocEnableOptimizedRemat | INT | Enable optimized rematerialization pass |
| 627 | RegAllocLiveRemat | INT | Enable live-range-aware rematerialization |
| 632 | RegAllocMaxRematHeight | INT | Max expression DAG height for remat candidates |
| 633 | RegAllocMaxRematInst | INT | Max instructions in a remat sequence |
| 635 | RegAllocMultiRegclassRemat | INT | Enable remat across multiple register classes |
| 636 | RegAllocMultiRegRemat | INT | Enable multi-register rematerialization |
| 637 | RegAllocMultiRegRematBudget | BDGT | Budget for multi-register remat attempts |
| 650 | RegAllocRematDisableRange | IRNG | Disable remat for instruction index range lo..hi |
| 651 | RegAllocRematEnable | INT | Master enable for rematerialization (0=off) |
| 652 | RegAllocRematReuseBudget | BDGT | Budget for remat-reuse optimization attempts |
| 654 | RegAllocOrderRematCandHeuristic | INT | Heuristic for ordering remat candidates |
Knob 650 (RegAllocRematDisableRange) is unique as the only IRNG-type knob in the set, accepting "lo..hi" to disable rematerialization for a range of instruction indices -- a debugging aid for bisecting remat-related miscompiles.
C. Pre-Assignment / MAC (8 knobs)
MAC (Machine-level Allocation with Constraints) pre-assigns physical registers to high-priority operands before the main Fatpoint allocator runs. Entry: sub_94A020 (331 lines).
| Index | Name | Type | Purpose |
|---|---|---|---|
| 613 | RegAllocAvoidBankConflictMac | INT | Enable bank-conflict-aware MAC pre-assignment |
| 614 | RegAllocAvoidBankConflictMacPenalty | INT | Penalty weight for bank conflicts during MAC pre-assignment |
| 615 | RegAllocAvoidBankConflictMacWindowSize | INT | Instruction window size for bank conflict analysis |
| 628 | RegAllocMacForce | NONE | Force MAC-level pre-allocation path |
| 629 | RegAllocMacVregAllocOrder | INT | Vreg processing order during MAC allocation |
| 630 | RegAllocMacVregAllocOrderCompileTime | INT | Compile-time variant of MAC vreg allocation order |
| 646 | RegAllocPrefMacOperands | INT | MAC operand preference level (1=read, 2=write, 3=both) |
| 647 | RegAllocPrefMacOperandsMaxDepth | INT | Max operand chain depth for MAC preference propagation |
D. Coalescing (3 knobs)
Register coalescing eliminates unnecessary register-to-register copies by merging live ranges.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 617 | RegAllocCoalesceBudget | BDGT | Budget limit for coalescing iterations |
| 618 | RegAllocCoalescing | NONE | Enable register coalescing |
| 634 | RegAllocMmaCoalescing | NONE | Enable MMA-specific coalescing |
E. Performance-Difference Backoff (5 knobs)
Progressive constraint relaxation: on retry iteration N, if the performance difference exceeds a limit, constraints relax between the begin and end iterations.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 641 | RegAllocPerfDiffBackoff | NONE | Enable perf-diff based constraint backoff |
| 642 | RegAllocPerfDiffBackoffBegin | INT | Iteration at which backoff begins |
| 643 | RegAllocPerfDiffBackoffEnd | INT | Iteration at which full relaxation is reached |
| 644 | RegAllocPerfDiffConflictWeight | INT | Weight factor for conflicts in perf-diff calculation |
| 645 | RegAllocPerfDiffLimit | INT | Performance difference limit triggering relaxation |
F. Register Target Selection (13 knobs)
The target selection phase determines how many physical registers to aim for -- the occupancy/performance tradeoff. More registers per thread means fewer warps can execute concurrently.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 687 | RegTargetList | ILIST | Comma-separated list of target register counts to try |
| 688 | RegTgtLowerLimitMMASlack | INT | Slack added to MMA lower register limit |
| 689 | RegTgtLowerLimitTCGENSlack | INT | Slack added to TCGEN lower register limit |
| 690 | RegTgtLowerLimitSPARSIFYSlack | INT | Slack added to SPARSIFY lower register limit |
| 691 | RegTgtLowerLimitDECOMPRESSSlack | INT | Slack added to DECOMPRESS lower register limit |
| 692 | RegTgtSelHigherWarpCntHeur | INT | Heuristic mode for higher-warp-count target selection |
| 693 | RegTgtSelHigherWarpCntHeurValue | DBL | Weight value for higher-warp-count heuristic |
| 694 | RegTgtSelHighLiveRangeHeurValue | DBL | Weight for high-live-range target selection heuristic |
| 695 | RegTgtSelLowerWarpCntHeur | INT | Heuristic mode for lower-warp-count target selection |
| 696 | RegTgtSelLowerWarpCntHeurValue | DBL | Weight value for lower-warp-count heuristic |
| 697 | RegTgtSelLowLiveRangeHeurValue | DBL | Weight for low-live-range target selection heuristic |
| 698 | RegTgtSelWithSMemSpillHeur | INT | Heuristic mode when shared-memory spilling is active |
| 699 | RegUsageLevel | INT | Register usage reporting level |
The four "Slack" knobs (688--691) fine-tune lower register limits for specific architectural features that have minimum register requirements: MMA (matrix multiply), TCGEN (tensor core generation), SPARSIFY (structured sparsity), DECOMPRESS (decompression).
G. General Allocation Control (12 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 616 | RegAllocCacheSize | INT | Cache size parameter for interference graph |
| 620 | RegAllocDebugConflictDetails | INT | Debug: print conflict graph details (verbosity level) |
| 621 | RegAllocDepDistanceThresholdForHighConflicts | INT | Dep-distance threshold above which high-conflict registers are deprioritized |
| 624 | RegAllocIndexAbiScratchRegs | INT | Index into ABI scratch register set |
| 639 | RegAllocNumNonSpillTrials | INT | Non-spill allocation trials before allowing spills |
| 640 | RegAllocOptLevel | INT | Regalloc optimization level (controls aggressiveness) |
| 648 | RegAllocPrintDetails | NONE | Enable detailed regalloc diagnostic printing |
| 649 | RegAllocRefineInf | INT | Refine interference graph iteration limit |
| 653 | RegAllocOptimizeABI | INT | Enable ABI-aware register optimization (setmaxnreg handling) |
| 655 | RegAllocReportMaxRegsAllowed | INT | Report maximum registers allowed per thread (diagnostic) |
| 656 | RegAllocCudaSmemSpillEnable | INT | Enable CUDA shared memory spill path |
| 685 | RegAllocUserSmemBytesPerCTA | INT | User-specified shared memory bytes per CTA (overrides computed) |
H. Miscellaneous (8 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 623 | RegAllocEstimatedLoopIterations | STR | String hint providing estimated loop iteration counts for spill cost weighting |
| 625 | RegAllocL1SpillRegThres | INT | Register count threshold for L1 spill mode activation |
| 626 | RegAllocL1SpillScale | DBL | Scale factor for L1 cache spill cost |
| 631 | RegAllocMaxGmmaDisallowedReg | INT | Max registers disallowed during GMMA (warp group MMA) allocation |
| 638 | RegAllocNoRetargetPrefs | NONE | Disable retarget-preference optimization |
| 657 | RegAllocSortRegs | INT | Sorting order for register candidates during allocation |
| 684 | RegAllocThresholdForDiscardConflicts | INT | Interference count above which conflicts are discarded (default 50) |
| 686 | RegAttrReuseVectorBudget | BDGT | Budget for register-attribute vector reuse optimization |
Scheduling Knobs (89 knobs, indices 229--978)
The instruction scheduler is the second most heavily parameterized subsystem after register allocation. Its 89 knobs span two contiguous blocks (indices 738--811 for the core Sched* set, and 569--574 for the PostSched* set) plus 11 scattered entries for scheduling-adjacent features. All names decoded from ROT13 strings at 0x21B6CB0--0x21BE100, registered in ctor_005 at code addresses 0x411FF0--0x420A00.
The knobs control every aspect of the list scheduler: how latencies are modeled, which functional units are treated as busy, how aggressively cross-block motion is attempted, and how register pressure feedback loops interact with the priority function. Three Blackwell-era SchedResBusy* knobs (QMMA at 964, OMMA at 977, MXQMMA at 978) sit outside the main block because they were appended in a later toolkit version for new MMA unit types.
A. Resource Busy Overrides (28 knobs)
The SchedResBusy* knobs override the hardware-profile resource busy times for individual functional units. Each knob sets the number of cycles the named unit is considered occupied after issuing an instruction to it. When unset, the scheduler uses the value from the latency model's per-SM hardware profile. Setting a SchedResBusy* knob to 0 effectively makes the unit appear always free to the scheduler.
Two knobs accept string values instead of integers: SchedResBusyOp and SchedResBusyMachineOpcode take a string identifying a specific opcode or machine opcode to override, enabling per-instruction busy-time tuning.
| Index | Name | Type | Functional Unit |
|---|---|---|---|
| 781 | SchedResBusyADU | INT | Address divergence unit |
| 782 | SchedResBusyALU | INT | Arithmetic logic unit |
| 783 | SchedResBusyCBU | INT | Convergence barrier unit |
| 784 | SchedResBusyDMMA | INT | Double-precision MMA unit |
| 785 | SchedResBusyFMA | INT | Fused multiply-add unit |
| 786 | SchedResBusyFMAWide | INT | Wide FMA unit (multi-cycle) |
| 787 | SchedResBusyFP16 | INT | Half-precision FP unit |
| 788 | SchedResBusyFP64 | INT | Double-precision FP unit |
| 789 | SchedResBusyGMMA | INT | Warp group MMA (WGMMA) unit |
| 790 | SchedResBusyHMMA16 | INT | Half-precision MMA, 16-wide |
| 791 | SchedResBusyHMMA16816 | INT | Half-precision MMA, 16x8x16 shape |
| 792 | SchedResBusyHMMA1688 | INT | Half-precision MMA, 16x8x8 shape |
| 793 | SchedResBusyHMMA32 | INT | Half-precision MMA, 32-wide |
| 794 | SchedResBusyIMMA | INT | Integer MMA unit |
| 795 | SchedResBusyLSU | INT | Load/store unit |
| 796 | SchedResBusyLSUL1 | INT | Load/store unit (L1 path) |
| 797 | SchedResBusyOp | STR | Per-opcode override (string: opcode name) |
| 798 | SchedResBusyMachineOpcode | STR | Per-machine-opcode override (string) |
| 799 | SchedResBusyUDP | INT | Uniform datapath unit |
| 800 | SchedResBusyXU64 | INT | Extended-precision (64-bit) unit |
| 964 | SchedResBusyQMMA | INT | Quarter-precision MMA unit (Blackwell) |
| 977 | SchedResBusyOMMA | INT | Octal MMA unit (Blackwell) |
| 978 | SchedResBusyMXQMMA | INT | MX-quantized MMA unit (Blackwell) |
The five HMMA variants (790--793) correspond to different tensor core shapes: HMMA16 for 16-wide half-precision, HMMA1688 for the 16x8x8 tile used on Volta/Turing, HMMA16816 for the 16x8x16 tile used on Ampere+, and HMMA32 for 32-wide half-precision operations. IMMA (794) handles integer tensor operations (INT8/INT4).
B. Latency Overrides (12 knobs)
These override the default latency values the scheduler uses for dependency edges. The SchedRead* prefix indicates read-after-write latencies; the SchedTex* and SchedLDS* variants target texture and shared-memory operations specifically.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 757 | SchedLDSLatency | INT | Shared memory (LDS) load latency in cycles |
| 771 | SchedReadLatency | INT | Default read-after-write latency |
| 772 | SchedReadSBBaseLatency | INT | Scoreboard base read latency |
| 773 | SchedReadSBBaseUseLSULat | BOOL | Use LSU latency as scoreboard base |
| 774 | SchedReadSbDmmaLatency | INT | Scoreboard read latency for DMMA operations |
| 775 | SchedReadSbLdgstsLatency | INT | Scoreboard read latency for LDGSTS (async copy) operations |
| 802 | SchedSyncsLatency | INT | Synchronization barrier latency |
| 803 | SchedSyncsPhasechkLatency | INT | Phase-check synchronization latency |
| 804 | SchedTex2TexIssueRate | INT | Minimum cycles between back-to-back texture issues |
| 808 | SchedTexLatency | INT | Texture fetch latency in cycles |
| 811 | SchedXU64Latency | INT | Extended 64-bit unit latency |
| 770 | SchedReadAvailTarget | INT | Target availability delay for read operands |
C. Register Pressure Feedback (8 knobs)
The scheduler's priority function incorporates register pressure awareness through these knobs. They control how aggressively the scheduler tries to reduce live register count: SchedMaxRTarget sets the target register count, while the SchedMaxRLive* knobs define slack bands around that target. SchedReduceIncLimit* throttles how quickly the scheduler increases its pressure-reduction efforts.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 758 | SchedLocalRefRatio | DBL | Local reference ratio weight in priority function |
| 760 | SchedMaxRLiveCarefulSlack | INT | Slack before aggressive register pressure reduction |
| 761 | SchedMaxRLiveOKslack | INT | Slack band where register pressure is acceptable |
| 762 | SchedMaxRLiveOKslackColdBlocks | INT | OK-slack for cold (infrequently executed) blocks |
| 763 | SchedMaxRTarget | INT | Target maximum register count for scheduling |
| 776 | SchedReduceIncLimit | INT | Limit on incremental register pressure reduction steps |
| 778 | SchedReduceIncLimitHigh | INT | Upper bound on incremental reduction |
| 779 | SchedReduceRegBudget | BDGT | Budget for register-pressure-reduction iterations |
D. Cross-Block Scheduling (8 knobs)
Cross-block motion allows the scheduler to move instructions across basic block boundaries for better latency hiding. These knobs control the scope and cost limits of cross-block speculation.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 742 | SchedCrossBlock | INT | Master cross-block scheduling mode selector |
| 743 | SchedCrossBlockInstsToSpeculate | INT | Max instructions to speculate across block boundary |
| 744 | SchedCrossBlockLimit | INT | Overall cross-block motion limit |
| 745 | SchedCrossBlockSpeculate | INT | Speculation mode for cross-block motion |
| 746 | SchedCrossBlockSpeculateBudget | BDGT | Budget for cross-block speculation attempts |
| 747 | SchedCrossBlockTexToSpeculate | INT | Max texture instructions to speculate across blocks |
| 288 | EnableXBlockSchedInMultiBlockInMMALoop | INT | Enable cross-block scheduling within multi-block MMA loops |
| 738 | SbXBlock | INT | Cross-block scoreboard tracking mode |
E. Texture Batching (7 knobs)
Texture operations have high latency, so the scheduler groups them into batches to maximize memory-level parallelism. These knobs control batch formation and target selection.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 741 | SchedCountLoadsPerTex | INT | Max loads to count per texture operation |
| 756 | SchedLDGBatchDelayBias | INT | Delay bias for global load batching |
| 755 | SchedLastHybridInBBWithIssueRate | INT | Last hybrid scheduler position in BB with issue rate |
| 805 | SchedTexBatchTargetSelectRegisterTarget | INT | Batch formation: prefer register-target-aware grouping |
| 806 | SchedTexBatchTargetSelectSchedulerTarget | INT | Batch formation: prefer scheduler-target grouping |
| 807 | SchedTexBatchTargetTexReadTogether | INT | Batch formation: prefer grouping tex reads together |
| 931 | UseGroupOpexesForResourceScheduling | INT | Use grouped opexes for resource scheduling decisions |
F. Dependency Modeling (6 knobs)
These control how the scheduler builds and refines the dependency graph between instructions.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 753 | SchedAddDepFromGlobalMembarToCB | INT | Add dependency edge from global membar to CB unit |
| 759 | SchedMaxMemDep | INT | Max memory dependencies per instruction |
| 764 | SchedMemNoAlias | NONE | Assume no memory aliasing (aggressive scheduling) |
| 777 | SchedReduceRefPsuedoDepLimit | INT | Limit on reducing reference pseudo-dependencies |
| 780 | SchedRefineMemDepBudget | BDGT | Budget for memory dependency refinement iterations |
| 801 | SchedSymmetricAntiDepConflictWindow | BOOL | Enable symmetric anti-dependency conflict window |
G. Post-Scheduler (6 knobs)
The post-scheduler runs after register allocation (phase 103) and adjusts the schedule to account for actual register assignments. It primarily inserts stall cycles and adjusts issue delays.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 569 | PostSchedAdvLatencyHiding | BOOL | Enable advanced latency hiding in post-scheduler |
| 570 | PostSchedBudget | BDGT | Budget for post-scheduler iterations |
| 571 | PostSchedEarlyStall | INT | Early stall insertion mode |
| 572 | PostSchedForceReverseOrder | INT | Force reverse traversal order in post-scheduler |
| 573 | PostSchedIssueDelay | BOOL | Enable issue delay computation |
| 574 | PostSchedIssueDelayForNoWBStalls | BOOL | Compute issue delays for no-writeback stalls |
H. Ordering and Preservation (5 knobs)
These control whether the scheduler preserves the original instruction order (from the optimizer or PTX source) versus reordering freely.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 229 | ForcePreserveSchedOrderSameNvOpt | INT | Force preserve scheduling order from NvOpt pass |
| 594 | PreserveSchedOrder | NONE | Preserve source scheduling order (boolean) |
| 595 | PreserveSchedOrderSame | BOOL | Preserve scheduling order for same-priority instructions |
| 751 | SchedForceReverseOrder | INT | Force reverse scheduling order (bottom-up) |
| 769 | SchedPrefFurthestDep | BOOL | Prefer instructions with furthest dependency |
I. Scoreboard (4 knobs)
The hardware scoreboard tracks instruction completion. These knobs tune how the scheduler predicts scoreboard occupancy to avoid stalls.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 738 | SbXBlock | INT | Cross-block scoreboard tracking mode |
| 739 | SbXBlockLLSB | INT | Cross-block long-latency scoreboard tracking |
| 772 | SchedReadSBBaseLatency | INT | Scoreboard base read latency |
| 773 | SchedReadSBBaseUseLSULat | BOOL | Use LSU latency as scoreboard base |
Note: SbXBlock appears in both cross-block (D) and scoreboard (I) categories because it serves both purposes -- it controls whether the scoreboard state propagates across block boundaries, which is a prerequisite for cross-block scheduling correctness.
J. MMA Coupling (3 knobs)
Matrix multiply-accumulate instructions on certain architectures share functional unit resources. These knobs control how the scheduler models coupled execution.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 752 | SchedFP16CoupledMaxellPascal | INT | FP16 coupled execution mode on Maxwell/Pascal |
| 754 | SchedHmmaImmaBmmaCoupledAmperePlus | INT | HMMA/IMMA/BMMA coupled execution on Ampere+ |
| 366 | GroupOpexesForResourceSchedulingThreshold | DBL | Threshold for grouping opexes in resource scheduling |
K. Scheduler Model (4 knobs)
These control how the scheduler models the hardware pipeline and instruction movement costs.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 765 | SchedModelIdentityMove | INT | Model identity moves as zero-latency |
| 766 | SchedModelSharedPhysicalPipe | INT | Model shared physical pipe contention |
| 767 | SchedMultiRefDeltaLive | INT | Delta-live threshold for multi-reference instructions |
| 768 | SchedMultiRefDeltaLiveMinRefs | INT | Minimum reference count for delta-live calculation |
L. Budget, Scale, and Control (7 knobs)
General scheduling control knobs covering budgets, loop iteration estimates, the master disable switch, and validation.
| Index | Name | Type | Purpose |
|---|---|---|---|
| 740 | SchedBumpScaleAugmentFactor | DBL | Augment factor for priority bump scaling |
| 748 | SchedDisableAll | INT | Master disable for all scheduling passes |
| 749 | SchedDynBatchBudget | BDGT | Budget for dynamic batching iterations |
| 750 | SchedEstimatedLoopIterations | STR | Estimated loop iterations (string: per-loop hints) |
| 809 | ScheduleKILs | INT | Schedule KIL (kill/discard) instructions |
| 810 | SchedValidateLiveness | INT | Enable liveness validation after scheduling |
| 811 | SchedXU64Latency | INT | XU64 unit latency override |
Disable Switches (75 knobs)
The disable switches are boolean knobs that turn off specific passes, optimizations, or workarounds. All 75 knobs containing "Disable" were decoded from ROT13 strings at 0x21BDE30--0x21BFA10. Nearly all are OKT_NONE (boolean) type -- setting them with no value or any value disables the corresponding feature. The single exception is RegAllocRematDisableRange, which is OKT_IRNG and accepts a "lo..hi" instruction index range.
The bare Disable knob at 0x21BE860 appears to be a master pass-disable switch. SchedDisableAll is the master scheduler disable. DisablePragmaKnobs prevents PTX .pragma directives from setting knobs -- a meta-level control that protects the knob system itself.
A. Workaround (WAR) Switches (9 knobs)
These disable hardware or compiler bug workarounds. Each War_SW* knob corresponds to an NVIDIA internal bug tracker ID. Disabling a WAR reverts to the unpatched behavior -- useful for bisecting whether a WAR is causing a regression.
| Name | Feature Disabled |
|---|---|
DisableWar_SW200655588 | Workaround for bug SW-200655588 |
DisableWar_SW2549067 | Workaround for bug SW-2549067 |
DisableWar_SW2789503 | Workaround for bug SW-2789503 |
DisableWar_SW2965144 | Workaround for bug SW-2965144 |
DisableWar_SW3093632 | Workaround for bug SW-3093632 |
DisableForwardProgressWar1842954 | Forward-progress guarantee workaround (bug 1842954) |
DisableForwardProgressWar1842954ForDeferBlocking | Same WAR, variant for defer-blocking scheduling |
DisableHMMARegAllocWar | HMMA (half-precision MMA) register allocation workaround |
DisableMultiViewPerfWAR | Multi-view rendering performance workaround |
B. Memory and Addressing (11 knobs)
These control address computation, memory access conversion, and shared-memory optimizations.
| Name | Feature Disabled |
|---|---|
DisableCvtaForGenmemToSmem | Generic-to-shared address space conversion via cvta |
DisableDoubleIndexedAddress | Double-indexed addressing mode optimization |
DisableErrbarAfterMembar | Error barrier (BAR.SYNC 15) insertion after membar.sys |
DisableForceLDCTOLDCUConv | LDC to LDCU (constant uniform load) conversion |
DisableImplicitMemDesc | Implicit memory descriptor inference |
DisableLDCU256 | LDCU.256 -- 256-bit constant uniform load |
DisableLDCUWithURb | LDCU with uniform register base addressing |
DisableLongIntArithAddressFolding | Long integer arithmetic folding into address computation |
DisableRemoveSmemLea | Shared memory LEA (load effective address) removal |
DisableSmemSizePerCTACheck | Shared memory size per CTA validation check |
DisableStrideOnAddr | Stride-on-address optimization (base+stride*index folding) |
C. Register Allocation and Uniform Registers (9 knobs)
These control uniform register (UR) file usage, live range management, and remat-related disable ranges.
| Name | Type | Feature Disabled |
|---|---|---|
DisableConvergentWriteUR | NONE | Convergent write-to-UR optimization |
DisableExtendedLiveRange | NONE | Extended live range optimization |
DisableU128 | NONE | 128-bit uniform register support |
DisableURLiveAcrossConvBound | NONE | UR liveness across convergence boundaries |
DisableURLivenessTradeOff | NONE | UR liveness trade-off heuristic |
DisableUreg | NONE | Uniform register file usage entirely |
MercuryDisableLegalizationOfTexToURBound | NONE | Mercury tex-to-UR-bound legalization |
RegAllocRematDisableRange | IRNG | Rematerialization for instruction index range lo..hi |
RematDisableTexThrottleRegTgt | NONE | Texture throttle register target during remat |
D. Loop Optimization (6 knobs)
| Name | Feature Disabled |
|---|---|
DisableAlignHotLoops | Hot loop alignment (NOP padding for fetch efficiency) |
DisableDeadLoopElimination | Dead loop elimination pass |
DisableLoopLevelVaryingAnalysis | Loop-level varying/invariant analysis |
DisableLoopPrecheckForYields | Loop pre-check insertion for yield points (cooperative groups) |
DisableMeshVCTALoop | Mesh shader virtual CTA loop optimization |
DisablePartialUnrollOverflowCheck | Overflow check during partial loop unrolling |
E. Code Motion and Scheduling (6 knobs)
| Name | Feature Disabled |
|---|---|
DisableLatTransitivity | Latency transitivity in scheduling dependency chains |
DisableMoveCommoning | MOV-based equivalence propagation (commoning walker) |
DisableNestedHoist | Nested code hoisting (loop-invariant-like motion) |
DisableOffDeck | Off-deck scheduling (prefetch to off-deck buffer) |
DisableSourceOrder | Source-order scheduling constraint |
SchedDisableAll | Master switch: all scheduling passes |
F. Vectorization (4 knobs)
| Name | Feature Disabled |
|---|---|
DisableFastvecEnhancement | Fast vectorization enhancement pass |
DisableHalfPartialVectorWrites | Half-precision partial vector write coalescing |
DisableReadVectorization | Load vectorization (coalescing scalar reads into vector loads) |
DisableWriteVectorization | Store vectorization (coalescing scalar writes into vector stores) |
G. Predication and Branching (4 knobs)
| Name | Feature Disabled |
|---|---|
CmpToMovPredCrossBlockDisable | CMP-to-MOV predicate propagation across basic blocks |
DisableBranchPredInput | Branch predicate input optimization |
DisableCmpToPred | CMP-to-predicate conversion |
DisablePredication | Predication pass (phase 63, OriDoPredication) |
H. Synchronization and Barriers (2 knobs)
| Name | Feature Disabled |
|---|---|
DisableRedundantBarrierRemoval | Redundant barrier removal pass |
DisableStageAndFence | Stage-and-fence synchronization insertion |
I. Dead Code and Store Elimination (2 knobs)
| Name | Feature Disabled |
|---|---|
DisableDeadStoreElimination | Dead store elimination pass |
DisableStraightenInSimpleLiveDead | Straightening within simple live/dead analysis |
J. Control Flow Merging (5 knobs)
| Name | Feature Disabled |
|---|---|
DisableEarlyExtractBCO | Early extraction of BCO (branch code optimization objects) |
DisableMergeEquivalentConditionalFlow | Phase 133: tail merging of equivalent conditional branches |
DisableMergeFp16MovPhi | FP16 MOV-PHI merge optimization |
DisableMergeSamRamBlocks | SAM/RAM block merging (surface/texture access coalescing) |
DisableOptimizeHotColdFlow | Hot/cold flow optimization (code layout splitting) |
K. Pass Control (2 knobs)
| Name | Feature Disabled |
|---|---|
Disable | Master disable switch (bare name) |
DisablePragmaKnobs | PTX .pragma-based knob overrides |
L. Sanitizer (3 knobs)
These control the address sanitizer instrumentation for different memory spaces. When the sanitizer is active, these knobs can selectively disable checking for one space while keeping the others.
| Name | Feature Disabled |
|---|---|
SanitizeDisableGlobal | Address sanitizer for global memory accesses |
SanitizeDisableLocal | Address sanitizer for local memory accesses |
SanitizeDisableShared | Address sanitizer for shared memory accesses |
M. Floating Point (2 knobs)
| Name | Feature Disabled |
|---|---|
FPFoldDisable | Floating-point constant folding |
FPRefactoringDisable | Floating-point expression refactoring |
N. Miscellaneous (10 knobs)
| Name | Feature Disabled |
|---|---|
DisableBW225LongIntArith | BW225 (Blackwell) long integer arithmetic optimization |
DisableBptTrapNoReturn | BPT.TRAP no-return semantics (debugger breakpoint trap) |
DisableDependentConstExpr | Dependent constant expression optimization |
DisableISBESharing | ISBE (indexed set buffer entry) sharing for bindless textures |
DisableMarkF2FPackbTo16Bit | Marking F2F.PACKB as 16-bit operation |
DisableNonUniformQuadDerivatives | Non-uniform quad derivative computation |
DisablePadding | NOP padding insertion (alignment and scheduling) |
DisablePicCodeGen | Position-independent code generation |
DisableSopSr | SOP (scalar operation) on special registers (SR) |
DisableSuperUdp | Super-UDP (enhanced uniform datapath) optimization |
Rematerialization Knobs (35 knobs)
Rematerialization knobs control the three dedicated remat pipeline phases (Phase 28: SinkRemat, Phase 69: OriDoRemat) and the cost model that decides whether recomputing a value is cheaper than keeping it live in a register. These are separate from the 12 RegAlloc*Remat* knobs documented above in section B, which control allocator-integrated rematerialization. The distinction matters: allocator-integrated remat fires during register allocation itself (sub_93AC90), while these knobs tune the standalone pre-allocation and post-predication remat passes.
The 35 knobs split into two contiguous blocks in the descriptor table plus one outlier:
- Remat* (27 knobs, indices 702--728): Late rematerialization (Phase 69) and shared cost model
- SinkRemat* (8 knobs, indices 824--831): Early sink+remat (Phase 28)
A. Remat Enable/Disable (5 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 709 | RematDisableTexThrottleRegTgt | INT | Disable texture-throttle register targeting during remat |
| 710 | RematEarlyEnable | INT | Enable Phase 54 early remat mode activation |
| 711 | RematEnable | INT | Master enable for Phase 69 late rematerialization |
| 712 | RematEnablePReg | NONE | Enable predicate register rematerialization (boolean flag) |
| 726 | RematStressTest | NONE | Force all remat candidates to be rematerialized (debug, boolean flag) |
Knob 711 (RematEnable) is the master switch. When zeroed via -knob RematEnable=0, Phase 69 skips its core loop entirely. Knob 710 (RematEarlyEnable) independently controls Phase 54's mode flag write (ctx+1552 = 4). Knob 726 (RematStressTest) is a debug-only boolean that forces every candidate to be rematerialized regardless of profitability -- useful for stress-testing correctness.
B. Remat Cost Model (10 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 702 | RematAbsCostFactor | DBL | Absolute cost scaling factor for remat profitability |
| 703 | RematBackOffRegTargetFactor | DBL | Back-off factor for register pressure target during remat |
| 705 | RematColdBlockRatio | DBL | Cost discount ratio for cold (rarely executed) blocks |
| 713 | RematGlobalCostFactor | DBL | Global cost multiplier for cross-block rematerialization |
| 714 | RematGlobalLowCostFactor | DBL | Cost factor for low-cost (cheap ALU: MOV, IADD, LOP3) remat |
| 716 | RematLdcCost | DBL | Cost weight assigned to LDC (load-from-constant-bank) remat |
| 719 | RematMemCost | DBL | Cost weight for memory-sourced (LD/ST) rematerialization |
| 722 | RematReadUAsLdc | INT | Treat uniform address reads as LDC for cost classification |
| 727 | RematTexInstRatioThreshold | DBL | Texture instruction ratio threshold for throttle activation |
| 728 | RematTexThrottleRegTgtScale | DBL | Scale factor for register target when texture throttle is active |
These 10 knobs parameterize the remat profitability function (sub_90B790). The cost model computes remat_cost = instruction_cost * factor and compares against register savings. The DBL-typed knobs (8 of 10) are floating-point multipliers that allow fine-grained tuning. The texture-specific knobs (727, 728) implement a throttle: when the ratio of texture instructions exceeds the threshold, the register target is scaled to avoid excessive register use that would harm texture unit throughput.
C. Register Pressure Control (5 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 706 | RematConservativeRegSlack | INT | Extra registers to reserve beyond target (conservative mode) |
| 708 | RematCostRegLimit | INT | Max register count considered during cost analysis |
| 718 | RematMaxRegCount | INT | Absolute ceiling on registers for remat decisions |
| 723 | RematRegTargetFactor | DBL | Scaling factor for computing the register pressure target |
| 724 | RematRegTargetTrialLimit | INT | Max iterations when searching for optimal register target |
The register target is the pressure level below which rematerialization becomes profitable. RematRegTargetFactor (723) scales the occupancy-derived target. RematRegTargetTrialLimit (724) caps the binary-search iterations in the target-finding loop. RematMaxRegCount (718) is a hard ceiling -- if current pressure exceeds this value, the remat pass operates in aggressive mode.
D. Instruction and Code Limits (2 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 707 | RematCostInstLimit | INT | Max instruction count for inclusion in cost model |
| 715 | RematInflationSlack | INT | Allowed code-size inflation slack (extra instructions from remat) |
RematCostInstLimit (707) prevents the cost model from analyzing extremely large remat sequences. RematInflationSlack (715) limits how many extra instructions rematerialization may introduce before the pass backs off.
E. Placement Control (4 knobs)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 717 | RematLowCostPlacementLimit | DBL | Max placement distance for low-cost remat candidates |
| 720 | RematMinDistance | INT | Minimum def-to-remat distance (instructions) before remat is attempted |
| 721 | RematPlacementLookback | INT | Lookback window size for placement-site search |
| 725 | RematSortRematChain | INT | Sort remat chain by priority before placement (0=off, 1=on) |
These knobs control where rematerialized instructions are placed relative to their uses. RematMinDistance (720) ensures remat is not attempted for short live ranges where the original definition is close enough. RematPlacementLookback (721) limits how far back the placement algorithm scans when searching for a profitable insertion point.
F. Remat Budget (1 knob)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 704 | RematBudget | BDGT | Optimization budget for the late remat pass (phase 69) |
BDGT-typed knobs carry a primary value and a secondary counter. The budget is decremented as each remat decision is committed. When exhausted (secondary reaches zero), the pass stops processing further candidates. This provides a deterministic cap on compile-time cost.
G. SinkRemat (Phase 28) Knobs (8 knobs, indices 824--831)
| Index | Name | Type | Purpose |
|---|---|---|---|
| 824 | SinkRematAbsCostLimit | DBL | Absolute cost ceiling for sinking+remat decisions |
| 825 | SinkRematBudget | BDGT | Optimization budget for the sink+remat pass |
| 826 | SinkRematDeltaRegsRatio | DBL | Register pressure delta ratio threshold for sink profitability |
| 827 | SinkRematEnable | INT | Master enable for Phase 28 SinkRemat |
| 828 | SinkRematMinDefPlaceDist | INT | Minimum definition-to-placement distance for sinking |
| 829 | SinkRematMinPlaceRefDist | INT | Minimum placement-to-reference distance for sinking |
| 830 | SinkRematMultiRefXBlkUsesPenaltyFactor | DBL | Penalty multiplier for multi-reference cross-block uses |
| 831 | SinkRematPredPenaltyFactor | DBL | Penalty multiplier for sinking predicated instructions |
Phase 28's SinkRemat pass (entry: sub_913A30, core: sub_A0F020) sinks instructions closer to their uses and marks remat candidates. Knob 827 (SinkRematEnable) is the master switch. The distance knobs (828, 829) prevent unprofitable micro-sinks. The penalty factors (830, 831) make the cost model more conservative for predicated instructions and for instructions with multiple cross-block uses, where sinking may duplicate code along multiple paths.
Related Knob Outside the Remat Block
| Index | Name | Type | Purpose |
|---|---|---|---|
| 475 | MovWeightForRemat | DBL | MOV instruction weight in remat profitability scoring |
This knob sits in the general MOV-weight family (indices 474--476) rather than the Remat block. It tunes how MOV instructions contribute to the scheduling cost model's remat profitability calculation. When the remat candidate is a MOV chain, this weight determines the per-MOV cost used to decide whether rematerialization beats keeping the value live.
DUMP_KNOBS_TO_FILE
The DUMP_KNOBS_TO_FILE environment variable triggers a full dump of all knob values to a file. Checked during KnobInit (sub_7A0C10) via getenv("DUMP_KNOBS_TO_FILE"):
char* dump_path = getenv("DUMP_KNOBS_TO_FILE");
if (dump_path) {
size_t len = strlen(dump_path);
// Store into SSO string at knob_state+88..104
}
The path is stored in a small-string-optimized (SSO) buffer at knob_state offsets +88 through +104:
Offset Size Field
────── ──── ─────────────────────────────────────
+88 8 data pointer (or first 8 inline bytes if len <= 15)
+96 8 string length
+104 8 capacity (or remaining inline bytes)
Paths of 15 bytes or fewer are stored inline without heap allocation. Longer paths allocate via the arena allocator at knob_state+8. The dump is produced later during compilation -- KnobInit only stores the path; the actual file write happens after all knobs are resolved.
This is the primary mechanism for discovering which knobs exist and what their current values are. Setting it produces a text file with all 1,294 knob names and their resolved values.
Error Handling
The knob system uses structured error descriptors (96 bytes each) allocated from an arena:
Offset Size Field
────── ──── ─────────────────────────────────────
+0 8 formatted message string pointer
+8 8 message length
+16 8 source file path pointer
+24 8 source file path length
+32 8 line number
+40 8 function name pointer
+48 48 (additional context fields)
Two error constructor functions:
| Function | Address | Purpose |
|---|---|---|
FormatKnobError | sub_79CDB0 | General knob error with vsnprintf formatting |
FormatKnobErrorWithContext | sub_79AED0 | Error with additional context (knob name, value) |
KnobError::Merge | sub_79A780 | Chains multiple errors for accumulated reporting |
Errors propagate through a tagged result: bit 0 of *(result + 16) is set on error, cleared on success. The GetKnobIndex return protocol:
// Success:
*(byte*)(result + 16) &= ~1; // clear error bit
*(int32*)(result) = knob_index; // store index
// Failure:
*(byte*)(result + 16) |= 1; // set error bit
*(result + 0..15) = error_desc; // store error descriptor
KnobValue Lifecycle
Construction
KnobValue::Destroy (sub_797790) resets a 72-byte value slot before writing a new value. It switches on the type tag:
| Type | Destruction Action |
|---|---|
| 0-5, 7, 8 | No-op (POD types, no heap allocation) |
| 6 (int-list) | Walk doubly-linked list, free each node via allocator+32 |
| 9 (opcode-list) | Walk doubly-linked list, free each node via allocator+32 |
| 10 (int-list dynamic) | Free the growable array block |
Deep Copy
KnobValue::CopyFrom (sub_7978F0) handles deep copy of value slots, switching on type to properly duplicate linked lists and allocated buffers.
KnobInit (sub_7A0C10) constructs a new knob state object by allocating 72 * count bytes for the value array, then deep-copying each slot from a source state if one exists.
Function Map
| Address | Size | Function | Confidence |
|---|---|---|---|
sub_6F04B0 | 6,824 | ReportKnobError (DAG) | HIGH |
sub_6F0820 | 2,782 | GetKnobIndex (DAG) | CERTAIN |
sub_6F0A30 | 8,700 | RegisterKnob (DAG) | HIGH |
sub_6F0FF0 | 13,000 | GetKnobValue (DAG) | HIGH |
sub_6F1B10 | 13,000 | BuildKnobTable (DAG) | HIGH |
sub_6F2380 | 14,000 | ParseKnobString (DAG) | HIGH |
sub_6F68C0 | 9,000 | InitializeKnobs (DAG) | HIGH |
sub_6F7360 | 18,306 | ParseKnobValue (DAG) | CERTAIN |
sub_6F83C0 | — | ParseWhenShorthand (DAG) | MEDIUM |
sub_797790 | 385 | KnobValue::Destroy | HIGH |
sub_7978F0 | 240 | KnobValue::CopyFrom | MEDIUM |
sub_7973E0 | 400 | KnobType::GetSize | MEDIUM |
sub_798280 | 900 | ParsePhaseNameFragment | MEDIUM |
sub_798B60 | 1,776 | NamedPhases::ParsePhaseList | CERTAIN |
sub_799250 | 68 | IsPassDisabled | HIGH |
sub_7992A0 | 894 | IsPassDisabledFull | HIGH |
sub_79A490 | 600 | KnobError::AppendContext | MEDIUM |
sub_79A5D0 | 800 | KnobError::Format | MEDIUM |
sub_79A780 | 2,200 | KnobError::Merge | MEDIUM |
sub_79AED0 | 1,000 | FormatKnobErrorWithContext | HIGH |
sub_79B240 | 518 | GetKnobIndex (OCG) | CERTAIN |
sub_79B450 | 200 | GetKnobIndexWithValidation | HIGH |
sub_79B530 | 3,296 | ParseKnobsString | HIGH |
sub_79C210 | 2,200 | ParseKnobOverrides | HIGH |
sub_79C9D0 | 1,600 | KnobsInitFromEnv | HIGH |
sub_79CDB0 | 1,400 | FormatKnobError | HIGH |
sub_79D070 | 2,312 | ReadKnobsFile | CERTAIN |
sub_79D990 | 7,073 | KnobsInit (master) | HIGH |
sub_79F540 | 3,640 | ParseKnobValue (OCG) | CERTAIN |
sub_7A0A90 | 350 | KnobValue::CopyListValue | MEDIUM |
sub_7A0C10 | 1,745 | KnobInit (per-knob) | HIGH |
sub_7A1B80 | 400 | GetKnobIntValue | MEDIUM |
sub_7A1CC0 | 350 | GetKnobBoolValue | MEDIUM |
sub_7A1E10 | 400 | GetKnobStringValue | MEDIUM |
sub_7A2860 | 2,100 | SetKnobValue | MEDIUM |
sub_7ACEA0 | 3,700 | OCGKnobSetup | MEDIUM |
Reimplementation Notes
To reimplement the knobs system:
-
Define the knob table as a compile-time array of descriptors (name, alias, type). No need for ROT13 -- that is purely obfuscation. Use an enum for knob indices so call sites reference
KNOB_SchedNumBB_Limitinstead of magic index 294. -
Parse order matters. Process sources in the documented priority order (env, file, CLI, pragma, WHEN). Last-write-wins semantics.
-
The WHEN= system is the complex part. You need FNV-1a hashing of function identifiers and a per-function override table. The hash table at
ctx+120 → +1128uses open addressing with linear probing. -
Budget knobs (OKT_BDGT) are just integers with a secondary tracking field. The secondary starts at 0 and is used by cost models to track how much "budget" remains during optimization.
-
Int-range knobs (OKT_IRNG) use
..as the range separator:"100..200"means [100, 200]. Missing bounds default toINT_MIN(0x80000000) /INT_MAX(0x7FFFFFFF). -
The opcode-string-list type (OKT_OPCODE_STR_LIST) carries pairs of (opcode_name, integer). The opcode name is resolved to an internal opcode ID via the SASS opcode table. Used for per-instruction tuning overrides.
Cross-References
- CLI Options -- public command-line flags, the user-facing layer above knobs
- Optimization Levels -- O-levels set specific knob presets
- DUMPIR & NamedPhases -- DUMPIR knob and phase-level dump control
- Phase Manager -- pass disable mechanism consumes the DisablePhases knob
- Scheduling Algorithm -- consumes Sched* knobs
- Allocator Architecture -- consumes RegAlloc* knobs
- Mercury Encoder -- consumes Mercury* knobs and DAG knob table
Optimization Levels
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The --opt-level (-O) flag controls how aggressively ptxas optimizes during the 159-phase pipeline. The option is parsed into a 32-bit integer at options block offset +148 by sub_434320 (line 216: sub_1C96470(v10, "opt-level", a3 + 148, 4)). The default value is 3. The documented range is 0--4, but the internal NvOpt recipe system supports levels 0--5, and the scheduler and rematerialization passes distinguish level 5 from lower values.
The optimization level propagates through the compilation context and is read by individual passes via sub_7DDB50 (232 bytes at 0x7DDB50), which combines the opt-level check with a knob 499 guard. Passes that call sub_7DDB50 receive the opt-level value (stored at compilation context offset +2104) only if knob 499 is enabled; otherwise, the function returns 1 (effectively clamping the level to O1 behavior).
| CLI option | --opt-level / -O |
| Options block offset | +148 (int32) |
| Default | 3 |
| Documented range | 0--4 |
| Internal range | 0--5 (NvOpt levels) |
| Accessor | sub_7DDB50 (0x7DDB50, 232 bytes) |
| Knob guard | 499 (if disabled, accessor returns 1) |
| Parse location | sub_434320 line 216 |
| Debug override | sub_431A40 forces level to 0 when -g is set |
| Ofast-compile override | sub_434320 lines 635--679 |
Level Summary
| Level | Name | Use Case |
|---|---|---|
| 0 | No optimization | Debug builds (-g), maximum source fidelity |
| 1 | Minimal optimization | Fast compile, basic folding/DCE |
| 2 | Standard optimization | Balanced speed/compile-time (previous default) |
| 3 | Full optimization | Default; all standard passes enabled |
| 4 | Aggressive optimization | Extra loop peeling, speculative hoisting |
| 5 | Maximum optimization | Full SinkRemat+Cutlass, highest compile time |
Options Block Fields Affected by Opt Level
The option parser (sub_434320) and debug handler (sub_431A40) modify several options block fields based on the optimization level. Key interactions discovered from decompiled code:
| Offset | Field | O0 | O1 | O2+ | Source |
|---|---|---|---|---|---|
+148 | opt_level | 0 | 1 | 2, 3, 4 | Direct from -O |
+160 | register_usage_level | Forced to 5 if set with -O0 (warning issued) | 5 (default) | 5 (default) | sub_434320 line 359--363 |
+235 | cloning_disabled | 0 (disabled) | per-CLI | per-CLI | sub_434320 line 776 |
+288 | device_debug | (from -g) | (from -g) | (from -g) | CLI only |
+292 | sp_bounds_check | 1 (auto-enabled) | per-CLI | per-CLI | sub_434320 line 775 |
+326 | no_cloning | 1 (when -g) | per-CLI | per-CLI | sub_431A40 line 42 |
+408 | allow_expensive_optimizations | false | false | true | sub_434320 line 768 |
+477 | fast_compile | forced 0 with -g | per-CLI | per-CLI | sub_431A40 line 28 |
The critical line at sub_434320 offset 768:
// allow-expensive-optimizations defaults to (opt_level > 1)
if (!user_set_allow_expensive)
options->allow_expensive_optimizations = (options->opt_level > 1);
Debug Mode Override (-g)
When --device-debug (-g) is active, sub_431A40 (at 0x431A40) forces the optimization level to 0 and disables most optimization features:
// sub_431A40 pseudocode
void ApplyDebugOverrides(options_block* opts, bool suppress_warning) {
if (suppress_warning) {
opts->device_debug = 1;
opts->sp_bounds_check_pair = 0x0101; // +16 = {1, 1}
}
opts->sp_bounds_check = 1; // +292
// Warn if user explicitly set opt-level with -g
if (was_set("opt-level") && opts->opt_level != 0)
warn("ignoring -O with -g");
// Warn about incompatible options
if (was_set("register-usage-level"))
warn("'--device-debug' overrides '--register-usage-level'");
// Force register_usage_level to {5, 5} (pair at +160)
*(int64*)(opts + 160) = 0x500000005LL;
if (opts->fast_compile)
warn("'--device-debug' overrides '--fast-compile'");
opts->fast_compile = 0;
if (opts->ofast_compile is "max" or "mid" or "min")
warn("'--device-debug' overrides '--Ofast-compile'");
opts->ofast_compile = "0";
opts->opt_level = 0; // +148
// Handle cloning
if (was_set("cloning") && opts->device_debug && !opts->no_cloning)
warn("-cloning=yes incompatible with -g");
opts->cloning_disabled = 0; // +235
}
The 0x500000005LL write to offset +160 sets both the 32-bit register-usage-level and the adjacent 32-bit field to 5, resetting any user override.
Ofast-Compile Interaction
The --Ofast-compile (-Ofc) option provides a compile-time vs code-quality tradeoff orthogonal to -O. It has four settings: 0 (disabled), min, mid, max. Each setting overrides the opt-level and related flags:
| Ofast-compile | Forces opt_level to | Cloning | Split-compile | Expensive opts | Notes |
|---|---|---|---|---|---|
0 | (no change) | (no change) | (no change) | (no change) | Default |
min | 1 | disabled | (no change) | true (forced) | Warns if -O set to non-{0,1} |
mid | 1 | disabled | (no change) | (no change) | Disables cloning when no split-compile |
max | 0 | disabled | (no change) | (no change) | Most aggressive compile-time reduction |
From sub_434320 lines 635--679:
if (ofast_compile == "max") {
if (was_set("cloning") && !no_cloning)
warn("-cloning=yes incompatible with --Ofast-compile=max");
no_cloning = 1;
if (was_set("opt-level") && opt_level != 0)
warn("-opt-level=<1,2,3> incompatible with --Ofast-compile=max");
opt_level = 0;
}
if (ofast_compile == "mid") {
no_cloning = 1;
if (!split_compile) cloning_disabled = 1;
if (was_set("opt-level") && opt_level != 1)
warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=mid");
opt_level = 1;
fast_compile_mode = 1;
}
if (ofast_compile == "min") {
no_cloning = 1;
if (!split_compile) cloning_disabled = 1;
if (was_set("opt-level") && opt_level != 1)
warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=min");
opt_level = 1;
fast_compile_mode = 1;
if (was_set("allow-expensive-optimizations") && !allow_expensive)
warn("-allow-expensive-optimizations=false incompatible with --Ofast-compile=min");
allow_expensive_optimizations = 1;
}
Per-Phase Gating by Optimization Level
Optimization levels control the pipeline through two mechanisms:
- Static
isNoOp()overrides -- The AdvancedPhase gate vtables are overridden at pipeline construction time based on the target architecture and opt-level. - Runtime opt-level checks -- Individual pass execute functions call
sub_7DDB50(the opt-level accessor) and early-return when the level is below their threshold.
Gate Accessor: sub_7DDB50
// sub_7DDB50 pseudocode (232 bytes at 0x7DDB50)
int getOptLevel(compilation_context* ctx) {
knob_vtable* kv = ctx->knob_state; // ctx + 1664
query_func qf = kv->vtable[19]; // vtable + 152
if (qf == sub_67EB60) { // fast-path: known vtable
check_func cf = kv->vtable[9]; // vtable + 72
bool knob_499;
if (cf == sub_6614A0)
knob_499 = (kv->state[35928] != 0); // direct field read
else
knob_499 = cf(ctx->knob_state, 499); // indirect query
if (!knob_499)
return ctx->opt_level; // ctx + 2104
int64 state = kv->state;
int iteration = state[35940];
if (state[35936] > iteration) {
state[35940] = iteration + 1; // increment pass counter
return ctx->opt_level;
}
} else if (qf(ctx->knob_state, 499, 1)) {
return ctx->opt_level;
}
return 1; // fallback: treat as O1
}
When knob 499 is disabled (or its iteration budget is exhausted), sub_7DDB50 returns 1 regardless of the actual opt-level. This provides a master kill-switch: setting knob 499 to false effectively caps all opt-level-gated behavior at O1.
Phase Activation Table
The following table lists every phase where the optimization level has been confirmed to affect behavior, based on decompiled isNoOp() methods and execute-function guard checks.
Threshold notation: > N means the phase requires opt_level > N (i.e., level N+1 and above).
| Phase | Name | Threshold | Effect at threshold | Source |
|---|---|---|---|---|
| 14 | DoSwitchOptFirst | > 0 | Branch/switch optimization enabled | isNoOp returns true at O0 |
| 15 | OriBranchOpt | > 0 | Branch folding enabled | isNoOp returns true at O0 |
| 18 | OriLoopSimplification | 4--5 | Aggressive loop peeling enabled at O4+ | sub_78B430 checks opt_level |
| 22 | OriLoopUnrolling | > 1 | Loop unrolling requires at least O2 | Execute guard via sub_7DDB50 |
| 24 | OriPipelining | > 1 | Software pipelining requires at least O2 | Execute guard |
| 26 | OriRemoveRedundantBarriers | > 1 | Barrier optimization at O2+ | Gating: opt_level > 1 |
| 28 | SinkRemat | > 1 / > 4 | O2+: basic path; O5: full cutlass mode | Two-tier guard in sub_913A30 |
| 30 | DoSwitchOptSecond | > 0 | Second switch pass at O1+ | isNoOp returns true at O0 |
| 38 | OptimizeNestedCondBranches | > 0 | Nested branch simplification at O1+ | isNoOp returns true at O0 |
| 49 | GvnCse | > 1 | GVN-CSE requires at least O2 | Execute guard |
| 54 | OriDoRematEarly | > 1 | Early rematerialization at O2+ | sub_7DDB50 check |
| 63 | OriDoPredication | > 1 | If-conversion at O2+ | Execute guard |
| 69 | OriDoRemat | > 1 | Late rematerialization at O2+ | sub_7DDB50 check |
| 71 | OptimizeSyncInstructions | > 1 | Sync optimization at O2+ | Gating: opt_level > 1 |
| 72 | LateExpandSyncInstructions | > 2 | Late sync expansion at O3+ | Gating: opt_level > 2 |
| 95 | SetAfterLegalization | > 1 | Post-legalization flag at O2+ | sub_7DDB50 check |
| 99 | OriDoSyncronization | > 1 | Sync insertion at O2+ | Gating: opt_level > 1 |
| 100 | ApplyPostSyncronizationWars | > 1 | WAR fixup at O2+ | Gating: opt_level > 1 |
| 110 | PostSchedule | > 0 | Full post-schedule at O1+ | Mode selection |
| 115 | AdvancedScoreboardsAndOpexes | > 0 | Full scoreboard generation at O1+ | Hook activated at O1+ |
| 116 | ProcessO0WaitsAndSBs | == 0 | Conservative scoreboards at O0 only | Active only at O0 |
O-Level Feature Matrix
| Feature | O0 | O1 | O2 | O3 | O4 | O5 |
|---|---|---|---|---|---|---|
| Basic block merging | off | on | on | on | on | on |
| Branch/switch optimization | off | on | on | on | on | on |
| Copy propagation + const folding | off | on | on | on | on | on |
| Dead code elimination | partial | on | on | on | on | on |
| Loop canonicalization | basic | basic | full | full | aggressive | aggressive |
| Loop unrolling | off | off | on | on | on | on |
| Software pipelining | off | off | on | on | on | on |
| Strength reduction | off | on | on | on | on | on |
| GVN-CSE | off | off | on | on | on | on |
| Predication (if-conversion) | off | off | on | on | on | on |
| Rematerialization (early) | off | off | on | on | on | on |
| Rematerialization (late) | off | off | on | on | on | on |
| SinkRemat (full) | off | off | off | off | off | on |
| Cutlass iterative remat | off | off | off | off | off | on |
| Loop peeling (aggressive) | off | off | off | off | on | on |
| Barrier optimization | off | off | on | on | on | on |
| Sync instruction optimization | off | off | on | on | on | on |
| Late sync expansion | off | off | off | on | on | on |
| Post-legalization mark | off | off | on | on | on | on |
| Allow expensive optimizations | off | off | on | on | on | on |
| Speculative hoisting | off | off | on | on | on | on |
| Hot/cold partitioning | off | on | on | on | on | on |
| Full scoreboard analysis (115) | off | on | on | on | on | on |
| Conservative scoreboards (116) | on | off | off | off | off | off |
| Stack-pointer bounds check | auto-on | off | off | off | off | off |
| Cloning | disabled | on | on | on | on | on |
Notes:
- "partial" DCE at O0:
EarlyOriSimpleLiveDead(phase 10) still runs for basic cleanup even at O0. - O4 and O5 are not documented in
--helpbut are accepted internally. O4 is equivalent to O3 plus aggressive loop peeling. O5 adds the full SinkRemat pass with cutlass iteration support.
Scoreboard Path Selection
The scoreboard generation subsystem has two mutually exclusive paths, selected by optimization level:
O0: Conservative Path (Phase 116)
Phase 116 (ProcessO0WaitsAndSBs) inserts maximum-safety scheduling metadata:
For every instruction:
stall_count = 15 // maximum stall (15 cycles)
wait_mask = 0x3F // wait on all 6 barriers
write_barrier = 7 // no barrier assignment (7 = none)
read_mask = 0 // no read barriers
yield = 1 // yield after every instruction
This eliminates all instruction-level parallelism. Every instruction waits the maximum time and clears all dependency barriers before executing. The result is correct but extremely slow code -- suitable only for debugging.
O1+: Full Analysis Path (Phase 115)
Phase 115 (AdvancedScoreboardsAndOpexes) runs the complete dependency analysis pipeline:
sub_A36360(52 KB) -- Master control word generator with per-opcode dispatchsub_A23CF0(54 KB) -- DAG list scheduler heuristic for barrier assignment- Per-field encoders for stall, yield, barrier, and scoreboard dependency fields
The full path computes precise stall counts based on actual instruction latencies from the hardware profile, assigns the minimum necessary dependency barriers (6 available per SM), and inserts yield hints only where the warp scheduler benefits from switching.
Scheduling Direction
The scheduling infrastructure (sub_8D0640) selects scheduling direction based on opt-level:
| Condition | Direction | Strategy |
|---|---|---|
opt_level <= 2 | Forward-pass | Register-pressure-reducing: prioritizes freeing registers |
opt_level > 2 | Reverse-pass | Latency-hiding: prioritizes ILP and memory latency overlap |
At the default O3, the scheduler uses the reverse-pass strategy, which hides memory latencies at the cost of potentially higher register pressure. At O1--O2, the forward-pass strategy minimizes peak register usage.
The direction selection happens in PreScheduleSetup (sub_8CBAD0), called from the scheduling orchestrator with the boolean opt_level > 2:
PreScheduleSetup(sched, opt_level > 2); // sub_8CBAD0
Additionally, the ScheduleInstructionsReduceReg phase (mode 0x39) is enabled by default at O3 and above, providing a dedicated register-pressure-reduction scheduling pass before the ILP pass.
Register Allocation Differences
The register allocator itself (fat-point greedy at sub_957160) does not directly branch on the optimization level. However, the opt-level affects register allocation indirectly through:
-
--register-usage-level(offset+160, range 0--10, default 5): At O0 with-g, this is forced to 5 regardless of user setting. The value modulates the per-class register budget atalloc + 32*class + 884. -
allow-expensive-optimizations(offset+408): Defaults totruewhenopt_level > 1. When true, the allocator and related passes are permitted to spend more compile time on better solutions (e.g., more spill-retry iterations, more aggressive coalescing). -
Phase gating: At O0, passes that reduce register pressure (rematerialization, predication, loop optimizations) are disabled, so the allocator receives un-optimized IR with higher register demand. This typically results in more spills at O0.
NvOpt Recipe System
The NvOpt recipe system (Phase 1: ApplyNvOptRecipes, option 391) provides an additional optimization-level axis. When enabled, the PhaseManager allocates a 440-byte NvOptRecipe sub-manager that configures per-phase aggressiveness:
| NvOpt level | Behavior |
|---|---|
| 0 | Minimal optimization (fast-compile path, many phases set to isNoOp()) |
| 1--2 | Standard optimization |
| 3--4 | Aggressive optimization (loop unrolling, speculative hoisting enabled) |
| 5 | Maximum optimization (may significantly increase compile time) |
The NvOpt level is validated in sub_C173E0 via the string "Invalid nvopt level : %d.", confirming the range 0--5. Recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes.
The NvOpt level is distinct from the -O CLI level. The -O level controls which phases run at all (via isNoOp() and sub_7DDB50 guards); the NvOpt level controls how aggressively the phases that do run behave (via recipe parameters).
Knob Defaults That Change Per Level
Several knobs have default values that vary by optimization level. The most significant:
| Knob | O0 Default | O1 Default | O2+ Default | Effect |
|---|---|---|---|---|
| 487 | enabled | enabled | enabled | Master optimization enable (checked by many passes) |
| 499 | (varies) | (varies) | (varies) | Guard knob for sub_7DDB50 opt-level accessor |
| 595 | true | true | true | Scheduling enable (but O0 uses conservative path) |
| 419 | -- | -- | -- | Forward scheduling mode (bit 3 in scheduler flags) |
Knob 487 is the most pervasive: it is checked by loop simplification, barrier optimization, sync optimization, predication, rematerialization, and scheduling passes. Disabling it overrides the opt-level and turns off the corresponding pass regardless of -O setting.
Key Decompiled Evidence
Options block opt_level field (offset +148)
// sub_434320 line 216: parse opt-level from CLI
sub_1C96470(v10, "opt-level", a3 + 148, 4);
allow-expensive-optimizations defaults to (opt_level > 1)
// sub_434320 line 768
if (!user_set_allow_expensive)
*(a3 + 408) = *(a3 + 148) > 1;
O0 forces sp-bounds-check and disables cloning
// sub_434320 lines 773-776
if (opt_level == 0) {
*(a3 + 292) = 1; // sp_bounds_check = true
*(a3 + 235) = 0; // cloning_disabled = false (misleading name: 0 = no cloning)
}
Debug mode forces O0
// sub_431A40 line 33
*(a3 + 148) = 0; // opt_level = 0
Ofast-compile=max forces O0
// sub_434320 line 646
*(a3 + 148) = 0; // opt_level = 0 for Ofast=max
Ofast-compile=mid/min forces O1
// sub_434320 lines 659, 674
*(a3 + 148) = 1; // opt_level = 1 for Ofast=mid and min
*(a3 + 572) = 1; // fast_compile_mode = 1
Scoreboard mutually exclusive paths
// Phase 115 (AdvancedScoreboardsAndOpexes): isNoOp() returns true at O0
// Phase 116 (ProcessO0WaitsAndSBs): isNoOp() returns true at O1+
SinkRemat two-tier gating
// sub_913A30 (SinkRemat core, phase 28)
if (getOptLevel(ctx) <= 1) return; // O0/O1: skip entirely
if (getOptLevel(ctx) <= 4) return; // O2-O4: skip full sink+remat
// O5: proceed to cutlass iterative mode
Sync barrier gating
// Phase 99 (OriDoSyncronization), Phase 71 (OptimizeSyncInstructions):
call sub_7DDB50 ; get opt_level
cmp eax, 1
jle return ; skip if opt_level <= 1
// Phase 72 (LateExpandSyncInstructions):
// Requires opt_level > 2
Cross-References
- CLI Options --
--opt-level,--device-debug,--Ofast-compile,--allow-expensive-optimizations - Knobs System -- Knob 487 (master enable), knob 499 (opt-level guard)
- Pass Inventory -- Complete 159-phase table with per-phase descriptions
- Phase Manager -- AdvancedPhase hooks and dispatch loop
- Scoreboards -- Phase 115/116 scoreboard generation
- Scheduling -- Forward vs reverse scheduling direction
- Rematerialization -- O-level-dependent remat behavior
- Branch & Switch -- O0 disables all branch/switch optimization
- Sync & Barriers -- Per-pass opt_level thresholds
- Loop Passes -- O4/O5 aggressive loop peeling
- Optimization Pipeline -- Pipeline-level O0 vs O1+ gating
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_434320 | -- | CLI option parser; parses --opt-level at line 216, handles --Ofast-compile at lines 635--679, sets allow-expensive-optimizations default at line 768 | 0.95 |
sub_431A40 | -- | Debug mode override; forces opt-level to 0, disables cloning, resets register-usage-level when -g is active | 0.95 |
sub_7DDB50 | 232B | Opt-level accessor; returns ctx+2104 opt-level if knob 499 is enabled, otherwise returns 1 (O1 fallback). Called by 20+ passes as the runtime opt-level gate | 0.95 |
sub_1C96470 | -- | Generic CLI argument reader; called by sub_434320 to read --opt-level into options block offset +148 | 0.85 |
sub_67EB60 | -- | Fast-path knob query vtable function; identified inside sub_7DDB50 for knob 499 check | 0.80 |
sub_6614A0 | -- | Knob state direct-read function; used by sub_7DDB50 to read knob 499 via direct field access at offset 35928 | 0.80 |
sub_78B430 | -- | OriLoopSimplification execute function; checks opt_level for aggressive loop peeling (O4+) | 0.85 |
sub_913A30 | -- | SinkRemat core (phase 28); two-tier opt-level guard: skips at O0--O1, limited at O2--O4, full cutlass mode at O5 | 0.90 |
sub_8D0640 | -- | Scheduling infrastructure; selects forward (O1--O2) vs reverse (O3+) scheduling direction | 0.85 |
sub_8CBAD0 | -- | PreScheduleSetup; called with opt_level > 2 boolean to configure scheduling direction | 0.85 |
sub_957160 | -- | Fat-point greedy register allocator; does not directly branch on opt-level but is affected indirectly | 0.90 |
sub_A36360 | 52KB | Master scoreboard control word generator (phase 115, O1+ path) | 0.90 |
sub_A23CF0 | 54KB | DAG list scheduler heuristic for barrier assignment (phase 115, O1+ path) | 0.90 |
sub_C173E0 | -- | NvOpt level validator; emits "Invalid nvopt level : %d." for out-of-range values | 0.85 |
DUMPIR & NamedPhases
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The DUMPIR knob and NamedPhases option are the two primary mechanisms for inspecting ptxas's internal IR at arbitrary points in the 159-phase optimization pipeline. DUMPIR is an OCG string knob that triggers an IR dump after a named phase completes. NamedPhases is a separate OCG string knob (index 298) that restricts the pipeline to execute only the specified phases, effectively allowing selective phase execution and reordering. Both knobs accept phase names resolved through a case-insensitive binary search over a sorted table of 144 phase names (sub_C641D0, 305 bytes).
| DUMPIR knob | OCG string knob (ROT13: QhzcVE), registered in ctor_005 at 0x412B80 |
| NamedPhases knob | OCG knob index 298, runtime offset 21456 in knob value array |
| Phase name lookup | sub_C641D0 (305 bytes, case-insensitive binary search) |
| Table sort | sub_C63FA0 (on-demand iterative quicksort via sub_C639A0) |
| Name table | 144 entries at off_22BD0C0 + 5 arch-specific additions |
| NamedPhases parser | sub_798B60 (1,776 bytes) |
| Phase fragment parser | sub_798280 (900 bytes) |
| Report passes | Phases 9, 96, 102, 126, 129, 130 |
| Sentinel return | 158 (NOP phase, returned on lookup failure) |
DUMPIR Knob
The DUMPIR knob is a string-valued OCG knob that takes one or more phase names. When set, the compiler dumps the Ori IR state after the named phase executes. This is the primary IR inspection mechanism for NVIDIA developers debugging the optimization pipeline.
Usage
ptxas -knob DUMPIR=AllocateRegisters input.ptx -o output.cubin
The knob value is a phase name string. The name is resolved through the phase name lookup function (sub_C641D0) using case-insensitive comparison, so allocateregisters, ALLOCATEREGISTERS, and AllocateRegisters all match.
The DUMPIR knob exists in two instantiations:
- OCG instance (ROT13:
QhzcVEat0x21BDBAD): registered inctor_005at0x412B80. This is the primary instance for the optimization pipeline. - DAG instance (ROT13:
QhzcVEat0x21DCC95): registered inctor_007at0x421920. This controls IR dumps in the Mercury SASS/DAG pipeline.
Diagnostic Reference
The DUMPIR knob is referenced in register allocation error diagnostics. When a register allocation verification failure occurs, sub_A55D80 and sub_A76030 emit:
Please use -knob DUMPIR=AllocateRegisters for debugging
This tells the developer to re-run with the DUMPIR knob set to AllocateRegisters to inspect the IR state entering register allocation, which helps diagnose mismatches between pre- and post-allocation reaching definitions.
Related Dump Knobs
DUMPIR is part of a family of 17 dump-related OCG knobs across two constructor registrations. The OCG pipeline registers 11 dump knobs in ctor_005 (0x412A40--0x412D60); the Mercury/DAG pipeline registers 6 in ctor_007 (0x421880--0x421A10). All knob names and their definition-table offsets are ROT13-encoded in the binary (e.g. 0k14q0 decodes to 0x14D0).
OCG Pipeline Dump Knobs (ctor_005)
| Knob Name | ROT13 | Reg Address | Def Offset | Purpose |
|---|---|---|---|---|
DumpCallGraph | QhzcPnyyTencu | 0x412A40 | 0x1490 | Dump the inter-procedural call graph |
DumpCFG | QhzcPST | 0x412A90 | 0x14A0 | Dump the control flow graph |
DumpFlow | QhzcSybj | 0x412AE0 | 0x14B0 | Dump data flow information (reaching defs, live sets) |
DumpInstPhase | QhzcVafgCunfr | 0x412B30 | 0x14C0 | Dump per-instruction phase annotations |
DumpIR | QhzcVE | 0x412B80 | 0x14D0 | Dump the Ori IR after a named phase |
DumpIRInfoAsInteger | QhzcVEVasbNfVagrtre | 0x412BD0 | 0x14E0 | Dump IR with integer-format operand info |
DumpKnobs | QhzcXabof | 0x412C20 | 0x14F0 | Dump all knob values to stderr |
DumpPerfMetricsForBlock | QhzcCresZrgevpfSbeOybpx | 0x412C70 | 0x1500 | Dump per-basic-block performance metrics |
DumpPerfStats | QhzcCresFgngf | 0x412CC0 | 0x1510 | Dump performance statistics |
DumpSASS | QhzcFNFF | 0x412D10 | 0x1520 | Dump generated SASS assembly |
DumpSBInstInfo | QhzcFOVafgVasb | 0x412D60 | 0x1530 | Dump scoreboard per-instruction info |
The "Def Offset" column is the byte offset into the 16-byte-stride knob definition table. Dividing by 16 gives the definition-table index: DumpCallGraph is index 329, DumpSBInstInfo is index 339. These indices are distinct from the 72-byte runtime knob slot indices used by GetKnobIntValue.
Adjacent knobs in ctor_005 (for boundary context):
0x4129F0:DoYieldInsertionWAR_SW2491854(offset0x1480) -- immediately before DumpCallGraph0x412DB0:EmitLDCU(offset0x1540) -- immediately after DumpSBInstInfo
Mercury/DAG Pipeline Dump Knobs (ctor_007)
| Knob Name | ROT13 | Reg Address | Purpose |
|---|---|---|---|
DumpAnnot | QhzcNaabg | 0x421880 | Dump instruction annotations |
DumpCFG | QhzcPST | 0x4218D0 | Dump DAG pipeline CFG |
DumpIR | QhzcVE | 0x421920 | Dump DAG pipeline IR |
DumpMercOpCounts | QhzcZrepBcPbhagf | 0x421970 | Dump Mercury opcode distribution |
DumpReconstitutedBinary | QhzcErpbafgvghgrqOvanel | 0x4219C0 | Dump reconstituted binary output |
DumpRPO | QhzcECB | 0x421A10 | Dump reverse post-order traversal |
Three knob names appear in both pipelines: DumpCFG, DumpIR, and (implicitly) their string addresses differ (0x21BDBF0 vs 0x21DCCA0 for DumpCFG). Setting one does not affect the other.
NamedPhases Knob
The NamedPhases knob (OCG index 298) provides a mechanism to restrict the optimization pipeline to execute only specific phases. Unlike DUMPIR which passively observes, NamedPhases actively controls which phases run.
Knob Location
NamedPhases is at OCG knob index 298. The runtime byte offset is 298 * 72 = 21456 from the knob state base. This is confirmed by the decompiled code in sub_798B60:
// sub_798B60 (NamedPhases parser)
v11 = *(ctx + 72); // knob state base pointer
v12 = *(byte*)(v11 + 21456); // type tag at knob index 298
if (!v12) return 0; // knob not set => no filtering
if (v12 == 5) // type 5 = string
v14 = *(ptr*)(v11 + 21464); // string value at +8 from type tag
Parser -- sub_798B60
The NamedPhases parser (sub_798B60, 1,776 bytes) reads the knob value string and parses it into parallel arrays of up to 256 entries. It is called from two sites:
- OCG pipeline (
sub_798B60direct): parses the NamedPhases string from OCG knob index 298, referenced at address0x798E90where the string "NamedPhases" (0x21B64C8) appears in an error/diagnostic message. - Mercury pipeline (
sub_9F4040): the Mercury encoder's phase reordering mechanism also references the "NamedPhases" string at0x9F42B0, using the same knob to control Mercury-side phase execution.
The parser operates as follows:
- Reads knob value at offset 21456 from the knob state
- If the knob is unset (type byte == 0), returns immediately (no filtering)
- If the knob is a string (type byte == 5), extracts the string pointer
- Copies the string into a pool-allocated buffer
- Tokenizes using
strtok_rwith comma (,) as delimiter - For each token, calls
sub_798280(ParsePhaseNameFragment) to split the phase name from optional parameters - Stores results in parallel arrays: names[], values[], full_strings[] (max 256 entries)
Phase Name Fragment Parser -- sub_798280
Each comma-separated token in the NamedPhases string is parsed by sub_798280 into two components:
- Phase name: characters up to the first
,separator, uppercased during parsing - Parameter suffix: characters after
,up to the next+delimiter or end-of-string
The + character acts as an entry separator (analogous to how the DisablePhases string uses + to delimit multiple phase names). This allows:
-knob NamedPhases=PhaseA,param1+PhaseB,param2+PhaseC
Mercury NamedPhases -- sub_9F4040
The Mercury encoder pipeline (sub_9F4040, 1,850 lines decompiled) uses the NamedPhases knob to support phase reordering within the Mercury backend. In addition to standard pipeline phase names, it recognizes Mercury-specific pseudo-phases:
| Name | Decompiled Line | Match Method | Purpose |
|---|---|---|---|
shuffle | 843 | strlen + byte compare (8 chars) | Mercury instruction shuffle pass |
swap1 | 950 | strlen + byte compare (6 chars) | Mercury register swap level 1 |
swap2 | 1007 | strlen + byte compare (6 chars) | Mercury register swap level 2 |
swap3 | 1061 | strlen + byte compare (6 chars) | Mercury register swap level 3 |
swap4 | 1119 | strlen + byte compare (6 chars) | Mercury register swap level 4 |
swap5 | 1162 | strlen + byte compare (6 chars) | Mercury register swap level 5 |
swap6 | 1202 | strcmp() | Mercury register swap level 6 |
OriPerformLiveDead | 1556 | sub_C641D0() lookup | Liveness analysis within Mercury context |
OriCopyProp | 1648 | sub_C641D0() lookup | Copy propagation within Mercury context |
shuffle and swap1--swap6 are pure Mercury pseudo-phases: they do not exist in the main 144-entry phase name table at off_22BD0C0. Their name matching is done inline with strlen-guarded character comparison (not strcmp -- except swap6 which uses a full strcmp call, likely because it is the last in a fallthrough chain).
OriPerformLiveDead and OriCopyProp resolve through sub_C641D0 (the standard binary search), meaning they ARE in the main phase table. They are special in that Mercury conditionally inserts them into its own phase sequence rather than inheriting them from the standard pipeline ordering. The insertion is guarded by state flags (v234, v252, v240 for OriPerformLiveDead; v222, v236, v257 for OriCopyProp), suggesting they are injected only when the Mercury encoder detects certain register-pressure or correctness conditions.
Phase Name Lookup -- sub_C641D0
The binary search function sub_C641D0 (305 bytes) resolves a phase name string to a phase index. It is the core name resolution used by both DUMPIR and NamedPhases.
Algorithm
int PhaseManager::lookup_phase(const char* query) {
ensure_sorted(); // sub_C63FA0
// Binary search over sorted {name_ptr, index} pairs
// Each entry is 16 bytes: [8-byte name pointer, 4-byte phase index, 4-byte padding]
int lo = 0, hi = sorted_count;
while (hi > 0) {
int mid = hi / 2;
// Case-insensitive string comparison via tolower()
int cmp = strcasecmp(table[lo + mid].name, query);
if (cmp < 0) {
hi -= mid + 1;
lo += mid + 1;
} else if (cmp == 0) {
return table[lo + mid].index; // found
} else {
hi = mid;
}
}
// Verify final position (handles edge case)
if (lo < end && strcasecmp(table[lo].name, query) == 0)
return table[lo].index;
return 158; // sentinel: NOP phase
}
The comparison uses tolower() on each character individually, making the search fully case-insensitive. On lookup failure, the function returns 158 (the sentinel NOP phase), not an error code. This means misspelled phase names silently resolve to a no-op rather than producing an error.
Sorted Table Construction -- sub_C63FA0
The sorted name table is lazily constructed. sub_C63FA0 checks whether the current sorted count matches the expected count (stored at PhaseManager+104). If they differ, it:
- Grows the sorted table array if needed (1.5x growth policy)
- Copies name pointers from the raw phase name table (
off_22BD0C0) - Each entry is 16 bytes:
{char* name, int phase_index}, wherephase_indexis the array position - Sorts using iterative quicksort (
sub_C639A0) with median-of-three pivot selection
The sort is performed once and cached. Subsequent lookups reuse the sorted table without re-sorting.
Report Passes
Six phases in the pipeline are dedicated diagnostic/dump passes. They are no-ops by default and activate only when specific debug options are enabled:
| Phase | Name | Trigger | Output |
|---|---|---|---|
| 9 | ReportInitialRepresentation | DUMPIR knob, --keep | Ori IR after initial lowering (pre-optimization) |
| 96 | ReportBeforeScheduling | DUMPIR knob, --keep | Ori IR entering scheduling/RA stage |
| 102 | ReportAfterRegisterAllocation | DUMPIR knob, --keep | Ori IR after register allocation |
| 126 | ReportFinalMemoryUsage | --stat=phase-wise | Memory pool consumption summary |
| 129 | DumpNVuCodeText | --keep, DUMPIR | SASS text disassembly (cuobjdump-style) |
| 130 | DumpNVuCodeHex | --keep, DUMPIR | Raw SASS hex dump |
Additionally, ReportBeforeRegisterAllocation (at 0x22BD068) is a phase name in the table but is handled as an arch-specific phase (index >= 139), providing an IR dump point immediately before register allocation in backends that override it.
Report Pass Activation
Report passes check their activation condition in the isNoOp() virtual method. When the DUMPIR knob is set to a phase name, the report pass compares the current phase name against the DUMPIR value. If they match, isNoOp() returns false and the pass executes its dump logic.
The dispatch loop in sub_C64F70 constructs diagnostic context strings around each phase execution:
// Before execution (line 117 of sub_C64F70):
*(_QWORD *)buffer = 0x2065726F666542LL; // "Before " as 8-byte LE literal
memcpy(buffer + 7, phase_name, len + 1); // append phase name after "Before "
// After execution (line 196 of sub_C64F70):
strcpy(buffer, "After "); // 6-byte prefix
memcpy(buffer + 6, phase_name, len + 1); // append phase name after "After "
The literal 0x2065726F666542 decomposes as bytes 42 65 66 6F 72 65 20 = ASCII "Before " (7 bytes including trailing space, plus a null in the 8th byte that gets overwritten by memcpy). The "After " path uses strcpy instead of a literal store because it is only 6 bytes and the code path is post-execution (not latency-critical).
These strings appear in diagnostic output when --stat=phase-wise is enabled:
Before GeneralOptimize :: [Total 1234 KB] [Freeable 567 KB] [Freeable Leaked 12 KB] (2%)
After GeneralOptimize :: [Total 1456 KB] [Freeable 789 KB] [Freeable Leaked 23 KB] (3%)
The string addresses in the binary are:
"Before "at0x22BC3D3"After "at0x22BC3DB" :: "at0x22BC3E2(separator between phase name and stats)"[Total "at0x22BC3E9"[Freeable "at0x22BC3F6"[Freeable Leaked "at0x22BC401"All Phases Summary"at0x22BC416(final summary label)
Phase-Wise Statistics -- --stat=phase-wise
The --stat CLI option (processed in sub_432A00 at 0x432E5A) accepts a comma-separated list of report modes:
ptxas --stat=phase-wise input.ptx -o output.cubin
| Mode | Short | Description |
|---|---|---|
time | t | Print compilation time |
memory | m | Print peak memory usage |
phase-wise | p | Print per-phase time and memory delta |
detailed | d | Print all of the above |
When phase-wise is enabled (string comparison at 0x4460F8 in sub_445EB0), the dispatch loop's timing flag (PhaseManager+72) is set, and sub_C64310 runs after every phase to print memory deltas.
IR Output Format
The DUMPIR dump emits a per-function statistics header (using # comment prefix) followed by the Ori IR listing. The statistics header is emitted by sub_A3A7E0 and contains hardware performance estimates computed from the current IR state.
Per-Function Statistics Header
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [instHint=142] [instPairs=4]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
# [SharedMem Alloc thru=0.000000]
# [Precise inst=0]
The format strings are at two locations in rodata:
| Address Range | Context | Notes |
|---|---|---|
0x21EBF76--0x21EC3B0 | Pre-register-allocation stats | Commas between some [SSpillB=%d], [SRefillB=%d] fields |
0x21FA008--0x21FA0A0 | Post-register-allocation stats | No commas: [SSpillB=%d] [SRefillB=%d] |
The [Occupancy] line's typo "trancedental" (missing 's') is present in the binary itself and matches NVIDIA's original source.
Statistics Field Glossary
| Field | Meaning |
|---|---|
inst | Total instruction count |
texInst | Texture/surface instruction count |
tepid | Texture instruction count (alternate metric) |
rregs | R-register (GPR) count |
LSpillB / LRefillB | Local-memory spill/refill byte counts |
SSpillB / SRefillB | Shared-memory spill/refill byte counts |
LowLmemSpillSize | Low local-memory spill total |
FrameLmemSpillSize | Frame-level local-memory spill total |
LNonSpillB / LNonRefillB | Local-memory non-spill traffic bytes |
Occupancy | Estimated warp occupancy (0.0--1.0) |
numDivergentBranches | Estimated divergent branch count |
attributeMemUsage | Attribute memory usage (shader inputs) |
programSize | Total program size in bytes |
issue thru | Issue throughput (instructions per cycle) |
fp thru / half thru | FP32 / FP16 throughput |
trancedental thru | Transcendental (SFU) throughput |
ipa thru | Interpolation throughput |
shared thru | Shared memory throughput |
texLoadStore thru | Texture + load/store throughput |
reg thru | Register throughput |
warp thru | Warp-level throughput |
CB-Bound Tex | Constant-bank-bound texture references |
UR-Bound Tex | Uniform-register-bound texture references |
Bindless Tex | Bindless texture references |
UDP inst | Uniform datapath instruction count |
numVecToURConverts | Vector-to-uniform-register conversion count |
maxNumLiveValuesAtSuspend | Peak live values at suspension point |
instHint / instPairs | Instruction hint count / instruction pair count |
worstcaseLat / avgcaseLat | Worst-case / average-case latency estimates |
SharedMem Alloc thru | Shared memory allocation throughput |
Precise inst | Precise (non-relaxed) instruction count |
Mercury Pipeline Dump Points
The Mercury/DAG pipeline emits its own "After" labels at fixed points in the encode-decode-expand flow. These labels are used by both the DumpIR (DAG) knob and the DumpAnnot knob:
| Label | String Address | Pipeline Stage |
|---|---|---|
After Decode | 0x202D5CE | After initial SASS decode |
After Expansion | 0x202D5DB | After instruction expansion |
After WAR post-expansion | 0x202D604 | After WAR insertion (post-expansion) |
After Opex | 0x202D60F | After operand expansion |
After WAR post-opexing | 0x202DCD0 | After WAR insertion (post-opex) |
After MercWARs | 0x202DCDF | After Mercury WAR pass |
After MercOpex | 0x21E5C33 | After Mercury operand expansion |
After MercConverter | 0x22B7B38 | After Mercury format conversion |
After MercExpand | 0x22BC3DB | After Mercury instruction expansion |
After EncodeAndDecode | 0x23D1A60 | After encode-decode round-trip |
Memory Statistics Format
The sub_C64310 (ReportPhaseStats) function formats memory sizes using three thresholds:
| Size Range | Format | Example |
|---|---|---|
| < 1024 bytes | %d (raw integer) | 512 |
| < 10 MB | %.3lf KB (kilobytes, 3 decimals) | 1234.567 KB |
| >= 10 MB | %.3lf MB (megabytes, 3 decimals) | 12.345 MB |
The memory format reuses the suffix from "PeakMemoryUsage = %.3lf KB" (at 0x1CE7BB6) by referencing the string at offset +24 to extract just " KB". The pool-consumption variant uses "[Pool Consumption = " at 0x22BC3B3.
Phase Name Table
The static phase name table at off_22BD0C0 contains 145 entries: 1 sentinel ("All Phases Summary") plus 144 phase names. After sorting by sub_C63FA0, the binary search in sub_C641D0 provides O(log n) lookup -- approximately 8 comparisons for 145 entries.
The 144 non-sentinel entries include:
- 139 base pipeline phases (indices 0--138) with fixed names
- 5 arch-specific phase aliases that map to indices >= 139:
LateEnforceArgumentRestrictionsUpdateAfterScheduleInstructionsUpdateAfterOriDoSyncronizationReportBeforeRegisterAllocationUpdateAfterOriAllocateRegisters
The AllocateRegisters string (0x21F0229) also appears as a phase name referenced by the register allocation subsystem (sub_A55D80, sub_A76030) and is present in the name table at 0x22BD490.
Interaction with --keep
The --keep flag triggers output file retention and activates certain report passes. When --keep is set:
- Phase 129 (
DumpNVuCodeText) writes a human-readable SASS disassembly to a.sassfile - Phase 130 (
DumpNVuCodeHex) writes raw SASS binary as hex - Report phases 9, 96, and 102 may produce
.oriintermediate representation dumps
The --keep flag is processed in the CLI option handler (sub_43CC70 at 0x43D850) which generates the .sass file extension.
Function Map
| Address | Size | Function | Confidence |
|---|---|---|---|
sub_798280 | 900 | ParsePhaseNameFragment -- splits NAME,PARAM from NamedPhases token | MEDIUM |
sub_798B60 | 1,776 | NamedPhases::ParsePhaseList -- tokenizes NamedPhases knob string | CERTAIN |
sub_9F4040 | ~7,400 | MercuryNamedPhases -- Mercury pipeline phase selection/reordering | HIGH |
sub_A3A7E0 | ~2,000 | CodeObject::EmitStats -- per-function statistics header printer | HIGH |
sub_C639A0 | ~800 | QuicksortNameTable -- iterative quicksort for phase name table | MEDIUM |
sub_C63FA0 | ~600 | EnsureSortedNameTable -- lazy sorted table construction | MEDIUM |
sub_C641D0 | 305 | PhaseManager::LookupPhase -- case-insensitive binary search | CERTAIN |
sub_C64310 | 3,168 | PhaseManager::ReportPhaseStats -- per-phase timing/memory reporter | HIGH |
sub_C64F70 | 1,455 | PhaseManager::Dispatch -- main phase execution loop | CERTAIN |
sub_A55D80 | ~2,000 | RegAlloc::VerifyReachingDefs -- references DUMPIR in error message | HIGH |
sub_A76030 | ~1,000 | RegAlloc::VerifyMismatch -- references DUMPIR in error message | HIGH |
Reimplementation Notes
-
DUMPIR is a string knob, not a boolean. The value is a phase name that triggers a dump after that specific phase. To dump at multiple points, run separate compilations with different DUMPIR values. There is no comma-separated multi-phase dump syntax for DUMPIR itself.
-
NamedPhases uses comma+plus syntax. Commas separate name-from-parameter within a single entry;
+separates multiple entries. The phase name portion is uppercased during parsing. Parameters are preserved as-is. -
Lookup failure is silent. An unrecognized phase name in DUMPIR or NamedPhases resolves to phase index 158 (NOP sentinel), not an error. The compiler does not warn about misspelled phase names.
-
The sorted table is 16 bytes per entry:
{char* name, int32 index, int32 padding}. The sort is stable only within the quicksort's three-way partitioning -- duplicate names (which do not occur in practice) would have undefined ordering. -
Two DumpIR knob instances exist (OCG and DAG). They are independent -- setting one does not affect the other. The OCG instance controls the 159-phase optimization pipeline; the DAG instance controls the Mercury SASS pipeline. Three knob names (
DumpCFG,DumpIR,DumpAnnot/DumpRPO) have separate OCG and DAG instances with distinct ROT13 string addresses. -
Memory statistics format uses three thresholds: bytes (< 1 KB), kilobytes with 3 decimals (< 10 MB), megabytes with 3 decimals (>= 10 MB). The reporter is
sub_C64310. -
NamedPhases in Mercury (
sub_9F4040) supports 7 pure pseudo-phases (shuffle,swap1--swap6) that do not exist in the main phase table. These use inline strlen-guarded byte comparison, notstrcmp(exceptswap6). Two additional names (OriPerformLiveDead,OriCopyProp) ARE in the main table but are conditionally injected into Mercury's phase sequence based on register-pressure/correctness state flags. -
The "Before" string is a raw 8-byte LE literal store, not a
strcpy. The dispatch loop writes0x2065726F666542directly to the buffer, which is"Before "in ASCII. This is a micro-optimization for the hot path (pre-phase execution). The "After" path usesstrcpysince it is post-execution. -
Statistics header has two variants. The pre-register-allocation format strings (at
0x21EC050) use commas between some spill fields:[SSpillB=%d], [SRefillB=%d]. The post-register-allocation variant (at0x21FA008) drops those commas:[SSpillB=%d] [SRefillB=%d]. A reimplementation should match whichever variant is appropriate for the dump point. -
The "trancedental" typo is canonical. Both the format string and the stats output use "trancedental" (missing 's'). A reimplementation should preserve this spelling for compatibility with tools that parse the output.
Cross-References
- Knobs System -- DUMPIR and NamedPhases are OCG knobs; ROT13 encoding, type system, access patterns
- CLI Options --
--stat=phase-wise,--keepflags that activate report passes - Phase Manager -- dispatch loop, phase factory, name table infrastructure
- Pass Inventory -- complete 159-phase table with report pass positions
- Register Allocator -- DUMPIR=AllocateRegisters diagnostic reference
- Mercury Encoder -- Mercury-side NamedPhases and DAG DumpIR knob
Memory Pool Allocator
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas replaces malloc/free with a custom hierarchical pool allocator for the vast majority of allocations. The allocator (sub_424070, 3,809 callers) is the single most-used allocation function in the binary. Every IR node, hash map, linked list, phase object, and temporary buffer flows through pools. The design serves two goals: fast allocation via size-class free lists, and per-compilation-unit lifetime management via hierarchical pool ownership.
| Allocator | sub_424070 (2,098 bytes, 3,809 callers) |
| Deallocator | sub_4248B0 (923 bytes, 1,215 callers) |
| Reallocator | sub_424C50 (488 bytes, 27 callers) |
| OOM handler | sub_42BDB0 (14 bytes, 3,825 callers) |
| TLS context | sub_4280C0 (597 bytes, 3,928 callers) |
| Stats header | sub_423A10 (323 bytes) -- prints "Memory space statistics for ..." banner |
| Stats detail | sub_425020 (~1,500 bytes) -- full per-pool metrics, recursive into children |
| Stats entry | sub_425AB0 (80 bytes) -- mutex-wrapped entry point for stats dump |
| OCG stats | sub_6936B0 (120 bytes) -- OCG mem space fixed-format stats to stderr |
| Pool teardown | sub_4234D0 (258 bytes) |
| Pool accounting | sub_423600 (922 bytes) |
| Slab registration | sub_423E50 (544 bytes) |
| Size-class index | sub_42BE50 (floor-log2, 64 bytes) |
| Slab growth | sub_423B60 / sub_423C70 |
| Global fallback | sub_427A10 (raw malloc wrapper) |
| System free | sub_427B30 (raw free wrapper) |
| Pool reporter | sub_C62200 (888 bytes) |
| Consumption query | sub_8DAE60 (32 bytes) |
| Snapshot | sub_8DADE0 (48 bytes) |
Pool Object Layout
The pool object is at least 7,136 bytes. It contains pool metadata at low offsets, large-block free lists indexed by power-of-2 order in the middle range, small-block free lists indexed by size class starting at offset +2128, and a mutex pointer at the end.
Pool Object (~7136 bytes)
+0 ptr large_block_list singly-linked list of large-block slab descriptors
+32 u32 min_slab_size minimum slab allocation (default from pool creator)
+44 u32 slab_count number of slabs allocated for this pool
+48 ptr large_free_list free list head for large blocks
+56 u32 fragmentation_count decremented on block split
+60 u32 max_order highest power-of-2 order currently tracked
+64.. ptr[] order_free_lists per-order free list: *(pool + 32*(order+2)) = head
+2112 ptr tracking_map hash map for allocation metadata (when enabled)
+2128.. ptr[] small_free_lists 625 bins: *(pool + 8*(size>>3) + 2128) = head
+7128 mutex* pool_mutex pthread_mutex_t* for thread safety
Size-Class Bins (Small Path)
Small allocations (up to 4,999 bytes) are served from 625 free-list bins. Each bin holds blocks of exactly one size class. The bin index is computed from the 8-byte-aligned allocation size:
aligned_size = max(16, (requested + 7) & ~7)
bin_index = aligned_size >> 3
bin_head = *(pool + 8 * bin_index + 2128)
This gives bins for sizes 16, 24, 32, 40, ... up to 4,992 bytes (the largest multiple of 8 that is <= 4,999). The minimum allocation is 16 bytes because each free block stores a next-pointer (8 bytes) and a slab-descriptor back-pointer (8 bytes).
Order Free Lists (Large Path)
Large allocations (above 4,999 bytes) use power-of-2 order free lists. The order is computed by sub_42BE50, which returns floor(log2(size)) by clearing all bits except the highest set bit, then using _BitScanForward64. The free list for order k is at pool offset 32*(k+2) + 64. The pool tracks max_order at +60 to avoid scanning empty higher-order lists.
Allocation Algorithm -- sub_424070
The allocator takes two arguments: a pool pointer (a1) and a size (a2). When a1 is NULL, it falls through to the global allocator (sub_427A10) which wraps malloc. Otherwise, it acquires the pool mutex and dispatches to one of two paths based on the aligned size.
// Pseudocode for sub_424070
void* pool_alloc(Pool* pool, size_t size) {
if (!pool)
return global_alloc(size); // sub_427A10 -> malloc
pthread_mutex_lock(pool->mutex); // pool + 7128
size_t aligned = (size + 7) & ~7;
if (aligned <= 4999) {
// --- Small path ---
if (aligned < 16) aligned = 16;
size_t bin = aligned >> 3;
FreeNode** head = &pool->small_free_lists[bin];
if (!*head) {
// Bin empty: allocate a new slab from parent pool
if (!can_grow(pool->min_slab_size)) // sub_423B60
goto oom;
// 1. Allocate 56-byte slab descriptor from parent pool
Pool* parent = get_tls_context()->parent_pool;
SlabDesc* desc = pool_alloc(parent, 56);
// 2. Compute slab memory: aligned * ceil(min_slab_size / aligned)
size_t slab_bytes = aligned * ((aligned + pool->min_slab_size - 1) / aligned);
// 3. Allocate slab memory from parent
void* slab_mem = pool_alloc(parent, slab_bytes);
// 4. Initialize slab descriptor
desc->total_size = slab_bytes; // +8
desc->available_size = slab_bytes; // +16
desc->owning_pool = pool; // +24
desc->memory_base = slab_mem; // +32
desc->is_small_slab = 1; // +40
desc->slab_id = atomic_inc(&global_slab_counter);
desc->bin_size = aligned; // +48
// 5. Carve slab into free-list nodes
char* cursor = slab_mem + slab_bytes;
FreeNode* list = NULL;
while (cursor > slab_mem) {
cursor -= aligned;
((FreeNode*)cursor)->next = list;
((FreeNode*)cursor)->slab = desc;
list = (FreeNode*)cursor;
}
*head = list;
// 6. Register slab in tracking structures
register_slab(desc); // sub_423E50
pool->slab_count++;
}
// Pop from free list
FreeNode* block = *head;
*head = block->next;
block->slab->available_size -= aligned;
pthread_mutex_unlock(pool->mutex);
return block;
}
// --- Large path ---
size_t total = aligned + 32; // 32 bytes for boundary tag header
// Search order free lists starting from floor(log2(total))
int order = floor_log2(total); // sub_42BE50
while (order <= pool->max_order) {
BoundaryTag* block = pool->order_lists[order];
while (block) {
if (block->payload_size >= total) {
// Found a fit: unlink from free list
unlink_free_block(block);
block->sentinel = -1; // mark allocated
// Split remainder if >= 40 bytes
size_t remainder = block->payload_size - total;
if (remainder > 39) {
split_block(block, total, remainder);
pool->fragmentation_count--;
}
// Update slab accounting
slab_desc->available_size -= block->tag_offset;
pthread_mutex_unlock(pool->mutex);
return (char*)block + 32; // skip header
}
block = block->next_free;
}
order++;
}
// No fit found: allocate new large slab from parent
// (allocates 88-byte slab descriptor + slab memory + 64 bytes for
// header/footer boundary tags)
allocate_large_slab(pool, total);
// retry search...
pthread_mutex_unlock(pool->mutex);
return result;
}
Critical Constants
| Constant | Value | Meaning |
|---|---|---|
0x1387 | 4,999 | Small/large allocation threshold |
16 | Minimum allocation | Free node: 8-byte next + 8-byte slab pointer |
32 | Boundary tag header size | Sentinel + prev + tag_offset + payload_size |
39 (0x27) | Minimum split remainder | Must hold a full boundary tag + at least 8 bytes |
56 | Slab descriptor size (small) | 7 fields |
88 | Slab descriptor size (large) | Extended with boundary-tag metadata |
64 | Overhead for large slab | Header (32) + footer (32) boundary tags |
Deallocation Algorithm -- sub_4248B0
The deallocator takes a single pointer argument. It locates the owning pool through the slab descriptor back-pointer (stored either inline for small blocks, or recoverable from boundary tags for large blocks), then returns the memory to the appropriate free list.
// Pseudocode for sub_4248B0
void pool_free(void* ptr) {
if (!ptr) { system_free(ptr); return; } // sub_427B30
// Locate slab descriptor via tracking map
SlabDesc* desc = find_slab(ptr);
if (!desc) { system_free(ptr); return; }
Pool* pool = desc->owning_pool;
pthread_mutex_lock(pool->mutex);
if (desc->is_small_slab) {
// Small block: push back onto size-class free list
size_t bin_size = desc->bin_size;
size_t bin = bin_size & ~7;
FreeNode** head = &pool->small_free_lists[bin >> 3];
((FreeNode*)ptr)->slab = desc;
((FreeNode*)ptr)->next = *head;
*head = (FreeNode*)ptr;
desc->available_size += bin_size;
} else {
// Large block: coalesce with adjacent free blocks
BoundaryTag* header = (BoundaryTag*)((char*)ptr - 32);
size_t block_size = header->payload_size;
// Validate sentinel (must be -1 = allocated)
assert(header->sentinel == -1);
desc->available_size += block_size;
// Check next block's sentinel
BoundaryTag* next = (BoundaryTag*)((char*)ptr - 32 + block_size);
if (next->sentinel != -1) {
// Next block is free: unlink and merge
unlink_free_block(next);
header->payload_size += next->payload_size;
// Update footer
}
// Check prev block via footer tag
BoundaryTag* prev_footer = (BoundaryTag*)((char*)header - header->prev_free);
if (prev_footer->sentinel != -1) {
// Prev block is free: merge into prev
prev_footer->payload_size += header->payload_size;
// Update footer
} else {
// Header becomes free: insert into order free list
header->sentinel = 0; // mark free
int order = floor_log2(header->payload_size);
insert_free_block(pool, order, header);
}
}
pthread_mutex_unlock(pool->mutex);
}
Small Block Free-List Node
Each free block in a small bin stores two pointers in the returned memory region itself (since the block is not in use):
Small Free Node (aligned_size bytes, minimum 16)
+0 ptr next next free node in this bin, or NULL
+8 ptr slab_desc back-pointer to owning slab descriptor
On allocation, the node is popped from the head. On deallocation, the node is pushed back to the head. This is a classic LIFO (stack) free list with O(1) alloc and free.
Boundary Tag Format (Large Blocks)
Large blocks use a classic Knuth-style boundary tag scheme. Every allocated or free block has a 32-byte header before the user payload and a 32-byte footer at the end. The sentinel field distinguishes allocated blocks (-1) from free blocks (pointer to next free block, or 0).
Large Block Layout
┌──────────────────────────────────────────────────────────────────┐
│ Header (32 bytes) │
│ +0 i64 sentinel -1 = allocated, else next_free ptr │
│ +8 ptr prev_free previous in order free list │
│ +16 u64 tag_offset always 32 (header size) │
│ +24 u64 payload_size user allocation size │
├──────────────────────────────────────────────────────────────────┤
│ User Payload (payload_size - 64 bytes) │
│ ... returned to caller ... │
├──────────────────────────────────────────────────────────────────┤
│ Footer (32 bytes, at end of block) │
│ +0 i64 sentinel mirrors header sentinel │
│ +8 ptr prev_free (unused in footer) │
│ +16 u64 footer_tag always 32 │
│ +24 u64 block_size total block size including headers │
└──────────────────────────────────────────────────────────────────┘
The footer allows the deallocator to coalesce with the preceding block by reading block_size from the footer of the previous block, then checking whether that block's header sentinel is -1 (allocated) or a free-list pointer. This enables bidirectional coalescing in O(1) without maintaining a separate block-address data structure.
Block Splitting
When a large free block is larger than needed, the allocator splits it if the remainder exceeds 39 bytes (enough for a header + footer + at least 8 bytes of payload). The split creates a new free block from the remainder and inserts it into the appropriate order free list. The pool's fragmentation_count is decremented on each split.
Slab Descriptor
Every slab (contiguous memory region backing allocations) is tracked by a descriptor. Small slabs use 56-byte descriptors; large slabs use 88-byte descriptors with additional boundary-tag metadata.
Small Slab Descriptor (56 bytes)
SlabDesc (56 bytes)
+0 ptr chain_link next descriptor in pool's slab chain
+8 u64 total_size total slab memory in bytes
+16 u64 available_size bytes currently free (decremented on alloc)
+24 ptr owning_pool back-pointer to the pool that owns this slab
+32 ptr memory_base base address of the contiguous slab memory
+40 u8 is_small_slab 1 = small-alloc slab, 0 = large-alloc slab
+44 u32 slab_id global atomic sequence number
+48 u32 bin_size size class this slab serves
Large Slab Descriptor (88 bytes)
Large slab descriptors extend the base 56 bytes with fields for boundary-tag free-list management. The memory base at +32 points to the raw allocation, which begins with a 32-byte header boundary tag. The descriptor at +48 points to the final footer boundary tag.
Hierarchical Pool Model
Pools form a tree. The root is a global fallback that wraps malloc/free. Below it are named pools created by the compilation driver. Each named pool allocates its slab memory from its parent pool.
┌─────────────────────────────────┐
│ Global Fallback (a1 = NULL) │
│ sub_427A10 -> malloc │
│ sub_427B30 -> free │
└─────────┬───────────────────────┘
│
┌─────────▼───────────────────────┐
│ "Top level ptxas memory pool" │
│ Created in sub_446240 (driver) │
│ Lifetime: entire compilation │
└─────┬───────────┬───────────────┘
│ │
┌─────▼─────┐ ┌─▼──────────────────────────┐
│ "Command │ │ Per-compilation-unit pool │
│ option │ │ (from compilation_ctx +16) │
│ parser" │ └──┬──────────┬───────────────┘
└───────────┘ │ │
┌───────▼──┐ ┌──▼───────────────────┐
│ "PTX │ │ "Permanent OCG │
│ parsing │ │ memory pool" │
│ state" │ │ per-kernel OCG state │
└──────────┘ └───┬───────────────────┘
│
┌───▼───────────────┐
│ "elfw memory │
│ space" (4096 init)│
│ ELF output buffer │
└───────────────────┘
Known Named Pools
| Name | Creator | Lifetime | Purpose |
|---|---|---|---|
"Top level ptxas memory pool" | sub_446240 | Entire process | Root of all sub-pools |
"Command option parser" | sub_446240 | Entire process | CLI option storage |
"Permanent OCG memory pool" | 0x1CE7B2B ref | Per-kernel | OCG IR and pass state |
"PTX parsing state" | sub_451730 | Per-parse | Lexer/parser temporaries |
"elfw memory space" | sub_1CB53A0 / sub_4258D0 | Per-ELF-output | ELF world (672-byte object, 4096 initial) |
Parent Pool Resolution
When the allocator needs a new slab, it calls sub_4280C0 to get the thread-local context, which holds a parent pool pointer at byte offset +192 (qword offset 24). This TLS context is a 280-byte (0x118) struct allocated via raw malloc on first access per thread, initialized with pthread_cond_t at +128, pthread_mutex_t at +176, and sem_t at +216.
// TLS context layout (280 bytes = 0x118)
struct TLSContext {
uint64_t error_flags; // +0
uint64_t has_error; // +8
// ... diagnostic fields ...
void* parent_pool; // +192 (qword index 24)
// ...
pthread_cond_t cond; // +128 (48 bytes)
pthread_mutex_t mutex; // +176 (40 bytes)
sem_t sem; // +216
// ... diagnostic suppression ... // +384-416
};
The parent pool pointer determines where slab memory is allocated from. For the top-level pool, the parent is the global allocator (NULL pool, i.e., malloc). For sub-pools, the parent is the enclosing pool.
Thread Safety
Every pool operation acquires a per-pool mutex at offset +7128. The mutex is lazily initialized: on first use, sub_4286A0 (a once-init guard) creates the mutex via sub_428320 (pthread_mutex_init). The initialization itself is serialized through a separate global once-init mechanism (sub_42BDD0 saves/restores some state around the initialization).
There is also a global mutex at qword_29FDC08 that protects the global slab counter (dword_29FDBF4) and the global emergency-reclaim state (qword_29FDC00). The allocator acquires this global mutex briefly after creating new slabs to decrement the outstanding-growth counter.
Locking Sequence
1. Lock pool->mutex (per-pool, offset +7128)
2. Perform allocation or deallocation
3. If new slab was created:
a. Lock global_mutex (qword_29FDC08)
b. Decrement dword_29FDBF4 (outstanding growth count)
c. Unlock global_mutex
4. Unlock pool->mutex
The locking is strictly ordered (pool mutex first, then global mutex if needed), preventing deadlock between pool operations. There is no lock-free fast path -- every allocation takes the pool mutex.
OOM Handling
The OOM handler sub_42BDB0 is a 14-byte stub that forwards to sub_42F590 (the central diagnostic/fatal-error emitter) with the error descriptor at unk_29FA530. This triggers a longjmp to abort the current compilation.
// sub_42BDB0 -- 14 bytes, 3825 callers
void alloc_fail_abort(void* pool, size_t size, ...) {
fatal_error(&internal_error_descriptor, size, ...);
// does not return -- longjmp
}
Every allocation site in ptxas follows the same pattern:
void* p = pool_alloc(pool, size);
if (!p) alloc_fail_abort(pool, size); // sub_42BDB0
The 3,825 call sites for sub_42BDB0 exactly mirror the 3,809 callers of sub_424070 (the difference being realloc and a few indirect call sites). This is an unconditional abort -- there is no graceful degradation or fallback allocation strategy.
Emergency Reclaim
Before aborting, the allocator at a1 = NULL (global path) checks for a reclaimable cache at qword_29FDC00. If present, it locks the global mutex, calls sub_427B30 to free the cached block, zeroes the cache pointer, then retries the allocation. This provides a one-shot emergency reserve for the global allocator only.
Per-Phase Memory Reporting
When --stat=phase-wise is enabled (option 17928), the phase manager takes memory snapshots before and after each phase, then reports deltas.
Memory Snapshot
sub_8DADE0 captures a 48-byte snapshot from the pool state:
// sub_8DADE0 -- take_snapshot(snapshot, pool_state)
void take_snapshot(Snapshot* snap, PoolState* ps) {
snap->pool_state = ps; // +0
snap->total_alloc = ps[80]; // +8 (qword offset 80 = byte +640)
snap->freeable = ps[78]; // +16 (qword offset 78 = byte +624)
snap->freeable_leak = ps[79]; // +24 (qword offset 79 = byte +632)
snap->metric4 = ps[76]; // +32 (qword offset 76 = byte +608)
snap->current_usage = ps->vtable->get_usage(ps); // +40
}
Memory Delta Queries
Three helper functions compute deltas between the current pool state and a saved snapshot:
| Function | Computation | Metric |
|---|---|---|
sub_8DAE20 | pool[632] - snap[3] | Total memory delta |
sub_8DAE30 | pool[624] - snap[2] | Freeable memory delta |
sub_8DAE40 | snap[1] + pool[624] - snap[2] - pool[640] | Freeable leaked delta |
Pool Consumption Query
sub_8DAE60 returns the current pool consumption as a single integer:
// sub_8DAE60 -- pool_consumption(pool_state)
int64_t pool_consumption(PoolState* ps) {
return *(ps->vtable->field_at_32) - ps[5];
// i.e., total allocated from parent minus some baseline
}
Reporter Output
The pool reporter (sub_C62200) prints to stderr:
[Pool Consumption = 45.678 MB]
Size formatting follows the same thresholds used throughout ptxas:
- 0--1023 bytes: raw integer with
Bsuffix - 1,024--10,485,760 bytes:
%.3lf KB - Above 10 MB:
%.3lf MB
The per-phase reporter (sub_C64310) prints one line per phase:
<phase_name> :: [Total 1234 KB] [Freeable 567 KB] [Freeable Leaked 12 KB] (2%)
The leak percentage is computed only when both freeable and freeable-leaked are positive.
Memory Space Statistics Dump
ptxas contains a detailed memory-space statistics subsystem for debugging the pool allocator. The output is gated by a byte flag at context+404 (initialized to 0 in sub_434320; not exposed as a user-facing knob). When the flag is non-zero, the compilation driver calls into the statistics printers at two points: after each per-kernel compilation (sub_436DF0, sub_4428E0) and on error-path exit from the main driver (sub_446240).
Generic Pool Statistics -- sub_425020
The entry point is sub_425AB0, which acquires the pool mutex, builds a stack-local stats-context struct, and calls sub_425020. The stats context is 28 bytes:
StatsContext (28 bytes, on stack)
+0 ptr output_stream FILE* for sub_42BB30 (formatted output)
+8 u8 verbosity_flag enables/disables output
+12 u32 detail_level 0 = compact, 1 = standard, 2 = per-page
+16 u8 recurse_flag walk child pools if set
+20 u32 indent_level current tab depth
+24 u32 indent_step tabs added per recursion level
sub_425020 first calls sub_423A10 to print the banner, then walks two structures to compute totals:
-
Large-block slab chain (pool+48 linked list): for each slab descriptor, accumulates
total_sizeandavailable_size, and counts free blocks within each slab. -
Small-block bin scan (pool+2112 hash map, via
sub_426D60): iterates all 625 size classes (0..4992 in steps of 8), summing per-buckettotal_sizeandavailable_size.
The three output metrics are in_use = total_available - total_allocated, all formatted as hex strings via sprintf("0x%llx", ...).
Detail level 1 (standard) output:
Memory space statistics for 'Top level ptxas memory pool'
==========================================================
Page size : 0x10000 bytes
Total allocated : 0x1a2b3c4 bytes
Total available : 0x1ffffff bytes
Total in use : 0x05d4c3b bytes
Nrof small block pages : 42
Nrof large block pages : 7
Longest free list size : 3
Average free list size : 0
Detail level 2 adds per-page breakdowns:
@@ large block page 0 : 0x1234/0x10000, #=2 max=0x5000
@@ small block size 24: 0x600/0x1800 (64/128 blocks) 3 pages
Detail level 0 (compact) prints a single line:
available= 0x1ffffff, allocated= 0x1a2b3c4, used= 0x05d4c3b
When recurse_flag is set, sub_425020 calls sub_42D4C0(child_chain, sub_425020, stats_context) to recursively walk and print statistics for all child pools, incrementing the indentation at each level.
OCG Memory Space Statistics -- sub_6936B0
The OCG (Optimizing Code Generator) uses a separate fixed-page allocator tracked in a 1048-byte hash-table object with 128 buckets. sub_6936B0 prints its statistics to stderr via sub_427540:
Memory space statistics for 'OCG mem space'
===========================================
Page size : 0x100000 bytes
Total allocated : 0x340000 bytes
Total available : 0x400000 bytes
The page size is hardcoded at 0x100000 (1 MB). The counters are read from the OCG state object at offsets +1032 (total_allocated) and +1040 (total_available) of the hash-table structure at OCG-context+24.
After printing, sub_693630 tears down the OCG allocator: it walks all 128 hash buckets freeing every linked-list entry, frees the overflow list at +1024, then frees the hash table object and the parent allocation via sub_4248B0.
Trigger
Both statistics paths are gated by the same flag: *(uint8_t*)(context + 404). This flag defaults to 0 and is not registered as a CLI knob. It is an internal debug mechanism, likely set only by NVIDIA-internal debug builds or environment variables not present in the release binary.
Pool Reset and Reuse
The pool system does not expose an explicit "reset" operation that returns all allocations without freeing slabs. Instead, pool lifetime is managed through the hierarchical ownership model:
-
Per-parse pool (
"PTX parsing state"): created before parsing, destroyed after parsing is complete. All lexer/parser temporaries are freed in bulk when the pool is torn down. -
Per-kernel pool (
"Permanent OCG memory pool"): created before the 159-phase pipeline runs on a kernel, destroyed afterward. All IR nodes, analysis results, and phase-local data die with this pool. -
ELF output pool (
"elfw memory space"): scoped to the ELF emission phase.
The teardown helper sub_4234D0 walks the pool's slab chain and returns each slab's memory to the parent pool via sub_4248B0 (free), then frees the slab descriptors themselves. Because slabs are allocated from the parent pool, this cascades upward -- destroying a child pool returns memory to the parent without touching the system heap.
Allocation Pattern: The 50KB Buffer
A pervasive allocation pattern across ptxas is the "alloc-format-shrink" idiom, observed in all PTX text formatters:
// ~100+ call sites follow this exact pattern
Pool* pool = get_arena_pool(ctx, table); // sub_4280C0 -> offset 24
char* buf = pool_alloc(pool, 50000); // 50KB temp buffer
if (!buf) alloc_fail_abort(pool, 50000);
int len = snprintf(buf, 50000, format, ...);
char* result = pool_alloc(pool2, len + 1); // exact-size copy
memcpy(result, buf, len + 1);
pool_free(buf); // return 50KB to pool
return result;
The 50,000-byte temporary buffer is a "one size fits all" strategy. Because it exceeds the 4,999-byte small-path threshold, every format operation takes the large-block path. However, because the buffer is freed immediately after use, it is typically coalesced back and reused by the next formatter call, making this effectively a per-thread scratch buffer recycled through the pool.
Global State
The allocator uses several global variables for cross-pool coordination:
| Address | Type | Purpose |
|---|---|---|
dword_29FDBF4 | u32 | Outstanding slab growth count (decremented after slab creation) |
dword_29FDBF8 | u32 | Emergency cache flag (zeroed when cache is reclaimed) |
qword_29FDC00 | ptr | Emergency reclaimable cache block pointer |
qword_29FDC08 | mutex* | Global mutex protecting the above three fields |
dword_29FDBE8 | u32 | Global slab sequence number (atomic increment) |
qword_29FDBE0 | ptr | Global slab tracking map (for cross-pool slab lookup) |
qword_29FDBD8 | mutex* | Mutex protecting qword_29FDBE0 |
byte_29FA4C0 | u8 | Flag enabling per-pool slab tracking maps |
The slab tracking map (qword_29FDBE0) is a hash map keyed by address >> 3 that maps any allocated pointer to its owning slab descriptor. The deallocator (sub_4248B0) consults this map when the per-pool tracking flag (byte_29FA4C0) is enabled. When per-pool tracking is disabled, it falls back to the global map.
Key Functions Reference
| Address | Size | Callers | Identity |
|---|---|---|---|
sub_424070 | 2,098 | 3,809 | pool_alloc(pool, size) -- main allocator |
sub_4248B0 | 923 | 1,215 | pool_free(ptr) -- main deallocator |
sub_424C50 | 488 | 27 | pool_realloc(ptr, new_size) -- alloc+copy+free |
sub_42BDB0 | 14 | 3,825 | alloc_fail_abort() -- fatal OOM via longjmp |
sub_4280C0 | 597 | 3,928 | get_tls_context() -- per-thread state accessor |
sub_427A10 | -- | -- | global_alloc(size) -- malloc wrapper for NULL pool |
sub_427B30 | -- | -- | global_free(ptr) -- free wrapper for non-pool memory |
sub_423A10 | 323 | 1 | pool_stats_header() -- prints "Memory space statistics for ..." banner |
sub_425020 | ~1,500 | 1 | pool_stats_detail() -- full metrics dump, recursive child walk |
sub_425AB0 | 80 | 2 | pool_stats_entry() -- mutex-wrapped entry point |
sub_6936B0 | 120 | 2 | ocg_memspace_stats() -- OCG allocator stats to stderr |
sub_693630 | 166 | 2 | ocg_memspace_teardown() -- free OCG hash-table allocator |
sub_4234D0 | 258 | 1 | pool_teardown() -- recursive slab deallocation |
sub_423600 | 922 | 3 | pool_accounting_init() -- accounting/hash-set setup |
sub_423E50 | 544 | 2 | register_slab() -- slab tracking insertion |
sub_423B60 | -- | -- | can_grow() -- checks whether slab expansion is permitted |
sub_423C70 | 480 | 2 | pool_grow() -- slab expansion handler |
sub_42BE50 | 64 | -- | floor_log2(size) -- clear-to-highest-bit + BSF |
sub_42B990 | -- | -- | slab_lookup(map, addr>>3) -- find slab for address |
sub_4258D0 | -- | -- | create_named_pool(name, flags, init_size) |
sub_8DADE0 | 48 | -- | take_snapshot(snap, pool_state) |
sub_8DAE20 | 16 | -- | delta_total(snap) |
sub_8DAE30 | 16 | -- | delta_freeable(snap) |
sub_8DAE40 | 32 | -- | delta_freeable_leaked(snap) |
sub_8DAE60 | 32 | -- | pool_consumption(pool_state) |
sub_C62200 | 888 | 1 | Pool consumption reporter (stderr) |
sub_C64310 | 3,168 | -- | Per-phase timing/memory reporter |
Hash Tables & Bitvectors
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas contains two independent hash map implementations and a dedicated bitvector library. These three container abstractions underpin nearly every subsystem in the compiler -- from the PTX parser's symbol tables to the optimizer's liveness analysis to the code generator's instruction deduplication cache. This page documents their binary-level layouts, hash algorithms, SIMD acceleration, and usage patterns.
Overview
| Container | Object Size | Hash Algorithm | Address Range | Callers |
|---|---|---|---|---|
| General hash map | 112 bytes | MurmurHash3 (strings), pointer-shift, or identity | 0x425B20--0x42D850 | 2800+ (sub_426150) |
| CFG hash map | 40 bytes (header) | FNV-1a (32-bit) | 0xBDED20--0xBDFB10 | ~80 |
| Bitvector | 20 bytes (header) | N/A | 0xBDBA60--0xBDE150 | 500+ |
The general hash map is a self-contained 112-byte object used for string-keyed lookups (intrinsic tables, PTX directive dispatch, ELF section names), pointer-keyed caches (instruction deduplication), and integer-keyed registries (opcode tables). The CFG hash map is a separate, purpose-built implementation for graph edge storage with embedded sub-hash tables for multi-edge blocks. The bitvector library provides 17+ SSE2-accelerated operations optimized for the iterative dataflow workloads that dominate ptxas compile time.
General Hash Map (112 bytes)
Construction
The constructor sub_425CA0 takes three arguments: a hash function pointer, a compare function pointer, and an initial capacity hint. It delegates allocation to sub_425B20, which allocates the 112-byte map object, a bucket pointer array, an entries array, and a used-bits bitmap.
HashMap* HashMap_create(HashFn hash, CmpFn compare, uint32_t capacity_hint);
// sub_425CA0(hash_fn, cmp_fn, initial_capacity)
After allocation, the constructor detects two well-known function-pointer pairs and sets fast-path mode flags in the flags field at offset +84 (bits 4-7):
| Mode | Bits 4-7 | Hash Function | Compare Function | Key Semantics |
|---|---|---|---|---|
| 0 (custom) | 0x00 | User-supplied | User-supplied | String or arbitrary (MurmurHash3 typical) |
| 1 (pointer) | 0x10 | sub_4277F0 | sub_427810 | Raw pointer identity |
| 2 (integer) | 0x20 | sub_427750 | sub_427760 | 32/64-bit integer identity |
Mode detection is a compile-time optimization: the insert and lookup fast paths test the mode bits and branch directly to specialized hash/compare code, avoiding the overhead of indirect function calls through the hash/compare pointers.
Object Layout (112 bytes)
GeneralHashMap (112 bytes, allocated by sub_425B20)
+0 ptr hash_fn // Hash function pointer (or NULL for mode 1/2)
+8 ptr compare_fn // Compare function pointer
+16 ptr (reserved) // Additional function pointer slot
+24 ptr (reserved) // Additional function pointer slot
+32 u32 has_custom_cmp // Non-zero if custom compare supplied
+40 u32 bucket_mask // (bucket_count - 1); power-of-2 mask
+48 u32 entry_count // Number of occupied entries
+52 u32 (padding)
+56 u32 (reserved)
+60 u32 (reserved)
+64 u32 load_threshold // entry_count threshold triggering resize
+68 u32 (reserved)
+72 u32 first_free // Free-slot tracking index
+76 u32 entry_capacity // Allocated capacity of entries array
+80 u32 bitmap_capacity // Capacity of used-bits bitmap (in uint32_t words)
+84 u16 flags // Bits 4-7: hash mode (0=custom, 1=pointer, 2=integer)
// Bits 0-1: bitmap/entry state flags
+86 u16 (padding)
+88 ptr entries // Array of 16-byte entries [key (8B), value (8B)]
+96 ptr used_bitmap // Bitmap tracking which entry slots are occupied
+104 ptr bucket_array // Array of pointers to index lists per bucket
Entry Format (16 bytes)
Each entry in the entries array at +88 is 16 bytes:
Entry (16 bytes, at entries + 16 * index)
+0 ptr/u64 key // Key (string pointer, raw pointer, or integer)
+8 ptr/u64 value // Associated value
Bucket chains are stored as arrays of uint32_t indices terminated by sentinel value 0xFFFFFFFF (-1). Each bucket pointer at bucket_array[hash & bucket_mask] points to a null-terminated index list. Collision resolution is open addressing with index chaining -- not linked-list chaining.
Hash Functions
Mode 0 (Custom / String) -- MurmurHash3:
String-keyed maps use sub_427630 (273 bytes, 73 callers), which implements MurmurHash3_x86_32 with the standard mixing constants:
uint32_t MurmurHash3_x86_32(const char* key, int len) {
// sub_427630
uint32_t h = seed;
const uint32_t c1 = 0xCC9E2D51; // -862048943
const uint32_t c2 = 0x1B873593; // 461845907
// Body: process 4 bytes at a time
for (int i = 0; i < len / 4; i++) {
uint32_t k = ((uint32_t*)key)[i];
k *= c1;
k = ROTL32(k, 15);
k *= c2;
h ^= k;
h = ROTL32(h, 13);
h = h * 5 - 430675100; // 0xE6546B64
}
// Tail: remaining 1-3 bytes
// ... standard MurmurHash3 tail handling ...
// Finalization
h ^= len;
h ^= h >> 16;
h *= 0x85EBCA6B; // -2048144789
h ^= h >> 13;
h *= 0xC2B2AE35; // -1028477387
h ^= h >> 16;
return h;
}
Mode 1 (Pointer):
Pointer-keyed maps use a shift-XOR hash that destroys the alignment pattern common in heap pointers:
uint32_t pointer_hash(void* key) {
uintptr_t p = (uintptr_t)key;
return (p >> 11) ^ (p >> 8) ^ (p >> 5);
}
The right-shifts at 5, 8, and 11 bits fold the significant bits of a 64-bit heap address into a 32-bit range. The compare function (sub_427810) performs raw pointer equality.
Mode 2 (Integer):
Integer-keyed maps use the key value directly as the hash (identity hash). The compare function (sub_427760) performs integer equality. Bucket selection is key & bucket_mask.
Insert Operation (sub_426150)
sub_426150 (2534 bytes, 2800 callers) is the most heavily used hash map function. Its signature:
void* HashMap_put(HashMap* map, void* key, void* value);
// Returns: previous value if key existed, NULL if new insertion
Algorithm:
- Hash computation. Branch on mode bits at +84 to select hash function.
- Bucket lookup. Compute
bucket = hash & *(map+40). Load index list frombucket_array[bucket]. - Key scan. Walk the index list. For each index
iwherei != 0xFFFFFFFF, compareentries[i].keywith the target key using the mode-specific compare. - Hit: Swap
entries[i].valuewith the new value. Return the old value. - Miss: Find a free slot via the used-bits bitmap at +96. Set
entries[free].key = key,entries[free].value = value. Append the free index to the bucket's index list. Incremententry_countat +48. - Resize check. If
entry_count > load_threshold, double the capacity: allocate new entries array (2 * old_capacity) and new bucket array, rehash all entries.
Lookup Operation (sub_426D60)
sub_426D60 (345 bytes, 422 callers) is the read-only lookup:
void* HashMap_get(HashMap* map, void* key);
// Returns: value if found, NULL (0) if not found
Same hash and scan logic as insert, but no modification path. The three mode fast paths avoid indirect calls on the hot path.
Contains Operation (sub_426EC0)
sub_426EC0 (349 bytes, 29 callers) returns 1 if the key exists, 0 otherwise. Nearly identical to lookup but returns a boolean rather than the value.
Resize Policy
The resize doubles capacity when entry_count exceeds load_threshold. The threshold is set to 4 * bucket_count during initialization (from sub_425B20: v6[8] = 4 << log2(capacity)), meaning the default load factor limit is approximately 4.0 entries per bucket. Initial bucket count is rounded up to the next power of two from the capacity hint (minimum 1). The constructor at sub_425B20 calls sub_42BDF0 to compute ceil(log2(capacity)).
Destructor (sub_425D20)
sub_425D20 (121 bytes, 63 callers) frees the entries array, used-bits bitmap, bucket array, and the 112-byte map object itself. All four allocations are returned to the pool allocator.
Iteration
Hash map iteration (used by sub_425DB0, 9 callers) walks the entries array linearly, testing the used-bits bitmap to identify occupied slots. There is no guaranteed iteration order -- the order depends on insertion history and resize operations.
Usage Examples
| Subsystem | Hash Mode | Key Type | Purpose |
|---|---|---|---|
Intrinsic table (sub_5AB660) | Custom (MurmurHash3) | String | 608 intrinsic name lookups (capacity 0x80) |
| PTX opcode dispatch | Custom (MurmurHash3) | String | Named PTX opcodes at ctx+808 |
| SM version backends | Custom (MurmurHash3) | String | sm_XX, compute_XX, lto_XX registries |
Instruction dedup (sub_737760) | Pointer | Pointer | Avoid re-encoding identical instructions |
| Opcode hash tables | Integer | Integer | Fast opcode-to-handler dispatch |
| File offset cache | Custom | String | Cache file offsets for #line directives |
Symbol table (sub_621480) | Custom (MurmurHash3) | String | Named symbol lookups at ctx+30016 |
| Per-function disable list | Custom (FNV-1a) | String | Function-specific pass disable at ctx+120->+1128 |
CFG Hash Map (Graph Edges)
The CFG edge hash map is a completely separate implementation from the general hash map, located in the address range 0xBDED20--0xBDFB10. It is designed specifically for graph edge storage -- mapping block indices to successor/predecessor sets -- and has a different object layout, hash function, and collision resolution strategy.
Object Layout
CFGHashMap (40 bytes header)
+0 ptr first_free_node // Free list for node recycling
+8 ptr node_arena // Pool allocator for new nodes
+16 ptr bucket_array // Array of 24-byte bucket headers
+24 u64 num_buckets // Power of two, initial = 8
+32 i32 total_elements // Total entries across all buckets
+36 i32 num_unique_keys // Distinct keys inserted
Two distinct hash map configurations exist for different node sizes:
Full Node (64 bytes) -- Successor Edge Map
Used by sub_BDED20 (12KB) for the successor edge hash map at Code Object +648. Each node represents a block's successor edge set with an optional sub-hash table for multi-successor blocks:
FullNode (64 bytes)
+0 ptr next // Chain link within bucket
+8 i32 key // Block index (bix)
+12 i32 value_info // Edge count or flags
+16 ptr value_array // Pointer to sub-array of successor indices
+24 i32 value_count // Number of successors in sub-array
+28 i32 (padding)
+32 ptr sub_hash_data // Embedded sub-hash for multi-edge blocks
+40 u64 sub_hash_size // Sub-hash capacity
+48 u64 (reserved)
+56 u32 cached_hash // Cached FNV-1a hash of key
+60 u32 (padding)
Simple Node (16 bytes) -- Backedge Set
Used by sub_BDF480 (10KB) for the backedge hash map at Code Object +680. Each node is a minimal key-hash pair for set membership testing:
SimpleNode (16 bytes)
+0 ptr next // Chain link within bucket
+8 i32 key // Block index
+12 u32 cached_hash // Cached hash
Bucket Header (24 bytes)
Both full and simple node maps use the same bucket header:
Bucket (24 bytes)
+0 ptr head // First node in collision chain
+8 ptr tail // Last node in collision chain
+16 i32 count // Number of nodes in this bucket
+20 i32 (padding)
FNV-1a Hash Function
All CFG hash lookups use FNV-1a (32-bit) with standard parameters, confirmed across 50+ call sites:
uint32_t fnv1a_hash_u32(uint32_t key) {
uint32_t hash = 0x811C9DC5; // FNV offset basis
hash = 16777619 * (hash ^ (key & 0xFF)); // byte 0
hash = 16777619 * (hash ^ ((key >> 8) & 0xFF)); // byte 1
hash = 16777619 * (hash ^ ((key >> 16) & 0xFF)); // byte 2
hash = 16777619 * (hash ^ ((key >> 24) & 0xFF)); // byte 3
return hash;
}
Bucket selection: bucket = hash & (num_buckets - 1).
This appears inline in the decompiled code as:
v9 = 16777619 * (HIBYTE(*a3)
^ (16777619 * ((unsigned __int8)BYTE2(*a3)
^ (16777619 * (BYTE1(v8)
^ (16777619 * ((unsigned __int8)v8 ^ 0x811C9DC5)))))));
v11 = v9 & (a2[3] - 1); // bucket index
Resize Policy
The CFG hash map resizes when total_elements > num_unique_keys (load factor > 1.0). New capacity is 4 * old_bucket_count. Both sub_BDED20 and sub_BDF480 allocate a new 192-byte bucket array (8 buckets * 24 bytes/bucket = 192) during initial creation, then redistribute nodes during resize.
The resize algorithm:
- Allocate new bucket array (192 bytes for 8 buckets).
- Walk all old buckets, removing each node from old chains.
- Re-hash each node:
new_bucket = cached_hash & (new_num_buckets - 1). - Insert at head or tail of new bucket chain.
- Free old bucket array via pool allocator (vtable dispatch at allocator
+32for free).
Edge Map Operations
Insert/Find (sub_BDED20, sub_BDF480):
- Compute FNV-1a hash of the block index key.
- Index into bucket array:
bucket = hash & (num_buckets - 1). - Walk the collision chain comparing keys.
- If found: return existing node.
- If not found: allocate node from arena, initialize, insert into bucket, increment counters.
Erase (sub_BDE6C0, 3KB):
Removes an entry from the successor edge map and recursively erases successor edges for blocks that become unreachable (single-predecessor blocks whose only predecessor was removed). This recursive cleanup maintains CFG consistency during block deletion.
Print (sub_BDE8B0, 2KB):
Iterates a block's successors and prints "\tbix%d -> bix%d\n" for each edge. Used by the CFG debug dump infrastructure.
Multi-Edge Sub-Hash
For blocks with many successors (switch statements, computed branches), the full node at +32 contains a pointer to an embedded sub-hash table. This sub-hash maps successor block indices within a single node's edge set, enabling O(1) lookup of whether a specific successor edge exists. The sub-hash uses the same 24-byte bucket structure as the outer hash map.
Storage Locations
| Code Object Offset | Field | Map Type | Node Size |
|---|---|---|---|
| +648 | succ_map | Successor edges | 64 bytes (full) |
| +680 | backedge_map | Backedge set | 16 bytes (simple) |
Bitvector Library
The bitvector library at 0xBDBA60--0xBDE150 is the most performance-critical infrastructure in ptxas. It supports 17+ operations, all SSE2-accelerated with manual alignment handling. The library is the backbone of liveness analysis (6 dedicated phases), dominance computation, register interference detection, and dead code elimination.
Object Layout (20 bytes)
struct BitVector { // 20 bytes
uint32_t* data; // +0: pointer to word array (heap-allocated)
int32_t word_count; // +8: number of 32-bit words in use
int32_t capacity; // +12: allocated words (>= word_count)
int32_t bit_count; // +16: number of valid bits
};
Word count derivation: word_count = (bit_count + 31) >> 5.
Memory is allocated through the pool allocator (vtable dispatch at allocator +24 for alloc, +32 for free). The allocate function (sub_BDBA60) implements grow-only semantics: if the new word count exceeds the current capacity, the old buffer is freed and a new one allocated. The buffer is never shrunk.
Bit Addressing
Individual bits are addressed by word index and bit position within the word:
// Set bit i: data[i >> 5] |= (1 << (i & 0x1F))
// Clear bit i: data[i >> 5] &= ~(1 << (i & 0x1F))
// Test bit i: (data[i >> 5] >> (i & 0x1F)) & 1
SSE2 Acceleration Pattern
All bulk operations follow a three-phase structure:
-
Alignment prologue. Process scalar words until the destination pointer is 16-byte aligned. The alignment count is
(-(uintptr_t)(dst_ptr) >> 2) & 3, yielding 0-3 scalar iterations. -
SSE2 main loop. Process 4 words (128 bits) per iteration using
_mm_load_si128(aligned destination),_mm_loadu_si128(potentially unaligned source), and the appropriate SSE2 intrinsic (_mm_or_si128,_mm_and_si128,_mm_andnot_si128,_mm_xor_si128). The loop count is(remaining_words) >> 2. -
Scalar epilogue. Process remaining 0-6 words individually. The decompiler shows an unrolled epilogue handling up to 6 trailing words with sequential if-chains rather than a loop.
Example from sub_BDCF40 (orIfChanged):
// Prologue: align dst
int align = (-(uintptr_t)(dst) >> 2) & 3;
for (int i = 0; i < align && i < word_count; i++)
dst[i] |= src[i];
// SSE2 loop: 4 words per iteration
int sse_iters = (word_count - align) >> 2;
for (int i = 0; i < sse_iters; i++) {
__m128i d = _mm_load_si128(&dst[aligned_offset + 4*i]);
__m128i s = _mm_loadu_si128(&src[aligned_offset + 4*i]);
_mm_store_si128(&dst[aligned_offset + 4*i], _mm_or_si128(d, s));
}
// Epilogue: 0-6 remaining words
Operation Catalog
Allocation and Lifecycle
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDBA60 | allocate | (bv*, alloc*, num_bits) | Grow-only allocation. Sets bit_count, recomputes word_count, reallocates if capacity insufficient. |
sub_BDDC00 | clear | (bv*, start_bit) | Zeroes all words from start_bit >> 5 to end. When start_bit = 0, equivalent to memset(data, 0, ...). |
sub_BDCA60 | operator= | (dst*, src*) | Copy assignment. Reallocates dst if capacity insufficient, then memcpy. |
Bit Manipulation
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDBFB0 | setBit | (bv*, bit_index) | data[i>>5] |= (1 << (i&31)) |
sub_BDC0E0 | clearBit | (bv*, bit_index) | data[i>>5] &= ~(1 << (i&31)) |
sub_BDC200 | testBit | (bv*, bit_index) -> bool | (data[i>>5] >> (i&31)) & 1 |
Bulk Boolean Operations (SSE2)
| Address | Operation | Signature | SSE2 Intrinsic | Description |
|---|---|---|---|---|
sub_BDCDE0 | operator|= | (dst*, src*) | _mm_or_si128 | dst |= src |
sub_BDC5F0 | operator&= | (dst*, src*) | _mm_and_si128 | dst &= src; zeroes dst tail beyond src length |
sub_BDDAA0 | operator^= | (dst*, src*) | _mm_xor_si128 | dst ^= src |
Fixed-Point Detection (IfChanged variants)
These return 1 if any bit changed, 0 if the result is identical to the previous state. They are the core of iterative dataflow convergence detection.
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDCF40 | orIfChanged | (dst*, src*) -> bool | Scans for (~dst & src) != 0 first, then applies dst |= src from the first differing word. Returns whether any bit was newly set. |
sub_BDC790 | andIfChanged | (dst*, src*) -> bool | Scans for (~src & dst) != 0 first, then applies dst &= src. Zeroes trailing dst words beyond src. Returns whether any bit was cleared. |
The IfChanged early-exit optimization: the function first scans word-by-word for the first position where a change would occur (~dst & src for OR, ~src & dst for AND). If no such position exists, it returns 0 immediately without touching memory. If found, it applies the operation only from that position forward, reducing cache pollution when most blocks have already converged.
Three-Input Operations
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDC3F0 | assignAND | (dst*, a*, b*) | dst = a & b (SSE2 _mm_and_si128) |
sub_BDD8C0 | assignANDNOT | (dst*, a*, b*) | dst = a & ~b (SSE2 _mm_andnot_si128) |
sub_BDCC20 | assignOR | (dst*, a*, b*) | dst = a | b (SSE2 _mm_or_si128) |
sub_BDD140 | orWithAND | (dst*, a*, b*) | dst |= a & b (SSE2 _mm_and_si128 + _mm_or_si128) |
Liveness-Specific Fused Operations
The most important bitvector operations for compiler performance are the fused transfer functions used by liveness analysis:
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDD300 | orWithAndNot | (dst*, gen*, in*, kill*) | dst |= gen | (in & ~kill) |
sub_BDD560 | orWithAndNotIfChanged | (dst*, gen*, in*, kill*) -> bool | Same as above + returns change flag |
These implement the liveness transfer function LiveIn(B) = gen(B) | (LiveOut(B) - kill(B)) in a single SIMD pass, avoiding materialization of intermediate bitvectors.
orWithAndNot inner loop (sub_BDD300):
// SSE2: process 4 words per iteration
*(__m128i*)(dst + offset) = _mm_or_si128(
_mm_or_si128(
_mm_loadu_si128(gen + offset), // gen
*(__m128i*)(dst + offset)), // existing dst
_mm_andnot_si128(
_mm_loadu_si128(kill + offset), // ~kill
_mm_loadu_si128(in + offset))); // in & ~kill
orWithAndNotIfChanged (sub_BDD560):
This function first scans for any word position where the new value would differ from the current dst:
// Phase 1: scan for first change
for (int i = 0; i < word_count; i++) {
uint32_t new_bits = gen[i] | (in[i] & ~kill[i]);
if (~dst[i] & new_bits)
goto apply_from_i; // Found change at position i
}
return 0; // No change -- already at fixed point
// Phase 2: apply from first change position
apply_from_i:
// SSE2 loop: dst[j] |= gen[j] | (in[j] & ~kill[j])
// ... (same three-phase SSE2 pattern) ...
return 1;
This two-phase design is critical for liveness analysis performance: in the late iterations of the fixed-point solver, most blocks have already converged and the scan returns 0 without writing memory.
Query Operations
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDDD40 | popcount | (bv*) -> int | Count set bits. Uses __popcountdi2 intrinsic per word. |
sub_BDBFB0 | findLastSetWord | (bv*) -> int | Scans from high word to low, returns index of last non-zero word. Returns 0xFFFFFFFF if all zero. |
sub_BDDC00 | nextSetBit | (bv*, start) -> int | Find next set bit at or after start. Uses tzcnt (trailing zero count) instruction for intra-word scanning. Returns 0xFFFFFFFF if none found. |
sub_BDDCB0 | prevSetBit | (bv*, start) -> int | Find previous set bit at or before start. Uses bsr (bit scan reverse). Returns 0xFFFFFFFF if none found. |
Extract Operation
| Address | Operation | Signature | Description |
|---|---|---|---|
sub_BDBD60 | extractBits | (bv*, out[], start, end) | Extract bit range [start, end] into output array. Handles cross-word boundaries with shift-and-mask. Supports extraction spans > 32 bits by filling multiple output words. |
The extract function handles three cases:
- Same word: Both start and end are in the same 32-bit word. Single mask-and-shift.
- Small span (<=32 bits, two words): Combine partial words with shift and OR.
- Large span (>32 bits): Loop over full words with optional shift for non-aligned start.
Bitvector Subset Test
The isSubsetOf operation tests whether bitvector A is a subset of B, i.e. (A & ~B) == 0. This is used by dominance queries (checking if block X is dominated by block Y). The test at sub_1245740 referenced in the copy-prop-CSE pass performs a single-bit indexed test against the dominator bitvector: testBit(dom_set[block], dominator_id).
Usage Across Subsystems
Liveness Analysis
The iterative fixed-point solver runs in reverse post-order, computing LiveIn and LiveOut bitvectors per basic block:
LiveOut(B) |= LiveIn(S) -- for each successor S: orIfChanged
LiveIn(B) = gen(B) | (LiveOut(B) - kill(B)) -- orWithAndNotIfChanged
Six dedicated phases (10, 16, 19, 33, 61, 84) perform liveness + DCE. The orWithAndNotIfChanged fusion and early-exit optimization minimize cache traffic in later iterations when most sets have stabilized. Bitvectors for liveness are stored at Code Object +832 (R registers) and +856 (UR registers).
Dominance
The iterative dominator computation (sub_BE2330, 4KB) uses bitvector intersection:
dom[entry] = {entry}
dom[b] = {b} union (intersection of dom[p] for all predecessors p)
Each block's dominator set is a bitvector indexed by block ID. Convergence uses andIfChanged to detect when the intersection no longer removes any dominators.
Register Allocation
The interference graph builder (sub_926A30, 155KB decompiled) uses bitvector intersection to detect live range overlaps. Two registers interfere if their liveness bitvectors have overlapping set bits at any program point. The allocator at sub_9721C0 explicitly rebuilds liveness before allocation.
GVN-CSE
The GVN-CSE pass (phase 49) uses the general hash map with FNV-1a hashing for the value numbering table. Each hash entry is 24 bytes: [next_ptr (8B), key (8B), value/metadata (8B)] with chained collision resolution. The hash incorporates opcode, type, and recursively resolved value numbers of all operands.
CFG Analysis
The CFG builder (sub_BE0690, 54KB) populates successor and backedge hash maps during phase 3 (AnalyzeControlFlow). RPO computation (sub_BDE150, 9KB) reads from the successor map. The DOT dumper (sub_BE21D0) iterates edge sets for visualization.
Key Function Table
General Hash Map
| Address | Size | Identity | Callers | Confidence |
|---|---|---|---|---|
sub_425B20 | 0.5KB | HashMap::allocate -- allocates 112-byte map + arrays | 127 (via sub_425CA0) | HIGH |
sub_425CA0 | 114B | HashMap::create -- constructor with hash/compare fn ptrs | 127 | HIGH |
sub_425D20 | 121B | HashMap::destroy -- frees all internal arrays + map | 63 | MEDIUM |
sub_425DB0 | 292B | HashMap::clear / iterator init | 9 | MEDIUM |
sub_426150 | 2.5KB | HashMap::put -- insert or update key/value pair | 2800 | HIGH |
sub_426D60 | 345B | HashMap::get -- lookup key, return value or 0 | 422 | HIGH |
sub_426EC0 | 349B | HashMap::contains -- test key existence | 29 | HIGH |
sub_427630 | 273B | MurmurHash3_x86_32 -- string hash | 73 | HIGH |
sub_427750 | -- | Integer hash function (identity) | -- | HIGH |
sub_4277F0 | -- | Pointer hash function (shift-XOR) | -- | HIGH |
sub_42D850 | 2.5KB | HashMap::insertSet -- set-mode insert with auto-resize | 282 | HIGH |
CFG Hash Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_BDED20 | 12KB | CFGHashMap::insertOrFind -- full 64-byte node | HIGH |
sub_BDF480 | 10KB | CFGHashMap::insertOrFind_simple -- 16-byte node | HIGH |
sub_BDE6C0 | 3KB | CFGHashMap::erase -- recursive edge removal | MEDIUM |
sub_BDE8B0 | 2KB | CFGHashMap::printEdges -- "bix%d -> bix%d" | HIGH |
sub_BDEA50 | 4KB | CFGHashMap::dumpRPOAndBackedges -- debug dump | HIGH |
sub_BDFB10 | 24KB | CFGHashMap::buildBlockMap -- block array init, RPO resize | MEDIUM |
sub_BDDDF0 | ~2KB | CFGHashMap::destroyAll -- walk and free all nodes | MEDIUM |
Bitvector Library
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_BDBA60 | ~120B | BitVector::allocate | HIGH (0.90) |
sub_BDBFB0 | ~120B | BitVector::findLastSetWord | HIGH (0.90) |
sub_BDBD60 | ~370B | BitVector::extractBits | HIGH (0.88) |
sub_BDBFB0 | ~120B | BitVector::setBit | HIGH (0.90) |
sub_BDC0E0 | ~120B | BitVector::clearBit | HIGH (0.90) |
sub_BDC200 | ~140B | BitVector::testBit | HIGH (0.90) |
sub_BDC3F0 | ~520B | BitVector::assignAND | HIGH (0.90) |
sub_BDC5F0 | ~480B | BitVector::operator&= | HIGH (0.95) |
sub_BDC790 | ~800B | BitVector::andIfChanged | HIGH (0.95) |
sub_BDCA60 | ~280B | BitVector::operator= (copy) | MEDIUM (0.85) |
sub_BDCC20 | ~320B | BitVector::assignOR | HIGH (0.90) |
sub_BDCDE0 | ~400B | BitVector::operator|= | HIGH (0.95) |
sub_BDCF40 | ~560B | BitVector::orIfChanged | HIGH (0.95) |
sub_BDD140 | ~480B | BitVector::orWithAND | HIGH (0.90) |
sub_BDD300 | ~490B | BitVector::orWithAndNot | HIGH (0.92) |
sub_BDD560 | ~650B | BitVector::orWithAndNotIfChanged | HIGH (0.92) |
sub_BDD8C0 | ~320B | BitVector::assignANDNOT | HIGH (0.88) |
sub_BDDAA0 | ~400B | BitVector::operator^= | HIGH (0.95) |
sub_BDDC00 | ~200B | BitVector::nextSetBit / clear | HIGH (0.90) |
sub_BDDCB0 | ~180B | BitVector::prevSetBit | MEDIUM (0.80) |
sub_BDDD40 | ~160B | BitVector::popcount | MEDIUM (0.80) |
Design Notes
Why two hash map implementations? The general hash map is optimized for string and pointer lookups with three mode-specific fast paths that avoid indirect calls. The CFG hash map is optimized for integer-keyed graph edge storage with chained nodes and embedded sub-hashes for multi-edge blocks. The general map uses open addressing with index arrays; the CFG map uses linked-list chaining with explicit node allocation. These are different design tradeoffs for different access patterns.
Why 32-bit bitvector words (not 64-bit)? The SSE2 instructions operate on 128-bit vectors regardless of word size, so 32-bit and 64-bit words yield the same SIMD throughput. The 32-bit choice minimizes wasted bits in the trailing word when bit_count is not a multiple of 64, and simplifies the alignment prologue calculation ((-(ptr >> 2)) & 3 vs (-(ptr >> 3)) & 1). It also matches the tzcnt/bsr instruction operand size on x86, avoiding the need for 64-bit variants in the bit-scan operations.
Why orWithAndNotIfChanged as a single function? This fusion eliminates three intermediate bitvectors that would be needed if the liveness transfer function were decomposed into separate AND-NOT, OR, and change-detection operations. On a function with 200 registers and 50 basic blocks, this saves ~60KB of intermediate allocations per iteration. More importantly, the two-phase scan-then-apply design avoids cache writes for converged blocks, which is the common case in later iterations.
Cross-References
- Data Structure Layouts -- Compilation context, Code Object field map, pool allocator
- Basic Blocks & CFG -- CFG edge hash maps, RPO computation, backedge detection
- Liveness Analysis -- Bitvector usage in iterative dataflow, SSE2 loop details
- Copy Propagation & CSE -- GVN value table hash map, FNV-1a in expression hashing
- Memory Pool Allocator -- Pool allocator used by both hash maps and bitvectors
Thread Pool & Concurrency
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas compiles multiple entry functions (kernels) in a single PTX input file. When --split-compile or --allow-expensive-optimizations is active, the compilation driver dispatches each kernel to a worker thread for parallel DAGgen+OCG+ELF+DebugInfo processing. The threading infrastructure lives in two address ranges: the TLS system at 0x4280xx--0x4286xx (front-end) and the thread pool at 0x1CB17xx--0x1CB1Axx (shared infrastructure).
| TLS accessor | sub_4280C0 (597 bytes, 3,928 callers, 280-byte struct) |
| TLS key init | ctor_001 at 0x4094C0 (static constructor) |
| TLS destructor | destr_function at 0x427F10 |
| Thread pool ctor | sub_1CB18B0 (184-byte pool struct) |
| Task submit | sub_1CB1A50 (24-byte task node) |
| Worker thread | start_routine at 0x1CB1780 |
| Wait-all | sub_1CB1AE0 |
| Pool destroy | sub_1CB1970 |
| CPU count | sub_1CB1890 (sysconf(_SC_NPROCESSORS_CONF)) |
| Mutex init helper | sub_428620 (recursive mutex factory) |
| Global mutex lock | sub_4286A0 (lazy-init + lock) |
| Jobserver client | sub_1CC7300 (GNU Make integration) |
Architecture
┌─────────────────────────────────────────┐
│ Compilation Driver │
│ sub_446240 │
│ │
│ if (thread_count > 0): │
│ pool = sub_1CB18B0(thread_count) │
│ for each kernel: │
│ snapshot config → 360-byte buffer │
│ copy hash maps for isolation │
│ sub_1CB1A50(pool, sub_436DF0, buf) │
│ sub_1CB1AE0(pool) // wait-all │
│ sub_1CB1970(pool) // destroy │
└────────────┬────────────────────────────┘
│ enqueue tasks
▼
┌────────────────────────────────────┐
│ Thread Pool (184 bytes) │
│ │
│ mutex ──────── guards all fields │
│ cond_work ──── wake workers │
│ cond_done ──── signal completion │
│ work_queue ─── priority heap │
│ pending ────── pending task count │
│ shutdown ───── termination flag │
└───┬──────┬──────┬──────────────────┘
│ │ │
┌─────┘ │ └─────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker 0 │ │ Worker 1 │ │ Worker N │
│ (detached)│ │(detached)│ │(detached)│
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ TLS ctx │ │ TLS ctx │ │ TLS ctx │
│ 280 bytes│ │ 280 bytes│ │ 280 bytes│
│ + pool │ │ + pool │ │ + pool │
└──────────┘ └──────────┘ └──────────┘
CLI Activation
Two options activate the thread pool:
| Option | Type | Behavior |
|---|---|---|
--split-compile N | int | Set thread count directly. 0 = CPU count via sysconf(83). 1 = serial (pool not created). N > 1 = N worker threads. |
--allow-expensive-optimizations | bool | Auto-enabled at -O2 and above. Enables the thread pool with an automatically determined thread count. |
Both flags ultimately result in a non-zero value at offset +668 in the compilation driver's state block. The driver checks this field to decide between sequential per-kernel iteration and thread pool dispatch.
Two internal options fine-tune pool behavior:
| Option | Type | Purpose |
|---|---|---|
--threads-dynamic-scheduling | bool | Enable dynamic scheduling of thread pool tasks (work stealing vs static partitioning). |
--threads-min-section-size | int | Minimum section size for thread pool work partitioning; prevents excessive task granularity. |
CPU Count Detection
// sub_1CB1890 -- returns online CPU count
__int64 sub_1CB1890() {
return sysconf(83); // _SC_NPROCESSORS_CONF on Linux
}
When --split-compile 0 is specified, the pool constructor receives the return value of sub_1CB1890() as its nmemb argument.
Thread-Local Storage (TLS)
The TLS system is the foundation of ptxas's concurrency model. Every thread -- main and workers alike -- gets its own 280-byte context struct, accessed through the single most-called function in the binary: sub_4280C0 (3,928 call sites).
Initialization: ctor_001 (0x4094C0)
The TLS key is created in a static constructor that runs before main:
// ctor_001 -- .init_array entry
int ctor_001() {
if (!qword_29FE0A0) { // one-time guard
pthread_key_create(&key, destr_function);
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
qword_29FE0A0 = &sentinel_node; // marks "initialized"
// ... initialize linked list sentinels
}
__cxa_atexit(cleanup_func, ...);
}
The SCHED_RR priority range is queried but never used for thread creation (all threads are created with default attributes). This appears to be vestigial infrastructure for priority-based scheduling that was never activated.
TLS Accessor: sub_4280C0
char *sub_4280C0() {
if (!qword_29FE0A0)
goto init_once; // fallback init (race protection)
char *result = pthread_getspecific(key);
if (result)
return result; // fast path: already allocated
char *ctx = malloc(0x118); // 280 bytes
memset(ctx, 0, 0x118);
pthread_cond_init(ctx + 128, NULL);
pthread_mutex_init(ctx + 176, NULL);
sem_init(ctx + 216, 0, 0);
// Insert into global doubly-linked list (under global mutex)
pthread_mutex_lock(&mutex);
ctx->prev = sentinel;
ctx->next = sentinel->next;
sentinel->next->prev = ctx;
sentinel->next = ctx;
pthread_mutex_unlock(&mutex);
pthread_setspecific(key, ctx);
return ctx;
}
The global doubly-linked list at offsets +256 (prev) and +264 (next) allows the system to enumerate all live TLS contexts. This is used during cleanup to destroy all thread contexts at program exit.
TLS Context Layout (280 bytes)
The full byte-level layout, verified against the constructor (sub_4280C0), destructor (destr_function), and the two diagnostic reporters (sub_42F590, sub_42FBA0):
| Offset | Size | Type | Purpose | Accessor |
|---|---|---|---|---|
| 0 | 1 | byte | has_warning_or_error flag -- set to 1 when severity > 2 | sub_42F590 (direct write) |
| 1 | 1 | byte | has_fatal_error flag -- set to 1 when severity > 4 | sub_42F590 (direct write) |
| 2 | 6 | -- | padding (zeroed by memset) | -- |
| 8 | 8 | jmp_buf* | Error recovery longjmp target (installed by sub_431ED0 before _setjmp) | sub_431ED0 (save/restore) |
| 16 | 8 | void* | Error descriptor pointer -- set to the faulting error descriptor on longjmp | sub_42F590 (write on fatal), sub_431ED0 (propagate on re-throw) |
| 24 | 8 | void* | Per-thread memory pool pointer (used by sub_424070) | sub_42BDD0 (swap), sub_4287D0 (read) |
| 32 | 8 | char* | Program name string (e.g. "ptxas") -- prepended to diagnostic messages | sub_430590 (set), sub_430570 (get) |
| 40 | 8 | char* | Diagnostic suffix string -- appended to message body when non-NULL | sub_42F590 (read, format as " %s") |
| 48 | 1 | byte | Info suppression flag -- suppresses severity-2 (info) messages | sub_42F590 (check) |
| 49 | 1 | byte | Diagnostic suppression flag -- suppresses severity-3 (warning) messages | sub_430560 (set) |
| 50 | 1 | byte | Werror promotion flag (--Werror) -- promotes warnings to errors | sub_430550 (set) |
| 51 | 1 | byte | Machine-readable annotation flag -- adds @E@/@W@/@O@/@I@ severity tags | sub_4305A0 (get) |
| 52 | 1 | byte | Multi-line continuation flag -- suppresses ". " prefix on wrapped lines | sub_4305C0 (set) |
| 53 | 75 | -- | padding (zeroed by memset) | -- |
| 128 | 48 | pthread_cond_t | Per-thread condition variable | constructor: pthread_cond_init |
| 176 | 40 | pthread_mutex_t | Per-thread mutex | constructor: pthread_mutex_init |
| 216 | 32 | sem_t | Semaphore for cross-thread signaling | constructor: sem_init(0, 0) |
| 248 | 8 | void* | Saved semaphore pointer (from pool, for destr_function to sem_post) | destr_function (read at qword[31]) |
| 256 | 8 | void* | Linked list prev pointer (global TLS chain) | constructor (qword[32]) |
| 264 | 8 | void* | Linked list next pointer (global TLS chain) | constructor (qword[33]) |
| 272 | 1 | byte | Destroyed flag (prevents double-destroy) | destr_function (set to 1) |
| 273 | 7 | -- | padding (zeroed by memset) | -- |
The fields fall into seven functional groups:
Error state (offsets 0--1). Two byte flags set by the diagnostic reporter sub_42F590. Byte 0 records whether any error-or-above diagnostic was emitted; byte 1 records fatal errors specifically. The compilation driver reads these after each kernel compilation to determine the process exit code.
Error recovery (offsets 8--16). A setjmp/longjmp mechanism for non-local error exits. sub_431ED0 saves the current jmp_buf* and error byte flags, installs a fresh jmp_buf, then enters the compiler. On a fatal diagnostic, sub_42F590 stores the error descriptor at offset +16 and calls longjmp to the target at offset +8. If no jmp_buf is installed, the handler calls sub_4275E0 (abort).
Per-thread allocator (offset 24). The most performance-critical field. The pool allocator sub_424070 reads this pointer on every allocation (accessed as sub_4280C0()[3]). When non-NULL, allocations go to the calling thread's own slab without any locking. sub_42BDD0 provides an atomic swap primitive that replaces the pool pointer and returns the old value -- used during pool migration at compilation boundaries. This is used pervasively: 3,928 call sites to sub_4280C0 are predominantly pool allocator calls that need the current thread's arena.
Diagnostic context (offsets 32, 40). The program name at +32 (e.g. "ptxas") is prepended to all diagnostic messages. The suffix at +40 is appended after the message body. Both are set per-thread to support library mode where multiple tool names coexist in the same process.
Diagnostic flags (offsets 48--52). Five single-byte flags controlling diagnostic formatting and filtering. The info suppression flag (+48) silences informational messages. The diagnostic suppression flag (+49) silences warnings entirely. The Werror flag (+50) promotes warnings to errors. The annotation flag (+51) enables machine-readable severity tags (@E@, @W@, @O@, @I@). The continuation flag (+52) enables multi-line continuation mode where wrapped lines omit the ". " prefix.
Synchronization primitives (offsets 128--248). The condvar, mutex, and semaphore are used by the thread pool for task coordination and cross-thread signaling. The saved semaphore pointer at +248 is set by the pool when assigning work to a thread; on thread exit, the destructor calls sem_post on it to notify the pool's shutdown logic.
Global linked list (offsets 256--272). A doubly-linked list threading through all live TLS contexts, protected by the global mutex at 0x29FE0xx. Used by the atexit handler to enumerate and destroy all contexts. The destroyed flag at +272 prevents double-destroy when contexts move to the free list for recycling.
TLS Destructor: destr_function (0x427F10)
When a worker thread terminates, the POSIX TLS destructor fires:
void destr_function(char *ctx) {
if (!ctx) return;
pthread_mutex_lock(&global_mutex);
if (ctx[272]) { // already destroyed?
pthread_mutex_unlock(&global_mutex);
return;
}
sem_t *sem = ctx->saved_semaphore; // offset +248
// Unlink from global TLS chain
ctx->prev->next = ctx->next;
ctx->next->prev = ctx->prev;
// Destroy sync primitives
pthread_cond_destroy(ctx + 128);
pthread_mutex_destroy(ctx + 176);
sem_destroy(ctx + 216);
// Move to free list (recycling, not freed)
ctx[272] = 1; // mark destroyed
ctx->next = free_list_sentinel;
ctx->prev = free_list_head;
free_list_head->next = ctx;
free_list_head = ctx;
pthread_mutex_unlock(&global_mutex);
if (sem)
sem_post(sem); // notify pool that worker exited
}
The destructor does not free() the 280-byte struct. Instead, it moves it to a free list rooted at a second sentinel node (unk_29FDC40 / unk_29FDD60). This is a deliberate choice: the destructor runs during pthread_exit() or thread detachment, and ptxas reuses TLS structs across multiple pool lifetimes within a single process invocation (library mode).
The sem_post at the end notifies the pool shutdown code (sub_1CB1970) that a worker has fully terminated.
Thread Pool Implementation
Pool Struct (184 bytes)
struct ThreadPool { // calloc(1, 0xB8) = 184 bytes
void *thread_handles; // +0: array of (pthread_t, 16 bytes each)
void *work_queue; // +8: priority heap (from sub_1CBEC10)
int32_t pending; // +16: count of tasks awaiting execution
// padding
pthread_mutex_t mutex; // +24: guards all mutable pool state (40 bytes)
pthread_cond_t cond_work; // +64: broadcast when new work arrives (48 bytes)
pthread_cond_t cond_done; // +112: signaled when all work completes (48 bytes)
int64_t active_count; // +160: workers currently executing tasks
int64_t max_threads; // +168: total worker count (= nmemb)
int8_t shutdown; // +176: set to 1 to terminate all workers
};
Constructor: sub_1CB18B0
ThreadPool *sub_1CB18B0(size_t nmemb) {
ThreadPool *pool = calloc(1, 0xB8); // 184 bytes, zero-init
pool->thread_handles = calloc(nmemb, 0x10); // 16 bytes per thread
pool->max_threads = nmemb;
pool->pending = 0;
pthread_mutex_init(&pool->mutex, NULL); // default (non-recursive)
pthread_cond_init(&pool->cond_work, NULL);
pthread_cond_init(&pool->cond_done, NULL);
// Create priority heap for task ordering
pool->work_queue = sub_1CBEC10(sub_1CB1770, 0);
// Spawn nmemb detached worker threads
for (int i = 0; i < nmemb; i++) {
pthread_create(&pool->thread_handles[i].tid, NULL,
start_routine, pool);
pthread_detach(pool->thread_handles[i].tid);
}
return pool;
}
Workers are detached immediately after creation. This means the pool does not use pthread_join for cleanup -- instead, it relies on the cond_done / max_threads countdown protocol in sub_1CB1970.
Work Queue: Priority Heap
The work queue at pool offset +8 is not a simple linked list -- it is a binary min-heap (priority queue) backed by a dynamically-resized array.
Heap structure (32 bytes):
| Offset | Size | Type | Field |
|---|---|---|---|
| 0 | 8 | void** | Array of element pointers |
| 8 | 8 | int64 | Current element count |
| 16 | 8 | int64 | Allocated capacity |
| 24 | 8 | fn_ptr | Comparator function |
Constructor (sub_1CBEC10): Allocates the heap struct from the pool allocator, sets the comparator to sub_1CB1770 (which always returns 1, making the heap behave as a FIFO -- every parent "beats" every child, so new elements sink to the end).
Enqueue (sub_1CBECC0): Standard heap push with sift-up. Appends element, then bubbles up through parent comparisons. Auto-grows the backing array (doubles capacity) when full via sub_424C50 (realloc equivalent).
Dequeue (sub_1CBEDD0): Standard heap pop with sift-down. Extracts root, moves last element to root, then percolates down via comparator-guided child selection.
Since the comparator sub_1CB1770 unconditionally returns 1, the heap degenerates into FIFO order. Tasks are dispatched in submission order.
Task Nodes (24 bytes)
struct TaskNode { // malloc(0x18) = 24 bytes
void (*func)(void *arg); // +0: task function pointer
void *arg; // +8: argument passed to func
void *reserved; // +16: zeroed (unused)
};
Task Submit: sub_1CB1A50
int sub_1CB1A50(ThreadPool *pool, void (*func)(void*), void *arg) {
if (!func || !pool)
return 0;
TaskNode *task = malloc(0x18); // 24 bytes
task->func = func;
task->arg = arg;
task->reserved = NULL;
pthread_mutex_lock(&pool->mutex);
sub_1CBECC0(task, pool->work_queue); // heap push
++pool->pending;
pthread_cond_broadcast(&pool->cond_work);
pthread_mutex_unlock(&pool->mutex);
return 1;
}
The broadcast wakes all sleeping workers, not just one. This is correct for the use case: multiple tasks are typically submitted in a burst (one per kernel), and all idle workers should attempt to pick up work immediately.
Worker Thread: start_routine (0x1CB1780)
void *start_routine(ThreadPool *pool) {
pthread_mutex_t *mtx = &pool->mutex;
pthread_cond_t *done = &pool->cond_done;
while (1) {
pthread_mutex_lock(mtx);
// Wait for work or shutdown
while (pool->pending == 0 && !pool->shutdown)
pthread_cond_wait(&pool->cond_work, mtx);
if (pool->shutdown)
goto exit;
// Dequeue task
TaskNode *task = (TaskNode *)sub_1CBEDD0(pool->work_queue);
--pool->pending;
++pool->active_count;
pthread_mutex_unlock(mtx);
// Execute task outside the lock
if (task) {
task->func(task->arg);
free(task);
}
// Post-execution bookkeeping
pthread_mutex_lock(mtx);
--pool->active_count;
if (!pool->shutdown && pool->active_count == 0 && pool->pending == 0)
pthread_cond_signal(done); // all work complete
pthread_mutex_unlock(mtx);
}
exit:
--pool->max_threads; // decrement live worker count
pthread_cond_signal(done); // notify shutdown waiter
pthread_mutex_unlock(mtx);
return NULL;
}
The completion signal on cond_done fires only when both active_count == 0 and pending == 0. This is the condition that sub_1CB1AE0 (wait-all) blocks on.
Wait-All: sub_1CB1AE0
Blocks until all submitted tasks complete:
void sub_1CB1AE0(ThreadPool *pool) {
if (!pool) return;
pthread_mutex_lock(&pool->mutex);
while (pool->pending > 0 ||
(pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
pthread_cond_wait(&pool->cond_done, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
}
In the non-shutdown case, it waits for pending == 0 && active_count == 0. During shutdown, it waits for max_threads == 0 (all workers have exited).
Destroy: sub_1CB1970
Initiates graceful shutdown:
void sub_1CB1970(ThreadPool *pool) {
if (!pool) return;
pthread_mutex_lock(&pool->mutex);
// Drain any remaining queued tasks
sub_1CBEBF0(pool->work_queue); // free heap contents
pool->pending = 0;
pool->shutdown = 1;
// Wake all workers so they see the shutdown flag
pthread_cond_broadcast(&pool->cond_work);
pthread_mutex_unlock(&pool->mutex);
// Wait for all workers to exit
pthread_mutex_lock(&pool->mutex);
while (pool->pending > 0 ||
(pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
pthread_cond_wait(&pool->cond_done, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
// Destroy synchronization primitives
pthread_mutex_destroy(&pool->mutex);
pthread_cond_destroy(&pool->cond_work);
pthread_cond_destroy(&pool->cond_done);
free(pool->thread_handles);
free(pool);
}
The shutdown sequence is: (1) drain the queue, (2) set the shutdown flag, (3) broadcast on cond_work so all sleeping workers wake up and check the flag, (4) wait on cond_done until max_threads reaches zero (each exiting worker decrements it), (5) destroy synchronization primitives and free memory.
Multi-Kernel Parallel Compilation
The compilation driver (sub_446240) decides between serial and parallel execution based on the thread count at offset +668:
Serial Path (default)
for each kernel in compile_unit:
sub_43A400(kernel) // target configuration
sub_43CC70(kernel) // DAGgen → OCG → ELF → DebugInfo
Parallel Path (--split-compile / --allow-expensive-optimizations)
pool = sub_1CB18B0(thread_count)
for each kernel in compile_unit:
sub_43A400(kernel) // target config (still serial)
buf = allocate 360-byte work buffer from pool
snapshot 15 config vectors into buf
deep-copy hash maps for thread isolation
sub_1CB1A50(pool, sub_436DF0, buf)
sub_1CB1AE0(pool) // block until all kernels done
sub_1CB1970(pool) // destroy pool
Each task runs sub_436DF0, which performs the per-kernel backend pipeline:
- Set thread-local program name via
sub_430590 - Acquire jobserver token (if
--jobserveractive):sub_1CC6EC0() - Record start time
sub_432500-- run the full DAGgen+OCG pipeline- Record end time, write to timing array at
a1->timing[112 * cu_index] - Update peak wall-clock counter (under lock via
sub_607D70/sub_607D90) - Release jobserver token:
sub_1CC7040() - Free the 360-byte work buffer
Thread Isolation Strategy
Each worker thread operates on an isolated copy of compilation state:
| Resource | Isolation Mechanism |
|---|---|
| Memory pool | Per-thread pool pointer at TLS offset +24. Each thread's allocations go to a separate arena, eliminating heap contention. |
| Error state | Per-thread flags at TLS offsets 0--1 (error bytes), 8 (longjmp target), 16 (error descriptor), 48--52 (diagnostic control). Each thread tracks its own errors independently. |
| Hash maps | Deep-copied from the master compilation context before task submission. Workers never share mutable lookup tables. |
| Config vectors | Snapshot of 15 configuration vectors into a 360-byte per-task buffer. Workers read their own copy. |
| Timing data | Per-kernel slots in a pre-allocated timing array (112 bytes per kernel). Each worker writes only to its own kernel's slot. |
The only shared mutable state during parallel compilation is the peak wall-clock counter at offset +224 in the compilation driver's state block, protected by a global lock (lock index 6, via sub_607D70/sub_607D90). This lock is acquired briefly at the end of each per-kernel task to compare-and-update the maximum observed wall-clock time.
GNU Make Jobserver Integration
When both --jobserver and --split-compile are active, ptxas participates in GNU Make's parallel job token protocol. The compilation driver creates the jobserver client object before spawning the thread pool, and each per-kernel worker task must acquire a token before starting and release it when done. This throttles ptxas to never exceed the make -j N budget, even when --split-compile would otherwise use more threads.
Jobserver Object (296 bytes)
The jobserver state is a 296-byte heap object allocated once per compilation, stored at global qword_29FE128. The constructor (sub_1CC7AF0) is called from the compilation driver (sub_4428E0) when *(_BYTE*)(context + 993) is set (the --jobserver CLI flag).
| Offset | Size | Type | Field |
|---|---|---|---|
| 0 | 4 | int32 (atomic) | State code (0=OK; see state table below) |
| 4 | 4 | int32 | Saved errno from last failed syscall |
| 8 | 1 | byte | Implicit token available (1=unconsumed) |
| 16 | 8 | int64 | Pending waiters (threads blocked in acquire) |
| 24 | 8 | int64 | Active count (tokens currently held) |
| 32 | 8 | void* | Token buffer base (std::vector<char> data) |
| 40 | 8 | void* | Token buffer cursor (stack top) |
| 48 | 8 | void* | Token buffer capacity end |
| 56 | 40 | pthread_mutex_t | Inner mutex (guards token accounting) |
| 96 | 40 | pthread_mutex_t | Write mutex (guards write() to Make pipe) |
| 136 | 48 | pthread_cond_t | Condition variable (wakes acquire waiters and reader thread) |
| 184 | 1 | byte | Token-ready flag (set by reader thread / release handoff) |
| 185 | 1 | byte | Last byte read from Make pipe |
| 188 | 4 | int32 | Read fd (Make pipe/FIFO read end) |
| 192 | 4 | int32 | Write fd (Make pipe/FIFO write end) |
| 196 | 4 | int32 | Internal pipe read fd (shutdown wakeup) |
| 200 | 4 | int32 | Internal pipe write fd (shutdown wakeup) |
| 204 | 1 | byte | Opened-fds flag (1=ptxas opened the Make fds itself) |
| 205 | 1 | byte | Shutdown flag |
| 208 | 8 | void* | Reader thread handle (std::thread) |
| 216 | 80 | bytes | Outer mutexes (serializing full acquire/release operations) |
MAKEFLAGS Parser: sub_1CC7300
Called during object construction to detect the Make jobserver channel:
// sub_1CC7300 -- parse MAKEFLAGS, open pipe/FIFO
void sub_1CC7300(JobserverObject *obj) {
char *flags = getenv("MAKEFLAGS");
if (!flags) {
CAS(&obj->state, 5, 0); // no MAKEFLAGS
return;
}
std::string s(flags);
size_t pos = s.find("--jobserver-auth=");
if (pos == npos) {
CAS(&obj->state, 6, 0); // no --jobserver-auth=
return;
}
size_t val = pos + 17; // skip "--jobserver-auth="
if (s.substr(val, 5) == "fifo:") {
// --- FIFO mode ---
std::string path = s.substr(val + 5, next_space);
int fd = open(path.c_str(), O_RDWR | O_NONBLOCK); // 0x802
if (fd == -1) { CAS(&obj->state, 7, 0); return; }
obj->read_fd = fd; // same fd for both
obj->write_fd = fd;
obj->opened_fds = 1;
} else {
// --- Pipe mode ---
// parse "R,W" -- e.g. "3,4"
std::string r_str = s.substr(val, comma_pos - val);
std::string w_str = s.substr(comma_pos + 1, ...);
// validate: digits only
if (r_str.find_first_not_of("0123456789") != npos ||
w_str.find_first_not_of("0123456789") != npos) {
CAS(&obj->state, 7, 0); return;
}
int rd = dup(stoi(r_str)); // private copy
if (fcntl(rd, F_SETFD, FD_CLOEXEC) == -1) {
CAS(&obj->state, 7, 0); return;
}
int wd = dup(stoi(w_str));
if (fcntl(wd, F_SETFD, FD_CLOEXEC) == -1) {
close(rd);
CAS(&obj->state, 7, 0); return;
}
obj->read_fd = rd;
obj->write_fd = wd;
obj->opened_fds = 1;
}
}
| Protocol | --jobserver-auth= value | Detection | fd Setup |
|---|---|---|---|
| FIFO | fifo:/path/to/fifo | Prefix match on fifo: | open(path, O_RDWR|O_NONBLOCK) -- single fd for both read and write |
| Pipe | R,W (e.g. 3,4) | Comma-separated integers after auth= | dup() each fd + fcntl(F_SETFD, FD_CLOEXEC) -- prevents fd leak to children |
Object Construction: sub_1CC7AF0
After sub_1CC7300 succeeds (state == 0), the constructor continues:
- Creates an internal wakeup pipe via
pipe()-- fds stored at +196/+200 - Spawns the reader thread (
sub_1CC6720) -- passed as astd::threadfunctor viaoff_2406838 - Pre-allocates the token buffer vector to hold
thread_countbytes
If state is 5 or 6 (no MAKEFLAGS, no auth string), the caller (sub_4428E0) emits: "GNU Jobserver support requested, but no compatible jobserver found. Ignoring '--jobserver'" and proceeds without throttling.
Reader Thread: sub_1CC6720
A dedicated background thread that reads tokens from the Make pipe/FIFO and buffers them for the acquire function:
loop:
if state != 0 or shutdown → exit
lock(mutex_inner)
while pending_waiters == 0 and not shutdown:
cond_wait(cond, mutex_inner) // sleep until someone needs a token
unlock(mutex_inner)
fd_set = { read_fd, internal_pipe_read }
select(max_fd + 1, &fd_set, NULL, NULL, NULL) // block
if shutdown → exit
n = read(read_fd, &byte, 1)
if n == 1:
if pending_waiters > 0:
lock(mutex_inner + mutex_write)
push byte onto token_buffer
token_ready = 1
unlock(mutex_write)
cond_signal(cond) // wake one acquire waiter
else:
write(write_fd, &byte, 1) // no waiter → return token immediately
else if errno == EAGAIN:
continue // expected for non-blocking fd
else:
state = 11; exit // I/O error
The select() monitors two fds simultaneously: the Make pipe (for incoming tokens) and the internal wakeup pipe (for shutdown notification). The internal pipe avoids a race between checking the shutdown flag and blocking in select().
Token Acquire: sub_1CC6EC0
Called by each per-kernel worker before compilation begins. Returns 0 on success.
int sub_1CC6EC0(JobserverObject *obj) {
if (!obj) return 4;
lock(outer_mutex_0);
if (obj->state) { unlock; return obj->state; }
lock(mutex_inner);
if (obj->implicit_token_available) {
// Fast path: consume the implicit token (no pipe I/O)
obj->implicit_token_available = 0;
obj->active_count++;
unlock_all;
return 0;
}
// Slow path: wait for reader thread to supply a token
obj->pending_waiters++;
if (obj->pending_waiters == 1)
cond_signal(cond); // wake reader thread
while (!obj->token_ready && !obj->shutdown)
cond_wait(cond, mutex_inner);
if (obj->shutdown) { unlock_all; return 3; }
obj->token_ready = 0;
obj->pending_waiters--;
obj->active_count++;
unlock_all;
return 0;
}
The implicit token is the standard GNU Make convention: the parent Make gives the first child an implicit token (no byte in the pipe). The first worker to call acquire consumes it for free; subsequent workers must wait for bytes to be read from the pipe.
Token Release: sub_1CC7040
Called by each per-kernel worker after compilation completes. Returns 0 on success.
int sub_1CC7040(JobserverObject *obj) {
if (!obj) return 4;
lock(outer_mutex_1);
if (obj->state) { unlock; return obj->state; }
lock(mutex_inner);
lock(mutex_write);
if (token_buffer not empty) {
// Path A: write a buffered byte back to Make pipe
byte = pop(token_buffer);
if (write(obj->write_fd, &byte, 1) == 1) {
obj->active_count--;
unlock_all;
return 0;
}
// write error → set state 11 or 2
}
unlock(mutex_write);
if (obj->pending_waiters > 0) {
// Path B: hand off directly to a waiting acquirer
obj->token_ready = 1;
obj->active_count--;
cond_signal(cond);
unlock_all;
return 0;
}
if (!obj->implicit_token_available && obj->active_count == 1) {
// Path C: return the implicit token
obj->implicit_token_available = 1;
obj->active_count = 0;
unlock_all;
return 0;
}
// Protocol error (double-free)
CAS(&obj->state, 12, 0);
unlock_all;
return 12;
}
Release has three paths, in priority order:
| Path | Condition | Action |
|---|---|---|
| A | Token buffer non-empty | Pop byte, write() back to Make pipe |
| B | No buffered token but waiters exist | Set token_ready, signal condvar (avoids pipe round-trip) |
| C | No buffered token, no waiters, last active | Restore implicit token flag |
Per-Kernel Worker Integration
In sub_436DF0 (the per-kernel compilation task submitted to the thread pool):
void sub_436DF0(int64_t *task) {
sub_430590("ptxas", kernel_name); // set TLS program name
if (task[5] && sub_1CC6EC0(task[5])) // acquire token if jobserver active
sub_42F590(FATAL); // acquire failed → fatal error
// ... sub_432500(): full DAGgen + OCG pipeline ...
if (!task[5] || !sub_1CC7040(task[5])) // release token
return; // normal return
sub_42F590(FATAL); // release failed → fatal error
}
task[5] is populated from qword_29FE128 during task dispatch in sub_4428E0. When --jobserver is not active, task[5] == 0 and both acquire/release calls are skipped.
Destroy: sub_1CC6C20
Called after sub_1CB1AE0 (wait-all) and sub_1CB1970 (pool destroy) complete:
- Set shutdown flag (+205) via
_InterlockedCompareExchange8 - Lock inner mutex, signal condvar (wake reader thread), unlock
- Write 1 byte to internal pipe write end (+200) -- unblocks
select()in reader thread - Join reader thread
- Lock inner mutex, drain all buffered tokens by writing each byte back to write_fd
- Unlock inner mutex
- Close Make fds if
opened_fdsis set (for FIFO: close once since read_fd == write_fd; for pipe: close both if different) - Close internal pipe fds (+196, +200)
- Destroy condvar, free token buffer, free 296-byte object
State Machine
All state transitions use _InterlockedCompareExchange(state, new_value, 0) -- only the first error sticks; subsequent errors are silently dropped.
| State | Meaning | Set by |
|---|---|---|
| 0 | OK (operational) | Constructor |
| 2 | Unexpected I/O (short write/read) | Release, reader thread |
| 5 | No MAKEFLAGS environment variable | sub_1CC7300 |
| 6 | No --jobserver-auth= in MAKEFLAGS | sub_1CC7300 |
| 7 | open()/dup()/fcntl() failed | sub_1CC7300 |
| 11 | I/O error (errno recorded at +4) | Reader thread, release, constructor |
| 12 | Protocol error (double-free of token) | Release |
Throttling Semantics
With make -jN and --split-compile M where M > N:
ptxas creates M worker threads in the pool
but only N-1 pipe tokens exist + 1 implicit token = N total
workers that cannot acquire a token block in cond_wait
→ at most N kernels compile simultaneously, matching Make's budget
→ as each kernel finishes and releases its token, a blocked worker wakes
Without --jobserver, all M workers run freely with no external throttling.
Pool Allocator Thread Safety
The pool allocator (sub_424070) achieves thread safety through a combination of per-thread arenas and a global mutex:
-
Per-thread arena: The TLS context at offset +24 holds a pointer to the current thread's memory pool.
sub_424070readssub_4280C0()[3]to obtain this pointer. When non-NULL, allocations come from the thread-local slab without any locking. -
Global pool mutex: The pool struct contains a
pthread_mutex_tat offset +7128 (within the ~7,200-byte pool descriptor). This mutex is acquired only for operations that modify the global pool state: slab acquisition from the parent pool, large-block allocation viammap/malloc, and pool destruction. -
Slab-level lockfree: Within a thread-local slab (56-byte descriptor), bump-pointer allocation requires no synchronization. The allocator advances a pointer and returns; only when the slab is exhausted does it acquire the global lock to request a new slab.
Recursive Mutex Pattern
All mutexes created by sub_428620 (the mutex factory used throughout ptxas) are PTHREAD_MUTEX_RECURSIVE:
bool sub_428620(pthread_mutex_t *mutex) {
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
return pthread_mutex_init(mutex, &attr) == 0;
}
This is necessary because ptxas functions may re-enter locking code paths through the diagnostic emitter (sub_42FBA0) or pool allocator (sub_424070), both of which are called from virtually everywhere.
Global Synchronization Objects
Global TLS Mutex (mutex at 0x29FE0xx)
Protects the global doubly-linked list of TLS contexts. Acquired during:
- TLS context allocation (
sub_4280C0) - TLS context destruction (
destr_function) sub_4286A0(explicit lock for cross-thread operations)
This is a recursive mutex (initialized in ctor_001).
Global Lock Array (sub_607D70 / sub_607D90)
A global array of locks accessed by index. Lock index 6 is used to protect the peak wall-clock counter during parallel compilation. The total number of lock indices and their complete purpose map is not fully recovered; index 6 is the only one observed in the threading hot path.
sub_607D70(6); // acquire lock 6
// update peak wall-clock
sub_607D90(6); // release lock 6
Function Map
| Address | Size | Callers | Identity |
|---|---|---|---|
0x4094C0 | 204 B | 0 | ctor_001 -- TLS key + global mutex init (.init_array) |
0x427F10 | 376 B | 0 | destr_function -- TLS destructor (via pthread_key_create) |
0x4280C0 | 597 B | 3,928 | TLS context accessor (280-byte struct, lazy alloc) |
0x428600 | 27 B | -- | Mutex destroy + free wrapper |
0x428620 | 62 B | -- | Recursive mutex init factory |
0x428670 | 6 B | -- | pthread_mutex_destroy PLT thunk |
0x428680 | 6 B | -- | pthread_mutex_lock PLT thunk |
0x428690 | 6 B | -- | pthread_mutex_unlock PLT thunk |
0x4286A0 | 163 B | -- | Global mutex lazy-init + lock |
0x1CB1770 | 8 B | 1 | Priority comparator (always returns 1 = FIFO) |
0x1CB1780 | 202 B | 0 | start_routine -- worker thread main loop |
0x1CB1890 | 11 B | -- | CPU count via sysconf(_SC_NPROCESSORS_CONF) |
0x1CB18B0 | 159 B | -- | Thread pool constructor (184-byte struct) |
0x1CB1970 | 168 B | -- | Thread pool graceful destroy |
0x1CB1A50 | 90 B | -- | Task submit (24-byte task node, heap push, broadcast) |
0x1CB1AE0 | 109 B | -- | Wait-all (block until pending=0, active=0) |
0x1CBEBF0 | -- | 1 | Heap drain (free all queued elements) |
0x1CBEC10 | -- | 1 | Priority heap constructor (32-byte struct) |
0x1CBECC0 | -- | -- | Priority heap push (sift-up) |
0x1CBEDD0 | -- | -- | Priority heap pop (sift-down) |
0x1CC6720 | ~700 B | 1 | Jobserver reader thread (select loop, pushes tokens to buffer) |
0x1CC6C20 | ~300 B | 1 | Jobserver destroy (drain tokens, close fds, free 296-byte object) |
0x1CC6EC0 | 384 B | 1 | Jobserver token acquire (consume implicit or wait for pipe token) |
0x1CC7040 | ~280 B | 1 | Jobserver token release (write byte back or hand off to waiter) |
0x1CC7300 | 2,027 B | 1 | Jobserver MAKEFLAGS parser (FIFO vs pipe detection, fd setup) |
0x1CC7AF0 | ~700 B | 1 | Jobserver constructor (alloc 296B, spawn reader thread) |
Cross-References
- Entry Point & CLI --
ctor_001TLS init,sub_446240serial vs parallel dispatch - CLI Options --
--split-compile,--allow-expensive-optimizations,--jobserver - Memory Pool Allocator -- per-thread arena via TLS offset +24, global pool mutex at +7128
- Pipeline Overview -- per-kernel compilation phases run as pool tasks
- Code Generation Overview --
sub_436DF0per-kernel worker, timing lock 6
SASS Opcode Catalog
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Complete reference table of all SASS opcode mnemonics known to ptxas v13.0.88. Extracted from the ROT13-encoded opcode name table in the InstructionInfo constructor (sub_7A5D10, vtable off_233ADC0). The table stores exactly 322 named entries (indices 0--321) at object offset +0x1058, with each entry occupying 16 bytes (8-byte string pointer + 8-byte length). A parallel constructor sub_BE7390 initializes an identical table. Immediately after the name table, a 322-element identity-mapped index array (0x508 bytes of 4-byte integers 0..321) is bulk-copied from unk_21C0E00 to object offset +0x2478; this is a separate data structure (encoding category map), not additional opcode names.
All SASS mnemonic strings in the ptxas binary are ROT13-obfuscated. The cleartext names shown here are the result of applying ROT13 decoding to the stored strings.
Table Organization
Opcodes are partitioned by SM generation through explicit boundary markers embedded in the table:
| Index | Marker | Range |
|---|---|---|
| 0--135 | Base ISA | sm_70 (Volta) and all later architectures |
| 136 | SM70_LAST | End of sm_70 range |
| 137--171 | sm_73+ | Volta extensions (uniform registers, tensor shapes) |
| 171 | SM73_LAST | End of sm_73 range |
| 172--193 | sm_82+ | Ampere additions (MMA shapes, gather, REDUX) |
| 193 | SM82_LAST | End of sm_82 range |
| 194--199 | sm_86+ | Ampere+ additions (conversion packed, SUQUERY) |
| 199 | SM86_LAST | End of sm_86 range |
| 200--205 | sm_89+ | Ada Lovelace additions (QMMA shapes) |
| 205 | SM89_LAST | End of sm_89 range |
| 206--252 | sm_90+ | Hopper additions (GMMA, CGA barriers, fences, TMA) |
| 252 | SM90_LAST | End of sm_90 range |
| 253--280 | sm_100+ | Blackwell datacenter additions (UTC, QFMA4, MEMSET) |
| 280 | SM100_LAST | End of sm_100 range |
| 281--320 | sm_104+ | Blackwell Ultra additions (uniform FP, new conversions) |
| 320 | SM104_LAST | End of sm_104 range |
| 321 | LAST | Sentinel (end of table) |
Each SM generation only adds opcodes; no base opcodes are removed. The Ori IR uses the 12-bit index into this table as the base opcode field (instruction offset +72, lower 12 bits). Bits 12--13 of the opcode word encode sub-operation modifiers (.HI, .WIDE, etc.) and are stripped by the 0xFFFFCFFF mask to recover the base index.
Encoding Format Summary
SASS instructions use three widths, selected per opcode during encoding:
| Format Code | Width | Usage |
|---|---|---|
| 0x1 | 64-bit | Simple moves, branches, barriers, NOPs, short-form ALU |
| 0x2 | 128-bit | Most ALU, load/store, texture, tensor core, atomics |
| 0x8 | 256-bit | IMAD.WIDE variants with 16 constant-bank operand slots |
The 3-level opcode hierarchy within the encoded instruction word is: major (9 bits, at bits [8:16]) / minor (8 bits, at bits [17:24]) / sub-opcode (7 bits, at bits [25:31]). See the encoding page for full details.
Duplicate Mnemonic Entries
Five entries in the table share a SASS mnemonic with an earlier index. These are not errors in the table -- they are distinct IR opcodes that happen to produce the same assembly mnemonic but with different binary encodings, operand widths, or functional-unit routing. The duplicates fall into two categories:
Category A -- SM-generation re-introduction. The same operation is re-implemented for a newer GPU generation with a different SASS major opcode and encoding path, typically because the tensor core or ALU microarchitecture changed:
| Later Index | Earlier Index | Mnemonic | Why re-introduced |
|---|---|---|---|
| 215 (sm_90) | 180 (sm_82) | DMMA | Hopper warpgroup-aware TC path (enc. cat. 515 vs 434) |
| 220 (sm_90) | 14 (sm_70) | FMNMX | Hopper adds 5-entry operand sub-mode table (enc. cat. 534 vs 510) |
Category B -- Operand-width extension. Blackwell Ultra (sm_104) adds 64-bit operand variants of existing integer ALU instructions. The SASS printer appends a .64 suffix at render time; the IR name table stores the same base mnemonic for both widths:
| Later Index | Earlier Index | Mnemonic | What the later index adds |
|---|---|---|---|
| 284 (sm_104) | 37 (sm_70) | IMNMX | 32-bit form, new encoding path |
| 285 (sm_104) | 37 (sm_70) | IMNMX | 64-bit form (IMNMX.64, .64.UI, .64.LO) |
| 288 (sm_104) | 7 (sm_70) | ISETP | 64-bit comparison (ISETP.64, .64.UI, .64.LO) |
Binary evidence: in the constructor sub_7A5D10, indices 284 and 285 store identical "VZAZK" string pointers at adjacent 16-byte slots (v2+8728 and v2+8744). The SASS printer (sub_7CB560) maps them to IMNMX vs IMNMX.64 based on operand metadata.
Base ISA -- sm_70 (Volta) and Later (Indices 0--135)
These opcodes are available on all SM architectures supported by ptxas v13.0.
Integer Arithmetic
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 1 | VZNQ | IMAD | Integer multiply-add (32-bit) |
| 2 | VZNQ_JVQR | IMAD_WIDE | Integer multiply-add, 32x32->64 result |
| 3 | VNQQ3 | IADD3 | Three-input integer add with carry |
| 4 | OZFX | BMSK | Generate bitmask from position and width |
| 5 | FTKG | SGXT | Sign-extend from specified bit position |
| 6 | YBC3 | LOP3 | Three-input logic operation (arbitrary LUT) |
| 7 | VFRGC | ISETP | Integer compare and set predicate (32-bit; re-introduced at index 288 for sm_104 with 64-bit support) |
| 8 | VNOF | IABS | Integer absolute value |
| 9 | YRN | LEA | Load effective address (shift-add) |
| 10 | FUS | SHF | Funnel shift (concatenate two regs, shift) |
| 33 | VQC | IDP | Integer dot product (4-element) |
| 34 | VQR | IDE | Integer dot expand |
| 37 | VZAZK | IMNMX | Integer min/max (32-bit only; re-introduced at indices 284--285 for sm_104 with 32/64-bit split) |
| 38 | CBCP | POPC | Population count (count set bits) |
| 39 | SYB | FLO | Find leading one (bit scan) |
| 53 | OERI | BREV | Bit reverse |
FP32 Arithmetic
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 11 | SSZN | FFMA | FP32 fused multiply-add |
| 12 | SNQQ | FADD | FP32 add |
| 13 | SZHY | FMUL | FP32 multiply |
| 14 | SZAZK | FMNMX | FP32 min/max (base encoding cat. 510; re-introduced at index 220 for sm_90 with extended operand modes) |
| 15 | SFJMNQQ | FSWZADD | FP32 swizzle add (cross-lane partial reduction) |
| 16 | SFRG | FSET | FP32 compare and set result register |
| 17 | SFRY | FSEL | FP32 select (conditional move) |
| 18 | SFRGC | FSETP | FP32 compare and set predicate |
| 40 | SPUX | FCHK | FP check for NaN/Inf/denorm |
| 42 | ZHSH | MUFU | Multi-function unit: RCP, RSQ, SIN, COS, EX2, LG2, RCP64H, RSQ64H |
FP64 Arithmetic
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 122 | QSZN | DFMA | FP64 fused multiply-add |
| 123 | QNQQ | DADD | FP64 add |
| 124 | QZHY | DMUL | FP64 multiply |
| 125 | QFRGC | DSETP | FP64 compare and set predicate |
FP16 Packed Arithmetic
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 126 | UNQQ2 | HADD2 | Packed FP16x2 add |
| 127 | UNQQ2_S32 | HADD2_F32 | Packed FP16x2 add with FP32 accumulator |
| 128 | USZN2 | HFMA2 | Packed FP16x2 fused multiply-add |
| 129 | UZHY2 | HMUL2 | Packed FP16x2 multiply |
| 130 | UFRG2 | HSET2 | Packed FP16x2 compare and set |
| 131 | UFRGC2 | HSETP2 | Packed FP16x2 compare and set predicate |
Type Conversion
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 35 | V2V | I2I | Integer to integer conversion (width/sign change) |
| 36 | V2VC | I2IP | Integer to integer, packed variant |
| 43 | S2S | F2F | Float to float conversion (precision change) |
| 44 | S2S_K | F2F_X | Float to float, extended (with carry chain) |
| 45 | S2V | F2I | Float to integer |
| 46 | S2V_K | F2I_X | Float to integer, extended |
| 47 | V2S | I2F | Integer to float |
| 48 | V2S_K | I2F_X | Integer to float, extended |
| 49 | SEAQ | FRND | FP round to integer (within FP format) |
| 50 | SEAQ_K | FRND_X | FP round, extended |
Data Movement
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 19 | ZBI | MOV | Move register to register |
| 20 | FRY | SEL | Predicated select (ternary conditional) |
| 21 | C2E | P2R | Pack predicate registers into GPR |
| 22 | E2C | R2P | Unpack GPR bits into predicate registers |
| 24 | CEZG | PRMT | Byte-level permute (4-byte shuffle) |
| 41 | VCN | IPA | Interpolate pixel attribute (fragment shader) |
| 57 | F2E | S2R | Read special register to GPR |
| 27 | PF2E_32 | CS2R_32 | Control/status register to GPR (32-bit) |
| 28 | PF2E_64 | CS2R_64 | Control/status register to GPR (64-bit) |
Predicate Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 23 | CYBC3 | PLOP3 | Three-input predicate logic (arbitrary LUT) |
| 26 | IBGR | VOTE | Warp-wide vote (ballot/any/all/unanimity) |
| 31 | INOFQVSS | VABSDIFF | Vector absolute difference |
| 32 | INOFQVSS4 | VABSDIFF4 | Vector absolute difference, 4-way |
Memory -- Load/Store
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 89 | YQP | LDC | Load from constant memory bank c[bank][offset] |
| 90 | NYQ | ALD | Attribute load (vertex/fragment attributes) |
| 91 | NFG | AST | Attribute store |
| 94 | YQF | LDS | Load from shared memory |
| 95 | FGF | STS | Store to shared memory |
| 96 | YQT | LDG | Load from global memory |
| 97 | FGT | STG | Store to global memory |
| 98 | YQY | LDL | Load from local memory (per-thread stack) |
| 99 | FGY | STL | Store to local memory |
| 100 | YQ | LD | Load, generic address space |
| 101 | FG | ST | Store, generic address space |
Atomic and Reduction
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 102 | NGBZ | ATOM | Atomic operation (generic address space) |
| 103 | NGBZT | ATOMG | Atomic operation (global memory) |
| 104 | ERQ | RED | Reduction (global memory, fire-and-forget) |
| 105 | NGBZF | ATOMS | Atomic operation (shared memory) |
Cache and Memory Control
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 106 | DFCP | QSPC | Query address space type |
| 107 | PPGY_AB_FO | CCTL_NO_SB | Cache control, no scoreboard wait |
| 108 | PPGY | CCTL | Cache control (invalidate/writeback/etc.) |
| 109 | PPGYY | CCTLL | Cache control, L2 level |
| 110 | PPGYG | CCTLT | Cache control, texture cache |
| 111 | ZRZONE | MEMBAR | Memory barrier (fence) |
Texture Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 83 | GRK | TEX | Texture fetch (filtered sample) |
| 84 | GYQ | TLD | Texture load (unfiltered, integer coords) |
| 85 | GYQ4 | TLD4 | Texture gather (fetch 4 texels for bilinear) |
| 86 | GZZY | TMML | Query texture mip-map level |
| 87 | GKQ | TXD | Texture fetch with explicit derivatives |
| 88 | GKD | TXQ | Texture query (dimensions, levels, format) |
Surface Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 112 | FHYQ | SULD | Surface load |
| 113 | FHFG | SUST | Surface store |
| 114 | FHNGBZ | SUATOM | Surface atomic |
| 115 | FHERQ | SURED | Surface reduction |
Graphics Pipeline
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 51 | NY2C | AL2P | Attribute location to patch offset |
| 52 | NY2C_VAQRKRQ | AL2P_INDEXED | Attribute to patch, indexed variant |
| 92 | BHG | OUT | Tessellation output emit |
| 93 | BHG_SVANY | OUT_FINAL | Tessellation output emit (final, cut primitive) |
| 116 | CVKYQ | PIXLD | Pixel information load (coverage, sample mask) |
| 117 | VFOREQ | ISBERD | Indexed set buffer for read (bindless) |
| 118 | VFORJE | ISBEWR | Indexed set buffer for write (bindless) |
Control Flow
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 67 | OEN | BRA | Branch (relative) |
| 68 | OEK | BRX | Branch indirect (register target) |
| 69 | WZC | JMP | Jump (absolute) |
| 70 | WZK | JMX | Jump indirect |
| 71 | PNYY | CALL | Function call |
| 72 | ERG | RET | Return from function |
| 73 | OFFL | BSSY | Push convergence point onto branch sync stack |
| 74 | OERNX | BREAK | Break out of convergence region |
| 77 | RKVG | EXIT | Thread exit |
| 76 | XVYY | KILL | Kill thread (discard fragment) |
| 75 | OCG | BPT | Breakpoint trap (debugger) |
| 78 | EGG | RTT | Return to trap handler |
| 79 | OFLAP | BSYNC | Branch sync (pop convergence stack, reconverge) |
Synchronization and Warp
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 54 | OZBI_O | BMOV_B | Barrier move (barrier register, B variant) |
| 55 | OZBI_E | BMOV_R | Barrier move (barrier register, R variant) |
| 56 | OZBI | BMOV | Barrier move |
| 58 | O2E | B2R | Barrier register to GPR |
| 59 | E2O | R2B | GPR to barrier register |
| 61 | ONE | BAR | Named barrier synchronization |
| 62 | ONE_VAQRKRQ | BAR_INDEXED | Barrier, indexed variant |
| 66 | QRCONE | DEPBAR | Dependency barrier (wait for scoreboard) |
| 80 | ZNGPU | MATCH | Warp match (find lanes with same value) |
| 119 | FUSY | SHFL | Warp shuffle (cross-lane data exchange) |
| 120 | JNECFLAP | WARPSYNC | Warp-wide synchronization barrier |
| 81 | ANABFYRRC | NANOSLEEP | Thread sleep for specified nanoseconds |
| 82 | ANABGENC | NANOTRAP | Nano trap (lightweight trap) |
System and Miscellaneous
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 0 | REEONE | ERRBAR | Error barrier (internal pseudo-instruction) |
| 25 | ABC | NOP | No-operation |
| 29 | CZGEVT | PMTRIG | Performance monitor trigger |
| 30 | PFZGRFG | CSMTEST | CSM (compute shader model) test |
| 60 | YRCP | LEPC | Load effective PC (get current instruction address) |
| 63 | FRGPGNVQ | SETCTAID | Set CTA (thread block) ID |
| 64 | FRGYZRZONFR | SETLMEMBASE | Set local memory base address |
| 65 | TRGYZRZONFR | GETLMEMBASE | Get local memory base address |
| 121 | LVRYQ | YIELD | Yield execution (internal, scheduler hint) |
| 135 | VAGEVAFVP | INTRINSIC | Compiler intrinsic (pseudo-opcode, lowered before encoding) |
Tensor Core (Base)
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 132 | UZZN_16 | HMMA_16 | FP16 matrix multiply-accumulate, 16-wide |
| 133 | UZZN_32 | HMMA_32 | FP16 matrix multiply-accumulate, 32-wide |
| 134 | VZZN | IMMA | Integer matrix multiply-accumulate |
sm_73 Extensions (Indices 137--171)
Volta+ additions. Primarily introduces uniform register variants and additional tensor core shapes.
Uniform Register Operations
Uniform registers (UR0--UR63) hold values shared across the warp, enabling scalar execution of warp-uniform computations.
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 138 | HOERI | UBREV | Uniform bit reverse |
| 139 | HOZFX | UBMSK | Uniform bitmask |
| 140 | HPYRN | UCLEA | Uniform clear address |
| 141 | HVFRGC | UISETP | Uniform integer set-predicate |
| 142 | HYQP | ULDC | Uniform load constant |
| 143 | HYRN | ULEA | Uniform load effective address |
| 144 | HC2HE | UP2UR | Uniform predicate to uniform register |
| 145 | HYBC3 | ULOP3 | Uniform three-input logic |
| 146 | HCYBC3 | UPLOP3 | Uniform predicate three-input logic |
| 147 | HFRY | USEL | Uniform select |
| 148 | HFTKG | USGXT | Uniform sign-extend |
| 149 | HSYB | UFLO | Uniform find leading one |
| 150 | HVNQQ3 | UIADD3 | Uniform three-input integer add |
| 151 | HVZNQ | UIMAD | Uniform integer multiply-add |
| 152 | HZBI | UMOV | Uniform move |
| 153 | HCEZG | UPRMT | Uniform byte permute |
| 154 | IBGRH | VOTEU | Uniform vote |
| 155 | HCBCP | UPOPC | Uniform population count |
| 156 | HFUS | USHF | Uniform funnel shift |
Additional sm_73 Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 157 | FPNGGRE | SCATTER | Scatter write |
| 158 | S2SC | F2FP | Float to float, packed conversion |
| 159 | UZZN_1688 | HMMA_1688 | FP16 MMA, 16x8x8 shape |
| 160 | UZZN_16816 | HMMA_16816 | FP16 MMA, 16x8x16 shape |
| 161 | OZZN | BMMA | Binary (1-bit) matrix multiply-accumulate |
| 162 | GGHPPGY | TTUCCTL | Tensor texture unit cache control |
| 163 | GGHZNPEB | TTUMACRO | Tensor texture unit macro |
| 164 | E2HE | R2UR | GPR to uniform register |
| 165 | ZBIZ | MOVM | Move with mask |
| 166 | YQFZ | LDSM | Load from shared memory to matrix register |
| 167 | YQGENZ | LDTRAM | Load from TRAM (transposed shared memory) |
| 168 | SBBGCEVAG | FOOTPRINT | Texture footprint query |
| 169 | F2HE | S2UR | Special register to uniform register |
| 170 | OEKH | BRXU | Branch indirect, uniform target |
sm_82 Extensions (Indices 172--193)
Ampere additions. New MMA shapes, gather/scatter metadata, and reduction variants.
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 173 | TNGURE | GATHER | Gather (multi-address load) |
| 174 | TRAZRGNQNGN | GENMETADATA | Generate metadata (for sparse MMA) |
| 175 | FCZRGNQNGN | SPMETADATA | Sparse metadata |
| 176 | OZZN_88128 | BMMA_88128 | Binary MMA, 8x8x128 shape |
| 177 | OZZN_168128 | BMMA_168128 | Binary MMA, 16x8x128 shape |
| 178 | OZZN_168256 | BMMA_168256 | Binary MMA, 16x8x256 shape |
| 179 | PYZNQ | CLMAD | Carry-less multiply-add (GF(2) arithmetic) |
| 180 | QZZN | DMMA | FP64 matrix multiply-accumulate (Ampere; encoding category 434; re-introduced at index 215 for Hopper with different TC path) |
| 181 | UZZN_FC_1688 | HMMA_SP_1688 | FP16 sparse MMA, 16x8x8 |
| 182 | USZN2_ZZN | HFMA2_MMA | FP16 FMA2, MMA variant |
| 183 | UZAZK2 | HMNMX2 | Packed FP16x2 min/max |
| 184 | VZZN_88 | IMMA_88 | Integer MMA, 8x8 shape |
| 185 | VZZN_FC_88 | IMMA_SP_88 | Integer sparse MMA, 8x8 |
| 186 | VZZN_16816 | IMMA_16816 | Integer MMA, 16x8x16 |
| 187 | VZZN_16832 | IMMA_16832 | Integer MMA, 16x8x32 |
| 188 | VZZN_FC_16832 | IMMA_SP_16832 | Integer sparse MMA, 16x8x32 |
| 189 | NEEVIRF | ARRIVES | Async barrier arrive signal |
| 190 | YQTQRCONE | LDGDEPBAR | Load-global dependency barrier |
| 191 | YQTFGF | LDGSTS | Load-global, store-to-shared (async copy) |
| 192 | ERQHK | REDUX | Warp-wide reduction (uniform result) |
sm_86 Extensions (Indices 194--199)
Ampere+ (GA106/GA107) additions.
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 195 | S2VC | F2IP | Float to integer, packed |
| 196 | HS2SC | UF2FP | Uniform float to float, packed |
| 197 | V2SC | I2FP | Integer to float, packed |
| 198 | FHDHREL | SUQUERY | Surface query (dimensions, format) |
sm_89 Extensions (Indices 200--205)
Ada Lovelace additions. Quarter-precision MMA shapes for FP8/INT4.
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 201 | DZZN_16816 | QMMA_16816 | Quarter-precision MMA, 16x8x16 (FP8) |
| 202 | DZZN_16832 | QMMA_16832 | Quarter-precision MMA, 16x8x32 |
| 203 | DZZN_FC_16832 | QMMA_SP_16832 | Quarter-precision sparse MMA, 16x8x32 |
| 204 | DZZN_FC_12864 | QMMA_SP_12864 | Quarter-precision sparse MMA, 128x64 |
sm_90 Extensions (Indices 206--252)
Hopper additions. Major expansion: CGA (Cooperative Grid Array) barriers, fences, GMMA (Group MMA), TMA (Tensor Memory Accelerator), and collective operations.
CGA Barriers and Synchronization
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 207 | NPDOYX | ACQBLK | Acquire block (CTA resource acquisition) |
| 208 | PTNONE_NEI | CGABAR_ARV | CGA barrier arrive |
| 209 | PTNONE_TRG | CGABAR_GET | CGA barrier get (query state) |
| 210 | PTNONE_FRG | CGABAR_SET | CGA barrier set |
| 211 | PTNONE_JNVG | CGABAR_WAIT | CGA barrier wait |
| 212 | PTNREEONE | CGAERRBAR | CGA error barrier |
Collective and Election
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 213 | PERNGRCBYVPL | CREATEPOLICY | Create scheduling/cache policy |
| 214 | PIGN | CVTA | Convert address space (generic to specific) |
| 215 | QZZN | DMMA | FP64 matrix multiply-accumulate (Hopper re-introduction; encoding category 515 vs 434 for index 180; uses warpgroup-aware tensor core path, shared dispatch with CVTA at case 0xD6/0xD7 in sub_6575D0) |
| 216 | RYRPG | ELECT | Elect a leader lane in warp |
| 217 | RAQPBYYRPGVIR | ENDCOLLECTIVE | End collective operation scope |
Fences
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 218 | SRAPR_T | FENCE_G | Fence, global scope |
| 219 | SRAPR_F | FENCE_S | Fence, shared/CTA scope |
| 220 | SZAZK | FMNMX | FP32 min/max (Hopper re-introduction; encoding category 534 vs 510 for index 14; adds 5-entry operand sub-mode table via dword_2026FC0 for extended rounding/precision modes not in base encoding) |
GMMA (Group Matrix Multiply-Accumulate)
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 221 | TZZN | GMMA | Group (warpgroup) matrix multiply-accumulate |
Memory Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 222 | YQPH | LDCU | Load constant, uniform (warp-coherent constant load) |
| 223 | YRCP | LEPC | Load effective PC (sm_90 variant) |
| 224 | ZNCN | MAPA | Map address (for TMA address translation) |
| 225 | CERRKVG | PREEXIT | Pre-exit (cleanup before thread exit) |
| 226 | E2HE_U | R2UR_H | Register to uniform register, high half |
| 227 | ERQNF | REDAS | Reduction, async (fire-and-forget with arrive) |
Configuration
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 228 | FRGZNKERT | SETMAXREG | Set maximum register count for dynamic partitioning |
| 229 | FRGFZRZFVMR | SETSMEMSIZE | Set shared memory size dynamically |
| 230 | FGNF | STAS | Store async (to shared, with barrier) |
| 231 | FGFZ | STSM | Store to shared memory, matrix layout |
Synchronization Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 232 | FLAPF_ONFVP | SYNCS_BASIC | Sync scope, basic |
| 233 | FLAPF_YQ_HAVSZ | SYNCS_LD_UNIFM | Sync scope with uniform load |
Uniform Block Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 234 | HOYXPC | UBLKCP | Uniform block copy |
| 235 | HOYXERQ | UBLKRED | Uniform block reduction |
| 236 | HOYXCS | UBLKPF | Uniform block prefetch |
| 237 | HPIGN | UCVTA | Uniform convert address space |
| 238 | HYRCP | ULEPC | Uniform load effective PC |
| 239 | HZNCN | UMAPA | Uniform map address |
TMA (Tensor Memory Accelerator) Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 240 | HGZNPPGY | UTMACCTL | TMA cache control |
| 241 | HGZNPZQSYHFU | UTMACMDFLUSH | TMA command flush |
| 242 | HGZNYQT | UTMALDG | TMA load global |
| 243 | HGZNCS | UTMAPF | TMA prefetch |
| 244 | HGZERQT | UTMREDG | TMA reduction global |
| 245 | HGZNYFG | UTMALST | TMA load/store |
Vector Min/Max Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 246 | IUZAZK | VHMNMX | Vector half min/max (FP16x2) |
| 247 | IVNQQ | VIADD | Vector integer add |
| 248 | IVNQQZAZK | VIADDMNMX | Vector integer add with min/max |
| 249 | IVZAZK | VIMNMX | Vector integer min/max |
| 250 | IVZAZK3 | VIMNMX3 | Vector integer three-input min/max |
| 251 | JNECTEBHC | WARPGROUP | Warpgroup collective operation |
sm_100 Extensions (Indices 253--280)
Blackwell datacenter additions. UTC (Unified Tensor Core) operations, quad-precision FP, FP32x2 packed operations, and tensor core swizzle load/store.
Packed FP32 and Reduction
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 254 | PERQHK | CREDUX | CTA-scope reduction (cross-warp) |
| 255 | SNQQ2 | FADD2 | Packed FP32x2 add |
| 256 | SSZN2 | FFMA2 | Packed FP32x2 fused multiply-add |
| 257 | SZAZK3 | FMNMX3 | FP32 three-input min/max |
| 258 | SZHY2 | FMUL2 | Packed FP32x2 multiply |
Tensor Memory
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 259 | YQGZ | LDTM | Load via tensor memory (5th-gen tensor core) |
| 260 | HTRGARKGJBEXVQ | UGETNEXTWORKID | Uniform get next work ID (dynamic scheduling) |
UTC (Unified Tensor Core) Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 261 | HGPONE_1PGN | UTCBAR_1CTA | UTC barrier, 1 CTA scope |
| 262 | HGPONE_2PGN | UTCBAR_2CTA | UTC barrier, 2 CTA scope |
| 263 | HGPPC_1PGN | UTCCP_1CTA | UTC copy, 1 CTA scope |
| 264 | HGPPC_2PGN | UTCCP_2CTA | UTC copy, 2 CTA scope |
| 265 | HGPZZN_1PGN | UTCMMA_1CTA | UTC MMA, 1 CTA scope |
| 266 | HGPZZN_2PGN | UTCMMA_2CTA | UTC MMA, 2 CTA scope |
| 267 | HGPFUVSG_1PGN | UTCSHIFT_1CTA | UTC shift, 1 CTA scope |
| 268 | HGPFUVSG_2PGN | UTCSHIFT_2CTA | UTC shift, 2 CTA scope |
Tensor Core Swizzle
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 269 | IVEGPBHAG | VIRTCOUNT | Virtual thread count query |
| 270 | GPNGBZFJF | TCATOMSWS | Tensor core atomic with swizzle |
| 271 | GPYQFJF | TCLDSWS | Tensor core load with swizzle |
| 272 | GPFGFJF | TCSTSWS | Tensor core store with swizzle |
Quad-Precision FP
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 273 | DSZN4 | QFMA4 | Quad-element FP fused multiply-add |
| 274 | DNQQ4 | QADD4 | Quad-element FP add |
| 275 | DZHY4 | QMUL4 | Quad-element FP multiply |
Additional sm_100
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 276 | ZRZFRG | MEMSET | Memory set (block fill) |
| 277 | NPDFUZVAVG | ACQSHMINIT | Acquire shared memory and initialize |
| 278 | FGGZ | STTM | Store via tensor memory |
| 279 | SRAPR_G | FENCE_T | Fence, tensor scope |
sm_104 Extensions (Indices 281--320)
Blackwell Ultra additions. Uniform FP operations, additional integer widths, conversion variants, MMA shape extensions, and MKQ sparse variants.
Integer Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 282 | VNQQ | IADD | Integer add (two-input, distinct from IADD3) |
| 283 | HIVNQQ | UVIADD | Uniform vector integer add |
| 284 | VZAZK | IMNMX | Integer min/max, 32-bit operands (sm_104 re-introduction; new Blackwell Ultra encoding path distinct from base index 37) |
| 285 | VZAZK | IMNMX | Integer min/max, 64-bit operands (SASS prints as IMNMX.64; consecutive with 284 to form the 32/64-bit pair; .64.UI and .64.LO sub-modifiers select unsigned/low-half comparison modes) |
| 286 | HVZAZK | UIMNMX | Uniform integer min/max |
| 287 | HIVZAZK | UVIMNMX | Uniform vector integer min/max |
| 288 | VFRGC | ISETP | Integer set-predicate (sm_104 re-introduction; supports 64-bit operand comparison as ISETP.64 with .64.UI/.64.LO sub-modifiers; new encoding path, case 0x120 in sub_7482B0 and sub_8380A0) |
| 289 | HVFRGC | UISETP | Uniform integer set-predicate (sm_104 re-introduction of index 141; pairs with ISETP index 288 for 64-bit uniform comparison) |
Data Movement Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 290 | ZBI | MOV | Move (sm_104 variant) |
| 291 | HZBI | UMOV | Uniform move (sm_104 variant) |
| 292 | FRY | SEL | Select (sm_104 variant) |
| 293 | HFRY | USEL | Uniform select (sm_104 variant) |
Uniform FP Operations
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 294 | HSNQQ | UFADD | Uniform FP add |
| 295 | HSFRY | UFSEL | Uniform FP select |
| 296 | HSSZN | UFFMA | Uniform FP fused multiply-add |
| 297 | HSZHY | UFMUL | Uniform FP multiply |
| 298 | HSFRG | UFSET | Uniform FP compare and set |
| 299 | HSFRGC | UFSETP | Uniform FP compare and set predicate |
Uniform Conversion
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 300 | HV2V | UI2I | Uniform integer to integer conversion |
| 301 | HV2VC | UI2IP | Uniform integer to integer, packed |
| 302 | HS2S | UF2F | Uniform float to float |
| 303 | HSEAQ | UFRND | Uniform FP round |
| 304 | HS2V | UF2I | Uniform float to integer |
| 305 | HS2VC | UF2IP | Uniform float to integer, packed |
| 306 | HV2S | UI2F | Uniform integer to float |
| 307 | HV2SC | UI2FP | Uniform integer to float, packed |
| 308 | HVNOF | UIABS | Uniform integer absolute value |
| 309 | PF2HE | CS2UR | Control/status register to uniform register |
| 310 | HS2SC | UF2FP | Uniform float to float, packed (sm_104 variant) |
MMA Extensions
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 311 | ZKDZZN_FS_16832 | MXQMMA_SF_16832 | Mixed-quantized structured-sparse MMA, 16x8x32 |
| 312 | BZZN_16864 | OMMA_16864 | Operand MMA, 16x8x64 shape |
| 313 | BZZN_FC_168128 | OMMA_SP_168128 | Operand sparse MMA, 16x8x128 |
| 314 | DZZN_16816 | QMMA_16816 | Quarter-precision MMA (sm_104 variant) |
| 315 | DZZN_16832 | QMMA_16832 | Quarter-precision MMA (sm_104 variant) |
| 316 | DZZN_FC_16832 | QMMA_SP_16832 | Quarter-precision sparse MMA (sm_104 variant) |
| 317 | DZZN_FC_12864 | QMMA_SP_12864 | Quarter-precision sparse MMA (sm_104 variant) |
| 318 | DZZN_FS_16832 | QMMA_SF_16832 | Quarter-precision structured sparse MMA |
| 319 | DZZN_FS_FC_16864 | QMMA_SF_SP_16864 | Quarter-precision structured+unstructured sparse MMA |
Boundary Markers
| Idx | ROT13 | Mnemonic | Description |
|---|---|---|---|
| 136 | FZ70_YNFG | SM70_LAST | End of sm_70 base ISA |
| 137 | FZ73_SVEFG | SM73_FIRST | Start of sm_73 extensions |
| 171 | FZ73_YNFG | SM73_LAST | End of sm_73 |
| 172 | FZ82_SVEFG | SM82_FIRST | Start of sm_82 extensions |
| 193 | FZ82_YNFG | SM82_LAST | End of sm_82 |
| 194 | FZ86_SVEFG | SM86_FIRST | Start of sm_86 extensions |
| 199 | FZ86_YNFG | SM86_LAST | End of sm_86 |
| 200 | FZ89_SVEFG | SM89_FIRST | Start of sm_89 extensions |
| 205 | FZ89_YNFG | SM89_LAST | End of sm_89 |
| 206 | FZ90_SVEFG | SM90_FIRST | Start of sm_90 extensions |
| 252 | FZ90_YNFG | SM90_LAST | End of sm_90 |
| 253 | FZ100_SVEFG | SM100_FIRST | Start of sm_100 extensions |
| 280 | FZ100_YNFG | SM100_LAST | End of sm_100 |
| 281 | FZ104_SVEFG | SM104_FIRST | Start of sm_104 extensions |
| 320 | FZ104_YNFG | SM104_LAST | End of sm_104 |
| 321 | YNFG | LAST | End-of-table sentinel |
Encoding Category Map at unk_21C0E00
The 0x508 bytes (1288 bytes) at unk_21C0E00 are not additional opcode names. They are a 322-element int32 array mapping each opcode index to an encoding category number -- a level of indirection between opcode indices and binary encoding format descriptors.
Binary Evidence
- RSI is loaded with
0x21C0E00(at0x7A5D9F: mov $0x21c0e00, %esi) - RDI is set to
obj+0x2478(at0x7A5D82: lea 0x2478(%rbx), %rdi) - RCX is set to 161 (at
0x7A5D22: mov $0xa1, %r13d;0x7A5D69: mov %r13, %rcx) - The
rep movsqat0x7A791Dcopies 161 quadwords = 1288 bytes = 322 x 4 bytes
The destination offset +0x2478 (decimal 9336) is immediately after the 322-entry name table (+4184 through +9328). Three arch-specific constructors each populate this array from a different static source table:
| Constructor | Source Table | Map Content |
|---|---|---|
sub_7A5D10 (base) | unk_21C0E00 | Identity: map[i] = i for all i in 0..321 |
sub_7C5410 | unk_21C3600 | Arch-remapped (selected entries differ) |
sub_BE7390 | unk_22B2320 | Arch-remapped (selected entries differ) |
Reader: sub_1377C60 (SASS Mnemonic Lookup)
The SASS mnemonic lookup function at sub_1377C60 reads this map at line 292:
v84 = *(_DWORD *)(a1 + 4 * v18 + 9336); // encoding_category_map[opcode_index]
After matching an input mnemonic string against the ROT13 name table (with inline decoding at lines 264-273), the function reads encoding_category_map[opcode_index] and uses the result as a hash key -- combined with a 24-bit architecture discriminator via FNV-1a -- to look up the encoding format descriptor in the hash table at InstructionInfo+10672.
This is why duplicate mnemonics (e.g. DMMA at indices 180 and 215, or FMNMX at indices 14 and 220) can have different encoding categories (434 vs 515, 510 vs 534): the category map provides the indirection needed to select different binary encoders for the same mnemonic across architectures. The opcode name table has exactly 322 entries and no more.
Opcode Category Summary
| Category | Base ISA | sm_73+ | sm_82+ | sm_86+ | sm_89+ | sm_90+ | sm_100+ | sm_104+ | Total |
|---|---|---|---|---|---|---|---|---|---|
| Integer ALU | 16 | 10 | 1 | 0 | 0 | 2 | 0 | 5 | 34 |
| FP32 | 10 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 15 |
| FP64 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 5 |
| FP16 | 6 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 8 |
| Conversion | 10 | 1 | 0 | 3 | 0 | 0 | 0 | 10 | 24 |
| Data Movement | 9 | 5 | 0 | 0 | 0 | 2 | 0 | 5 | 21 |
| Predicate/Vote | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| Load/Store | 11 | 3 | 2 | 0 | 0 | 5 | 2 | 0 | 23 |
| Atomic/Reduce | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 6 |
| Cache/Fence | 6 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 11 |
| Texture | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
| Surface | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| Control Flow | 13 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 15 |
| Sync/Warp | 10 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 14 |
| Tensor Core | 3 | 3 | 10 | 0 | 4 | 1 | 9 | 9 | 39 |
| TMA | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 6 |
| Uniform Block | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 6 | 10 |
| CGA/Collective | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 5 |
| Graphics | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
| System/Misc | 7 | 0 | 1 | 0 | 0 | 4 | 2 | 0 | 14 |
| Boundaries | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 16 |
Encoding Format Correlation
From the encoding page analysis, the approximate distribution of 64-bit vs 128-bit formats for the base ISA:
64-bit format (format code 0x1): NOP, BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, BPT, KILL, RTT, BAR, DEPBAR, WARPSYNC, BMOV, B2R, R2B, S2R, CS2R, MOV (short form), YIELD, ERRBAR, NANOSLEEP, NANOTRAP, SHFL. These are primarily control-flow, barriers, and simple data movement instructions that need fewer operand bits.
128-bit format (format code 0x2): All ALU operations (IMAD, IADD3, FFMA, FADD, FMUL, LOP3, ISETP, FSETP, etc.), all memory operations (LDG, STG, LDS, STS, LDL, STL, LD, ST, LDC), all atomics (ATOM, ATOMG, ATOMS, RED), all texture operations (TEX, TLD, TLD4, TMML, TXD, TXQ), all surface operations, tensor core operations (HMMA, IMMA, BMMA, GMMA, etc.), conversion instructions, and most uniform register operations.
256-bit format (format code 0x8): IMAD.WIDE variants with 16 constant-bank operand slots. Extremely rare -- only 2 encoder functions use this format.
The 64-bit short-form encoders cover 27 opcode classes across 174 encoder functions total. The 128-bit encoders cover the remaining ~75+ opcode classes across 912+ encoder functions.
SM100 Encoding Variant Counts
Per-opcode variant counts for the SM100 (Blackwell datacenter) SASS encoder, extracted from the 683 concrete encoding handler functions at 0xED1520--0xFA5F10. Each function encodes one (opcode, operand-form) pair -- e.g., FFMA reg,reg,reg vs FFMA reg,reg,imm vs FFMA reg,reg,pred. The "Enc ID" column is the numeric value written to *(WORD*)(a2+12) by each handler, which maps to the SASS binary major opcode through the encoding dispatch megafunctions. The "SASS Mnemonic" column gives the canonical name from the 322-entry ROT13 opcode name table in InstructionInfo. Where two encoder IDs map to the same mnemonic (e.g. IADD3 IDs 0+1, LOP3 IDs 4+10), both are listed; the "Combined" column gives the merged count for that instruction.
Source: sweep report p1.14-sweep-0xED1000-0xFA6000.txt, ptxas v13.0.88.
Integer ALU
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 0 | 8 | IADD3 | 13 (IDs 0+1) | 23F1DF8, 23F1F08 |
| 1 | 5 | IADD3 | 23F1DF8, 23F1F08 | |
| 15 | 19 | IMAD | 19 | 23F1DF8, 23F2018 |
| 40 | 23 | IMAD (wide) | 23 | 23F1DF8, 23F21B0 |
| 42 | 34 | IMAD (extended) | 34 | 23F1DF8, 23F21B0 |
| 4 | 4 | LOP3 | 12 (IDs 4+10) | 23F2018 |
| 10 | 8 | LOP3 | 23F2018 | |
| 34 | 33 | ISETP | 33 | 23F1DF8, 23F29A8 |
| 30 | 2 | IMNMX | 2 | 23F1D70 |
| 43 | 13 | FLO | 13 | 23F1D70, 23F1DF8 |
| 44 | 4 | IABS | 4 | 23F1F08, 23F1F90 |
| 47 | 5 | POPC | 5 | 23F1F08, 23F1F90 |
| 49 | 2 | BREV | 2 | 23F1DF8 |
| 21 | 5 | SHF | 5 | 23F1DF8, 23F1F08 |
| 84 | 6 | SHF | 6 | 23F1F08, 23F1F90 |
| Subtotal | 171 |
FP32 ALU
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 13 | 30 | FFMA | 30 | 23F2018..23F2EF8 |
| 14 | 11 | FADD | 11 | 23F1F90, 23F2E70 |
| 22 | 18 | FMUL | 18 | 23F1DF8..23F2678 |
| 31 | 2 | FMNMX | 2 | 23F1D70 |
| 35 | 30 | FSETP | 30 | many formats |
| 33 | 2 | FSET/CSET | 2 | 23F2238 |
| 38 | 2 | FSWZADD | 2 | 23F2128 |
| 103 | 9 | extended FMA | 9 | 23F1DF8..23F2678 |
| Subtotal | 104 |
FP64 ALU
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 59 | 6 | DFMA | 6 | 23F2678, 23F2EF8 |
| 91 | 2 | DADD | 2 | 23F1DF8 |
| 57 | 5 | DMUL | 5 | 23F1F08 |
| 65 | 6 | DSETP | 6 | 23F2678, 23F2EF8 |
| Subtotal | 19 |
FP16 / Half-Precision
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 23 | 18 | HFMA2/HMUL2 | 18 | 23F1DF8..23F2678 |
| 37 | 34 | HSETP2/DSETP | 34 | 23F1DF8, 23F21B0 |
| Subtotal | 52 |
Data Movement
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 18 | 78 | MOV | 78 | many formats |
| 32 | 28 | SEL | 28 | 23F1D70, 23F1DF8 |
| 71 | 45 | P2R/R2P | 45 | many formats |
| 19 | 3 | PRMT | 3 | 23F1C60, 23F1D70 |
| 20 | 3 | LEA | 3 | 23F1DF8, 23F1F08 |
| 6 | 5 | S2R | 5 | 23F1F08, 23F1F90 |
| 7 | 2 | CS2R | 2 | 23F2018 |
| Subtotal | 164 |
Memory
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 27 | 24 | LDG/STG | 24 | 23F1F08, 23F29A8 |
| 77 | 18 | LDS/STS | 18 | 23F29A8 |
| 94 | 16 | LDL/STL | 16 | 23F29A8 |
| 74 | 6 | ST | 6 | 23F1DF8, 23F1F08 |
| 50 | 5 | ATOM/ATOMG | 5 | 23F1DF8, 23F1F08 |
| 81 | 6 | RED | 6 | 23F1F08, 23F1F90 |
| 100 | 3 | SULD | 3 | 23F1DF8, 23F1F08 |
| Subtotal | 78 |
Tensor Core
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 78 | 35 | HMMA/IMMA | 35 | 23F1DF8, 23F29A8 |
| 90 | 5 | BMMA/QMMA | 5 | 23F2678 |
| Subtotal | 40 |
Texture
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 5 | 1 | TLD | 1 | 23F1F08 |
| 8 | 2 | TEX | 2 | 23F1DF8, 23F1F90 |
| 9 | 1 | TLD4 | 1 | 23F1F08 |
| 88 | 2 | TEX (variant) | 2 | 23F1F08 |
| Subtotal | 6 |
Predicate / Warp
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 79 | 7 | PLOP3 | 7 | 23F1F08..23F2018 |
| 82 | 6 | VOTE | 6 | 23F1F08, 23F1F90 |
| 48 | 7 | SHFL | 7 | 23F1D70, 23F1DF8 |
| Subtotal | 20 |
Control Flow / Sync
| Enc ID | Variants | SASS Mnemonic | Combined | Formats |
|---|---|---|---|---|
| 17 | 1 | BRA | 1 | 23F1F08 |
| 73 | 10 | BAR | 10 | 23F1F08, 23F2238 |
| 92 | 1 | DEPBAR | 1 | 23F1F08 |
| 98 | 1 | MEMBAR | 1 | 23F1F08 |
| 11 | 14 | MUFU | 14 | 23F1F08, 23F1F90 |
| 45 | 1 | NOP | 1 | 23F1D70 |
| 46 | 1 | YIELD/EXIT | 1 | 23F2238 |
| Subtotal | 29 |
Totals
| Category | Encoder Functions | Distinct Opcodes |
|---|---|---|
| Integer ALU | 171 | 15 (across 10 mnemonics) |
| FP32 ALU | 104 | 8 |
| FP64 ALU | 19 | 4 |
| FP16 | 52 | 2 |
| Data Movement | 164 | 7 |
| Memory | 78 | 7 |
| Tensor Core | 40 | 2 |
| Texture | 6 | 4 |
| Predicate/Warp | 20 | 3 |
| Control/Sync | 29 | 7 |
| Total | 683 | 59 |
The top 5 instructions by variant count -- MOV (78), P2R/R2P (45), HMMA/IMMA (35), IMAD extended (34), HSETP2/DSETP (34) -- account for 226 of 683 encoders (33%). MOV alone accounts for 11.4% of all encoder functions because every possible source type (GPR, uniform reg, immediate, constant bank, predicate, special reg) and every destination type requires a separate encoder with a distinct operand signature and bitfield extraction sequence.
The 21 encoding format descriptors (xmmword groups) cluster into three tiers by usage: heavy (165+141+101 = 407 functions across 3 formats), medium (87+47+36 = 170 across 3 formats), and light (106 functions across 15 formats). The heavy-tier formats (23F1F08, 23F1DF8, 23F29A8) are the simple/compact, primary ALU, and memory/load-store formats respectively -- these three alone cover 60% of all SM100 encoders.
Internal Index vs. Numeric Opcode
The index in this table (the position within the ROT13 name array) is the value stored in the Ori IR instruction's opcode field at offset +72 (lower 12 bits). However, this index is distinct from the encoded SASS major opcode in the binary instruction word. The mapping between IR opcode index and SASS binary major opcode is performed by the encoding dispatch tables (the "six megafunctions" at 0x10C0B20--0x10E32E0, which switch on up to 370 opcode category values from 0x0 through 0x171). A single IR opcode index may map to multiple SASS major opcodes depending on operand types and modifier bits, and vice versa.
Known IR-index-to-numeric correlations (confirmed from switch statements across multiple independent functions):
| IR Index | Numeric (encoding switch) | Mnemonic |
|---|---|---|
| 1 | 0x59 | IMAD |
| 3 | 0x29 | IADD3 |
| 25 | (64-bit, no major) | NOP |
| 52 | (pseudo) | BB boundary |
| 77 | (64-bit, no major) | EXIT |
| 91 | 0x1E | ATOM |
| 95 | (64-bit, no major) | EXIT/RET |
| 96 | 0x38 | LDG |
| 221 | 0xDF | GMMA |
Extended Mnemonic Table (sub_896D50)
A second, much larger mnemonic table is constructed by sub_896D50 (21KB, vtable off_21DA9F8). This "extended" table serves a different purpose from the primary 322-entry table: it is used during SASS disassembly input parsing (string-to-index lookup), whereas the primary table is used during encoding (index-to-string). The two tables share the same base class (sub_A2B110) but have different vtables and different object layouts.
Table Dimensions
| Property | Primary (sub_7A5D10) | Extended (sub_896D50) |
|---|---|---|
| Entry count | 322 (indices 0--321) | 773 (indices 0--772) |
| Effective mnemonics | 306 (excl. 16 boundary markers) | 772 (excl. NONE sentinel) |
| Entry size | 16 bytes (8B ptr + 8B len) | 16 bytes (8B ptr + 8B len) |
| Object offset | +0x1058 (+4184) | +0x2C60 (+11360) |
| Ordering | By IR opcode index | Alphabetical by ROT13 name |
| Encoding category map | 322 x int32 at +0x2478 | 772 x int32 at +0x5CB0 (+23728), from unk_21D92E0 |
| Vtable | off_233ADC0 | off_21DA9F8 |
Why 772 Entries?
The extended table is 2.4x larger because it expands each base mnemonic into its modifier-qualified SASS forms. For example, the primary table stores one IMAD entry (index 1), but the extended table stores seven:
| Extended entry | ROT13 | Description |
|---|---|---|
| IMAD | VZNQ | Base form |
| IMAD.HI | VZNQ.UV | High-half variant |
| IMAD.WIDE | VZNQ.JVQR | 32x32->64 |
| IMAD.WIDE.READ.AB | VZNQ.JVQR.ERNQ.NO | Paired read, A+B |
| IMAD.WIDE.READ.CH | VZNQ.JVQR.ERNQ.PU | Paired read, C high |
| IMAD.WIDE.READ.CL | VZNQ.JVQR.ERNQ.PY | Paired read, C low |
| IMAD.WIDE.WRITE.DH | VZNQ.JVQR.JEVGR.QU | Paired write, D high |
| IMAD.WIDE.WRITE.DL | VZNQ.JVQR.JEVGR.QY | Paired write, D low |
Entry Composition
The 771 populated entries (from the decompiled string assignments at a1+11360 through a1+23712) break down as:
| Category | Count | Examples |
|---|---|---|
| SASS base mnemonics (also in primary table) | 244 | IMAD, FADD, LDG, BRA, MOV, ... |
| SASS dot-modified variants | 125 | FENCE.G, ISETP.64, BAR.SYNC.DEFER_BLOCKING, HMMA.SP.16832.F16.* |
| SASS new base names (not in primary) | 81 | BGMMA, RPCMOV, SYNCS, MOV32I, SHL, SHR, LOP, BITEXTRACT |
| Mercury internal descriptors | 321 | MERCURY_addmin_srcs_r_ur_0, MERCURY_mbarrier_try_wait_... |
| Total SASS | 450 | |
| Total (SASS + Mercury) | 771 |
Of the 450 SASS entries, 7 carry annotation text in parentheses: F2F (not F64), F2I (not *64), FRND (not F64), I2F (not F64), NANOSLEEP (with Rb), NANOTRAP (with Rb), WARPSYNC (with Rb). These annotations indicate operand-type restrictions or register-variant qualifiers used by the SASS parser to disambiguate instruction forms.
32-Bit Immediate Forms
These mnemonics represent SASS instructions with a 32-bit immediate operand packed directly into the instruction word. They do not appear as separate entries in the primary IR opcode table because the immediate form is selected during encoding based on operand type, not during IR construction:
| ROT13 | Mnemonic | Description |
|---|---|---|
SNQQ32V | FADD32I | FP32 add with 32-bit immediate |
SSZN32V | FFMA32I | FP32 FMA with 32-bit immediate |
SZHY32V | FMUL32I | FP32 multiply with 32-bit immediate |
UNQQ2_32V | HADD2_32I | FP16x2 add with 32-bit immediate |
USZN2_32V | HFMA2_32I | FP16x2 FMA with 32-bit immediate |
UZHY2_32V | HMUL2_32I | FP16x2 multiply with 32-bit immediate |
VNQQ32V | IADD32I | Integer add with 32-bit immediate |
VNQQ2 | IADD2 | Two-input integer add (32I related) |
VZHY32V | IMUL32I | Integer multiply with 32-bit immediate |
VZHY32V.JVQR | IMUL32I.WIDE | Integer multiply-wide with 32-bit immediate |
VFPNQQ32V | ISCADD32I | Integer scaled-add with 32-bit immediate |
YBC32V | LOP32I | Logic operation with 32-bit immediate |
ZBI32V | MOV32I | Move 32-bit immediate to register |
ZBI64VHE | MOV64IUR | Move 64-bit immediate to uniform register |
HYBC32V | ULOP32I | Uniform logic with 32-bit immediate |
Mercury Pseudo-Instructions (321 Entries)
The single largest category. These are not real SASS instructions -- they are internal pseudo-instructions representing Mercury IR operations that need mnemonic-string identity for diagnostic and dump output. They follow a rigid naming convention:
MERCURY_{operation}_{srcs|dests}_{regclass}_{variant_index}
Register class codes in the mnemonic:
r= GPR (R0--R255)ur= Uniform register (UR0--UR63)p= Predicate register (P0--P6)simm= Signed immediateuimm= Unsigned immediater2/ur2= Register pair
Representative entries (decoded from ROT13):
| ROT13 | Cleartext | Operation |
|---|---|---|
ZREPHEL__vage | MERCURY__intr | Generic intrinsic placeholder |
ZREPHEL_nqqzva_fepf_e_he_0 | MERCURY_addmin_srcs_r_ur_0 | Fused add-min, GPR + uniform |
ZREPHEL_nqqznk_fepf_he_e_0 | MERCURY_addmax_srcs_ur_r_0 | Fused add-max, uniform + GPR |
ZREPHEL_ngbz_pnf_vag_npd_ery_... | MERCURY_atom_cas_int_acq_rel_... | Atomic CAS with acquire-release |
ZREPHEL_flapf_neevir_n1g0_n0g1_... | MERCURY_syncs_arrive_a1t0_a0t1_... | Sync arrive with token spec |
New Base Mnemonics
Mnemonics that appear in the extended table but have no base-name match in the primary 322-entry table at all. Some are legacy forms (pre-Volta mnemonics preserved for disassembly compatibility), others are specialized operations:
| ROT13 | Mnemonic | Category |
|---|---|---|
NPDOHYX | ACQBULK | CGA bulk resource acquire |
OVGRKGENPG | BITEXTRACT | Bitfield extract |
QRPBZCERFF | DECOMPRESS | Data decompression |
VQC4N | IDP4A | Integer dot-product accumulate (4-element) |
VZHY | IMUL | Integer multiply (non-fused, legacy) |
VFPNQQ | ISCADD | Integer scaled-add (legacy LEA form) |
YQTZP | LDGMC | Load global with memory consistency |
YQG | LDT | Load from texture memory |
YBC | LOP | Two-input logic operation (legacy) |
CFRGC | PSETP | Predicate set-predicate |
ERQT | REDG | Reduction, global (explicit address space) |
FUY | SHL | Shift left (legacy, replaced by SHF) |
FUE | SHR | Shift right (legacy, replaced by SHF) |
FCNEFVSL | SPARSIFY | Convert dense to sparse format |
FGG | STT | Store to texture memory |
GNGBZT | TATOMG | Texture atomic, global scope |
IVFRG | VISET | Vector integer set |
JNECTEBHCFRG | WARPGROUPSET | Configure warpgroup parameters |
Modifier Suffix Patterns
Five distinct modifier suffix patterns are used in the extended table's dot-separated SASS mnemonics:
Pattern 1 -- Sub-operation mode. The suffix selects a functional sub-operation within a single hardware instruction. CCTL has the most variants (7):
| Extended Mnemonic | Sub-operation |
|---|---|
CCTL.C | Clean |
CCTL.C.LDC | Clean via constant cache |
CCTL.C.LDC.IVALL | Clean constant cache, invalidate all |
CCTL.E.LDC | Evict via constant cache |
CCTL.I | Invalidate |
CCTL.LDCU | Load constant, uniform path |
CCTL.QFAULT | Query fault status |
Also: SYNCS.ARRIVE.A1T0.A0T1, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK (8 variants); and BPT.DRAIN, BPT.PAUSE.
Pattern 2 -- Operand width. The .64 suffix (with optional .HI/.LO half-selectors) indicates 64-bit operand mode. Added for sm_104 (Blackwell Ultra):
| Extended Mnemonic | Base Opcode |
|---|---|
ISETP.64, ISETP.64.HI, ISETP.64.LO | ISETP (idx 288) |
IMNMX.64, IMNMX.64.HI, IMNMX.64.LO | IMNMX (idx 285) |
IADD.64, IADD.64.HI, IADD.64.LO | IADD (idx 282) |
IADD2.64, IADD2.64.HI, IADD2.64.LO | IADD2 |
MOV.64, MOV.64.HI, MOV.64.LO | MOV (idx 290) |
SEL.64, SEL.64.HI, SEL.64.LO | SEL (idx 292) |
UMOV.64, USEL.64, UIADD3.64, UIMNMX.64, UISETP.64 | Uniform 64-bit variants |
Pattern 3 -- Data access direction. IMAD.WIDE has 5 sub-variants controlling which 32-bit half of the 64-bit accumulator is read or written. These correspond to the 256-bit instruction format (format code 0x8) with 16 constant-bank operand slots:
| Extended Mnemonic | Meaning |
|---|---|
IMAD.WIDE | Default wide multiply-add |
IMAD.WIDE.READ.AB | Read both A and B input halves |
IMAD.WIDE.READ.CL / .CH | Read accumulator low / high half |
IMAD.WIDE.WRITE.DL / .DH | Write result low / high half |
IMAD.HI | High-half result only |
Pattern 4 -- Scope qualifier. Fences, barriers, UTC operations, and synchronization carry scope suffixes:
| Extended Mnemonic | Scope |
|---|---|
FENCE.G | Global (GPU-wide) |
FENCE.S | Shared/CTA |
FENCE.T | Tensor (sm_100+) |
UTCBAR.1CTA, UTCBAR.2CTA | 1-CTA / 2-CTA scope |
UTCBAR.1CTA.FLUSH | 1-CTA with flush |
BAR.SYNC.DEFER_BLOCKING | Deferred blocking sync |
USETMAXREG.RELEASE | Release variant |
USETSHMSZ.FLUSH | Flush variant |
Pattern 5 -- Shape and type descriptor. Tensor core operations carry shape geometry and data type. Brace-delimited alternation syntax indicates a single encoder handling multiple shapes:
| Extended Mnemonic | Meaning |
|---|---|
HMMA.F32.{16816.F16|16816.E8M7|1688.E8M10} | FP16 MMA with FP32 accum, multiple shapes |
HMMA.SP.16832.F16.* | Sparse FP16 MMA, 16x8x32 |
IMMA.{8816.*|8832.*} | Integer MMA, 8x8x16 or 8x8x32 |
IMMA.SP.{16832.*|16864.*4.*4} | Sparse integer MMA |
QMMA.SF.SP | Structured + unstructured sparse |
MUFU.EX2.LOW_ACC.{F16x2, BF16x2} | Low-accuracy EX2 for half types |
Top Opcodes by Dot-Variant Count
| Base Opcode | Variants | Category |
|---|---|---|
| HMMA | 8 | Tensor core shape + sparse + FP type |
| SYNCS | 8 | Scope-aware synchronization modes |
| CCTL | 7 | Cache control sub-operations |
| IMAD | 7 | .HI, .WIDE, .WIDE.READ., .WIDE.WRITE. |
| IMMA | 6 | Tensor core shape + sparse |
| QMMA | 6 | Shape + structured/unstructured sparse |
| USYNCS | 6 | Uniform sync scope modes |
| MUFU | 5 | .EX2, .RCP, .RSQ, .EX2 with half-precision |
| IADD | 4 | .64, .64.HI, .64.LO, .XOR |
| WARPGROUP | 3 | .ARRIVE, .DEPBAR, .WAIT |
| RPCMOV | 3 | .32, .32.READ, .64 |
| UTCBAR | 3 | .1CTA, .1CTA.FLUSH, .2CTA |
Complete New SASS Mnemonics by Category
The following 206 SASS mnemonics appear only in the extended table -- they have no corresponding entry in the base 322-entry name table. Many represent modifier-suffixed forms of base opcodes; others are entirely new operations.
GMMA type-specialized (8): BGMMA, BGMMA_GSB, HGMMA, HGMMA_GSB, IGMMA, IGMMA_GSB, QGMMA, QGMMA_GSB
UTC type-specialized (20): UTCHMMA.1CTA, UTCHMMA.2CTA, UTCIMMA.1CTA, UTCIMMA.2CTA, UTCMXQMMA.1CTA, UTCMXQMMA.2CTA, UTCOMMA.1CTA, UTCOMMA.2CTA, UTCQMMA.1CTA, UTCQMMA.2CTA, UTCBAR.1CTA.FLUSH, UTCATOMSWS, UTCLDSWS, UTCSTSWS, UTCBAR.1CTA, UTCBAR.2CTA, UTCCP.1CTA, UTCCP.2CTA, UTCSHIFT.1CTA, UTCSHIFT.2CTA
DLC/DPC operations (13): UDLCBAR, UDLCCP, UDLCHMMA, UDLCIMMA, UDLCQMMA, UDPCBLKCP, UDPCBLKL2CCTL, UDPCBLKRED, UDPCTMACCTL, UDPCTMAL2CCTL, UDPCTMALDG, UDPCTMAREDG, UDPCTMASTG
Synchronization (17): SYNCS.ARRIVE.A1T0.A0T1, SYNCS.ARRIVE.A1TR.ART0.A0TR.A0TX, SYNCS.CAS.EXCH, SYNCS.CCTL, SYNCS.FLUSH, SYNCS.LD.NON_UNIFORM, SYNCS.LD.UNIFORM, SYNCS.PHASECHK, SYNCSU.ARRIVE.A1T0, SYNCSU.ARRIVE.MULTICAST.A1T0, WARPGROUP.ARRIVE, WARPGROUP.DEPBAR, WARPGROUP.WAIT, WARPGROUPSET, BAR.SYNC.DEFER_BLOCKING, BPT.DRAIN, BPT.PAUSE
Uniform sync (6): USYNCS.ARRIVE, USYNCS.ARRIVE.MULTICAST, USYNCS.CAS.EXCH, USYNCS.CCTL, USYNCS.LD, USYNCS.PHASECHK
Integer 64-bit variants (18): IADD.64, IADD.64.HI, IADD.64.LO, IADD.XOR, IADD2, IADD2.64, IADD2.64.HI, IADD2.64.LO, IMNMX.64, IMNMX.64.HI, IMNMX.64.LO, ISETP.64, ISETP.64.HI, ISETP.64.LO, MOV.64, MOV.64.HI, MOV.64.LO, SEL.64, SEL.64.HI, SEL.64.LO
Uniform scalar extended (27): UIADD3.64, UIMNMX.64, UISETP.64, UMOV.64, USEL.64, ULOP, ULOP32I, UMEMSETS.64, UPSETP, UR2UP, USHL, USHR, UCCTL, UBLKL2CCTL, UCGABAR_ARV, UCGABAR_GET, UCGABAR_SET, UCGABAR_WAIT, USETMAXREG, USETMAXREG.RELEASE, USETSHMSZ, USETSHMSZ.FLUSH, UREDGR, UREGPRERELEASE, USTGR, UTRACEEVENT, UVIRTCOUNT
IMAD/IMUL variants (8): IMAD.HI, IMAD.WIDE.READ.AB, IMAD.WIDE.READ.CH, IMAD.WIDE.READ.CL, IMAD.WIDE.WRITE.DH, IMAD.WIDE.WRITE.DL, IMUL.WIDE, IMUL32I.WIDE
Tensor core shapes (28): HMMA.16816.F16.*, HMMA.1688.F16.*, HMMA.F32.{...} (4 entries), HMMA.SP.{...} (4 entries), IMMA.{...} (3 entries), IMMA.SP.{...} (3 entries), DMMA.1684, DMMA.1688, DMMA.16816, BMMA.88128, BMMA.168128, BMMA.168256, QMMA.16816, QMMA.16832, QMMA.SF, QMMA.SF.SP, QMMA.SP.16832, QMMA.SP.16864, OMMA.SP
FP extensions (16): FADD32I, FFMA32I, FMUL32I, FHADD, FHADD2, FHFMA, FHFMA2, FHMUL2, UFHADD, UFHFMA, UFMNMX, MUFU.EX2, MUFU.RCP, MUFU.RSQ, MUFU.EX2.{F16x2, BF16x2}, MUFU.EX2.LOW_ACC.{F16x2, BF16x2}
Cache control (7): CCTL.C, CCTL.C.LDC, CCTL.C.LDC.IVALL, CCTL.E.LDC, CCTL.I, CCTL.LDCU, CCTL.QFAULT
Texture extensions (8): TATOMG, TTUCLOSE, TTUGO, TTULD, TTULD_CLOSE, TTUMACROFUSE, TTUOPEN, TTUST
Fence/scope (3): FENCE.G, FENCE.S, FENCE.T
Data movement (7): MOV32I, MOV64IUR, RPCMOV, RPCMOV.32, RPCMOV.32.READ, RPCMOV.64, CS2R (base without size), DECOMPRESS
Memory (4): LDGMC, LDT, STT, REDG
Other new (13): ACQBULK, BRA_IMM, JMP_IMM, JMXU, NONE, PSETP, HADD2_32I, HFMA2_32I, HMUL2_32I, IADD32I, IMUL, LOP, LOP32I
Parallel Constructor Regions
The ROT13 string data for the extended table exists in two identical regions:
| Region | Address Range | SASS Entries | MERCURY Entries |
|---|---|---|---|
| 1 | 0x2039000--0x203A500 | 139 unique | 32 |
| 2 | 0x21CA000--0x21CB100 | 139 unique | 40 |
Region 2 has 8 additional MERCURY entries not in region 1, all for sm_100/sm_104 cluster barrier and atomic operations: MERCURY_barrier_cluster_arrive_sync_unaligned_* (4), MERCURY_atom_shared_cta_popc_inc_* (3), MERCURY_atom_shared_cta_int_acq_rel_* (1). This indicates at least two InstructionInfo variant objects for different target architectures, where the newer variant gains additional Mercury instruction templates.
Hash Table for O(1) Lookup
After populating the flat sorted array, sub_896D50 constructs a hash table for O(1) mnemonic lookup during SASS parsing. The hash table is allocated as a 488-byte header object with three backing arrays:
| Array | Slot size | Slots | Total bytes | Purpose |
|---|---|---|---|---|
| 1 | 64 bytes | 772 | 49,408 | Open-addressing hash (key prefix + metadata) |
| 2 | 36 bytes | 772 | 27,792 | Auxiliary data per mnemonic |
| 3 | 16 bytes | 35 | 560 | Overflow / collision chain |
Array 1 slots are initialized to 0xFF (empty sentinel). The hash function used for lookup is the same FNV-1a variant used by sub_1377C60 for the primary table.
Object Tail Configuration
After building the tables and hash structure, the constructor:
- Queries ~14 knobs via
context+1664(knobs 1, 2, 5, 11, 14, 18, 22, 25, 28, 273, 774, 775, 803, 983, 998) to conditionally register feature-gated instruction families atcontext+1728 - Stores knob 803's value at
obj+108 - Sets the vtable to
off_21DA9F8(line 2438 in decompiled source) - Writes feature bitmask
0x48018BA65atobj+26856 - Stores the hash table pointer at
obj+26832and the arena pointer atobj+26840
Related Pages
- Instructions & Opcodes -- Ori IR instruction layout, opcode encoding, full ROT13 table
- SASS Encoding -- Instruction encoding pipeline, format groups, encoder templates
- Instruction Selection -- Pattern matching from IR to SASS
- SM Architecture Map -- SM version numbering and feature sets
- Scheduling -- How opcodes are assigned to functional units
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_7A5D10 | -- | InstructionInfo constructor; initializes the 322-entry ROT13 opcode name table at object offset +0x1058 and the 322-entry encoding category identity map at +0x2478 (vtable off_233ADC0) | 0.92 |
sub_BE7390 | -- | Parallel InstructionInfo constructor; initializes an identical 322-entry name table | 0.90 |
sub_7CB560 | -- | SASS printer; maps duplicate opcode indices (e.g., 284 vs 285) to distinct mnemonic strings (IMNMX vs IMNMX.64) based on operand metadata | 0.85 |
sub_6575D0 | 49KB | Register-class-to-opcode dispatch; handles DMMA (index 215) shared dispatch with CVTA at cases 0xD6/0xD7 | 0.85 |
sub_7482B0 | -- | Encoding path for ISETP (index 288, sm_104); handles case 0x120 for 64-bit integer set-predicate | 0.80 |
sub_8380A0 | -- | Encoding path for ISETP (index 288, sm_104); second handler for case 0x120 | 0.80 |
sub_896D50 | 21KB | Extended mnemonic table constructor; builds the 772-entry alphabetically-sorted SASS mnemonic lookup table at object offset +11360, with parallel 772-entry encoding category map from unk_21D92E0, plus 3-array hash table for O(1) string lookup during disassembly parsing (vtable off_21DA9F8) | 0.90 |
sub_A2B110 | -- | Base class constructor shared by both primary (sub_7A5D10) and extended (sub_896D50) mnemonic table objects | 0.85 |
PTX Instruction Table
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Complete catalog of PTX instructions recognized by ptxas v13.0.88 (CUDA 12.8 / PTX ISA 8.7). All entries are verified against the binary: instruction names come from the Flex lexer's 552 token rules, type signatures from the instruction table builder's 1,141 descriptor registrations (sub_46E000 calling sub_46BED0), and formatter names from the 580 PTX text generation functions dispatched by sub_5D4190. Internal-only instructions (prefixed with _) are included where they appear in the binary but are marked accordingly.
| Instruction table builder | sub_46E000 (93 KB, 1,141 calls to sub_46BED0) |
| Instruction lookup | sub_46C690 (entry) / sub_46C6E0 (6.4 KB matcher) |
| PTX text formatter dispatch | sub_5D4190 (12.9 KB, 81 string + 473-entry hash) |
| Formatter functions | 0x4DA340--0x5A8E40 (580 functions) |
| Semantic validators | 0x460000--0x4D5000 (~20 validator functions) |
| Operand type encoding | Single-char codes: F=float, H=half, I=int, B=bits, N=imm, P=pred, E=bf16, Q=fp8, R=fp4 |
Organization
Instructions are grouped by functional category following NVIDIA's PTX ISA documentation structure. Each table entry lists:
- Mnemonic: the PTX instruction name as recognized by the lexer
- Type suffixes: legal type qualifiers (from instruction table builder encoding strings)
- Operands: operand pattern (
d=dest,a/b/c=source,p=predicate,[a]=memory) - SM req: minimum SM architecture (from
sub_489390version gates in validators) - PTX req: minimum PTX ISA version (from
sub_489050version gates) - Description: brief functional description
Type abbreviations in the suffix column: s=signed int, u=unsigned int, f=float, b=bits, pred=predicate. Widths: 8/16/32/64/128. Packed: f16x2, bf16x2.
Integer Arithmetic
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
add | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Integer addition |
sub | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Integer subtraction |
mul.lo | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Multiply, low half of result |
mul.hi | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Multiply, high half of result |
mul.wide | .s16 .s32 .u16 .u32 | d, a, b | all | 1.0 | Widening multiply (16->32 or 32->64) |
mul24.lo | .s32 .u32 | d, a, b | all | 1.0 | 24-bit multiply, low half (deprecated sm_20+) |
mul24.hi | .s32 .u32 | d, a, b | all | 1.0 | 24-bit multiply, high half (deprecated sm_20+) |
mad.lo | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b, c | all | 1.0 | Multiply-add, low half |
mad.hi | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b, c | all | 1.0 | Multiply-add, high half |
mad.wide | .s16 .s32 .u16 .u32 | d, a, b, c | all | 1.0 | Widening multiply-add |
mad24.lo | .s32 .u32 | d, a, b, c | all | 1.0 | 24-bit multiply-add, low (deprecated) |
mad24.hi | .s32 .u32 | d, a, b, c | all | 1.0 | 24-bit multiply-add, high (deprecated) |
mad.cc | .s32 .u32 .s64 .u64 | d, a, b, c | all | 1.0 | Multiply-add with carry-out |
madc.lo | .s32 .u32 .s64 .u64 | d, a, b, c | all | 1.0 | Multiply-add with carry-in, low |
madc.hi | .s32 .u32 .s64 .u64 | d, a, b, c | all | 1.0 | Multiply-add with carry-in, high |
mad.fused.hi | .s32 .u32 | d, a, b, c | all | 1.0 | Fused multiply-add, high half |
madc.fused.hi | .s32 .u32 | d, a, b, c | all | 1.0 | Fused multiply-add with carry, high |
div | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Integer division |
rem | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Integer remainder |
abs | .s16 .s32 .s64 | d, a | all | 1.0 | Absolute value |
neg | .s16 .s32 .s64 | d, a | all | 1.0 | Negate |
min | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Minimum |
max | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Maximum |
popc | .b32 .b64 | d, a | 20+ | 2.0 | Population count (count set bits) |
clz | .b32 .b64 | d, a | 20+ | 2.0 | Count leading zeros |
bfind | .s32 .s64 .u32 .u64 | d, a | 20+ | 2.0 | Find most significant set bit |
brev | .b32 .b64 | d, a | 20+ | 2.0 | Bit reverse |
bfe | .s32 .s64 .u32 .u64 | d, a, b, c | 20+ | 2.0 | Bit field extract |
bfi | .b32 .b64 | d, f, a, b, c | 20+ | 2.0 | Bit field insert |
dp4a | .s32.s32 .s32.u32 .u32.s32 .u32.u32 | d, a, b, c | 61+ | 5.0 | 4-element dot product accumulate |
dp2a.lo | .s32.s32 .s32.u32 .u32.s32 .u32.u32 | d, a, b, c | 61+ | 5.0 | 2-element dot product accumulate, low |
dp2a.hi | .s32.s32 .s32.u32 .u32.s32 .u32.u32 | d, a, b, c | 61+ | 5.0 | 2-element dot product accumulate, high |
sad | .s16 .s32 .u16 .u32 | d, a, b, c | all | 1.0 | Sum of absolute differences |
Floating-Point Arithmetic
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
add | .f32 .f64 | d, a, b | all | 1.0 | FP addition (.rn .rz .rm .rp rounding) |
sub | .f32 .f64 | d, a, b | all | 1.0 | FP subtraction |
mul | .f32 .f64 | d, a, b | all | 1.0 | FP multiplication |
fma | .f32 .f64 | d, a, b, c | 20+ | 2.0 | Fused multiply-add |
mad | .f32 .f64 | d, a, b, c | all | 1.0 | Multiply-add (non-fused on sm_20+) |
mad.rnd.f32 | .f32 | d, a, b, c | all | 1.0 | Multiply-add with explicit rounding |
div | .f32 .f64 | d, a, b | all | 1.0 | FP division (.approx .full .rn .rz .rm .rp) |
div.full | .f32 | d, a, b | all | 1.0 | Full-range division (specialized formatter) |
div.rnd.f32 | .f32 | d, a, b | all | 1.0 | Division with explicit rounding |
div.rn.f64 | .f64 | d, a, b | all | 1.0 | Double-precision division, round-nearest |
abs | .f32 .f64 | d, a | all | 1.0 | FP absolute value |
neg | .f32 .f64 | d, a | all | 1.0 | FP negate |
min | .f32 .f64 | d, a, b | all | 1.0 | FP minimum |
max | .f32 .f64 | d, a, b | all | 1.0 | FP maximum |
rcp | .f32 .f64 | d, a | all | 1.0 | Reciprocal (.approx .rn .rz .rm .rp) |
rcp.approx.f64 | .f64 | d, a | all | 1.0 | Approximate double reciprocal |
rcp.rnd.f32 | .f32 | d, a | all | 1.0 | Reciprocal with explicit rounding |
rcp.rn.f64 | .f64 | d, a | all | 1.0 | Double reciprocal, round-nearest |
sqrt | .f32 .f64 | d, a | all | 1.0 | Square root (.approx .rn .rz .rm .rp) |
rsqrt | .f32 .f64 | d, a | all | 1.0 | Reciprocal square root (.approx) |
sin | .f32 | d, a | all | 1.0 | Sine (approximate) |
cos | .f32 | d, a | all | 1.0 | Cosine (approximate) |
lg2 | .f32 | d, a | all | 1.0 | Log base 2 (approximate) |
ex2 | .f32 | d, a | all | 1.0 | Exp base 2 (approximate) |
tanh | .f32 | d, a | 75+ | 6.5 | Hyperbolic tangent (approximate) |
testp | .f32 .f64 | p, a | 20+ | 2.0 | Test FP property (.finite .infinite .number .notanumber .normal .subnormal) |
copysign | .f32 .f64 | d, a, b | 20+ | 2.0 | Copy sign from b to a |
fma.f32 | .f32 | d, a, b, c | 20+ | 2.0 | FP32 fused multiply-add (rounding modes) |
fma.f64 | .f64 | d, a, b, c | 20+ | 2.0 | FP64 fused multiply-add |
Half-Precision and BFloat16 Arithmetic
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
add | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 53+ | 4.2 | Half-precision addition |
sub | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 53+ | 4.2 | Half-precision subtraction |
mul | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 53+ | 4.2 | Half-precision multiplication |
fma | .f16 .f16x2 .bf16 .bf16x2 | d, a, b, c | 53+ | 4.2 | Half-precision fused multiply-add |
neg | .f16 .f16x2 .bf16 .bf16x2 | d, a | 53+ | 4.2 | Half-precision negate |
abs | .f16 .f16x2 .bf16 .bf16x2 | d, a | 53+ | 4.2 | Half-precision absolute value |
min | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 80+ | 7.0 | Half-precision minimum |
max | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 80+ | 7.0 | Half-precision maximum |
min.ftz.NaN | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 80+ | 7.0 | Min with NaN propagation |
max.ftz.NaN | .f16 .f16x2 .bf16 .bf16x2 | d, a, b | 80+ | 7.0 | Max with NaN propagation |
ex2.approx | .f16 .f16x2 | d, a | 75+ | 6.5 | Half-precision exp2 |
tanh.approx | .f16 .f16x2 | d, a | 75+ | 6.5 | Half-precision tanh |
fma.rn.relu | .f16 .bf16 | d, a, b, c | 80+ | 7.0 | Fused multiply-add with ReLU |
Comparison and Selection
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
setp | .s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .f16 .bf16 .b16 .b32 .b64 | p[|q], a, b | all | 1.0 | Set predicate on comparison |
selp | .s16 .s32 .s64 .u16 .u32 .u64 .f32 .f64 .b16 .b32 .b64 | d, a, b, p | all | 1.0 | Select on predicate |
slct | .s32 .u32 .f32 .s64 .u64 .f64 | d, a, b, c | all | 1.0 | Select on comparison |
set | .s32 .u32 .f32 .s64 .u64 .f64 | d, a, b | all | 1.0 | Compare and set |
Comparison operators for setp/set: .eq .ne .lt .le .gt .ge .lo .ls .hi .hs .equ .neu .ltu .leu .gtu .geu .num .nan
Logic and Shift
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
and | .b16 .b32 .b64 .pred | d, a, b | all | 1.0 | Bitwise AND |
or | .b16 .b32 .b64 .pred | d, a, b | all | 1.0 | Bitwise OR |
xor | .b16 .b32 .b64 .pred | d, a, b | all | 1.0 | Bitwise XOR |
not | .b16 .b32 .b64 .pred | d, a | all | 1.0 | Bitwise NOT |
cnot | .b16 .b32 .b64 | d, a | all | 1.0 | C-style logical NOT |
lop3 | .b32 | d, a, b, c, immLut | 50+ | 4.0 | 3-input logic operation (LUT-encoded) |
shl | .b16 .b32 .b64 | d, a, b | all | 1.0 | Shift left |
shr | .s16 .s32 .s64 .u16 .u32 .u64 | d, a, b | all | 1.0 | Shift right (arithmetic for .s, logical for .u) |
shf.l | .b32 | d, a, b, c | 32+ | 3.2 | Funnel shift left |
shf.r | .b32 | d, a, b, c | 32+ | 3.2 | Funnel shift right (.clamp .wrap) |
prmt | .b32 | d, a, b, c | 20+ | 2.0 | Byte permute (.f4e .b4e .rc8 .ecl .ecr .rc16) |
Data Movement
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
mov | .b16 .b32 .b64 .b128 .u16 .u32 .u64 .s16 .s32 .s64 .f16 .f32 .f64 .pred | d, a | all | 1.0 | Move register-to-register |
shfl | .b32 | d|p, a, b, c | 30+ | 3.0 | Warp shuffle (.up .down .bfly .idx) |
shfl.sync | .b32 | d|p, a, b, c, membermask | 70+ | 6.0 | Warp shuffle with sync |
vote | .pred | d, {p}, a | 20+ | 2.0 | Warp vote (.all .any .uni .ballot) |
vote.sync | .pred .b32 | d, {p}, a, membermask | 70+ | 6.0 | Warp vote with sync |
match | .b32 .b64 | d, a | 70+ | 6.0 | Warp match (.any .all) |
match.sync | .b32 .b64 | d, a, membermask | 70+ | 6.0 | Warp match with sync |
redux | .s32 .u32 | d, a | 80+ | 7.0 | Warp reduction (.add .min .max .and .or .xor) |
redux.sync | .s32 .u32 | d, a, membermask | 80+ | 7.0 | Warp reduction with sync |
activemask | .b32 | d | 70+ | 6.2 | Get active thread mask |
elect | .pred | p | 90+ | 8.0 | Elect one leader thread |
elect.one | -- | d, {p} | 90+ | 8.0 | Elect one thread, return success |
Load, Store, and Memory
Global, Local, Shared, Const
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
ld | .b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64 | d, [a] | all | 1.0 | Load from memory (.global .shared .local .const .param) |
ld.nc | .b32 .b64 .b128 .f32 .f64 | d, [a] | 35+ | 3.2 | Non-coherent load (read-only cache) |
ld.param | .b8 .b16 .b32 .b64 .b128 | d, [a] | all | 1.0 | Load from kernel parameter space |
st | .b8 .b16 .b32 .b64 .b128 .u8 .u16 .u32 .u64 .s8 .s16 .s32 .s64 .f16 .f32 .f64 | [a], b | all | 1.0 | Store to memory |
ldu | .b32 .b64 .b128 .f32 .f64 | d, [a] | 20+ | 2.0 | Load via uniform cache (deprecated) |
prefetch | .L1 .L2 | [a] | 20+ | 2.0 | Prefetch to cache level |
prefetchu | .L1 | [a] | 20+ | 2.0 | Prefetch uniform |
isspacep | .global .shared .local .const | p, a | 20+ | 2.0 | Test address space |
cvta | -- | d, a | 20+ | 2.0 | Convert address space (generic <-> specific) |
cvta.to | .global .shared .local .const | d, a | 20+ | 2.0 | Convert to specific state space |
Cache qualifiers for ld/st: .ca .cg .cs .cv .lu .wb .wt
Eviction policy (PTX 7.4+): .L2::evict_first .L2::evict_last .L2::evict_normal .L2::cache_hint
Async Copy
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
cp.async | .ca .cg | [dst], [src], size | 80+ | 7.0 | Async copy (4/8/16 bytes, global->shared) |
cp.async.commit_group | -- | -- | 80+ | 7.0 | Commit outstanding async copies |
cp.async.wait_group | -- | N | 80+ | 7.0 | Wait for async copy group completion |
cp.async.wait_all | -- | -- | 80+ | 7.0 | Wait for all async copies |
cp.async.mbarrier.arrive | -- | [mbar] | 80+ | 7.0 | Arrive at mbarrier on async copy completion |
cp.async.bulk | -- | [dst], [src], size | 90+ | 8.0 | Bulk async copy (TMA) |
cp.async.bulk.tensor | -- | [dst], [src], dims... | 90+ | 8.0 | Tensor async copy (TMA, 1-5D tiles) |
cp.async.bulk.prefetch | -- | [src], size | 90+ | 8.0 | Prefetch via TMA |
cp.async.bulk.prefetch.tensor | -- | [src], dims... | 90+ | 8.0 | Tensor prefetch via TMA |
cp.async.bulk.commit_group | -- | -- | 90+ | 8.0 | Commit bulk async group |
cp.async.bulk.wait_group | -- | N | 90+ | 8.0 | Wait for bulk group completion |
cp.reduce.async.bulk | .add .min .max .and .or .xor .inc .dec | [dst], [src], size | 90+ | 8.0 | Bulk async copy with reduction |
cp.reduce.async.bulk.tensor | .add .min .max .and .or .xor .inc .dec | [dst], [src], dims... | 90+ | 8.0 | Tensor async copy with reduction |
st.async | .b32 .b64 .b128 | [a], b | 90+ | 8.1 | Async store |
st.bulk | -- | [a], b, size | 90+ | 8.0 | Bulk store |
red.async | .add .min .max .and .or .xor .inc .dec | [a], b | 90+ | 8.1 | Async reduction |
Multimem
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
multimem.ld_reduce | .f16 .bf16 .f32 .u32 .s32 .u64 | d, [a] | 90+ | 8.1 | Multicast memory load with reduction |
multimem.st | .f16 .bf16 .f32 .u32 .s32 .u64 | [a], b | 90+ | 8.1 | Multicast memory store |
multimem.red | .f16 .bf16 .f32 .u32 .s32 .u64 | [a], b | 90+ | 8.1 | Multicast memory reduction |
Cache Control
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
discard | .global .L2 | [a], size | 80+ | 7.4 | Discard data (hint: no writeback) |
applypriority | .global .L2 | [a], size, prio | 80+ | 7.4 | Set cache eviction priority |
createpolicy.cvt | .L2 | d, imm | 80+ | 7.4 | Create cache policy from immediate |
createpolicy.fractional | .L2 | d, fraction | 80+ | 7.4 | Create fractional cache policy |
createpolicy.range | .L2 | d, lo, hi, hit, miss | 80+ | 7.4 | Create cache policy for address range |
Conversion
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
cvt | all int/float combinations | d, a | all | 1.0 | Type conversion (rounding: .rn .rz .rm .rp .rna) |
cvt.pack | .b16.b32 .b32.b32 | d, a, b | 80+ | 7.1 | Pack two values into one register |
cvt type combinations (from instruction table encoding strings):
The instruction table builder registers extensive type-pair combinations for cvt. Representative signatures:
| Source | Destination | Notes |
|---|---|---|
F[16|32|64] | F[16|32|64] | Float-to-float, rounding modes apply |
F[16|32|64] | I[8|16|32|64] | Float-to-integer, rounding + saturation |
I[8|16|32|64] | F[16|32|64] | Integer-to-float, rounding modes |
I[8|16|32|64] | I[8|16|32|64] | Integer-to-integer, sign extend / truncate |
E16 | F[16|64] / I[8|16|32|64] | bf16 source conversions (sm_80+) |
F[16|64] / I[8|16|32|64] | E16 | bf16 destination conversions (sm_80+) |
H32 (tf32) | various | TensorFloat-32 conversions (sm_80+) |
Q16 (fp8 e5m2) | F32 / H32 / E32 | FP8 conversions (sm_89+) |
R8 (fp8 e4m3) | F32 / H32 / E32 | FP8 e4m3 conversions (sm_89+) |
R16 (fp4) | F32 | FP4 conversions (sm_100+, PTX 8.6) |
Q32 (fp8 e5m2) | F32 | Extended FP8 conversions (sm_100+) |
Modifier .ftz (flush-to-zero), .sat (saturation), .relu (ReLU clamp to 0) are recognized. The .rna rounding mode (round-to-nearest-away, PTX 8.3+) is registered for cvt.tf32.f32.
szext -- Sign/Zero Extend
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
szext | .b32 | d, a, pos | 90+ | 8.0 | Sign- or zero-extend at bit position |
Texture and Surface
Texture
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
tex | .1d .2d .3d .a1d .a2d .cube .acube | d, [tex, sampler, coord] | all | 1.0 | Texture fetch with sampler |
tex.base | .1d .2d .3d | d, [tex, coord] | 60+ | 5.0 | Texture fetch base level |
tex.level | .1d .2d .3d .a1d .a2d .cube .acube | d, [tex, sampler, coord, lod] | all | 1.0 | Texture fetch at explicit LOD |
tex.grad | .1d .2d .3d .a1d .a2d .cube .acube | d, [tex, sampler, coord, dPdx, dPdy] | all | 1.0 | Texture fetch with explicit gradients |
tld4 | .2d .a2d | d, [tex, sampler, coord] | 20+ | 2.0 | Texture gather (4 texels) |
txq | .width .height .depth .channel_data_type .channel_order .normalized_coords .filter_mode .addr_mode_0 .addr_mode_1 .addr_mode_2 .samp_pos .num_mip_levels .num_samples | d, [tex] | 20+ | 2.0 | Texture query |
Return types for texture ops: .v4.s32 .v4.u32 .v4.f32 (4-component return).
Surface
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
suld.b | .1d .2d .3d .a1d .a2d | d, [surf, coord] | 20+ | 2.0 | Surface load (bindless) |
sust.b | .1d .2d .3d .a1d .a2d | [surf, coord], a | 20+ | 2.0 | Surface store (bindless) |
sust.p | .1d .2d .3d .a1d .a2d | [surf, coord], a | 20+ | 2.0 | Surface store (packed format) |
sured.b | .1d .2d .3d | d, [surf, coord], a | 20+ | 2.0 | Surface reduction (.add .min .max .and .or) |
suq | .width .height .depth .channel_data_type .channel_order .array_size | d, [surf] | 20+ | 2.0 | Surface query |
Surface clamp modes: .trap .clamp .zero
Tensormap
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
tensormap.replace | .tile .im2col | d, [tmap], field, value | 90+ | 8.0 | Replace tensormap field at runtime |
tensormap.cp_fenceproxy | -- | [tmap] | 90+ | 8.0 | Tensormap copy fence proxy |
Control Flow
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
bra | -- | target | all | 1.0 | Branch (unconditional or predicated) |
bra.uni | -- | target | all | 1.0 | Uniform branch (all threads take same direction) |
brx.idx | -- | a, [targets] | 70+ | 6.0 | Indexed branch (jump table) |
call | -- | (ret), func, (params) | 20+ | 2.0 | Function call (with ABI) |
ret | -- | -- | 20+ | 2.0 | Return from function |
exit | -- | -- | all | 1.0 | Exit kernel / terminate thread |
trap | -- | -- | all | 1.0 | Trigger error |
brkpt | -- | -- | 11+ | 1.0 | Breakpoint (debugger halt) |
pmevent | -- | imm | 20+ | 2.0 | Performance monitor event |
pmevent.mask | -- | imm | 20+ | 3.0 | Performance monitor event with mask |
nanosleep | -- | t | 70+ | 6.3 | Sleep for t nanoseconds |
Synchronization and Barriers
Legacy Barrier
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
bar.sync | -- | a{, b} | all | 1.0 | Barrier synchronize (CTA-level) |
bar.arrive | -- | a, b | all | 1.0 | Barrier arrive (non-blocking) |
bar.red | .and .or .popc | d, a, {b}, p | all | 1.0 | Barrier with reduction |
bar.warp | .sync | membermask | 70+ | 6.0 | Warp-level barrier |
Named Barrier (PTX 6.0+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
barrier | -- | a{, b} | 70+ | 6.0 | Named barrier synchronize |
barrier.arrive | -- | a, b | 70+ | 6.0 | Named barrier arrive |
barrier.red | .and .or .popc | d, a, {b}, p | 70+ | 6.0 | Named barrier with reduction |
CTA-Cluster Barrier (PTX 7.8+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
bar.cta | -- | -- | 90+ | 7.8 | CTA-level barrier sync |
bar.cta.arrive | -- | -- | 90+ | 7.8 | CTA-level barrier arrive |
bar.cta.red | .and .or .popc | d, p | 90+ | 7.8 | CTA barrier with reduction |
barrier.cta | -- | -- | 90+ | 7.8 | CTA named barrier sync |
barrier.cta.arrive | -- | -- | 90+ | 7.8 | CTA named barrier arrive |
barrier.cta.red | .and .or .popc | d, p | 90+ | 7.8 | CTA named barrier with reduction |
barrier.cluster.arrive | -- | -- | 90+ | 7.8 | Cluster-level barrier arrive |
barrier.cluster.wait | -- | -- | 90+ | 7.8 | Cluster-level barrier wait |
Asynchronous Barriers (mbarrier)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
mbarrier.init | .shared.b64 | [mbar], count | 80+ | 7.0 | Initialize mbarrier with expected count |
mbarrier.inval | .shared.b64 | [mbar] | 80+ | 7.0 | Invalidate mbarrier |
mbarrier.arrive | .shared.b64 | state, [mbar] | 80+ | 7.0 | Arrive at mbarrier |
mbarrier.arrive_drop | .shared.b64 | state, [mbar] | 80+ | 7.0 | Arrive and drop expected count |
mbarrier.test_wait | .shared.b64 | p, [mbar], state | 80+ | 7.0 | Test if mbarrier phase complete |
mbarrier.test_wait.parity | .shared.b64 | p, [mbar], parity | 80+ | 7.1 | Test mbarrier parity |
mbarrier.try_wait | .shared.b64 | p, [mbar], state | 80+ | 7.8 | Try-wait on mbarrier (with timeout) |
mbarrier.try_wait.parity | .shared.b64 | p, [mbar], parity | 80+ | 7.8 | Try-wait on mbarrier parity |
mbarrier.pending_count | .b64 | d, state | 80+ | 7.0 | Get pending arrival count |
mbarrier.complete_tx | .shared.b64 | [mbar], count | 90+ | 8.0 | Complete transaction at mbarrier |
mbarrier.expect_tx | .shared.b64 | [mbar], count | 90+ | 8.0 | Set expected transaction count |
mbarrier.tx | -- | [mbar], count | 90+ | 8.0 | Transaction mbarrier arrive |
mbarrier.arrive.expect_tx | .shared.b64 | state, [mbar], count | 90+ | 8.0 | Arrive with expected tx count |
mbarrier.arrive_drop.expect_tx | .shared.b64 | state, [mbar], count | 90+ | 8.0 | Arrive-drop with expected tx count |
Memory Fence
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
membar | .cta .gl .sys | -- | all | 1.0 | Memory barrier (scope) |
membar.proxy | .alias | -- | 75+ | 6.4 | Proxy memory barrier (alias scope) |
fence.proxy | .alias .async .async.global .async.shared::cta | -- | 70+ | 6.0 | Fence proxy (alias/async) |
fence.proxy.tensormap | -- | [addr] | 90+ | 8.0 | Fence tensormap proxy |
Grid Dependency Control
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
griddepcontrol | .launch_dependents .wait | -- | 90+ | 7.8 | Grid dependency control |
Atomic and Reduction
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
atom | .s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2 | d, [a], b | all | 1.1 | Atomic RMW (.add .min .max .and .or .xor .exch .cas .inc .dec) |
atom.global | (same as atom) | d, [a], b | all | 1.1 | Atomic on global memory |
atom.shared | (same as atom) | d, [a], b | all | 1.1 | Atomic on shared memory |
red | .s32 .u32 .u64 .f32 .f64 .b32 .b64 .f16x2 .bf16x2 | [a], b | all | 1.1 | Reduction (no return value) |
red.global | (same as red) | [a], b | all | 1.1 | Reduction on global memory |
Atom/red scope modifiers (PTX 6.0+): .cta .gpu .sys .cluster
Matrix (MMA / Tensor Core)
WMMA (Warp Matrix Multiply-Accumulate, PTX 6.0+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
wmma.load.a | .sync .aligned | d, [ptr], stride | 70+ | 6.0 | Load matrix A fragment |
wmma.load.b | .sync .aligned | d, [ptr], stride | 70+ | 6.0 | Load matrix B fragment |
wmma.load.c | .sync .aligned | d, [ptr], stride | 70+ | 6.0 | Load accumulator C fragment |
wmma.store.d | .sync .aligned | [ptr], d, stride | 70+ | 6.0 | Store result D fragment |
wmma.mma | .sync .aligned | d, a, b, c | 70+ | 6.0 | Matrix multiply-accumulate |
WMMA shapes (from validator sub_4BFED0): .m16n16k16 .m32n8k16 .m8n32k16 .m16n16k8 etc.
WMMA type combinations (from string table): F16F16F16F16, F32F16F16F32, F32F32 (TF32), I32I8I8I32, I32I4I4I32, I32B1B1I32, F64F64F64F64
MMA (PTX 6.5+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
mma | .sync .aligned | d, a, b, c | 75+ | 6.5 | Matrix multiply-accumulate (Turing+) |
MMA type combinations (verified from instruction table builder strings):
| D type | A type | B type | C type | SM | Notes |
|---|---|---|---|---|---|
F16 | F16 | F16 | F16 | 75+ | Native FP16 |
F32 | F16 | F16 | F32 | 75+ | Mixed-precision |
F32 | F32 | F32 | -- | 80+ | TF32 Tensor Core |
F32 | E16 | E16 | F32 | 80+ | BFloat16 |
F32 | T32 | T32 | F32 | 80+ | TF32 path (string: F32T32T32F32) |
I32 | I8 | I8 | I32 | 75+ | INT8 |
I32 | I4 | I4 | I32 | 75+ | INT4 |
I32 | B1 | B1 | I32 | 75+ | Binary (1-bit) |
F64 | F64 | F64 | F64 | 80+ | Double-precision |
F16 | Q8 | Q8 | F16 | 89+ | FP8 (e5m2) |
F32 | Q8 | Q8 | F32 | 89+ | FP8 mixed |
F16 | R4 | Q8 | F16 | 100+ | FP4 x FP8 |
F32 | R4 | Q8 | F32 | 100+ | FP4 x FP8 mixed |
F32 | R4 | R4 | F32 | 100+ | FP4 x FP4 |
F32 | R4 | R4 | F32.Q8 | 100+ | FP4 with scale (string: F32R4R4F32Q8) |
F32 | Q8 | Q8 | F32.Q8 | 100+ | FP8 with block scale |
MMA shapes: .m16n8k16 .m16n8k32 .m16n8k64 .m16n8k128 .m16n8k256
Sparse MMA modifiers: .sp with metadata selector and sparsity pattern
WGMMA (Warp-Group MMA, PTX 7.8+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
wgmma.mma_async | .aligned | d, a_desc, b_desc | 90+ | 7.8 | Warp-group async matrix multiply |
wgmma.fence | .aligned | -- | 90+ | 7.8 | WGMMA fence (ordering) |
wgmma.commit_group | .aligned | -- | 90+ | 7.8 | Commit WGMMA group |
wgmma.wait_group | .aligned | N | 90+ | 7.8 | Wait for WGMMA group completion |
WGMMA operand encoding strings (from instruction table, selection):
hUUhP, fUUfP, hUhhP, fUhfP, hhUhP, fhUfP (H=half dest, F=float dest, U=desc operand, P=pred)
With accumulator: hUUhdC, fUUfdC, hUhhdC, fUhfdC
With scale: hUUhdCP, fUUfdCP (P=pred control for scale)
TCGen05 (5th Generation Tensor Core, sm_100+)
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
tcgen05.mma | -- | d, a_desc, b_desc | 100+ | 8.6 | 5th-gen tensor core MMA |
tcgen05.mma.ws | -- | d, a_desc, b_desc | 100+ | 8.6 | 5th-gen MMA with warpgroup scale |
tcgen05.ld | -- | d, [desc] | 100+ | 8.6 | TC load from descriptor |
tcgen05.ld.red | -- | d, [desc], src | 100+ | 8.6 | TC load with reduction |
tcgen05.st | -- | [desc], src | 100+ | 8.6 | TC store to descriptor |
tcgen05.cp | -- | [desc], [src] | 100+ | 8.6 | TC copy |
tcgen05.commit | -- | [mbar] | 100+ | 8.6 | TC commit |
tcgen05.shift | -- | [desc] | 100+ | 8.6 | TC shift accumulator |
tcgen05.alloc | -- | d, nCols | 100+ | 8.6 | Allocate TC columns |
tcgen05.dealloc | -- | nCols | 100+ | 8.6 | Deallocate TC columns |
tcgen05.relinquish_alloc_permit | -- | -- | 100+ | 8.6 | Relinquish TC allocation permit |
tcgen05.fence | -- | -- | 100+ | 8.6 | TC fence |
tcgen05.wait | -- | -- | 100+ | 8.6 | TC wait |
TCGen05 MMA operand encodings (from instruction table):
MUUuP, MUUMuP, MMUuP, MMUMuP (M=matrix, U=desc, u=uniform, P=pred)
With accumulator descriptors: MUUudP, MUUuPC, MMUudP, MMUuPC, MUUMudP, MMUMudPC
With metadata: MUUuMMP, MUUMuMMP, MMUuMMP, MMUMuMMP (sparse)
TCGen05 Guardrails (internal)
These are internal debug/verification instructions, not user-facing PTX:
| Mnemonic | SM | Description |
|---|---|---|
_tcgen05.guardrails.is_phase_valid | 100+ | Validate TC phase |
_tcgen05.guardrails.are_columns_allocated | 100+ | Check column allocation |
_tcgen05.guardrails.is_current_warp_valid_owner | 100+ | Check warp ownership |
_tcgen05.guardrails.in_physical_bounds | 100+ | Check physical bounds |
_tcgen05.guardrails.allocation_granularity | 100+ | Validate allocation granularity |
_tcgen05.guardrails.datapath_alignment | 100+ | Validate datapath alignment |
_tcgen05.guardrails.sp_consistency_across_idesc_mod | 100+ | Sparse consistency check |
_tcgen05.guardrails.check_sparse_usage | 100+ | Validate sparse usage |
Matrix Utility Instructions
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
ldmatrix | .sync .aligned .m8n8 .num | d, [ptr] | 75+ | 6.5 | Load matrix from shared memory |
stmatrix | .sync .aligned .m8n8 .num | [ptr], a | 90+ | 7.8 | Store matrix to shared memory |
movmatrix | .aligned | d, a | 80+ | 7.1 | Move/transform matrix fragment |
SIMD Video Instructions
These 8/16-bit SIMD instructions operate on packed sub-word elements within 32-bit registers.
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
vadd | .s32.s32 .s32.u32 .u32.s32 .u32.u32 | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD add with secondary op |
vsub | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD subtract |
vmad | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD multiply-add |
vmin | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD minimum |
vmax | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD maximum |
vabsdiff | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD absolute difference |
vset | (same) | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD set on compare |
vshl | .u32.u32 | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD shift left |
vshr | .s32.u32 .u32.u32 | d, a.asel, b.bsel, c | 20+ | 2.0 | SIMD shift right |
vadd2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit SIMD add |
vsub2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit SIMD subtract |
vmin2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit SIMD minimum |
vmax2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit SIMD maximum |
vabsdiff2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit absolute difference |
vset2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit set on compare |
vavrg2 | (packed-pairs) | d, a, b, c | 30+ | 3.0 | Dual 16-bit average |
vadd4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit SIMD add |
vsub4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit SIMD subtract |
vmin4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit SIMD minimum |
vmax4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit SIMD maximum |
vabsdiff4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit absolute difference |
vset4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit set on compare |
vavrg4 | (packed-quads) | d, a, b, c | 30+ | 3.0 | Quad 8-bit average |
Element selectors: .b0 .b1 .b2 .b3 (byte), .h0 .h1 (half-word)
Secondary ops: .add .min .max
Saturation: .sat
Miscellaneous
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
getctarank | .shared .global | d, a | 90+ | 7.8 | Get CTA rank in cluster from address |
istypep | .texref .samplerref .surfref | p, a | all | 1.0 | Test if variable is a type |
preexit | -- | -- | all | 1.0 | Pre-exit notification |
stacksave | -- | d | 20+ | 2.0 | Save stack pointer |
stackrestore | -- | a | 20+ | 2.0 | Restore stack pointer |
alloca | -- | d, size | 20+ | 2.0 | Dynamic stack allocation |
clusterlaunchcontrol.try_cancel.async | -- | [mbar], d | 100+ | 8.7 | Cluster launch cancel (async) |
clusterlaunchcontrol.query_cancel | -- | d, [mbar] | 100+ | 8.7 | Query cluster launch cancel status |
Register and Shared Memory Control
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
setmaxnreg.inc | -- | N | 90+ | 7.8 | Increase max register count |
setmaxnreg.dec | -- | N | 90+ | 7.8 | Decrease max register count |
setmaxreg.alloc | -- | N | 100+ | 8.6 | Allocate registers to max |
setmaxreg.dealloc | -- | N | 100+ | 8.6 | Deallocate from max registers |
setmaxreg.try_alloc | -- | d, N | 100+ | 8.6 | Try-allocate registers |
setsmemsize | -- | N | 90+ | 7.8 | Set dynamic shared memory size |
setsmemsize.flush | -- | N | 100+ | 8.6 | Set shared memory size with flush |
getnextworkid | -- | d | 90+ | 8.0 | Get next dynamic work unit ID |
Warp-Group Management
| Mnemonic | Type suffixes | Operands | SM | PTX | Description |
|---|---|---|---|---|---|
_warpgroup.arrive | -- | -- | 90+ | 7.8 | Warpgroup arrive (internal) |
_warpgroup.wait | -- | N | 90+ | 7.8 | Warpgroup wait |
_warpgroup.commit_batch | -- | -- | 90+ | 7.8 | Warpgroup commit batch |
Internal Instructions
These underscore-prefixed instructions are not part of the public PTX ISA. They are generated internally by ptxas during lowering, stub synthesis, or as pre-codegen IR representations. All are registered in the instruction table builder sub_46E000 and appear in --dumpir output, but users never write them directly.
Internal Memory
| Mnemonic | Type suffixes | Operands | String addr | Handler / Formatter | Description |
|---|---|---|---|---|---|
_ldldu | (varies) | d, [a] | 0x1d080ee | formatter sub_4DD860 | Unified load-uniform; combines ld+ldu semantics for uniform-cache-path loads |
_ldsm | .b8 .b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2 | d, [M] | 0x1d076c2 | handlers sub_46B0C0--sub_46B160, validator sub_4AEB60 | Load shared matrix; loads matrix tiles from shared memory into registers for MMA. Opcode ID 28 |
_movm | .b16 .s8.s4 .u8.u4 .s4.s2 .u4.u2 | d, a | 0x1d076da | handlers sub_46B1B0--sub_46B260 | Move matrix; register-to-register matrix data movement with optional format conversion. Opcode ID 29 |
Internal Cache Control
| Mnemonic | Type suffixes | Operands | String addr | Table builder xref | Description |
|---|---|---|---|---|---|
_createpolicy.fractional | .L2 | d, fraction | 0x1d0813a | 0x47752f | Internal form of createpolicy.fractional; creates fractional L2 cache eviction policy |
_createpolicy.range | .L2 | d, lo, hi, hit, miss | 0x1d08158 | 0x477579 | Internal form of createpolicy.range; creates L2 policy for address range |
Internal Surface
| Mnemonic | Type suffixes | Operands | String addr | Table builder xrefs | Description |
|---|---|---|---|---|---|
_sulea.b | (varies) | d, [surf, coord] | 0x1d088bc | 0x4815cd, 0x48166b | Surface load effective address, bindless; computes address for suld.b without performing the load |
_sulea.p | (varies) | d, [surf, coord] | 0x1d088c5 | 0x48161c, 0x4816ba | Surface load effective address, packed; computes address for sust.p-mode surface access |
Internal FP / Guard
| Mnemonic | Type suffixes | Operands | String addr | Table builder xref | Description |
|---|---|---|---|---|---|
_checkfp.divide | (varies) | d, a, b | 0x1d088d2 | 0x481709 | FP division guard; inserted during lowering to validate divisor (handles division-by-zero, denormals) before SASS div emission |
Internal Control Flow / ABI
| Mnemonic | Type suffixes | Operands | String addr | Table builder xref | Description |
|---|---|---|---|---|---|
_gen_proto | -- | (opaque) | 0x1d08903 | 0x48189a | Generate function prototype; synthesizes call prototypes for indirect / device-runtime calls during ABI resolution |
_jcall | -- | target | 0x1d0890e | 0x4818df | Internal jump-call; used inside auto-generated unified-function-stub (UFT) wrappers synthesized by sub_451680 (.func .attribute(.unified_func_stub) __cuda_uf_stub_%s() { _jcall %s; }) |
Internal Warp
| Mnemonic | Type suffixes | Operands | String addr | Table builder xref | Description |
|---|---|---|---|---|---|
_match | (varies) | d, a | 0x1d08a24 | 0x483404 | Internal match; pre-sync lowered form of warp match instruction, distinct from the public match.sync |
Internal MMA / Tensor Core
| Mnemonic | Type suffixes | Operands | String addr | Handlers | Description |
|---|---|---|---|---|---|
_mma.warpgroup | 135 type combos (F16, BF16, TF32, FP8, INT8) | d, a, b, c | 0x1d072e3 | 135 handlers sub_4668A0--sub_469FD0 | Warp-group MMA; pre-codegen form of WGMMA. Each handler registers one (src, dst, acc) type triple via sub_465030. Lowers to MERCURY_warpgroup_mma_* SASS opcodes (sm_90+) |
_zzn.z8a8x4 | (sub-byte int) | d, a, b, c | 0x1cfdc03 | data table at 0x1cfe678 | ROT13-obfuscated _mma.m8n8k4; sub-byte integer MMA with tile shape m8n8k4 for INT4/INT2 and bit-level XOR MMA (sm_75+) |
Handler address summary for internal instructions:
| Range | Contents |
|---|---|
sub_46B0C0--sub_46B260 | _ldsm (3) + _movm (3) type-variant handlers |
sub_4668A0--sub_469FD0 | _mma.warpgroup 135 type-variant handlers |
sub_4AEB60 | _ldsm validator (3.7 KB) -- handles .s8.s4/.u8.u4 format rules |
sub_451680 | _jcall UFT stub generator |
sub_4DD860 | _ldldu PTX text formatter |
Instruction Table Builder Internals
Registration Mechanism
The 93 KB function sub_46E000 runs once during parser initialization (sub_451730). Each of its 1,141 calls to sub_46BED0 has the form:
sub_46BED0(table, operand_encoding, opcode_id, opcode_name,
type_flags, sm_requirement, xmm_data, extra);
Where:
operand_encodingis a compact string like"F32F32","I32I8I8I32","MUUuP"opcode_idis the internal opcode integer (mapped 1:1 to Ori IR opcodes fromctor_003)type_flagsis a bitfield encoding which.sNN/.uNN/.fNN/.bNNqualifiers are legalsm_requirementgates the instruction to architectures >= this SM version
The operand encoding characters are:
| Char | ID | Meaning | Type bits registered |
|---|---|---|---|
F | 1 | Float operand | .f16 .f32 .f64 |
H | 2 | Half-precision | .f16 .f16x2 |
N | 3 | Immediate / numeric | (no type suffix) |
I | 4 | Integer operand | .s8--.s64 .u8--.u64 |
B | 5 | Bitwise operand | .b8--.b128 |
P | 6 | Predicate | .pred |
O | 7 | Optional operand | (no type suffix) |
E | 8 | Extended type | .bf16 .e4m3 .e5m2 |
Q | 10 | FP8 type | .e5m2 .e4m3 (fp8) |
R | 11 | FP4/narrow type | .e2m1 (fp4) |
M | -- | Matrix descriptor | Tensor core descriptor |
U | -- | Uniform descriptor | TMA/TC uniform register |
C | -- | Carry/accumulator | MMA accumulator control |
When a letter is followed by a digit, that digit constrains the bit-width: F32 means only .f32, I16 means only .s16/.u16. The function sub_1CB0850 registers each valid width into a bitset.
Descriptor Layout
Each registered instruction creates a 368-byte descriptor node via sub_424070. Key fields:
| Offset | Field | Description |
|---|---|---|
| +0 | opcode_id | Internal opcode identifier |
| +8 | type_flags | Bitfield of legal type qualifiers |
| +12 | xmm_data | 128-bit SIMD data (rounding, modifiers) |
| +28 | extra_flags | Architecture / behavior flags |
| +36 | operand_count | Number of operands |
| +40+ | operand_slots | Per-operand type bitsets (4 slots max, each 8 bytes) |
| +232 | name_length | Length of the opcode name string |
The two hash tables at offsets 2472 and 2480 in the instruction table provide dual-path lookup -- the first table is the primary lookup; if the opcode is not found, the second table is checked. This two-table scheme separates core instructions from extended/variant instructions.
Instruction Count by Category
Based on the 1,141 registration calls and the 431 decoded Ori IR opcodes (from ctor_003):
| Category | Approximate descriptor count |
|---|---|
| Integer arithmetic (add/sub/mul/mad/div/rem/...) | ~120 |
| Float arithmetic (fadd/fmul/fma/div/rcp/sqrt/...) | ~100 |
| Half/BF16 arithmetic | ~60 |
| Comparison and selection (setp/selp/set/slct) | ~50 |
| Logic and shift (and/or/xor/shl/shr/lop3/shf) | ~40 |
| Conversion (cvt + 100+ type combinations) | ~180 |
| Load/store/atomic (ld/st/atom/red + variants) | ~150 |
| Texture/surface (tex/suld/sust/sured/txq) | ~80 |
| Control flow (bra/call/ret/exit/bar/barrier) | ~50 |
| MMA/WMMA/WGMMA/TCGen05 | ~160 |
| SIMD video (vadd/vsub/vmin/vmax/...) | ~60 |
| Miscellaneous (prmt/bfe/bfi/shfl/vote/...) | ~91 |
| Total | 1,141 |
PTX 8.x New Instructions
Instructions introduced in PTX ISA 8.0--8.7 (CUDA 12.0--12.8), verified from SM architecture gates and PTX version checks in the ptxas v13.0.88 binary:
| PTX | Instruction | SM | Description |
|---|---|---|---|
| 8.0 | cp.async.bulk | 90+ | TMA bulk async copy |
| 8.0 | cp.async.bulk.tensor | 90+ | TMA tensor async copy |
| 8.0 | cp.reduce.async.bulk | 90+ | Bulk async copy with reduction |
| 8.0 | cp.reduce.async.bulk.tensor | 90+ | Tensor async copy with reduction |
| 8.0 | cp.async.bulk.prefetch | 90+ | TMA prefetch |
| 8.0 | cp.async.bulk.prefetch.tensor | 90+ | TMA tensor prefetch |
| 8.0 | cp.async.bulk.commit_group | 90+ | Commit TMA group |
| 8.0 | cp.async.bulk.wait_group | 90+ | Wait for TMA group |
| 8.0 | tensormap.replace | 90+ | Runtime tensormap modification |
| 8.0 | elect / elect.one | 90+ | Elect leader thread |
| 8.0 | mbarrier.complete_tx | 90+ | Transaction mbarrier completion |
| 8.0 | mbarrier.expect_tx | 90+ | Set expected transaction count |
| 8.0 | st.bulk | 90+ | Bulk store |
| 8.0 | szext | 90+ | Sign/zero extend at bit position |
| 8.0 | getnextworkid | 90+ | Dynamic work unit |
| 8.1 | st.async | 90+ | Asynchronous store |
| 8.1 | red.async | 90+ | Asynchronous reduction |
| 8.1 | multimem.ld_reduce | 90+ | Multicast load-reduce |
| 8.1 | multimem.st | 90+ | Multicast store |
| 8.1 | multimem.red | 90+ | Multicast reduction |
| 8.6 | tcgen05.mma | 100+ | 5th-gen tensor core MMA |
| 8.6 | tcgen05.mma.ws | 100+ | 5th-gen MMA with warp scale |
| 8.6 | tcgen05.ld / tcgen05.st | 100+ | TC load/store |
| 8.6 | tcgen05.ld.red | 100+ | TC load with reduction |
| 8.6 | tcgen05.cp | 100+ | TC copy |
| 8.6 | tcgen05.commit | 100+ | TC commit |
| 8.6 | tcgen05.shift | 100+ | TC shift accumulator |
| 8.6 | tcgen05.alloc / tcgen05.dealloc | 100+ | TC column allocation |
| 8.6 | tcgen05.relinquish_alloc_permit | 100+ | Relinquish TC permit |
| 8.6 | tcgen05.fence / tcgen05.wait | 100+ | TC fence/wait |
| 8.6 | setmaxreg.alloc / .dealloc / .try_alloc | 100+ | Register management |
| 8.6 | setsmemsize.flush | 100+ | Shared memory with flush |
| 8.7 | clusterlaunchcontrol.try_cancel.async | 100+ | Cluster launch cancel |
| 8.7 | clusterlaunchcontrol.query_cancel | 100+ | Query cancel status |
Cross-References
- PTX Parser (Flex + Bison) -- scanner, parser, instruction table builder details
- PTX Directive Handling --
.version,.target,.entry, etc. - PTX-to-Ori Lowering -- how PTX instructions become Ori IR
- Ori IR Instructions -- the 431 Ori IR opcodes derived from PTX
- Intrinsic Table -- 608 built-in helper functions
- SM Architecture Map -- architecture version requirements
- SASS Opcode Catalog -- SASS equivalents of PTX instructions
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_46E000 | 93KB | Instruction table builder; runs once during parser init, makes 1,141 calls to sub_46BED0 to register all PTX instruction descriptors | 0.95 |
sub_46BED0 | -- | Instruction descriptor registrar; called 1,141 times with operand encoding, opcode ID, type flags, SM requirement | 0.95 |
sub_46C690 | -- | Instruction lookup entry point; dispatches to sub_46C6E0 for name-to-descriptor matching | 0.90 |
sub_46C6E0 | 6.4KB | Instruction name matcher; resolves PTX mnemonic strings to instruction table descriptors | 0.90 |
sub_5D4190 | 12.9KB | PTX text formatter dispatch; 81-string + 473-entry hash table routing to per-instruction formatters | 0.90 |
sub_489390 | -- | SM version gate; validates minimum SM architecture for each instruction | 0.85 |
sub_489050 | -- | PTX ISA version gate; validates minimum PTX version for each instruction | 0.85 |
sub_4BFED0 | -- | WMMA shape validator; checks legal WMMA shape combinations (.m16n16k16, .m32n8k16, etc.) | 0.85 |
sub_451680 | -- | UFT stub generator; synthesizes _jcall wrappers for unified-function-stub entries | 0.90 |
sub_451730 | -- | Parser initialization; calls sub_46E000 to build the instruction table | 0.85 |
sub_4DD860 | -- | _ldldu PTX text formatter; handles internal unified load-uniform instruction | 0.85 |
sub_4AEB60 | 3.7KB | _ldsm validator; validates .s8.s4/.u8.u4 format rules for shared matrix loads | 0.85 |
sub_465030 | -- | MMA type triple registrar; called by 135 _mma.warpgroup handlers to register (src, dst, acc) type combinations | 0.85 |
sub_424070 | -- | Instruction descriptor allocator; creates 368-byte descriptor nodes for each registered instruction | 0.85 |
sub_1CB0850 | -- | Width bitset registrar; registers valid bit-widths (e.g., F32 -> .f32 only) into per-operand bitsets | 0.80 |
EIATTR Attribute Catalog
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
EIATTR (ELF Info ATTRibute) is NVIDIA's proprietary metadata system embedded in .nv.info ELF sections within CUBIN files. Every CUDA kernel carries EIATTR records that tell the GPU driver how many registers to allocate, how much shared memory to reserve, what barriers the kernel uses, and dozens of other resource descriptors. Without this metadata, the driver cannot launch the kernel -- it has no way to determine the kernel's hardware resource footprint.
ptxas v13.0.88 defines 97 EIATTR codes, numbered 0 through 96 (0x00--0x60). The code-to-name mapping was extracted from the pointer table at VA 0x23FDC20 in the ptxas binary (16-byte entries: 8-byte string pointer + 8-byte metadata word, indexed by code number). The string names reside at 0x23FC6C7--0x23FD040. Code assignments were cross-verified against the nvlink v13.0.88 pointer table at 0x1D37D60, confirming identical enumeration across both tools.
| ELF section type | SHT_CUDA_INFO = 0x70000064 |
| Section name (global) | .nv.info |
| Section name (per-function) | .nv.info.<function_name> |
| Record format | Type-Length-Value (TLV), 4-byte aligned |
| Known attribute count | 97 codes: 0--96 (v13.0.88) |
| Name table VA | 0x23FDC20 (97 entries x 16 bytes = 1,552 bytes) |
| EIATTR builder function | sub_1CC9800 (14,764 bytes, 90 KB decompiled -- third largest in output range) |
| Barrier/register propagator | sub_1CC8950 (2,634 bytes, propagates counts across call graph) |
| TLV record emitter | sub_1CC85F0 (44 lines, writes individual EIATTR records) |
| SM-version gating | sub_1C97840 (checks whether an EIATTR code is valid for a given SM version) |
TLV Record Format
Each .nv.info section contains a flat sequence of 4-byte-aligned TLV records. There is no section header or record count -- the parser walks from byte 0 to sh_size, consuming records sequentially.
Record Layout
Offset Size Field
------ ---- -----
0x00 1 format Format byte (determines payload structure)
0x01 1 attr_code EIATTR type code (0x00--0x60)
0x02 2 size Payload size in bytes (little-endian uint16)
0x04 var payload Attribute-specific data (size bytes)
Total record size = 4 + size, padded up to 4-byte alignment. The minimum record is 4 bytes (format + code + size=0, no payload).
Format Byte
The format byte at offset 0 controls how the payload is interpreted:
| Format | Name | Payload structure | Typical use |
|---|---|---|---|
0x01 | Free | Raw bytes, attribute-specific layout | Offset tables, parameter info |
0x02 | Value | Single 32-bit value (no symbol index) | Global flags |
0x03 | Sized | 16-bit value + padding | Counts, sizes |
0x04 | Indexed | [sym_index:4][value:4] -- per-symbol attribute | Per-kernel resources |
Format 0x04 (indexed) is the most common for per-function attributes. The 4-byte symbol index at payload offset 0 identifies which function the attribute applies to. The linker uses this index for symbol remapping during merge and for per-function property extraction during finalization.
Binary Evidence -- sub_1CC85F0
The TLV record emitter function directly confirms the encoding:
// sub_1CC85F0 -- simplified from decompilation
// a2 = attr_code, a3 = 16-bit value/size, a4 = payload data, a5 = symbol index
void emit_eiattr(void* elfw, uint8_t attr_code, int16_t size, void* data, uint32_t sym_idx) {
if (!is_valid_for_sm(attr_code, elfw->sm_version))
return;
int section_index = get_nvinfo_section(elfw, sym_idx);
// Allocate 16-byte record buffer
uint8_t* record = pool_alloc(16);
// TLV header
record[0] = 0x04; // format = Indexed
record[1] = attr_code; // EIATTR type code
*(uint16_t*)(record + 2) = size; // payload size
*(uint32_t*)(record + 4) = section_index; // symbol index
// Append to .nv.info section's linked list
list_append(record, &elfw->nvinfo_list);
// Overwrite size field with actual value for indexed format
*(uint16_t*)(record + 2) = size;
*(uint64_t*)(record + 8) = data;
}
Parsing Pseudocode
uint8_t *ptr = section_data;
uint8_t *end = section_data + section_size;
while (ptr < end) {
uint8_t format = ptr[0];
uint8_t attr_code = ptr[1];
uint16_t size = *(uint16_t *)(ptr + 2);
if (format == 0x04) {
// Indexed: first 4 bytes of payload = symbol index
uint32_t sym_idx = *(uint32_t *)(ptr + 4);
uint32_t value = *(uint32_t *)(ptr + 8);
process_indexed_attribute(attr_code, sym_idx, value);
} else if (format == 0x02) {
// Value: single 32-bit immediate
uint32_t value = *(uint32_t *)(ptr + 4);
process_global_attribute(attr_code, value);
} else {
// Free/sized: attribute-specific handling
process_raw_attribute(attr_code, ptr + 4, size);
}
ptr += 4 + ALIGN_UP(size, 4);
}
Section Variants
A cubin contains two kinds of .nv.info sections:
Global .nv.info -- A single section named .nv.info with sh_link = 0 (no associated symbol). Contains attributes that apply to the entire compilation unit: CUDA API version, compatibility flags, and shared metadata not specific to any one kernel.
Per-function .nv.info.<name> -- One section per kernel or device function, named .nv.info.<function_name> with sh_link pointing to the corresponding symbol table entry. Carries per-kernel resource descriptors: register count, barrier count, stack sizes, parameter bank layout, and instruction-offset tables.
Both section variants use sh_type = SHT_CUDA_INFO (0x70000064). The ELF section type is the authoritative way to identify .nv.info sections; the name is only a convention.
Complete Code Table
All 97 EIATTR codes in numeric order. Extracted from the ptxas pointer table at VA 0x23FDC20. The "Format" column reflects the typical TLV format byte used when emitting that attribute. The "Meta" column shows the metadata word from the pointer table (lo word encodes minimum toolkit version compatibility, hi word encodes flags).
| Code | Hex | Name | Format | Meta | Category |
|---|---|---|---|---|---|
| 0 | 0x00 | EIATTR_ERROR | -- | 1 | Sentinel |
| 1 | 0x01 | EIATTR_PAD | -- | 1 | Sentinel |
| 2 | 0x02 | EIATTR_IMAGE_SLOT | Indexed | 1 | Texture |
| 3 | 0x03 | EIATTR_JUMPTABLE_RELOCS | Free | 1 | Metadata |
| 4 | 0x04 | EIATTR_CTAIDZ_USED | Indexed | 1 | Metadata |
| 5 | 0x05 | EIATTR_MAX_THREADS | Indexed | 1 | Resource |
| 6 | 0x06 | EIATTR_IMAGE_OFFSET | Indexed | 1 | Texture |
| 7 | 0x07 | EIATTR_IMAGE_SIZE | Indexed | 1 | Texture |
| 8 | 0x08 | EIATTR_TEXTURE_NORMALIZED | Indexed | 1 | Texture |
| 9 | 0x09 | EIATTR_SAMPLER_INIT | Indexed | 1 | Texture |
| 10 | 0x0A | EIATTR_PARAM_CBANK | Indexed | 1 | Param |
| 11 | 0x0B | EIATTR_SMEM_PARAM_OFFSETS | Free | 1 | Param |
| 12 | 0x0C | EIATTR_CBANK_PARAM_OFFSETS | Free | 1 | Param |
| 13 | 0x0D | EIATTR_SYNC_STACK | Indexed | 1 | Metadata |
| 14 | 0x0E | EIATTR_TEXID_SAMPID_MAP | Free | 1 | Texture |
| 15 | 0x0F | EIATTR_EXTERNS | Free | 1 | Metadata |
| 16 | 0x10 | EIATTR_REQNTID | Indexed | 1 | Resource |
| 17 | 0x11 | EIATTR_FRAME_SIZE | Indexed | 1 | Resource |
| 18 | 0x12 | EIATTR_MIN_STACK_SIZE | Indexed | 1 | Resource |
| 19 | 0x13 | EIATTR_SAMPLER_FORCE_UNNORMALIZED | Indexed | 1 | Texture |
| 20 | 0x14 | EIATTR_BINDLESS_IMAGE_OFFSETS | Free | 1 | Texture |
| 21 | 0x15 | EIATTR_BINDLESS_TEXTURE_BANK | Indexed | 1 | Texture |
| 22 | 0x16 | EIATTR_BINDLESS_SURFACE_BANK | Indexed | 1 | Texture |
| 23 | 0x17 | EIATTR_KPARAM_INFO | Free | 1 | Param |
| 24 | 0x18 | EIATTR_SMEM_PARAM_SIZE | Indexed | 1 | Param |
| 25 | 0x19 | EIATTR_CBANK_PARAM_SIZE | Sized | 1 | Param |
| 26 | 0x1A | EIATTR_QUERY_NUMATTRIB | Indexed | 1 | Metadata |
| 27 | 0x1B | EIATTR_MAXREG_COUNT | Sized | 1 | Resource |
| 28 | 0x1C | EIATTR_EXIT_INSTR_OFFSETS | Free | 1 | Offsets |
| 29 | 0x1D | EIATTR_S2RCTAID_INSTR_OFFSETS | Free | 1 | Offsets |
| 30 | 0x1E | EIATTR_CRS_STACK_SIZE | Indexed | 1 | Resource |
| 31 | 0x1F | EIATTR_NEED_CNP_WRAPPER | Indexed | 1 | Metadata |
| 32 | 0x20 | EIATTR_NEED_CNP_PATCH | Indexed | 1 | Metadata |
| 33 | 0x21 | EIATTR_EXPLICIT_CACHING | Indexed | 1 | Metadata |
| 34 | 0x22 | EIATTR_ISTYPEP_USED | Indexed | 1 | Metadata |
| 35 | 0x23 | EIATTR_MAX_STACK_SIZE | Indexed | 1 | Resource |
| 36 | 0x24 | EIATTR_SUQ_USED | Indexed | 1 | Metadata |
| 37 | 0x25 | EIATTR_LD_CACHEMOD_INSTR_OFFSETS | Free | 1 | Offsets |
| 38 | 0x26 | EIATTR_LOAD_CACHE_REQUEST | Indexed | 1 | Metadata |
| 39 | 0x27 | EIATTR_ATOM_SYS_INSTR_OFFSETS | Free | 1 | Offsets |
| 40 | 0x28 | EIATTR_COOP_GROUP_INSTR_OFFSETS | Free | 1 | Offsets |
| 41 | 0x29 | EIATTR_COOP_GROUP_MASK_REGIDS | Indexed | 1 | Cluster |
| 42 | 0x2A | EIATTR_SW1850030_WAR | Free | 1 | WAR |
| 43 | 0x2B | EIATTR_WMMA_USED | Indexed | 2 | Metadata |
| 44 | 0x2C | EIATTR_HAS_PRE_V10_OBJECT | Value | 3 | Metadata |
| 45 | 0x2D | EIATTR_ATOMF16_EMUL_INSTR_OFFSETS | Free | 3 | Offsets |
| 46 | 0x2E | EIATTR_ATOM16_EMUL_INSTR_REG_MAP | Free | 5 | Offsets |
| 47 | 0x2F | EIATTR_REGCOUNT | Indexed | 5 | Resource |
| 48 | 0x30 | EIATTR_SW2393858_WAR | Free | 5 | WAR |
| 49 | 0x31 | EIATTR_INT_WARP_WIDE_INSTR_OFFSETS | Free | 5 | Offsets |
| 50 | 0x32 | EIATTR_SHARED_SCRATCH | Indexed | 5 | Shared |
| 51 | 0x33 | EIATTR_STATISTICS | Free | 5 | Metadata |
| 52 | 0x34 | EIATTR_INDIRECT_BRANCH_TARGETS | Free | 5 | Offsets |
| 53 | 0x35 | EIATTR_SW2861232_WAR | Free | 5 | WAR |
| 54 | 0x36 | EIATTR_SW_WAR | Free | 5 | WAR |
| 55 | 0x37 | EIATTR_CUDA_API_VERSION | Indexed | 5 | Metadata |
| 56 | 0x38 | EIATTR_NUM_MBARRIERS | Indexed | 5 | Resource |
| 57 | 0x39 | EIATTR_MBARRIER_INSTR_OFFSETS | Free | 5 | Offsets |
| 58 | 0x3A | EIATTR_COROUTINE_RESUME_OFFSETS | Free | 5 | Offsets |
| 59 | 0x3B | EIATTR_SAM_REGION_STACK_SIZE | Indexed | 5 | Resource |
| 60 | 0x3C | EIATTR_PER_REG_TARGET_PERF_STATS | Free | 5 | Metadata |
| 61 | 0x3D | EIATTR_CTA_PER_CLUSTER | Indexed | 5 | Cluster |
| 62 | 0x3E | EIATTR_EXPLICIT_CLUSTER | Indexed | 5 | Cluster |
| 63 | 0x3F | EIATTR_MAX_CLUSTER_RANK | Indexed | 5 | Cluster |
| 64 | 0x40 | EIATTR_INSTR_REG_MAP | Free | 5 | Metadata |
| 65 | 0x41 | EIATTR_RESERVED_SMEM_USED | Indexed | 5 | Shared |
| 66 | 0x42 | EIATTR_RESERVED_SMEM_0_SIZE | Indexed | 5 | Shared |
| 67 | 0x43 | EIATTR_UCODE_SECTION_DATA | Free | 5 | Metadata |
| 68 | 0x44 | EIATTR_UNUSED_LOAD_BYTE_OFFSET | Free | 5 | Offsets |
| 69 | 0x45 | EIATTR_KPARAM_INFO_V2 | Free | 5 | Param |
| 70 | 0x46 | EIATTR_SYSCALL_OFFSETS | Free | 5 | Offsets |
| 71 | 0x47 | EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | Free | 5 | WAR |
| 72 | 0x48 | EIATTR_GRAPHICS_GLOBAL_CBANK | Indexed | 5 | Graphics |
| 73 | 0x49 | EIATTR_SHADER_TYPE | Indexed | 5 | Graphics |
| 74 | 0x4A | EIATTR_VRC_CTA_INIT_COUNT | Indexed | 5 | Graphics |
| 75 | 0x4B | EIATTR_TOOLS_PATCH_FUNC | Indexed | 5 | Metadata |
| 76 | 0x4C | EIATTR_NUM_BARRIERS | Indexed | 5 | Resource |
| 77 | 0x4D | EIATTR_TEXMODE_INDEPENDENT | Indexed | 5 | Texture |
| 78 | 0x4E | EIATTR_PERF_STATISTICS | Free | 5 | Metadata |
| 79 | 0x4F | EIATTR_AT_ENTRY_FRAGEMENTS | Free | 5 | Blackwell |
| 80 | 0x50 | EIATTR_SPARSE_MMA_MASK | Free | 5 | Blackwell |
| 81 | 0x51 | EIATTR_TCGEN05_1CTA_USED | Indexed | 5 | Blackwell |
| 82 | 0x52 | EIATTR_TCGEN05_2CTA_USED | Indexed | 5 | Blackwell |
| 83 | 0x53 | EIATTR_GEN_ERRBAR_AT_EXIT | Indexed | 5 | Blackwell |
| 84 | 0x54 | EIATTR_REG_RECONFIG | Indexed | 5 | Blackwell |
| 85 | 0x55 | EIATTR_ANNOTATIONS | Free | 5 | Metadata |
| 86 | 0x56 | EIATTR_UNKNOWN | -- | 5 | Sentinel |
| 87 | 0x57 | EIATTR_STACK_CANARY_TRAP_OFFSETS | Free | 5 | Offsets |
| 88 | 0x58 | EIATTR_STUB_FUNCTION_KIND | Indexed | 5 | Metadata |
| 89 | 0x59 | EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETS | Free | 5 | Offsets |
| 90 | 0x5A | EIATTR_MERCURY_FINALIZER_OPTIONS | Free | 5 | Mercury |
| 91 | 0x5B | EIATTR_BLOCKS_ARE_CLUSTERS | Indexed | 5 | Cluster |
| 92 | 0x5C | EIATTR_SANITIZE | Indexed | 5 | Blackwell |
| 93 | 0x5D | EIATTR_SYSCALLS_FALLBACK | Free | 5 | Metadata |
| 94 | 0x5E | EIATTR_CUDA_REQ | Free | 5 | Metadata |
| 95 | 0x5F | EIATTR_MERCURY_ISA_VERSION | Sized | 5 | Mercury |
| 96 | 0x60 | EIATTR_ERROR_LAST | -- | 5 | Sentinel |
Metadata Word Encoding
Each entry in the pointer table carries an 8-byte metadata word alongside the string pointer. The low 32 bits encode the minimum toolkit version required to parse this attribute. The high 32 bits encode flags (0 = legacy, 1 = internal-only, 2 = standard).
| Meta lo | Interpretation |
|---|---|
| 1 | Legacy attribute, present since earliest CUDA versions |
| 2 | Introduced in CUDA ~7.0 era (Volta) |
| 3 | Introduced in CUDA ~9.0 era (Turing) |
| 5 | Introduced in CUDA ~11.0+ era (Ampere and later) |
Codes 0--42 all carry meta=1 (legacy). The boundary at code 43 (EIATTR_WMMA_USED) marks the Volta-era expansion. Codes 46+ carry meta_lo=5, indicating the major expansion that happened with Ampere and continued through Blackwell.
Attribute Categories
Resource Allocation (GPU Driver Critical)
These attributes directly control how the GPU driver allocates hardware resources for kernel launch. Incorrect values cause silent performance degradation or launch failure.
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 47 | 0x2F | EIATTR_REGCOUNT | Indexed | Physical register count per thread. The GPU driver computes max_warps_per_SM = total_registers / (regcount * warp_size). Single most important occupancy-determining attribute. |
| 5 | 0x05 | EIATTR_MAX_THREADS | Indexed | Maximum threads per block (from .maxntid PTX directive). |
| 16 | 0x10 | EIATTR_REQNTID | Indexed | Required thread count per dimension (from .reqntid). |
| 17 | 0x11 | EIATTR_FRAME_SIZE | Indexed | Per-thread local memory frame size in bytes. |
| 18 | 0x12 | EIATTR_MIN_STACK_SIZE | Indexed | Minimum stack size per thread (non-recursive case). |
| 35 | 0x23 | EIATTR_MAX_STACK_SIZE | Indexed | Maximum stack size per thread (recursive case, computed via call graph propagation). |
| 30 | 0x1E | EIATTR_CRS_STACK_SIZE | Indexed | Call-Return-Stack size for nested function calls. |
| 59 | 0x3B | EIATTR_SAM_REGION_STACK_SIZE | Indexed | SAM (Streaming Asynchronous Memory) region stack size. |
| 76 | 0x4C | EIATTR_NUM_BARRIERS | Indexed | Number of named barriers used (max 16 on most architectures). Propagated from callees to entry points by sub_1CC8950. |
| 56 | 0x38 | EIATTR_NUM_MBARRIERS | Indexed | Number of memory barriers (mbarrier objects) used. |
| 27 | 0x1B | EIATTR_MAXREG_COUNT | Sized | Maximum register count hint (from --maxrregcount or .maxnreg). |
| 84 | 0x54 | EIATTR_REG_RECONFIG | Indexed | Dynamic register reconfiguration support (setmaxnreg instruction, sm_100+). |
Parameter Bank Layout
Describes how kernel parameters are laid out in constant memory bank 0 (c[0x0]).
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 10 | 0x0A | EIATTR_PARAM_CBANK | Indexed | Constant bank number and offset for kernel parameters. |
| 25 | 0x19 | EIATTR_CBANK_PARAM_SIZE | Sized | Size of the parameter constant bank in bytes. |
| 24 | 0x18 | EIATTR_SMEM_PARAM_SIZE | Indexed | Size of shared memory parameter region. |
| 11 | 0x0B | EIATTR_SMEM_PARAM_OFFSETS | Free | Offsets of parameters within shared memory. |
| 12 | 0x0C | EIATTR_CBANK_PARAM_OFFSETS | Free | Offsets of parameters within constant bank. |
| 23 | 0x17 | EIATTR_KPARAM_INFO | Free | Kernel parameter metadata (types, sizes, alignments). |
| 69 | 0x45 | EIATTR_KPARAM_INFO_V2 | Free | Extended kernel parameter info (v2 format with additional fields, no metadata version constraint). |
Instruction Offset Tables
Record byte offsets of specific instruction types within the kernel's .text section, enabling the driver and tools to locate and patch instructions at load time.
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 28 | 0x1C | EIATTR_EXIT_INSTR_OFFSETS | Free | Byte offsets of all EXIT instructions. |
| 29 | 0x1D | EIATTR_S2RCTAID_INSTR_OFFSETS | Free | Offsets of S2R instructions reading SR_CTAID (CTA ID). Used for cluster launch CTA-ID remapping. |
| 37 | 0x25 | EIATTR_LD_CACHEMOD_INSTR_OFFSETS | Free | Offsets of load instructions with explicit cache modifier. |
| 39 | 0x27 | EIATTR_ATOM_SYS_INSTR_OFFSETS | Free | Offsets of atomic instructions with .sys scope. |
| 40 | 0x28 | EIATTR_COOP_GROUP_INSTR_OFFSETS | Free | Offsets of cooperative group instructions. |
| 45 | 0x2D | EIATTR_ATOMF16_EMUL_INSTR_OFFSETS | Free | Offsets of emulated FP16 atomic instructions. |
| 46 | 0x2E | EIATTR_ATOM16_EMUL_INSTR_REG_MAP | Free | Register map for 16-bit atomic emulation. |
| 49 | 0x31 | EIATTR_INT_WARP_WIDE_INSTR_OFFSETS | Free | Offsets of integer warp-wide instructions. |
| 52 | 0x34 | EIATTR_INDIRECT_BRANCH_TARGETS | Free | Valid targets of indirect branches (for control flow integrity). |
| 57 | 0x39 | EIATTR_MBARRIER_INSTR_OFFSETS | Free | Offsets of MBAR (memory barrier) instructions. |
| 58 | 0x3A | EIATTR_COROUTINE_RESUME_OFFSETS | Free | Resume point offsets for device-side coroutines. Variant name EIATTR_COROUTINE_RESUME_ID_OFFSETS at 0x24064D8. |
| 68 | 0x44 | EIATTR_UNUSED_LOAD_BYTE_OFFSET | Free | Byte offset of unused load instruction. |
| 70 | 0x46 | EIATTR_SYSCALL_OFFSETS | Free | Offsets of __cuda_syscall invocations. |
| 87 | 0x57 | EIATTR_STACK_CANARY_TRAP_OFFSETS | Free | Offsets of stack canary trap instructions (stack protector). |
| 89 | 0x59 | EIATTR_LOCAL_CTA_ASYNC_STORE_OFFSETS | Free | Offsets of CTA-local async store instructions. |
Texture and Surface Binding
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 2 | 0x02 | EIATTR_IMAGE_SLOT | Indexed | Texture/surface image slot assignment. |
| 6 | 0x06 | EIATTR_IMAGE_OFFSET | Indexed | Offset within the image descriptor table. |
| 7 | 0x07 | EIATTR_IMAGE_SIZE | Indexed | Size of the image descriptor. |
| 8 | 0x08 | EIATTR_TEXTURE_NORMALIZED | Indexed | Whether texture coordinates are normalized. |
| 9 | 0x09 | EIATTR_SAMPLER_INIT | Indexed | Sampler initialization parameters. |
| 14 | 0x0E | EIATTR_TEXID_SAMPID_MAP | Free | Texture ID to sampler ID mapping table. |
| 19 | 0x13 | EIATTR_SAMPLER_FORCE_UNNORMALIZED | Indexed | Force unnormalized sampler coordinates. |
| 20 | 0x14 | EIATTR_BINDLESS_IMAGE_OFFSETS | Free | Offsets for bindless image references. |
| 21 | 0x15 | EIATTR_BINDLESS_TEXTURE_BANK | Indexed | Constant bank used for bindless texture descriptors. |
| 22 | 0x16 | EIATTR_BINDLESS_SURFACE_BANK | Indexed | Constant bank used for bindless surface descriptors. |
| 77 | 0x4D | EIATTR_TEXMODE_INDEPENDENT | Indexed | Independent texture mode flag. |
Cluster and Cooperative Launch (sm_90+)
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 41 | 0x29 | EIATTR_COOP_GROUP_MASK_REGIDS | Indexed | Register IDs used for cooperative group masks. |
| 61 | 0x3D | EIATTR_CTA_PER_CLUSTER | Indexed | Number of CTAs per cluster (Hopper cluster launch). |
| 62 | 0x3E | EIATTR_EXPLICIT_CLUSTER | Indexed | Kernel uses explicit cluster dimensions. |
| 63 | 0x3F | EIATTR_MAX_CLUSTER_RANK | Indexed | Maximum cluster rank for scheduling. |
| 91 | 0x5B | EIATTR_BLOCKS_ARE_CLUSTERS | Indexed | CTA blocks are clusters flag. |
Shared Memory and Reserved Resources
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 50 | 0x32 | EIATTR_SHARED_SCRATCH | Indexed | Shared memory scratch space for register spilling. |
| 65 | 0x41 | EIATTR_RESERVED_SMEM_USED | Indexed | Whether reserved shared memory is used. |
| 66 | 0x42 | EIATTR_RESERVED_SMEM_0_SIZE | Indexed | Size of reserved shared memory partition 0. |
Software Workarounds
Hardware errata requiring instruction-level patching by the driver. Each WAR attribute carries a list of instruction byte offsets that the driver must modify at kernel load time.
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 42 | 0x2A | EIATTR_SW1850030_WAR | Free | Workaround for HW bug 1850030. |
| 48 | 0x30 | EIATTR_SW2393858_WAR | Free | Workaround for HW bug 2393858. |
| 53 | 0x35 | EIATTR_SW2861232_WAR | Free | Workaround for HW bug 2861232. |
| 54 | 0x36 | EIATTR_SW_WAR | Free | Generic software workaround container. |
| 71 | 0x47 | EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | Free | Offsets of MEMBAR.SYS instructions needing software workaround. |
Graphics-Specific
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 72 | 0x48 | EIATTR_GRAPHICS_GLOBAL_CBANK | Indexed | Global constant bank for graphics shaders. |
| 73 | 0x49 | EIATTR_SHADER_TYPE | Indexed | Shader type (vertex, fragment, compute, etc.). |
| 74 | 0x4A | EIATTR_VRC_CTA_INIT_COUNT | Indexed | Virtual Register Count CTA init count. |
Blackwell+ Features (sm_100+)
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 79 | 0x4F | EIATTR_AT_ENTRY_FRAGEMENTS | Free | Fragment descriptors at function entry. Note: "FRAGEMENTS" is a typo preserved in the binary; corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1. |
| 80 | 0x50 | EIATTR_SPARSE_MMA_MASK | Free | Sparsity mask for structured-sparse MMA operations. |
| 81 | 0x51 | EIATTR_TCGEN05_1CTA_USED | Indexed | tcgen05 (5th-gen tensor core) single-CTA mode used. |
| 82 | 0x52 | EIATTR_TCGEN05_2CTA_USED | Indexed | tcgen05 two-CTA mode used. |
| 83 | 0x53 | EIATTR_GEN_ERRBAR_AT_EXIT | Indexed | Generate error barrier at kernel exit. |
| 84 | 0x54 | EIATTR_REG_RECONFIG | Indexed | Dynamic register reconfiguration (setmaxnreg). |
| 92 | 0x5C | EIATTR_SANITIZE | Indexed | Address sanitizer instrumentation present. |
Mercury-Specific
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 90 | 0x5A | EIATTR_MERCURY_FINALIZER_OPTIONS | Free | Options for the Mercury FNLZR post-link pass. |
| 95 | 0x5F | EIATTR_MERCURY_ISA_VERSION | Sized | Mercury ISA version for the shader binary. |
Compilation Metadata
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 3 | 0x03 | EIATTR_JUMPTABLE_RELOCS | Free | Jump table relocation entries. |
| 4 | 0x04 | EIATTR_CTAIDZ_USED | Indexed | Whether kernel uses %ctaid.z (3D grid). |
| 13 | 0x0D | EIATTR_SYNC_STACK | Indexed | Synchronization stack depth. |
| 15 | 0x0F | EIATTR_EXTERNS | Free | External symbol references list. |
| 26 | 0x1A | EIATTR_QUERY_NUMATTRIB | Indexed | Number of queryable attributes. |
| 31 | 0x1F | EIATTR_NEED_CNP_WRAPPER | Indexed | Kernel needs CUDA Nested Parallelism wrapper. |
| 32 | 0x20 | EIATTR_NEED_CNP_PATCH | Indexed | Kernel needs CNP patching at load time. |
| 33 | 0x21 | EIATTR_EXPLICIT_CACHING | Indexed | Explicit cache control directives present. |
| 34 | 0x22 | EIATTR_ISTYPEP_USED | Indexed | isspacep instruction used. |
| 36 | 0x24 | EIATTR_SUQ_USED | Indexed | Surface query instruction used. |
| 38 | 0x26 | EIATTR_LOAD_CACHE_REQUEST | Indexed | Load cache request configuration. |
| 43 | 0x2B | EIATTR_WMMA_USED | Indexed | Warp Matrix Multiply-Accumulate instructions used. |
| 44 | 0x2C | EIATTR_HAS_PRE_V10_OBJECT | Value | Object contains pre-CUDA 10 compiled code. |
| 51 | 0x33 | EIATTR_STATISTICS | Free | Compilation statistics (instruction counts, etc.). |
| 55 | 0x37 | EIATTR_CUDA_API_VERSION | Indexed | CUDA API version the kernel was compiled for. |
| 60 | 0x3C | EIATTR_PER_REG_TARGET_PERF_STATS | Free | Per-register-target performance statistics. |
| 64 | 0x40 | EIATTR_INSTR_REG_MAP | Free | Instruction-to-register mapping for profiling. |
| 67 | 0x43 | EIATTR_UCODE_SECTION_DATA | Free | Microcode section data (internal). |
| 75 | 0x4B | EIATTR_TOOLS_PATCH_FUNC | Indexed | Function patching descriptor for CUDA tools (cuda-gdb, Nsight). |
| 78 | 0x4E | EIATTR_PERF_STATISTICS | Free | Performance statistics for the profiler. |
| 85 | 0x55 | EIATTR_ANNOTATIONS | Free | General-purpose annotation data. |
| 88 | 0x58 | EIATTR_STUB_FUNCTION_KIND | Indexed | Stub function classification. |
| 93 | 0x5D | EIATTR_SYSCALLS_FALLBACK | Free | Syscall fallback mechanism offsets. |
| 94 | 0x5E | EIATTR_CUDA_REQ | Free | CUDA requirements descriptor. |
Sentinel and Error
| Code | Hex | Name | Format | Description |
|---|---|---|---|---|
| 0 | 0x00 | EIATTR_ERROR | -- | Invalid/error sentinel. Never emitted in valid cubins. |
| 1 | 0x01 | EIATTR_PAD | -- | Padding record (ignored by parser). |
| 86 | 0x56 | EIATTR_UNKNOWN | -- | Unknown attribute placeholder. |
| 96 | 0x60 | EIATTR_ERROR_LAST | -- | Upper bound sentinel for the enum range. Code 96 is never emitted; it serves as a bound check (if (attr_code > 0x2F) at line 760 of the builder). |
Payload Format Reference (Codes 0--32)
Per-attribute wire-format documentation derived from sub_1CC9800 (master EIATTR builder), sub_1CC86D0 (per-entry stack emitter), sub_1CC8950 (barrier/register propagator), and sub_1CC85F0 (TLV record emitter). Payload layouts describe the bytes that follow the 4-byte TLV header.
For Indexed-format (0x04) attributes the first 4 payload bytes are always a u32 symbol index. The remaining bytes (if any) carry the value. For Sized-format (0x03) attributes the value is encoded directly in the 16-bit size field of the TLV header -- there are no additional payload bytes.
Sentinel Codes (0--1)
| Code | Hex | Name | Payload |
|---|---|---|---|
| 0 | 0x00 | EIATTR_ERROR | None. Never emitted. |
| 1 | 0x01 | EIATTR_PAD | None. Padding, ignored by parser. |
Texture and Image Binding (2, 6--9, 14, 19--22)
All Indexed attributes in this group share the same 8-byte payload layout: [sym_index:4][value:4]. The builder's first switch (line 722) routes all of these through the same symbol-index resolution path.
Offset Size Field
------ ---- -----
0x00 4 sym_index Per-function symbol table index
0x04 4 value Attribute-specific (see per-code table)
| Code | Hex | Name | value field semantics |
|---|---|---|---|
| 2 | 0x02 | EIATTR_IMAGE_SLOT | Image slot number (texture unit binding point) |
| 6 | 0x06 | EIATTR_IMAGE_OFFSET | Byte offset within image descriptor table |
| 7 | 0x07 | EIATTR_IMAGE_SIZE | Image descriptor size in bytes |
| 8 | 0x08 | EIATTR_TEXTURE_NORMALIZED | 0 = unnormalized, 1 = normalized coordinates |
| 9 | 0x09 | EIATTR_SAMPLER_INIT | Packed sampler initialization parameters |
| 19 | 0x13 | EIATTR_SAMPLER_FORCE_UNNORMALIZED | Sampler ID to force unnormalized |
| 21 | 0x15 | EIATTR_BINDLESS_TEXTURE_BANK | Constant bank ID for bindless texture descriptors |
| 22 | 0x16 | EIATTR_BINDLESS_SURFACE_BANK | Constant bank ID for bindless surface descriptors |
Code 14 (0x0E) -- EIATTR_TEXID_SAMPID_MAP: Free format. Variable-length array of u32 pairs mapping texture IDs to sampler IDs.
Payload: repeating [tex_id:4][samp_id:4] pairs
Size: N * 8 bytes (N = number of tex-sampler bindings)
Code 20 (0x14) -- EIATTR_BINDLESS_IMAGE_OFFSETS: Free format. Array of u32 byte offsets for bindless image descriptor references in the kernel's constant bank. Each u32 is a symbol index that gets resolved during link.
Payload: u32[] symbol indices (resolved to byte offsets at link)
Size: N * 4 bytes
Jump Table Relocations (3)
Code 3 (0x03) -- EIATTR_JUMPTABLE_RELOCS: Free format. Array of u32 byte offsets into the .text section where jump table relocations are needed.
Payload: u32[] byte offsets into .text
Size: N * 4 bytes
CTAIDZ Flag (4)
Code 4 (0x04) -- EIATTR_CTAIDZ_USED: Indexed format, zero-value flag attribute. Presence of the record signals the kernel reads %ctaid.z. SM-version gated via sub_1C97840(0x04, sm_version).
Offset Size Field
------ ---- -----
0x00 4 sym_index Per-function symbol
(no value field -- presence is the signal)
The builder creates this record with two different format bytes depending on context: 0x04 (Indexed) via the TLV emitter, or 0x01 (Free) via inline construction (magic 0x0401). Both encode the same semantic: flag-only, no value.
Resource Allocation (5, 16--18, 25, 27, 30)
Codes 5, 16, 17, 18 -- Indexed, 8-byte payload [sym_index:4][value:4]:
| Code | Hex | Name | value field semantics |
|---|---|---|---|
| 5 | 0x05 | EIATTR_MAX_THREADS | Maximum threads per block (from .maxntid) |
| 16 | 0x10 | EIATTR_REQNTID | Required thread count per dimension (from .reqntid) |
| 17 | 0x11 | EIATTR_FRAME_SIZE | Per-thread local memory frame size in bytes |
| 18 | 0x12 | EIATTR_MIN_STACK_SIZE | Minimum per-thread stack size in bytes |
EIATTR_FRAME_SIZE is weak-symbol filtered: dropped when a weak function is replaced by a stronger definition (bitmask 0x800800020000).
EIATTR_MIN_STACK_SIZE is emitted by sub_1CC86D0 with sub_1CC85F0(a1, 0x12, 8, buf, 0) where buf is [sym_index:4][min_stack:4]. A sentinel value of -1 in min_stack means "not yet computed." When sm_version == 0xFF00 (Mercury), the record is suppressed.
Code 25 (0x19) -- EIATTR_CBANK_PARAM_SIZE: Sized format (0x03). Value encoded directly in the 16-bit size field. No separate payload bytes.
TLV header: [fmt=0x03][code=0x19][param_bank_size:2]
Total record: 4 bytes (header only)
Code 27 (0x1B) -- EIATTR_MAXREG_COUNT: Sized format (0x03). Value encoded in the low byte of the 16-bit size field (range 0--255). Per-compilation-unit hint, not per-function. Set by --maxrregcount CLI flag or .maxnreg PTX directive.
TLV header: [fmt=0x03][code=0x1B][maxreg:2]
Total record: 4 bytes (header only)
Effective range: low byte only (0--255), high byte 0
Binary evidence: second switch case 0x1B (line 1094) reads *(u8*)(v150+2) -- the low byte of the size field -- as the register count value.
Code 30 (0x1E) -- EIATTR_CRS_STACK_SIZE: Indexed format, 4-byte value payload. Emitted by sub_1CC86D0 with sub_1CC85F0(a1, 0x1E, 4, buf, sym_index).
Offset Size Field
------ ---- -----
0x00 4 sym_index Per-function symbol
0x04 4 crs_bytes Call-Return-Stack size in bytes
Total record: 12 bytes (4 header + 8 payload). Diagnostic "conflicting crs_stack attribute" fires when two records target the same function.
Parameter Bank Layout (10--12, 23--24)
Code 10 (0x0A) -- EIATTR_PARAM_CBANK: Indexed format, packed value.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 cbank_desc lo16 = bank number, hi16 = byte offset
Typical value: bank=0, offset=0x160 (standard CUDA kernel parameter ABI).
Codes 11 (0x0B) and 12 (0x0C) -- Free format, variable-length u32 arrays:
EIATTR_SMEM_PARAM_OFFSETS (0x0B):
Payload: u32[] byte offsets within shared memory, one per parameter
Size: N * 4 bytes
EIATTR_CBANK_PARAM_OFFSETS (0x0C):
Payload: u32[] packed entries, one per parameter
Each u32: lo16 = byte offset in cbank, hi16 = parameter size
Size: N * 4 bytes
Code 23 (0x17) -- EIATTR_KPARAM_INFO: Free format, complex per-parameter descriptors. This is the only attribute in codes 0--32 with a multi-field sub-record structure.
Payload: repeating 12-byte per-parameter entries:
Offset Size Field
------ ---- -----
0x00 4 param_index Ordinal position (0-based)
0x04 4 param_offset Byte offset in constant bank
0x08 2 param_size Size in bytes
0x0A 1 log_alignment log2(alignment)
0x0B 1 flags Bit flags (pointer, ordinal, etc.)
Size: N * 12 bytes
Special behavior: the builder exempts KPARAM_INFO from being zeroed when its symbol index resolves to 0 (line 755: (_BYTE)v5 == 23 check). This allows global-scope parameter info records.
Code 24 (0x18) -- EIATTR_SMEM_PARAM_SIZE: Indexed, [sym_index:4][smem_param_bytes:4].
Synchronization (13)
Code 13 (0x0D) -- EIATTR_SYNC_STACK: Indexed format.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 sync_depth lo16 = stack depth (u16), hi16 = 0
Binary evidence: case 0x0D (line 1038) reads *(u16**)(v150+8) as a pointer to a u16 value. The depth value (v343) is a 16-bit unsigned integer. Used with sub_1CBD8F0 for sync stack tracking.
External Symbol References (15)
Code 15 (0x0F) -- EIATTR_EXTERNS: Free format, most complex processing of any attribute in the 0--32 range.
Payload: u32[] symbol table indices
Size: N * 4 bytes (N = size_field / 4)
The builder handles EXTERNS in both switches:
- First switch (line 779): iterates the u32 array, resolving each symbol index through the link-time symbol table. Dead symbols (resolved to 0) are zeroed in-place.
- Second switch (line 1054): collects extern refs into a set (
v643) for the current function. - Emission (line 1706):
sub_1CC85F0(a1, 0x0F, 4*count, buf, sym_index)emits the final record. - The size field encodes
N * 4and the element count is recovered assize >> 2.
Metadata Query (26)
Code 26 (0x1A) -- EIATTR_QUERY_NUMATTRIB: Indexed, [sym_index:4][num_attributes:4].
Instruction Offset Tables (28--29)
Both attributes are Free format carrying arrays of u32 byte offsets into the .text section.
Code 28 (0x1C) -- EIATTR_EXIT_INSTR_OFFSETS:
Payload: u32[] byte offsets of EXIT instructions
Size: N * 4 bytes
Confirmed by the builder's loop (line 2011): code 28 is explicitly checked and skipped past the symbol-resolution path, confirming the payload is a simple offset array with no embedded symbol indices.
Code 29 (0x1D) -- EIATTR_S2RCTAID_INSTR_OFFSETS:
Payload: u32[] byte offsets of S2R SR_CTAID.* instructions
Size: N * 4 bytes
At line 2001, code 29 triggers CNP (CUDA Nested Parallelism) wrapper generation. The symbol index from the record is added to the CNP wrapper list, driving emission of NEED_CNP_WRAPPER (code 31) and NEED_CNP_PATCH (code 32) records.
CUDA Nested Parallelism Flags (31--32)
Both are Indexed-format flag attributes with no value payload. They are always emitted as a pair.
Code 31 (0x1F) -- EIATTR_NEED_CNP_WRAPPER:
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only, presence is the signal)
SM-version gated: sub_1C97840(0x1F, sm_version). Builder constructs with internal format 0x01 (magic 0x1F01 = 7937). Emitted for every function that the S2RCTAID analysis identified as needing a CNP wrapper.
Code 32 (0x20) -- EIATTR_NEED_CNP_PATCH:
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only, presence is the signal)
SM-version gated: sub_1C97840(0x20, sm_version). Builder constructs with internal format 0x01 (magic 0x2001 = 8193). Emitted for every function in the CNP call tree.
Payload Format Summary (Codes 0--32)
| Code | Name | Wire Fmt | Payload size | Payload layout |
|---|---|---|---|---|
| 0 | ERROR | -- | 0 | none |
| 1 | PAD | -- | 0 | none |
| 2 | IMAGE_SLOT | 0x04 | 8 | [sym:4][slot_id:4] |
| 3 | JUMPTABLE_RELOCS | 0x01 | N*4 | u32[] byte offsets |
| 4 | CTAIDZ_USED | 0x04 | 4 | [sym:4] flag-only |
| 5 | MAX_THREADS | 0x04 | 8 | [sym:4][max_threads:4] |
| 6 | IMAGE_OFFSET | 0x04 | 8 | [sym:4][offset:4] |
| 7 | IMAGE_SIZE | 0x04 | 8 | [sym:4][size:4] |
| 8 | TEXTURE_NORMALIZED | 0x04 | 8 | [sym:4][normalized:4] |
| 9 | SAMPLER_INIT | 0x04 | 8 | [sym:4][params:4] |
| 10 | PARAM_CBANK | 0x04 | 8 | [sym:4][lo16=bank,hi16=off:4] |
| 11 | SMEM_PARAM_OFFSETS | 0x01 | N*4 | u32[] param offsets |
| 12 | CBANK_PARAM_OFFSETS | 0x01 | N*4 | u32[] lo16=off,hi16=size |
| 13 | SYNC_STACK | 0x04 | 8 | [sym:4][depth_u16:4] |
| 14 | TEXID_SAMPID_MAP | 0x01 | N*8 | [tex_id:4][samp_id:4] pairs |
| 15 | EXTERNS | 0x01 | N*4 | u32[] symbol indices |
| 16 | REQNTID | 0x04 | 8 | [sym:4][reqntid:4] |
| 17 | FRAME_SIZE | 0x04 | 8 | [sym:4][frame_bytes:4] |
| 18 | MIN_STACK_SIZE | 0x04 | 8 | [sym:4][stack_bytes:4] |
| 19 | SAMPLER_FORCE_UNNORM | 0x04 | 8 | [sym:4][sampler_id:4] |
| 20 | BINDLESS_IMAGE_OFFSETS | 0x01 | N*4 | u32[] sym indices |
| 21 | BINDLESS_TEXTURE_BANK | 0x04 | 8 | [sym:4][bank_id:4] |
| 22 | BINDLESS_SURFACE_BANK | 0x04 | 8 | [sym:4][bank_id:4] |
| 23 | KPARAM_INFO | 0x01 | N*12 | 12B per-param descriptors |
| 24 | SMEM_PARAM_SIZE | 0x04 | 8 | [sym:4][size_bytes:4] |
| 25 | CBANK_PARAM_SIZE | 0x03 | 0 | value in TLV size field |
| 26 | QUERY_NUMATTRIB | 0x04 | 8 | [sym:4][count:4] |
| 27 | MAXREG_COUNT | 0x03 | 0 | value in TLV size field (u8) |
| 28 | EXIT_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 29 | S2RCTAID_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 30 | CRS_STACK_SIZE | 0x04 | 8 | [sym:4][crs_bytes:4] |
| 31 | NEED_CNP_WRAPPER | 0x04 | 4 | [sym:4] flag-only |
| 32 | NEED_CNP_PATCH | 0x04 | 4 | [sym:4] flag-only |
Payload Format Reference (Codes 33--64)
Continuation of the per-attribute wire-format documentation. Same sources and conventions as the 0--32 section above.
Metadata Flags (33--34, 36, 43)
Code 33 (0x21) -- EIATTR_EXPLICIT_CACHING: Indexed format, flag-only. Signals the kernel uses explicit cache control directives (ld.ca, ld.cg, etc.). SM-gated via sub_1C97840(0x21, sm_version).
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
Binary evidence: magic 0x2101 (line 1733). Emitted when cache-on flag (v648) is set. When both cache-on and cache-off flags are set simultaneously (conflicting directives), sub_1CC8100 (cache conflict resolver) is called instead of emitting this record. The diagnostic "Turning caching %s for entry '%s' as per its request" logs cache resolution decisions.
Code 34 (0x22) -- EIATTR_ISTYPEP_USED: Indexed format, flag-only. Signals the kernel uses isspacep (type predicate) instructions.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No special builder logic -- passes through the default path.
Code 36 (0x24) -- EIATTR_SUQ_USED: Indexed format, flag-only. Signals the kernel uses surface query instructions.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No special builder logic.
Code 43 (0x2B) -- EIATTR_WMMA_USED: Indexed format, flag-only. Signals the kernel uses Warp Matrix Multiply-Accumulate instructions. First attribute introduced in the Volta era (meta=2).
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No special builder logic.
Resource Allocation (35, 47, 50, 55--56, 59)
Code 35 (0x23) -- EIATTR_MAX_STACK_SIZE: Indexed format, 4-byte value. Maximum per-thread stack size for recursive call chains, computed via call-graph propagation in sub_1CC8950.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 max_stack_bytes Maximum stack size in bytes
Binary evidence: second switch case 0x23 (line 1128) reads v354[1] as the stack size value and stores it in the per-entry array s[]. Weak-symbol filtered: bitmask 0x800800060000 includes this code. Mercury suppression: when sm_version == 0xFF00, the code byte is zeroed, dropping the record.
Code 47 (0x2F) -- EIATTR_REGCOUNT: Indexed format, 4-byte value. Physical register count per thread. The single most important attribute for GPU occupancy: max_warps_per_SM = total_registers / (regcount * warp_size).
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 regcount Physical registers per thread
Binary evidence: second switch case 0x2F (line 1176) resolves the symbol and stores the record pointer in v642[] (per-entry regcount array). Diagnostic "invalid index" (line 1180) fires if the symbol resolves to null. Weak-symbol filtered: bitmask 0x800800060000 includes this code.
Code 50 (0x32) -- EIATTR_SHARED_SCRATCH: Indexed format, 4-byte value. Shared memory scratch space allocated for register spilling.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 scratch_bytes Shared scratch size in bytes
No special builder logic.
Code 55 (0x37) -- EIATTR_CUDA_API_VERSION: Indexed format, 4-byte value. Records the CUDA API version the kernel was compiled for.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 api_version CUDA API version number
No special builder logic -- passes through the default path.
Code 56 (0x38) -- EIATTR_NUM_MBARRIERS: Sized format (0x03), value encoded in the TLV size field. Number of memory barrier (mbarrier) objects used by the kernel.
TLV header: [fmt=0x03][code=0x38][mbar_count:2]
Total record: 4 bytes (header only)
Binary evidence: magic 0x3803 (14339) at lines 1664 and 2446. The mbarrier count is stored in the 16-bit size field: *((_WORD *)v511 + 1) = v651 (line 1669). SM-gated via sub_1C97840(0x38, sm_version) at lines 1654 and 2436.
Accumulative semantics: the builder sums mbarrier counts from callees during call-graph propagation (second switch case 0x38 at line 1183, falling through to LABEL_331). If any callee reports -1 (unknown), the sum stays -1 (lines 1255--1256). The emission loop at lines 2407--2454 propagates the count to all entry points that call the function.
Code 59 (0x3B) -- EIATTR_SAM_REGION_STACK_SIZE: Indexed format, 8-byte payload. SAM (Streaming Asynchronous Memory) region stack size.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 sam_stack_bytes SAM region stack size in bytes
Binary evidence: emitted by sub_1CC86D0 at line 114: sub_1CC85F0(a1, 0x3B, 8, buf, 0) where buf is [sym_index:4][sam_stack:4]. Only emitted when sub_1CBD9E0(a1, a2) returns nonzero, indicating the kernel actually uses SAM regions. Second switch case 0x3B (line 1186) calls sub_1CBD940(a1, sym, value) to record the SAM stack size.
Cache Control (38)
Code 38 (0x26) -- EIATTR_LOAD_CACHE_REQUEST: Indexed format, 4-byte value. Per-kernel cache mode configuration. Controls whether the driver enables explicit caching for this kernel.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 cache_mode 0 = off, nonzero = on
Binary evidence: second switch case 0x26 (line 1134) is the most complex handler in this range. The builder first checks the function kind: if (byte & 3) == 1 (device function), the record is dropped by zeroing the code byte (line 1141). For entry-point kernels, the verbose trace "Turning caching %s for entry '%s' as per its request" is emitted (line 1153), where %s is either "OFF" or "ON". When cache_mode is nonzero: adds the symbol to the caching-on list (v639[]) and sets the per-entry status to 2. When cache_mode is zero: sets status to 1 (off). The v648 and v655 flags track the presence of on/off requests for conflict detection.
Global Flags (44)
Code 44 (0x2C) -- EIATTR_HAS_PRE_V10_OBJECT: Value format (0x02), global scope. Signals the compilation unit contains pre-CUDA 10 compiled code.
TLV header: [fmt=0x02][code=0x2C][size:2]
Payload: [flags:4]
Total record: 8 bytes
Binary evidence: top-level gating at line 686--688 checks three conditions: link mode (v609 == 2), toolkit version (> 0x63), and SM compatibility (sub_1C97840(0x2C, sm_version)). The magic 0x2C01 at line 709 constructs the record with internal format byte 0x01, which the emitter translates to Value format (0x02) for the wire encoding since the record is global scope. This is the only Value-format attribute in the 33--64 range.
Instruction Offset Tables (37, 39--40, 45--46, 48--49, 52, 57--58)
All attributes in this group use Free format (0x01) carrying variable-length arrays of u32 byte offsets into the kernel's .text section. None have explicit switch cases in the builder -- they pass through the default path. The payload layout for all is identical:
Payload: u32[] byte offsets into .text section
Size: N * 4 bytes (N = size_field / 4)
| Code | Hex | Name | Offset semantics |
|---|---|---|---|
| 37 | 0x25 | LD_CACHEMOD_INSTR_OFFSETS | Load instructions with explicit cache modifier |
| 39 | 0x27 | ATOM_SYS_INSTR_OFFSETS | Atomic instructions with .sys scope |
| 40 | 0x28 | COOP_GROUP_INSTR_OFFSETS | Cooperative group instructions |
| 45 | 0x2D | ATOMF16_EMUL_INSTR_OFFSETS | Emulated FP16 atomic instructions |
| 48 | 0x30 | SW2393858_WAR | HW bug 2393858 patch locations |
| 49 | 0x31 | INT_WARP_WIDE_INSTR_OFFSETS | Integer warp-wide instructions |
| 52 | 0x34 | INDIRECT_BRANCH_TARGETS | Valid targets of indirect branches |
| 57 | 0x39 | MBARRIER_INSTR_OFFSETS | Memory barrier instructions |
| 58 | 0x3A | COROUTINE_RESUME_OFFSETS | Coroutine resume point offsets |
Code 46 (0x2E) -- EIATTR_ATOM16_EMUL_INSTR_REG_MAP: Free format, but NOT a simple offset array. Carries a register map for 16-bit atomic emulation with a structured per-entry layout rather than flat offsets. The exact sub-record layout is not fully determined from the builder alone (constructed by a separate pass).
Payload: structured register-map entries (not flat u32[] offsets)
Size: variable
Software Workarounds (42, 48, 53--54)
All use Free format (0x01) with u32 offset arrays. The driver patches the instructions at the listed byte offsets during kernel load.
| Code | Hex | Name |
|---|---|---|
| 42 | 0x2A | SW1850030_WAR |
| 48 | 0x30 | SW2393858_WAR |
| 53 | 0x35 | SW2861232_WAR |
| 54 | 0x36 | SW_WAR |
SW_WAR (0x36) is a generic container -- unlike the numbered WAR attributes, its payload format may include sub-type discriminators, though the builder treats it as a flat pass-through.
Cluster and Cooperative Launch (41, 61--63)
Code 41 (0x29) -- EIATTR_COOP_GROUP_MASK_REGIDS: Indexed, 4-byte value.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 mask_regids Register IDs for cooperative group masks
Code 61 (0x3D) -- EIATTR_CTA_PER_CLUSTER: Indexed, 4-byte value.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 ctas_per_cluster Number of CTAs per cluster (Hopper sm_90+)
Code 62 (0x3E) -- EIATTR_EXPLICIT_CLUSTER: Indexed, flag-only.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only, presence signals explicit cluster dimensions)
Code 63 (0x3F) -- EIATTR_MAX_CLUSTER_RANK: Indexed, 4-byte value.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 max_rank Maximum cluster rank for scheduling
Compilation Metadata (51, 60, 64)
Code 51 (0x33) -- EIATTR_STATISTICS: Free format. Variable-length compilation statistics (instruction counts, etc.). Internal diagnostic data not consumed by the GPU driver.
Payload: structured statistics data (format varies)
Size: variable
Code 60 (0x3C) -- EIATTR_PER_REG_TARGET_PERF_STATS: Free format. Per-register-target performance statistics for the profiler.
Payload: structured performance data (format varies)
Size: variable
Code 64 (0x40) -- EIATTR_INSTR_REG_MAP: Free format. Instruction-to-register mapping for profiling and debugging tools.
Payload: structured register-map data
Size: variable
Payload Format Summary (Codes 33--64)
| Code | Name | Wire Fmt | Payload size | Payload layout |
|---|---|---|---|---|
| 33 | EXPLICIT_CACHING | 0x04 | 4 | [sym:4] flag-only |
| 34 | ISTYPEP_USED | 0x04 | 4 | [sym:4] flag-only |
| 35 | MAX_STACK_SIZE | 0x04 | 8 | [sym:4][max_stack_bytes:4] |
| 36 | SUQ_USED | 0x04 | 4 | [sym:4] flag-only |
| 37 | LD_CACHEMOD_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 38 | LOAD_CACHE_REQUEST | 0x04 | 8 | [sym:4][cache_mode:4] |
| 39 | ATOM_SYS_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 40 | COOP_GROUP_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 41 | COOP_GROUP_MASK_REGIDS | 0x04 | 8 | [sym:4][mask_regids:4] |
| 42 | SW1850030_WAR | 0x01 | N*4 | u32[] .text byte offsets |
| 43 | WMMA_USED | 0x04 | 4 | [sym:4] flag-only |
| 44 | HAS_PRE_V10_OBJECT | 0x02 | 4 | [flags:4] global |
| 45 | ATOMF16_EMUL_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 46 | ATOM16_EMUL_INSTR_REG_MAP | 0x01 | var | structured register map |
| 47 | REGCOUNT | 0x04 | 8 | [sym:4][regcount:4] |
| 48 | SW2393858_WAR | 0x01 | N*4 | u32[] .text byte offsets |
| 49 | INT_WARP_WIDE_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 50 | SHARED_SCRATCH | 0x04 | 8 | [sym:4][scratch_bytes:4] |
| 51 | STATISTICS | 0x01 | var | structured stats data |
| 52 | INDIRECT_BRANCH_TARGETS | 0x01 | N*4 | u32[] .text byte offsets |
| 53 | SW2861232_WAR | 0x01 | N*4 | u32[] .text byte offsets |
| 54 | SW_WAR | 0x01 | var | generic WAR data |
| 55 | CUDA_API_VERSION | 0x04 | 8 | [sym:4][api_version:4] |
| 56 | NUM_MBARRIERS | 0x03 | 0 | value in TLV size field (u16) |
| 57 | MBARRIER_INSTR_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 58 | COROUTINE_RESUME_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 59 | SAM_REGION_STACK_SIZE | 0x04 | 8 | [sym:4][sam_stack_bytes:4] |
| 60 | PER_REG_TARGET_PERF_STATS | 0x01 | var | structured perf data |
| 61 | CTA_PER_CLUSTER | 0x04 | 8 | [sym:4][ctas:4] |
| 62 | EXPLICIT_CLUSTER | 0x04 | 4 | [sym:4] flag-only |
| 63 | MAX_CLUSTER_RANK | 0x04 | 8 | [sym:4][max_rank:4] |
| 64 | INSTR_REG_MAP | 0x01 | var | structured register map |
Payload Format Reference (Codes 65--96)
Continuation of the per-attribute wire-format documentation. Same sources and conventions as the 0--64 sections above. Codes 65--96 represent the newest EIATTR additions (Ampere through Blackwell era). All require SM-version gating via sub_1C97840 before emission. Many have dedicated switch cases in the master builder for call-graph propagation.
Shared Memory (65--66)
Code 65 (0x41) -- EIATTR_RESERVED_SMEM_USED: Indexed format, flag-only. Signals the kernel uses reserved shared memory. SM-gated via sub_1C97840(0x41, sm_version).
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only, presence is the signal)
Binary evidence: magic 0x4101 (16641) at lines 1511 and 2219 of sub_1CC9800. The builder tracks this attribute in the v615[] per-entry array and propagates it to callee entry points during the second pass (lines 2186--2229). When an entry point does not already have this record, the builder creates one using sub_1CC7FB0 for symbol resolution.
Code 66 (0x42) -- EIATTR_RESERVED_SMEM_0_SIZE: Indexed format, 4-byte value. Size of reserved shared memory partition 0 in bytes.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 rsmem_bytes Reserved shared memory size in bytes
No explicit switch case in the builder -- passes through the default path.
Microcode Section (67)
Code 67 (0x43) -- EIATTR_UCODE_SECTION_DATA: Free format. Opaque microcode section data for internal use. Payload format is architecture-specific and not decoded by the builder.
Payload: opaque byte array
Size: variable
Instruction Offset Tables (68, 70--71, 87, 89)
All attributes in this group use Free format (0x01) carrying variable-length arrays of u32 byte offsets into the kernel's .text section.
Payload: u32[] byte offsets into .text section
Size: N * 4 bytes (N = size_field / 4)
| Code | Hex | Name | Offset semantics | Emitter |
|---|---|---|---|---|
| 68 | 0x44 | UNUSED_LOAD_BYTE_OFFSET | Unused load instructions | sub_60BCF0 (code 70 pattern) |
| 70 | 0x46 | SYSCALL_OFFSETS | __cuda_syscall invocations | sub_60BCF0 |
| 71 | 0x47 | SW_WAR_MEMBAR_SYS_INSTR_OFFSETS | MEMBAR.SYS instructions needing WAR | sub_60BDC0 |
| 87 | 0x57 | STACK_CANARY_TRAP_OFFSETS | Stack canary trap instructions | sub_60BEA0 |
| 89 | 0x59 | LOCAL_CTA_ASYNC_STORE_OFFSETS | CTA-local async store instructions | default path |
Binary evidence for sub_60BCF0 (code 70): allocates 4 * count bytes, copies offsets from the instruction table at struct+40, then calls sub_1CC85F0(a2, 70, (unsigned __int16)count, buf, a4). Emission gated by *(a1+25) flag and count > 0.
Binary evidence for sub_60BDC0 (code 71) and sub_60BEA0 (code 87): identical structure to sub_60BCF0, differing only in the attribute code passed to sub_1CC85F0.
Kernel Parameter Info V2 (69)
Code 69 (0x45) -- EIATTR_KPARAM_INFO_V2: Free format, 12-byte per-parameter entries. Extended version of KPARAM_INFO (code 23) with additional type encoding. Emitted by sub_7FD2B0.
Payload: repeating 12-byte per-parameter entries:
Offset Size Field
------ ---- -----
0x00 4 param_index Ordinal position (0-based)
0x04 4 param_offset Byte offset in constant bank
0x08 2 param_size Size in bytes
0x0A 1 log_alignment log2(alignment)
0x0B 1 flags Packed nibbles:
lo4 = param_type (from lookup table at 0x21D2E60)
bit4 = is_pointer flag
hi3 = reserved
Size: N * 12 bytes
Binary evidence: sub_7FD2B0 at line 116 calls sub_1CC85F0(a3, 69, 12, v16, a4). The flags byte at offset 0x0B is assembled from two sources: the low nibble is looked up from dword_21D2E60 indexed by param_type - 1 (line 110), and bit 4 is set when the parameter is a pointer (line 115: 16 * (*(_BYTE *)(v20 + 25) & 1)).
First-switch handling: code 69 (0x45) appears in the first switch at line 737 alongside texture and resource codes, meaning KPARAM_INFO_V2 records undergo symbol-index resolution during the first pass.
Graphics-Specific (72--74)
Code 72 (0x48) -- EIATTR_GRAPHICS_GLOBAL_CBANK: Indexed format, 4-byte value. Global constant bank descriptor for graphics shaders.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 cbank_desc Global constant bank descriptor
Code 73 (0x49) -- EIATTR_SHADER_TYPE: Indexed format, 4-byte value. Shader type classification.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 shader_type Shader type enum (vertex, fragment, compute, etc.)
Code 74 (0x4A) -- EIATTR_VRC_CTA_INIT_COUNT: Constructed with internal format byte 0x02 (magic 0x4A02 = 18946), but the value is stored in the TLV size field byte, making the wire behavior Sized-like. The builder takes the maximum across all callees.
TLV header: [fmt=0x02][code=0x4A][vrc_count:2]
Payload: [sym_index:4]
Total record: 8 bytes
Binary evidence: magic 18946 at lines 1532 and 2344. The maximum-across-callees logic at lines 1214--1215: if (v675 < *(v150+2)) v328 = *(v150+2); v675 = v328. The final value is written back at line 1538: *((_BYTE *)v196 + 2) = v675. The v617[] per-entry array tracks this attribute for propagation. SM-gated via sub_1C97840(0x4A, sm_version).
Tools Patching (75)
Code 75 (0x4B) -- EIATTR_TOOLS_PATCH_FUNC: Indexed format, 4-byte value. Function patching descriptor for CUDA debugging tools (cuda-gdb, Nsight Compute).
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 patch_info Patch descriptor for tool instrumentation
No explicit switch case -- passes through the default path.
Barrier Count (76)
Code 76 (0x4C) -- EIATTR_NUM_BARRIERS: Constructed with internal format byte 0x02 (magic 0x4C02 = 19458), with the barrier count stored in the TLV size field. This is one of the most complex attributes in the 65--96 range, with two distinct code paths.
TLV header: [fmt=0x02][code=0x4C][bar_count:2]
Payload: [sym_index:4]
Total record: 8 bytes
Dual-path behavior controlled by *(a1+101):
-
Per-SM tracking mode (when
*(a1+101)is set, line 1223): reads barrier count from the size field byte. Takes the maximum across all callees:if (n < *(v150+2)) v323 = *(v150+2); n = v323. Thev628[]per-entry array tracks records. SM-gated viasub_1C97840(0x4C, sm_version). -
Accumulative mode (when
*(a1+101)is clear, falls through toLABEL_331): sums barrier counts from callees with -1 sentinel handling (lines 1251--1257):v298 = v297 + v651; if (v297 == -1) v298 = -1. The sentinel-1means "unknown count" and poisons the sum.
Propagation in sub_1CC8950: the barrier/register propagator (2,634 bytes) also creates NUM_BARRIERS records during barrier count migration from section flags to .nv.info records.
Texture Mode (77)
Code 77 (0x4D) -- EIATTR_TEXMODE_INDEPENDENT: Indexed format, flag-only. Signals the kernel uses independent texture mode.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No explicit switch case -- passes through the default path.
Performance Statistics (78)
Code 78 (0x4E) -- EIATTR_PERF_STATISTICS: Free format. Performance statistics for the profiler.
Payload: structured performance data
Size: variable
No explicit switch case -- passes through the default path. Internal profiler data, not consumed by the GPU driver.
Fragment Descriptors at Entry (79)
Code 79 (0x4F) -- EIATTR_AT_ENTRY_FRAGEMENTS: Free format. The most complex handler in the 65--96 range. Carries fragment offset arrays that describe function entry point fragments. Note: "FRAGEMENTS" is a typo preserved in the binary; corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1.
Payload: u32[] fragment offsets
Size: N * 4 bytes
Binary evidence: emitted via sub_1CC85F0(a1, 0x4F, 4*count, buf, sym) at lines 1774 and 2539. The builder uses a set data structure (v644) to collect fragment offsets from callees, then merges and deduplicates them:
- Line 1749: collects total fragment count from
v644set. - Lines 1762--1772: iterates set entries, extracting each offset via
sub_42F060. - Line 1774: emits the merged offset array.
- Lines 2460--2548: callee propagation loop. For each callee, if an existing entry has fragments, the builder extends the array and deduplicates offsets. If no existing entry, creates a new record.
The deduplication logic (lines 2503--2525) does an O(N*M) scan: for each new offset, checks all existing offsets for duplicates before appending.
Cross-function ownership: when *(a1+568) != srca (the current entry's symbol differs from the fragment source), the code byte is zeroed (line 1290: *(_BYTE *)(v150+1)=0), suppressing the record for non-owning functions.
Sparse MMA Mask (80)
Code 80 (0x50) -- EIATTR_SPARSE_MMA_MASK: Sized format (0x03). Sparsity bitmask for structured-sparse MMA (Matrix Multiply-Accumulate) operations on Blackwell. SM-gated via sub_1C97840(0x50, sm_version).
TLV header: [fmt=0x03][code=0x50][mask_bits:2]
Total record: 4 bytes (header only)
Binary evidence: magic 0x5003 (20483) at lines 2085 and 1433. The mask value is stored in the TLV size field. During propagation, the builder OR's mask bits from all callees (line 1407: v158 |= *(_WORD *)(v162 + 2)). New entry-point records are initialized with bit 15 set (line 1436: *((_WORD *)v598 + 1) = 0x8000; line 1438: v158 |= 0x8000u). The v632[] per-entry array tracks records.
The .nv.uft section emission (lines 2068--2090) also creates SPARSE_MMA_MASK records, gated on *(a1+240) (UFT presence flag).
Tensor Core Gen05 (81--82)
These two codes are mutually exclusive. The builder enforces that a function cannot use both 1-CTA and 2-CTA tensor core modes simultaneously.
Code 81 (0x51) -- EIATTR_TCGEN05_1CTA_USED: Indexed format, flag-only. Signals the kernel uses 5th-generation tensor cores in single-CTA mode. SM-gated via sub_1C97840(0x51, sm_version) AND requires v673 > 0x81 (SM code > 129, i.e., sm_130+ / Blackwell).
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
Binary evidence: magic 0x5101 (20737) at lines 1559 and 2259. Tracked in v614[] per-entry array. The v668 flag indicates any tcgen05_1CTA record was seen. The SM architecture threshold v673 > 0x81 (line 1543) gates emission: only architectures above 0x81 support tcgen05.
Code 82 (0x52) -- EIATTR_TCGEN05_2CTA_USED: Indexed format, flag-only. Signals the kernel uses 5th-generation tensor cores in two-CTA collaborative mode. SM-gated via sub_1C97840(0x52, sm_version) AND requires v673 > 0x81.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
Binary evidence: magic 0x5201 (20993) at lines 1582 and 2300. Tracked in v610[] per-entry array. The v674 flag indicates any tcgen05_2CTA record was seen.
Mutual exclusion enforcement: during callee propagation (lines 2264--2266 and 2304--2307), if a function already has a TCGEN05_1CTA record and the builder attempts to add a TCGEN05_2CTA record (or vice versa), sub_42F590 fires a diagnostic warning with the function name. This catches conflicting tensor core mode usage across the call graph.
Error Barrier at Exit (83)
Code 83 (0x53) -- EIATTR_GEN_ERRBAR_AT_EXIT: Indexed format, flag-only. Instructs the driver to generate an error barrier at kernel exit.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No explicit switch case in the builder -- passes through the default path.
Register Reconfiguration (84)
Code 84 (0x54) -- EIATTR_REG_RECONFIG: Indexed format, flag-only with optional value. Signals the kernel uses dynamic register reconfiguration (setmaxnreg instruction, sm_100+). SM-gated via sub_1C97840(0x54, sm_version).
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x02 1 reconfig_value (in TLV size field lo byte, optional)
Binary evidence: magic 0x5401 (21505) at lines 1637 and 2395. Tracked in v616[] per-entry array with the v666 flag. During callee propagation (lines 2364--2405), if a callee has a reconfig value (ii = *(v230+2)), it is written into the target record's size field byte: *(_BYTE *)(v417 + 2) = ii (line 2403). The value propagates from callee to entry point.
Annotations (85)
Code 85 (0x55) -- EIATTR_ANNOTATIONS: Free format with nested TLV-within-TLV sub-records. Emitted by sub_60C580. General-purpose annotation container for arbitrary metadata.
Payload: sequence of sub-records, each starting with a type byte:
Type 0: [type:4] -- 4 bytes
Type 1: [type:4][value:4] -- 8 bytes
Type 2: [type:4][key:4][len:4][data:len] -- 12+len bytes, 4-byte aligned
Type 3: [type:4][len:4][data:len] -- 8+len bytes, 4-byte aligned
Size: sum of all sub-record sizes
Binary evidence from sub_60C580:
- Line 47: type 2 records copy
key(4 bytes) +len(4 bytes) +lenbytes of data (line 51--53:memcpy(v17+3, v7+3, v22)). Alignment:(len + 11) & ~3+ 4 (line 55). - Line 63: type 3 records copy
len(4 bytes) +lenbytes (line 66--67:memcpy(v17+2, v7+2, v26)). Alignment:(len + 7) & ~3+ 4 (line 68). - Line 71: type 1 records are 8 bytes (
v19 = 8; v17[1] = v7[1]). - Line 79: type 0 (default) records are 4 bytes.
Total allocation: 257 * entry_count dwords (line 29: v8 = 257LL * count), providing generous headroom for variable-length sub-records.
Sentinel (86)
Code 86 (0x56) -- EIATTR_UNKNOWN: Never emitted. Placeholder in the enum, analogous to EIATTR_ERROR (code 0).
Stub Function Kind (88)
Code 88 (0x58) -- EIATTR_STUB_FUNCTION_KIND: Indexed format, 4-byte value. Classifies the type of stub function.
Offset Size Field
------ ---- -----
0x00 4 sym_index
0x04 4 stub_kind Stub function classification enum
No explicit switch case -- passes through the default path.
Mercury Finalizer Options (90)
Code 90 (0x5A) -- EIATTR_MERCURY_FINALIZER_OPTIONS: Free format. Options for the Mercury FNLZR post-link pass. Emitted by sub_462220. Contains null-terminated key-value string pairs with a trailing CRC hash.
Payload: sequence of key-value entries followed by a hash:
Per-entry:
Offset Size Field
------ ---- -----
0x00 2 key_len strlen(key) + 1 (includes null terminator)
0x02 2 val_len strlen(val) + 1 (includes null terminator)
0x04 key_len key_str Null-terminated key string
0x04+ val_len val_str Null-terminated value string
key_len
Trailer: CRC/hash (computed by sub_4305D0)
Size: sum of all entries + hash
Binary evidence: sub_462220 at line 656 calls sub_1CC85F0(v7, 90, v234, v225, *a5). Lines 640--647 show the key-value pair packing: strlen of key and value, packed as u16 lengths, followed by strcpy of both strings. The hash is computed at line 653 via sub_4305D0(0x123456, ...).
Cluster Configuration (91)
Code 91 (0x5B) -- EIATTR_BLOCKS_ARE_CLUSTERS: Indexed format, flag-only. Signals that CTA blocks are clusters (every block is its own cluster).
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No explicit switch case -- passes through the default path.
Address Sanitizer (92)
Code 92 (0x5C) -- EIATTR_SANITIZE: Indexed format, flag-only. Signals the kernel has been instrumented with address sanitizer.
Offset Size Field
------ ---- -----
0x00 4 sym_index
(no value -- flag only)
No explicit switch case -- passes through the default path.
Syscall Fallback (93)
Code 93 (0x5D) -- EIATTR_SYSCALLS_FALLBACK: Free format. Syscall fallback mechanism data.
Payload: structured syscall fallback data
Size: variable
No explicit switch case -- passes through the default path.
CUDA Requirements (94)
Code 94 (0x5E) -- EIATTR_CUDA_REQ: Free format. CUDA requirements descriptor specifying minimum runtime capabilities.
Payload: structured requirements data
Size: variable
No explicit switch case -- passes through the default path.
Mercury ISA Version (95)
Code 95 (0x5F) -- EIATTR_MERCURY_ISA_VERSION: Sized format (0x03). Mercury ISA version encoded in the TLV size field.
TLV header: [fmt=0x03][code=0x5F][isa_version:2]
Total record: 4 bytes (header only)
Error Last Sentinel (96)
Code 96 (0x60) -- EIATTR_ERROR_LAST: Never emitted. Upper bound sentinel for the enum range. Used for bound checks in the builder: if (attr_code > 0x2F) at line 760.
Payload Format Summary (Codes 65--96)
| Code | Name | Wire Fmt | Payload size | Payload layout |
|---|---|---|---|---|
| 65 | RESERVED_SMEM_USED | 0x04 | 4 | [sym:4] flag-only |
| 66 | RESERVED_SMEM_0_SIZE | 0x04 | 8 | [sym:4][rsmem_bytes:4] |
| 67 | UCODE_SECTION_DATA | 0x01 | var | opaque byte array |
| 68 | UNUSED_LOAD_BYTE_OFFSET | 0x01 | N*4 | u32[] .text byte offsets |
| 69 | KPARAM_INFO_V2 | 0x01 | N*12 | 12B per-param descriptors |
| 70 | SYSCALL_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 71 | SW_WAR_MEMBAR_SYS_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 72 | GRAPHICS_GLOBAL_CBANK | 0x04 | 8 | [sym:4][cbank_desc:4] |
| 73 | SHADER_TYPE | 0x04 | 8 | [sym:4][shader_type:4] |
| 74 | VRC_CTA_INIT_COUNT | 0x02 | 4 | [sym:4] count in TLV size byte |
| 75 | TOOLS_PATCH_FUNC | 0x04 | 8 | [sym:4][patch_info:4] |
| 76 | NUM_BARRIERS | 0x02 | 4 | [sym:4] count in TLV size byte |
| 77 | TEXMODE_INDEPENDENT | 0x04 | 4 | [sym:4] flag-only |
| 78 | PERF_STATISTICS | 0x01 | var | structured perf data |
| 79 | AT_ENTRY_FRAGEMENTS | 0x01 | N*4 | u32[] fragment offsets |
| 80 | SPARSE_MMA_MASK | 0x03 | 0 | bitmask in TLV size field (u16) |
| 81 | TCGEN05_1CTA_USED | 0x04 | 4 | [sym:4] flag-only |
| 82 | TCGEN05_2CTA_USED | 0x04 | 4 | [sym:4] flag-only |
| 83 | GEN_ERRBAR_AT_EXIT | 0x04 | 4 | [sym:4] flag-only |
| 84 | REG_RECONFIG | 0x04 | 4 | [sym:4] value in TLV size byte |
| 85 | ANNOTATIONS | 0x01 | var | nested TLV sub-records |
| 86 | UNKNOWN | -- | 0 | none (never emitted) |
| 87 | STACK_CANARY_TRAP_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 88 | STUB_FUNCTION_KIND | 0x04 | 8 | [sym:4][stub_kind:4] |
| 89 | LOCAL_CTA_ASYNC_STORE_OFFSETS | 0x01 | N*4 | u32[] .text byte offsets |
| 90 | MERCURY_FINALIZER_OPTIONS | 0x01 | var | key-value pairs + hash |
| 91 | BLOCKS_ARE_CLUSTERS | 0x04 | 4 | [sym:4] flag-only |
| 92 | SANITIZE | 0x04 | 4 | [sym:4] flag-only |
| 93 | SYSCALLS_FALLBACK | 0x01 | var | structured syscall data |
| 94 | CUDA_REQ | 0x01 | var | structured requirements |
| 95 | MERCURY_ISA_VERSION | 0x03 | 0 | value in TLV size field (u16) |
| 96 | ERROR_LAST | -- | 0 | none (never emitted) |
Generation Pipeline
EIATTR attributes are generated during Phase 6 of the ELF output pipeline, after all per-kernel SASS encoding and memory allocation have completed. The generation is orchestrated by two functions working in sequence.
Barrier/Register Propagation -- sub_1CC8950
Before per-entry attribute emission begins, sub_1CC8950 (2,634 bytes, called once per entry point) propagates resource requirements from callees to entry kernels via the call graph:
-
Register count propagation: Walks the call graph DFS, finding the maximum register count among all callees. The verbose trace
"regcount %d for %s propagated to entry %s"logs this. -
Barrier count creation: When a kernel's section flags contain a barrier count (bits 20--26 of
section_header + 8) but noEIATTR_NUM_BARRIERSrecord exists, creates one and clears the section flag bits:
Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for entry symbol %s
- SM-version gating: Uses
sub_1C97840to check whetherEIATTR_NUM_BARRIERS(0x4C) andEIATTR_NUM_MBARRIERS(0x38) are valid for the target SM version before emitting.
Master EIATTR Builder -- sub_1CC9800
The main builder function (14,764 bytes binary, 90 KB decompiled -- third largest function in the output range) constructs the complete set of .nv.info.<func> sections. It has 51 callees and is called once per compilation unit.
The builder iterates over every entry point and device function, emitting the applicable EIATTR records for each. The SM-version gating function sub_1C97840 is called before emitting each attribute to check compatibility. Observed EIATTR code checks in the builder:
| Hex code | EIATTR name | Gating condition |
|---|---|---|
0x04 | CTAIDZ_USED | SM-version check |
0x21 | EXPLICIT_CACHING | SM-version check |
0x1F | NEED_CNP_WRAPPER | SM-version check |
0x20 | NEED_CNP_PATCH | SM-version check |
0x2C | HAS_PRE_V10_OBJECT | SM-version check |
0x38 | NUM_MBARRIERS | SM-version check |
0x41 | RESERVED_SMEM_USED | SM-version check |
0x4A | VRC_CTA_INIT_COUNT | SM-version check |
0x4C | NUM_BARRIERS | SM-version check |
0x50 | SPARSE_MMA_MASK | SM-version check |
0x51 | TCGEN05_1CTA_USED | SM-version check |
0x52 | TCGEN05_2CTA_USED | SM-version check |
0x54 | REG_RECONFIG | SM-version check |
The SM version comes from offset +624 of the compilation state object, consistent with the SM version field at a1 + 624 observed throughout ptxas.
Weak Symbol Filtering
During linking (nvlink), three specific EIATTR codes are treated specially during weak symbol resolution. When a weak function is replaced by a stronger definition, records for these three codes are dropped using the bitmask 0x800800020000:
- Code 17 (
0x11) --EIATTR_FRAME_SIZE - Code 35 (
0x23) --EIATTR_MAX_STACK_SIZE - Code 47 (
0x2F) --EIATTR_REGCOUNT
The rationale: when a weak function is replaced, its resource descriptors must not contaminate the replacement's resource accounting.
Consumer Tools
cuobjdump
cuobjdump --dump-elf-section=.nv.info dumps raw hex bytes of the global .nv.info section. With --dump-resource-usage, it decodes EIATTR records into human-readable resource summaries (register count, shared memory, stack sizes).
nvdisasm
nvdisasm -nvi decodes .nv.info sections into named EIATTR records with decoded values. This is the primary tool for inspecting EIATTR content without writing a custom parser.
cuda-gdb
The debugger uses EIATTR_TOOLS_PATCH_FUNC (code 75, 0x4B) to locate patchable function entry points for breakpoint insertion and instrumentation.
How EIATTR Drives GPU Resource Allocation
The .nv.info section is not just metadata for tools -- it is the primary input to the GPU driver's kernel launch resource allocator:
-
Register allocation:
EIATTR_REGCOUNT(0x2F) tells the driver how many registers each thread needs. The driver computesmax_warps_per_SM = total_registers / (regcount * warp_size). -
Shared memory reservation:
EIATTR_SMEM_PARAM_SIZE(0x18) andEIATTR_RESERVED_SMEM_0_SIZE(0x42) determine how much shared memory to carve out before dynamic shared memory allocation. -
Stack allocation:
EIATTR_CRS_STACK_SIZE(0x1E) andEIATTR_MAX_STACK_SIZE(0x23) determine per-thread stack allocation. Too small causes memory corruption; too large reduces occupancy. -
Barrier reservation:
EIATTR_NUM_BARRIERS(0x4C) reserves named barrier slots. Hardware supports 16 barriers per CTA on most architectures. -
Instruction patching: Offset tables (
EXIT_INSTR_OFFSETS,S2RCTAID_INSTR_OFFSETS,SW*_WAR) tell the driver which instruction words to patch at load time. This enables hardware workarounds and CTA-ID remapping for cluster launch without recompilation. -
Cluster configuration:
EIATTR_CTA_PER_CLUSTER(0x3D) andEIATTR_EXPLICIT_CLUSTER(0x3E) control the cluster launch hardware on sm_90+, determining how many CTAs share distributed shared memory. -
Tensor core mode:
EIATTR_TCGEN05_1CTA_USED(0x51) andEIATTR_TCGEN05_2CTA_USED(0x52) inform the driver about 5th-gen tensor core usage modes on sm_100+.
Binary Artifacts
Pointer Table Layout
The EIATTR name table at VA 0x23FDC20 consists of 97 entries of 16 bytes each (1,552 bytes total):
Offset Size Field
------ ---- -----
0x00 8 name_ptr Pointer to null-terminated EIATTR name string
0x08 4 meta_lo Minimum toolkit version compatibility
0x0C 4 meta_hi Flags (0=legacy, 1=internal, 2=standard)
The table is indexed directly by EIATTR code number: entry = table_base + code * 16.
Typos Preserved in the Binary
| String in binary | Correct spelling | Address |
|---|---|---|
EIATTR_AT_ENTRY_FRAGEMENTS | EIATTR_AT_ENTRY_FRAGMENTS | 0x23FCCBD (code 79 name) |
A corrected variant EIATTR_AT_ENTRY_FRAGMENTS exists at 0x2405DA1, and EIATTR_COROUTINE_RESUME_ID_OFFSETS at 0x24064D8 is an alternate name for code 58, both outside the main table.
Diagnostic Strings
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for entry symbol %s" (0x2406960)
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
from section flags of %s to nvinfo for non-entry symbol %s" (0x24069D0)
"Creating new EIATTR_NUM_BARRIERS and propagating higher
barcount %d from section flags of %s to nvinfo
for entry symbol %s" (0x2406B10)
"conflicting crs_stack attribute" (sub_1CC9800 evidence)
"Turning caching %s for entry '%s' as per its request" (sub_1CC9800 evidence)
"regcount %d for %s propagated to entry %s" (sub_1CC8950 evidence)
"no regcount?" (sub_1CC8950 evidence)
Key Functions
| Address | Size | Identity | Role |
|---|---|---|---|
sub_1CC9800 | 14,764 B | Master EIATTR builder | Constructs all .nv.info.<func> sections (90 KB decompiled, 51 callees) |
sub_1CC8950 | 2,634 B | Barrier/register propagator | Propagates resource counts across call graph |
sub_1CC85F0 | ~180 B | TLV record emitter | Writes individual EIATTR records to the nvinfo linked list |
sub_1C97840 | ~100 B | SM-version gate | Checks if an EIATTR code is valid for a given SM target |
sub_1CC86D0 | ~600 B | Per-entry stack emitter | Emits MIN_STACK_SIZE (0x12), CRS_STACK_SIZE (0x1E), SAM_REGION_STACK_SIZE (0x3B) per function |
sub_1CC84A0 | ~400 B | EIATTR helper | Attribute lookup helper |
sub_1CC83F0 | ~200 B | EIATTR helper | Section flag extractor |
sub_1CC8100 | ~1 KB | Cache conflict resolver | Resolves conflicting cache preference attributes |
Cross-References
- ELF/Cubin Output -- Phase 6 in the 11-phase output pipeline
- Custom ELF Emitter -- Section creation and layout
- Synchronization & Barriers -- Barrier count source
- Register Allocation -- REGCOUNT source
Glossary
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Quick-reference for terms used throughout this wiki. Each entry links to the primary page with full details.
| Term | Definition |
|---|---|
| Barrier | Hardware synchronization primitive that blocks threads until a condition is met. PTXAS inserts and optimizes barriers via dedicated passes. See Synchronization & Barriers. |
| BMMA | Binary Matrix Multiply-Accumulate — tensor core operation on 1-bit inputs. Part of the WMMA/GMMA family. See Tensor Core Intrinsics. |
| BSSY | Barrier Set Synchronization — SASS instruction that sets a convergence barrier for divergent control flow. Paired with BSYNC. See Scoreboards & Dependency Barriers. |
| BSYNC | Barrier Synchronize — SASS instruction that waits on a convergence barrier set by BSSY. See Scoreboards & Dependency Barriers. |
| Capmerc | Capsule Mercury — an ELF section (.nv.capmerc) embedding a secondary Mercury-encoded representation of the kernel for debug metadata and binary patching. See Capsule Mercury & Finalization. |
| CGA | Cooperative Grid Array — Hopper+ hardware grouping of thread blocks that can synchronize and share distributed shared memory. See Ada & Hopper. |
| Convergence | The point where divergent warp threads rejoin a common execution path, marked by BSSY/BSYNC pairs in SASS. See Predication. |
| Cubin | CUDA Binary — the ELF-based output format produced by ptxas, containing .text (SASS), .nv.info, .nv.constant0, and other NVIDIA-specific sections. See ELF/Cubin Output. |
| DAG | Directed Acyclic Graph — the core data structure within Ori IR basic blocks; instructions form a DAG of def-use edges rather than a flat list. See IR Overview & Design. |
| DEPBAR | Dependency Barrier — SASS instruction (DEPBAR) that stalls until a scoreboard counter reaches a threshold, enforcing producer-consumer ordering. See Scoreboards & Dependency Barriers. |
| Divergence | When threads within a warp take different control-flow paths, requiring the hardware to serialize execution. PTXAS manages divergence through predication and BSSY/BSYNC insertion. See Predication. |
| DMMA | Double-precision Matrix Multiply-Accumulate — FP64 tensor core operation available on sm_80+. See Tensor Core Intrinsics. |
| DynBatch | Dynamic Batch — one of the instruction scheduler's two modes (alongside ReduceReg), which batches independent instructions to maximize ILP. See Scheduler Architecture. |
| EIATTR | Extended Info Attributes — per-kernel metadata in .nv.info sections: tag-length-value records carrying register counts, barrier usage, shared memory sizes, and other properties consumed by the CUDA runtime and driver. See EIATTR Attribute Catalog. |
| ELFW | PTXAS's custom ELF writer (sub_1C9F280, 97 KB) — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions. See Custom ELF Emitter. |
| Fatpoint | The register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers; the allocator maps these sets to physical registers. See Fatpoint Algorithm. |
| HMMA | Half-precision Matrix Multiply-Accumulate — FP16 tensor core operation, the original WMMA instruction class from Volta/Turing. See Tensor Core Intrinsics. |
| IMMA | Integer Matrix Multiply-Accumulate — INT8/INT4 tensor core operation. See Tensor Core Intrinsics. |
| Knob | An internal tuning parameter (1,294 total) stored as a ROT13-obfuscated string in the binary, read from environment variables or INI-format knob files. Controls per-pass thresholds, feature toggles, and scheduler behavior. See Knobs System. |
| MEMBAR | Memory Barrier — SASS instruction that enforces memory ordering across threads, CTAs, or the GPU. See Synchronization & Barriers. |
| MercConverter | The subsystem that converts abstract Ori IR instructions into Mercury-compatible instruction objects for SASS encoding. Part of instruction selection. See Instruction Selection. |
| Mercury | The SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words via ~4,000 per-variant handler functions. See Mercury Encoder. |
| MovPhi | A pseudo-instruction in the Ori IR that represents SSA phi-node moves — parallel copies resolved during register allocation and out-of-SSA conversion. See IR Overview & Design. |
| NvOptRecipe | NVIDIA Optimization Recipe — a predefined sequence of optimization phases selected by optimization level. The PhaseManager reads the recipe to determine which phases run and in what order. See Optimization Levels. |
| Occupancy | The ratio of active warps to the maximum warps a streaming multiprocessor can support, determined by register count, shared memory usage, and barrier count. Higher occupancy helps hide memory latency. See Allocator Architecture. |
| OCG | Optimizing Code Generator — NVIDIA's internal name for the ptxas optimization and codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings. See Optimization Pipeline. |
| Opex | Operand Expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields (virtual registers, immediates, address modes to bit patterns). See SASS Code Generation. |
| Ori IR | PTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym. See IR Overview & Design. |
| PhaseManager | The infrastructure class (sub_C62720) that drives the 159-phase optimization pipeline: a factory, vtable dispatch, execute/isNoOp/getName interface. See Phase Manager Infrastructure. |
| Pipeline progress counter | A hardware counter (Hopper+) that tracks the stage of an asynchronous pipeline operation, used by cp.async and TMA to overlap compute with memory transfers. See Ada & Hopper. |
| PTX | Parallel Thread Execution — NVIDIA's virtual ISA for GPU compute. The textual input format consumed by ptxas. See PTX Instruction Table. |
| QMMA | Quarter-precision Matrix Multiply-Accumulate — FP8 (E4M3/E5M2) tensor core operation available on Hopper+. See Tensor Core Intrinsics. |
| Register pressure | The number of live virtual registers at a program point relative to the physical register file capacity. High pressure causes spilling. See Allocator Architecture. |
| Remat | Rematerialization — recomputing a value instead of spilling and reloading it, trading ALU cycles for register file pressure reduction. See Rematerialization. |
| ROT13 | The trivial Caesar cipher (rotate-13) used to obfuscate all 1,294 knob name strings in the ptxas binary. Decoded at lookup time by GetKnobIndex. See Knobs System. |
| SASS | Shader Assembly — NVIDIA's native GPU machine code. The binary output produced by ptxas, encoded as 128-bit instruction words. See SASS Opcode Catalog. |
| Scoreboard | A hardware dependency-tracking mechanism (6 barriers on pre-Hopper, more on Hopper+) that enforces ordering between long-latency producers and their consumers. Managed by DEPBAR instructions. See Scoreboards & Dependency Barriers. |
| sm_backend | The per-architecture codegen backend selected by --gpu-name. Each SM family (Turing, Ampere, Ada, Hopper, Blackwell) has distinct encoding tables, latency profiles, and feature gates. See SM Architecture Map. |
| Spill | Storing a live register value to local memory when the allocator cannot fit all live values into the physical register file. Spills degrade performance significantly on GPUs. See Spilling. |
| tcgen05 | Fifth-generation tensor core instruction set on Blackwell (sm_100+). Replaces WMMA/GMMA with a new ISA for matrix operations. See TCGen05. |
| TMA | Tensor Memory Accelerator — Hopper+ hardware unit that performs bulk asynchronous copies between global and shared memory with address generation offloaded from the SM. See Ada & Hopper. |
| UFT | Uniform Function Table — a data structure in the CUBIN that maps function indices to code offsets, used by the driver for indirect call dispatch. See ELF/Cubin Output. |
| UDT | Uniform Data Table — a companion to UFT that maps data indices to constant bank offsets within the CUBIN. See ELF/Cubin Output. |
| Warpgroup | A Hopper+ scheduling unit consisting of 4 warps (128 threads) that execute WGMMA and other warpgroup-level instructions collectively. See Ada & Hopper. |
| WGMMA | Warpgroup Matrix Multiply-Accumulate — Hopper+ tensor core instruction that operates at warpgroup granularity (4 warps), supporting asynchronous execution with pipeline progress counters. See GMMA/WGMMA Pipeline. |