PTXAS v13.0 — Reverse Engineering Reference

Purpose: reimplementation-grade documentation of NVIDIA's PTX-to-SASS assembler, recovered entirely from static analysis of the stripped x86-64 binary.

PTX (Parallel Thread Execution) is NVIDIA's virtual ISA for GPU compute. SASS (Shader Assembly) is the native machine code executed by GPU hardware. PTXAS is the binary that transforms PTX into SASS. At 37.7 MB stripped, it is a fully proprietary compiler with no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, every data structure, and every encoding table was built in-house by NVIDIA. This wiki documents its internal architecture using IDA Pro 8.x and Hex-Rays decompilation.


Binary	ptxas v13.0.88, 37,741,528 bytes, x86-64, stripped
Build	`cuda_13.0.r13.0/compiler.36424714_0` (Aug 20 2025)
Decompilation	40,185 functions, IDA Pro 8.x + Hex-Rays
Strings	30,632 extracted
Call graph	548,693 edges
Version string	`Cuda compilation tools, release 13.0, V13.0.88` (`sub_612DE0`)
LLVM code	None — fully proprietary compiler
Default target	`sm_75` (Turing)
Supported SMs	sm_75 through sm_121f (Turing through DGX Spark)
Internal codename	OCG (Optimizing Code Generator), Mercury (SASS encoder)

Glossary

Term	Meaning
Ori IR	PTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym.
Mercury	The SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words. Named in NVIDIA source paths and error strings.
OCG	Optimizing Code Generator — NVIDIA's internal name for the ptxas optimization+codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings.
Fatpoint	The register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers. The allocator works by computing these sets and mapping them to physical registers.
Opex	Operand expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields. Converts virtual register references, immediates, and address modes into the bit patterns Mercury expects.
Capmerc	Capsule Mercury — an ELF section (`.nv.capmerc`) that embeds a secondary Mercury-encoded representation of the kernel alongside the primary `.text` section. Used for debug metadata and binary patching support.
ELFW	PTXAS's custom ELF writer (`sub_1C9F280`, 97 KB). Not a standard library — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions.
EIATTR	Extended Info Attributes — per-kernel metadata encoded in `.nv.info` sections. Each attribute is a tag-length-value record carrying register counts, barrier usage, shared memory sizes, CRS stack depth, and other kernel properties consumed by the CUDA runtime and driver.

Three Subsystems

PTXAS is not a monolithic assembler. It decomposes into three largely independent subsystems with distinct coding conventions, data structures, and lineages:

1. PTX Frontend (~3 MB, 0x400000--0x5AA000) — A Flex-generated DFA scanner (sub_720F00, 64 KB, ~552 rules) feeds tokens into a Bison-generated LALR(1) parser (sub_4CE6B0, 48 KB). The parser is driven from sub_446240 (the real main, 11 KB), which orchestrates the full pipeline: parse, DAGgen, OCG, ELF, DebugInfo. The frontend also contains 1,141 instruction descriptors registered via sub_46E000 (93 KB) that define accepted type combinations for every PTX opcode, 608 CUDA runtime intrinsics registered in sub_5D1660 (46 KB), and a suite of per-instruction semantic validators (0x460000--0x4D5000) that check architecture requirements, type compatibility, and operand constraints before lowering. See PTX Parser and Entry Point & CLI.

2. Ori Optimizer (~8 MB, 0x5AA000--0xC52000) — A proprietary 159-phase optimization pipeline managed by the PhaseManager (sub_C62720). The phase factory at sub_C60D30 is a 159-case switch that allocates polymorphic phase objects from a vtable table at off_22BD5C8. Each phase has virtual methods for execute(), isNoOp(), and getName(). Major subsystems include: a fatpoint-based register allocator (sub_957160 core, sub_95DC10 driver, sub_926A30 interference graph builder), a 3-phase instruction scheduler (sub_688DD0 with ReduceReg/DynBatch modes and 9 register pressure counters), copy propagation, strength reduction, predication (if-conversion), rematerialization, and GMMA/WGMMA pipelining. The pipeline reads its default phase ordering from a 159-entry table at 0x22BEEA0. See Optimization Pipeline and Phase Manager.

3. SASS Backend (~14 MB, 0xC52000--0x1CE3000) — The Mercury encoder generates native SASS binary code. Instruction encoding is handled by ~4,000 per-variant handler functions (683 + 678 = 1,361 in the SM100 Blackwell encoding tables alone at 0xED1000--0x107B000, with additional tables for other SM generations). Each handler follows a rigid template: set opcode ID, load a 128-bit encoding format descriptor via SIMD, initialize a 10-slot register class map, register operand descriptors via sub_7BD3C0/sub_7BD650/sub_7BE090, finalize with sub_7BD260, then extract bitfields from the packed instruction word. The backend also contains 3 peephole optimizers (the PeepholeOptimizer class at 0x7A5D10 with Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, and SchedulingAwarePatterns methods), a capsule Mercury ELF embedder for debug metadata (sub_1CB53A0, section .nv.capmerc), and a custom ELF emitter (sub_1C9F280, 97 KB) that builds the final CUBIN output. See SASS Code Generation, Mercury Encoder, and Peephole Optimization.

Additionally, the binary embeds a custom pool allocator (sub_424070, 3,809 callers), MurmurHash3-based hash maps (sub_426150 insert / sub_426D60 lookup), a thread pool with pthread-based parallel compilation support, and a GNU Make jobserver client for integration with build systems.

Compilation Pipeline

Both standalone and library-mode invocations converge on the same pipeline, visible in the timing strings emitted by sub_446240:

PTX text (.ptx file or string)
  |
  +-- Flex Scanner (sub_720F00, 64KB)
  |     552-rule DFA, off_203C020 transition table
  |     Tokens: 340+ terminal symbols for Bison grammar
  |
  +-- Bison LALR(1) Parser (sub_4CE6B0, 48KB)
  |     Semantic validators: 0x460000-0x4D5000
  |     1,141 instruction descriptors via sub_46E000
  |
  +-- Ori IR Construction (DAGgen phase)
  |     Internal representation: basic blocks + instruction DAG
  |     608 CUDA runtime intrinsics (sub_5D1660)
  |
  +-- 159-Phase Optimization Pipeline (PhaseManager, sub_C62720)
  |     Phase factory: sub_C60D30 (159-case switch)
  |     Fatpoint register allocator (sub_957160)
  |     3-phase instruction scheduler (sub_688DD0)
  |     Copy propagation, CSE, strength reduction, predication,
  |     rematerialization, GMMA pipelining, late legalization
  |
  +-- Mercury SASS Encoder
  |     Instruction encoding: ~4000 per-variant handlers
  |     3 peephole optimizers (PeepholeOptimizer at 0x7A5D10)
  |     WAR hazard resolution (sub_6FC240)
  |     Operand expansion (Opex pipeline)
  |
  +-- ELF/CUBIN Output (sub_1C9F280, 97KB)
        Sections: .text, .nv.constant0, .nv.info, .symtab
        Capsule Mercury: .nv.capmerc (debug metadata)
        DWARF: .debug_line, .debug_info, .debug_frame

The driver at sub_446240 reports per-stage timing: Parse-time, CompileUnitSetup-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time, plus PeakMemoryUsage in KB. For multi-entry PTX files, each compile unit is processed independently with the header "\nCompile-unit with entry %s".

Dual Compilation Modes

PTXAS operates in two modes selected at invocation:

	Standalone CLI	Library Mode
Invocation	`ptxas [options] file.ptx`	Called from nvcc/nvlink as a subprocess
Entry	`main` at `0x409460`	`sub_9F63D0` (library/ftrace entry)
Real driver	`sub_446240` (11 KB)	Same pipeline, alternate setup
Input	PTX file on disk	PTX string via `--input-as-string`
Output	`.cubin` / `.o` file	Binary blob returned to caller
Usage string	`"Usage : %s [options] <ptx file>,...\n"`	N/A

The main function (0x409460, 84 bytes) is a thin wrapper: it stores argv[0], sets stdout/stderr to unbuffered via setvbuf, and delegates to sub_446240. The --input-as-string flag enables accepting PTX source directly as a CLI argument rather than reading from a file.

Configuration

PTXAS exposes three layers of configuration:

CLI Options (~100 flags) — Registered in sub_432A00 and parsed by sub_434320. Key options include --gpu-name (target SM), --maxrregcount (register limit), --opt-level (0--4), --verbose, --warn-on-spills, --warn-on-local-memory-usage, --fast-compile, --fdevice-time-trace (Chrome trace JSON output), --compile-as-tools-patch (sanitizer mode), and --extensible-whole-program. Help is printed by sub_403588 which calls sub_1C97640 to enumerate all registered options.

Internal Knobs (1,294 ROT13-encoded entries) — A separate configuration system implemented in generic_knobs_impl.h (source path recovered: /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h). The knob table is populated by two massive static constructors: ctor_005 at 0x40D860 (80 KB, ~2,000 general OCG knobs) and ctor_007 at 0x421290 (8 KB, 98 Mercury scheduler knobs). All knob names are ROT13-obfuscated in the binary. Examples after decoding: MercuryUseActiveThreadCollectiveInsts, MercuryTrackMultiReadsWarLatency, MercuryPresumeXblockWaitBeneficial, ScavInlineExpansion, ScavDisableSpilling. Knobs are read from environment variables and knob files via ReadKnobsFile (sub_79D070) which parses [knobs]-header INI files. Lookup is performed by GetKnobIndex (sub_79B240) with inline ROT13 decoding and case-insensitive comparison. See Knobs System.

SM Profile Tables — Per-architecture capability maps initialized by sub_607DB0 (14 KB) which creates 7 hash maps indexing sm_XX / compute_XX strings to handler functions. Profile objects are constructed by sub_6765E0 (54 KB) with architecture-to-family mappings (sm_75 -> Turing, sm_80/86/87/88 -> Ampere, sm_89 -> Ada Lovelace, sm_90/90a -> Hopper, sm_100/100a/100f -> Blackwell, sm_103/103a/103f -> Blackwell Ultra, sm_110/110a/110f -> Jetson Thor, sm_120/120a/120f -> RTX 50xx, sm_121/121a/121f -> DGX Spark). See SM Architecture Map.

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with GPU compiler experience.

Section Index

Overview

Function Map — Address-to-identity lookup for key functions with confidence levels.
Binary Layout — Subsystem address map at pass granularity.
Methodology — How this analysis was performed.
Version Tracking — Cross-version address deltas.

Compilation Pipeline

Pipeline Overview — End-to-end PTX-to-SASS flow diagram with links to every stage.
Entry Point & CLI — CLI parsing, main at 0x409460, the real driver at sub_446240.
PTX Parser (Flex + Bison) — 552-rule Flex DFA scanner, Bison LALR(1) parser, instruction descriptor table.
PTX Directive Handling — .version, .target, .entry, .func, .reg, .shared, .const processing.
PTX-to-Ori Lowering — How parsed PTX is lowered into the Ori internal representation.
Optimization Pipeline (159 Phases) — PhaseManager, phase factory, default phase ordering, per-phase timing.
SASS Code Generation — Mercury encoder, instruction selection, operand expansion.
ELF/Cubin Output — Custom ELF emitter, section layout, capsule Mercury, DWARF generation.

Ori IR — Internal Representation

IR Overview & Design — Instruction DAG, basic blocks, typed virtual registers.
Instructions & Opcodes — Ori opcode set and instruction encoding.
Basic Blocks & CFG — Control flow graph construction and manipulation.
Register Model (R/UR/P/UP) — Four register classes and their constraints.
Data Structure Layouts — Memory layout of key IR objects.

Optimization Passes

Pass Inventory & Ordering — All 159 phases with names, addresses, and pipeline positions.
Phase Manager Infrastructure — Phase factory, vtable dispatch, execute/isNoOp/getName.
GeneralOptimize Bundles — Mega-pass bundles that group related sub-passes.
Loop Passes — Unrolling, LICM, induction variable optimization, strength reduction.
Copy Propagation & CSE — Value forwarding and common subexpression elimination.
Predication — If-conversion for GPU divergence control.
Rematerialization — Recomputing values to reduce register pressure.
Synchronization & Barriers — Barrier insertion and dead barrier elimination.
Late Expansion & Legalization — Final lowering before codegen.

Register Allocation

Allocator Architecture — Fatpoint algorithm, interference graph, spilling, ABI constraints.
Fatpoint Algorithm — Core allocation loop and heuristics.
Spilling — Spill cost model and spill code generation.
GPU ABI & Calling Convention — Register assignment rules and caller/callee contracts.

Instruction Scheduling

Scheduler Architecture — 3-phase scheduler, ReduceReg/DynBatch modes.
Scheduling Algorithm — Priority list scheduling with register pressure tracking.
Latency Model & HW Profiles — Per-SM instruction latency tables.
Scoreboards & Dependency Barriers — WAR hazard resolution, barrier allocation.

SASS Code Generation

Code Generation Overview — Instruction selection, encoding, peephole, Mercury.
Instruction Selection — Pattern-based DAG-to-SASS lowering.
SASS Instruction Encoding — 128-bit instruction word format and bitfield packing.
Peephole Optimization — Three peephole dispatchers with SM-variant patterns.
Mercury Encoder — Per-variant handler architecture, encoding tables.
Capsule Mercury & Finalization — .nv.capmerc section, debug metadata embedding.
SASS Text Generation — Disassembly-format printing for --verbose output.

GPU Architecture Targets

SM Architecture Map — SM feature gates from sm_75 through sm_121f.
Turing & Ampere (SM 75--88) — Feature delta between generations.
Ada & Hopper (SM 89--90a) — Async copy, TMA, distributed shared memory.
Blackwell (SM 100--121) — TCGen05, fifth-gen tensor cores, new SM variants.
TCGen05 — 5th Gen Tensor Cores — Blackwell tensor core instruction set.

CUDA Intrinsics

Intrinsic Table (608 Entries) — Math, tensor, sync, warp intrinsics.
Math Intrinsics — Fast-math, Newton-Raphson, special functions.
Tensor Core Intrinsics — WMMA, GMMA, WGMMA instruction families.
Sync & Warp Intrinsics — Barrier, vote, shuffle, match.

ELF/Cubin Output

Custom ELF Emitter — ELFW internals, section construction, symbol table.
Section Catalog & EIATTR — .nv.info attribute encoding, per-kernel metadata.
Debug Information — DWARF generation for GPU debugging.
Relocations & Symbols — CUBIN relocation types and symbol conventions.

Configuration

CLI Options — ~100 flags registered in sub_432A00.
Knobs System (1,294 Knobs) — ROT13 knob table, environment variables, INI files.
Optimization Levels — O-level to phase mapping, --fast-compile tiers.
DUMPIR & NamedPhases — Dumping IR at specific pipeline points.

Infrastructure

Memory Pool Allocator — sub_424070, 3,809 callers, arena-style allocation.
Hash Tables & Bitvectors — MurmurHash3-based maps, bitvector liveness sets.
Thread Pool & Concurrency — pthread pool, GNU Make jobserver client.

Reference

SASS Opcode Catalog — Complete SASS opcode enumeration.
PTX Instruction Table — All PTX instructions with type signatures.
EIATTR Attribute Catalog — Tag-length-value format for .nv.info attributes.

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how PTX text becomes SASS binary, what each stage does, and how control flows between subsystems.

Pipeline Overview — The complete flow diagram. Establishes all stages and their address ranges.
Entry Point & CLI — How ptxas is invoked, the ~100 CLI flags, and the sub_446240 driver function.
PTX Parser — The Flex scanner and Bison parser. How PTX text becomes an internal parse tree.
PTX-to-Ori Lowering — How the parse tree is lowered to Ori IR (basic blocks + instruction DAG).
Optimization Pipeline — The 159-phase PhaseManager. Phase factory, ordering, timing infrastructure.
SASS Code Generation — Mercury encoder, instruction selection, operand expansion, peephole.
ELF/Cubin Output — Custom ELF emitter, section layout, DWARF debug info, capsule Mercury.

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one optimization phase deeply enough to write a compatible replacement.

Pass Inventory & Ordering — Locate the phase in the 159-entry table. Note its index, vtable address, and pipeline position.
The phase's dedicated page (e.g., Copy Propagation & CSE, Predication). Every dedicated page contains the function address, decompiled algorithm, data flow, and controlling knobs.
Knobs System — Find which ROT13 knobs control the phase's behavior (enable/disable toggles, thresholds).
Ori IR Overview — Understand the IR data structures the phase operates on.
Register Model — The R/UR/P/UP register classes and their constraints.
Function Map — Cross-reference internal function addresses with the master function map.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, crash, or incorrect SASS output by tracing the problem to a specific phase.

DUMPIR & NamedPhases — How to dump IR at specific pipeline points. Use DUMPIR to observe the IR before and after each phase.
Optimization Levels — Compare phase pipelines at different O-levels. If a bug appears at -O2 but not -O1, the diff identifies suspect phases.
Pipeline Overview — The pipeline is linear: Parse -> DAGgen -> OCG (159 phases) -> Mercury -> ELF. The stage where output first goes wrong narrows the search.
Knobs System — Check whether the suspect phase has enable/disable knobs. Toggle them to confirm or rule out the phase.
Instruction Scheduling and Scoreboards & Dependency Barriers — If the generated SASS hangs or produces wrong results under specific warp configurations, the scheduler or barrier insertion may be at fault.

Reading Path 4: Tuning Performance

Goal: understand what ptxas does at each optimization level and what knobs control aggressiveness.

Optimization Levels — The O-level to phase mapping, including --fast-compile tiers.
Knobs System — The 1,294 ROT13-encoded internal tuning parameters. The primary mechanism for fine-grained control.
Register Allocation — The fatpoint allocator directly determines register count, which determines maximum occupancy.
Instruction Scheduling — The scheduler's ReduceReg and DynBatch modes, WAR hazard resolution, and interaction with register pressure.
Peephole Optimization — The 3 peephole dispatchers that perform late SASS-level rewrites.
SM Architecture Map — Per-SM feature gates that influence code generation decisions.

Function Map

This page is the central lookup index for identified functions in ptxas. It lists the functions that appear most frequently across the wiki (cross-cutting infrastructure and major entry points), and provides routing tables to find any function by address range or subsystem.

Confidence levels: CERTAIN = named in symbols or strings. HIGH = strong evidence from strings and call patterns (>90%). MEDIUM = structural analysis with partial string evidence (70-90%).

Core Infrastructure

These functions appear in 10+ wiki pages -- they are the universal building blocks called by nearly every subsystem.

Address	Identity	Pages	Callers	Notes
`0x424070`	`pool_alloc(pool, size)`	19	3,809	Custom slab allocator, 8-byte aligned
`0x4248B0`	`pool_free(ptr)`	8	1,215	Coalescing free, boundary tags
`0x4280C0`	`get_thread_local_context`	10	3,928	Most-called function in ptxas; 280-byte TLS struct
`0x42BDB0`	`fatal_OOM_handler`	8	3,825	Called on every allocation failure
`0x426150`	`hashmap_put(map, key, value)`	11	2,800	Open-addressing + chaining, auto-resize
`0x426D60`	`hashmap_get(map, key)`	11	422	Returns value or 0
`0x425CA0`	`hashmap_create(hash_fn, cmp_fn, cap)`	7	127	Integer/pointer/custom hash modes
`0x427630`	`murmurhash3_x86_32(str)`	5	73	Constants: 0xcc9e2d51, 0x1b873593
`0x42D850`	`hashset_insert(set, key)`	4	282	Hash set variant
`0x42FBA0`	`diagnostic_emit(desc, loc, fmt...)`	7	2,350	Central error/warning reporter
`0x42F590`	`fatal_internal_error(desc, ...)`	8	3,825	Assertion handler
`0x4279D0`	`starts_with(str, prefix)`	4	185	Returns suffix pointer or 0
`0x42CA60`	`list_push_front(node, head_ptr)`	4	298	Pool-allocated linked list
`0xBDBA60`	`bitvector_allocate`	8	many	(bits+31)>>5 word count
`0xBDCDE0`	`bitvector_or_assign` (SSE2)	5	many	`_mm_or_si128` on 128-bit chunks

Compilation Driver & CLI

Address	Identity	Pages	Callers	Notes
`0x409460`	`main`	5	1	Delegates to `0x446240`
`0x446240`	`real_main` (top-level driver)	13	1	Orchestrates entire pipeline
`0x4428E0`	`ptx_input_setup`	6	1	Version/target validation
`0x43CC70`	`per_entry_compile_unit`	5	1	Processes each entry through pipeline
`0x43F400`	`function_abi_config`	4	1	Parameter regs, return addr, scratch
`0x43A400`	`compilation_target_config`	7	1	SM-specific defaults
`0x43B660`	`register_constraint_calculator`	5	1	Balances .maxnreg, occupancy
`0x432A00`	`option_registration`	9	1	CLI option definitions
`0x434320`	`option_parser`	9	1	Validates combinations, applies state

PTX Front End

Address	Identity	Pages	Callers	Notes
`0x46E000`	`instruction_table_builder`	9	1	93 KB, 1168 callees, one per PTX opcode
`0x451730`	`parser_setup` (special register init)	9	1	%ntid, %laneid, %clock, etc.
`0x4CE6B0`	`bison_parser` (directive/decl)	7	1	.local_maxnreg, .alias, .pragma
`0x720F00`	`flex_lexer` (ptxlex / yylex)	8	2	~550 Flex rules, DFA scanner
`0x4B2F20`	`ptx_validator_general`	4	1	Validates texture, surface, cvt, call
`0x4C5FB0`	`ptx_validator_mma_wmma_tcgen05`	4	1	MMA, WMMA, tensor core validation
`0x71F630`	`preprocessor_dispatch`	4	1	.MACRO, .ELSE, .INCLUDE
`0x489050`	`ptx_to_ori_converter`	5	1	PTX AST to ORI IR translation

Static Initialization

Address	Identity	Pages	Notes
`0x4094C0`	`ctor_001` -- thread infra init	4	pthread_key_create, mutex
`0x4095D0`	`ctor_003` -- PTX opcode name table	6	~900 ROT13-encoded PTX mnemonics
`0x40D860`	`ctor_005` -- tuning knob registry	6	80 KB, 2000+ ROT13 knob names
`0x421290`	`ctor_007` -- scheduler knob registry	4	98 ROT13 scheduler knobs

Phase Manager & Optimization Framework

Address	Identity	Pages	Callers	Notes
`0xC60D30`	`phase_factory` (159-case switch)	12	1	Allocates phase objects
`0xC62720`	`PhaseManager_ctor`	10	2	159-entry phase table
`0xC64F70`	`phase_dispatch_loop`	5	2	Executes phases, reports timing
`0xC64310`	`per_phase_timing_reporter`	5	1	"[Total N KB] [Freeable N KB]"
`0xC641D0`	`phase_name_to_index_lookup`	5	3	Binary search, case-insensitive
`0x7DDB50`	`phase_run_dispatch`	14	many	Vtable-based phase execution
`0x9F4040`	`NamedPhases_parse_and_build`	6	1	"shuffle", "OriCopyProp", etc.
`0x798B60`	`NamedPhases_parser`	4	2	PTXAS_DISABLE env var parsing
`0x799250`	`IsPassDisabled`	5	4	Checks knob index 185
`0xA36360`	`pass_sequence_builder`	6	1	Constructs NvOptRecipe pass list

ORI IR & Instruction Access

Address	Identity	Pages	Callers	Notes
`0x9253C0`	`instruction_operand_get`	11	many	Operand accessor on ORI instructions
`0x7E6090`	`instruction_modifier_set`	10	many	IR modification helper
`0x781F80`	`instruction_iterator`	12	many	Doubly-linked list traversal
`0x7DF3A0`	`instruction_property_query`	5	many	Instruction flag/attribute checker
`0x91BF30`	`register_type_query`	8	many	Register class/type inspection
`0x9314F0`	`register_class_id_query`	7	1,547	Most-called non-trivial regalloc fn
`0x931920`	`register_class_compat_checker`	6	328	Pair register class handling
`0x934630`	`register_id_packer`	9	856	Packs reg#/class/type into 32-bit
`0xB28E00`	`ir_node_type_query`	5	many	Node kind discrimination
`0xB28E90`	`ir_node_field_accessor`	6	many	Generic field getter
`0xA50650`	`CodeObject_EmitRecords`	1	8	74 KB, ORI record serializer (56 section types)
`0xA53840`	`EmitRecords_wrapper`	1	1	Thin wrapper, adds type-44 header

Intrinsic Infrastructure

Address	Identity	Pages	Callers	Notes
`0x5D1660`	`intrinsic_table_register` (608 entries)	7	1	Master name-to-ID table
`0x5D4190`	`intrinsic_dispatch_builder`	13	1	PTX opcode -> codegen handler mapping
`0x5FF700`	`intrinsic_prototype_emitter`	5	1	354 KB -- largest function in binary
`0x5C7A50`	`wmma_mma_codegen`	4	1	173 KB, all shapes/types/layouts
`0x5C10A0`	`mma_codegen` (mma.sync)	4	1	120 KB, m8n8k4 through m16n8k256
`0x5BBC30`	`tcgen05_mma_codegen` (Blackwell)	5	1	90 KB, 5th-gen tensor core
`0x70FA00`	`ocg_intrinsic_handler`	8	1	OCG-level intrinsic routing
`0x6A97B0`	`intrinsic_lowering_main`	4	1	26 KB, switch-based lowering
`0x6C9EB0`	`ocg_builtin_name_lookup`	5	1	Blackwell+ OCG name table

Register Allocator

Address	Identity	Pages	Callers	Notes
`0x9721C0`	`regalloc_entry` ("REGALLOC GUIDANCE")	6	1	Top-level allocator entry
`0x957160`	`fatpoint_allocator_core`	7	1	Core fatpoint graph coloring
`0x96D940`	`spill_guidance_engine`	5	1	Determines spill strategy
`0x971A90`	`full_alloc_with_spill_retry`	4	1	"NOSPILL REGALLOC" path
`0x9714E0`	`regalloc_failure_reporter`	6	1	"Register allocation failed..."
`0x926A30`	`interference_graph_builder`	9	7	22 KB, SSE bitvectors
`0x92C240`	`liveness_bitvector_ops`	5	87	Set/clear/query with aliasing
`0x917A60`	`opcode_to_regclass_mapping`	4	221	Massive switch
`0x910840`	`ConvertMemoryToRegisterOrUniform`	5	1	Pass driver

Instruction Scheduling

Address	Identity	Pages	Callers	Notes
`0x8D0640`	`ScheduleInstructions` (top-level)	7	1	String: "ScheduleInstructions"
`0x688DD0`	`scheduler_engine` (main BB loop)	5	1	ReduceReg / DynBatch selection
`0x8C9320`	`scheduling_priority_function`	4	0	~300 locals, core heuristic
`0x68B9C0`	`dependency_graph_builder`	4	1	RAW/WAR/WAW hazard analysis
`0x6820B0`	`build_ready_list`	5	1	Zero-dependency instructions
`0x8CD6E0`	`reverse_scheduling_driver`	4	1	Reverse post-order iteration
`0x8CEE80`	`register_budget_with_occupancy`	4	1	Pressure coeff default 0.045
`0x8E4400`	`hw_profile_table_init`	6	3	Encoding/latency property tables
`0xA9CDE0`	`scheduling_metadata_builder`	6	1	Per-instruction sched metadata
`0xA9CF90`	`scheduling_metadata_accessor`	5	many	Sched metadata field queries
`0xAED3C0`	`scheduling_optimization_mega_pass`	4	0	137 KB, ~560 locals, largest vtable pass

Codegen & ISel

Address	Identity	Pages	Callers	Notes
`0x169B190`	`isel_pattern_dispatch` (master)	5	1	280 KB, 65,999 insns -- largest function
`0x143C440`	`sm120_peephole_dispatch`	4	1	SM120 (RTX 50), 373-case switch
`0x198BCD0`	`sm100_peephole_dispatch`	4	1	SM100 (Blackwell), 1336 callees
`0x83EF00`	`main_peephole_pass`	6	0	29 KB, 392 callees
`0x6D9690`	`master_instruction_encoder`	7	1	94 KB, opcode switch
`0x6E4110`	`sass_codegen_main`	4	1	EmitSASSForFunction, FNV-1a BB hash
`0x6F52F0`	`SASS_pipeline_run_stages`	5	1	Mercury SASS compilation pipeline
`0x9ED2D0`	`MercConverter_entry`	6	1	ORI to Mercury IR conversion
`0x9F1A90`	`MercConverter_builder`	6	1	Mercury instruction construction

Bitfield Encoding

Address	Identity	Pages	Callers	Notes
`0x7B9B80`	`bitfield_insert(insn, off, wid, val)`	9	18,347	Most-called by caller count
`0x7BC030`	`encode_register_operand`	4	6,147	1-bit + 4-bit type + 10-bit reg
`0x7B9D60`	`encode_reuse_flags_predicate`	4	2,408	1-bit reuse + 5-bit predicate
`0x7BC5C0`	`encode_immediate_const_operand`	4	1,449	Const buffer index or immediate
`0x7BCF00`	`encode_predicate_register`	4	1,657	PT=14, 2-bit type + 3-bit condition
`0x10B6180`	`1_bit_boolean_encoder`	3	8,091	.S/.U, .STRONG, etc.

ELF / CUBIN Output

Address	Identity	Pages	Callers	Notes
`0x612DE0`	`section_attr_builder`	11	1	76 KB, ELF section/attribute config
`0x1C9F280`	`master_elf_emitter`	9	1	Complete CUBIN assembly
`0x1CB53A0`	`elf_world_init`	7	1	672-byte ELFW context
`0x1CB68D0`	`symbol_table_builder`	5	1	.symtab from internal symbols
`0x1CABD60`	`master_section_allocator`	5	1	Shared/const/local memory
`0x1CB3570`	`add_function_section`	5	44	Creates .text.FUNCNAME + .rela
`0x1CD48C0`	`relocation_processor`	5	1	Relocation section emission
`0x1C9B110`	`mercury_capsule_builder`	4	1	Creates embedded .nv.merc ELF

Knobs System

Address	Identity	Pages	Callers	Notes
`0x79B240`	`GetKnobIndex`	6	2	ROT13 name lookup, case-insensitive
`0x79D070`	`ReadKnobsFile`	5	1	Parses `[knobs]` section from file
`0x79F540`	`ParseKnobValue`	4	1	12-type switch: bool/int/float/string/...
`0x79D990`	`ProcessKnobs` (top-level)	4	1	File + pragma + numbered config
`0xA0F020`	`knob_conditional_evaluator`	5	many	`[WHEN condition]` handler

Target-Specific Code

Address	Identity	Pages	Callers	Notes
`0x6765E0`	`target_profile_selector`	7	1	SM-dependent profile dispatch
`0x607DB0`	`target_feature_query`	7	many	SM feature capability checks
`0x896D50`	`sass_mnemonic_table_init` (ROT13)	4	1	~400+ SASS instruction names
`0x89FBA0`	`instruction_latency_init`	4	3	Encoding/latency property tables

Subsystem Routing Table

To find a specific function, locate it by address range or subsystem topic in this table. Each page contains a detailed Function Map section with complete listings.

By Subsystem Topic

Subsystem	Primary Pages	Functions
Memory allocator, pools	memory-pools.md	30
Hash maps, bitvectors, sets	hash-bitvector.md	51
Threading, TLS, jobserver	threading.md	41
CLI parsing, option handling	cli-options.md	17
Tuning knobs (2000+ knobs)	knobs.md	56
Optimization levels	opt-levels.md	14
DumpIR debug output	dumpir.md	14
Compilation pipeline	overview.md, entry.md	56+25
PTX lexer & parser	ptx-parser.md	75
PTX directives	ptx-directives.md	41
PTX-to-ORI translation	ptx-to-ori.md	41
Optimizer pipeline	optimizer.md	28
ORI instruction IR	instructions.md	80
CFG construction	cfg.md	18
Register representation	registers.md	40
IR data structures	data-structures.md	74
Phase manager (159 phases)	phase-manager.md	26
Copy propagation, CSE, GVN	copy-prop-cse.md	65
General optimization passes	general-optimize.md	71
Loop optimization (unroll, LICM, SWP)	loop-passes.md	92
Branch/switch optimization	branch-switch.md	24
Strength reduction	strength-reduction.md	25
Predication	predication.md	28
Rematerialization	rematerialization.md	55
Liveness analysis	liveness.md	42
Sync barriers	sync-barriers.md	66
Late legalization	late-legalization.md	59
Hot/cold splitting	hot-cold.md	10
GMMA pipelining	gmma-pipeline.md	47
Uniform registers	uniform-regs.md	22
Register allocator core	algorithm.md	50
Spilling	spilling.md	54
ABI handling	abi.md	87
Scheduling overview	overview.md	112
Scheduling algorithm	algorithm.md	121
Latency model & HW profiles	latency-model.md	78
Scoreboards & barriers	scoreboards.md	56
ISel pattern matching	isel.md	182
SASS encoding	encoding.md	92
Peephole optimization	peephole.md	67
Mercury IR conversion	mercury.md	79
SASS templates	templates.md	46
SASS printing / renderer	sass-printing.md	96
Capsule Mercury	capmerc.md	20
Intrinsic infrastructure	index.md	159
Math intrinsics	math.md	42
Tensor core intrinsics	tensor.md	45
Sync & warp intrinsics	sync-warp.md	65
SM targets & features	index.md	70
ELF emitter	elf-emitter.md	29
ELF sections	sections.md	33
Debug info (DWARF)	debug-info.md	33
Relocations	relocations.md	19

By Address Range

Functions in the binary are clustered by subsystem. This table maps address ranges to the pages that document them.

Address Range	Primary Subsystem	Key Pages
`0x400000`-`0x424000`	Entry, static init, main	entry.md, binary-layout.md
`0x424000`-`0x42E000`	Memory pools, hash maps, lists	memory-pools.md, hash-bitvector.md
`0x42E000`-`0x446000`	Diagnostics, CLI parsing	cli-options.md, entry.md
`0x446000`-`0x452000`	Compilation driver	overview.md, entry.md
`0x452000`-`0x4D5000`	PTX parser & validator	ptx-parser.md, ptx-directives.md
`0x4D5000`-`0x5AA000`	PTX-to-ORI, early IR	ptx-to-ori.md, instructions.md
`0x5AA000`-`0x612000`	Intrinsic infrastructure	index.md, math.md, tensor.md
`0x612000`-`0x67F000`	Section builder, target config	sections.md, index.md
`0x67F000`-`0x6E4000`	Scheduling engine, OCG lowering, encoding	overview.md, encoding.md
`0x6E4000`-`0x754000`	SASS codegen, SASS pipeline	mercury.md, overview.md
`0x754000`-`0x7C0000`	Liveness, knobs, bitfield encoding	liveness.md, knobs.md, encoding.md
`0x7C0000`-`0x8FE000`	Peephole, SASS mnemonics, scheduling upper	peephole.md, algorithm.md
`0x8FE000`-`0x9D3000`	Register allocator	overview.md, algorithm.md, abi.md
`0x9D3000`-`0xAA8000`	Post-regalloc, named phases, remat	rematerialization.md, phase-manager.md
`0xAA8000`-`0xC52000`	Mega-passes, sync barriers, dataflow	sync-barriers.md, general-optimize.md
`0xC52000`-`0xD27000`	Phase manager, phase factory	phase-manager.md, optimizer.md
`0xD27000`-`0x10B7000`	592 SASS encoder bodies	encoding.md, isel.md
`0x10B7000`-`0x1225000`	Field encoders, ISel helpers	encoding.md, isel.md
`0x1225000`-`0x13CF000`	Bitvector, ISel coordinators	hash-bitvector.md, isel.md
`0x13CF000`-`0x17F8000`	SM-specific ISel, pattern matchers, templates	isel.md, templates.md
`0x17F8000`-`0x1C21000`	SASS printing, peephole mega-dispatchers	sass-printing.md, peephole.md
`0x1C21000`-`0x1CE3000`	ELF emitter, capsule mercury, relocations	elf-emitter.md, capmerc.md

Statistics

Top 10 Most-Called Functions

Rank	Address	Identity	Callers
1	`0x7B9B80`	bitfield_insert	18,347
2	`0x10B6180`	1-bit boolean encoder	8,091
3	`0x7BC030`	encode_register_operand	6,147
4	`0x4280C0`	get_thread_local_context	3,928
5	`0x42BDB0`	fatal_OOM_handler	3,825
6	`0x424070`	pool_alloc	3,809
7	`0x426150`	hashmap_put	2,800
8	`0x7B9D30`	clear_const_buffer_slots	2,408
9	`0x7B9D60`	encode_reuse_flags_predicate	2,408
10	`0x42FBA0`	diagnostic_emit	2,350

Top 5 Largest Functions

Rank	Address	Identity	Size
1	`0x5FF700`	intrinsic_prototype_emitter	354 KB
2	`0x169B190`	isel_pattern_dispatch	280 KB
3	`0x198BCD0`	sm100_peephole_dispatch	233 KB
4	`0x143C440`	sm120_peephole_dispatch	233 KB
5	`0x5C7A50`	wmma_mma_codegen	173 KB

Top 10 Most Cross-Referenced (by wiki page count)

Rank	Address	Identity	Pages
1	`0x424070`	pool_alloc	19
2	`0x7DDB50`	phase_run_dispatch	14
3	`0x446240`	real_main	13
3	`0x5D4190`	intrinsic_dispatch_builder	13
5	`0x781F80`	instruction_iterator	12
5	`0xC60D30`	phase_factory	12
7	`0x9253C0`	instruction_operand_get	11
7	`0x612DE0`	section_attr_builder	11
7	`0x426150`	hashmap_put	11
7	`0x426D60`	hashmap_get	11

Documentation Coverage

Metric	Count
Total unique functions documented	2,063
Wiki pages with function maps	70
Functions in 5+ pages (high cross-reference)	89
Functions in 1 page only (subsystem-internal)	1,324
Confidence CERTAIN	~40
Confidence HIGH	~1,400
Confidence MEDIUM	~620

Binary Layout

PTXAS v13.0.88 is a 37,741,528-byte stripped x86-64 ELF executable. Its .text section spans 26.2 MB (0x403520--0x1CE2DE2) containing 40,185 functions. This page maps every byte of the binary to the subsystem that owns it, derived from all 40 sweep reports covering the complete address range.

ELF Section Map

Section	Address	Size	Notes
`.plt`	`0x402C00`	2,336 B (146 stubs)	Procedure linkage table for libc/libpthread imports
`.text`	`0x403520`	26,212,546 B (26.2 MB)	All executable code -- 40,185 functions
`.rodata`	`0x1CE2E00`	7,508,368 B (7.5 MB)	Read-only data: encoding tables, strings, DFA tables
`.eh_frame_hdr`	`0x240BF90`	358,460 B (350 KB)	Exception handling frame index
`.eh_frame`	`0x2664A60`	3,751,640 B (3.7 MB)	Unwinding data for 40K functions
`.gcc_except_table`	`0x29F8938`	940 B	C++ exception filter tables
`.ctors`	`0x29F8CE8`	104 B (12 entries)	Static constructor table
`.data.rel.ro`	`0x29F8D60`	4,256 B	Vtable pointers, resolved at load time
`.got.plt`	`0x29FA000`	1,184 B (148 entries)	Global offset table for PLT
`.data`	`0x29FA4A0`	14,032 B (13.7 KB)	Initialized globals: function pointers, defaults
`.bss`	`0x29FDB80`	85,864 B (83.9 KB)	Zero-init globals: knob tables, TLS keys, mutexes

Total file composition:

Component	Size	Percentage
`.text`	26.2 MB	69.4%
`.rodata`	7.5 MB	19.9%
`.eh_frame` + `.eh_frame_hdr`	4.0 MB	10.7%
`.data` + `.bss` + other	0.1 MB	0.3%

Program Headers

Segment	VirtAddr	MemSiz	Flags	Contents
LOAD 0	`0x400000`	32.4 MB	R E	`.text` + `.rodata` + headers + `.eh_frame_hdr`
LOAD 1	`0x2664A60`	3.7 MB	RW	`.eh_frame` + `.data` + `.bss` + `.got`
GNU_RELRO	`0x2664A60`	3.6 MB	R	Read-only after relocation (`.eh_frame` through `.data.rel.ro`)
GNU_EH_FRAME	`0x240BF90`	350 KB	R	Exception handling index
GNU_STACK	`0x0`	0	RW	Non-executable stack

Entry point: 0x42333C (ELF e_entry), which is inside .text (the CRT startup stub _start). The actual main is at 0x409460.

Three Subsystems

The .text section decomposes into three subsystems with distinct coding styles, data structures, and origins:

  .text linear address map (26.2 MB)
  0x403520                 0x67F000        0xC52000                          0x1CE2DE2
  |--- PTX Frontend 2.9 MB ---|-- Ori Optimizer 5.8 MB --|---- SASS Backend 17.6 MB ----|
  |          11%               |          22%             |              67%              |
  |  parsers, validators,      | passes, regalloc,        | encoding handlers, ISel,      |
  |  intrinsics, formatters    | scheduling, CFG analysis  | peephole, codecs, ABI, ELF    |

Subsystem	Address Range	Size	Functions	Share	Avg Fn Size	Largest Function
PTX Frontend	`0x403520`--`0x67F000`	2.9 MB	~2,592	11%	~1,170 B	`sub_46E000` (93 KB, opcode table builder)
Ori Optimizer	`0x67F000`--`0xC52000`	5.8 MB	~11,001	22%	~550 B	`sub_926A30` (155 KB decomp, interference graph)
SASS Backend	`0xC52000`--`0x1CE2DE2`	17.6 MB	~26,592	67%	~690 B	`sub_169B190` (280 KB, master ISel dispatch)

The backend dominates the binary because SASS instruction encoding is template-generated code: each of the ~4,000 encoding handler functions is a standalone vtable entry, never called directly. The optimizer has the highest function density (many small pass helpers), while the frontend has the largest average function size (complex validators and parsers).

Complete .text Address Map

The table below maps every address range in the .text section to its subsystem, function count, and key entry points. Data is aggregated from the 30 sweep partitions (p1.01 through p1.30).

PTX Frontend (`0x403520`--`0x67F000`, 2.9 MB)

Address Range	Size	Functions	Subsystem	Key Functions
`0x403520`--`0x430000`	178 KB	~300	Runtime infrastructure: pool allocator, hash maps, TLS, diagnostics, error reporting, string utilities	`sub_424070` (pool alloc, 3809 callers), `sub_4280C0` (TLS context, 3928 callers), `sub_426150` (hash insert, 2800 callers), `sub_42FBA0` (diagnostic emitter, 2350 callers), `sub_427630` (MurmurHash3)
`0x430000`--`0x460000`	200 KB	~120	CLI parsing and compilation driver: option registration, argument parser, target configuration, register/resource constraints, Chrome trace JSON parser	`sub_446240` (real main, 11 KB), `sub_432A00` (option registration, 6 KB), `sub_434320` (option parser, 10 KB), `sub_43B660` (register constraint calc), `sub_439880` (trace JSON parser)
`0x460000`--`0x4D5000`	470 KB	~350	PTX instruction validators: per-opcode semantic checkers for MMA, WMMA, load/store, cvt, atomics, barriers, tensormap, async copy	`sub_4B2F20` (general validator, 52 KB), `sub_4CE6B0` (Bison parser, 48 KB), `sub_4C5FB0` (operand validator, 28 KB), `sub_4C2FD0` (WMMA/MMA validator, 12 KB), `sub_4A73C0` (tensormap validator, 11 KB)
`0x4D5000`--`0x5AA000`	872 KB	581	PTX instruction text generation: 580 per-opcode formatters that convert internal IR to PTX assembly text, plus a built-in function declaration emitter	`sub_5D4190` (formatter dispatch, 13 KB), `sub_5FF700` (builtin decl emitter, 34 KB), ~580 formatter functions (avg 1.5 KB each)
`0x5AA000`--`0x67F000`	874 KB	628	Intrinsic infrastructure: 608 CUDA intrinsic handlers, MMA/WMMA/tcgen05 tensor core codegen, SM profile tables (sm_75 through sm_121), special register init, ELF/DWARF finalization, memory space management	`sub_5D1660` (608 intrinsics, 46 KB), `sub_607DB0` (SM profile hash maps, 14 KB), `sub_6765E0` (arch capability constructor, 54 KB), `sub_612DE0` (version string)

Ori Optimizer (`0x67F000`--`0xC52000`, 5.8 MB)

Address Range	Size	Functions	Subsystem	Key Functions
`0x67F000`--`0x754000`	869 KB	~500	Mercury SASS backend core: scheduling engine (ReduceReg/DynBatch, 9 reg pressure counters), WAR hazard management, Opex (operand expansion) pipeline, OCG intrinsic lowering, instruction encoding core, Flex DFA scanner, ELF section helpers	`sub_688DD0` (scheduler engine, 20 KB), `sub_6D9690` (encoding switch, 94 KB), `sub_6FC240` (WAR/scoreboard), `sub_720F00` (Flex scanner, 64 KB, 552 rules)
`0x754000`--`0x829000`	872 KB	1,545	Knobs infrastructure (1,294 entries) and peephole optimizer class: knob lookup/read/file parsing, `PeepholeOptimizer` with 7 virtual methods (`Init`, `RunOnFunction`, `RunOnBB`, `RunPatterns`, `SpecialPatterns`, `ComplexPatterns`, `SchedulingAwarePatterns`), pipeline orchestrator, Mercury operand registration helpers	`sub_79B240` (GetKnobIndex), `sub_79D070` (ReadKnobsFile), `sub_7A5D10` (PeepholeOptimizer), `sub_7BD3C0`/`sub_7BD650`/`sub_7BE090` (operand registrars), `sub_7BD260` (encoding finalize)
`0x829000`--`0x8FE000`	872 KB	1,069	Debug line tables, scheduler core, and HW profiles: `ScheduleInstructions` pipeline (context setup, priority computation, reverse scheduling, register budget with occupancy optimization), ROT13 SASS mnemonic table, architecture-specific latency/throughput profiles, constant bank naming, peephole/legalization passes, cutlass-aware scheduling heuristics	`sub_8BF000`--`0x8D1600` (ScheduleInstructions), `sub_896D50` (ROT13 SASS mnemonics), `sub_8F0D00` (HW latency profiles), `sub_8F4820` (cutlass heuristics)
`0x8FE000`--`0x9D3000`	872 KB	1,090	Register allocator: fatpoint algorithm core, interference graph builder (155 KB decompiled -- largest non-dispatch function), spill/refill mechanism, live range analysis, retry with reduced register count, memory-to-register promotion, `ConvertMemoryToRegisterOrUniform` pass	`sub_926A30` (interference graph, 155 KB decomp), `sub_957160` (fatpoint core), `sub_95DC10` (regalloc driver), `sub_9714E0` (failure handler + retry), `sub_910840` (ConvertMemoryToRegister)
`0x9D3000`--`0xAA8000`	860 KB	1,218	Post-RA pipeline phases: `NamedPhases` registry (`OriPerformLiveDead`, `OriCopyProp`, `shuffle`, `swap1`--`swap6`), DAG/dependency analysis, IR statistics printer (instruction count, reg count, estimated latency, spill bytes, occupancy, throughput), hot/cold split, mbarrier intrinsics, regalloc verification, uninitialized register detection	`sub_9F4040` (NamedPhases registry), `sub_A3A7E0` (IR stats printer), `sub_A0B5E0` (uninitialized reg detector), `sub_A9EDB0` (mbarrier/scheduling, 85 KB decomp)
`0xAA8000`--`0xB7D000`	862 KB	4,493	GMMA/WGMMA pipeline optimizer, ISel, and instruction emission: GMMA register allocation, warpgroup sync injection, instruction emission helpers (SASS encoder dispatch), post-scheduling IR statistics, operand legalization, 1,269 tiny vtable dispatchers (~160 bytes each), live range analysis, scheduler-integrated mega-pass	`sub_AED3C0` (mega scheduling/ISel pass, 137 KB decomp), `sub_AF7DF0`/`sub_AF7200` (register decode helpers), ~1,269 vtable dispatchers
`0xB7D000`--`0xC52000`	870 KB	1,086	CFG analysis, bitvectors, and IR manipulation: ~390 instruction operand pattern matchers, bitvector dataflow framework (alloc, OR, AND, XOR, clear, iterate), CFG analysis (edge printing, reverse post-order, DOT graph dump), scoreboard and instruction classification, sync analysis	`sub_BDC000` (bitvector infra), `sub_BDE8B0` (CFG/RPO/DOT), `sub_BE2E40` (scoreboard classification), ~390 operand pattern matchers

SASS Backend (`0xC52000`--`0x1CE2DE2`, 17.6 MB)

Address Range	Size	Functions	Subsystem	Key Functions
`0xC52000`--`0xD27000`	853 KB	1,053	PhaseManager (159 phases): phase factory (159-case switch), phase vtable table at `off_22BD5C8`, default phase ordering table at `0x22BEEA0`, 530 encoding table initialization bodies, instruction handler vtable bodies	`sub_C60D30` (phase factory), `sub_C62720` (PhaseManager constructor), `sub_C60D20` (default table pointer), ~530 phase table body functions
`0xD27000`--`0xDFC000`	853 KB	592	SASS encoder table (SM100 Blackwell, set 1): 592 uniform template-generated encoding handlers, each packing operands into a 1,280-bit instruction word at `a1+544`. Covers 60 opcode classes across 16 format groups. All vtable-dispatched (zero direct callers).	592 per-variant handlers (avg 1,473 B), `sub_7B9B80` (bitfield insert helper)
`0xDFC000`--`0xED1000`	877 KB	591	SASS encoder/decoder (SM100 Blackwell, set 2): 494 encoders translating IR to packed SASS bitfields, plus 97 decoders for the reverse direction (disassembly/validation). All vtable-dispatched.	494 encoders (`0xDFC`--`0xEB2`), 97 decoders (`0xEB3`--`0xED0`), `sub_E0F370` (largest, 11 KB)
`0xED1000`--`0xFA6000`	860 KB	683	SM100 SASS encoders (set 3): 683 per-variant encoding handlers for 59 SASS opcodes. Each sets opcode ID, loads 128-bit format descriptor via SSE, initializes 10-slot register class map, registers operands, finalizes, extracts bitfields.	683 template-generated handlers, 128-bit xmmword format descriptors
`0xFA6000`--`0x107B000`	851 KB	678	SM100 SASS encoders (set 4): 587 primary encoders (opcodes 16--372, predicate/comparison/memory/tensor/control flow), plus 91 alternate-form encoders for dual-width or SM-variant instruction encodings. Combined with sets 1--3: 2,544 SM100 encoding handlers total. Six mega dispatch tables.	587 primary + 91 alternate-form encoders, 6 dispatch tables
`0x107B000`--`0x1150000`	853 KB	3,396	SM100 codec completion: 641 final encoding handlers, 78 object lifecycle and scheduling support functions (FNV-1a hash, instruction construction), 2,095 bitfield accessor functions (machine-generated read/write primitives for the packed encoding format). Seven core extractors handle 1-bit, 2-bit, and multi-bit fields across 192-bit words.	`sub_10AFF80` (instruction constructor, 11 KB, 32 params), 2,095 bitfield accessors, 7 core extractors
`0x1150000`--`0x1225000`	852 KB	733	SASS codec (decoders + encoders): both directions of the instruction codec for an older SM target (likely sm_89 Ada Lovelace or sm_90 Hopper). Decoders read 128-bit words and extract fields; encoders pack fields back. Three mega-decoders (29--33 KB each) and two mega-dispatchers (78--104 KB, too large for Hex-Rays).	3 mega decoders (29--33 KB), 2 mega dispatchers (78--104 KB), 728 of 733 vtable-dispatched
`0x1225000`--`0x12FA000`	860 KB	1,552	Register-pressure scheduling + ISel + encoders: register-pressure-aware instruction scheduling (`0x1225`--`0x1240`), instruction selection and emission pipeline (`0x1240`--`0x1254`), 982 SASS binary encoders packing operand fields into 128-bit words (`0x1254`--`0x12FA`). All encoders vtable-dispatched.	Scheduling at `0x1225`--`0x1240`, ISel at `0x1240`--`0x1254`, 982 encoding handlers
`0x12FA000`--`0x13CF000`	845 KB	1,282	Operand legalization and peephole: 522 per-instruction bit-field encoders (366 KB), 186 peephole pattern matchers (81 KB), 11 operand legalization/materialization functions (40 KB), 38 operand encoding emitters (31 KB), 8 live-range analysis functions (14 KB).	`sub_137B790` (operand legalization, 8.5 KB), 186 peephole matchers, 522 encoders
`0x13CF000`--`0x14A4000`	844 KB	1,219	SM120 (RTX 50-series) peephole pipeline: 1,087 instruction pattern matchers (429 KB), one 233 KB master opcode dispatch switch (`sub_143C440`, 373-case primary switch), 123 instruction encoders (180 KB). Pattern matchers validate opcode, modifiers, and operand types; dispatch rewrites opcode byte and operand mapping.	`sub_143C440` (233 KB dispatch, 373-case switch), 1,087 pattern matchers, 123 encoders
`0x14A4000`--`0x1579000`	852 KB	606	Blackwell ISA encode/decode: 332 encoder functions (`0x14A4`--`0x1520`) packing SASS bitstreams, 1 dispatcher (vtable router at `0x15209F0`), 273 decoder functions (`0x1520`--`0x1578`) unpacking bitstreams and validating fields. Encoder state struct is 600+ bytes with 128-bit format descriptor at `+8`, operand arrays at `+24`--`+143`.	332 encoders, 273 decoders, 1 dispatcher
`0x1579000`--`0x164E000`	852 KB	1,324	SASS encoding + peephole matchers: Zone A has 367 instruction encoders, Zone B has 78 utility/transition functions, Zone C has 469 peephole pattern matchers. All pattern matchers are called from a single 280 KB mega-dispatcher (`sub_169B190`).	367 encoders, 469 peephole matchers, 78 utilities
`0x164E000`--`0x1723000`	873 KB	899	ISel pattern matching core: 762 PTX opcode pattern matchers (Zone A), the master dispatch function `sub_169B190` at 280 KB / 66K instructions (Zone B -- the single largest function in the binary), 100 encoding table entries, and 36 multi-instruction template expanders. The dispatch tries every matcher, selects the highest-scoring match, and records which SASS expansion template to use.	`sub_169B190` (280 KB, 66K insns, 15,870 callees), 762 matchers, 36 template expanders
`0x1723000`--`0x17F8000`	852 KB	631	ISA description database: ~555 SASS instruction format descriptor classes (one per opcode variant), ~316 bitfield layout initializers, ~239 opcode handler vtable entries. Also contains instruction sequence generators (multi-instruction expansions for complex PTX operations), register allocation helpers, and Newton-Raphson approximation templates. 91.8% of functions have zero static callers (vtable-dispatched).	~555 format descriptor classes, ~316 bitfield initializers, ~239 vtable entries
`0x17F8000`--`0x18CD000`	852 KB	1,460	SASS instruction printer + peephole: Subsystem A (`0x17F8`--`0x181F`) implements SASS disassembly rendering via virtual method overrides on a builder/visitor with a 4,080+ byte vtable. Subsystem B (`0x1820`--`0x18CC`) is a 231 KB peephole dispatch function (`sub_18A2CA0`, 54K instructions, 1,330 unique callees).	`sub_18189C0` (SASS printer, 45 KB), `sub_181B370` (SASS printer, 28 KB), `sub_18A2CA0` (231 KB peephole dispatch)
`0x18CD000`--`0x19A2000`	877 KB	1,598	Scheduling + peephole dispatchers: Zone A (275 KB) is the instruction scheduling core (list scheduler, dependency graph, ready queue, register pressure tracking). Zone B (130 KB) contains 318 opcode property/classification tables. Zones C+D (460 KB) contain 888 peephole pattern matchers called from `sub_198BCD0` (239 KB, 1,336 unique callees).	`sub_198BCD0` (239 KB peephole dispatch), 392 scheduling functions, 318 opcode property tables, 888 pattern matchers
`0x19A2000`--`0x1A77000`	880 KB	1,393	GPU ABI/calling convention + SM89/90 encoders: Zone A (250 KB, 276 functions) implements the NVIDIA GPU calling convention -- parameter register allocation, return address placement, scratch/preserved classification, convergent boundary enforcement, coroutine SUSPEND semantics, uniform register support, per-SM ABI lowering (sm_35 through sm_100+). Zone B (480 KB) has ~1,117 supplementary SASS encoding vtable handlers.	`sub_19D1AF0` (master ABI setup, 5.6 KB), 276 ABI functions, ~1,117 encoding handlers
`0x1A77000`--`0x1B4C000`	829 KB	1,518	SASS emission backend (4 SM families): Zone A has 1,083 bit-field packing encoders spanning sm_50 through sm_100+. Zone B has 339 instruction lowering/expansion functions (two SM families: sm_8x and sm_9x/10x). Zone C has 84 Ampere/Ada/Hopper-era encoders. Zone D has 92 Blackwell-era encoders.	`sub_1B6B250` (register-class-to-HW mapping, 254 callers), 1,083 emitters, 339 lowering functions
`0x1B4C000`--`0x1C21000`	876 KB	1,974	SASS emission + format descriptors: register-class encoding tables (Zone A), per-SM instruction bit-field encoders (Zone B), instruction emission orchestrators (Zone C), multi-operand dispatch emitters (Zone D), mirrored SM-variant emitters (Zone E), instruction format descriptors (Zone F, `0x1C05`--`0x1C21`).	487 functions exceed 2 KB decompiled
`0x1C21000`--`0x1CE2DE2`	776 KB	1,628	Library layer: custom ELF emitter (CUBIN output), capsule Mercury ELF (`.nv.capmerc` debug metadata), section layout and memory allocation (shared/constant/local/global), relocation resolution (branch targets, UFT/UDT, YIELD-to-NOP), call graph analysis (recursion detection, dead function elimination), DWARF debug generation (`.debug_info`/`.debug_line`/`.debug_frame`), option parsing library, thread pool (pthread-based), JSON builder, GNU Make jobserver client, C++ name demangler (Itanium ABI), ELF file writer	`sub_1C9F280` (ELF emitter, 97 KB decomp), `sub_1CABD60` (section allocator, 67 KB), `sub_1CC9800` (EIATTR builder, 90 KB), `sub_1CDC780` (demangler, 93 KB), `sub_1CB53A0` (ELF world init), `sub_1CD48C0` (relocation resolver, 22 KB), `sub_1CBB920` (recursion detector), `sub_1CB18B0` (thread pool), `sub_1CD13A0` (file writer, 11 KB)

.rodata Contents (7.5 MB)

The .rodata section at 0x1CE2E00--0x240BF8F is 29% of the binary by size. Its dominant consumers:

Content	Estimated Size	Notes
SASS encoding format descriptors	~3.5 MB	128-bit `xmmword` constants loaded via SSE by ~4,000 encoding handlers
Flex DFA transition tables	~600 KB	`off_203C020`, the 552-rule PTX scanner's state machine
Bison parser tables	~400 KB	LALR(1) action/goto tables for the PTX grammar
Error/diagnostic format strings	~300 KB	30,632 strings extracted from the binary
Phase ordering + vtable tables	~100 KB	Default 159-entry phase table at `0x22BEEA0`, vtable table at `off_22BD5C8`
ROT13-encoded string tables	~200 KB	PTX opcode names (~900 entries), knob names (~2,000 entries)
Architecture capability tables	~150 KB	Per-SM feature maps (sm_75 through sm_121), HW latency profiles
DWARF name tables	~50 KB	`DW_FORM_`, `DW_AT_`, `DW_OP_*` string tables
Hash constants + misc	~2.2 MB	MurmurHash3 mixing constants, lookup tables, padding

.bss Contents (84 KB)

Content	Notes
ROT13 PTX opcode name table	Populated by `ctor_003` (`0x4095D0`, 17 KB) at startup
General OCG knob table	Populated by `ctor_005` (`0x40D860`, 80 KB) -- ~2,000 entries
Mercury scheduler knob table	Populated by `ctor_007` (`0x421290`, 8 KB) -- 98 entries
Thread-local storage keys	`pthread_key_t` for per-thread context (280-byte struct)
Global pool allocator mutex	`pthread_mutex_t` at pool struct offset 7128
Diagnostic suppression bitmaps	Per-warning-ID suppression flags
SM architecture profile objects	Constructed on demand per `sub_6765E0`
Global error/warning counters	Incremented by `sub_42FBA0`
Make jobserver state	Atomic state machine (0=init, 5=no MAKEFLAGS, 6=no auth, 7=failed)

.data Contents (14 KB)

Content	Notes
Function pointer tables	Exit wrapper (`off_29FA4B0`), error handler dispatch
Default option values	Populated by `sub_432A00` (option registration)
Static string table pointers	Version strings, format strings
Diagnostic output tables	Severity prefix strings: `"error "`, `"warning "`, `"info "`, `"fatal "`

Static Constructors

The .ctors section holds 12 entries executed before main. The four largest are:

Constructor	Address	Binary Size	Purpose
`ctor_001`	`0x4094C0`	204 B	Thread infrastructure: `pthread_key_create`, mutex init, thread priority range
`ctor_003`	`0x4095D0`	17,007 B	PTX opcode name table: ~900 ROT13-encoded opcode mnemonics
`ctor_005`	`0x40D860`	80,397 B	General OCG knob table: ~2,000 ROT13-encoded knob names + hex defaults
`ctor_007`	`0x421290`	7,921 B	Mercury scheduler knob table: 98 ROT13-encoded scheduler knobs

The remaining 8 constructors handle memory allocator pool initialization, hash map infrastructure setup, diagnostic system initialization, and architecture vtable factory registration (sub_1CCD900).

Mega-Functions (>50 KB binary)

Address	Binary Size	Decompiled	Function	Callees
`sub_169B190`	280 KB	N/A	Master ISel pattern dispatch (66K instructions)	15,870
`sub_198BCD0`	239 KB	N/A	Peephole dispatch, SM variant 2	1,336
`sub_143C440`	233 KB	N/A	SM120 peephole dispatch (373-case switch)	~1,100
`sub_18A2CA0`	231 KB	N/A	Peephole dispatch, SM variant 1	1,330
`sub_6D9690`	94 KB	N/A	Instruction encoding switch	~500
`sub_46E000`	93 KB	N/A	PTX opcode-to-handler table builder	1,168
`sub_40D860`	80 KB	N/A	`ctor_005`: general knob registration	~2,000
`sub_720F00`	64 KB	N/A	Flex DFA scanner (552 rules)	~50

These eight functions account for 1.2 MB of code (4.8% of .text) but only 0.02% of the function count.

Most-Called Functions

Address	Callers	Identity
`sub_4280C0`	3,928	Thread-local context accessor (`pthread_getspecific`)
`sub_42BDB0`	3,825	Fatal OOM handler (called from every allocation site)
`sub_424070`	3,809	Pool memory allocator (alloc)
`sub_426150`	2,800	Hash map insert/update
`sub_42FBA0`	2,350	Central diagnostic message emitter
`sub_4248B0`	1,215	Pool memory deallocator (free)
`sub_42CA60`	298	Linked list prepend
`sub_42D850`	282	Hash set insert
`sub_1B6B250`	254	Register-class-to-hardware-number lookup (SASS emission)
`sub_4279D0`	185	String prefix match (`starts_with`)

The top five functions are all in the runtime infrastructure region (0x403520--0x42F000). Together they represent the core allocation, error handling, and data structure layer that the rest of the binary depends on.

Binary Composition by Purpose

Estimated from function classification across 30 sweep reports (p1.01--p1.30). Each function was assigned to a single purpose category based on its dominant behavior; functions straddling categories (e.g., a scheduling pass that also emits SASS) are attributed to the category consuming the larger share of their code.

Purpose	Estimated Size	Share of .text
SASS instruction encoding/decoding	~12 MB	46%
Optimization passes + scheduling	~5 MB	19%
Peephole pattern matching + dispatch	~3 MB	12%
Frontend: parsing + validation	~2 MB	8%
ISel pattern matching + templates	~1.5 MB	6%
Infrastructure: allocator, hash, ELF, debug	~1.5 MB	6%
GPU ABI + calling convention	~0.7 MB	3%

The single largest consumer of code space is SASS instruction encoding. Each SM architecture generation requires its own set of per-opcode encoding/decoding handler functions. With support for SM75 through SM121 (six major generations), this yields approximately 4,000 encoding handlers, each a standalone function averaging 1,400 bytes.

Methodology

This page documents how the reverse engineering of ptxas v13.0.88 was performed. It serves as a transparency record so readers can assess the confidence of any claim in this wiki, and as a practical guide for anyone who wants to reproduce or extend the analysis.

Scope and Scale

PTXAS is a 37.7 MB stripped x86-64 ELF binary with no debug symbols, no DWARF information, and no export table beyond 146 libc/libpthread PLT stubs. Unlike NVIDIA's cicc (which is an LLVM fork), ptxas contains no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, data structure, and encoding table is proprietary NVIDIA code. This makes the analysis harder than LLVM-derived binaries -- there is no upstream source to compare against.

Metric	Value
Binary size	37,741,528 bytes
Build string	`cuda_13.0.r13.0/compiler.36424714_0`
Total functions detected	40,185
Functions decompiled	39,881 (99.2%)
Strings extracted	30,632
Call graph edges	548,693
Cross-references	7,427,044
IDA comments recovered	66,598
IDA auto-names recovered	16,019
Control flow graphs exported	80,078
PLT imports	146 (libc, libpthread, libm, libgcc)
Functions with 0 static callers	15,907 (39.6%) -- vtable-dispatched
Functions < 100 bytes	11,532 (28.7%)
Functions > 10 KB	86 (0.2%)
Named functions (not `sub_*`)	319 (0.8%)
Internal codenames	OCG (Optimizing Code Generator), Mercury (SASS encoder), Ori (IR)

The 304 functions that Hex-Rays could not decompile are predominantly PLT stubs, computed-jump trampolines in the Flex DFA scanner, and the four mega-dispatch functions exceeding 200 KB (too large for Hex-Rays to handle within default limits). None are in critical analysis paths -- the dispatch functions are understood from their callee lists and the PLT stubs from their import names.

Why PTXAS Is Harder Than LLVM-Based Binaries

Reverse engineering cicc (NVIDIA's LLVM-based CUDA compiler) benefits from extensive prior art: LLVM's open-source codebase provides structural templates, pass names are registered in predictable patterns, and cl::opt strings directly name their global variables. PTXAS offers none of these advantages:

No upstream source. Every identified function is identified from first principles -- string evidence, callgraph position, structural fingerprinting, or decompiled algorithm analysis. There is no reference implementation to compare against.
ROT13 obfuscation. Internal names for tuning knobs and PTX opcode mnemonics are ROT13-encoded in the binary, requiring decoding before they become useful anchors.
Vtable-heavy architecture. 39.6% of functions have zero static callers because they are dispatched through vtable pointers or function pointer tables. The call graph alone cannot reach them.
Template-generated code. The SASS backend contains approximately 4,000 encoding handler functions generated from templates, each structurally near-identical. These dominate the function count but carry almost no unique identifying features.
No pass registration infrastructure. LLVM passes register themselves via PassInfo objects with name strings. PTXAS phases are allocated by a factory switch (sub_C60D30) and their names are only visible through the NamedPhases registry and AdvancedPhase* timing strings -- far fewer anchors than LLVM's registration system.

Toolchain

All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. The entire effort is static analysis of the binary at rest -- no dynamic analysis (debugging, tracing, instrumentation) was used for function identification. Runtime tools (ptxas --stat, DUMPIR knob, --keep) were used only for validation and cross-referencing.

Tool	Purpose
IDA Pro 8.x	Disassembly, auto-analysis, cross-referencing, vtable reconstruction
Hex-Rays decompiler	Pseudocode generation for 39,881 recovered functions
IDA Python scripting	Complete database extraction: all 8 JSON artifact exports
Custom Python script	`analyze_ptxas.py`: batch string, function, graph, xref, and decompilation export
ptxas CLI	`--stat`, `--verbose`, `--compiler-stats`, `--fdevice-time-trace` for runtime validation
ptxas DUMPIR knob	`-knob DUMPIR=<phase>` to dump IR at specific pipeline points
ROT13 decoder	Standard `codecs.decode(s, "rot_13")` for 2,000+ obfuscated knob/opcode names

IDA Pro Setup and Initial Analysis

Loading the Binary

PTXAS is a dynamically-linked ELF with 146 PLT imports but no symbol table beyond those imports. IDA auto-analysis settings:

Processor: Meta PC (x86-64)
Analysis options: default. IDA correctly identifies the Flex DFA scanner tables, Bison parser tables, and the .ctors/.dtors sections.
Auto-analysis time: approximately 8-10 minutes on a modern machine for the 37.7 MB binary.
Compiler detection: IDA identifies GCC as the compiler. The binary uses the Itanium C++ ABI (confirmed by the embedded C++ name demangler at sub_1CDC780, 93 KB).

Post-Auto-Analysis Steps

After auto-analysis completes:

Run string extraction. IDA's auto-analysis finds 30,632 strings. All are exported via the analyze_ptxas.py IDA Python script.
Force function creation. Some address ranges, particularly the template-generated encoding handlers, are not automatically recognized as functions. IDA's "Create function" (P key) was applied selectively in the 0xD27000--0x1579000 range where encoding handler stubs are tightly packed.
Batch decompile. The IDA Python script iterates all 40,185 detected functions and calls ida_hexrays.decompile() on each, saving per-function .c files. 39,881 succeeded; 304 failed (PLT stubs, computed-jump trampolines, and 4 mega-functions exceeding decompiler limits).
Export control flow graphs. For each function, the script extracts the FlowChart (basic blocks, edges, per-instruction disassembly) as JSON. 80,078 graph files were produced.

Type Recovery

PTXAS uses no C++ RTTI (no typeid, no dynamic_cast -- the binary has no .data.rel.ro RTTI structures). Type recovery relies on:

Vtable layout analysis. Each vtable is a contiguous array of function pointers in .data.rel.ro (4,256 bytes total). The vtable at off_22BD5C8 contains 159 entries, one per optimization phase. Each entry points to the phase's constructor function.
Structure offset patterns. The pool allocator struct has free-list bins at offset +2128 and a mutex at +7128. The thread-local context is a 280-byte struct accessed via pthread_getspecific. These offsets were recovered from the decompiled code of sub_424070 (pool alloc, 3,809 callers) and sub_4280C0 (TLS accessor, 3,928 callers).
Parameter/return type propagation. Once a function's signature is established (e.g., pool_alloc(pool*, size_t) -> void*), Hex-Rays propagates types to all 3,809 call sites, improving decompilation quality throughout the binary.

String-Driven Analysis

Strings are the single most productive source of function identification in ptxas. Of the 30,632 strings extracted, several categories are particularly valuable.

ROT13-Encoded Knob Names (2,000+ entries)

PTXAS uses ROT13 encoding as a light obfuscation layer on internal configuration names. Two massive static constructors populate these tables at startup:

ctor_005 at 0x40D860 (80 KB) registers approximately 2,000 general OCG tuning knobs
ctor_007 at 0x421290 (8 KB) registers 98 Mercury scheduler knobs

Each entry pairs a ROT13-encoded name with a hex-encoded default value. Decoding examples:

ROT13 in binary	Decoded name
`ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf`	`MercuryUseActiveThreadCollectiveInsts`
`ZrephelGenpxZhygvErnqfJneYngrapl`	`MercuryTrackMultiReadsWarLatency`
`ZrephelCerfhzrKoybpxJnvgOrarsvpvny`	`MercuryPresumeXblockWaitBeneficial`
`ZrephelZretrCebybthrOybpxf`	`MercuryMergePrologueBlocks`
`ZrephelTraFnffHPbqr`	`MercuryGenSassUCode`
`FpniVayvarRkcnafvba`	`ScavInlineExpansion`
`FpniQvfnoyrFcvyyvat`	`ScavDisableSpilling`

The knob names directly reveal subsystem organization. Names prefixed with Mercury* belong to the SASS encoder. Names prefixed with Scav* belong to the register allocator's scavenger. Names like XBlockWait* and WarDeploy* belong to the instruction scheduler. The knob lookup function GetKnobIndex at sub_79B240 performs inline ROT13 decoding and case-insensitive comparison, which was itself identified by tracing the xrefs from the ROT13-encoded strings.

ROT13-Encoded PTX Opcode Names (~900 entries)

A third static constructor, ctor_003 at 0x4095D0 (17 KB), populates a table of ~900 ROT13-encoded PTX opcode mnemonics. Decoding examples:

ROT13	Decoded
`NPDOHYX`	`ACQBULK`
`OFLAP`	`BSYNC`
`SZN`	`FMA`
`FRGC`	`SETP`
`ERGHEA`	`RETURN`
`RKVG`	`EXIT`

These strings are used by the PTX parser to match instruction mnemonics. Each xref from one of these strings leads to a parser action or instruction validator function.

Timing and Phase Name Strings

The compilation driver at sub_446240 emits per-stage timing via format strings:

Parse-time            : %.3f ms (%.2f%%)
CompileUnitSetup-time : %.3f ms (%.2f%%)
DAGgen-time           : %.3f ms (%.2f%%)
OCG-time              : %.3f ms (%.2f%%)
ELF-time              : %.3f ms (%.2f%%)
DebugInfo-time        : %.3f ms (%.2f%%)
PeakMemoryUsage = %.3lf KB

Tracing the xrefs from these format strings identifies the code that brackets each pipeline stage, revealing the stage boundaries within sub_446240.

The NamedPhases registry (string at 0x21B64C8, xrefs to sub_9F4040) and the AdvancedPhase* timing strings provide phase-level anchors within the 159-phase optimization pipeline:

AdvancedPhaseBeforeConvUnSup, AdvancedPhaseAfterConvUnSup
AdvancedPhaseEarlyEnforceArgs, AdvancedPhaseLateConvUnSup
AdvancedPhasePreSched, AdvancedPhaseAllocReg, AdvancedPhasePostSched
AdvancedPhaseOriPhaseEncoding, AdvancedPhasePostFixUp
GeneralOptimizeEarly, GeneralOptimize, GeneralOptimizeMid, GeneralOptimizeMid2
GeneralOptimizeLate, GeneralOptimizeLate2
OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth

Each AdvancedPhase* string xrefs to exactly one call site, which is a boundary marker in the phase pipeline. These 15 markers divide the 159-phase pipeline into named segments whose boundaries were used to identify the phases between each pair of markers.

Error and Diagnostic Strings

The central diagnostic emitter sub_42FBA0 (2,350 callers) prints error messages whose text reveals the calling function's purpose. Examples:

"Please use -knob DUMPIR=AllocateRegisters for debugging" -- identifies the register allocator failure path at sub_9714E0
"SM does not support LDCU" -- identifies SM capability checking in the instruction legalizer
"Invalid knob identifier", "Invalid knob specified (%s)" -- identifies the knob parsing infrastructure around sub_79D070
"fseek() error knobsfile %s", "[knobs]" -- identifies ReadKnobsFile at sub_79D070

Source File Path

One recovered source path provides a structural anchor:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h

This string (at 0x202D4D8, 66 xrefs) is referenced from assertion checks throughout the knobs infrastructure, confirming that the knob system is a shared utility component (generic_knobs_impl.h) used across NVIDIA's compiler drivers.

Build and Version Strings

Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

The version string at sub_612DE0 identifies both the exact build and the version reporting function. The Usage : string at 0x1CE3666 identifies the usage printer. The "\nCompile-unit with entry %s" string identifies the per-kernel compilation loop within the driver.

Vtable-Driven Discovery

The Phase Vtable Table

The most productive vtable discovery was the phase vtable table at off_22BD5C8 in .rodata. This is an array of 159 pointers, each pointing to a vtable for one optimization phase class. The phase factory function at sub_C60D30 is a 159-case switch statement that allocates a 16-byte phase object and assigns the corresponding vtable from this table:

// Simplified from decompiled sub_C60D30
switch (phase_index) {
    case 0:  obj->vtable = off_22BD5C8[0];  break;
    case 1:  obj->vtable = off_22BD5C8[1];  break;
    ...
    case 158: obj->vtable = off_22BD5C8[158]; break;
}
return obj;

Each vtable contains pointers to the phase's virtual methods. The virtual method at slot 0 is execute() (the phase body). The virtual method at slot 1 is isNoOp() (returns whether the phase should be skipped). The virtual method at slot 2 is getName() (returns the phase name string).

By following each of the 159 vtable entries to their execute() slot, every optimization phase's main function was identified. The getName() slot provided the phase name for phases that implement it. For phases that return a constant empty string, the name was inferred from the NamedPhases registry or from the AdvancedPhase* timing strings that bracket the phase in the pipeline.

Encoding Handler Vtables

The SASS backend uses vtable dispatch for instruction encoding. Each SASS opcode variant has its own encoding handler function, registered in dispatch tables rather than called directly. This explains why 15,907 functions (39.6%) have zero static callers -- they are reached exclusively through indirect calls via function pointer tables.

The encoding handler vtables were identified by their structural uniformity: every handler in the 0xD27000--0x1579000 range follows an identical template:

Set opcode ID via bitfield insert into the instruction word at a1+544
Load a 128-bit format descriptor from .rodata via SSE (movaps xmm0, xmmword_XXXXXX)
Initialize a 10-slot register class map
Register operand descriptors via sub_7BD3C0 / sub_7BD650 / sub_7BE090
Finalize encoding via sub_7BD260
Extract bitfields from the packed instruction word

The uniformity of this template allowed batch identification: once the template was recognized in a few handlers, the remaining ~4,000 were identified by structural matching alone.

Peephole Optimizer Vtable

The PeepholeOptimizer class at 0x7A5D10 has a reconstructed vtable with 7 virtual methods:

Slot	Method	Purpose
0	`Init`	Initialize peephole state for a compilation unit
1	`RunOnFunction`	Entry point for per-function peephole optimization
2	`RunOnBB`	Per-basic-block dispatch
3	`RunPatterns`	Standard pattern matching pass
4	`SpecialPatterns`	Architecture-specific pattern pass
5	`ComplexPatterns`	Multi-instruction pattern pass
6	`SchedulingAwarePatterns`	Schedule-preserving pattern pass

The three peephole dispatch mega-functions (sub_143C440 at 233 KB, sub_18A2CA0 at 231 KB, sub_198BCD0 at 239 KB) each serve a different SM generation family and call 1,100--1,336 pattern matcher functions. These dispatchers were identified by their enormous callee counts and their position in the pipeline after instruction encoding.

Callgraph Analysis

The 548,693-edge call graph, exported from IDA, reveals the binary's module structure and function relationships. Several callgraph properties were systematically exploited.

Hub Function Identification

Functions with extreme callee or caller counts serve as structural anchors:

Top callees (hub functions -- "fan-out" nodes):

Address	Name	Size	Callees	Role
`sub_169B190`	ISel master dispatch	280 KB	15,870	The single largest function in the binary. Dispatches to all ISel pattern matchers.
`sub_143C440`	SM120 peephole dispatch	233 KB	13,425	SM120 (RTX 50-series) peephole optimization
`sub_198BCD0`	Peephole dispatch (variant 2)	239 KB	13,391	Peephole optimization for another SM family
`sub_18A2CA0`	Peephole dispatch (variant 1)	231 KB	12,974	Peephole optimization for another SM family
`sub_BA9D00`	Bitvector/CFG analysis	204 KB	11,335	Dataflow framework core

Top callers (utility functions -- "fan-in" nodes):

Address	Name	Size	Callers	Role
`sub_B28F30`	(unknown leaf)	12 B	31,399	Tiny utility, likely a type tag or opcode check
`sub_10AE5C0`	(unknown leaf)	60 B	30,768	Small encoding helper
`.sprintf`	libc sprintf	6 B	20,398	String formatting (PLT stub)
`sub_7B9B80`	Bitfield insert	216 B	18,347	Inserts bits into the 1280-bit instruction word
`sub_424070`	Pool allocator	2,098 B	3,809	Custom memory allocator
`sub_4280C0`	TLS context accessor	597 B	3,928	Thread-local storage via `pthread_getspecific`
`sub_42FBA0`	Diagnostic emitter	2,388 B	2,350	Central error/warning reporter

The fan-out nodes identify the mega-dispatch functions: ISel, peephole, and dataflow. The fan-in nodes identify the shared infrastructure layer: memory allocation, encoding primitives, string formatting, and error reporting.

Module Boundary Detection

The call graph reveals clear module boundaries. Functions in the 0x400000--0x67F000 range (PTX frontend) rarely call functions in 0xC52000--0x1CE3000 (SASS backend) directly, and vice versa. The optimizer region (0x67F000--0xC52000) bridges the two, calling into both the frontend (for IR construction) and the backend (for encoding).

The call graph was used to validate the three-subsystem decomposition:

Call direction	Edge count	Interpretation
Frontend -> Frontend	~8,000	Internal frontend cohesion
Frontend -> Optimizer	~1,200	IR construction handoff
Optimizer -> Optimizer	~15,000	Phase-to-phase internal calls
Optimizer -> Backend	~3,500	Scheduling, encoding setup
Backend -> Backend	~18,000	Encoding handler internal calls
Backend -> Frontend	~500	Shared infrastructure (allocator, hash)

Propagation from Known Functions

Once a high-confidence function is identified, its callees and callers gain contextual identity. The most productive propagation chains:

sub_446240 (real main, CERTAIN) -> calls stage entry points for Parse, DAGgen, OCG, ELF, DebugInfo. Each stage's entry point was identified by following the timing format string pattern.
sub_C62720 (PhaseManager constructor) -> allocates 159 phase objects via sub_C60D30 (factory). The factory's 159 case targets are the phase constructors. Each constructor installs a vtable whose slot 0 points to the phase's execute() method.
sub_79B240 (GetKnobIndex) -> called from every function that reads a tuning knob. The first argument to GetKnobIndex is the ROT13-encoded knob name, so every call site reveals which knob a function checks.
sub_42FBA0 (diagnostic emitter) -> the format string argument at each of the 2,350 call sites reveals the error context. A call with "Cannot take address of texture/surface variable (%s)" identifies a PTX semantic checker.

Pattern Recognition

16-Byte Phase Objects

All 159 optimization phases share a uniform object layout:

Offset 0: vtable pointer (8 bytes) -- points to phase-specific vtable
Offset 8: phase data pointer or inline data (8 bytes)

The phase factory (sub_C60D30) allocates each phase as a 16-byte object from the pool allocator, sets the vtable pointer from the vtable table at off_22BD5C8, and returns the object. The PhaseManager stores these 159 objects in its internal array and iterates them to execute the pipeline.

Pool Allocator Usage Pattern

The custom pool allocator (sub_424070, 3,809 callers) is the dominant allocation mechanism. Its usage pattern is recognizable throughout the binary:

ptr = sub_424070(pool, size);   // Allocate
if (!ptr) sub_42BDB0();         // Fatal OOM -- never returns
// ... use ptr ...
sub_4248B0(ptr);                // Free (1,215 callers)

The OOM handler sub_42BDB0 (14 bytes, 3,825 callers) is a tiny wrapper that calls sub_42F590 (fatal internal error). Because every allocation site checks for failure and calls the same handler, the allocator usage pattern is a reliable structural marker. Finding sub_42BDB0 in a function's callee list confirms that function performs heap allocation.

SASS Encoding Handler Template

Every encoding handler in the backend follows a rigid 6-step template (described in the vtable section above). The key identification markers:

Calls to sub_7B9B80 (bitfield insert, 18,347 callers)
SSE movaps loading a 128-bit constant from .rodata
Calls to sub_7BD3C0, sub_7BD650, or sub_7BE090 (operand registrars)
Final call to sub_7BD260 (encoding finalize)

Any function matching this pattern is a SASS encoding handler. This template recognition identified approximately 4,000 handlers spanning 6 SM architecture generations.

Hash Map Infrastructure Pattern

The MurmurHash3-based hash map infrastructure (sub_426150 insert, sub_426D60 lookup, sub_427630 MurmurHash3) appears throughout the binary with a consistent usage pattern:

map = sub_425CA0(hash_fn, cmp_fn, initial_capacity);  // Create
sub_426150(map, key, value);                           // Insert (2,800 callers)
result = sub_426D60(map, key);                         // Lookup (422 callers)
sub_425D20(map);                                       // Destroy

The MurmurHash3 constants (0xcc9e2d51, 0x1b873593) in sub_427630 confirmed the hash algorithm. The hash map supports three modes (custom function pointers, pointer hash, integer hash) selected by flags at struct offset 84.

Data Artifacts

The complete IDA database was exported via analyze_ptxas.py into 8 JSON artifacts. These artifacts are the foundation for all subsequent analysis.

Artifact	File	Size	Entries	Schema
Functions	`ptxas_functions.json`	92 MB	40,185	`{addr, end, name, size, insn_count, is_library, is_thunk, callers[], callees[]}`
Strings	`ptxas_strings.json`	4.8 MB	30,632	`{addr, value, type, xrefs[{from, func, type}]}`
Call graph	`ptxas_callgraph.json`	64 MB	548,693	`{from, from_addr, to, to_addr}` -- one edge per call site
Cross-references	`ptxas_xrefs.json`	978 MB	7,427,044	Complete xref database (code, data, string references)
Comments	`ptxas_comments.json`	5.9 MB	66,598	`{addr, type, text}` -- IDA auto-comments and analyst annotations
Names	`ptxas_names.json`	972 KB	16,019	`{addr, name}` -- IDA auto-generated and analyst-assigned names
Imports	`ptxas_imports.json`	17 KB	146	`{module, name, addr, ordinal}` -- PLT import stubs
Segments	`ptxas_segments.json`	3 KB	24	`{name, start, end, size, type, perm}` -- ELF segment map

Total artifact storage: 1.14 GB (dominated by the 978 MB xref database).

What Each Artifact Reveals

Functions (ptxas_functions.json): The master index. Every function's address, size, instruction count, caller list, and callee list. The caller/callee lists are the basis for callgraph analysis. The is_thunk flag identifies PLT stubs (exclude from analysis). The is_library flag identifies functions IDA tagged as library code (CRT startup, jemalloc-like allocator internals).

Strings (ptxas_strings.json): The primary identification tool. Each string's xref list shows which functions reference it. Searching for "AdvancedPhase" returns 15 strings, each xref pointing to a pipeline boundary in the PhaseManager. Searching for strings starting with "Z" (ROT13 "M" for "Mercury") returns the Mercury subsystem's knob names. The 2,035 hex-encoded default value strings ("0k..." / "0x...") are paired 1:1 with knob name strings in the constructors.

Call graph (ptxas_callgraph.json): The structural backbone. Each edge records a direct call from one function to another. Indirect calls (vtable dispatch, function pointer callbacks) are not captured, which is the primary limitation -- the 15,907 zero-caller functions are almost all vtable-dispatched. The call graph is used for module boundary detection, propagation from known functions, and entry/exit point analysis.

Cross-references (ptxas_xrefs.json): The most comprehensive artifact. Contains all code-to-code, code-to-data, and data-to-data references detected by IDA. At 7.4 million entries, it is too large to load into memory on machines with less than 16 GB RAM. Used for deep analysis of specific functions: finding all references to a particular .rodata constant, tracing data flow through global variables, and identifying vtable consumers.

Comments (ptxas_comments.json): IDA's auto-generated comments (e.g., "File format: \\x7FELF") plus analyst-added annotations. The auto-comments on function prologues identify calling conventions and stack frame layouts. Analyst comments record identification rationale for reviewed functions.

Names (ptxas_names.json): IDA's auto-generated names for data and code addresses. Of 16,019 entries, approximately 9,670 are auto-generated string reference names (aLib64LdLinuxX8, aGnu, etc.) and ~6,349 are analyst-assigned or IDA-recovered names (PLT stubs, constructors, etc.). These names appear in the callgraph edges as from/to identifiers.

Imports (ptxas_imports.json): The 146 PLT imports. Key imports include pthread_* (13 functions), malloc/free/realloc, _setjmp/longjmp (used by the error recovery system), select/fcntl (used by the GNU Make jobserver client), and clock (used by the timing infrastructure).

Segments (ptxas_segments.json): The 24 ELF segments/sections. Used to establish the address space layout and map code/data boundaries. The .ctors section (104 bytes, 12 entries) is particularly important -- it lists the static constructors that initialize the ROT13 tables and the knob registry.

The 30-Region Sweep Approach

The primary analysis was conducted as a systematic address-range sweep of the entire .text section, divided into 30 contiguous regions. Each region was analyzed independently in a single session, producing a raw sweep report. The 40 report files (including sub-region splits) total 34,880 lines of working notes.

Region Partitioning

The .text section (0x403520--0x1CE2DE2, 26.2 MB) was divided into approximately 870 KB regions. The partitioning was not arbitrary -- region boundaries were chosen to align with subsystem boundaries where possible, so that each sweep report covers a coherent functional area.

Report	Address Range	Size	Functions	Subsystem
p1.01	`0x400000`--`0x4D5000`	853 KB	1,383	Runtime infra + CLI + PTX validators
p1.02	`0x4D5000`--`0x5AA000`	853 KB	581	PTX text generation (580 formatters)
p1.03	`0x5AA000`--`0x67F000`	853 KB	628	Intrinsics + SM profiles
p1.04	`0x67F000`--`0x754000`	469 KB	~500	Mercury core + scheduling engine
p1.05	`0x754000`--`0x829000`	853 KB	1,545	Knobs + peephole optimizer class
p1.06	`0x829000`--`0x8FE000`	853 KB	1,069	Debug tables + scheduler + HW profiles
p1.07	`0x8FE000`--`0x9D3000`	853 KB	1,090	Register allocator (fatpoint)
p1.08	`0x9D3000`--`0xAA8000`	853 KB	1,218	Post-RA pipeline + NamedPhases
p1.09	`0xAA8000`--`0xB7D000`	853 KB	4,493	GMMA/WGMMA + ISel + emission
p1.10	`0xB7D000`--`0xC52000`	853 KB	1,086	CFG analysis + bitvectors
p1.11	`0xC52000`--`0xD27000`	853 KB	1,053	PhaseManager + phase factory
p1.12	`0xD27000`--`0xDFC000`	853 KB	592	SM100 SASS encoders (set 1)
p1.13	`0xDFC000`--`0xED1000`	853 KB	591	SM100 SASS encoders (set 2) + decoders
p1.14	`0xED1000`--`0xFA6000`	853 KB	683	SM100 SASS encoders (set 3)
p1.15	`0xFA6000`--`0x107B000`	853 KB	678	SM100 SASS encoders (set 4)
p1.16	`0x107B000`--`0x1150000`	853 KB	3,396	SM100 codec + 2,095 bitfield accessors
p1.17	`0x1150000`--`0x1225000`	853 KB	733	SM89/90 codec (decoders + encoders)
p1.18	`0x1225000`--`0x12FA000`	853 KB	1,552	Reg-pressure scheduling + ISel + encoders
p1.19	`0x12FA000`--`0x13CF000`	853 KB	1,282	Operand legalization + peephole
p1.20	`0x13CF000`--`0x14A4000`	853 KB	1,219	SM120 peephole pipeline
p1.21	`0x14A4000`--`0x1579000`	853 KB	606	Blackwell ISA encode/decode
p1.22	`0x1579000`--`0x164E000`	853 KB	1,324	Encoding + peephole matchers
p1.23	`0x164E000`--`0x1723000`	853 KB	899	ISel pattern matching core
p1.24	`0x1723000`--`0x17F8000`	853 KB	631	ISA description database
p1.25	`0x17F8000`--`0x18CD000`	853 KB	1,460	SASS printer + peephole dispatch
p1.26	`0x18CD000`--`0x19A2000`	853 KB	1,598	Scheduling + peephole dispatchers
p1.27	`0x19A2000`--`0x1A77000`	853 KB	1,393	GPU ABI + SM89/90 encoders
p1.28	`0x1A77000`--`0x1B4C000`	853 KB	1,518	SASS emission backend
p1.29	`0x1B4C000`--`0x1C21000`	853 KB	1,974	SASS emission + format descriptors
p1.30	`0x1C21000`--`0x1CE3000`	780 KB	1,628	ELF emitter + infra library layer

Several regions were further split into sub-reports (p1.04a/b, p1.05a/b, p1.06a/b, p1.07a/b, p1.08a/b) when the initial analysis revealed that a region contained multiple distinct subsystems requiring separate treatment.

Sweep Report Structure

Each sweep report follows a consistent format:

================================================================================
P1.XX SWEEP: Functions in address range 0xAAAA000 - 0xBBBB000
================================================================================
Range: 0xAAAA000 - 0xBBBB000
Files found: NNN decompiled .c files (of which ~MMM are > 1KB)
Total decompiled size: X,XXX,XXX bytes
Functions in range (from DB): NNN
Named functions: NNN (or 0 if all are sub_XXXXXX)
Functions with identified callers: NNN

CONTEXT: [1-paragraph summary of the region's purpose]

================================================================================
SECTION 1: [Subsystem name]
================================================================================

### 0xAAAAAA -- sub_AAAAAA (NNNN bytes / NNN lines)
**Identity**: [Function identification]
**Confidence**: [CERTAIN / HIGH / MEDIUM]
**Evidence**:
  - [String evidence]
  - [Structural evidence]
  - [Callgraph evidence]
**Key code**:
  [Relevant decompiled excerpts]
**Note**: [Additional observations]

Each function entry records the address, size, decompiled line count, proposed identity, confidence level, evidence citations, and key code excerpts. The reports are raw working notes -- they contain false starts, corrections, and evolving hypotheses that were resolved as more context became available.

Analysis Ordering

The sweep was not performed in address order. The analysis followed an information-maximizing sequence:

p1.01 (infrastructure + CLI) first -- establishes the allocator, hash map, TLS, and diagnostic patterns that appear throughout the binary.
p1.11 (PhaseManager) second -- identifies all 159 phases and their vtable entries, providing the skeleton of the optimization pipeline.
p1.07 (register allocator) and p1.06 (scheduler) third -- these are the highest-complexity subsystems with the richest string evidence.
p1.12--p1.15 (SASS encoders) in batch -- once the encoding template was recognized, all encoder regions were swept rapidly with template matching.
p1.30 (library layer) late -- identifies shared infrastructure (ELF emitter, demangler, thread pool) referenced by earlier regions.
Remaining regions filled in by decreasing information density.

Cross-Referencing with PTXAS CLI

Several ptxas command-line features and internal mechanisms provide runtime validation of static analysis findings.

`--stat` and `--verbose`

Running ptxas --stat input.ptx prints per-kernel resource usage (register count, shared memory, stack frame size). This output is generated by sub_A3A7E0 (the IR statistics printer), which was identified from the format strings:

ptxas info    : Used %d registers, %d bytes smem, %d bytes cmem[0]

Comparing the --stat output against the decompiled statistics printer confirms the register counting and resource tracking logic.

`--compiler-stats`

Enables the timing output (Parse-time, DAGgen-time, OCG-time, etc.) from sub_446240. This confirms the pipeline stage ordering and the stage boundary functions identified by string xrefs.

`--fdevice-time-trace`

Generates Chrome trace JSON output showing per-phase timing. The trace parser at sub_439880 and the ftracePhaseAfter string at 0x1CE383F confirm the per-phase instrumentation infrastructure. The trace output lists phase names that can be cross-referenced against the 159-entry phase table.

DUMPIR Knob

The internal DUMPIR knob (accessed via -knob DUMPIR=<phase_name>) dumps the Ori IR at specified pipeline points. The string "Please use -knob DUMPIR=AllocateRegisters for debugging" at 0x21EFBD0 confirms this mechanism. The NamedPhases registry at sub_9F4040 maps phase names to pipeline positions. Available DUMPIR points include:

OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth
AllocateRegisters (the register allocation phase)
swap1 through swap6 (swap elimination phases)
shuffle (instruction scheduling)

The DUMPIR output format reveals the IR structure: basic block headers, instruction opcodes, register names (R0--R255, UR0--UR63, P0--P7, UP0--UP7), and operand encodings. This runtime output was used to validate the IR format reconstructed from static analysis.

`--keep` Flag

The --keep flag preserves intermediate files. While ptxas does not emit intermediate text files in the same way as nvcc, the --keep behavior in the overall CUDA compilation pipeline (nvcc -> cicc -> ptxas) allows inspecting the PTX input that reaches ptxas, confirming the PTX grammar and instruction format expectations.

Confidence Levels

Every function identification in this wiki carries one of three confidence levels:

Level	Meaning	Basis
CERTAIN	Identity is certain	Direct string evidence naming the function, or the function is a PLT import with a known name
HIGH	Strong identification (>90%)	Multiple corroborating indicators: string xrefs, callgraph position, structural fingerprint, decompiled algorithm match
MEDIUM	Probable identification (70--90%)	Single indicator (vtable position, size fingerprint, callgraph context) or inferred from surrounding identified functions

The distribution across the ~200 key identified functions in the Function Map:

CERTAIN: ~30 functions (PLT imports, main, functions with unique identifying strings)
HIGH: ~130 functions (string evidence + structural confirmation)
MEDIUM: ~40 functions (inferred from callgraph context or structural similarity)

The remaining ~39,985 functions are either unidentified (template-generated encoding handlers, small utility stubs) or identified at subsystem level only (e.g., "this is an SM100 SASS encoding handler" without knowing which specific opcode it encodes).

Reproducing the Analysis

To reproduce this analysis from scratch:

Obtain the binary. Install CUDA Toolkit 13.0. The binary is at <cuda>/bin/ptxas. Verify: ptxas --version should report V13.0.88 and the binary should be 37,741,528 bytes. Build string: cuda_13.0.r13.0/compiler.36424714_0.
Run IDA auto-analysis. Open ptxas in IDA Pro 8.x with default x86-64 settings. Allow auto-analysis to complete (8-10 minutes). Accept GCC as the detected compiler.
Run the extraction script. Load analyze_ptxas.py in IDA's Python console. The script exports all 8 JSON artifacts plus per-function decompiled C files, disassembly files, and control flow graph JSON files. Expected runtime: 4-8 hours for the full export (the xref export dominates).
Decode ROT13 strings. Apply codecs.decode(s, "rot_13") to all strings in the knob constructors (ctor_003, ctor_005, ctor_007). This decodes ~3,000 obfuscated names into readable English identifiers.
Identify anchor functions. Start with the highest-confidence identifications:
- main at 0x409460 (named in symbol table)
- sub_446240 (real main -- called from main, contains timing format strings)
- sub_C60D30 (phase factory -- 159-case switch)
- sub_C62720 (PhaseManager constructor -- references phase vtable table)
- sub_79B240 (GetKnobIndex -- inline ROT13 decoding)
- sub_42FBA0 (diagnostic emitter -- 2,350 callers, severity dispatch)
Sweep the address space. Work through the .text section in regions of ~870 KB. For each region:
- Count functions and decompiled file sizes
- Identify string anchors (search for region-specific strings)
- Classify functions by structural template (encoding handler, phase body, utility, etc.)
- Propagate identities from known callers/callees
- Record findings in the sweep report format
Cross-reference with runtime. Compile a simple CUDA kernel and run ptxas --stat --verbose --compiler-stats to observe runtime behavior. Use -knob DUMPIR=<phase> to dump IR at specific pipeline points. Compare the dumped IR format against the IR structure reconstructed from decompiled code.

Dependencies

The extraction script (analyze_ptxas.py) requires IDA Pro 8.x with Hex-Rays decompiler and Python 3.x. No external Python packages are needed -- only the IDA Python API (idautils, idc, idaapi, ida_bytes, ida_funcs, ida_segment, ida_nalt, ida_gdl, ida_hexrays).

Post-export analysis requires only the Python 3.8+ standard library (json, codecs, collections).

Debug Infrastructure: bugspec.txt

ptxas contains an internal fault injection framework that deliberately corrupts the Mercury IR to test compiler verification passes. The mechanism is entirely file-driven: if a file named ./bugspec.txt exists in the current working directory when ptxas runs, the function sub_A83AC0 reads it and injects controlled mutations into the post-register-allocation instruction stream. No CLI flag activates this -- file presence alone is sufficient. If the file is absent, a diagnostic is printed to stdout (Cannot open file with bug specification) and compilation proceeds normally.

File Format

The file contains a single line of six integers:

COUNT0,COUNT1,COUNT2,COUNT3 COUNT4 COUNT5

The first four are comma-separated; then a space; then two space-separated values. Each integer specifies the number of faults to inject for that bug category. Zero or negative disables the category.

Field	Variable	Category	Target
COUNT0	v78	Register bugs	General (R) and uniform (UR) register operands
COUNT1	v79	Predicate bugs	Predicated instruction operands
COUNT2	v80	Offset/spill bugs	Memory offsets in spill/refill instructions
COUNT3	v81	Remat bugs	Rematerialized value operands
COUNT4	v82	R2P/P2R bugs	Register-to-predicate conversion instructions
COUNT5	v83	Bit-spill bugs	Bit-level spill storage operands

Example: 3,2,1,0 0 1 injects 3 register bugs, 2 predicate bugs, 1 offset bug, and 1 bit-spill bug.

Bug Kind String Table

Each injected fault record carries a kind code (1--10) mapped to a string table at 0x21F0500:

Kind	String	Meaning
1	`r-ur register`	General or uniform register replaced with wrong register
2	`p-up register`	Predicate or uniform predicate register corrupted
3	`any reg`	Any register class operand corrupted
4	`offset`	Memory offset shifted by +16 bytes
5	`regular bug`	Generic operand value replacement
6	`predicated bug`	Predicate source operand corrupted
7	`remat bug`	Rematerialization value corrupted
8	`spill-regill bug`	Spill or refill path value corrupted
9	`r2p-p2r bug`	Register-predicate conversion operand corrupted
10	`bit-spill bug`	Bit-level spill storage operand corrupted

Injection Algorithm

The injection proceeds in four phases:

1. Candidate collection. The function walks the Mercury IR instruction linked list (from context[0]+272). For each instruction, it checks which bug categories are active and whether the instruction qualifies:

Register bugs (field0): Scans operands for type-tag 1 (register) with register class 6 (general) or 3 (predicate), excluding opcodes 41--44. Eligible instructions are collected into a candidate list.
Predicate bugs (field1): Checks flag byte at instruction+73 for bit 0x10 (predicated). Eligible instructions are collected separately.
Offset/spill bugs (field2): Calls sub_A56DE0 / sub_A56CE0 against the register allocator state (context[133]) to identify spill/refill instructions.
Remat bugs (field3): Queries the rematerialization hash table (context+21 via sub_A54200) for instructions with remat entries.
R2P/P2R bugs (field4): Checks instruction opcode (offset +72) for values 268, 155, 267, 173 (the R2P and P2R conversion opcodes, with bit-masked variants).
Bit-spill bugs (field5): Checks operand count > 2, flag bit 0x10 at offset +28, and calls sub_A53DB0 / sub_A53C40 / sub_A56880 for bit-spill eligibility.

2. Random selection. Seeds the RNG with time(0) via srand(). For each active category, sub_A83490 randomly selects N instruction indices from the candidate list, where N is the count from bugspec.txt. The selector uses FNV-1a hashing on instruction addresses for collision avoidance, re-rolling duplicates.

3. Mutation application. For register and predicate categories, sub_A5EC40 iterates over selected instructions and calls sub_A5E9E0, which finds the last register operand, allocates a new register of the same class via sub_91BF30, and replaces the operand value. For offset bugs, the mutation adds +16 to the signed 24-bit offset field directly: *operand = (sign_extend_24(*operand) + 16) & 0xFFFFFF | (*operand & 0xFF000000).

4. Reporting. Prints to stdout:

Num forced bugs N
Created a bug at index I : kind K inst # ID [OFF] in operand OP correct val V replaced with W

Fault Record Structure (40 bytes)

Offset	Size	Field
+0	4	Kind (1--10)
+8	8	Pointer to Mercury instruction node
+16	4	Operand index within instruction
+20	4	Original operand value
+24	4	Replacement operand value
+28	4	Selection index (position in candidate list)
+32	4	Instruction ID (from instruction+16)

Records are stored in a dynamic array at context[135].

Function Map

Address	Function	Role	Confidence
`0xA83AC0`	`sub_A83AC0`	bugspec.txt reader and injection coordinator	CERTAIN (string: `./bugspec.txt`)
`0xA83490`	`sub_A83490`	Random index selector with FNV-1a dedup	HIGH
`0xA5E9E0`	`sub_A5E9E0`	Register operand mutation (allocates new register)	HIGH
`0xA5EC40`	`sub_A5EC40`	Batch mutation applicator (iterates selected instructions)	HIGH
`0xA832D0`	`sub_A832D0`	Hash table resize for dedup tracking	MEDIUM

Significance

This is NVIDIA's internal compiler testing infrastructure for stochastic fault injection. It targets specific vulnerability surfaces in the register allocator and post-allocation pipeline: wrong-register assignments, address calculation errors, predicate propagation failures, rematerialization correctness, spill code integrity, and register-predicate conversion accuracy. The time(0)-seeded RNG produces different fault patterns on each run for the same bugspec.txt, enabling randomized stress testing of verification passes.

Embedded C++ Name Demangler

PTXAS statically embeds an Itanium ABI C++ name demangler rather than linking libc++abi or libstdc++. The demangler is a self-contained 41-function cluster spanning 0x1CD8B00--0x1CE1E60 in .text, with a single external entry point. The core recursive-descent parser at sub_1CDC780 (93 KB decompiled, 3,442 lines) handles the full Itanium mangling grammar: nested names, template arguments, substitutions, function types, and special names.

API and Integration

The public-facing function is sub_1CE23F0, whose signature matches __cxa_demangle exactly: it takes a mangled name string, an optional output buffer with length pointer, and a status pointer; it returns a malloc-allocated demangled string or NULL with a status code (-1 = memory failure, -3 = invalid arguments). The only caller of this function is the embedded terminate handler at sub_1CD7850, which prints the standard "terminate called after throwing an instance of '...'" diagnostic to stderr, demangling the exception type name before display.

Why Embedded

PTXAS imports only libc, libpthread, libm, and libgcc_s (146 PLT stubs total). It has no dependency on any C++ runtime library. The only C++ ABI symbol in the PLT is __cxa_atexit (at 0x401989), used to register the terminate handler. By embedding the demangler and terminate handler directly, NVIDIA avoids a runtime dependency on libstdc++ or libc++abi, which would otherwise be required solely for exception type name display in fatal error messages. This is consistent with the binary's overall strategy of minimizing external dependencies.

Function Map

Address	Function	Size	Role	Confidence
`sub_1CDC780`	Demangler core (recursive-descent parser)	93 KB	Parses Itanium-mangled names via large switch dispatch	HIGH (size, structure, callgraph isolation)
`sub_1CE0600`	Recursive dispatch wrapper	580 B	Re-enters the parser for nested name components (76 call sites from core)	HIGH (mutual recursion with `sub_1CDC780`)
`sub_1CE23F0`	`__cxa_demangle`-compatible API	340 B	Public entry: mangled string in, demangled string out, `malloc`-allocated	CERTAIN (API shape, status codes, `free`/`memcpy`/`strlen` callees)
`sub_1CE1E60`	Parse entry point	~200 B	Initializes parse state and invokes the core	HIGH (bridge between API and parser)
`sub_1CD7850`	Terminate handler (`__cxa_terminate`)	280 B	Prints `"terminate called after throwing..."` to stderr	CERTAIN (string: `"terminate called after throwing an instance of '"`)

Version Update Procedure

All addresses, function counts, and structural offsets in this wiki are specific to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0, 37,741,528 bytes). When a new CUDA toolkit ships a different ptxas binary, the wiki must be updated. This section documents the procedure.

Version-Stable vs Version-Fragile Findings

Not everything changes between versions. Understanding what is stable dramatically reduces update effort.

Version-stable (survives across minor and most major releases unchanged):

Category	Examples	Why stable
Algorithm logic	Copy propagation worklist walk, fatpoint pressure computation, MurmurHash3 constants	Algorithms are rarely rewritten between releases
Data structure layouts	Pool allocator bins at +2128, Mercury instruction node at 112 bytes, 16-byte phase objects	Struct layouts change only when fields are added or reordered
Knob names	`MercuryUseActiveThreadCollectiveInsts`, `ScavInlineExpansion`, all 2,000+ ROT13 names	Knob names are API-like -- changing them breaks internal test harnesses
ROT13 encoding	The ROT13 obfuscation layer itself, decoded by `codecs.decode(s, "rot_13")`	Obfuscation scheme has been consistent across observed versions
Phase count and ordering	159 phases in the OCG pipeline, ordered by the PhaseManager vtable table	Phase count may grow but existing phases retain their relative order
Pipeline stage names	`Parse-time`, `DAGgen-time`, `OCG-time`, `ELF-time`, `DebugInfo-time`	Stage names are embedded in format strings unlikely to change
Subsystem names	OCG, Mercury, Ori, Scav	Internal codenames are stable across releases
Encoding handler template	6-step pattern: opcode ID, `movaps` format descriptor, register class map, operand registration, finalize, bitfield extract	Template structure is generated from a stable code generator
Error message text	`"SM does not support LDCU"`, `"Invalid knob identifier"`	Diagnostic strings are rarely reworded

Version-fragile (changes with every recompilation):

Category	Examples	Why fragile
Function addresses	Every `sub_XXXXXX` reference, vtable addresses like `off_22BD5C8`	ASLR-style shifts from any code or data size change
Address ranges	Sweep boundaries `0x400000`--`0x4D5000`, subsystem regions	Functions move when preceding code grows or shrinks
Function sizes	`sub_446240` at 12,345 bytes	Inlining decisions change, optimizer improvements add/remove code
Caller/callee counts	`sub_424070` at 3,809 callers	New call sites added, old ones removed
Struct offsets	`context[133]`, `context+1584`	New fields inserted into context structs
`.rodata` addresses	String locations like `0x202D4D8`, encoding table addresses	Data layout shifts with code changes
Call graph edge counts	548,693 edges	New functions and call sites
Total function count	40,185	New SM targets add encoding handlers

Identifying Function Address Changes

When loading a new ptxas version into IDA:

Extract the same 8 JSON artifacts using analyze_ptxas.py (or equivalent). The critical artifacts for diffing are ptxas_functions.json (address, size, callee list) and ptxas_strings.json (string content, xref locations).
Match functions by invariant properties. Functions cannot be matched by address alone. Use these matching criteria in priority order:
- String anchors. Functions containing unique string references (e.g., the function referencing "Please use -knob DUMPIR=AllocateRegisters") can be matched across versions by searching for the same string in the new binary. This is the highest-confidence matching method.
- Size + callee signature. For functions without string anchors, match by (approximate size, sorted callee list). A function of ~2,100 bytes calling the pool allocator, OOM handler, and hash map insert is almost certainly the same function even if its address shifted by megabytes.
- Callgraph position. Functions identified by their caller/callee topology: the phase factory is the function called from the PhaseManager constructor with 159+ case targets. The diagnostic emitter is the function with 2,000+ callers that calls vfprintf.
- Vtable slot position. Phase execute() methods are at vtable slot 0. If the vtable table address changes but still contains 159 entries, the slot positions identify each phase.
- Template fingerprinting. Encoding handlers matching the 6-step template (bitfield insert via the highest-caller utility, movaps from .rodata, operand registrars, finalize call) are encoding handlers in any version.
Diff the function lists. Produce a mapping {old_addr -> new_addr} for all matched functions. Functions present in the new binary but absent in the old are new (likely new SM target support). Functions absent in the new binary are removed (dropped legacy SM support) or merged.

Updating Sweep Reports

The 30-region sweep reports in ptxas/raw/ are version-locked historical records -- they document the analysis of v13.0.88 and should not be overwritten. For a new version:

Re-run the sweep with new address ranges derived from the new binary's function list. The region partitioning should follow the same subsystem-aligned strategy: infrastructure first, then PhaseManager, then high-complexity subsystems, then batch encoding handlers.
Name new reports with a version suffix: p2.01-sweep-v13.1-0xNNN-0xMMM.txt (or whatever scheme distinguishes the version).
Cross-reference against old reports. For each region, note which functions moved, which are new, and which disappeared. The old sweep reports provide the expected function identities; the new sweep validates whether those identities still hold at the new addresses.

Pages Most Sensitive to Version Changes

These wiki pages require immediate updates when the binary changes:

Page	Sensitivity	What changes
`function-map.md`	Critical	Every address in every table row. The entire page is address-indexed.
`binary-layout.md`	Critical	Section addresses, subsystem boundaries, address-range diagram.
`VERSIONS.md`	Critical	Binary size, build string, function count, version number.
`pipeline/overview.md`	High	Phase factory address, PhaseManager constructor address, vtable table address.
`scheduling/algorithm.md`	High	Scheduler function addresses, priority function addresses.
`regalloc/algorithm.md`	High	Allocator function addresses, fatpoint computation address.
`codegen/encoding.md`	High	Encoding handler address ranges, format descriptor addresses.
`config/knobs.md`	Medium	Knob constructor addresses (content of knob names is stable).
`ir/instructions.md`	Medium	Opcode numbers may shift if new instructions are added.
`targets/index.md`	Medium	New SM targets may appear, changing validation table sizes.
`methodology.md`	Low	The methodology itself is version-stable; only the "Scope and Scale" table needs updating.

Recommended Update Workflow

The update follows a five-step sequence. Steps 1-2 are mechanical; steps 3-5 require analyst judgment.

Step 1: Extract new IDA artifacts.

Load the new ptxas binary into IDA Pro 8.x. Run analyze_ptxas.py to produce the 8 JSON artifacts and per-function decompiled .c files. Store them in a version-specific directory (e.g., ptxas/ida-v13.1/ or alongside the existing artifacts with clear version labeling).

Step 2: Diff against the old artifacts.

Write or use a diff script that:

Compares ptxas_functions.json (old vs new) by matching on string anchors, size+callee signature, and callgraph position.
Produces a {old_addr -> new_addr} mapping for matched functions.
Lists unmatched functions in both directions (new functions, removed functions).
Compares ptxas_strings.json to detect new strings, removed strings, and strings whose xref functions changed.
Reports total function count delta, binary size delta, and new section addresses.

Step 3: Update address-sensitive pages.

Using the address mapping from Step 2:

Update every sub_XXXXXX reference in function-map.md, binary-layout.md, and all pages listed in the sensitivity table above.
Update the "Scope and Scale" table in methodology.md with new function counts, string counts, binary size, and build string.
Update VERSIONS.md with the new binary metadata.
For pages with address ranges (sweep boundaries, subsystem regions), recompute the ranges from the new function list.

Step 4: Verify key struct layouts.

Struct offset changes are the most dangerous kind of version drift because they silently invalidate decompiled code analysis. For each documented struct:

Re-decompile the struct's primary accessor function (e.g., sub_424070 for the pool allocator, sub_4280C0 for the TLS context).
Compare field offsets against the documented layout.
If offsets shifted, update the struct documentation and propagate the change to all pages that reference those offsets.

Priority structs to verify: pool allocator (free-list bins at +2128, mutex at +7128), TLS context (280 bytes), Mercury instruction node (112 bytes), scheduler context (~1000 bytes), allocator state (1590+ bytes), phase objects (16 bytes).

Step 5: Validate phase pipeline.

Re-extract the phase vtable table (find the new address of the 159-entry pointer array in .data.rel.ro).
Verify all 159 phases are present and in the expected order.
Check for new phases (count > 159) or removed phases (count < 159).
Re-run ptxas --fdevice-time-trace on a test kernel and cross-reference the phase names in the trace output against the wiki's phase list.

Raw Data Locations

All raw analysis artifacts for the current version (v13.0.88) live in the repository under ptxas/:

Directory	Contents
`ptxas/raw/`	40 sweep reports (`p1.01`--`p1.30` plus sub-region splits), per-task investigation reports (`P0_`, `P1_`, `P2_*`, etc.)
`ptxas/decompiled/`	Per-function Hex-Rays decompiled C files (`sub_XXXXXX.c`, named functions like `ctor_003_0x4095d0.c`)
`ptxas/disasm/`	Per-function disassembly files
`ptxas/graphs/`	Per-function control flow graph JSON files (80,078 files)
`ptxas/` (root)	The 8 JSON artifacts (`ptxas_functions.json`, `ptxas_strings.json`, `ptxas_callgraph.json`, `ptxas_xrefs.json`, `ptxas_comments.json`, `ptxas_names.json`, `ptxas_imports.json`, `ptxas_segments.json`), the IDA database (`ptxas.i64`), the extraction script (`analyze_ptxas.py`), and the binary itself (`ptxas`)
`ptxas/wiki/src/`	The wiki source pages (this document and all others)

When updating to a new version, preserve the existing artifacts for v13.0.88 (rename or move to a versioned subdirectory) and store new artifacts alongside them. The sweep reports in ptxas/raw/ are historical records and should never be overwritten.

Limitations and Known Gaps

No dynamic validation of optimization correctness. All findings are from static analysis. The identified phase algorithms have not been tested against runtime inputs to verify they produce correct output for all corner cases.
39.6% of functions are vtable-dispatched. Functions with zero static callers can only be reached by finding the vtable or function pointer table that references them. Some vtables in deep .rodata may have been missed, leaving some functions orphaned.
No upstream reference for any code. Unlike cicc (LLVM fork) or nvcc (EDG frontend), ptxas has no open-source analog. Every identification is from first principles. This limits confidence for functions where string evidence is absent and structural analysis is the only basis.
Template-generated code is indistinguishable. The ~4,000 SASS encoding handlers are generated from internal templates. Without the template source, mapping individual handlers to specific opcodes requires tracing the dispatch table entries, which has only been done for select handlers.
Mega-functions are partially opaque. The four functions exceeding 200 KB (sub_169B190 at 280 KB, sub_143C440 at 233 KB, sub_198BCD0 at 239 KB, sub_18A2CA0 at 231 KB) could not be decompiled by Hex-Rays. Their behavior is understood from their callee lists (13,000--15,870 callees each) and their position in the pipeline, but the internal dispatch logic is known only at the disassembly level.
ROT13 decoding is necessary but not sufficient. Decoding the 2,000+ knob names reveals the existence of tuning parameters but not their semantics. A knob named MercuryPresumeXblockWaitBeneficial can be decoded from ROT13, but understanding what "xblock wait beneficial" means requires analyzing the code paths that read the knob.
Version-specific addresses. All addresses in this wiki apply to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0). Other CUDA toolkit versions will have different addresses, different function counts, and potentially different phase orderings. However, the analysis methodology (string-driven, vtable-driven, callgraph propagation) applies to any version.
Indirect calls are undercounted. The 548,693-edge call graph captures only direct call instructions resolved by IDA. Virtual calls through vtable pointers, function pointer callbacks, and computed jumps are not fully captured. The true call graph is significantly denser than what is recorded.

Corrections Log

This section documents every factual error discovered and corrected during the wiki improvement pass. Each entry records the error, the correction, affected pages, and the agent task that performed the fix. The full detail for each correction is in ptxas/raw/P5_11_corrections_log_report.txt.

Summary

Metric	Count
Distinct factual errors corrected	22
Wiki pages with at least one fix	30+
Agent tasks that discovered errors	15
Agent tasks that propagated fixes	5

Corrections by Severity

Systematic errors (affected 5+ pages each)

#	Error	Correction	Pages	Agent
01	Opcode numbering: wiki assumed two numbering systems; "Selected Opcode Values" table had wrong SASS mnemonic labels (e.g., 93=CALL, 95=EXIT, 97=MOV, 130=BAR)	One numbering system: ROT13 name table index IS the instruction opcode. Correct labels: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET2	15 pages (ir/instructions, ir/cfg, passes/predication, passes/sync-barriers, passes/liveness, passes/general-optimize, passes/rematerialization, passes/copy-prop-cse, passes/strength-reduction, regalloc/abi, regalloc/spilling, intrinsics/sync-warp, codegen/isel, scheduling/latency-model, scheduling/algorithm)	P0-01, P4-02, P5-01
02	Register class 6 = UB (Uniform Barrier); classes 2-6 all wrong	Class 6 = Tensor/Accumulator (MMA/WGMMA). Correct table: 2=R(alt), 3=UR, 4=UR(ext), 5=P/UP, 6=Tensor/Acc. Barrier regs use reg_type 9, outside the 7-class system	7 pages (ir/registers, regalloc/overview, regalloc/algorithm, regalloc/spilling, passes/gmma-pipeline, intrinsics/tensor, ir/overview)	P0-02
03	context+1584 had 5 conflicting names: code_object, sched_ctx, arch_backend, optimizer_state, function manager	Single object: SM-specific architecture backend ("sm_backend"), constructed per-compilation-unit in sub_662920 via SM version switch	3 pages corrected (ir/data-structures, ir/overview, passes/copy-prop-cse); 14 pages acceptable as-is	P0-03

Identity misattributions

#	Error	Correction	Pages	Agent
06	sub_83EF00 (29KB) listed as "Top-level unrolling driver"	sub_83EF00 is MainPeepholeOptimizer (opcode switch on 2, 134, 133, 214, 213, 210). Actual unrolling driver: sub_1390B30 via Phase 22 entry sub_1392E30	passes/loop-passes.md	P1-04, P5-03
07	sub_926A30 (22KB) listed as "Main pipelining engine (modulo scheduling)"	sub_926A30 is the operand-level latency annotator and interference weight builder, called by sub_92C0D0 per-instruction	passes/loop-passes.md	P1-06
08	sub_7E7380 described as "full structural equivalence" (opcode, type, all operands, register class comparison)	sub_7E7380 is 30 lines / 150 bytes: narrow predicate-operand compatibility check (predicate bit parity + last operand 24-bit ID + penultimate 8-byte encoding). Full structural comparison done by the 21 callers	passes/copy-prop-cse.md, passes/general-optimize.md	P1-07, P5-06

Inverted semantics

#	Error	Correction	Pages	Agent
05	isNoOp()=1 "means it executes unconditionally"	isNoOp()=1 means the dispatch loop SKIPS execute(). Code: `if (!phase->isNoOp()) { phase->execute(ctx); }`	passes/rematerialization.md	P0-05
09	Hot-cold priority: "1 = cold, 0 = hot"	1 = hot = higher priority, 0 = cold = lower priority. sub_A9CDE0 (hot detector) returns true -> bit 5 set -> higher priority	passes/hot-cold.md	P1-09, P5-06
10	"Fatpoint" implied to be maximum-pressure point	Fatpoint scans for MINIMUM-cost slot. The name refers to the exhaustive (fat) scan evaluating all slots, not to picking the maximum	(verified correct across all pages -- 0 fixes needed)	P1-10, P5-06

Wrong numeric values

#	Error	Correction	Pages	Agent
04	context+1552 = "Legalization stage counter" with 3 values (3, 7, 12)	Pipeline progress counter with 22 values (0-21) spanning all pipeline categories	4 pages (ir/data-structures, passes/late-legalization, passes/rematerialization, passes/copy-prop-cse)	P0-04
12	5 SASS opcode mnemonic typos: PSMTEST, LGDEPBAR, LGSTS, UBLKPC, UTMAREDG	CSMTEST, LDGDEPBAR, LDGSTS, UBLKCP, UTMREDG	reference/sass-opcodes.md	P2-11
14	WGMMA case 9 = 0x1D5D (7517), case 10 = 0x1D5E (7518)	Case 9 = 0x1D5E (7518), case 10 = 0x1D60 (7520). Codes 0x1D5D/0x1D5F are advisory (non-serialization) warnings	passes/gmma-pipeline.md	P3-25
15	ABI minimum: gen 5 (sm_60-sm_89) = 16 regs, gen 9+ = 24 regs	gen 3-4 (sm_35-sm_53) = 16, gen 5-9 (sm_60-sm_100) = 24. Binary: `(generation - 5) < 5 ? 24 : 16`	regalloc/abi.md	P3-26
17	Unrolling rejection table at 0x21D1980 with 36-byte structures	Rejection string pointer array at 0x21D1EA0 with simple integer indices 7-24. The 0x21D1980 table is for peephole operand range lookups	passes/loop-passes.md	P1-04

Phantom data and scope errors

#	Error	Correction	Pages	Agent
11	"Approximately 80 additional entries bulk-copied from unk_21C0E00" at SASS opcode indices 322-401, "totaling roughly 402 named opcodes"	Table has exactly 322 entries. The 1288-byte block at unk_21C0E00 is a 322-element identity map {0,1,...,321} copied to a different data structure (encoding category map at obj+0x2478)	reference/sass-opcodes.md	P2-11
13	"139 explicitly named phases and 20 architecture-specific unnamed phases"	All 159 phases have names in the static table at off_22BD0C0. The original 139-phase inventory missed 20 phases (e.g., OriCopyProp, Vectorization, MercConverter, AllocateRegisters)	pipeline/overview.md, passes/index.md	P2-14, P4-03
16	Warning 7018 (0x1B6A) attributed to SUSPEND/preserved scratch diagnostic	Code 0x1B6A does not exist in the binary. The actual code is 7011 (0x1B63)	regalloc/abi.md	P3-26
18	Unrolling rejection codes listed as 0x80000001-0x80000018	Those hex values appear in diagnostic message STRINGS, not as internal codes. Internal codes are simple integers 7-24	passes/loop-passes.md	P1-04

Minor corrections

#	Error	Correction	Pages	Agent
19	sub_80B700/sub_80BC80 listed as unrolling functions	Both are peephole optimizer functions (called through sub_83EF00), not unrolling	passes/loop-passes.md	P1-04
22	general-optimize.md called sub_7E7380 "instruction_equivalent" / "structural instruction equivalence" in 6 locations	Renamed to "predicate_operand_compatible" / "predicate-operand compatibility check"	passes/general-optimize.md	P5-06

Error Categories

Category	Count	Examples
Identity misattribution	5	Wrong function-to-role mappings, wrong names for context fields
Wrong numeric values	5	Wrong opcode labels, wrong hex codes, wrong thresholds, wrong addresses
Inverted semantics	3	isNoOp skip-vs-execute, hot-cold bit polarity, fatpoint min-vs-max
Conflicting definitions	3	Register class contradictions across pages
Phantom data	2	Nonexistent SASS entries 322-401, nonexistent warning 7018
Scope mischaracterization	2	context+1552 scope too narrow, phase naming scope too narrow
Encoding confusion	2	Hex-in-message-string vs internal code, wrong address for lookup table

Lessons Learned

Behavioral inference is unreliable for opcode identity. Observing that an opcode appears in branch contexts does not make it BRA. Always check the authoritative ROT13 name table.
Cross-page consistency checks catch conflicting speculations. Five pages independently naming the same field (context+1584) is a strong signal that at least four are wrong.
Counts from partial analysis are systematically low. The "3 values" for context+1552 and "139 named phases" both resulted from stopping the search too early. Exhaustive binary sweeps consistently reveal more entries.
Function size is not a reliable identity signal. sub_83EF00 (29KB) was large enough to seem like a major driver, but size alone does not distinguish a peephole optimizer from a loop unroller.
ROT13 decoding + binary cross-validation is the gold standard. Every correction that replaced speculative labels with ROT13-decoded names has held up under subsequent audits.

Version Tracking

This page documents the exact ptxas binary under analysis and the version-related metadata recovered from the stripped ELF.

Binary Under Analysis

Field	Value
Tool	`ptxas` (PTX optimizing assembler)
Version	13.0.88
Build tag	`cuda_13.0.r13.0/compiler.36424714_0`
Build date	Wed Aug 20 01:55:12 PM PDT 2025
Source path	`/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h`
ELF size	37,741,528 bytes (37.7 MB)
Architecture	x86-64 (AMD64)
Linking	Dynamically linked, stripped
Functions	~40,000 (estimated from IDA/Ghidra DB)

Embedded Version Strings

sub_432A00 (0x432A00, CLI option registration) contains the self-identification strings that ptxas prints for --version / --list-version:

String	Location
`"Ptx optimizing assembler"`	Product name
`"NVIDIA (R)"`	Vendor
Copyright 2005-2025	Date range
`"ptxocg.0.0"`	OCG backend version tag

The "ptxocg.0.0" tag also appears in sub_43A400 (compilation setup) and at address 0x1CE74AB in the .rodata section, identifying the backend optimizer component embedded inside ptxas.

Default Target Architecture

sub_6784B0 returns sm_75 (Turing) as the default compilation target when no --gpu-name flag is supplied. This is consistent with the CUDA 13.0 toolkit defaulting to a Turing-class GPU.

The full set of architecture strings referenced in the front-end validators (addresses 0x460000-0x4D5000) includes:

sm_20  sm_30  sm_35  sm_50  sm_60  sm_75  sm_80  sm_86  sm_89  sm_90

with sm_%d format-string patterns covering all supported SM codes.

Output ELF Format

Cubins emitted by ptxas use the ELF standard with:

Field	Value
`e_machine`	`EM_CUDA` (0xBE = 190)
ELF class	ELFCLASS32 or ELFCLASS64 (per target)
Custom section type	`SHT_CUDA_INFO` = 0x70000064
Magic (code object header)	`0x16375564E` (`"dUWc"` + version nibble)

The SM-version-to-code-object mapping lives in the ELF emitter at sub_1C9F280. Example encodings recovered from sub_A3D000 range:

`field[93]`	Target	Version encoding
12288	sm_30	0x70007
20481	sm_50	0xC000C

Build System Metadata

The source path leaked through __FILE__ macros in the knobs infrastructure reveals the NVIDIA internal build tree layout:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/
  drivers/common/utils/generic/impl/generic_knobs_impl.h

Key observations:

/dvs/p4/ -- Perforce depot root on the DVS (Driver Verification System) build farm.
sw/rel/gpgpu/toolkit/r13.0/ -- Release branch for CUDA toolkit 13.0.
compiler/drivers/common/ -- Shared compiler driver code (used by both ptxas and cicc).
generic_knobs_impl.h -- The knob system implementation header; the __FILE__ macro at lines 395-1090 of this file is embedded in ptxas error metadata.

Evidence Index

Claim	Source
Version 13.0.88, 37.7 MB	Headers of all 30 sweep reports (e.g. `p1.23`, `p1.28`)
`sub_432A00` strings	`p1.01` lines 514-521
`sub_6784B0` default sm_75	User-provided; corroborated by sm_75 prevalence across all validators
Source path	`p1.05` lines 14-16, `p1.04a` line 628
`ptxocg.0.0`	`p1.01` line 553, `p1.05` line 1256
ELF emitter / EM_CUDA	`p1.30` lines 46-69
SM version encoding table	`p1.08b` lines 217-237

Key Functions

Address	Size	Role	Confidence
`sub_432A00`	--	CLI option registration; contains `--version` / `--list-version` self-identification strings (`"Ptx optimizing assembler"`, `"NVIDIA (R)"`, `"ptxocg.0.0"`)	0.92
`sub_43A400`	--	Compilation setup; references the `"ptxocg.0.0"` backend version tag	0.85
`sub_6784B0`	--	Default target architecture selector; returns sm_75 (Turing) when no `--gpu-name` flag is supplied	0.90
`sub_1C9F280`	--	ELF emitter; SM-version-to-code-object mapping for cubin output	0.85
`sub_A3D000`	--	SM version encoding table; example encodings (12288 = sm_30, 20481 = sm_50)	0.80

Compilation Pipeline Overview

This page maps the complete end-to-end flow of a PTX assembly through ptxas v13.0.88, from the initial CLI invocation to the final ELF/cubin binary output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.

Pipeline Diagram

nvcc / cicc
  |  (PTX text file or --input-as-string)
  v
+================================================================+
| ptxas v13.0.88 (37.7 MB, ~40,000 functions)                   |
|                                                                |
|  1. Entry & CLI Parsing ----------> [entry.md]                 |
|     |  main -> sub_446240 -> sub_434320                        |
|     |  target arch, opt level, --maxrregcount, knobs           |
|     v                                                          |
|  2. PTX Lexer + Parser -----------> [ptx-parser.md]            |
|     |  sub_451730: Flex scanner, Bison grammar                 |
|     |  ROT13-decoded opcode table (900+ mnemonics)             |
|     |  30+ per-instruction semantic validators                 |
|     v                                                          |
|  3. PTX Directive Handling --------> [ptx-directives.md]       |
|     |  .version, .target, .entry, .func, .reg, .shared         |
|     |  register constraints, ABI configuration                 |
|     v                                                          |
|  4. PTX-to-Ori Lowering ----------> [ptx-to-ori.md]           |
|     |  PTX AST -> Ori IR (basic blocks, virtual registers)     |
|     |  address space annotation, special register mapping      |
|     v                                                          |
|  5. 159-Phase Optimization -------> [optimizer.md]             |
|     |  PhaseManager: sub_C62720 (constructor),                 |
|     |                sub_C64F70 (executor)                     |
|     |  10 stages, 17 AdvancedPhase hooks,                     |
|     |  8-phase Mercury encoding sub-pipeline                   |
|     |  per-kernel via sub_7FBB70 -> sub_7FB6C0                 |
|     v                                                          |
|  6. Register Allocation ----------> [../regalloc/overview.md]  |
|     |  Fatpoint algorithm, phase 101 (AdvancedPhaseAllocReg)   |
|     |  spill/fill insertion, ABI register reservations          |
|     v                                                          |
|  7. Instruction Scheduling -------> [../scheduling/overview.md]|
|     |  3-phase: pre-schedule (97), post-schedule (106),        |
|     |           post-fixup (111)                               |
|     |  scoreboard generation, dependency barriers              |
|     v                                                          |
|  8. SASS Encoding ----------------> [../codegen/encoding.md]   |
|     |  530 instruction encoding handlers (vtable dispatch)     |
|     |  Mercury format: phases 113-122                          |
|     |  Capsule Mercury (default on sm_100+)                    |
|     v                                                          |
|  9. ELF/Cubin Output -------------> [output.md]               |
|     |  sub_612DE0 (finalizer) -> sub_1C9F280 (ELF emitter)    |
|     |  section layout, symbol table, relocations               |
|     |  DWARF debug info, EIATTR attributes                     |
|     v                                                          |
|  OUTPUT: .cubin / .o (ELF)                                    |
+================================================================+

Side paths:
  * Capsule Mercury (--cap-merc) -----> [../codegen/capmerc.md]
  * Debug info (all stages) ----------> [../output/debug-info.md]
  * SASS text (--verbose) ------------> [../codegen/sass-printing.md]

Narrative Walk-Through: One Kernel, Start to Finish

A concrete trace of a single-kernel PTX module compiled for sm_100 at -O2:

1. PTX text arrives (~2--200 KB). Either read from a .ptx file or received in-memory via --input-as-string from nvcc. The driver sub_446240 establishes a setjmp recovery point, parses CLI options into the 1,352-byte options block, and allocates the "Top level ptxas memory pool".

2. Lexer + Parser (sub_451730). A Flex-generated scanner tokenizes the PTX text into a token stream. Tokens flow into a Bison-generated LALR parser that builds an AST. The opcode dispatch table (sub_46E000, 93 KB, 1,168 callees) routes each instruction mnemonic through ROT13 decoding, type resolution, and 30+ per-instruction semantic validators. For a 5 KB PTX kernel, the parser typically produces ~200--500 AST nodes with ~50 virtual register declarations. The "PTX parsing state" pool holds all AST memory.

3. Directive processing and CompileUnitSetup. .version/.target directives configure the SM profile via sub_6765E0 (54 KB profile constructor). .entry/.func directives establish the kernel boundary. .reg/.shared/.const directives declare resources. sub_43B660 computes the physical register budget from .maxnreg, --maxrregcount, and .maxntid constraints. The 1,936-byte profile object is now populated with codegen factory value (36864 for sm_100), scheduling parameters, and capability flags.

4. PTX-to-Ori lowering (DAGgen). sub_6273E0 (44 KB) converts each AST instruction into an Ori IR node: a basic block with virtual registers, control flow edges, and memory space annotations. Special registers (%ntid, %laneid, %smid) map to internal IDs. Address computation uses a 6-bit operand type encoding. A 500-instruction PTX kernel typically produces ~600--1,200 Ori instructions (expansion from pseudo-ops, address calculations, and predicate materialization). The "Permanent OCG memory pool" is created here to hold all IR state.

5. 159-phase OCG pipeline (sub_C62720 constructs, sub_C64F70 executes). Each phase is a 16-byte polymorphic object with execute(), isNoOp(), and getName() vtable methods. The PhaseManager iterates the phase table at 0x22BEEA0, skipping any phase whose isNoOp() returns true. At -O2, roughly 80--100 of the 159 phases are active. Typical expansion factors: the initial 1,000 Ori instructions may grow to 1,200--1,500 after unrolling and intrinsic expansion, then shrink to 800--1,000 after CSE/DCE, then re-expand to 1,500--2,500 after register allocation spill/fill insertion. The PhaseManager logs "Before <phase>" / "After <phase>" strings (visible in the sub_C64F70 decompile) for DUMPIR.

6. Register allocation (phase 101, sub_971A90). The Fatpoint algorithm attempts NOSPILL allocation first. If pressure exceeds the register budget, the spill guidance engine (sub_96D940, 84 KB) computes spill candidates across 7 register classes, and the retry loop makes up to N attempts (knob 638/639) with progressively more aggressive spilling. Physical register assignments are committed; spill/fill instructions are inserted into the Ori IR.

7. Instruction scheduling (phases 97, 106, 111). Three scheduling passes assign dependency barriers and reorder instructions for pipeline throughput. The scoreboard generator tracks 6 dependency barriers per warp. For a 1,500-instruction kernel, scheduling typically produces a ~2,000--3,000-entry instruction stream after barrier insertion and NOP padding.

8. SASS encoding (phases 113--122). Each Ori instruction is lowered to a 128-bit SASS binary instruction via the 530-handler vtable dispatch. The 1,280-bit (160-byte) encoding workspace at instruction+544 is filled by sub_7B9B80 (bitfield insert, 18,347 callers). A 2,000-instruction kernel produces ~32 KB of raw SASS binary. On sm_100+, Capsule Mercury (capmerc) is the default format, embedding PTX source alongside the SASS.

9. ELF/cubin emission (sub_612DE0, 47 KB). The finalizer assembles the cubin: .text.FUNCNAME (SASS binary), .nv.info.FUNCNAME (EIATTR attributes), .nv.shared.FUNCNAME (shared memory layout), .nv.constant0.FUNCNAME (constant bank), plus global sections (.shstrtab, .strtab, .symtab). Section layout (sub_1CABD60, 67 KB) assigns addresses; the master ELF emitter (sub_1C9F280, 97 KB) writes headers, section tables, and program headers. A single-kernel cubin for a medium-complexity kernel is typically 40--120 KB.

Approximate data sizes at each stage (medium kernel, sm_100, -O2):

Stage	Input	Output	Peak Memory
PTX text	--	5--50 KB text	100 KB (file buffer + parser state)
AST	Token stream	200--500 nodes (~40--100 KB)	200 KB
Ori IR (initial)	AST	600--1,200 instructions (~100--250 KB)	500 KB
Ori IR (post-OCG)	1,200 instr	1,500--2,500 instr (~300--600 KB)	2--8 MB (peak during regalloc)
SASS binary	Scheduled IR	32--128 KB	1 MB
Cubin (ELF)	SASS + metadata	40--120 KB	2 MB

Timed Phases

The compilation driver sub_446240 measures six timed phases per compile unit and reports them when --compiler-stats is enabled. The format strings are embedded directly in the binary:

Phase	Format String	Subsystem
Parse-time	`"Parse-time : %.3f ms (%.2f%%)\n"`	PTX lexer + Bison parser + semantic validation
CompileUnitSetup-time	`"CompileUnitSetup-time : %.3f ms (%.2f%%)\n"`	Target configuration, ABI setup, register constraints
DAGgen-time	`"DAGgen-time : %.3f ms (%.2f%%)\n"`	PTX-to-Ori lowering, CFG construction, initial DAG formation
OCG-time	`"OCG-time : %.3f ms (%.2f%%)\n"`	Optimized Code Generation: all 159 optimization phases, register allocation, instruction scheduling, SASS encoding
ELF-time	`"ELF-time : %.3f ms (%.2f%%)\n"`	ELF construction, section layout, symbol table, relocations, EIATTR, file write
DebugInfo-time	`"DebugInfo-time : %.3f ms (%.2f%%)\n"`	DWARF `.debug_info`/`.debug_line`/`.debug_frame` generation, LEB128 encoding

Additional aggregate stats:

CompileTime = %f ms (100%)
PeakMemoryUsage = %.3lf KB

The per-unit header prints "\nCompile-unit with entry %s" before each kernel's phase breakdown.

Per-Kernel Parallelism

ptxas supports two compilation modes for multi-kernel PTX modules:

Single-Threaded Mode (Default)

The compilation driver sub_446240 iterates over compile units sequentially. For each kernel entry:

sub_43CC70 -- per-entry compilation unit processor, skips __cuda_dummy_entry__
sub_7FBB70 -- per-kernel entry point, prints "\nFunction name: " + kernel name
sub_7FB6C0 -- pipeline orchestrator: builds phases via sub_C62720, executes via sub_C64F70
Cleanup: destroys 17 analysis data structures (live ranges, register maps, scheduling state)

Each kernel runs through the entire 159-phase pipeline independently. Cross-kernel state is limited to shared memory layout and the global symbol table.

Thread Pool Mode (`--split-compile`)

When --allow-expensive-optimizations or --split-compile is active, ptxas uses a pthread-based thread pool for per-kernel parallelism:

Pool constructor (sub_1CB18B0): allocates a 184-byte pool struct (0xB8), spawns N detached worker threads via pthread_create, initializes mutex at +24, two condition variables at +64 and +112
Task submit (sub_1CB1A50): allocates a 24-byte task node {func_ptr, arg, next}, enqueues via linked list, broadcasts on cond_work
Jobserver integration (sub_1CC7300): reads MAKEFLAGS environment variable, parses --jobserver-auth= for either fifo: named pipes or pipe-based file descriptors, throttles thread count to respect GNU Make's -j slot limit

The thread pool is used throughout the OCG and ELF phases (stages 5-9 in the diagram). Each worker thread receives its own thread-local context (sub_4280C0, 280-byte TLS struct with per-thread error flags, diagnostic suppression state, Werror flag, and synchronization primitives).

Thread-Local Context Layout

struct ThreadLocalContext {  // 280 bytes (0x118), per-thread via pthread_getspecific
    uint64_t error_flags;          // +0:   error/warning state flags
    uint64_t has_error;            // +8:   nonzero if error occurred
    // +16..+48: internal fields (jmp_buf pointer, pool pointer, counters)
    uint8_t  diag_suppress;        // +49:  diagnostic suppression flag
    uint8_t  werror_flag;          // +50:  --Werror promotion flag
    // +51..+127: reserved / internal state
    pthread_cond_t  cond;          // +128: condition variable (48 bytes)
    pthread_mutex_t mutex;         // +176: per-thread mutex (40 bytes)
    sem_t           sem;           // +216: semaphore (32 bytes)
    // +248..+279: linked-list pointers (global thread list at +256/+264)
};

Accessed by sub_4280C0 (3,928 callers -- the single most-called function in the binary). On first call in a new thread, allocates and initializes via malloc(0x118) + memset + pthread_cond_init + pthread_mutex_init + sem_init. The decompiled code confirms the 280-byte size: v5 = malloc(0x118u), followed by memset(v5, 0, 0x118u), pthread_cond_init(v5 + 128), pthread_mutex_init(v5 + 176), sem_init(v5 + 216). After initialization, the struct is inserted into a global doubly-linked list (offsets +256 and +264 hold prev/next pointers, protected by a global mutex). The pthread_setspecific(key, v5) call stores the pointer for subsequent pthread_getspecific retrieval.

Key Function Call Chain

The top-level control flow from program entry to ELF output:

main (0x409460, 84 bytes)
  |  setvbuf(stdout/stderr, unbuffered)
  v
sub_446240 (0x446240, 11KB) ---- "Top-level compilation driver"
  |
  |-- sub_434320 (0x434320, 10KB) -- Parse CLI options, validate flags
  |     reads: --gpu-name, --maxrregcount, --opt-level, --verbose,
  |            --compiler-stats, --split-compile, --fast-compile
  |
  |-- [allocate "Top level ptxas memory pool"]
  |-- [allocate "Command option parser" pool]
  |
  |-- sub_445EB0 (setup) ----------- Target configuration, texturing mode
  |     sub_43A400 --------------- SM-specific defaults ("ptxocg.0.0")
  |     sub_43B660 --------------- Register/resource constraint calculation
  |
  |-- sub_451730 (0x451730, 14KB) -- Parser initialization
  |     |  "PTX parsing state" pool allocation
  |     |  Builtin symbol table: %ntid, %laneid, %smid, %clock64, ...
  |     |  sub_46E000 (93KB) ---- Opcode-to-handler dispatch table (1168 callees)
  |     v
  |   [Flex lexer + Bison parser: PTX text -> AST]
  |
  |-- for each compile unit:
  |     sub_4428E0 (0x4428E0, 14KB) -- PTX input validation
  |     |  .version/.target checks, ABI mode selection
  |     |  --extensible-whole-program, --compile-only handling
  |     |
  |     sub_43CC70 (5.4KB) --------- Per-entry unit processor
  |     |  skip __cuda_dummy_entry__
  |     |  generate .sass and .ucode sections
  |     |
  |     sub_7FBB70 (198 bytes) ----- Per-kernel entry point
  |       |
  |       sub_7FB6C0 (1.2KB) ------- Pipeline orchestrator
  |         |  check knob 298 (NamedPhases mode)
  |         |  if NamedPhases: delegate to sub_9F63D0
  |         |  else:
  |         |    sub_C62720 -- PhaseManager constructor (159 phases)
  |         |    sub_C60D20 -- get default phase table (at 0x22BEEA0)
  |         |    sub_C64F70 -- execute all phases
  |         |  cleanup: destroy 17 analysis data structures
  |         v
  |       [159-phase pipeline: optimization -> regalloc -> scheduling -> encoding]
  |
  |-- sub_612DE0 (0x612DE0, 47KB) -- Kernel finalizer / ELF builder
  |     |  "Finalizer fastpath optimization"
  |     |  version: "Cuda compilation tools, release 13.0, V13.0.88"
  |     |  build:   "Build cuda_13.0.r13.0/compiler.36424714_0"
  |     |
  |     sub_1CB53A0 (13KB) ------- ELF world initializer (672-byte object)
  |     |  "elfw memory space", .shstrtab, .strtab, .symtab
  |     |
  |     sub_1CB3570 (10KB) ------- Add .text.FUNCNAME sections (44 callers)
  |     sub_1CB68D0 (49KB) ------- Symbol table builder
  |     sub_1CABD60 (67KB) ------- Section layout & memory allocation
  |     sub_1CD48C0 (22KB) ------- Relocation resolver
  |     sub_1C9B110 (23KB) ------- Mercury capsule builder (capmerc)
  |     sub_1C9F280 (97KB) ------- Master ELF emitter (largest in range)
  |     sub_1CD13A0 (11KB) ------- Final file writer
  |
  v
[report CompileTime, PeakMemoryUsage, per-phase breakdown]

Memory Pools

ptxas uses a custom hierarchical pool allocator (sub_424070 / sub_4248B0, the most-called allocation functions with 3,809 and 1,215 callers respectively) instead of the system malloc/free. Three named pools are created during the top-level driver:

Pool Name	Created By	Lifetime	Purpose
`"Top level ptxas memory pool"`	`sub_446240`	Entire compilation	Global allocations, cross-kernel data structures
`"Command option parser"`	`sub_446240`	Entire compilation	CLI option storage, flag validation state
`"Permanent OCG memory pool"`	OCG initialization	Per-kernel	Optimization phase state, instruction IR, register maps

Additional per-subsystem pools exist:

"PTX parsing state" -- created by sub_451730, holds the lexer/parser symbol tables and AST nodes
"elfw memory space" -- created by sub_1CB53A0, holds the ELF world object (672 bytes) and section data

Pool Allocator Internals

The allocator at sub_424070 implements a dual-path design:

Small allocations (up to 4,999 bytes / 0x1387): 8-byte-aligned, size-class binned free lists at pool struct offset +2128. Pop from free list head on alloc, push back on free.
Large allocations (above 4,999 bytes): boundary-tag allocator with coalescing of adjacent free blocks.
Thread safety: pthread_mutex_lock/unlock around all pool operations, mutex at pool struct offset +7128.
OOM handling: calls sub_42BDB0 (3,825 callers) which triggers longjmp-based fatal abort via sub_42F590.

Pipeline Stage Breakdown

Stage 1: Parse (Parse-time)

The Flex-generated scanner and Bison-generated parser consume PTX text and produce an internal AST. The opcode dispatch table at sub_46E000 (93KB, 1,168 callees) registers type-checking rules for every PTX instruction. Thirty separate validator functions (in 0x460000-0x4D5000) enforce SM architecture requirements, PTX version constraints, operand types, and state space compatibility. See PTX Parser.

Stage 2: CompileUnitSetup (CompileUnitSetup-time)

Target configuration via sub_43A400: sets SM-specific defaults (texturing mode, cache policies, def-load-cache, force-load-cache), applies --fast-compile shortcuts, configures ABI (parameter registers, return address register, scratch registers). Register constraints computed by sub_43B660 from .maxnreg, --maxrregcount, .minnctapersm, and .maxntid directives. See Entry Point & CLI.

Stage 3: DAGgen (DAGgen-time)

Lowers the validated PTX AST into the Ori intermediate representation: basic blocks with a control flow graph, virtual registers, and memory space annotations. Special PTX registers (%ntid, %laneid, %smid, %ctaid, etc.) are mapped to internal identifiers. Operand processing at sub_6273E0 (44KB) handles address computation with a 6-bit operand type encoding. See PTX-to-Ori Lowering.

Stage 4: OCG (OCG-time)

The core of ptxas: the 159-phase Optimized Code Generation pipeline. This single timed phase encompasses:

Early optimization (phases 13-36): general optimization, branch/switch, loop simplification, strength reduction, unrolling, pipelining, barrier removal
Mid-level optimization (phases 37-58): GVN/CSE, reassociation, commoning, late expansion, speculative hoisting
Late optimization (phases 59-95): loop fusion, predication, GMMA propagation, legalization
Register allocation (phase 101): Fatpoint algorithm
Instruction scheduling (phases 97, 106, 111): pre-schedule, post-schedule, post-fixup
Mercury encoding (phases 113-122): SASS binary format generation

The PhaseManager (sub_C62720) instantiates phases via a 159-case factory switch (sub_C60D30), each phase a 16-byte polymorphic object with a vtable providing execute(), isNoOp(), and getName() methods. See Optimization Pipeline.

Stage 5: ELF (ELF-time)

The finalizer sub_612DE0 (47KB) assembles the NVIDIA ELF/cubin from the compiled SASS. Section layout (sub_1CABD60, 67KB) assigns addresses to shared memory, constant banks (with OCG deduplication), local memory, and reserved shared memory (.nv.reservedSmem.begin/cap/offset0). The master ELF emitter sub_1C9F280 (97KB) constructs headers, section tables, and program headers. Three binary output modes exist:

mercury -- traditional SASS binary format
capmerc -- Capsule Mercury (default on sm_100+), embeds PTX source in .nv.merc.* sections
sass -- direct SASS output

See ELF/Cubin Output.

Stage 6: DebugInfo (DebugInfo-time)

DWARF debug information generation: .debug_info, .debug_line, .debug_frame sections. The LEB128 encoder at sub_45A870 handles all variable-length integer encoding. Source location tracking uses the location map (hash map at sub_426150/sub_426D60) with file offset caching every 10 lines for fast random access. Labels follow the pattern .L__$locationLabel$__%d. See Debug Information.

Error Paths and Recovery

ptxas uses setjmp/longjmp as its sole error recovery mechanism -- there are no C++ exceptions (the binary is compiled as C). Three nested recovery points exist, each catching progressively more localized failures.

Recovery Point Hierarchy

sub_446240 (top-level driver)
  setjmp(jmp_buf_global)         // Level 1: catches any fatal anywhere
    |
    sub_43A400 (per-kernel worker)
      setjmp(jmp_buf_kernel)     // Level 2: catches per-kernel fatals
        |
        sub_432500 (finalization bridge)
          setjmp(jmp_buf_local)  // Level 3: catches OCG pipeline fatals
            |
            [159-phase pipeline, regalloc, encoding, ELF]

Level 1 (global). Established by sub_446240 at function entry. If any code path anywhere in ptxas calls sub_42FBA0 with severity >= 6 (fatal), execution longjmps back here. The handler cleans up global resources and returns a non-zero exit code. This is the last-resort handler.

Level 2 (per-kernel). Established by sub_43A400 before the OCG pipeline runs. On longjmp, the handler destroys the partially-compiled kernel's state, clears the error flags in the TLS context, and continues to the next kernel. This allows multi-kernel compilations to survive a single kernel's failure.

Level 3 (finalization). Established by sub_432500, which saves and replaces the TLS jmp_buf pointer for nested recovery. On longjmp: restores the previous jmp_buf, sets error_flags = 1, releases output buffers, and calls report_internal_error(). Execution returns false to the Level 2 handler.

Parse Error Recovery

Parse errors in sub_451730 (the Flex/Bison parser) invoke sub_42FBA0 with the message "syntax error":

Severity 4--5 (non-fatal error): The error is printed with file:line location, and the parser attempts to continue via Bison's error recovery rules. Multiple non-fatal parse errors can accumulate. After parsing completes, if the error count > 0, the compilation is aborted before entering the OCG pipeline.
Severity 6 (fatal): Triggers longjmp to the Level 1 handler immediately. The parser state pool is leaked (accepted trade-off since the process is about to exit).

Bison error recovery operates through the error token in the grammar. When the parser encounters a token that matches no production, it discards tokens until it finds one that allows the error production to reduce, then resumes parsing. This provides reasonable error recovery for common mistakes (missing semicolons, misspelled opcodes) but can cascade badly for structural errors (mismatched braces, corrupt PTX).

Phase Failure in PhaseManager

The phase executor sub_C64F70 runs each phase by calling its vtable execute() method. There is no explicit per-phase error check -- phases that detect internal errors call the diagnostic emitter sub_42FBA0 directly. The error handling cascade:

Non-fatal phase error (severity 3--5): The error is printed and the error flag is set in the TLS context. The PhaseManager continues executing subsequent phases. This allows multiple diagnostics to be collected in a single run.
Fatal phase error (severity 6): Triggers longjmp to Level 2 or Level 3. The current kernel's compilation is aborted. The PhaseManager's loop is unwound non-locally -- no cleanup of intermediate phase state occurs. Resources are reclaimed when the per-kernel memory pool is destroyed.
OOM during phase execution: Any allocation failure calls sub_42BDB0 (3,825 callers), which forwards to sub_42F590 with a severity-6 descriptor at unk_29FA530. This always triggers longjmp.

The PhaseManager logs phase transitions using "Before <phase>" and "After <phase>" string construction (visible in sub_C64F70). When DUMPIR is set to a phase name, the IR is dumped to a file after that phase completes. This enables bisection of phase failures: --knob DUMPIR=<phase_name> isolates which phase corrupted the IR.

Register Allocation Failure and Retry

The register allocator has its own retry mechanism that operates within the normal pipeline (not via longjmp). The retry driver sub_971A90 (355 lines) wraps the Fatpoint allocator in a two-phase strategy:

Phase 1 -- NOSPILL. Attempt allocation without spilling. If the allocator fits within the register budget, proceed directly to finalization.

Phase 2 -- SPILL retry loop. If NOSPILL fails:

The spill guidance engine sub_96D940 (84 KB) computes per-register-class spill candidates
The allocator retries with progressively more aggressive spilling, up to N attempts (controlled by knobs 638/639)
Each attempt prints: "-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d" (note: the typo "attemp" is in the binary)
The best result across all attempts is tracked by sub_93D070
The finalization function sub_9714E0 (290 lines) commits the best result or emits a fatal error

On allocation failure (all retry attempts exhausted):

Register allocation failed with register count of '%d'.
Compile the program with a higher register target

This error is emitted by sub_9714E0 through two paths: with source location (via sub_895530, including function name and PTX line number) or without source location (via sub_7EEFA0, generic). After this error, sub_9714E0 returns with HIBYTE(status) set, causing the retry driver to clear all register assignments to -1 and propagate the failure.

A dedicated DUMPIR hook exists: "Please use -knob DUMPIR=AllocateRegisters for debugging" -- this string (found at sub_9714E0's error path) directs users to dump the IR state before the allocator runs.

Fatal Error Handler Chain

The complete chain from any error site to process termination:

[any function, 2,350 call sites]
  sub_42FBA0(descriptor, location, ...)   // central diagnostic emitter
    |  checks descriptor[0] for severity
    |  severity 0: silently ignored
    |  severity 1-2: prints "info    " message
    |  severity 3: prints "warning " (or "error   " if TLS[50] Werror flag set)
    |  severity 4-5: prints "error   " / "error*  " (non-fatal)
    |  severity 6: prints "fatal   " then:
    v
  longjmp(tls->jmp_buf, 1)
    |  unwinds to nearest setjmp recovery point
    v
  [Level 3] sub_432500 -> restore jmp_buf, set error_flags, return false
  [Level 2] sub_43A400 -> cleanup kernel state, continue to next kernel
  [Level 1] sub_446240 -> cleanup global state, exit(non-zero)

Resource leak note. Because longjmp bypasses normal stack unwinding, all heap allocations made between the setjmp and the fatal error are leaked unless tracked in a pool. This is why ptxas uses pool allocators -- the per-kernel pool can be destroyed wholesale at the Level 2 recovery point, reclaiming all leaked memory without tracking individual allocations.

Architecture Dispatch

An architecture vtable factory at sub_1CCEEE0 (17KB, 244 callees) constructs a 632-byte vtable object (79 function pointers) based on the target SM version. The version dispatch ranges:

Range	Architecture	Generation	Status in v13.0.88
sm_30-39	Kepler	1st gen	Validation only -- accepted by `bsearch` in `unk_1D16220`, but no codegen factory, no capability dispatch handlers, and no SASS encoders ship for these targets. Compilation fails after parsing.
sm_50-59	Maxwell	2nd gen	Validation only -- same as Kepler. Present in the base validation table for backward-compatible PTX version/target checking, but no backend support.
sm_60-69	Pascal	3rd gen	Validation only -- same as above. The codegen factory value 24576 (`6 << 12`) is referenced in comparison thresholds but no Pascal-specific encoder tables exist.
sm_70-73	Volta	4th gen	Validation only -- sm_70, sm_72, sm_73 are in the base table but have no active capability dispatch handlers in `sub_607DB0`.
sm_75	Turing	4th gen	Active -- lowest SM with full codegen support (factory 24577).
sm_80-89	Ampere / Ada	5th gen	Active -- factory 28673.
sm_90	Hopper	6th gen	Active -- factory 32768.
sm_100-110	Blackwell	7th gen	Active -- factory 36864.
sm_120-121	Consumer / DGX Spark	7th gen (desktop)	Active -- factory 36864 (shared with Blackwell datacenter).

The distinction between "validation only" and "active" is critical: the base validation table at unk_1D16220 contains 32 entries including all legacy SMs back to sm_20, allowing ptxas to parse PTX files that declare .target sm_30 without immediately rejecting them. However, the capability dispatch initializer sub_607DB0 only registers handler functions for sm_75 through sm_121 (13 base targets). Attempting to compile code for an unregistered SM produces a fatal error during codegen factory lookup -- the architecture vtable factory sub_1CCEEE0 cannot construct a backend object for these targets.

The legacy codegen factory values (12288 for sm_30, 16385/20481 for sm_50, 24576 for sm_60) survive as comparison constants in feature-gating checks throughout the backend (e.g., if (factory_value > 28673) gates sm_90+ features), but the code paths they would activate no longer exist.

Each vtable entry is a function pointer to an SM-specific implementation of a codegen or emission primitive. This is the central dispatch mechanism for all architecture-dependent behavior in the backend.

Obfuscation: ROT13 Encoding

All internal identifiers in ptxas's static initializers are ROT13-encoded:

Opcode table (ctor_003 at 0x4095D0, 17KB): 900+ PTX opcode mnemonics. Example: NPDOHYX decodes to ACQBULK, SZN decodes to FMA, RKVG decodes to EXIT.
General knob table (ctor_005 at 0x40D860, 80KB): 2,000+ Mercury/OCG tuning knob names with hex default values. Example: ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf decodes to MercuryUseActiveThreadCollectiveInsts.
Scheduler knob table (ctor_007 at 0x421290, 8KB): 98 scheduler-specific knob names. Example: XBlockWaitOut, ScavInlineExpansion.

The ROT13 decoding is performed inline during lookup (in sub_79B240, GetKnobIndex) using character-range detection: bytes in A-M get +13, bytes in N-Z get -13, with case-insensitive comparison via tolower().

Cross-References

Binary Layout -- address ranges for every subsystem
Function Map -- master index of recovered function addresses
CLI Options -- complete flag catalog
Knobs System -- 1,294 internal tuning parameters
Optimization Levels -- what changes at -O0/-O1/-O2/-O3
Phase Manager -- PhaseManager object layout and dispatch
Memory Pool Allocator -- pool struct layout and allocation algorithm
Thread Pool & Concurrency -- thread pool struct, task submit, jobserver

Function Map

Address	Size	Callers	Identity	Confidence
`0x409460`	84 B	--	`main` (entry point)	CERTAIN
`0x446240`	11 KB	1	Top-level compilation driver	HIGH
`0x434320`	10 KB	1	CLI option parser + validator	HIGH
`0x445EB0`	--	1	Target configuration setup	HIGH
`0x43A400`	4.7 KB	1	SM-specific default configuration	HIGH
`0x43B660`	3.8 KB	1	Register/resource constraint calculator	HIGH
`0x451730`	14 KB	1	Parser init + special register setup	HIGH
`0x46E000`	93 KB	1	Opcode dispatch table builder (1,168 callees)	HIGH
`0x4428E0`	14 KB	1	PTX input validation + preprocessing	HIGH
`0x43CC70`	5.4 KB	1	Per-entry compilation unit processor	HIGH
`0x7FBB70`	198 B	vtable	Per-kernel entry point	CERTAIN
`0x7FB6C0`	1.2 KB	1	Pipeline orchestrator	CERTAIN
`0xC62720`	4.7 KB	1	PhaseManager constructor	VERY HIGH
`0xC60D30`	3.6 KB	1	Phase factory (159-case switch)	VERY HIGH
`0xC64F70`	--	1	Phase executor	HIGH
`0x9F63D0`	342 B	1	NamedPhases executor	VERY HIGH
`0x612DE0`	47 KB	1	Kernel finalizer / ELF builder	HIGH
`0x1C9F280`	97 KB	1	Master ELF emitter	HIGH
`0x1CB53A0`	13 KB	1	ELF world initializer (672-byte object)	HIGH
`0x1CABD60`	67 KB	1	Section layout & memory allocator	HIGH
`0x1CD13A0`	11 KB	2	Final ELF file writer	HIGH
`0x1CB18B0`	~200 B	1	Thread pool constructor	HIGH
`0x1CB1A50`	~200 B	N	Thread pool task submit	HIGH
`0x1CC7300`	8 KB	1	GNU Make jobserver client	HIGH
`0x1CCEEE0`	17 KB	3	Architecture vtable factory	HIGH
`0x424070`	2.1 KB	3,809	Pool allocator: alloc(pool, size)	HIGH
`0x4248B0`	923 B	1,215	Pool allocator: free(ptr)	HIGH
`0x4280C0`	597 B	3,928	Thread-local context accessor	HIGH
`0x426150`	2.5 KB	2,800	Hash map: put(map, key, value)	HIGH
`0x42FBA0`	2.4 KB	2,350	Diagnostic message emitter	HIGH

Entry Point & CLI

The ptxas binary has a deceptively simple entry point. The exported main at 0x409460 is an 84-byte wrapper that sets up unbuffered I/O and immediately tail-calls sub_446240 -- the real compilation driver. This driver is a monolithic 11 KB function that allocates a 1,352-byte master options block on the stack, establishes setjmp-based error recovery, parses all command-line options through a generic framework, reads PTX input, and then loops over compile units running the full Parse -> CompileUnitSetup -> DAGgen -> OCG -> ELF -> DebugInfo pipeline for each. The entire error-handling strategy is non-local: any of the 2,350 call sites to the central diagnostic emitter sub_42FBA0 can trigger a longjmp back to the driver's recovery point on fatal errors.

The same binary doubles as an in-process library. When nvcc loads ptxas as a shared object rather than spawning a subprocess, three extra arguments to the driver carry an output buffer pointer, an extra option count, and an extra options array. Callback function pointers at fixed offsets in the options block allow the host process to receive diagnostics and progress notifications without going through stderr.


main()	`0x409460` (84 bytes) -- `setvbuf` + tail-call to `sub_446240`
Real main	`sub_446240` (11,064 bytes, ~900 lines)
Options block	1,352 bytes on stack
Error recovery	`setjmp` / `longjmp` (no C++ exceptions)
Option registration	`sub_432A00` (6,427 bytes, ~100 options via `sub_1C97210`)
Option parser	`sub_434320` (10,289 bytes, ~800 lines)
Diagnostic emitter	`sub_42FBA0` (2,388 bytes, 2,350 callers, 7 severity levels)
TLS context	`sub_4280C0` (597 bytes, 3,928 callers, 280-byte per-thread struct)
Pipeline phases	Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo
Library mode	`sub_446240(argc, argv, output_buf, extra_opt_count, extra_opts)`

Architecture

main (0x409460, 84B)
  │
  ├─ nullsub_1(*argv)          // store program name (no-op)
  ├─ setvbuf(stdout, _IONBF)
  ├─ setvbuf(stderr, _IONBF)
  │
  └─ sub_446240(argc, argv, 0, 0, 0)   // REAL MAIN
       │
       ├─ setjmp(jmp_buf)               // fatal error recovery point
       │
       ├─ sub_434320(opts_block, ...)    // OPTION PARSER
       │    └─ sub_432A00(...)           // register ~100 options via sub_1C97210
       │
       ├─ sub_4428E0(...)                // PTX INPUT SETUP
       │    ├─ validate .version / .target
       │    ├─ handle --input-as-string
       │    └─ generate __cuda_dummy_entry__ if --compile-only
       │
       ├─ sub_43A400(...)                // TARGET CONFIGURATION
       │    └─ set cache defaults, texmode, arch-specific flags
       │
       ├─ FOR EACH compile unit:
       │    ├─ sub_451730(...)           // parser/lexer init + special regs
       │    ├─ sub_43B660(...)           // register constraint calculator
       │    ├─ sub_43F400(...)           // function/ABI setup
       │    └─ sub_43CC70(...)           // per-entry: DAGgen → OCG → ELF → DebugInfo
       │
       ├─ timing / memory stats output (--compiler-stats)
       │
       └─ cleanup + return exit code

Pre-main Static Constructors

Before main executes, four static constructors run as part of the ELF .init_array. Three of them populate ROT13-obfuscated lookup tables that are foundational to the rest of the binary. This obfuscation is deliberate -- it prevents trivial string searching for internal opcode names and tuning knob identifiers in the stripped binary.

ctor_001 -- Thread Infrastructure (`0x4094C0`, 204 bytes)

Initializes the POSIX threading foundation used throughout ptxas:

pthread_key_create(&key, destr_function);       // TLS key for sub_4280C0
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
__cxa_atexit(cleanup_func, ...);                // registered destruction

The TLS key created here is the one used by sub_4280C0 (3,928 callers), making it the single most important piece of global state in the binary.

ctor_003 -- PTX Opcode Name Table (`0x4095D0`, 17,007 bytes)

Populates a table at 0x29FE300+ with approximately 900 ROT13-encoded PTX opcode mnemonic strings. Each entry is a (string_ptr, length) pair. The ROT13 encoding maps A-Z to N-Z,A-M and a-z to n-z,a-m, leaving digits and punctuation unchanged.

Encoded	Decoded	Instruction
`NPDOHYX`	`ACQBULK`	Bulk acquire
`NPDFUZVAVG`	`ACQSHMINIT`	Shared memory acquire init
`OFLAP`	`BSYNC`	Barrier sync
`PPGY.P`	`CCTL.C`	Cache control
`SZN`	`FMA`	Fused multiply-add
`FRGC`	`SETP`	Set predicate
`ERGHEA`	`RETURN`	Return
`RKVG`	`EXIT`	Thread exit

These decoded names are the canonical PTX opcode mnemonics used during parsing and validation. The table is consumed by the PTX lexer initialization at sub_451730 and the opcode-to-handler dispatch table at sub_46E000 (93 KB, the largest function in the front-end range).

ctor_005 -- Mercury Tuning Knob Registry (`0x40D860`, 80,397 bytes)

The single largest function in the front-end address range. Registers 2,000+ ROT13-encoded internal tuning knob names, each paired with a hexadecimal default value string. These are the "Mercury" (OCG) backend tuning parameters that control every aspect of code generation, scheduling, and register allocation.

Encoded Name	Decoded Name	Default
`ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf`	`MercuryUseActiveThreadCollectiveInsts`	`0x3e40`
`ZrephelGenpxZhygvErnqfJneYngrapl`	`MercuryTrackMultiReadsWarLatency`	—
`ZrephelCerfhzrKoybpxJnvgOrarsvpvny`	`MercuryPresumeXblockWaitBeneficial`	—
`ZrephelZreteCebybthrOybpxf`	`MercuryMergePrologueBlocks`	—
`ZrephelTraFnffHPbqr`	`MercuryGenSassUCode`	—

The knob system is documented in detail in the Knobs System page. The ROT13 encoding applies identically to all knob name strings in all four constructors.

ctor_007 -- Scheduler Knob Registry (`0x421290`, 7,921 bytes)

A smaller companion to ctor_005 that registers 98 scheduler-specific knobs. These control the instruction scheduler (Mercury/OCG) behavior at a finer granularity than the general knobs:

Decoded examples: XBlockWaitOut, XBlockWaitInOut, XBlockWaitInOnTarget, WarDeploySyncsFlush_SW4397903, WaitToForceCTASwitch, VoltageWar_SW4981360PredicateOffDummies, TrackMultiReadsWarLatency, ScavInlineExpansion, ScavDisableSpilling.

Knob names containing _SW followed by a number (e.g., _SW4397903) indicate workarounds for specific hardware bugs identified by NVIDIA's internal bug tracking system.

Real Main -- `sub_446240`

The exported main() tail-calls sub_446240 with three zero arguments appended. This function is the complete compilation orchestrator: it owns the options block, the error recovery, the compilation loop, and the statistics output.

Field	Value
Address	`0x446240`
Size	11,064 bytes
Stack frame	1,352+ bytes (master options block + locals)
Callers	1 (main)
Error recovery	`setjmp` at function entry

Signature and Library Mode

int sub_446240(int argc, char **argv,
               void *output_buf,        // a3: cubin output buffer (NULL for standalone)
               int   extra_opt_count,   // a4: count of extra options from nvcc
               char **extra_opts);      // a5: array of extra option strings

When main calls this, a3/a4/a5 are all zero -- standalone mode. When nvcc loads ptxas as a shared library and calls the entry point directly, these arguments carry non-null values:

a3 (output_buf): Pointer to a memory buffer where the compiled cubin is written. Eliminates the need for temporary files and filesystem round-trips, which matters for large CUDA compilations where nvcc may invoke ptxas hundreds of times.
a4 (extra_opt_count): Number of additional option strings injected by nvcc beyond what appears on the command line.
a5 (extra_opts): Array of those extra option strings.

Additionally, callback function pointers at offsets 37--39 of the 1,352-byte options block (byte offsets ~296, ~304, ~312) allow the host process to receive progress notifications and diagnostic messages in-process rather than through stderr.

Error Recovery with setjmp/longjmp

The first significant action in sub_446240 is establishing a setjmp recovery point:

if (setjmp(jmp_buf) != 0) {
    // Fatal error occurred somewhere in the pipeline.
    // Clean up and return non-zero exit code.
    goto cleanup;
}

This is the only error recovery mechanism in ptxas -- there are no C++ exceptions (the binary is compiled as C, not C++). Any function anywhere in the call tree that encounters an unrecoverable error calls sub_42FBA0 with severity >= 6, which internally calls longjmp(jmp_buf, 1) to unwind directly back to this point. The approach is simple but has a critical implication: all resources allocated between the setjmp and the fatal error are leaked unless explicitly tracked and cleaned up at the recovery site.

The 1,352-Byte Options Block

The master options block lives on the stack and accumulates all compilation state during option parsing. It is passed by pointer to virtually every subsystem. Key fields (approximate offsets based on access patterns):

Offset Range	Purpose
0--63	Input/output file paths, PTX version, target SM
64--127	Optimization level, debug flags, cache modes
128--255	Register limits, occupancy constraints
256--295	Warning/error control flags
296--319	Library-mode callback function pointers (offsets 37--39)
320--1351	Per-pass configuration, knob overrides, feature flags

Compilation Loop

After option parsing and PTX input setup, the driver enters a loop over compile units. Each unit corresponds to one entry function (or device function in --compile-only mode). The per-entry processing is handled by sub_43CC70, which prints a separator:

printf("\n# ============== entry %s ==============\n", entry_name);

and then sequences: DAGgen (PTX-to-Ori lowering), OCG (optimization and code generation), ELF (binary emission), and DebugInfo (DWARF generation). The special entry __cuda_dummy_entry__ is silently skipped.

Timing and Memory Statistics

When --compiler-stats is active, sub_446240 prints per-phase timing and peak memory after all compile units complete:

CompileTime = 42.3 ms (100%)
Parse-time            : 12.1 ms (28.61%)
CompileUnitSetup-time :  1.4 ms ( 3.31%)
DAGgen-time           :  8.7 ms (20.57%)
OCG-time              : 15.2 ms (35.93%)
ELF-time              :  3.8 ms ( 8.98%)
DebugInfo-time        :  1.1 ms ( 2.60%)
PeakMemoryUsage = 2048.000 KB

When --compiler-stats-file is specified, the same data is written in JSON format using the shared JSON builder (sub_1CBA950). When --fdevice-time-trace is active, sub_439880 parses Chrome DevTools trace format JSON and merges ptxas timing events into the trace.

Option Parser -- `sub_434320` and `sub_432A00`

Option parsing is split into two phases: registration and processing.

Option Registration -- `sub_432A00`

This 6,427-byte function calls sub_1C97210 approximately 100 times, once per recognized option. Each call provides the option's long name, short name, value type, help text, and default value to the generic option framework (implemented in the 0x1C96xxx--0x1C97xxx range, shared with other NVIDIA tools).

Option	Short	Type	Help Text
`--arch`	`-arch`	string	"Specify the 'sm_' name of the target architecture"
`--output-file`	`-o`	string	"Specify name and location of the output file"
`--opt-level`	`-O`	int	"Specify optimization level"
`--maxrregcount`	—	int	"Specify the maximum number of registers"
`--register-usage-level`	—	enum(0..10)	Register usage reporting level
`--verbose`	`-v`	bool	Verbose output
`--version`	`-V`	—	Print version and exit
`--compile-only`	—	bool	Compile without linking
`--compile-functions`	—	string	"Entry function name"
`--input-as-string`	—	string	"PTX string" (compile from memory)
`--fast-compile`	—	bool	Reduce compile time at cost of code quality
`--suppress-stack-size-warning`	—	bool	Suppress stack size warnings
`--warn-on-local-memory-usage`	—	bool	Warn when local memory is used
`--warn-on-spills`	—	bool	Warn on register spills
`--warn-on-double-precision-use`	—	bool	Warn on FP64 usage
`--compiler-stats`	—	bool	Print compilation timing
`--compiler-stats-file`	—	string	"/path/to/file" (JSON output)
`--fdevice-time-trace`	—	string	Chrome trace JSON output
`--def-load-cache`	—	enum	Default load cache operation
`--force-load-cache`	—	enum	Force load cache operation
`--position-independent-code`	—	bool	Generate PIC
`--compile-as-tools-patch`	—	bool	CUDA sanitizer/tools patch mode
`--extensible-whole-program`	—	bool	Whole-program compilation
`--cloning`	—	enum(yes/no)	Inline cloning control
`--ptxlen`	—	—	PTX length statistics
`--list-version` / `--version-ls`	—	—	List supported PTX versions
`--disable-smem-reservation`	—	bool	Disable shared memory reservation
`--generate-relocatable-object`	`-c`	bool	Generate relocatable object

Option Processing -- `sub_434320`

The 10,289-byte parser iterates over argv (and any extra options from library mode), matches each argument against registered options via the framework, and populates fields in the 1,352-byte options block. Special handling exists for:

--version: Prints the identification string "Ptx optimizing assembler" followed by the version (e.g., "Cuda compilation tools, release 13.0, V13.0.88") and exits.
--help: Delegates to sub_403588, which prints "Usage : %s [options] <ptx file>,...\n" followed by all registered options, then exits.
--fast-compile: Validated against conflicting optimization options.
-cloning=yes/-cloning=no: Inline cloning control parsed as an equality option.

Generic Option Framework

The option parsing library lives in the 0x1C96000--0x1C97FFF range and is shared with other NVIDIA tools (nvlink, fatbinary, etc.):

Address	Identity	Role
`sub_1C960C0`	Option parser constructor	Creates the option parser state
`sub_1C96680`	Argv processor	Matches argv entries against registered options
`sub_1C97210`	Option registrar	Registers one option with name, type, help
`sub_1C97640`	Help printer	Iterates all registered options, prints help text

Diagnostic System -- `sub_42FBA0`

The central diagnostic emitter is the most important error-reporting function in ptxas. With 2,350 call sites, it handles every warning, error, and fatal message in the entire binary.

Signature

void sub_42FBA0(
    int *descriptor,    // a1: points to severity level at *a1
    void *location,     // a2: source location context
    ...                 // variadic: printf-style format args
);

Severity Levels

Level	Prefix	Tag	Behavior
0	(none)	—	Suppressed -- message is silently discarded
1	`"info "`	`@I@`	Informational
2	`"info "`	`@I@`	Informational (alternate)
3	`"warning "` / `"error "`	`@W@` / `@E@`	Warning, promoted to error if TLS[50] set
4	`"error* "`	`@O@`	Non-fatal error with special marker
5	`"error "`	`@E@`	Non-fatal error
6	`"fatal "`	`@E@`	Fatal -- triggers `longjmp(jmp_buf, 1)`

The machine-readable tags (@E@, @W@, @O@, @I@) allow nvcc and other tools to parse ptxas output programmatically, extracting severity without parsing the human-readable text.

Warning-to-Error Promotion

Severity level 3 has context-dependent behavior controlled by two flags in the thread-local storage:

v5 = *a1;   // severity
if (v5 == 3) {
    if (sub_4280C0()[49])   // TLS offset 49: suppression flag
        return;             // silently discard
    if (sub_4280C0()[50])   // TLS offset 50: Werror flag
        prefix = "error   ";
    else
        prefix = "warning ";
}

This implements the --Werror equivalent: when the Werror flag is active in the TLS context, all warnings become errors.

Output Format

<filename>, line <N>; <severity>: <message>

When source is available, the diagnostic emitter reads the PTX input file, seeks to line N, and prints the source line prefixed with "# ". To avoid O(n) seeking through large files on every diagnostic, it maintains a hash map (sub_426150/sub_426D60) that caches file byte offsets every 10 lines for fast random access to arbitrary line numbers.

Fatal Error Handler -- `sub_42BDB0`

A 14-byte wrapper called from 3,825 sites (nearly every allocation in ptxas). It fires whenever the pool allocator sub_424070 returns NULL:

void sub_42BDB0(...) {
    return sub_42F590(&unk_29FA530, ...);   // internal error descriptor
}

The descriptor at unk_29FA530 has severity 6 (fatal), so this always triggers longjmp back to the driver's recovery point.

Thread-Local Storage -- `sub_4280C0`

The most-called function in the entire binary (3,928 callers). Returns a pointer to a 280-byte per-thread context struct, allocating and initializing it on first access.

void *sub_4280C0(void) {
    void *ctx = pthread_getspecific(key);
    if (ctx) return ctx;

    ctx = malloc(0x118);        // 280 bytes
    memset(ctx, 0, 0x118);
    pthread_cond_init(ctx + 128, NULL);
    pthread_mutex_init(ctx + 176, NULL);
    sem_init(ctx + 216, 0, 0);
    pthread_setspecific(key, ctx);
    return ctx;
}

TLS Context Layout (280 bytes)

Offset	Size	Type	Purpose
0	8	int/flags	Error/warning state flags
8	8	int	`has_error` flag
49	1	byte	Diagnostic suppression flag
50	1	byte	Werror promotion flag
128	48	`pthread_cond_t`	Condition variable
176	40	`pthread_mutex_t`	Per-thread mutex
216	32	`sem_t`	Semaphore for synchronization

The TLS key is created by ctor_001 before main runs, and a destructor function registered via pthread_key_create frees the 280-byte struct when a thread terminates. This per-thread context enables concurrent compilation of multiple compile units (when the thread pool is active), with each thread maintaining independent error state and diagnostic suppression flags.

PTX Input Setup -- `sub_4428E0`

After options are parsed, this 13,774-byte function reads and preprocesses the PTX input:

Version and target validation. Checks .version and .target directives in the input. Emits synthetic headers ("\t.version %s\n", "\t.target %s\n") when needed.
Compile-only mode. When --compile-only is active and no real entries exist, generates a dummy entry: "\t.entry %s { ret; }\n" with name __cuda_dummy_entry__.
Input-as-string mode. When --input-as-string is active, PTX is read from memory (passed as a string argument) rather than from a file. This is used by the library-mode interface.
Whole-program mode. --extensible-whole-program enables inter-function optimization across all entries in the compilation unit.
Cache and debug configuration. Applies --def-load-cache, --def-store-cache, --force-load-cache, --force-store-cache, and suppress-debug-info settings.
Tools-patch mode. --compile-as-tools-patch activates the CUDA sanitizer compilation path, checking for __cuda_sanitizer symbols.

Key diagnostic strings from this function:

"'--fast-compile'"
"calls without ABI"
"compilation without ABI"
"device-debug or lineinfo"
"unified Functions"

Target Configuration -- `sub_43A400`

A 4,696-byte function that configures target-specific defaults after option parsing completes. It reads the SM architecture number from the options block and sets:

Texturing mode: texmode_unified vs raw texture mode.
Cache defaults: Based on architecture capabilities.
Feature flags: Hardware-specific workaround flags (e.g., --sw4575628).
Indirect function support: "Indirect Functions or Extern Functions" validation.

The function references "NVIDIA" and "ptxocg.0.0" (the internal name for the OCG optimization pass), suggesting it also initializes the pass pipeline configuration for the target architecture.

Register Constraint Calculator -- `sub_43B660`

A 3,843-byte function that resolves potentially conflicting register limit specifications into a single register budget per function. Register constraints come from four sources with different priorities:

Source	Directive/Option	Priority
PTX directive	`.maxnreg N`	Per-function, highest priority
CLI option	`--maxrregcount N`	Global, overridden by `.maxnreg`
PTX directive	`.minnctapersm N`	Occupancy target, derived limit
PTX directive	`.maxntid Nx,Ny,Nz`	Thread block size, derived limit

The occupancy-derived limit is computed from .minnctapersm and .maxntid: given a minimum number of CTAs per SM and a maximum thread count per CTA, the function calculates the maximum register count that allows the requested occupancy level, accounting for per-SM register file size.

Diagnostic strings indicate the resolution process:

"computed using thread count" -- derived from .maxntid
"of .maxnreg" -- explicit per-function limit
"of maxrregcount option" -- CLI override
"global register limit specified" -- global cap applied

Per-Entry Compilation -- `sub_43CC70`

A 5,425-byte function that processes each entry function through the complete backend pipeline. For each entry:

Skips __cuda_dummy_entry__ (generated by compile-only mode).
Prints the entry separator: "\n# ============== entry %s ==============\n".
Runs DAGgen (PTX-to-Ori lowering).
Runs OCG (the 159-phase optimization pipeline + SASS code generation).
Generates .sass and .ucode ELF sections.
Generates DWARF debug information if requested.

The function also handles reg-fatpoint configuration (the register allocation algorithm, documented in the Fatpoint Algorithm page).

Function/ABI Setup -- `sub_43F400`

A 9,078-byte function that configures the calling convention for each function before compilation. This includes:

Resource	Diagnostic String
Parameter passing registers	`"number of registers used for parameter passing"`
First parameter register	`"first parameter register"`
Return address register	`"return address register"`
Scratch data registers	`"scratch data registers"`
Scratch control barriers	`"scratch control barriers"`
Call prototype	`"callprotoype"` (sic -- misspelled in binary)
Call target	`"calltarget"`

The function handles both entry functions (kernels launched from the host) and device functions (callable from other device code), with different ABI requirements for each. Entry functions use a simplified ABI where parameters come from constant memory, while device functions use register-based parameter passing.

The --compile-as-tools-patch and --sw200428197 flags activate a special ABI variant for CUDA sanitizer instrumentation, which inserts additional scratch registers for sanitizer state.

Function Map

Address	Size	Callers	Identity
`0x409460`	84 B	—	`main` (entry point thunk)
`0x4094C0`	204 B	—	`ctor_001` (thread infrastructure init)
`0x4095D0`	17 KB	—	`ctor_003` (ROT13 opcode table, ~900 entries)
`0x40D860`	80 KB	—	`ctor_005` (ROT13 knob registry, 2000+ entries)
`0x421290`	8 KB	—	`ctor_007` (scheduler knob registry, 98 entries)
`0x403588`	75 B	1	Usage printer (`--help`)
`0x4280C0`	597 B	3,928	TLS context accessor (280-byte struct)
`0x42BDB0`	14 B	3,825	OOM fatal error handler
`0x42FBA0`	2.4 KB	2,350	Central diagnostic emitter
`0x42F590`	—	1	Internal fatal error handler
`0x430570`	—	2	Program name getter
`0x432A00`	6.4 KB	1	Option registration (~100 options)
`0x434320`	10 KB	1	Option parser and validator
`0x439880`	2.9 KB	1	Chrome trace JSON parser
`0x43A400`	4.7 KB	1	Target configuration
`0x43B660`	3.8 KB	1	Register constraint calculator
`0x43CC70`	5.4 KB	1	Per-entry compilation processor
`0x43F400`	9 KB	1	Function/ABI setup
`0x4428E0`	13.8 KB	1	PTX input setup and preprocessing
`0x446240`	11 KB	1	Compilation driver (real main)
`0x451730`	14 KB	1	Parser/lexer init + special registers
`0x46E000`	93 KB	1	Opcode-to-handler dispatch table builder
`0x1C960C0`	—	—	Option parser constructor
`0x1C96680`	—	—	Argv processor
`0x1C97210`	—	~100	Option registrar (per-option)
`0x1C97640`	—	1	Help text printer
`0x1CBA950`	—	—	JSON context constructor
`0x1CBAC20`	2.9 KB	3	JSON recursive descent parser

Cross-References

Pipeline Overview -- full PTX-to-SASS compilation flow
CLI Options -- complete option catalog
Knobs System -- the 2,000+ Mercury tuning knobs registered in ctor_005
Memory Pool Allocator -- the allocator (sub_424070) that calls sub_42BDB0 on OOM
Hash Tables & Bitvectors -- the hash map used by diagnostics for line offset caching
Thread Pool & Concurrency -- thread pool that creates the TLS contexts
PTX Parser -- the parser initialized by sub_451730
Optimization Pipeline -- the 159-phase pipeline invoked per compile unit
Fatpoint Algorithm -- register allocation referenced in per-entry compilation

PTX Parser (Flex + Bison)

The ptxas front-end parses PTX assembly text into internal IR using a classic two-stage architecture: a Flex-generated DFA scanner (lexer) and a Bison-generated LALR(1) shift-reduce parser. Unlike most compiler front-ends, the parser does not construct an AST. Instead, Bison reduction actions directly build IR nodes, populate the instruction table, and emit validation calls -- the parse tree is consumed inline and never materialized as a data structure. A separate macro preprocessor handles .MACRO, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives at the character level before tokens reach the Flex DFA. The instruction table builder (sub_46E000, 93 KB) registers all PTX opcodes with their legal type combinations during parser initialization, and an instruction lookup subsystem classifies operands into 12 categories at parse time.


Flex scanner	`sub_720F00` (15.8 KB, 64 KB with inlined helpers)
DFA table	`off_203C020` (transition/accept array)
Scanner rules	~552 Flex rules, 162 token types (codes 258--422)
Scanner prefix	`ptx` (all Flex symbols: `ptxlex`, `ptxensure_buffer_stack`, etc.)
Bison parser	`sub_4CE6B0` (48 KB, spans `0x4CE6B0`--`0x4DA337`)
Grammar size	~512 productions, 443 reduction cases
LALR tables	`word_1D146A0` (yydefact), `word_1D121A0` (yycheck), `word_1D13360` (yypact), `word_1D150C0` (yypgoto), `byte_1D15960` (yyr2)
Instruction table builder	`sub_46E000` (93 KB, 1,141 calls to `sub_46BED0`)
Instruction lookup	`sub_46C690` (entry), `sub_46C6E0` (6.4 KB descriptor matcher)
Macro preprocessor	`sub_71F630` (14 KB dispatcher), `sub_71E2B0` (32 KB conditional handler)
Parser state object	1,128 bytes (+ 2,528-byte lexer state via pointer at +1096)
Error handler	`sub_42FBA0` (2,350 callers, central diagnostics)
Parser init	`sub_451730` (14 KB, symbol table + special registers + opcode table)

Architecture

PTX source text
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  MACRO PREPROCESSOR (character-level, 0x71B000-0x720000)│
│  sub_71F630  dispatch: .MACRO / .ELSE / .INCLUDE        │
│  sub_71E2B0  conditional: .ELSE / .ELIF / .ENDIF (32KB) │
│  sub_71DCA0  macro definition handler                   │
│  sub_71C310  .INCLUDE file handler                      │
└────────────────────┬────────────────────────────────────┘
                     │ preprocessed character stream
                     ▼
┌─────────────────────────────────────────────────────────┐
│  FLEX DFA SCANNER  sub_720F00 (15.8KB, 552 rules)       │
│  off_203C020       DFA transition table                  │
│  Token codes:      258-422 (162 types)                   │
│  Helper:           sub_720410 (yy_get_next_buffer)       │
│                    sub_720630 (yy_get_previous_state)     │
│                    sub_720BA0 (yy_scan_string)            │
└────────────────────┬────────────────────────────────────┘
                     │ token stream (code + attribute)
                     ▼
┌─────────────────────────────────────────────────────────┐
│  BISON LALR(1) PARSER  sub_4CE6B0 (48KB, 512 prods)     │
│  5 LALR tables at 0x1D12xxx-0x1D15xxx                    │
│  443 reduction actions → direct IR construction           │
│  NO AST: reductions emit IR nodes inline                  │
└────────────────────┬────────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
  INSTRUCTION TABLE         SEMANTIC VALIDATORS
  sub_46E000 (93KB)         sub_4B2F20 (52KB, general)
  sub_46BED0 (per-opcode)   sub_4C5FB0 (28KB, operands)
  sub_46C690 (lookup)       sub_4C2FD0 (12KB, WMMA/MMA)
  sub_46C6E0 (6.4KB match)  sub_4ABFD0 (11KB, async copy)
                            sub_4A73C0 (10KB, tensormap)
                            + 20 more validators

Flex DFA Scanner -- `sub_720F00`

The scanner is a standard Flex-generated DFA with the ptx prefix (all exported symbols use ptx instead of yy: ptxlex, ptxensure_buffer_stack, ptx_create_buffer, etc.). At 15.8 KB of core logic (64 KB including inlined buffer management), it is the largest single function in the lexer region. The DFA transition table lives at off_203C020 and is indexed by *(DWORD*)(state + 76) (the current start condition). The main loop structure follows the textbook Flex pattern:

// DFA transition core (reconstructed from sub_720F00)
while (1) {
    v10 = (DWORD*)(table_base + 8 * state);   // table[state]
    if (current_char == *v10) {                 // character match
        state = table_base + 8 * v10[1];       // goto next state
        action = *(unsigned int*)(state - 4);   // accept action (or 0)
    }
    if (action != 0) break;                     // matched a rule
}
// Giant switch on action number (0..~550)
switch (action) { ... }

The scanner returns integer token codes to the Bison parser. The value 550 is YY_NULL (end-of-input sentinel). Token attributes are communicated through the lexer state object, which the parser state carries as a pointer at offset +1096. The scanner receives this pointer as its a3 argument and dereferences it (e.g., *(_QWORD *)(a3 + 1096)) to reach the 2,528-byte lexer state.

Token Categories

The 552 Flex rules map PTX lexemes to 162 distinct token types. Bison terminal codes range from 258 to 422. The scanner switch cases reveal the following category structure:

Switch cases	Token code	Category	Examples / attributes
2	364	Semicolons / newlines	Statement terminator
5--7	340, 341, 344	Keywords	PTX keywords
63--65	302	Register names	Attribute: -1, `chr-48`, `chr-38` (register numbering)
74--91	320	Data types	Values 1--18: `.b8` through `.f64` (18 type qualifiers)
92--94	322	Comparison types	Values 9, 7, 11
95--99	323	Rounding modes	Values 24--29: `.rn`, `.rz`, `.rm`, `.rp`, etc.
1	(internal)	`#include`	Strips whitespace, copies filename
3	(dispatch)	Preprocessor directive	Calls `sub_71F630`
4	339	`#pragma`	Strips whitespace

Line and column tracking uses fields at *(state+48) (line number) and *(state+52) (column), incremented on each newline character.

Buffer Management

The scanner uses the standard Flex buffer stack for nested input sources (includes, macros, inline strings). Key buffer management functions:

Address	Size	Identity	Purpose
`sub_720190`	2.0 KB	`ptxensure_buffer_stack`	Grows buffer stack via realloc
`sub_7202E0`	1.3 KB	`ptx_create_buffer`	Creates `YY_BUFFER_STATE` from FILE*
`sub_720410`	3.3 KB	`yy_get_next_buffer`	Refills character buffer, handles EOF
`sub_720630`	9.7 KB	`yy_get_previous_state`	Restores DFA state, SIMD-optimized memmove
`sub_720BA0`	4.3 KB	`ptx_scan_string`	Scans inline string into buffer
`sub_724CC0`	4.9 KB	`ptx_scan_bytes`	Macro expansion buffer allocation
`sub_725070`	2.7 KB	`ptx_scan_buffer`	Buffer creation with error recovery

Notable: sub_720630 contains SSE2-optimized memmove using __m128i aligned 16-byte copies for buffer compaction -- a Flex optimization for large input buffers. The ptx_scan_bytes function (sub_724CC0) is called from the Bison parser actions (3 call sites in sub_4CCF30) to handle inline macro expansion during parsing.

Error strings in the buffer system:

"out of dynamic memory in ptxensure_buffer_stack()"
"out of dynamic memory in ptx_create_buffer()"
"out of dynamic memory in yy_get_next_buffer()"
"out of dynamic memory in ptx_scan_bytes()"
"bad buffer in ptx_scan_bytes()"
"out of dynamic memory in ptx_scan_buffer()"
"fatal flex scanner internal error--no action found"
"fatal flex scanner internal error--end of buffer missed"
"unexpected EOF while scanning"

Macro Preprocessor

Before tokens reach the Flex DFA, a character-level macro preprocessor handles .MACRO/.ENDM, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives. The preprocessor lives at 0x71B000--0x720000 (~20 KB) and operates on raw character streams, not tokens. This design is identical to C's preprocessor running before the lexer.

Preprocessor Dispatch -- `sub_71F630`

The top-level dispatcher (14 KB) is called from the Flex scanner's case 3 (directive detection). It examines the directive name and routes to the appropriate handler:

Directive	Handler	Size	Description
`.MACRO`	`sub_71DCA0`	8.4 KB	Macro definition: records body text, handles nesting
`.ELSE` / `.ELIF`	`sub_71E2B0`	32 KB	Conditional code: skips blocks, handles nested conditionals
`.ENDIF`	`sub_71E2B0`	(shared)	End of conditional block
`.INCLUDE`	`sub_71C310`	8.3 KB	File inclusion: pushes new input source onto lexer stack

The dispatcher uses strstr for substring matching on directive names and returns token codes (e.g., 364 for end-of-directive).

Conditional Handler -- `sub_71E2B0`

At 32 KB, this is the largest preprocessor function. It handles .ELSE, .ELIF, and .ENDIF by scanning ahead through the input character stream, counting nesting levels, and skipping entire blocks of PTX text when conditions are false. It calls sub_4287D0 (the token reader) to evaluate conditional expressions and sub_428C40 (string compare) for keyword matching. Two nearly-duplicate code blocks handle .ELSE and .ELIF paths with identical scanning logic but different branch conditions.

Macro Definition -- `sub_71DCA0`

Handles .MACRO directives by recording the macro body text. The function is recursive to support nested .MACRO definitions. It delegates to sub_71D710 (macro body scanner, 7.5 KB) and sub_71D1B0 (macro argument scanner, 6.8 KB). The argument scanner uses strlen + strncmp for keyword matching against a delimiter string parameter.

Include Handler -- `sub_71C310`

Processes .INCLUDE by pushing a new file onto the lexer's input stack. The function is recursive (calls itself 4 times) for nested includes. It manages the include-stack pointers at offsets +2128, +2136, +2160, and +2168 of the lexer state object (the 2,528-byte struct pointed to by parser+1096), and uses the "pushback character" register at offset +2441 of the same lexer state. String reference: "ptxset_lineno called with no buffer".

Error Handling

Macro errors are reported through sub_71BF60 (fatal macro abort) which calls sub_71BF30 to print "out of dynamic memory..." messages, and sub_71C140 (format error) which calls sub_42CA60 (error output). Nesting depth is checked by sub_724CC0 which prints "macro nesting too deep!" on overflow.

Bison LALR(1) Parser -- `sub_4CE6B0`

The parser is a standard Bison-generated LALR(1) shift-reduce parser spanning 48 KB (addresses 0x4CE6B0--0x4DA337). It contains ~512 grammar productions with 443 reduction cases. The function calls ptxlex (sub_720F00) to obtain tokens and uses five LALR tables for state transitions:

Table	Address	Bison name	Purpose
`word_1D146A0`	`0x1D146A0`	`yydefact`	Default reduction rule for each state
`word_1D121A0`	`0x1D121A0`	`yycheck`	Valid lookahead verification
`word_1D13360`	`0x1D13360`	`yypact`	Parser action table (shift/reduce)
`word_1D150C0`	`0x1D150C0`	`yypgoto`	Goto table for nonterminals
`byte_1D15960`	`0x1D15960`	`yyr2`	Right-hand-side length for each rule

Direct IR Construction (No AST)

The critical architectural decision: Bison reduction actions directly construct IR nodes rather than building an intermediate AST. When a grammar rule is reduced, the semantic action immediately:

Allocates IR nodes via the pool allocator (sub_424070)
Populates instruction fields from token attributes
Calls instruction validators for semantic checking
Links nodes into the instruction stream
Registers symbols in the symbol table (via sub_426150, the hash map)

This means the parser is a single-pass translator from PTX text to IR. The trade-off is clear: no AST means no multi-pass source-level analysis, but it eliminates an entire allocation and traversal phase. For an assembler (as opposed to a high-level language compiler), this is the right choice -- PTX is already a linearized instruction stream with no complex scoping or overload resolution that would benefit from an AST.

Reduction Actions -- Semantic Processing

The 443 reduction cases in the parser body handle PTX constructs from simple register declarations to complex matrix instruction specifications. Diagnostic strings found in the parser tail (0x4D5000--0x4DA337) reveal the kinds of semantic checks performed during reduction:

Directive validation:

"Defining labels in .section"
"dwarf data" -- DWARF section processing
"reqntid" / ".reqntid directive" -- required thread count
".minnctapersm directive" -- min CTAs per SM
".maxnctapersm" / ".maxnctapersm directive" -- max CTAs per SM (deprecated)
".maxntid and .reqntid cannot both be specified"
".maxnctapersm directive deprecated..."
".minnctapersm is ignored..."

Type and operand validation:

"Vector Type not specified properly"
".f16x2 packed data-type" -- half-precision packed type
"matrix shape" -- matrix instruction dimensions
".scale_vectorsize" -- vector scaling modifier
"too many layout specifiers"

Resource limits:

"Kernel parameter size larger than 4352 bytes"

Architecture gating:

"sm_50", "sm_20", "sm_53" -- target architecture checks via sub_485520(ctx, sm_number)
PTX version checks via sub_485570(ctx, major, minor)

Expression handling:

"%s+%llu" / "%s-%s" -- label arithmetic in address expressions
"Negative numbers in dwarf section" -- DWARF data validation

Symbol resolution:

"unrecognized symbol" -- lexer/symbol table failure
"syntax error" -- generic parse error
".extern" -- external declarations
".noreturn directive" -- function attributes
"texmode_unified" / "texmode_raw" -- texture mode selection
"cache eviction priority" / ".level::eviction_priority" -- cache policy

Error Recovery

Parse errors trigger sub_42FBA0 with "syntax error" as the message. The central diagnostic emitter (sub_42FBA0, 2,388 bytes, 2,350 callers) handles all severity levels:

Severity	Prefix	Tag	Behavior
0	(suppressed)	--	Silently ignored
1--2	`"info "`	`@I@`	Informational message
3	`"warning "` or `"error "`	`@W@` or `@E@`	Context-dependent; promoted to error by `--Werror`
4	`"error* "`	`@E@`	Non-fatal error
5	`"error "`	`@E@`	Error
6+	`"fatal "`	(none)	Calls `longjmp` to abort compilation

The diagnostic system reads the source file to display context lines (prefixed with "# "), caching file offsets every 10 lines in a hash map for fast random-access seeking.

Parser Initialization -- `sub_451730`

Parser initialization (14 KB) builds the lexer's symbol table with all built-in PTX names before parsing begins. This function is called from the compilation driver (sub_446240) and performs three major tasks:

1. Special Register Registration

All PTX special registers are pre-registered in the symbol table with their internal identifiers:

Category	Registers
Thread/block ID	`%ntid`, `%laneid`, `%warpid`, `%nwarpid`, `%smid`, `%nsmid`, `%ctaid`, `%nctaid`, `%gridid`
Clocks	`%clock`, `%clock_hi`, `%clock64`
Performance counters	`%%pm0`--`%%pm7`, `%%pm0_64`--`%%pm7_64`
Lane masks	`%lanemask_eq`, `%lanemask_le`, `%lanemask_lt`, `%lanemask_ge`, `%lanemask_gt`
Environment	`%%envreg0`--`%%envreg31`
Timers	`%globaltimer_lo`, `%globaltimer_hi`
Shared memory	`%total_smem_size`, `%dynamic_smem_size`
Texture types	`.texref`, `.samplerref`, `.surfref`
Predefined macros	`GPU_ARCH`, `PTX_MAJOR_VERSION`, `PTX_MINOR_VERSION`

2. Opcode Table Construction

Calls sub_46E000 -- the 93 KB instruction table builder -- to register all PTX opcodes with their legal type combinations. See the dedicated section below.

3. Context State Initialization

Allocates and initializes two objects: the parser state (1,128 bytes, sub_424070(pool, 1128)) and the lexer state (2,528 bytes, sub_424070(pool, 2528)). The parser state stores a pointer to the lexer state at offset +1096. The string "PTX parsing state" identifies the parser state allocation in memory dumps. The string "<builtin>" serves as the filename for built-in declarations. Both objects are zeroed via memset before field initialization.

Instruction Table Builder -- `sub_46E000`

This is the largest single function in the front-end region at 93 KB. It is not a normal function body but a massive initialization sequence that calls sub_46BED0 exactly 1,141 times -- once per legal PTX instruction variant. Each call registers an opcode name together with its accepted type combinations using compact encoding strings.

Operand Encoding Strings

Each instruction variant is registered with a string that encodes its operand signature. The encoding uses single-character codes for operand categories:

Code	Meaning
`F`	Float operand (`.f16`, `.f32`, `.f64`)
`H`	Half-precision (`.f16`, `.f16x2`)
`I`	Integer operand (`.s8`--`.s64`, `.u8`--`.u64`)
`B`	Bitwise operand (`.b8`--`.b128`)
`N`	Immediate / numeric literal
`P`	Predicate operand

String references found in the function include composite type signatures:

"F32F32" -- binary float32 operation
"F16F16F16F16" -- quad half-precision
"I32I8I8I32" -- integer MMA (int32 accumulator, int8 operands)
"F64F64F64F64" -- quad float64 (double-precision MMA)
"_mma.warpgroup" -- warp-group MMA marker

Hash Tables

The instruction table builder populates two hash tables at offsets +2472 and +2480 within the lexer state object (the 2,528-byte struct passed as the first argument to sub_46E000). These hash tables provide O(1) lookup from opcode name to the registered type combination list.

Registration Function -- `sub_46BED0`

Called 1,141 times from sub_46E000. Each call takes an opcode name string and an operand encoding string, creates a descriptor node, and inserts it into the hash table. The descriptor captures the opcode, its legal operand types, and the semantic validation function to call during parsing.

Instruction Lookup -- `sub_46C690` and `sub_46C6E0`

At parse time, when the parser reduces an instruction production, it calls sub_46C690 to look up the instruction name in the hash table built by sub_46E000. The lookup returns a descriptor list, and sub_46C6E0 (6.4 KB, the descriptor matcher) walks the list to find the variant matching the actual operands present in the source.

Operand Classification -- 12 Categories

The descriptor matcher (sub_46C6E0) classifies each operand into one of 12 categories based on its syntactic form, then matches the category sequence against the registered encoding strings. The 12 categories cover:

General-purpose register (R)
Predicate register (P)
Uniform register (UR)
Uniform predicate (UP)
Integer immediate
Float immediate
Address expression (register + offset)
Label / symbol reference
Special register
Vector operand
Texture / surface / sampler reference
Bitfield / compound modifier

The classification examines token attributes set by the lexer -- register type bits at (field >> 28) & 7, immediate flag (0x1000000), uniform flag (0x6000000), and operand descriptor fields at instruction offset 84+.

Parser State Object (1,128 bytes)

The parser passes a state object through all phases. This 1,128-byte structure (sub_424070(pool, 1128)) carries compilation context and pointers to sub-systems. It is indexed as _QWORD* (8-byte slots), so QWORD index [N] = byte offset N*8. The highest accessed byte is +1120 (index [140]), fitting exactly within the 1,128-byte allocation.

Offset	Size	Field	Description
+0	8	`pool_context`	Pool allocator handle (from `sub_4258D0`)
+8	8	`compilation_unit`	Pointer to compilation unit (parameter a2)
+16	8	`macro_symbol_table`	Hash table for macros (`sub_425CA0`, 64 buckets)
+24	8	`module_ptr`	Pointer to module object (parameter a3)
+32	8	`container_a`	Sorted set container (8,192 buckets)
+56	8	`scope_chain[0]`	Scope chain entry (`sub_44F7C0`), used for symbol resolution
+64	8	`scope_chain[1]`	Second scope chain entry
+72	8	`scope_chain[2]`	Third scope chain entry
+80	8	`type_map`	Type descriptor hash map (`sub_42D150`, 8 buckets)
+96	8	`symbol_tables[0..5]`	Six hash tables for symbol lookup (at +96, +104, +112, +120, +128, +136)
+152	8	`current_function`	Pointer to current function being parsed
+160	4	`ptx_major_version`	PTX ISA major version (set by Bison reduction)
+164	4	`ptx_minor_version`	PTX ISA minor version
+168	4	`sm_version_check`	SM target version for feature gating
+177	1	`flag_a`	Initialization flag
+192	2	`word_96`	Zero-initialized word at WORD index 96
+196	4	`address_size`	32 or 64 (address width)
+208	8	`hash_ref_a`	Hash table reference (64-bucket)
+236	1	`default_flag`	Initialized to 1
+264	16	`list_a`	Linked list (head at +264, tail ptr at +272 points to head)
+280	8	`sorted_set_b`	Sorted set (8,192 buckets)
+288	8	`sorted_set_c`	Sorted set (1,024 buckets)
+296	16	`sorted_maps[0..1]`	Two sorted maps (`sub_42A300`)
+320	8	`hash_e`	Hash table (1,024 buckets)
+328	16	`list_b`	Linked list (head/tail pair)
+344	16	`list_c`	Linked list (head/tail pair)
+360	256	`offset_table[16]`	SSE-initialized offset table (16 entries of 16 bytes each, computed from base address + constants at `xmmword_1CFDA00`--`1CFDA70`)
+616	16	`list_d`	Linked list (head/tail pair)
+632	16	`list_e`	Linked list (head/tail pair); low bits of first word used as `address_space_flags`
+648	8	`local_symbol_table`	Per-scope local symbol table pointer
+824	8	`symbol_lookup_ref`	Hash table for symbol name lookup
+832	1	`dwarf_section_flag`	Nonzero when inside `.section` DWARF data
+834	1	`directive_flag_a`	Checked as pair with +835
+836	1	`directive_flag_b`	Set to 1 by multiple Bison reductions
+840	8	`builtin_filename`	Interned string `"<builtin>"`
+848	8	`empty_string`	Interned empty string `""`
+856	4	`sm_arch_number`	SM architecture number (parameter a6, e.g. 90 for sm_90)
+860	1	`feature_a`	Feature flags set during parsing
+861	1	`feature_b`
+862	1	`feature_c`
+864	1	`feature_d`
+865	1	`feature_e`	ORed with 1 by Bison reductions
+869	1	`flag_h`	Initialized to 0
+960	4	`sm_target_code`	SM target code used in `sub_454E70` checks
+968	8	`insn_stream_a`	Instruction stream pointer A (set in Bison)
+976	8	`insn_stream_b`	Instruction stream pointer B
+984	8	`insn_stream_c`	Instruction stream pointer C
+1000	1	`insn_state_flag`	Instruction state flag (= 0)
+1008	8	`string_pool`	String pool pointer
+1016	8	`context_ref`	Compilation context reference (parameter a4)
+1048	4	`dword_262`	Zero-initialized
+1053	1	`parsing_active`	Toggled 1/0 during active parsing
+1080	16	`list_f`	Linked list (head/tail pair)
+1096	8	`lexer_state_ptr`	Pointer to 2,528-byte lexer state object (see below)
+1104	16	`list_g`	Linked list (head/tail pair)
+1120	1	`param_flag`	From parameter a10

Lexer State Object (2,528 bytes)

The lexer state is a separate heap-allocated object (sub_424070(pool, 2528)) pointed to by parser_state+1096. It is the primary state carrier for the Flex DFA scanner and the instruction table subsystem. All functions that need scanner state (the Bison parser, the Flex scanner, the include handler, and the instruction table builder) access this object through the pointer at +1096.

Offset	Size	Field	Description
+48	4	`line_number`	Current source line (incremented on newline)
+52	4	`column_number`	Current source column
+64	8	`buffer_limit`	Pointer to end of current scan buffer
+76	4	`start_condition`	Flex DFA start condition (`*(state+76)`, indexes `off_203C020`)
+152	1	`flag_a`	Scanner state flag
+156	8	`sentinel_a`	Initialized to -1 (0xFFFFFFFFFFFFFFFF)
+164	8	`sentinel_b`	Initialized to -1
+172	4	`address_size_proxy`	Written by Bison via `sub_4563E0`; -1 on init
+180	8	`zero_pair`	Zero-initialized
+188	8	`sentinel_c`	Initialized to 0xFFFFFFFF00000000
+196	8	`sentinel_d`	Initialized to -1
+204	4	`sentinel_e`	DWORD[51], initialized to -1
+208	2	`word_104`	WORD[104], zero-initialized
+540	1	`flag_b`	Scanner flag
+541	1	`include_active`	Checked by Flex (`lexer+541`) and Bison to gate `.INCLUDE` behavior
+784	8	`current_filename`	Pointer to current filename string (set during include handling)
+1984	128	`version_array[32]`	DWORD array of version fields; written by `sub_70FDD0(lexer, index, value)` as `(lexer + 4index + 1984) = value`
+2104	4	`ptx_major_ver`	`version_array[30]` = PTX major version (initialized to 9)
+2108	4	`ptx_minor_ver`	`version_array[31]` = PTX minor version (initialized to 0)
+2128	8	`include_stack_a`	Include nesting pointer 1 (linked list for file stack)
+2136	8	`include_stack_b`	Include nesting pointer 2
+2160	8	`include_stack_head`	Head of include stack (walked by `sub_71C310`)
+2168	8	`include_stack_file`	Include stack filename pointer
+2441	1	`pushback_char`	Character pushed back into input stream by scanner
+2464	2	`word_1232`	Zero-initialized
+2466	1	`flag_c`	Flag
+2472	8	`opcode_hash_a`	Opcode lookup hash table (populated by `sub_46E000`)
+2480	8	`opcode_hash_b`	Second opcode lookup hash table (populated by `sub_46E000`)
+2488	8	`context_sub_ref`	Compilation context sub-reference (parameter a9); accessed by Bison for `sub_457CB0`/`sub_70A5B0` calls
+2496	1	`flag_d`	Flag
+2504	24	`tail_fields`	Three zero-initialized QWORD slots (indices [313],[314],[315])

Version checks use sub_485520(ctx, sm_number) (SM architecture >= N) and sub_485570(ctx, major, minor) (PTX version >= major.minor). For example, the address-space attribute setter (sub_4035D3) checks sm_90 and PTX 7.8:

if (!sub_485520(ctx, 90))
    sub_42FBA0(&err, loc, "sm_90", ...);   // Error: requires sm_90
if (!sub_485570(ctx, 7, 8))
    sub_42FBA0(&err, loc, "7.8", ...);     // Error: requires PTX 7.8
*(byte*)(v15 + 632) = (old & 0xFC) | (a2 & 3);   // Set address space bits

Semantic Validators

The parser's reduction actions dispatch to specialized validator functions for each instruction category. These functions live in 0x460000--0x4D5000 and check SM architecture requirements, type compatibility, operand constraints, and instruction-specific invariants.

Address	Size	Identity	Coverage
`sub_4B2F20`	52.6 KB	General instruction validator	Textures, surfaces, loads, stores, cvt, calls
`sub_4CE6B0` tail	48 KB	Directive/declaration validator	`.local_maxnreg`, `.alias`, `.unified`, `.pragma`, `.noreturn`
`sub_4C5FB0`	28.5 KB	Operand validator	State spaces, rounding, barriers, cache levels
`sub_4C2FD0`	12.2 KB	WMMA/MMA validator	Matrix dimensions, FP8 types, layout specifiers
`sub_49BBA0`	11.4 KB	MMA scale/block validator	`.scale_vec_size`, `.block_scale`, sparse GMMA
`sub_4ABFD0`	11.1 KB	Async copy validator	`cp.async`, bulk copy, `cvt.tf32.f32.rna`
`sub_4A73C0`	10.9 KB	Tensormap validator	`.tile`, field ranges, `.tensormap::generic`
`sub_4BFED0`	10.3 KB	WMMA shape/type validator	`.m%dn%dk%d` shapes, `.aligned` modifier
`sub_4AF9F0`	5.8 KB	CVT validator	`cvt.f16x2.f32`, type combinations, rounding
`sub_4AEB60`	3.7 KB	LDSM validator	`_ldsm.s8.s4`/`_ldsm.u8.u4` format conversion
`sub_4B1630`	4.6 KB	Function address validator	`cudaDeviceSynchronize`, kernel/device addresses
`sub_498AF0`	3.9 KB	MMA layout validator	Row/col layout, floating-point type constraints
`sub_497C00`	3.0 KB	Prototype validator	`.FORCE_INLINE`, `.noreturn`, `.unique`, register counts
`sub_496690`	3.6 KB	Scope/barrier validator	Scope modifiers, barrier constraints
`sub_494210`	2.3 KB	Sparse GMMA validator	Sparse GMMA with specific types
`sub_492C80`	4.0 KB	Cache eviction validator	L2 eviction priority, `.v8.b32`/`.v4.b64`
`sub_49A5A0`	3.5 KB	Special register validator	`%laneid`, `%clock64`, `%lanemask_*`, arch gating
`sub_4A0CD0`	4.9 KB	Variable declaration validator	`.texref`, `.managed`, `.reserved`, `.common`
`sub_4A02A0`	2.6 KB	Initializer validator	`generic()` operator, function addresses
`sub_4036D9`	437 B	Parameter list validator	Count, types, alignment, state space

Validators follow a uniform pattern: they receive the parser context and instruction data, check constraints against the current SM architecture and PTX version, and call sub_42FBA0 with descriptive error messages when violations are found. The general validator (sub_4B2F20, 52.6 KB) is the second-largest function in the front-end and covers the broadest range of PTX instructions.

ROT13 Opcode Name Obfuscation

PTX opcode names stored in the binary are ROT13-encoded as an obfuscation measure. The static constructor ctor_003 at 0x4095D0 (17 KB, ~1,700 lines) decodes and populates the opcode name table at 0x29FE300 during program startup. Each entry is a (string_ptr, length) pair. Decoded examples:

ROT13	Decoded	PTX instruction
`NPDOHYX`	`ACQBULK`	`acqbulk`
`OFLAP`	`BSYNC`	`bsync`
`PPGY.P`	`CCTL.C`	`cctl.c`
`SZN`	`FMA`	`fma`
`FRGC`	`SETP`	`setp`
`ERGHEA`	`RETURN`	`return`
`RKVG`	`EXIT`	`exit`

The table covers the entire PTX ISA vocabulary -- hundreds of opcodes. A separate ROT13 table in ctor_005 (0x40D860, 80 KB) encodes 2,000+ internal Mercury/OCG tuning knob names (see Knobs System).

Compilation Pipeline Integration

The parser is invoked from the top-level compilation driver sub_446240 (11 KB), which orchestrates the full pipeline:

Parse  →  CompileUnitSetup  →  DAGgen  →  OCG  →  ELF  →  DebugInfo

The driver reports timing for each phase:

"Parse-time : %.3f ms (%.2f%%)"
"CompileUnitSetup-time : %.3f ms (%.2f%%)"
"DAGgen-time : %.3f ms (%.2f%%)"
"OCG-time : %.3f ms (%.2f%%)"
"ELF-time : %.3f ms (%.2f%%)"
"DebugInfo-time : %.3f ms (%.2f%%)"

The parse phase encompasses the Flex scanner, macro preprocessor, Bison parser, instruction table lookup, and all semantic validation. Since the parser directly builds IR, the output of the parse phase is a populated instruction stream ready for the DAG generation phase.

PTX Text Generation (Reverse Direction)

The inverse of parsing -- converting IR back to PTX text -- lives in 0x4DA340--0x5A8E40 (580 formatter functions). Each handles one PTX opcode. A dispatcher at sub_5D4190 (12.9 KB) routes by opcode name using 81 direct string comparisons plus a 473-entry hash switch. Every formatter follows an identical allocation pattern:

pool = sub_4280C0(ctx)[3];              // Get allocator pool
buf = sub_424070(pool, 50000);          // 50KB temp buffer
// ... sprintf() operands into buf ...
len = strlen(buf);
result = sub_424070(pool, len + 1);     // Exact-size allocation
strcpy(result, buf);
sub_4248B0(buf);                        // Free temp buffer
return result;

A monolithic format string table (~1.8 MB) at the a2 parameter contains pre-assembled PTX text templates with %s/%llu/%d placeholders. This trades memory for speed: instead of building instruction text dynamically, ptxas simply fills in operand names at runtime.

Function Map

Address	Size	Identity	Confidence
`sub_720F00`	15.8 KB	`ptxlex` -- Flex DFA scanner main	98%
`sub_4CE6B0`	48 KB	`ptxparse` -- Bison LALR(1) parser	HIGH
`sub_46E000`	93 KB	Instruction table builder (1,141 opcode registrations)	HIGH
`sub_46BED0`	--	Per-opcode registration function (called 1,141x)	HIGH
`sub_46C690`	--	Instruction lookup entry	HIGH
`sub_46C6E0`	6.4 KB	Descriptor matcher (12-category operand classifier)	HIGH
`sub_451730`	14 KB	Parser initialization (allocs 1,128B parser state + 2,528B lexer state)	HIGH
`sub_70FDD0`	14 B	Lexer version array writer: `(a1 + 4a2 + 1984) = a3`	HIGH
`sub_71F630`	14 KB	Preprocessor directive dispatcher	93%
`sub_71E2B0`	32 KB	Conditional handler (`.ELSE`/`.ELIF`/`.ENDIF`)	92%
`sub_71DCA0`	8.4 KB	Macro definition handler (`.MACRO`)	90%
`sub_71C910`	13 KB	Directive scanner	91%
`sub_71C310`	8.3 KB	Include handler (`.INCLUDE`)	90%
`sub_71D1B0`	6.8 KB	Macro argument scanner	89%
`sub_71D710`	7.5 KB	Macro body scanner	89%
`sub_71BA10`	2.3 KB	Macro character peek	88%
`sub_71BB80`	2.6 KB	Macro buffer reader	88%
`sub_71BE20`	1.1 KB	Macro expansion entry	85%
`sub_71BF60`	1.8 KB	Macro fatal abort	90%
`sub_71C140`	2.5 KB	Macro format error	88%
`sub_720190`	2.0 KB	`ptxensure_buffer_stack`	95%
`sub_7202E0`	1.3 KB	`ptx_create_buffer`	96%
`sub_720410`	3.3 KB	`yy_get_next_buffer`	95%
`sub_720630`	9.7 KB	`yy_get_previous_state` (SSE2 optimized)	94%
`sub_720BA0`	4.3 KB	`ptx_scan_string`	93%
`sub_724CC0`	4.9 KB	`ptx_scan_bytes` / macro nesting check	91%
`sub_725070`	2.7 KB	`ptx_scan_buffer`	93%
`sub_42FBA0`	2.4 KB	Central diagnostic emitter (2,350 callers)	HIGH
`sub_4280C0`	597 B	Thread-local context accessor (3,928 callers)	HIGH
`sub_424070`	2.1 KB	Pool allocator (3,809 callers)	HIGH
`sub_4248B0`	923 B	Pool deallocator (1,215 callers)	HIGH
`sub_42BDB0`	14 B	Fatal OOM handler (3,825 callers)	HIGH
`sub_446240`	11 KB	Top-level compilation driver	HIGH
`sub_4095D0`	17 KB	ROT13 opcode name table initializer	HIGH
`sub_5D4190`	12.9 KB	PTX text format dispatcher	HIGH
`sub_4B2F20`	52.6 KB	General instruction validator	HIGH
`sub_4C5FB0`	28.5 KB	Instruction operand validator	HIGH
`sub_4C2FD0`	12.2 KB	WMMA/MMA validator	HIGH
`sub_485520`	--	SM architecture check (`sm >= N`)	HIGH
`sub_485570`	--	PTX version check (`version >= M.N`)	HIGH

Cross-References

Pipeline Overview -- where the parser fits in the compilation flow
PTX Directive Handling -- detailed directive processing after parsing
PTX-to-Ori Lowering -- what happens to the IR the parser builds
Knobs System -- ROT13-encoded knob names from ctor_005
Memory Pool Allocator -- sub_424070/sub_4248B0 pool system
Hash Tables & Bitvectors -- sub_426150/sub_426D60 hash map
PTX Instruction Table -- full opcode catalog
CLI Options -- sub_432A00/sub_434320 option handling

PTX Directive Handling

PTX directives -- .version, .target, .entry, .func, .global, .shared, .local, .const, .reg, .param, .weak, .common, .extern, .visible, .alias, .pragma -- are parsed and semantically validated by the Bison reduction actions embedded in the 48 KB parser function sub_4CE6B0. Unlike instructions which pass through opcode table lookup (sub_46E000) and per-instruction semantic validators, directives are handled entirely within the Bison reduction switch: each grammar production's action block reads values from the parser value stack, validates them against the current PTX version and target architecture, and writes the results into the 1,200-byte parser state object or its child compile-unit state (CU_state). No intermediate AST is constructed; directives take effect immediately during parsing.

The state object maintains 18 linked lists (9 head/tail pairs at offsets 368--512) that track symbols per state space, a string-keyed hash map (offset 208) for target feature flags, and a scope chain (offset 984) rooted at offset 968 for nested function declarations. Two version-gating functions -- sub_489050 (PTX ISA version) and sub_489390 (SM architecture) -- guard every directive that was introduced after the baseline ISA.


Bison parser	`sub_4CE6B0` (48,263 bytes, 631 case labels)
Version validator	`sub_44A100` (bsearch over 44 valid PTX version IDs at `xmmword_1CFD940`)
PTX version gate	`sub_489050` -- `sub_454E70` + `sub_455A80(major, minor, state)`
SM arch gate	`sub_489390` -- checks `state+168 >= required_sm`
Target handler	`sub_4B1080` (per-target, texmode logic)
Function handler	`sub_497C00` (entry/func declarations, ABI)
Variable handler	`sub_4A0CD0` (state-space declarations, type validation)
Parameter allocator	`sub_44F6E0` (48-byte parameter nodes)
Scope manager	`sub_44B9C0` (scope hash map at `state+1008`)
State-space lists	18 linked lists at `state+368`--`state+512`
Target feature map	Hash map at `state+208` (string keys, presence values)

Architecture

PTX source text
     |
     v
+-------------------------------------------------------------------+
| BISON LALR(1) PARSER  sub_4CE6B0                                  |
| 631 reduction cases, each a direct action                         |
|                                                                   |
|   DIRECTIVE       CASES           HANDLER                         |
|   .version        35              sscanf + sub_44A100             |
|   .target         5, 38           sub_4B1080 (per-target)         |
|   .address_size   10              inline validation                |
|   .entry          82, 86-88       sub_497C00                      |
|   .func           97, 100-105     sub_497C00                      |
|   .global/shared  57-68           sub_4A0CD0                      |
|     /local/const                                                  |
|   .reg/.param     110-112         inline + sub_48BE80             |
|   .weak           55              sub_489050(3,1)                 |
|   .common         56              sub_489050(5,0)                 |
|   .extern         79              sets CU+81, linkage=3           |
|   .visible        80              sets CU+81, linkage=2           |
|   .alias          41              sub_4036D9 (param match)        |
|   .pragma         42              prefix-match dispatch chain     |
|                                                                   |
+-------------------+-----------------------------------------------+
                    |
          +---------+---------+
          v                   v
  PARSER STATE OBJECT     CU_STATE (compile-unit)
  ~1200 bytes             pointed to by state+1096
  state+144: version      CU+0:   linkage code
  state+152: target       CU+24:  state-space ID
  state+160: ptx_major    CU+48:  func metadata buf
  state+164: ptx_minor    CU+80:  return type
  state+168: sm_id        CU+81:  declaration linkage
  state+196: addr_size    CU+88:  current function
  state+208: feature map  CU+156: noinline pragma
  state+368: 18 ll heads  CU+172: reg-usage pragma
  state+968: scope root   CU+784: arch capability
  state+984: scope chain  CU+2448: target string
  state+1008: scope map   CU+2456: version string

`.version X.Y` -- Case 35

The .version directive establishes the PTX ISA version for the compilation unit. The parser extracts the major and minor version integers from the grammar, validates the combined version against a sorted table of 44 known versions, and stores both the numeric and string forms.

// Reconstructed from case 35 of sub_4CE6B0
int major = sub_449950();       // extract major from parser state
int minor = sub_449960();       // extract minor from parser state
sscanf(token, "%d.%d", &major, &minor);

// Allocate formatted version string
char* ver_str = pool_alloc(pool, 5);
sprintf(ver_str, "%d.%d", major, minor);

// Validate: bsearch over 44 valid version IDs
int combined = major * 10 + minor;
if (!sub_44A100(combined))
    fatal_error("Unsupported PTX version %s", ver_str);

pool_free(ver_str);

// Store in parser state
state->version_string = token;          // state+144
state->ptx_major = major;               // state+160
state->ptx_minor = minor;               // state+164
CU_state->version_string = token;       // CU+2456

Version Validation -- `sub_44A100`

// sub_44A100: validate PTX version against known versions
bool sub_44A100(int version_id) {
    int key = version_id;
    return bsearch(&key,
                   xmmword_1CFD940,   // sorted table base
                   0x2C,              // 44 entries
                   4,                 // sizeof(int)
                   compar) != NULL;   // simple integer compare
}

The 44-entry table at xmmword_1CFD940 contains the combined version IDs (major*10 + minor) for every PTX ISA version recognized by ptxas v13.0. This covers PTX 1.0 through 8.7+.

`.target sm_XX` -- Cases 5 and 38

The .target directive accepts a comma-separated list of targets: SM architecture identifiers (sm_XX, compute_XX) and feature modifiers (texmode_unified, texmode_independent, texmode_raw, map_f64_to_f32, debug).

Case 38 -- Target List Iteration

// Reconstructed from case 38
for (node = list_begin(*v5); !list_end(node); node = list_next(node)) {
    char* target_str = list_value(node);
    sub_4B1080(target_str, location, state);
}

Per-Target Handler -- `sub_4B1080`

The function branches on whether the target string contains "sm_" or "compute_".

SM/compute targets:

// SM target path in sub_4B1080
state->target_string = target_str;       // state+152
CU->target_string = target_str;          // CU+2448
state->arch_variant = sub_1CBEFD0(target_str);  // state+177

int sm_id;
sscanf(target_str + prefix_len, "%d", &sm_id);
state->target_id = sm_id;               // state+168
if (sm_id > state->max_target)
    state->max_target = sm_id;           // state+204

// Validate against one of three target tables:
//   compute_ targets:  unk_1D16160 (6 entries, 12 bytes each)
//   sm_ sub-variant:   unk_1D161C0 (7 entries, 12 bytes each)
//   standard sm_:      unk_1D16220 (32 entries, 12 bytes each)
// Each entry: { sm_id, required_ptx_major, required_ptx_minor }
entry = bsearch(&sm_id, table, count, 12, sub_484B70);
if (entry) {
    if (!sub_455A80(entry->ptx_major, entry->ptx_minor, state))
        state->version_mismatch_flag |= 1;   // state+178
}

Feature modifiers:

Modifier	PTX Requirement	Action	CU State
`map_f64_to_f32`	Deprecated for sm > 12	Stored in feature map; `CU+152 \|= 1`	Feature flag
`texmode_unified`	--	Stored in feature map; default if none specified	Default
`texmode_independent`	PTX >= 1.5	Stored in feature map; `CU+2464 = 1`	Tex mode
`texmode_raw`	Requires `state+220` flag	Stored in feature map; `CU+2465 = 1`	Tex mode
`debug`	PTX >= 3.0	`CU+2466 = 1`; `state+1033 = 1`; `state+834 = 1`	Debug on

Texmode values are mutually exclusive. Each setter checks the feature hash map at state+208 for conflicting entries before inserting:

// texmode_unified path in sub_4B1080
if (map_get(state->feature_map, "texmode_independent"))
    error("conflicting texmode: %s", target_str);
if (map_get(state->feature_map, "texmode_raw"))
    error("conflicting texmode: %s", target_str);
map_put(state->feature_map, "texmode_unified", 1);

Case 5 -- Automatic Texmode Inference

When the .target directive omits an explicit texmode, case 5 infers one based on CLI flags:

if (arch_supports_texmode(CU->arch_capability)) {
    if (!map_has(feature_map, "texmode_independent") &&
        !map_has(feature_map, "texmode_raw")) {
        if (state->cli_texmode_independent)
            sub_4B1080("texmode_independent", loc, state);
        else if (state->cli_texmode_raw)
            sub_4B1080("texmode_raw", loc, state);
        else
            sub_4B1080("texmode_unified", loc, state);
    }
}

`.address_size 32|64` -- Case 10

// Reconstructed from case 10
sub_489050(state, 2, 3, ".address_size directive", location);  // PTX >= 2.3

int value = stack_value;
if (((value - 32) & ~0x20) != 0)       // allows exactly 32 and 64
    error("Invalid address size: %d", value);

state->address_size = value;             // state+196

The bit trick (v - 32) & ~0x20 passes for exactly two values:

v=32: (0) & 0xFFFFFFDF = 0
v=64: (32) & 0xFFFFFFDF = 0

Any other value produces a nonzero result and triggers an error.

`.entry` / `.func` Declarations -- Cases 76+, 82, 88, 97, 103

Function and entry declarations span multiple Bison productions because the grammar decomposes them into prototype, parameter list, linkage qualifier, and body productions. The central handler sub_497C00 processes both entry functions and device functions.

`sub_497C00` -- Function Declaration Handler

// Reconstructed signature
int64 sub_497C00(
    state,          // parser state
    int decl_type,  // 1=visible, 2=forward, 3=extern, 4=static, 5=definition
    name,           // function name token
    return_params,  // return parameter list (NULL for entries)
    params,         // input parameter list
    bool is_entry,  // 1 for .entry, 0 for .func
    bool is_func,   // CU+80 qualifier for .func
    scratch_regs,   // scratch register list
    int retaddr,    // return address allocno (-1 if none)
    bool noreturn,  // .noreturn attribute
    bool unique,    // .unique attribute
    bool force_inline, // .FORCE_INLINE attribute
    location        // source location token
);

Processing steps:

Scope creation: sub_44B9C0(state) creates a new scope context. The scope hash map at state+1008 maps scope IDs (starting at 61) to 40-byte scope descriptors.
Parameter node allocation: sub_44F6E0(state, scope, name, 0, 0, location) allocates a 48-byte parameter descriptor: {type_info, name, scope, alignment, init_data, location}.
Symbol lookup: sub_4504D0(state+968, name, 1, state) searches the current scope chain for an existing declaration.
Forward declaration resolution: If a matching forward declaration exists, the handler validates compatibility:
- Declaration type consistency (except 2->1 and 4->1 promotions)
- Parameter list type/alignment/state-space matching via sub_484DA0
- Return parameter matching via sub_484DA0
- Scratch register count and types
- Return address register, first parameter register
- .noreturn and .unique attribute consistency
- Unified identifier matching
New function creation: If no prior declaration:
- Registers in state+968 (regular scope) or state+976 (extern scope)
- Calls sub_44FDC0 to record ABI metadata
- For Blackwell GB10B architecture (sub_70FA00(CU, 33)): allocates __nv_reservedSMEM_gb10b_war_var in shared memory as a hardware workaround

Case 82 -- Entry Function

// Case 82: .entry declaration
if (CU->output_param_context)
    error("Parameter to entry function");

result = sub_497C00(state, decl_type, name,
                    NULL,    // no return params for entries
                    params,
                    1,       // is_entry = true
                    0,       // is_func = false
                    NULL,    // no scratch regs
                    -1,      // no retaddr
                    0, 0, 0, // no .noreturn/.unique/.force_inline
                    location);

Case 88 -- Entry Function Body Completion

After the function body is parsed, case 88 performs the final validation pass:

Performance directive validation:
- .maxntid and .reqntid are mutually exclusive
- .maxnctapersm/.minnctapersm require either .maxntid or .reqntid
- .reqntid + .reqnctapercluster require .blocksareclusters
- .reqnctapercluster and .maxclusterrank are mutually exclusive
Kernel parameter size limits (computed via sub_42CBF0 + sub_484ED0):

PTX Version Max Kernel Param Size

< 1.5 256 bytes

>= 1.5, < 8.1 4,352 bytes

>= 8.1 32,764 bytes

Parameters exceeding 4,352 bytes also require SM >= 70 and PTX >= 8.1.
Debug labels: Generates __$startLabel$__<name> and __$endLabel$__<name> for DWARF debug info.
Debug hash: If debug mode enabled (state+856 != 0), computes CRC32(name) % 0xFFFF + base as a debug identifier stored at func->80+176.

Case 97 -- Device Function

// Case 97: .func declaration
result = sub_497C00(state, decl_type, name,
                    return_params, params,
                    0,                   // is_entry = false
                    CU->return_qualifier, // CU+80
                    scratch_regs, retaddr,
                    noreturn, unique, force_inline,
                    location);

State-Space Declarations -- `.global`, `.shared`, `.local`, `.const`

State-space directives set the "current state space" field (CU+24) and then delegate to sub_4A0CD0 for variable declaration processing or sub_4A2020 for declaration-without-initializer processing.

State-Space Code Assignment

Case	Action	State Space
57	`*CU = 1`	(extern/unresolved)
59	`*CU = 3`	`.shared`
61	`*CU = 2`	`.global`
63	`*CU = 4`	`.local`
65	`*CU = 5`	`.const`
67	`*CU = 0`	`.reg`
58, 60, 62, 64, 66, 68	`sub_4A2020(...)`	Process declaration in current space

The odd-numbered cases set the state-space code; the immediately following even-numbered cases trigger the actual declaration processing.

Variable Validator -- `sub_4A0CD0`

This 4,937-byte function validates variable declarations across all state spaces. Key checks:

Type validation: Resolves .texref via sub_450D00. For types 9 (.surfref) and 10 (.texref), enforces .tex deprecation after PTX 1.5 and .surfref scope restrictions.
.b128 type: Requires PTX >= 8.3 (sub_455A80(8, 3)) and SM >= 70 (sub_489390(state, 70)).
State-space restrictions:
- .managed valid only with .global (space 5)
- .reserved valid only with .shared (space 8)
- .reserved shared alignment must be <= 64
- .common valid only with .const
- .param at file scope requires .const space
- .local const disallowed at file scope
Texmode interaction:
- .surfref types require texmode_independent in the feature map
- .tex/.texref types incompatible with texmode_raw
Initializer handling: If an initializer is present, calls sub_4A02A0 to validate constant expressions (no function pointers, no entry functions as values, no opaque type initializers).

State-Space Linked Lists -- 18 Lists at `state+368`

The parser maintains 18 linked list heads (9 head/tail pairs) at state offsets 368--512 to track declared symbols per state space:

Offset    Pair   State Space
368/376   0      .global
384/392   1      .shared
400/408   2      .local
416/424   3      .const
432/440   4      .param
448/456   5      .tex
464/472   6      .surf
480/488   7      .sampler
496/504   8      reserved / other

Initialization (case 3 -- section begin): Iterates j from 0 to 144 in steps of 8, allocating an 88-byte sentinel node (type=6) for each list. Each node's +48 field links to per-section tracking data at state+656 + j.

Scope teardown (case 76 -- new compilation unit): Destroys old symbol tables via sub_425D20, clears the target feature map, and merges scope-level lists into the parent scope by concatenating linked list chains for offsets 16, 48, 112, 128, 144, and 184 of the scope node.

`.reg` / `.param` -- Register and Parameter Declarations

Within function bodies, .reg and .param declarations create typed register/parameter entries. Three grammar productions handle the variants:

Declaration Node Layout (56 bytes)

Offset	Type	Field
0	ptr	Type list pointer
8	ptr	Name pointer
16	int32	State-space code
20	byte	Is array
21	byte	Is vector
24	int32	Alignment
28	byte	Extra flags
40	int32	Count / range start
44	int32	Range end (`0xFFFFFFFF` = no upper bound)
48	ptr	Auxiliary data

Case 110 -- Single declaration: Reads type info from CU_state (offsets +16, +24, +28, +29, +32, +36), allocates the 56-byte node, sets count from the parsed integer, and calls sub_48BE80(state) to validate.

Case 111 -- Range declaration: Same as 110 but sets both start and end bounds. The sentinel value 0xFFFFFFFF at offset 44 distinguishes range from single declarations.

Case 112 -- Vector declaration: Handles vector type qualifiers (.v2, .v4).

Visibility / Linkage Directives

`.weak` -- Case 55

sub_489050(state, 3, 1, ".weak directive", location);  // PTX >= 3.1

`.common` -- Case 56

sub_489050(state, 5, 0, ".common directive", location);  // PTX >= 5.0

Linkage Qualifiers -- Cases 78--81

These set CU+81 (declaration linkage type) within function prototype production contexts:

Case	Linkage	PTX Directive
78	1	`.visible` (default/internal)
79	3	`.extern`
80	2	`.visible`
81	4	`.weak`

`.alias` -- Case 41

Symbol aliasing requires PTX >= 6.3 and SM >= 30:

// Reconstructed from case 41
sub_489050(state, 6, 3, ".alias", location);   // PTX >= 6.3
sub_489390(state, 0x1E, ".alias", location);   // SM >= 30

sym1 = sub_4504D0(state->scope_chain, name1, 1, state);
sym2 = sub_4504D0(state->scope_chain, name2, 1, state);

if (!sym1) error("undefined symbol: %s", name1);
if (!sym2) error("undefined symbol: %s", name2);

Validation:

Both symbols must be function type (node type == 5)
sym1 must not already have a body defined (sym1->80->88 == 0)
Neither can be entry functions
No self-aliasing (names must differ)
Parameter lists must match (calls sub_4036D9 twice: once for return params, once for input params)
.noreturn attribute must be consistent across both symbols
Cannot alias to .extern or declaration-qualified functions

On success: sym1->80->64 = sym2 (sets the alias-target pointer).

`.pragma` -- Case 42

The .pragma directive requires PTX >= 2.0 and dispatches through a prefix-matching chain. Each pragma string is compared against known prefixes via sub_4279D0 (starts-with test):

// Reconstructed dispatch structure from case 42
for (node = list_begin(pragma_list); !list_end(node); node = list_next(node)) {
    char* pragma_str = list_value(node);
    sub_489050(state, 2, 0, ".pragma directive", location);  // PTX >= 2.0

    char* arch_str = sub_457CB0(CU->arch_descriptor, index);
    if (starts_with(arch_str, pragma_str)) {
        // matched known pragma
        dispatch_to_handler(pragma_str, state);
    }
}

Pragma Dispatch Chain

Priority	Prefix Index	Pragma	Handler	Storage
1	`sub_457CB0(arch, 1)`	`"noinline"`	`sub_456A50` + `sub_48D8F0`	`CU+156`, `CU+192`
2	`sub_457CB0(arch, 3)`	inline-related	Sets `CU+160 = 1`	`CU+160`
3	`sub_457CB0(arch, 16)`	register-usage	`sub_4563E0` + `sub_48C370`	`CU+172`
4	`sub_457CB0(arch, 5)`	min threads	`sub_4563E0` + `sub_48C6F0`	`CU+164` or `CU+168`
5	`sub_457CB0(arch, 9)`	max constraint	`sub_4567E0` + `sub_403D2F`	`CU+176`
6	`sub_457CB0(arch, 10)`	min constraint	`sub_4567E0` + `sub_403D2F`	`CU+184`
7	`sub_457CB0(arch, 18)`	deprecated	Warning via `dword_29FA6C0`	--
8	`sub_457CC0(arch, 1)`	deprecated	Warning via `dword_29FA6C0`	--
9--11	`sub_457C60/CA0/C70`	unsupported	Warning via `dword_29FA7F0`	--
12	`sub_457D30/D50`	unsupported	Warning via `dword_29FA7F0`	--
13	`sub_457CB0(arch, 22)`	function-level	Appends to func or module pragma list	`func->80->80` or `state+272`

Unmatched pragmas trigger an error via dword_29FA6C0.

Feature Version Gating

Two functions guard every directive against minimum PTX ISA version and SM architecture requirements. They are called hundreds of times throughout the Bison reduction actions.

`sub_489050` -- PTX ISA Version Check

// sub_489050(state, required_major, required_minor, directive_name, location)
char sub_489050(state, int major, int minor, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;  // checks disabled
    if (state->lenient_mode)  // state+832
        return 1;  // lenient mode
    if (!sub_455A80(major, minor, state)) {
        char buf[152];
        sprintf(buf, "%d.%d", major, minor);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

`sub_489390` -- SM Architecture Check

// sub_489390(state, required_sm, directive_name, location)
char sub_489390(state, uint required_sm, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;
    if (!state->target_string || state->target_id > required_sm) {
        // state+152 == NULL or state+168 > required_sm
        char buf[48];
        sprintf(buf, "sm_%d", required_sm);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

Version Requirements by Directive

Directive	PTX ISA	SM Architecture
`.address_size`	>= 2.3	--
`.weak`	>= 3.1	--
`.common`	>= 5.0	--
`.alias`	>= 6.3	>= 30
`.branchtargets`	>= 6.0	>= 30
`.calltargets`	>= 2.1	>= 20
`.callprototype`	>= 2.1	>= 20
`.pragma`	>= 2.0	--
`texmode_independent`	>= 1.5	--
`debug` target	>= 3.0	--
kernel param list	>= 1.4	--
opaque types	>= 1.5	--
`.b128` type	>= 8.3	>= 70
kernel params > 4352B	>= 8.1	>= 70

Parser State Object Layout

The parser state object (v1127 / a1 in sub_4CE6B0) is approximately 1,200 bytes. Key offsets for directive handling:

Offset	Type	Field
72	ptr	Module-level output buffer
88	ptr	Current function link
144	char*	`.version` string (e.g., `"8.5"`)
152	char*	`.target` string (e.g., `"sm_90"`)
160	int32	PTX major version
164	int32	PTX minor version
168	int32	SM architecture ID
177	byte	Architecture sub-variant flag
178	byte	Version mismatch flag
196	int32	`.address_size` (32 or 64)
204	int32	Maximum SM ID encountered
208	ptr	Target feature hash map
219	byte	CLI `texmode_independent` flag
220	byte	CLI `texmode_raw` flag
272	ptr	Module pragma list head
368--512	ptr[18]	State-space linked list heads
656--800	bytes	Per-section tracking data (144 bytes)
832	byte	Lenient mode flag
834	word	Debug mode flags
856	int32	Debug hash base
960	int32	Version check disable flag
968	ptr	Scope root (top-level symbol table)
976	ptr	Extern function scope
984	ptr	Current scope chain pointer
1000	byte	Function body active flag
1008	ptr	Scope hash map
1033	byte	Debug info enabled
1096	ptr	CU_state pointer

Function Map

Address	Size	Identity	Callers
`sub_44A100`	39 B	PTX version bsearch validator	case 35
`sub_44B9C0`	171 B	Scope context creator	case 82, 97 via `sub_497C00`
`sub_44F6E0`	135 B	Parameter node allocator (48 B nodes)	`sub_497C00`
`sub_489050`	115 B	PTX ISA version gate	~30 directive cases
`sub_489390`	85 B	SM architecture version gate	~15 directive cases
`sub_497C00`	2,992 B	Function/entry declaration handler	cases 82, 97
`sub_4A0CD0`	4,937 B	Variable/symbol declaration validator	cases 58--68
`sub_4A02A0`	2,607 B	Initializer/constant expression validator	`sub_4A0CD0`
`sub_4B1080`	~700 B	Per-target handler (SM + texmode)	cases 5, 38
`sub_4036D9`	437 B	Parameter list compatibility check	case 41 (`.alias`)
`sub_4CE6B0`	48,263 B	Bison parser (all directive cases)	compilation driver

PTX-to-Ori Lowering

The PTX-to-Ori lowering is the transition from parsed PTX assembly into the Ori internal representation -- the SASS-level, virtual-register IR that all subsequent optimization operates on. Unlike a traditional compiler where the parser builds an AST and a separate lowering pass consumes it, ptxas has no materialized AST: the Bison parser's reduction actions directly construct Ori IR nodes, basic blocks, and CFG edges inline. What the --compiler-stats timer calls "DAGgen-time" measures this inline construction phase. The result is a raw Ori IR that still uses PTX-derived opcodes and has unresolved architecture-dependent constructs. Fourteen "bridge phases" (pipeline indices 0--13) then transform this raw IR into the optimizer-ready form where every instruction carries its final SASS opcode, the CFG is fully annotated, and architecture-incompatible operations have been legalized.

The key architectural consequence of this design: there is no separate "lowering" function that you can point at and say "this converts PTX to Ori." The conversion is distributed across (1) the Bison parser's 443 reduction actions, (2) a 44 KB operand processing function, (3) the MercConverter instruction legalization pass, and (4) six additional bridge phases that handle FP16 promotion, control flow canonicalization, macro fusion, and recipe application.


DAGgen timer	`"DAGgen-time : %.3f ms (%.2f%%)\n"` (inline Bison -> Ori construction)
Bison parser	`sub_4CE6B0` (48 KB, 512 productions, 443 reductions, no AST)
Operand processing	`sub_6273E0` (44 KB, 6-bit operand type switch)
MercConverter	`sub_9F1A90` (35 KB, opcode-dispatched visitor)
MercConverter orchestrator	`sub_9F3340` (7 KB)
Opcode dispatch	`sub_9ED2D0` (25 KB, master switch on `*(instr+72) & 0xCF`)
Post-conversion lowering	`sub_9EF5E0` (27 KB, string `"CONVERTING"`)
Bridge phases	Phases 0--13 (14 phases, first group in the 159-phase pipeline)
Diagnostic dump	Phase 9: `ReportInitialRepresentation` (sub_A3A7E0 stats emitter)
Intrinsic descriptors	`sub_9EE390` (20 KB, `"IntrinsicDescrFile=%s"`)

Architecture

PTX source text
     |
     v
[Flex scanner]  sub_720F00 (15.8KB, 552 rules)
     |  token stream
     v
[Bison parser]  sub_4CE6B0 (48KB, 512 productions)
     |  NO AST -- reduction actions build IR directly:
     |    - allocate instruction nodes from pool
     |    - set opcode field (instruction +72)
     |    - build operand array (instruction +84)
     |    - link into doubly-linked list per basic block
     |    - create basic block entries (40B each)
     |    - populate CFG hash maps (Code Object +648, +680)
     |
     v                                             "DAGgen-time"
[Operand processing]  sub_6273E0 (44KB)            boundary
     |  6-bit type switch (v12 & 0x3F)             ----------
     |  address computation, state space annotation
     v
+----------------------------------------------------------+
|  RAW ORI IR (PTX-derived opcodes, virtual registers)     |
|  Instructions: PTX-level names (add.f32, ld.global, etc) |
|  Registers: virtual R-file, typed descriptors             |
|  CFG: basic blocks + edge hash maps (partially formed)    |
+----------------------------------------------------------+
     |
     |  Phase 0: OriCheckInitialProgram (validate)
     |  Phase 1: ApplyNvOptRecipes      (configure opt levels)
     |  Phase 2: PromoteFP16            (FP16 -> FP32 where needed)
     |  Phase 3: AnalyzeControlFlow     (finalize CFG + RPO + backedges)
     |  Phase 4: AdvancedPhaseBeforeConvUnSup (arch hook, no-op default)
     |  Phase 5: ConvertUnsupportedOps  (MercConverter: PTX ops -> SASS ops)
     |  Phase 6: SetControlFlowOpLastInBB (CFG structural fixup)
     |  Phase 7: AdvancedPhaseAfterConvUnSup (arch hook, no-op default)
     |  Phase 8: OriCreateMacroInsts    (fuse instruction sequences)
     |  Phase 9: ReportInitialRepresentation (diagnostic dump)
     |  Phase 10: EarlyOriSimpleLiveDead (dead code elimination)
     |  Phase 11: ReplaceUniformsWithImm (fold known constants)
     |  Phase 12: OriSanitize           (validate post-bridge IR)
     |  Phase 13: GeneralOptimizeEarly  (bundled copy-prop + const-fold)
     v                                             "OCG-time"
+----------------------------------------------------------+               begins
|  OPTIMIZER-READY ORI IR                                  |
|  Instructions: SASS opcodes (FADD, IMAD, LDG, STG, ...) |
|  Registers: virtual R/UR/P/UP files                       |
|  CFG: complete with RPO, backedge map, loop headers       |
+----------------------------------------------------------+
     |
     v
[Phase 14+: main optimization pipeline]

Inline IR Construction (Bison -> Ori)

The Bison parser at sub_4CE6B0 has 512 grammar productions with 443 reduction-action cases. Each reduction action constructs IR directly -- no intermediate AST is ever materialized. The instruction table builder (sub_46E000, 93 KB, 1,141 per-opcode registration calls to sub_46BED0) runs during parser initialization and registers the legal type combinations for every PTX instruction. The instruction lookup subsystem (sub_46C690 entry, sub_46C6E0 matcher at 6.4 KB) classifies operands into 12 categories at parse time.

When the parser encounters a PTX instruction like add.f32 %r1, %r2, %r3, it:

Looks up add.f32 in the opcode table to get the internal opcode index and validate the type qualifier .f32
Allocates an Ori instruction node from the memory pool
Writes the opcode into the instruction field at offset +72
Processes each operand through sub_6273E0 to build the packed operand array at offset +84
Links the instruction into the current basic block's doubly-linked list
If the instruction is a branch/jump/return, creates a CFG edge in the successor hash map at Code Object +648

Special PTX registers (%ntid, %laneid, %smid, %ctaid, %clock64, etc.) are mapped to internal identifiers during parser initialization at sub_451730. The mapping table is built from the ROT13-encoded opcode table populated by ctor_003 at 0x4095D0.

Operand Processing -- `sub_6273E0`

The 44 KB operand processing function handles all PTX operand forms. It switches on a 6-bit type encoding extracted as v12 & 0x3F:

Type bits	Operand kind	PTX syntax	Processing
Register	Direct register reference	`%r1`, `%rd1`, `%f1`	Look up register descriptor via `(ctx+88) + 8regId`
Register pair	64-bit register pair	`%rd1` (on 32-bit ALU)	Allocate paired descriptors, link hi/lo
Immediate	Integer constant	`42`, `0xFF`	Pack into operand field
Float immediate	Floating-point constant	`0F3F800000`	Encode IEEE 754 bits
Address	Base + offset	`[%rd1+16]`	Compute effective address, annotate state space
Constant bank	Constant memory ref	`c[2][0x100]`	Bank index + offset encoding
Label	Branch target	`$L__BB0_1`	Resolve to basic block index
Special register	Built-in register	`%ntid.x`, `%laneid`	Map to internal ID from `sub_451730` table

String evidence in sub_6273E0:

".nv.reservedSmem.offset0" -- reserved shared memory region handling
"COARSEOFFSET" -- coarse-grained offset computation for large address spaces
"__$endLabel$__%s" -- label generation for structured control flow expansion

The function bridges PTX's explicitly-typed operand model (where .u32, .f32, .b64 qualifiers are part of the syntax) to Ori's implicitly-typed model where the operand type is determined by the SASS opcode.

Bridge Phases (0--13)

Phase 0: OriCheckInitialProgram -- Validation

Validates the raw Ori IR produced by the Bison parser for structural correctness: all basic blocks have valid entry/exit points, instruction operand counts match opcode requirements, register references are within bounds, and CFG edges are consistent. This is a pure validation pass that produces no IR transformations. It catches malformed IR early, before any optimization pass can amplify a structural error into a hard-to-diagnose miscompile.

Phase 1: ApplyNvOptRecipes -- Optimization Level Configuration

Applies NvOptRecipe transformations controlled by option 391. When enabled, the PhaseManager's constructor (sub_C62720) allocates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This sub-manager configures per-phase behavior based on the NvOpt level (0--5), controlling which later phases are active and their aggressiveness:

NvOpt level	Behavior
0	Minimal optimization (fast-compile path, many phases `isNoOp()`)
1--2	Standard optimization
3--4	Aggressive optimization (loop unrolling, speculative hoisting enabled)
5	Maximum optimization (may significantly increase compile time)

The string "Invalid nvopt level : %d." in sub_C173E0 confirms the valid range. The recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes. The sub-manager maintains its own sorted array (+376) and hash table (+400..+416) for fast recipe lookup by phase index.

NvOptRecipe Sub-Manager (440 bytes, at PhaseManager+56)
  +0      compilation_unit
  +8      phase_manager back-reference
  +16     ref_counted_list_1
  +312    recipe_data
  +336    allocator
  +344    timing_records (stride = 584 per entry)
  +376    sorted_array (for binary search by phase index)
  +400    hash_bucket_count
  +408    hash_buckets
  +432    shared_list_ptr (ref-counted)

Phase 2: PromoteFP16 -- Half-Precision Type Promotion

Promotes half-precision (FP16) operations where hardware support is insufficient or promotion yields better throughput. The promotion strategy is architecture-dependent:

Pre-sm_53: no native FP16 ALUs. All FP16 arithmetic is expanded to FP32 with narrowing conversions at stores.
sm_53+: native FP16 support. Only operations that require expensive multi-instruction sequences in FP16 (certain transcendentals, complex comparisons) are promoted.
sm_89+ (Ada, Blackwell): wide FP16 tensor paths. Promotion is minimal; most FP16 stays native.

The phase walks the instruction linked list, inspects each instruction's type encoding at offset +72, and rewrites FP16 operations to FP32 equivalents by replacing the opcode and inserting conversion instructions (F2F in SASS terminology) at use/def boundaries.

Phase 3: AnalyzeControlFlow -- CFG Finalization

Builds and finalizes the control flow graph data structures that the optimizer requires:

Successor edges: populates the FNV-1a hash table at Code Object +648
Backedge map: computes backedges and stores them at Code Object +680
RPO array: builds the reverse post-order traversal at Code Object +720
Loop identification: marks loop headers and backedge targets for later loop optimization passes (phases 18, 22, 24, 59)

The Bison parser constructs basic blocks and edges incrementally as it processes PTX instructions, but the CFG is not guaranteed to be fully consistent until this phase runs. For example, forward branch targets may reference blocks that were not yet created at parse time. This phase resolves all pending edges and ensures the CFG is complete.

Phases 4 and 7: Architecture Hook Points

Phases 4 (AdvancedPhaseBeforeConvUnSup) and 7 (AdvancedPhaseAfterConvUnSup) are no-op-by-default hook points that bracket ConvertUnsupportedOps. Architecture backends override their vtables to inject target-specific processing:

Phase 4 (before): prepare target-specific state, mark instructions that need special handling on this architecture
Phase 7 (after): clean up after legalization, fix architecture-specific edge cases introduced by the generic lowering

These hooks are part of the 16 AdvancedPhase injection points distributed throughout the 159-phase pipeline. The architecture vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the sm_version.

Phase 5: ConvertUnsupportedOps -- Instruction Legalization

The most substantial bridge phase. Lowers PTX operations that have no direct SASS equivalent for the target architecture. This phase runs the MercConverter engine (see next section) and handles:

64-bit integer arithmetic on architectures with 32-bit ALUs: splits add.s64, mul.lo.s64 into hi/lo 32-bit instruction pairs using carry chains
Complex addressing modes: decomposes multi-component addresses into separate arithmetic instructions
PTX-specific operations: converts PTX instructions that have no 1:1 SASS mapping (e.g., bfe, bfi, prmt variants not supported on all targets)
Architecture availability: gates instructions by SM version (an instruction added in sm_80 is lowered to a multi-instruction sequence on sm_70)
Texture/surface operations: legalizes texture sampling and surface access patterns (sub_9E8B20, 17 KB)
Memory operations: legalizes load/store patterns, address register handling (sub_9D76D0/sub_9D80E0, 17--18 KB each)

After ConvertUnsupportedOps completes, every instruction in the IR has a valid SASS opcode for the target architecture.

The late phase 132 (UpdateAfterConvertUnsupportedOps) runs cleanup for edge cases introduced by this phase that are only detectable after optimization.

Phase 6: SetControlFlowOpLastInBB -- CFG Structural Fixup

Enforces a critical structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering when a PTX instruction expands to a sequence ending in a branch), this phase splits the block at the control flow point.

The invariant is required by the scheduler (which assumes only the last instruction in a block can transfer control) and the register allocator (which computes live-out sets at block boundaries). The phase rewrites the instruction linked list and allocates new 40-byte basic block entries as needed.

Phase 8: OriCreateMacroInsts -- Macro Fusion

Identifies and fuses instruction sequences into macro instructions for hardware efficiency. The phase scans the instruction linked list for patterns that the GPU hardware can execute as a single macro-op:

Compare + branch: fused into a conditional branch macro instruction
Multiply + add: fused into FMA where not already (different from PTX fma -- this catches mul followed by add on the same operands)
Address computation + memory access: fused sequences for coalesced access patterns

The fused macro instructions carry composite semantics in a single IR node. They are expanded back into individual SASS instructions much later at phase 118 (MercExpandInstructions), after scheduling has determined the optimal placement. This late expansion allows the optimizer to treat the fused sequence as atomic, preventing passes from inserting unrelated instructions between the components.

Phase 9: ReportInitialRepresentation -- Diagnostic Dump

Dumps the Ori IR state for debugging, active when DUMPIR or --ftrace diagnostics are enabled. The stats emitter at sub_A3A7E0 prints a per-function profile:

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]
# [SharedMem Alloc thru=0.000000]
# [instHint=0] [instPairs=0]

This snapshot provides the pre-optimization baseline. Comparing it against ReportBeforeScheduling (phase 96) and ReportFinalMemoryUsage (phase 126) shows the optimizer's impact on instruction count, register pressure, and estimated latency.

Phases 10--13: Early Cleanup

Phase	Name	Purpose
10	`EarlyOriSimpleLiveDead`	First dead code elimination pass. Removes instructions whose results are unused. Uses the SIMD-accelerated bitvector library (`sub_BDBA60`..`sub_BDE150`) for liveness computation.
11	`ReplaceUniformsWithImm`	Folds known-constant uniform register loads into immediate operands. Important for kernel launch parameters passed through constant memory.
12	`OriSanitize`	Second structural validation after all bridge transformations. Catches errors introduced by phases 1--11 before the main optimizer begins.
13	`GeneralOptimizeEarly`	First compound optimization pass: copy propagation + constant folding + algebraic simplification in a single fixed-point iteration. Cleans up redundancies introduced by the bridge phases.

The MercConverter Engine

The MercConverter (sub_9F1A90, 35 KB) is the instruction conversion engine at the heart of ConvertUnsupportedOps. Despite its name referencing "Mercury" (NVIDIA's SASS encoding format), it operates purely at the IR level -- converting instruction semantics, not binary encodings.

Call Chain

sub_9F3340 (orchestrator, 7KB)
  |
  +-- sub_9F1A90 (MercConverter main pass, 35KB)
  |     |
  |     +-- sub_9ED2D0 (opcode dispatch, 25KB)
  |     |     |
  |     |     |  Large switch on (*(instr+72)) with byte-1 mask:
  |     |     |    BYTE1(opcode) &= 0xCF  -- strips modifier bits 4-5
  |     |     |
  |     |     +-- case 1:  sub_9DA5C0 (2KB)   -- opcode class 1
  |     |     +-- case 6:  sub_9DA100 (9KB)   -- arithmetic operations
  |     |     +-- case 8:  sub_9D2440         -- specific class
  |     |     +-- case 10,11,149,151,152,290,291:
  |     |     |            sub_9D80E0 (17KB)  -- memory load/store
  |     |     +-- default: vfunc[0](a1, a2)   -- vtable dispatch
  |     |
  |     +-- sub_934630 (instruction creation utility, called N times)
  |
  +-- sub_9EF5E0 (post-conversion lowering, 27KB)
        |  string "CONVERTING"
        +-- sub_9EC160, sub_7C11F0, sub_7BFC30 (intrinsic expansion)

Per-Category Handlers

Handler	Size	Category	Key behavior
`sub_9D76D0`	18 KB	Memory legalization (load/store)	Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses `sub_9D4380` (instruction builder) and `sub_9CD420` (predication).
`sub_9D80E0`	17 KB	Memory legalization (variant)	Same opcode set as `sub_9D76D0`, alternate code path for different operand patterns.
`sub_9EC340`	23 KB	Multi-operand legalization	Operand type test: `(v >> 28) & 7 == 1` means register. Register class query via `sub_7BE7B0`. Creates new instructions via `sub_7DEAD0`.
`sub_9E6600`	25 KB	Instruction expansion	Splits instructions into multiple SASS equivalents (e.g., 64-bit ops on 32-bit ALU). Uses `sub_9D4380` ~10 times.
`sub_9E8B20`	17 KB	Texture/surface lowering	Register type 6 = GPR. Manipulates bitmask at register descriptor offset `+48`.
`sub_9DA100`	9 KB	Arithmetic operations	Handles opcode case 6 -- standard ALU instruction legalization.
`sub_9DE890`	17 KB	Control flow legalization	Branch/call instruction patterns. Calls `sub_9D4380` (builder) 5 times.
`sub_9DDEE0`	14 KB	Address computation	Address arithmetic lowering, complex addressing mode decomposition.

Intrinsic Descriptor Loading

sub_9EE390 (20 KB) loads architecture-specific instruction descriptions from a file ("IntrinsicDescrFile=%s"). This allows the MercConverter to query which intrinsic operations are natively supported on the target SM and which require multi-instruction expansion. The descriptor file is architecture-versioned and loaded once during the first compilation of a kernel targeting that architecture.

The PTX-to-SASS Opcode Transition

The fundamental semantic transformation during lowering: PTX uses high-level, explicitly-typed opcodes; Ori uses SASS-level opcodes where the type is encoded in the mnemonic. All SASS opcode strings in the binary are ROT13-encoded.

PTX source (typed virtual ISA)          Ori IR (SASS machine-level)
---------------------------------       ---------------------------------
add.f32  %r1, %r2, %r3           -->   FADD  R1, R2, R3
add.s32  %r4, %r5, %r6           -->   IADD3 R4, R5, R6, RZ
mul.f64  %d1, %d2, %d3           -->   DMUL  D1, D2, D3
mad.lo.s32 %r7, %r8, %r9, %r10  -->   IMAD  R7, R8, R9, R10
ld.global.f32 %r11, [%rd1]       -->   LDG   R11, [R1]
st.shared.f32 [%rd2], %r12       -->   STS   [R2], R12
bra  $L__BB0_1                   -->   BRA   bix1
@%p0 bra $L__BB0_2               -->   @P0 BRA bix2
exit                              -->   EXIT
bar.sync 0                        -->   BAR

ROT13 encoding in the binary:

SNQQ  = FADD       VZNQ  = IMAD       SSZN  = FFMA
VNQQ3 = IADD3      QZHY  = DMUL       YQT   = LDG
FGT   = STG        OEN   = BRA        RKVG  = EXIT
ERG   = RET        ONE   = BAR        FGF   = STS

Key semantic differences at the transition:

Type moves into the opcode: PTX add.f32 becomes FADD (the "F" encodes float); PTX add.s32 becomes IADD3 (the "I" encodes integer). The type qualifier disappears from the instruction syntax.
Register namespace unification: PTX's typed virtual registers (%r for int, %f for float, %rd for 64-bit, %p for predicate) merge into Ori's four register files (R, UR, P, UP) with type tracked in the register descriptor at offset +64.
Operand count changes: SASS IADD3 takes 3 source operands where PTX add takes 2 -- the third source defaults to RZ (the hardware zero register). This is handled by the expansion in sub_9E6600.
Multi-instruction expansion: Complex PTX operations expand to multiple SASS instructions. A PTX div.f32 may become a Newton-Raphson sequence of RCP + FMUL + correction iterations.
Predication mapping: PTX @%p0 instruction maps to an Ori predicate operand in the P register file, attached to the instruction node's predicate slot.

Error Detection During Lowering

The bridge phases include two error detection mechanisms:

Internal compiler error assertion (sub_9EB990, 1.4 KB): three references to "Internal compiler error.". Called when a bridge phase encounters an impossible IR state (e.g., an opcode value outside the known range in the MercConverter dispatch switch). Triggers longjmp-based fatal abort via sub_42F590 back to the driver's error recovery point in sub_446240.

Uninitialized register detector (sub_A0B5E0, 7 KB): "Found %d potentially uninitialized register(s) in function %s". Walks the instruction list per block, checks register descriptor flags at offset +48 (bit 5 = "defined"). Reports registers that appear as sources without any prior definition. This detector fires after the bridge phases to catch conversion errors that leave registers undefined.

Key Data Structures

Instruction Node

Instruction (variable size, linked list node)
  +0     prev_ptr           // doubly-linked list: previous instruction
  +8     next_ptr           // doubly-linked list: next instruction
  +16    child_ptr          // child/expanded instruction chain
  +32    control_word_ptr   // set later during scheduling (initially NULL)
  +72    opcode             // byte 0: primary opcode
                            // byte 1 bits 4-5: modifier (masked with 0xCF)
  +80    operand_count      // number of operands
  +84    operand_array      // packed operand descriptors

Operand Encoding

Each operand is a packed 32-bit value:
  Bits 28-30: operand kind ((value >> 28) & 7)
    1 = register operand
    5 = predicate register
    (other values for immediate, constant bank, label, etc.)

  Lower bits: operand-kind-specific payload (register ID, immediate value, etc.)

Register Descriptor

Register descriptor (accessed via *(ctx+88) + 8*regId)
  +12    register number (int)
  +48    flags (bit 5 = "defined", other bits for liveness state)
  +64    type (3=address, 6=GPR, 7=predicate)

Timing Boundary

The lowering spans two --compiler-stats timer phases:

Timer	Covers
`DAGgen-time`	Bison parser reduction actions -> Ori instruction nodes, operand processing (`sub_6273E0`), basic block / CFG construction
`OCG-time`	Phases 0--13 (bridge), then phases 14--158 (optimization + codegen)

The boundary between "lowering" and "optimization" is therefore between phase 13 (GeneralOptimizeEarly, the last bridge phase) and phase 14 (DoSwitchOptFirst, the first pure optimization). After phase 13, the IR is in its final SASS-opcode form with validated structure, ready for the main optimization pipeline.

Cross-References

PTX Parser -- Flex scanner + Bison LALR(1) parser (the source of raw Ori IR)
Ori IR -- IR design: Code Object, basic blocks, instruction format, register files
Optimization Pipeline -- 159-phase pipeline (phases 0--13 are the bridge)
Phase Manager -- PhaseManager object, phase factory, dispatch loop
Optimization Levels -- NvOpt levels 0--5 and their effect on recipes
SASS Opcodes -- target SASS instruction set after lowering

Function Map

Address	Size	Callers	Identity	Confidence
`0x451730`	14 KB	1	Parser init, special register setup	HIGH
`0x46E000`	93 KB	1	Opcode table builder (1,141 per-opcode calls)	HIGH
`0x4CE6B0`	48 KB	1	Bison LALR(1) parser (512 productions)	HIGH
`0x6273E0`	44 KB	N	Operand processing (6-bit type switch)	MEDIUM
`0x9D4380`	7 KB	~10	Instruction builder / inserter into linked list	HIGH
`0x9D76D0`	18 KB	1	Memory instruction legalization (load/store)	HIGH
`0x9D80E0`	17 KB	1	Memory instruction legalization (variant)	HIGH
`0x9DA100`	9 KB	1	Arithmetic operation handler (case 6)	HIGH
`0x9DE890`	17 KB	1	Control flow legalization (branch/call)	MEDIUM
`0x9DDEE0`	14 KB	1	Address computation legalization	MEDIUM
`0x9E6600`	25 KB	1	Instruction expansion (64-bit split, etc.)	HIGH
`0x9E8B20`	17 KB	1	Texture/surface lowering	MEDIUM
`0x9EB990`	1.4 KB	3	Internal compiler error assertion	HIGH
`0x9EC340`	23 KB	1	Multi-operand instruction legalization	MEDIUM
`0x9ED2D0`	25 KB	1	Opcode dispatch (master switch, `& 0xCF` mask)	HIGH
`0x9EE390`	20 KB	1	Intrinsic descriptor file loader	MEDIUM
`0x9EF5E0`	27 KB	1	Post-MercConverter lowering (`"CONVERTING"`)	HIGH
`0x9F1A90`	35 KB	1	MercConverter main instruction conversion pass	HIGH
`0x9F3340`	7 KB	1	MercConverter orchestrator (`"After MercConverter"`)	HIGH
`0xA0B5E0`	7 KB	N	Uninitialized register detector	HIGH
`0xA3A7E0`	6 KB	N	Scheduling statistics printer (phase 9 output)	VERY HIGH

Optimization Pipeline (159 Phases)

The ptxas optimizer is a fixed-order pipeline of 159 compilation phases that transform Ori IR from its initial post-lowering form into scheduled, register-allocated SASS machine code. Unlike LLVM's PassManager -- which uses dependency-driven scheduling and analysis preservation -- ptxas runs every phase unconditionally in a predetermined order, relying on per-phase isNoOp() checks to skip inapplicable transformations. This design trades flexibility for predictability: the phase ordering is identical across all compilations, and architecture-specific behavior is injected through 16 "AdvancedPhase" hook points whose vtables are overridden per target.

Each phase is a polymorphic C++ object exactly 16 bytes in size, allocated from a memory pool by a 159-case factory switch. The PhaseManager constructs all 159 phase objects up front during initialization, stores them in a flat array, and iterates the array in a simple dispatch loop. Per-phase timing and memory consumption are optionally tracked for --stat=phase-wise output.

Key Facts

Field	Value
Total phases	159 (indices 0--158)
Named phases (static table)	139 (indices 0--138)
Dynamic phases (vtable names)	20 (indices 139--158)
AdvancedPhase hook points	16
Mercury sub-pipeline phases	8 (phases 113--114, 117--122)
Phase object size	16 bytes: `{vtable_ptr, allocator_ptr}`
Factory switch	`sub_C60D30` (3554 bytes, 159 cases)
PhaseManager constructor	`sub_C62720` (4734 bytes)
Dispatch loop	`sub_C64F70` (1455 bytes)
Phase name table	`off_22BD0C0` (159 entries, 1272 bytes)
Default ordering table	`unk_22BEEA0` (159-entry index array)
Vtable range	`off_22BD5C8`..`off_22BEE78` (40-byte stride)
NamedPhases option ID	298
Pipeline orchestrator	`sub_7FB6C0`

Phase Object Layout

Every phase is a 16-byte polymorphic object created by the factory:

struct Phase {
    void** vtable;       // +0: pointer to phase-specific vtable in .data.rel.ro
    void*  allocator;    // +8: memory pool used for allocation
};

The vtable provides three virtual methods common to all phases:

Offset	Signature	Purpose
`+0`	`execute(Phase, CompilationContext)`	Run the phase on the IR
`+8`	`isNoOp(Phase*) -> bool`	Return `true` to skip execution
`+16`	`getName(Phase*) -> int`	Return index into the phase name table

Additional vtable slots (+24 pool alloc, +32 pool free) are present but belong to the allocator interface, not the phase protocol.

Dispatch Loop

The dispatch loop at sub_C64F70 drives execution:

// sub_C64F70 -- simplified
void dispatch(PhaseManager* pm, int* phase_indices, int count) {
    MemorySnapshot baseline = take_snapshot();

    for (int i = 0; i < count; i++) {
        int idx = phase_indices[i];
        Phase* phase = pm->phase_list[idx];

        const char* name = pm->name_table[phase->getName()];

        if (!phase->isNoOp()) {
            MemorySnapshot before = take_snapshot();
            phase->execute(pm->compilation_unit);

            if (pm->timing_enabled) {
                report_phase_stats(pm, name, &before);
            }
        }
    }

    if (pm->timing_enabled) {
        report_summary(pm, "All Phases Summary", &baseline);
        report_pool_consumption(pm);
    }
}

Timing output format (to stderr when --stat=phase-wise):

  <phase_name>  ::  [Total 42 KB ]   [Freeable 8 KB ]   [Freeable Leaked 0 KB ] (0%)

Complete Phase Table

Group 1 -- Initial Setup (phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, macro instruction creation.

#	Phase Name	Category
0	`OriCheckInitialProgram`	Validation
1	`ApplyNvOptRecipes`	Recipe application
2	`PromoteFP16`	Type promotion
3	`AnalyzeControlFlow`	CFG analysis
4	`AdvancedPhaseBeforeConvUnSup`	Hook (no-op default)
5	`ConvertUnsupportedOps`	Legalization
6	`SetControlFlowOpLastInBB`	CFG fixup
7	`AdvancedPhaseAfterConvUnSup`	Hook (no-op default)
8	`OriCreateMacroInsts`	Macro expansion
9	`ReportInitialRepresentation`	Diagnostics
10	`EarlyOriSimpleLiveDead`	Early DCE
11	`ReplaceUniformsWithImm`	Immediate folding
12	`OriSanitize`	IR validation
13	`GeneralOptimizeEarly`	Bundled early opts

Phase 0 validates the initial Ori IR for structural correctness. Phase 1 applies NvOptRecipe transformations (controlled by option 391, which allocates a 440-byte sub-manager at PhaseManager+56). Phase 2 promotes FP16 operations where profitable. Phases 4 and 7 are architecture hooks that bracket ConvertUnsupportedOps -- backends override them to inject target-specific pre/post-legalization logic.

Group 2 -- Early Optimization (phases 14--32)

Branch optimization, loop canonicalization, strength reduction, software pipelining, SSA formation.

#	Phase Name	Category
14	`DoSwitchOptFirst`	Switch optimization
15	`OriBranchOpt`	Branch optimization
16	`OriPerformLiveDeadFirst`	Liveness / DCE
17	`OptimizeBindlessHeaderLoads`	Texture header opt
18	`OriLoopSimplification`	Loop canonicalization
19	`OriSplitLiveRanges`	Live range splitting
20	`PerformPGO`	Profile-guided opt
21	`OriStrengthReduce`	Strength reduction
22	`OriLoopUnrolling`	Loop unrolling
23	`GenerateMovPhi`	SSA phi insertion
24	`OriPipelining`	Software pipelining
25	`StageAndFence`	Memory fence insertion
26	`OriRemoveRedundantBarriers`	Barrier elimination
27	`AnalyzeUniformsForSpeculation`	Uniform analysis
28	`SinkRemat`	Sink + rematerialization
29	`GeneralOptimize`	Bundled mid opts
30	`DoSwitchOptSecond`	Switch optimization (2nd)
31	`OriLinearReplacement`	Linear scan replacement
32	`CompactLocalMemory`	Local memory compaction

The GeneralOptimize* phases (13, 29, 37, 46, 58, 65) are compound passes that bundle multiple small optimizations (copy propagation, constant folding, algebraic simplification) into a single fixed-point iteration. They appear at multiple pipeline positions to re-clean the IR after major transformations. Liveness/DCE also runs repeatedly (OriPerformLiveDead at phases 16, 33, 61, 84) to remove dead code exposed by intervening passes.

Group 3 -- Mid-Level Optimization (phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA expansion, argument enforcement.

#	Phase Name	Category
33	`OriPerformLiveDeadSecond`	Liveness / DCE (2nd)
34	`ExtractShaderConstsFirst`	Shader constant extraction
35	`OriHoistInvariantsEarly`	LICM (early)
36	`EmitPSI`	PSI emission
37	`GeneralOptimizeMid`	Bundled mid opts
38	`OptimizeNestedCondBranches`	Nested branch opt
39	`ConvertVTGReadWrite`	VTG read/write conversion
40	`DoVirtualCTAExpansion`	Virtual CTA expansion
41	`MarkAdditionalColdBlocks`	Cold block marking
42	`ExpandMbarrier`	Mbarrier expansion
43	`ForwardProgress`	Forward progress guarantee
44	`OptimizeUniformAtomic`	Uniform atomic opt
45	`MidExpansion`	Mid-level legalization
46	`GeneralOptimizeMid2`	Bundled mid opts (2nd)
47	`AdvancedPhaseEarlyEnforceArgs`	Hook (no-op default)
48	`EnforceArgumentRestrictions`	ABI enforcement
49	`GvnCse`	GVN + CSE
50	`OriReassociateAndCommon`	Reassociation + commoning
51	`ExtractShaderConstsFinal`	Shader constants (final)
52	`OriReplaceEquivMultiDefMov`	Redundant move elimination

Shader constant extraction (phases 34, 51) identifies uniform values that can be loaded from constant memory rather than recomputed per-thread. GvnCse (phase 49) combines global value numbering with common subexpression elimination in a single pass. The MidExpansion (phase 45) performs target-dependent lowering of operations that must be expanded before register allocation but after high-level optimizations have had their chance.

Group 4 -- Late Optimization (phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#	Phase Name	Category
53	`OriPropagateVaryingFirst`	Varying propagation
54	`OriDoRematEarly`	Early rematerialization
55	`LateExpansion`	Late legalization
56	`SpeculativeHoistComInsts`	Speculative hoisting
57	`RemoveASTToDefaultValues`	AST cleanup
58	`GeneralOptimizeLate`	Bundled late opts
59	`OriLoopFusion`	Loop fusion
60	`DoVTGMultiViewExpansion`	Multi-view expansion
61	`OriPerformLiveDeadThird`	Liveness / DCE (3rd)
62	`OriRemoveRedundantMultiDefMov`	Dead move elimination
63	`OriDoPredication`	If-conversion
64	`LateOriCommoning`	Late commoning
65	`GeneralOptimizeLate2`	Bundled late opts (2nd)
66	`OriHoistInvariantsLate`	LICM (late)
67	`DoKillMovement`	Kill movement
68	`DoTexMovement`	Texture movement
69	`OriDoRemat`	Rematerialization
70	`OriPropagateVaryingSecond`	Varying propagation (2nd)
71	`OptimizeSyncInstructions`	Sync optimization
72	`LateExpandSyncInstructions`	Late sync expansion
73	`ConvertAllMovPhiToMov`	Phi destruction
74	`ConvertToUniformReg`	Uniform reg conversion
75	`LateArchOptimizeFirst`	Arch-specific late opt
76	`UpdateAfterOptimize`	IR update pass
77	`AdvancedPhaseLateConvUnSup`	Hook (no-op default)

Predication (phase 63) converts short conditional branches into predicated instruction sequences, eliminating branch divergence. Rematerialization runs twice (phases 54 and 69) -- the early pass targets values that are cheap to recompute, while the late pass handles cases exposed by predication and loop fusion. Phase 73 (ConvertAllMovPhiToMov) destroys SSA form by converting phi nodes into move instructions, preparing the IR for register allocation. Phase 74 converts qualifying values to uniform registers (UR), reducing general register pressure.

Group 5 -- Legalization (phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attribute setting, final inspection.

#	Phase Name	Category
78	`LateExpansionUnsupportedOps`	Late unsupported ops
79	`OriHoistInvariantsLate2`	LICM (late 2nd)
80	`ExpandJmxComputation`	JMX expansion
81	`LateArchOptimizeSecond`	Arch-specific late opt (2nd)
82	`AdvancedPhaseBackPropVReg`	Hook (no-op default)
83	`OriBackCopyPropagate`	Backward copy propagation
84	`OriPerformLiveDeadFourth`	Liveness / DCE (4th)
85	`OriPropagateGmma`	GMMA propagation
86	`InsertPseudoUseDefForConvUR`	UR pseudo use/def
87	`FixupGmmaSequence`	GMMA sequence fixup
88	`OriHoistInvariantsLate3`	LICM (late 3rd)
89	`AdvancedPhaseSetRegAttr`	Hook (no-op default)
90	`OriSetRegisterAttr`	Register attribute setting
91	`OriCalcDependantTex`	Texture dependency calc
92	`AdvancedPhaseAfterSetRegAttr`	Hook (no-op default)
93	`LateExpansionUnsupportedOps2`	Late unsupported ops (2nd)
94	`FinalInspectionPass`	Final IR validation
95	`SetAfterLegalization`	Post-legalization marker
96	`ReportBeforeScheduling`	Diagnostics

GMMA (phases 85, 87) handles WGMMA (warp group matrix multiply-accumulate) instruction sequences that require specific register arrangements and ordering constraints. OriSetRegisterAttr (phase 90) annotates registers with scheduling attributes (latency class, bank assignment) consumed by the downstream scheduler. FinalInspectionPass (phase 94) is a validation gate that catches illegal IR patterns before the irreversible scheduling/RA phases.

Group 6 -- Pre-Scheduling and Register Allocation (phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#	Phase Name	Category
97	`AdvancedPhasePreSched`	Hook (no-op default)
98	`BackPropagateVEC2D`	Vector back-propagation
99	`OriDoSyncronization`	Synchronization insertion
100	`ApplyPostSyncronizationWars`	Post-sync WAR fixup
101	`AdvancedPhaseAllocReg`	Hook (no-op default)
102	`ReportAfterRegisterAllocation`	Diagnostics
103	`Get64bRegComponents`	64-bit register splitting

Phase 99 inserts the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model. Phase 100 fixes write-after-read hazards exposed by sync insertion. Register allocation is driven through the hook at phase 101 -- the actual allocator is architecture-specific and invoked from the AdvancedPhase override. Phase 103 splits 64-bit register pairs into their 32-bit components for architectures that require it.

Group 7 -- Post-RA and Post-Scheduling (phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboards.

#	Phase Name	Category
104	`AdvancedPhasePostExpansion`	Hook (no-op default)
105	`ApplyPostRegAllocWars`	Post-RA WAR fixup
106	`AdvancedPhasePostSched`	Hook (no-op default)
107	`OriRemoveNopCode`	NOP removal
108	`OptimizeHotColdInLoop`	Hot/cold in loops
109	`OptimizeHotColdFlow`	Hot/cold flow opt
110	`PostSchedule`	Post-scheduling
111	`AdvancedPhasePostFixUp`	Hook (no-op default)
112	`PlaceBlocksInSourceOrder`	Block layout
113	`PostFixForMercTargets`	Mercury target fixup
114	`FixUpTexDepBarAndSync`	Texture barrier fixup
115	`AdvancedScoreboardsAndOpexes`	Scoreboard generation
116	`ProcessO0WaitsAndSBs`	O0 wait/scoreboard

Hot/cold partitioning (phases 108--109) separates frequently executed blocks from cold paths, improving instruction cache locality. PlaceBlocksInSourceOrder (phase 112) determines the final layout of basic blocks in the emitted binary. The scoreboard sub-system has two paths: at -O1 and above, AdvancedScoreboardsAndOpexes (phase 115) performs full dependency analysis to compute the 23-bit control word per instruction (4-bit stall count, 1-bit yield, 3-bit write barrier, 6-bit read barrier mask, 6-bit wait barrier mask, plus reuse flags). At -O0, phase 115 is a no-op and ProcessO0WaitsAndSBs (phase 116) inserts conservative waits.

Group 8 -- Mercury Backend (phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#	Phase Name	Category
117	`MercEncodeAndDecode`	Mercury encode/decode
118	`MercExpandInstructions`	Instruction expansion
119	`MercGenerateWARs1`	WAR generation (1st pass)
120	`MercGenerateOpex`	Opex generation
121	`MercGenerateWARs2`	WAR generation (2nd pass)
122	`MercGenerateSassUCode`	SASS microcode generation

"Mercury" is NVIDIA's internal name for the SASS encoding framework. Phase 117 converts Ori instructions into Mercury's intermediate encoding, then decodes them back to verify round-trip correctness. Phase 118 expands pseudo-instructions into their final SASS sequences. WAR generation runs in two passes (119, 121) because expansion in phase 118 can introduce new write-after-read hazards. Phase 120 generates "opex" (operation extension) annotations. Phase 122 produces the final SASS microcode bytes. The MercConverter infrastructure (sub_9F1A90, 35KB) drives the instruction-level legalization using a visitor pattern dispatched through a large opcode switch (sub_9ED2D0, 25KB).

Group 9 -- Post-Mercury (phases 123--131)

#	Phase Name	Category
123	`ComputeVCallRegUse`	Virtual call reg use
124	`CalcRegisterMap`	Register map computation
125	`UpdateAfterPostRegAlloc`	Post-RA update
126	`ReportFinalMemoryUsage`	Diagnostics
127	`AdvancedPhaseOriPhaseEncoding`	Hook (no-op default)
128	`UpdateAfterFormatCodeList`	Code list formatting
129	`DumpNVuCodeText`	SASS text dump
130	`DumpNVuCodeHex`	SASS hex dump
131	`DebuggerBreak`	Debugger breakpoint

CalcRegisterMap (phase 124) computes the final physical-to-logical register mapping emitted as EIATTR metadata in the output ELF. DumpNVuCodeText and DumpNVuCodeHex (phases 129--130) produce the human-readable SASS text and raw hex dumps used by cuobjdump and debugging workflows. DebuggerBreak (phase 131) is a development-only hook that triggers a breakpoint when a specific phase is reached.

Group 10 -- Finalization (phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, architecture-specific fixups.

#	Phase Name	Category
132	`UpdateAfterConvertUnsupportedOps`	Post-conversion update
133	`MergeEquivalentConditionalFlow`	Conditional flow merge
134	`AdvancedPhaseAfterMidExpansion`	Hook (no-op default)
135	`AdvancedPhaseLateExpandSyncInstructions`	Hook (no-op default)
136	`LateMergeEquivalentConditionalFlow`	Late conditional merge
137	`LateExpansionUnsupportedOpsMid`	Late unsupported mid
138	`OriSplitHighPressureLiveRanges`	High-pressure splitting
139--158	(architecture-specific)	Arch-specific fixups

Phases 132--138 handle late-breaking transformations that must run after the Mercury backend but before finalization. OriSplitHighPressureLiveRanges (phase 138) is a last-resort live range splitter that fires when register pressure exceeds hardware limits after the main allocation pass.

Phases 139--158 are 20 additional slots whose names are not in the static name table but are returned by their vtable getString() methods. These are architecture-specific phases registered in the factory switch (vtable addresses off_22BEB08..off_22BEE78) that target particular SM generations or compilation modes. They provide extensibility for new architectures without modifying the fixed 139-phase base table.

Optimization Level Gating

AdvancedPhase Hook Points

Sixteen phases serve as conditional extension points. Their isNoOp() method returns true by default, causing the dispatch loop to skip them. Architecture backends and optimization-level configurations override the vtable to activate these hooks:

Phase	Name	Gate Location
4	`AdvancedPhaseBeforeConvUnSup`	Before unsupported-op conversion
7	`AdvancedPhaseAfterConvUnSup`	After unsupported-op conversion
47	`AdvancedPhaseEarlyEnforceArgs`	Before argument enforcement
77	`AdvancedPhaseLateConvUnSup`	Late unsupported-op boundary
82	`AdvancedPhaseBackPropVReg`	Before backward copy prop
89	`AdvancedPhaseSetRegAttr`	Before register attr setting
92	`AdvancedPhaseAfterSetRegAttr`	After register attr setting
97	`AdvancedPhasePreSched`	Before scheduling
101	`AdvancedPhaseAllocReg`	Register allocation driver
104	`AdvancedPhasePostExpansion`	After post-RA expansion
106	`AdvancedPhasePostSched`	After post-scheduling
111	`AdvancedPhasePostFixUp`	After post-fixup
115	`AdvancedScoreboardsAndOpexes`	Full scoreboard analysis
127	`AdvancedPhaseOriPhaseEncoding`	Phase encoding hook
134	`AdvancedPhaseAfterMidExpansion`	After mid-expansion
135	`AdvancedPhaseLateExpandSyncInstructions`	Late sync expansion

The pattern is consistent: AdvancedPhase hooks bracket major pipeline stages, allowing backends to insert target-specific transformations without altering the fixed phase ordering. Phase 101 (AdvancedPhaseAllocReg) is notable because register allocation itself is entirely driven through this hook -- the base pipeline has no hardcoded allocator.

O0 vs O1+ Behavior

At -O0, the pipeline skips most optimization phases via their individual isNoOp() checks. The critical difference is in scoreboard generation:

-O1 and above: Phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52KB control word encoder) and sub_A23CF0 (54KB DAG list scheduler heuristic). Phase 116 is a no-op.
-O0: Phase 115 is a no-op. Phase 116 (ProcessO0WaitsAndSBs) inserts conservative stall counts and wait barriers -- every instruction gets the maximum stall, and barriers are placed at every potential hazard point. This produces correct but slow code.

Individual phases also check the optimization level internally via the compilation context. The scheduling infrastructure (sub_8D0640) reads the opt-level via sub_7DDB50 and selects between forward-pass scheduling (opt-level <= 2, register-pressure-reducing) and reverse-pass scheduling (opt-level > 2, latency-hiding).

NamedPhases Override (Option 298)

The NamedPhases mechanism allows complete replacement of the default 159-phase pipeline with a user-specified phase sequence, primarily used for debugging and performance investigation.

Activation

The pipeline orchestrator (sub_7FB6C0) checks option ID 298 via a vtable call at compilation context offset +72. When set, the orchestrator bypasses the default pipeline and delegates to sub_9F63D0 (NamedPhases entry point):

// sub_7FB6C0 -- simplified
void orchestrate(CompilationUnit* cu) {
    if (cu->config->getOption(298)) {
        // NamedPhases mode -- user-specified phase sequence
        NamedPhases_run(cu);              // sub_9F63D0
    } else {
        // Default mode -- fixed 159-phase pipeline
        PhaseManager* pm = PhaseManager_new(cu);  // sub_C62720
        int* ordering = get_default_ordering();    // sub_C60D20
        dispatch(pm, ordering, 159);               // sub_C64F70
        PhaseManager_destroy(pm);                  // sub_C61B20
    }
    // ... cleanup 17 data structures, refcounted objects ...
}

Configuration String Format

Option 298 is set via a knob string (environment variable or command-line). The string is stored at compilation context offset 21464 with a type indicator at offset 21456. The parser (sub_798B60, NamedPhases::ParsePhaseList) tokenizes the comma-delimited string:

"phase_name1,phase_name2=param,shuffle,swap1,..."

Maximum 256 entries. The parser populates three parallel arrays:

Phase name strings
Parameter value strings (parsed via strtol)
Full name=value pairs

Phase List Builder

The core builder (sub_9F4040, 49KB) processes the parsed configuration:

Allocates a 0x2728-byte stack frame with 256-entry string tables
Initializes a 158-entry phase descriptor table (zeroed 0x400 bytes)
Resolves phase names to indices via sub_C641D0 (case-insensitive binary search)
Recognized manipulation keywords:
- shuffle -- randomize the phase ordering
- swap1..swap6 -- swap specific phase pairs (for A/B testing)
- OriPerformLiveDead -- override liveness pass placement
- OriCopyProp -- override copy propagation placement
Constructs the final phase index sequence and dispatches via sub_C64F70

Pass-Disable Integration

Individual passes can be disabled without reordering the pipeline. The check function sub_799250 (IsPassDisabled, 68 bytes) performs a case-insensitive substring match against the PTXAS_DISABLED_PASSES string at context offset 13328:

// sub_799250 -- simplified
bool is_pass_disabled(Context* ctx, const char* pass_name) {
    if (ctx->pass_disable_flag == 0) return false;  // offset 13320
    if (ctx->pass_disable_flag == 5) {
        return strcasestr(ctx->pass_disable_string, pass_name);  // offset 13328
    }
    return false;
}

This check is called from 16+ sites across the codebase, guarding passes like LoopMakeSingleEntry and SinkCodeIntoBlock. A more thorough variant (sub_7992A0, IsPassDisabledFull) uses FNV-1a hashing for function-specific override tables.

PhaseManager Data Structures

PhaseManager Object (~112 bytes)

Offset  Type       Field
------  ----       -----
+0      int64      compilation_unit pointer
+8      int64*     allocator
+16     void*      sorted_name_table (for binary search)
+24     int32      sorted_name_count
+28     int32      sorted_name_capacity
+32     int64*     allocator_copy
+40     void*      phase_list (array of 16-byte Phase entries)
+48     int32      phase_list_count
+52     int32      phase_list_capacity
+56     int64      nvopt_recipe_ptr (NvOptRecipe sub-manager, or NULL)
+64     int64      (reserved)
+72     bool       timing_enabled (from options[17928])
+76     int32      (flags)
+80     bool       flag_byte
+88     int64*     timing_allocator
+96     void*      phase_name_raw_table
+104    int32      phase_name_raw_count
+108    int32      phase_name_raw_capacity

Timing Record (32 bytes)

Offset  Type       Field
------  ----       -----
+0      int32      phase_index (-1 = sentinel)
+8      int64      phase_name_or_magic (0x2030007 = sentinel)
+16     int64      timing_value
+24     int32      memory_flags

NvOptRecipe Sub-Manager (440 bytes, at `PhaseManager+56`)

Created when option 391 is set. Contains timing records with 584-byte stride, a hash table for recipe lookup, sorted arrays, and ref-counted shared lists. The sub-manager inherits the phase chain from the previous execution context, enabling recipe-based pipeline modification across compilation units.

Function Map

Address	Size	Identity
`sub_C60D20`	16 B	Default phase table pointer
`sub_C60D30`	3554 B	Phase factory (159-case switch)
`sub_C60BD0`	334 B	Multi-function phase invoker
`sub_C61B20`	1753 B	PhaseManager destructor
`sub_C62200`	888 B	Pool consumption reporter
`sub_C62580`	253 B	Timing record array resizer (1.5x growth)
`sub_C62640`	223 B	Phase list resizer (1.5x growth)
`sub_C62720`	4734 B	PhaseManager constructor
`sub_C639A0`	1535 B	Case-insensitive quicksort (median-of-3)
`sub_C63FA0`	556 B	Phase name table sort/rebuild
`sub_C641D0`	305 B	Phase name-to-index binary search
`sub_C64310`	3168 B	Per-phase timing reporter
`sub_C64F70`	1455 B	Phase dispatch loop
`sub_7FB6C0`	1193 B	Pipeline orchestrator (option 298 gate)
`sub_798B60`	1776 B	NamedPhases::ParsePhaseList
`sub_799250`	68 B	IsPassDisabled (substring check)
`sub_7992A0`	894 B	IsPassDisabledFull (FNV-1a hash)
`sub_9F4040`	9093 B	NamedPhases::parseAndBuild
`sub_9F63D0`	342 B	NamedPhases::run
`sub_9F1A90`	6310 B	MercConverter main pass
`sub_9F3340`	~7 KB	MercConverter orchestrator
`sub_9ED2D0`	~25 KB	MercConverter opcode dispatch

Diagnostic Strings

String	Location	Trigger
`"All Phases Summary"`	`sub_C64F70`	End of dispatch loop (timing enabled)
`"[Pool Consumption = "`	`sub_C62200`	End of dispatch loop (timing enabled)
`" :: "`	`sub_C64310`	Per-phase timing line
`"[Total "`, `"[Freeable "`, `"[Freeable Leaked "`	`sub_C64310`	Memory delta columns
`"Before "`, `"After "`	`sub_C64F70`	Phase execution markers
`"NamedPhases"`	`sub_9F4040`	NamedPhases config parsing
`"shuffle"`, `"swap1"`..`"swap6"`	`sub_9F4040`	NamedPhases manipulation keywords
`"After MercConverter"`	near `sub_9F3340`	Post-MercConverter diagnostic
`"CONVERTING"`	`sub_9EF5E0`	During MercConverter lowering
`"Internal compiler error."`	`sub_9EB990`	ICE assertion (3 sites)

Cross-References

Phase Manager Infrastructure -- detailed PhaseManager internals
Pass Inventory & Ordering -- per-pass documentation index
GeneralOptimize Bundles -- compound optimization passes
Scheduler Architecture -- scheduling phases 97--116
Mercury Encoder -- Mercury backend phases 117--122
Scoreboards & Dependency Barriers -- control word generation
Optimization Levels -- O-level gating details
DUMPIR & NamedPhases -- NamedPhases configuration reference
Memory Pool Allocator -- pool allocator used by phase objects

SASS Code Generation

This page covers the top-level compilation orchestration layer of ptxas: the code that sits between the PTX front-end (parsing, directive handling) and the 159-phase optimization pipeline. It is responsible for validating the parsed PTX, selecting a compilation strategy, computing register constraints, dispatching per-kernel compilation (either sequentially or via a thread pool), and collecting per-kernel outputs for finalization. The orchestrator is the single largest function in the front-end region at 2,141 decompiled lines.

Key Facts


Core orchestrator	`sub_4428E0` (13,774 bytes, 2,141 decompiled lines)
Per-kernel worker	`sub_43A400` (4,696 bytes, 647 lines)
Per-kernel DAGgen+OCG	`sub_64BAF0` (~30 KB, 1,006 decompiled lines)
Per-entry output	`sub_43CC70` (5,425 bytes, 1,077 decompiled lines)
Thread pool worker	`sub_436DF0` (485 bytes, 59 decompiled lines)
Thread pool constructor	`sub_1CB18B0` (184-byte pool struct, calls `pthread_create`)
Finalization	`sub_432500` (461 bytes, 47 decompiled lines)
Regalloc finalize	`sub_4370F0` (522 bytes, 64 decompiled lines)
Compilation strategies	4 (normal, compile-only, debug, non-ABI)
Error recovery	`setjmp`/`longjmp` (non-local, no C++ exceptions)

Architecture

sub_446240 (top-level driver)
  |
  v
sub_4428E0 (core orchestrator, 2141 lines)
  |
  |-- Option validation: .version/.target, --compile-only, --compile-as-tools-patch
  |-- Cache config: def-load-cache, force-load-cache, def-store-cache, force-store-cache
  |-- Strategy selection: 4 function-pointer pairs (see below)
  |-- Register constraints: sub_43B660 per kernel (via strategy function)
  |-- Compile-unit table: 48-byte per-CU entry at a1+336
  |-- Timing array: 112-byte per-kernel entry at a1+256
  |
  +-- IF single-threaded (thread_count == 0):
  |     |
  |     FOR EACH compile unit:
  |       |
  |       +-- sub_43A400 (per-kernel setup, 647 lines)
  |       |     |-- Target-specific defaults ("ptxocg.0.0", cache, texmode)
  |       |     |-- ABI configuration, fast-compile shortcuts
  |       |     +-- Error recovery via setjmp
  |       |
  |       +-- sub_432500 (finalization wrapper, 47 lines)
  |       |     |-- setjmp error guard
  |       |     +-- vtable call at a1+96: invokes the actual OCG pipeline
  |       |
  |       +-- sub_4370F0 (timing finalization, 64 lines)
  |       |     +-- Accumulates per-kernel timing into 112-byte records
  |       |
  |       +-- sub_43CC70 (per-entry output, 1077 lines)
  |             |-- Skip __cuda_dummy_entry__
  |             |-- Generate .sass and .ucode sections
  |             +-- Emit "# ============== entry %s ==============" header
  |
  +-- IF multi-threaded (thread_count > 0):
        |
        |-- sub_1CB18B0(thread_count) --> create thread pool
        |
        FOR EACH compile unit:
          |
          +-- sub_43A400 (per-kernel setup, same as above)
          +-- Snapshot 15 config vectors (v158[3]..v158[17])
          +-- Copy hash maps for thread isolation
          +-- sub_1CB1A50(pool, sub_436DF0, task_struct) --> enqueue
          |
          v
        sub_436DF0 (thread pool worker, 59 lines)
          |-- sub_430590("ptxas", ...) -- set thread-local program name
          |-- Jobserver slot check (sub_1CC6EC0)
          |-- sub_432500(...) -- finalize via vtable call (DAGgen+OCG+SASS)
          |-- Timing: wall-clock and phase timers into per-CU record
          |-- sub_693630 (release compiled output to downstream)
          +-- sub_4248B0 (free task struct)
        |
        sub_1CB1AE0(pool)  --> wait for all tasks
        sub_1CB1970(pool)  --> destroy pool
        sub_4370F0(a1, -1) --> finalize aggregate timing

The Core Orchestrator: sub_4428E0

This is a 2,141-line monolith that drives the entire compilation after the PTX has been parsed. Its responsibilities, in execution order:

1. Cache and Option Validation

The first 200+ lines read four cache-mode knobs from the options store at a1+904:

def_load_cache   = get_knob(a1->options, "def-load-cache");
force_load_cache = get_knob(a1->options, "force-load-cache");
def_store_cache  = get_knob(a1->options, "def-store-cache");
force_store_cache = get_knob(a1->options, "force-store-cache");

It then validates a long series of option combinations:

--compile-as-tools-patch (a1+727) incompatibility checks against shared memory, textures, surfaces, samplers, constants
--assyscall (a1+627) resource allocation checks
--compile-only (a1+726) vs unified functions
Non-ABI mode (a2+218, a2+235): disables --fast-compile, --extensible-whole-program
--position-independent-code vs --extensible-whole-program mutual exclusion
Architecture version checks: .target SM version vs --gpu-name SM version
-noFwdPrg forward-progress flag against SM version
--legacy-bar-warp-wide-behavior against SM >= 70

2. Strategy Selection

Four compilation strategies are expressed as pairs of function pointers (v314, v293), selected at lines 756-779 of the decompilation. Each strategy pair consists of a register-constraint calculator and a compile-unit builder:

Strategy	Condition	Register Calculator (`v314`)	CU Builder (`v293`)
Compile-only	`--compile-only` OR `--assyscall` OR `--compile-as-tools-patch`	`sub_43C6F0` (225 lines)	`sub_4383B0` (177 lines)
Debug	`--compile-as-tools-patch` AND NOT debug mode	`sub_43CAE0` (91 lines)	`sub_4378E0` (250 lines)
Non-ABI	`--extensible-whole-program`	`sub_43C570` (77 lines)	`sub_438B50` (375 lines)
Normal	default	`sub_43CA80` (24 lines)	`sub_438B50` (375 lines)

The selection logic:

if (compile_only || assyscall || tools_patch) {
    calc_regs = sub_43C6F0;
    build_cus = sub_4383B0;
} else if (tools_patch_mode) {
    calc_regs = debug_mode ? sub_43C6F0 : sub_43CAE0;
    build_cus = debug_mode ? sub_4383B0 : sub_4378E0;
} else {
    calc_regs = extensible_whole_program ? sub_43C570 : sub_43CA80;
    build_cus = sub_438B50;
}

3. Register Constraint Calculation

Each strategy's register calculator iterates the kernel list and calls sub_43B660 to compute per-kernel register limits. The result is stored in a hash map at a1+1192:

// sub_43CA80 (normal strategy, 24 lines) -- simplest form
for (node = kernel_list; node; node = node->next) {
    entry = node->data;
    name  = entry->func_ptr;    // at +16
    limit = sub_43B660(a1, name, a1->opt_level, entry->thread_count);
    map_put(a1->reg_limit_map, name, limit);  // a1+1192
}

sub_43B660 (3,843 bytes) balances register pressure against occupancy by considering .maxnreg, --maxrregcount, .minnctapersm, and .maxntid. String evidence: ".minnctapersm and .maxntid", "threads per SM", "computed using thread count", "of .maxnreg", "of maxrregcount option", "global register limit specified".

4. Compile-Unit Table Construction

The CU builder (v293) constructs a linked list of 72-byte compile-unit descriptors. Each descriptor contains:

struct CompileUnitDescriptor {  // 72 bytes
    int32   index;          // +0:  CU ordinal
    void*   dep_list;       // +8:  dependency tracking set
    void*   entry_ptr;      // +16: pointer to entry function symbol
    bool    is_entry;       // +25: true if .entry, false if .func
    int32   regalloc[2];    // +28: register allocation mode pair
    bool    flags[4];       // +36: has_shared_mem, has_surfaces, has_textures, has_samplers
    int16   cap_flags;      // +40: capability flags from func_attr+240
    int32   min_regs;       // +44: minimum register count (from profile check)
    // +48..55: additional attribute OR bits from func_attr+208..236
    // +56..63: reserved
    void*   smem_info;      // +48: 24-byte sub-struct for shared memory
};

The builder allocates this via sub_424070(pool, 72), populates it from the function attribute struct at entry_ptr+80, and enqueues it into the output list via sub_42CA60.

5. Per-Kernel Dispatch

After building the CU list, the orchestrator enters one of two dispatch modes based on the thread count at a1+668:

Single-Threaded Path (thread_count == 0)

The loop at decompilation lines 1607-1686 iterates each CU sequentially:

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    // Record start time in 112-byte timing record
    timing_record[cu_desc->index].start = get_time();
    
    // Allocate 360-byte work buffer
    work = pool_alloc(pool, 360);
    memset(work, 0, 360);
    
    // Per-kernel setup: target config, cache defaults, ABI
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Finalization: runs the actual DAGgen + OCG pipeline
    sub_432500(a1, cu_desc + 16, work[0], work[1]);
    
    // Timing finalization for this kernel
    sub_4370F0(a1, cu_desc->index);
    
    // Per-entry output: .sass/.ucode sections
    sub_43CC70(a1, parser_state, cu_desc, work);
    
    pool_free(work);
}

Multi-Threaded Path (thread_count > 0)

The thread pool path (decompilation lines 1709-1929) uses the pthread-based thread pool:

// Phase 1: prepare all tasks
pool_obj = sub_1CB18B0(thread_count);  // create pool with N threads

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    
    // Allocate 360-byte work buffer (same as single-threaded)
    work = pool_alloc(pool, 360);
    
    // Extra per-thread state: 3 hash maps for thread isolation
    work[288] = hashmap_new(8);   // per-thread reg constraints
    work[296] = hashmap_new(8);   // per-thread symbol copies
    work[304] = hashmap_new(8);   // per-thread attribute copies
    
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Snapshot 15 config vectors from global state (a1+1072..a1+1296)
    // into work[48]..work[288] for thread-safe access
    for (i = 0; i < 15; i++)
        work[48 + 16*i] = load_128bit(a1 + 1072 + 16*i);
    
    // Copy hash maps from shared state into per-thread copies
    // (reg constraints, symbol tables, attribute maps)
    
    // Enqueue: sub_1CB1A50(pool, sub_436DF0, task_struct)
    task = malloc(48);
    task->pool_data = work;
    task->timing_base = ...;
    sub_1CB1A50(pool_obj, sub_436DF0, task);
}

// Phase 2: wait for all tasks
sub_1CB1AE0(pool_obj);   // wait-for-all
sub_1CB1970(pool_obj);   // destroy pool

// Phase 3: aggregate timing
sub_4370F0(a1, -1);      // -1 = aggregate all CUs

Jobserver Integration

When --split-compile is active with GNU Make, the thread pool integrates with Make's jobserver protocol. The worker function sub_436DF0 checks sub_1CC6EC0() (jobserver slot acquire) before starting compilation and calls sub_1CC7040() (jobserver slot release) after completion. This prevents ptxas from exceeding the -j slot limit during parallel builds.

Per-Kernel Worker: sub_43A400

This 647-line function sets up target-specific defaults for each kernel before the OCG pipeline runs. Key responsibilities:

Timing instrumentation -- records start timestamps, wall-clock time
Target configuration -- reads "ptxocg.0.0" defaults, sets cache mode, texturing mode, "specified texturing mode" string evidence
Fast-compile shortcuts -- when --fast-compile is active, reduces optimization effort
ABI setup -- configures parameter passing, return address register, scratch registers
Error recovery -- establishes setjmp point for fatal errors during kernel compilation

The function allocates a _jmp_buf on the stack for error recovery. If any phase in the downstream pipeline calls the fatal diagnostic path (sub_42F590 with severity >= 6), execution longjmps back to sub_43A400's recovery handler, which cleans up the partially-compiled kernel and continues to the next.

String evidence: "def-load-cache", "force-load-cache", "--sw4575628", "NVIDIA", "ptxocg.0.0", "specified texturing mode", "Indirect Functions or Extern Functions".

Finalization: sub_432500

This 47-line wrapper function is the bridge between the orchestrator and the actual DAGgen+OCG pipeline. It:

Retrieves the thread-local context via sub_4280C0
Saves and replaces the jmp_buf pointer in the TLS struct (for nested error recovery)
Saves the current error/warning flags
Clears the error flags to create a clean compilation context
Calls through a vtable pointer at a1+96 to invoke the actual compilation:

// sub_432500 -- simplified
bool finalize(Context* ctx, CUDesc* cu, void* sass_out, void* ucode_out) {
    char* tls = get_tls();
    jmp_buf* old_jmp = tls->jmp_buf;
    tls->jmp_buf = &local_jmp;
    char saved_err = tls->error_flags;
    char saved_warn = tls->warning_flags;
    tls->error_flags = 0;
    tls->warning_flags = 0;
    
    if (setjmp(local_jmp)) {
        // Error path: restore state, cleanup, report ICE
        tls->jmp_buf = old_jmp;
        tls->error_flags = 1;  tls->warning_flags = 1;
        release_output(ucode_out);
        release_output(cu->output);
        report_internal_error();
        return false;
    }
    
    // Normal path: invoke the pipeline
    bool ok = ctx->vtable->compile(ctx->state, sass_out, ctx + 384);
    if (!ok) report_internal_error();
    
    // Merge error flags
    tls->jmp_buf = old_jmp;
    tls->error_flags = saved_err ? true : (tls->error_flags != 0);
    tls->warning_flags = saved_warn ? true : (tls->warning_flags != 0);
    return ok;
}

The vtable call at a1+96 is the entry point into sub_64BAF0 (the 1,006-line function that runs DAGgen, the 159-phase OCG pipeline, and Mercury SASS encoding for a single kernel).

Regalloc Finalization: sub_4370F0

This 64-line function accumulates per-kernel timing results into the master timing array at a1+256. Each entry in this array is a 112-byte record:

struct KernelTimingRecord {  // 112 bytes, at a1->timing_array + 112*index
    char*   kernel_name;     // +0
    float   ocg_time;        // +20
    float   total_time;      // +36
    float   cumulative;      // +40
    double  wall_clock;      // +72
    // ... other timing fields
};

When called with index == -1 (aggregate mode after multi-threaded compilation), it sums all per-kernel records into the global timing counters at a1+176 (total parse time), a1+184 (total OCG time), and a1+208 (peak wall-clock).

Per-Entry Output: sub_43CC70

This 1,077-line function produces the final per-kernel output artifacts. Key behaviors:

Skip dummy entries -- checks for "__cuda_dummy_entry__" and returns immediately
Section generation -- creates .sass and .ucode ELF sections for each kernel
Entry banner -- emits "\n# ============== entry %s ==============\n" to the SASS text output
Register map -- calls "reg-fatpoint" to annotate the register allocation
Verbose SASS output -- when --verbose is active, formats and writes human-readable SASS text
Multiple output paths -- supports mercury, capmerc, and direct SASS output modes

Thread Pool Worker: sub_436DF0

The worker function dispatched to each thread pool thread is compact (59 lines) but carefully structured for thread safety:

void thread_worker(Context* a1, TaskStruct* task) {
    set_thread_program_name("ptxas", task);
    
    // Acquire jobserver slot if applicable
    if (a1->jobserver_enabled && !jobserver_acquire())
        report_fatal_error();  // Cannot get build slot
    
    float time_before = get_pool_time(a1->timer);
    double wall_before = get_wall_time();
    
    // Run the actual compilation
    sub_432500(a1->state, task + 64, task->sass_output, task->ucode_output);
    
    float time_after = get_pool_time(a1->timer);
    double wall_after = get_wall_time();
    
    // Record timing in per-CU record
    int cu_index = *(int*)task->cu_desc;
    TimingRecord* rec = &a1->timing_array[cu_index];
    rec->ocg_time = time_after - time_before;
    rec->cumulative += (time_after - time_before);
    if (wall_after - wall_before > 0)
        rec->wall_clock = wall_after - wall_before;
    
    // Peak wall-clock tracking (under lock)
    if (a1->compiler_stats && a1->per_kernel_stats) {
        lock_timing(6);
        double peak = a1->peak_wall_clock;
        if (get_wall_time() - a1->start_time > peak)
            a1->peak_wall_clock = get_wall_time() - a1->start_time;
        unlock_timing(6);
    }
    
    // Emit compiled output downstream
    if (a1->dump_sass)
        dump_sass(task->ucode_output);
    release_output(task->ucode_output);
    
    // Transfer kernel name to output
    task->output->name = **(task->cu_desc->entry_ptr + 88);
    
    // Release jobserver slot
    if (a1->jobserver_enabled && !jobserver_release())
        report_fatal_error();
    
    pool_free(task);
}

The timing lock at index 6 (sub_607D70(6) / sub_607D90(6)) serializes access to the peak wall-clock counter across threads. This is the only shared mutable state in the multi-threaded path -- all other per-kernel state is isolated in the 360-byte work buffer and per-thread hash map copies.

Data Flow Summary

PTX text
  |
  v (parsed by sub_451730 into AST at parser_state+88)
  |
sub_4428E0: strategy_calc(kernel_list)  --> reg_limit_map (a1+1192)
sub_4428E0: strategy_build(kernel_list) --> cu_descriptor_list (72-byte nodes)
  |
  v (for each CU descriptor)
  |
sub_43A400: target_config(cu_desc) --> 360-byte work buffer
  |
sub_432500: vtable->compile()
  |  invokes sub_64BAF0 (DAGgen + 159-phase OCG + Mercury)
  |    |
  |    +-- Ori IR construction (DAGgen phase)
  |    +-- 159 phases via PhaseManager (sub_C62720 / sub_C64F70)
  |    +-- Mercury SASS encoding (phases 113-122)
  |    |
  v    v
work[0] = .sass output    work[1] = .ucode output
  |
sub_4370F0: record timing
  |
sub_43CC70: emit .sass/.ucode sections, entry banner, verbose output
  |
  v
ELF builder (sub_612DE0)

Cross-References

Pipeline Overview -- end-to-end compilation flow
Entry Point & CLI -- the top-level driver that calls sub_4428E0
Optimization Pipeline (159 Phases) -- the OCG pipeline invoked per-kernel
Code Generation Overview -- detailed codegen subsystem
SASS Instruction Encoding -- Mercury encoding phases
Register Allocation -- Fatpoint algorithm invoked at phase 101
Thread Pool & Concurrency -- thread pool struct and jobserver
Memory Pool Allocator -- pool allocator used throughout
Knobs System -- cache-mode knobs read by sub_4428E0

Function Map

Address	Size	Lines	Identity	Confidence
`sub_4428E0`	13,774 B	2,141	Core compilation orchestrator	HIGH
`sub_43CA80`	192 B	24	Normal strategy: register calculator	HIGH
`sub_438B50`	2,419 B	375	Normal/non-ABI strategy: CU builder	HIGH
`sub_43C6F0`	1,600 B	225	Compile-only strategy: register calculator	HIGH
`sub_4383B0`	1,320 B	177	Compile-only/debug strategy: CU builder	HIGH
`sub_43CAE0`	648 B	91	Debug strategy: register calculator	HIGH
`sub_4378E0`	2,010 B	250	Debug strategy: CU builder	HIGH
`sub_43C570`	577 B	77	Non-ABI strategy: register calculator	HIGH
`sub_43A400`	4,696 B	647	Per-kernel worker (target config + setup)	HIGH
`sub_64BAF0`	~30 KB	1,006	DAGgen + OCG + SASS (per-kernel pipeline)	MEDIUM
`sub_43CC70`	5,425 B	1,077	Per-entry output (.sass/.ucode sections)	HIGH
`sub_436DF0`	485 B	59	Thread pool worker function	HIGH
`sub_432500`	461 B	47	Finalization wrapper (setjmp + vtable call)	HIGH
`sub_4370F0`	522 B	64	Timing finalization (per-kernel + aggregate)	HIGH
`sub_43B660`	3,843 B	~300	Register/resource constraint calculator	HIGH
`sub_1CB18B0`	~200 B	33	Thread pool constructor (184-byte struct)	HIGH
`sub_1CB1A50`	~200 B	21	Thread pool task submit	HIGH
`sub_1CB1AE0`	--	--	Thread pool wait-for-all	HIGH
`sub_1CB1970`	--	--	Thread pool destructor	HIGH
`sub_1CC7300`	2,027 B	--	GNU Make jobserver client	HIGH

Diagnostic Strings

String	Location	Purpose
`"def-load-cache"`	sub_4428E0	Cache mode knob read
`"force-load-cache"`	sub_4428E0	Cache mode knob read
`"def-store-cache"`	sub_4428E0	Cache mode knob read
`"force-store-cache"`	sub_4428E0	Cache mode knob read
`"--compile-only"`	sub_4428E0	Option validation
`"--compile-as-tools-patch"`	sub_4428E0	Option validation
`"--extensible-whole-program"`	sub_4428E0	Option validation
`"calls without ABI"`	sub_4428E0	Non-ABI mode diagnostic
`"compilation without ABI"`	sub_4428E0	Non-ABI mode diagnostic
`"unified Functions"`	sub_4428E0	Compile-only restriction
`"suppress-debug-info"`	sub_4428E0	Debug info suppression warning
`"position-independent-code"`	sub_4428E0	PIC mode configuration
`"__cuda_dummy_entry__"`	sub_43CC70	Dummy entry skip check
`"reg-fatpoint"`	sub_43CC70	Register map annotation
`".sass"`, `".ucode"`	sub_43CC70	Output section names
`"# ============== entry %s =="`	sub_43CC70	Per-entry SASS banner
`"ptxocg.0.0"`	sub_43A400	Target config identifier
`"specified texturing mode"`	sub_43A400	Texturing mode diagnostic
`".local_maxnreg"`	sub_438B50	Per-function register limit
`"device functions"`	sub_438B50	Compile-only device function handling
`"--compile-only option"`	sub_438B50	Compile-only restriction
`"ptxas"`	sub_436DF0	Thread-local program name

ELF/Cubin Output

After all per-kernel SASS encoding completes, ptxas enters the ELF output phase -- the final stage of the compilation pipeline. This phase transforms the accumulated per-kernel SASS bytes, relocation metadata, constant bank data, shared memory layouts, and debug information into a complete NVIDIA CUBIN file. The CUBIN is a standard ELF container with NVIDIA-proprietary extensions: machine type EM_CUDA (0xBE), non-standard ELF class bytes, CUDA-specific section types, and a rich per-entry metadata system called EIATTR. The output pipeline is a custom implementation with no libelf dependency -- ptxas constructs every byte of the ELF from scratch, including headers, section tables, symbol tables, string tables, relocations, and program headers.

The output phase handles three binary kinds: SASS (raw resolved SASS, legacy default), Mercury (SM 75--99 default), and Capsule Mercury (SM 100+ default, supporting deferred finalization). All three produce a valid CUBIN ELF; the difference is whether the .text sections contain final SASS bytes or Mercury-encoded streams that a downstream finalizer resolves at link or load time.


Cubin entry point	`sub_612DE0` (47 KB, called from `sub_446240`)
ELFW constructor	`sub_1CB53A0` (3,480 bytes, 672-byte central object)
Section creator	`sub_1CB3570` (1,963 bytes, 44 call sites)
Symbol table builder	`sub_1CB68D0` (9,578 bytes, ~1,700 decompiled lines)
Master ELF emitter	`sub_1C9F280` (15,263 bytes, 97 KB decompiled -- largest function in output range)
Section layout calculator	`sub_1C9DC60` (5,663 bytes)
Master section allocator	`sub_1CABD60` (11,856 bytes, 67 KB decompiled -- shared/constant/local addresses)
nvinfo/EIATTR builder	`sub_1CC9800` (14,764 bytes, 90 KB decompiled)
Master relocation resolver	`sub_1CD48C0` (4,184 bytes, 22 KB decompiled)
File serializer	`sub_1CD13A0` (2,541 bytes, writes final bytes to disk)
ELF machine type	`EM_CUDA` = `0xBE` (190)
CUDA section type	`SHT_CUDA_INFO` = `0x70000064`
ELF timing	`"ELF-time : %.3f ms (%.2f%%)"` in `--compiler-stats` output
Peak ELF memory	`"PeakELFMemoryUsage : %.3lf KB"`

Pipeline Overview

The ELF output pipeline runs as a single-threaded sequence after all per-kernel OCG passes have completed (multi-kernel compilation may be parallel, but ELF emission is serialized). The flow is orchestrated by sub_612DE0, which reads compilation flags, constructs the ELFW central object, then drives 11 phases to produce the final .cubin or .o file.

sub_446240 (compilation driver -- "real main")
  |  all per-kernel OCG passes complete
  v
sub_612DE0 (cubin entry, 47KB)
  |  reads: deviceDebug, lineInfo, optLevel, IsCompute, IsPIC
  |  establishes setjmp/longjmp error recovery
  |  writes "Cuda compilation tools, release 13.0, V13.0.88"
  |         "Build cuda_13.0.r13.0/compiler.36424714_0"
  |
  |-- Phase 1: ELFW construction
  |     sub_1CB53A0 -- create 672-byte ELFW object, 7 standard sections
  |
  |-- Phase 2: Per-kernel section creation
  |     sub_1CB42D0 x N -- .text.<func>, .rela.text.<func> (one per kernel)
  |     sub_1CB3570 x 44 -- .nv.constant0.<func>, .nv.shared.<func>, etc.
  |
  |-- Phase 3: Call graph analysis
  |     sub_1CBB920 -- recursion detection (DFS)
  |     sub_1CBC090 -- dead function elimination
  |     sub_1CBE1B0 -- .nv.callgraph section builder
  |
  |-- Phase 4: Symbol fixup
  |     sub_1CB2CA0 -- renumber symbols after dead code elimination
  |     sub_1C99BB0 -- remap .symtab_shndx extended indices
  |
  |-- Phase 5: Memory allocation
  |     sub_1CABD60 -- assign addresses: shared, constant, local memory
  |     sub_1CA92F0 -- shared memory interference graph
  |     sub_1CA6890 -- constant bank deduplication
  |
  |-- Phase 6: nvinfo/EIATTR generation
  |     sub_1CC9800 -- build .nv.info.<func> sections (EIATTR attributes)
  |     sub_1CC8950 -- propagate barrier/register counts across call graph
  |
  |-- Phase 7: Symbol table construction
  |     sub_1CB68D0 -- build .symtab, handle SHN_XINDEX overflow
  |
  |-- Phase 8: Section layout
  |     sub_1C9DC60 -- compute file offsets with alignment padding
  |
  |-- Phase 9: Relocation resolution
  |     sub_1CD48C0 -- resolve all R_CUDA_* relocations
  |     sub_1CD5920 -- write .nv.resolvedrela sections
  |
  |-- Phase 10: Capsule Mercury embedding (SM 100+ only)
  |     sub_1C9B110 -- create .nv.merc.* section namespace
  |     sub_1CA2E40 -- clone memory-space sections into merc namespace
  |     sub_1C9C300 -- build 328-byte capsule descriptors per function
  |
  |-- Phase 11: Final assembly & write
  |     sub_1C9F280 -- master ELF emitter (assemble complete CUBIN)
  |     sub_1CD13A0 -- file serializer (write to disk)
  |
  v
OUTPUT: .cubin / .o file

Custom ELF Emitter

ptxas builds the entire ELF output without libelf. The custom implementation spans approximately 20 functions in the 0x1C99--0x1CD6 address range (~300 KB of binary code). At the center is the ELFW ("ELF world") object -- a 672-byte structure that owns all sections, symbols, and string tables for a single compilation unit.

ELFW Object Layout

The ELFW constructor sub_1CB53A0 allocates 672 bytes from the pool allocator sub_424070, creates a dedicated "elfw memory space" pool (4,096-byte initial allocation), writes the ELF header, and initializes 7 mandatory sections:

Index	Section	Type	Purpose
0	(null)	`SHT_NULL`	Required ELF null section
1	`.shstrtab`	`SHT_STRTAB`	Section name string table
2	`.strtab`	`SHT_STRTAB`	Symbol name string table
3	`.symtab`	`SHT_SYMTAB`	Symbol table
4	`.symtab_shndx`	`SHT_SYMTAB_SHNDX`	Extended section indices (for >65,280 sections)
5	`.note.nv.tkinfo`	`SHT_NOTE`	NVIDIA toolkit info (version, build ID, CLI args)
6	`.note.nv.cuinfo`	`SHT_NOTE`	NVIDIA CUDA info (SM version, features)
7	`.nv.uft.entry`	`SHT_PROGBITS`	Unified Function Table entries

ELF Header

The ELF header uses standard structure with NVIDIA-specific overrides:

Offset	Size	Field	CUDA Value
`0x00`	4	`e_ident[EI_MAG0..3]`	`0x7F 'E' 'L' 'F'` (magic `0x464C457F`)
`0x04`	1	`e_ident[EI_CLASS]`	`0x33` (`'3'`, 32-bit) or `0x41` (`'A'`, 64-bit)
`0x05`	1	`e_ident[EI_DATA]`	`0x01` (little-endian)
`0x06`	1	`e_ident[EI_VERSION]`	`0x01` (`EV_CURRENT`)
`0x07`	1	`e_ident[EI_OSABI]`	CUDA ABI version
`0x12`	2	`e_machine`	`0x00BE` (`EM_CUDA` = 190)
`0x24`	4	`e_flags`	SM version bits `[7:0]` + CUDA control flags

Standard ELF uses ELFCLASS32 = 1 and ELFCLASS64 = 2. CUDA cubins use non-standard values '3' (0x33) and 'A' (0x41), which the CUDA driver recognizes as the cubin signature. Any tool using standard libelf will reject these as invalid, which is one reason ptxas uses a custom emitter.

The e_flags field packs the SM architecture version in the low byte (e.g., 100 for sm_100) along with flags for relocatable vs. executable mode and address size. The mask 0x7FFFBFFF (clears bits 14 and 31) is applied during finalization to strip internal control flags that must not appear in the output cubin.

Section Generation

Each kernel/function produces a set of ELF sections. For a program with N entry functions and M device functions, the CUBIN contains on the order of 4*(N+M) sections minimum. The section creator sub_1CB3570 (44 call sites) handles the generic case; sub_1CB42D0 is specialized for .text.<funcname> code sections.

Per-Kernel Sections

For each kernel entry my_kernel, the output pipeline generates:

Section	Type	Content
`.text.my_kernel`	`SHT_PROGBITS`	SASS instruction bytes (SHF_ALLOC \| SHF_EXECINSTR)
`.rela.text.my_kernel`	`SHT_RELA`	Relocation entries for the code section
`.nv.info.my_kernel`	`SHT_CUDA_INFO` (`0x70000064`)	EIATTR metadata (register count, barriers, stack sizes, etc.)
`.nv.constant0.my_kernel`	`SHT_PROGBITS`	Constant bank 0 data (kernel parameters + literal constants)

Additional per-kernel sections are generated as needed:

Section	Condition
`.nv.shared.my_kernel`	Kernel uses shared memory
`.nv.local.my_kernel`	Kernel uses local (spill) memory
`.nv.global.init`	Program uses initialized global variables
`.nv.callgraph`	Relocatable object mode (`-c`)
`.nv.prototype`	Prototype information for cross-module linking

Global Sections

Section	Purpose
`.nv.info`	Global EIATTR attributes (not per-kernel)
`.nv.constant0`	Merged constant bank (whole-program mode)
`.nv.reservedSmem`	Reserved shared memory for runtime (tmem allocation, mbarrier parity)
`.nv.metadata`	Module-level metadata
`.nv.compat`	Forward-compatibility attributes
`.note.nv.cuver`	CUDA version note
`.nv.uft`	Unified Function Table (indirect call support)
`.nv.udt`	Unified Data Table
`.nv.uft.entry`	UFT entry point table
`.nv.udt.entry`	UDT entry point table
`.nv.rel.action`	Relocation action table
`.nv.resolvedrela`	Resolved relocations (post-linking)
`.nv.host`	Host-side interop data

Constant Banks

CUDA supports up to 18 numbered constant banks (0--17) plus 6 named banks:

Bank	Name	Purpose
0	`.nv.constant0`	Kernel parameters + compiler constants (per-entry)
1--17	`.nv.constant1`--`.nv.constant17`	User-declared `__constant__` variables
--	`.nv.constant.entry_params`	Entry point parameter block
--	`.nv.constant.entry_image_header_indices`	Texture/surface header index table
--	`.nv.constant.driver`	Driver-injected constants
--	`.nv.constant.optimizer`	Optimizer-generated constants (OCG)
--	`.nv.constant.user`	User-specified constants
--	`.nv.constant.pic`	Position-independent code constants
--	`.nv.constant.tools_data`	Tools/debugger-injected data

Section Ordering

During finalization, sections are sorted into 8 priority buckets that determine their order in the output ELF:

Bucket	Contents
0 (highest)	ELF header pseudo-section, `.shstrtab`
1	`.strtab`, `.symtab`, `.symtab_shndx`
2	`.note.nv.tkinfo`, `.note.nv.cuinfo`
3	`.text.<funcname>` (code sections)
4	`.nv.constant0.`, `.nv.shared.`, `.nv.local.*` (data sections)
5	`.rela.`, `.rel.` (relocation sections)
6	`.nv.info.*` (EIATTR metadata sections)
7 (lowest)	`.debug_`, `.nv.merc.` (debug and Mercury metadata)

Section file offsets are assigned by sub_1C9DC60, walking the sorted list with alignment padding:

uint64_t offset = elf_header_size;
for (int i = 0; i < section_count; i++) {
    section_t* sec = sorted_sections[i];
    if (sec->sh_addralign > 1)
        offset = (offset + sec->sh_addralign - 1) & ~(sec->sh_addralign - 1);
    sec->sh_offset = offset;
    offset += sec->sh_size;
}

Two section types receive special treatment during layout: .nv.constant0 (address assigned by the OCG constant bank allocator, not the layout calculator) and .nv.reservedSmem (address assigned by the shared memory master allocator sub_1CABD60).

EIATTR Metadata

Each kernel's .nv.info.<funcname> section contains a sequence of EIATTR (Entry Information Attribute) records. These encode per-kernel metadata that the CUDA driver reads at launch time to configure the hardware correctly. The EIATTR builder is sub_1CC9800 (14,764 binary bytes, 90 KB decompiled, 51 callees) -- one of the largest functions in the output pipeline.

EIATTR Encoding

Each EIATTR record is a TLV (type-length-value) structure:

+--------+--------+------------------+
| type   | length | value            |
| 2 bytes| 2 bytes| length bytes     |
+--------+--------+------------------+

The type field is a 16-bit EIATTR code. The length field specifies the payload size. The value is type-dependent (scalar, array, or structured).

EIATTR Catalog

ptxas v13.0.88 defines 98 EIATTR codes. The critical ones that every cubin emitter must produce:

EIATTR	Purpose	Encoding
`EIATTR_REGCOUNT`	Register count for this kernel	4-byte LE integer
`EIATTR_NUM_BARRIERS`	Hardware barrier count (0--16)	4-byte LE integer
`EIATTR_FRAME_SIZE`	Per-thread stack frame size in bytes	4-byte LE integer
`EIATTR_MIN_STACK_SIZE`	Minimum call stack size	4-byte LE integer
`EIATTR_MAX_STACK_SIZE`	Maximum call stack size (recursive)	4-byte LE integer
`EIATTR_CRS_STACK_SIZE`	Call/Return/Sync stack size	4-byte LE integer
`EIATTR_EXIT_INSTR_OFFSETS`	Byte offsets of EXIT instructions in `.text`	Array of 4-byte offsets
`EIATTR_S2RCTAID_INSTR_OFFSETS`	Byte offsets of S2R SR_CTAID.* instructions	Array of 4-byte offsets
`EIATTR_CTAIDZ_USED`	Kernel reads CTA ID Z dimension	Flag (0-byte payload)
`EIATTR_REQNTID`	Required thread block dimensions	3x 4-byte integers (X, Y, Z)
`EIATTR_MAX_THREADS`	Maximum threads per block	4-byte LE integer
`EIATTR_PARAM_CBANK`	Constant bank for kernel parameters	4-byte bank index + offset
`EIATTR_CBANK_PARAM_SIZE`	Size of parameter constant bank region	4-byte LE integer
`EIATTR_KPARAM_INFO`	Per-parameter ordinal/offset/size/alignment	Structured (V1)
`EIATTR_KPARAM_INFO_V2`	Per-parameter info, extended format	Structured (V2)
`EIATTR_MAXREG_COUNT`	Maximum register count directive	4-byte LE integer
`EIATTR_EXTERNS`	List of external symbol references	Array of symbol indices

Additional EIATTR codes for textures/surfaces, barriers, cooperative groups, tensor cores, and hardware workarounds:

EIATTR	Purpose
`EIATTR_IMAGE_SLOT`	Texture/surface image binding slot
`EIATTR_SAMPLER_INIT`	Sampler initialization data
`EIATTR_TEXID_SAMPID_MAP`	Texture-to-sampler mapping
`EIATTR_BINDLESS_IMAGE_OFFSETS`	Bindless texture/surface offset table
`EIATTR_SYNC_STACK`	Synchronization stack requirements
`EIATTR_COOP_GROUP_MASK_REGIDS`	Cooperative group register IDs
`EIATTR_NUM_MBARRIERS`	Number of mbarrier objects used
`EIATTR_MBARRIER_INSTR_OFFSETS`	mbarrier instruction locations
`EIATTR_WMMA_USED`	Kernel uses WMMA (Tensor Core) instructions
`EIATTR_TCGEN05_1CTA_USED`	Kernel uses 5th-gen Tensor Core (1-CTA mode)
`EIATTR_TCGEN05_2CTA_USED`	Kernel uses 5th-gen Tensor Core (2-CTA mode)
`EIATTR_SPARSE_MMA_MASK`	Structured sparsity mask for MMA
`EIATTR_CTA_PER_CLUSTER`	CTAs per cluster (SM 90+)
`EIATTR_EXPLICIT_CLUSTER`	Kernel requires explicit cluster launch
`EIATTR_MAX_CLUSTER_RANK`	Maximum cluster rank
`EIATTR_REG_RECONFIG`	Register reconfiguration (setmaxreg)
`EIATTR_SAM_REGION_STACK_SIZE`	SAM (Shared Address Mode) region stack
`EIATTR_RESERVED_SMEM_USED`	Kernel uses reserved shared memory
`EIATTR_RESERVED_SMEM_0_SIZE`	Size of reserved shared memory region 0
`EIATTR_SW1850030_WAR`	Hardware workaround (bug SW-1850030)
`EIATTR_SW2393858_WAR`	Hardware workaround (bug SW-2393858)
`EIATTR_SW2861232_WAR`	Hardware workaround (bug SW-2861232)
`EIATTR_CUDA_API_VERSION`	Required CUDA API version
`EIATTR_MERCURY_ISA_VERSION`	Mercury ISA version for capmerc binaries
`EIATTR_MERCURY_FINALIZER_OPTIONS`	Finalizer tuning for capmerc
`EIATTR_SYSCALL_OFFSETS`	Byte offsets of syscall instructions
`EIATTR_INSTR_REG_MAP`	Instruction-to-register allocation map (debug)
`EIATTR_STATISTICS`	Per-kernel compilation statistics
`EIATTR_PERF_STATISTICS`	Performance statistics

Barrier and Register Count Propagation

The EIATTR builder runs a cross-function propagation pass via sub_1CC8950 (2,634 bytes). When a device function uses barriers or a high register count, these requirements must propagate upward to the entry kernel that calls it:

"regcount %d for %s propagated to entry %s"
"Creating new EIATTR_NUM_BARRIERS and moving barcount %d
    from section flags of %s to nvinfo for entry symbol %s"
"Propagating higher barcount %d to the section flags
    of %s of entry symbol %s"

This ensures that the CUDA driver allocates sufficient barriers and registers for the entry kernel, accounting for all callees.

Relocation Processing

The relocation system handles symbol resolution for branch targets, constant bank references, function descriptors, texture/surface bindings, and address computations. The master relocation resolver is sub_1CD48C0 (4,184 binary bytes, 22 KB decompiled). ptxas defines 117 CUDA-specific relocation types (R_CUDA_NONE through R_CUDA_NONE_LAST).

Relocation Categories

Category	Types	Purpose
Absolute address	`R_CUDA_32`, `R_CUDA_64`, `R_CUDA_ABS*`	Global memory addresses
Global address	`R_CUDA_G32`, `R_CUDA_G64`, `R_CUDA_G8_*`	Global-space addresses
PC-relative	`R_CUDA_PCREL_IMM24_26`, `R_CUDA_PCREL_IMM24_23`	Branch/call targets
Constant field	`R_CUDA_CONST_FIELD19_`, `R_CUDA_CONST_FIELD21_`, `R_CUDA_CONST_FIELD22_*`	Constant bank references
Function descriptor	`R_CUDA_FUNC_DESC*`	Indirect function call targets
Texture/surface	`R_CUDA_TEX_HEADER_INDEX`, `R_CUDA_SAMP_HEADER_INDEX`, `R_CUDA_SURF_*`	Texture/surface binding
Bindless	`R_CUDA_BINDLESSOFF`, `R_CUDA_TEX_BINDLESSOFF`	Bindless texture/surface offsets
Sub-byte patching	`R_CUDA_8_0` through `R_CUDA_8_56`	Individual byte within a 64-bit instruction
Unified address	`R_CUDA_UNIFIED`, `R_CUDA_UNIFIED_32`, `R_CUDA_UNIFIED_8_*`	Unified address space
Instruction-level	`R_CUDA_INSTRUCTION64`, `R_CUDA_INSTRUCTION128`	Whole-instruction replacement
Yield/NOP	`R_CUDA_YIELD_OPCODE9_0`, `R_CUDA_YIELD_CLEAR_PRED4_87`	YIELD-to-NOP patching
Cleanup	`R_CUDA_UNUSED_CLEAR32`, `R_CUDA_UNUSED_CLEAR64`	Zero out unused instruction fields

Resolution Logic

The resolver performs these operations for each relocation entry:

Alias resolution -- redirect relocations from alias symbols to their targets ("change alias reloc %s to %s")
Dead function filtering -- skip relocations on eliminated functions ("ignore reloc on dead func %s")
UFT/UDT pseudo-relocation -- handle __UFT_OFFSET, __UFT_CANONICAL, __UDT_OFFSET, __UDT_CANONICAL synthetic symbols
PC-relative validation -- ensure branch targets are in the same section ("PC relative branch address should be in the same section")
YIELD-to-NOP conversion -- convert YIELD instructions to NOP when forward progress requirements prevent yielding
Unified reloc replacement -- convert type 103 (unified) to type 1 (absolute) for final resolution
Address computation -- compute final patched value from symbol address + addend

Output relocation sections (.nv.resolvedrela) are written by sub_1CD5920.

Debug Information

When --device-debug or --generate-line-info is active, ptxas generates DWARF debug sections. The debug subsystem spans 0x1CBF--0x1CC9 and includes parsers, emitters, and dumpers for the standard DWARF sections plus NVIDIA-specific debug extensions.

Debug Sections

Section	Content
`.debug_info`	DWARF DIE tree (compilation units, types, variables)
`.debug_abbrev`	DWARF abbreviation table
`.debug_line`	Source-to-address line number mapping
`.debug_frame`	Call frame information for unwinding
`.debug_loc`	Location lists for variables
`.debug_str`	DWARF string table
`.debug_ranges`	Address ranges
`.debug_aranges`	Address range lookup table
`.debug_pubnames`	Public name index
`.debug_pubtypes`	Public type index
`.nv_debug_ptx_txt`	Embedded PTX source text
`.nv_debug_line_sass`	SASS-level line number mapping
`.nv_debug_info_reg_sass`	Register allocation debug info
`.nv_debug_info_reg_type`	Register type information

Key debug infrastructure functions:

Function	Purpose
`sub_1CBF820`	DWARF form name table (`DW_FORM_addr`, `DW_FORM_data4`, etc.)
`sub_1CBF9B0`	DWARF attribute name table (`DW_AT_producer`, `DW_AT_comp_dir`, etc.)
`sub_1CC0850`	`.debug_abbrev` parser/emitter (18 KB decompiled)
`sub_1CC4A40`	`.debug_info` DIE tree walker (28 KB decompiled)
`sub_1CC34E0`	DWARF location expression decoder (`DW_OP_*` operations)
`sub_1CC24C0`	`.debug_info` emission pass (18 KB decompiled)
`sub_1CC5EB0`	Compilation unit header parser
`sub_1C9D1F0`	Debug section classifier/mapper (16 KB decompiled)

The --suppress-debug-info option (sub_432A00) disables debug section generation even when debug flags are present.

Capsule Mercury Output Path

For SM 100+ targets (Blackwell, Jetson Thor, consumer RTX 50-series), the default output mode is Capsule Mercury. In this mode, the CUBIN ELF contains two layers of content: standard CUBIN sections and a parallel set of .nv.merc.* sections carrying Mercury-encoded instruction streams plus all metadata needed for deferred finalization.

Mercury-Specific Sections

Section	Purpose
`.nv.merc.debug_abbrev`	Cloned DWARF abbreviation table
`.nv.merc.debug_info`	Cloned DWARF info
`.nv.merc.debug_line`	Cloned DWARF line table
`.nv.merc.debug_frame`	Cloned DWARF frame info
`.nv.merc.debug_loc`	Cloned DWARF locations
`.nv.merc.debug_str`	Cloned DWARF string table
`.nv.merc.debug_ranges`	Cloned DWARF ranges
`.nv.merc.debug_aranges`	Cloned DWARF address ranges
`.nv.merc.debug_pubnames`	Cloned DWARF public names
`.nv.merc.debug_pubtypes`	Cloned DWARF public types
`.nv.merc.debug_macinfo`	Cloned DWARF macro info
`.nv.merc.nv_debug_ptx_txt`	Embedded PTX source text
`.nv.merc.nv_debug_line_sass`	SASS-level line mapping
`.nv.merc.nv_debug_info_reg_sass`	Register allocation debug info
`.nv.merc.nv_debug_info_reg_type`	Register type debug info
`.nv.merc.symtab_shndx`	Extended section index table (merc copy)
`.nv.merc.nv.shared.reserved`	Shared memory reservation metadata
`.nv.merc.rela<secname>`	Per-section relocation tables

Capsule Mercury Construction

The capmerc path is integrated into the master ELF emitter. The sequence:

sub_1C9B110 (23 KB decompiled) creates the .nv.merc.* section namespace
sub_1CA2E40 (18 KB decompiled) clones constant/global/shared/local sections into the merc namespace, creating .nv.merc.rela relocation sections
sub_1C9C300 (24 KB decompiled) processes .nv.capmerc<funcname> sections. Constructs a 328-byte capsule descriptor per function containing: Mercury-encoded instruction stream, relocation metadata, KNOBS configuration snapshot, and function-level metadata (register counts, barriers, shared memory usage)
sub_1CA3A90 (45 KB decompiled) merges sections that have both merc and non-merc copies
sub_1C99BB0 (25 KB decompiled) remaps section indices after merc section insertion

Off-Target Finalization

The cubin entry sub_612DE0 implements a "fastpath optimization" for off-target finalization:

"[Finalizer] fastpath optimization applied for off-target %u -> %u finalization"

When a capmerc binary compiled for SM X is finalized for SM Y (within the same family), the fastpath patches the Mercury instruction stream directly without full recompilation. The compatibility checker sub_60F290 determines whether fastpath is safe based on instruction set compatibility, register file layout, and memory model.

Self-Check

The --self-check option performs roundtrip verification: generate capmerc, reconstitute SASS from the capsule, and compare section-by-section. The verifier uses a Flex SASS lexer (sub_720F00, 64 KB) and a comparator (sub_729540, 35 KB). Error string: "Failure of '%s' section in self-check for capsule mercury".

Multi-Kernel Output

A typical CUDA program compiles multiple kernels and device functions into a single CUBIN. The output pipeline handles this through per-function section isolation, combined with cross-function analysis for call graph construction, barrier propagation, and dead code elimination.

Per-Function Section Layout

Each entry function and each device function gets its own .text section (the -ffunction-sections pattern). This enables:

Function-level dead code elimination -- sub_1CBC090 removes .text, .rela.text, .nv.info, and .nv.constant0 sections for unreachable functions
Linker granularity -- nvlink can select individual functions from relocatable objects
Driver loading -- the CUDA runtime can load individual kernels by name

Call Graph Construction

The call graph builder (sub_1CBE1B0) emits a .nv.callgraph section that encodes inter-function call edges. This section is present only in relocatable object mode (-c). The recursion detector (sub_1CBB920) performs a DFS traversal with manual 9-level unrolling, emitting "recursion at function %d" for each cycle found.

Dead functions are eliminated by sub_1CBC090:

"dead function %d(%s)"
"removed un-used section %s (%d)"   (x8 -- once per section type)
"function %d(%s) has address taken but no call to it"

Memory Allocation Across Kernels

The master section allocator sub_1CABD60 (11,856 binary bytes, 67 KB decompiled, 69 callees) assigns addresses to all memory-space sections across all kernels. It runs a multi-pass algorithm:

Global shared allocation -- shared variables visible to multiple kernels
Per-entry shared memory -- shared variables private to each kernel
Extern shared handling -- dynamically-sized shared memory (extern __shared__)
Reserved shared memory -- runtime reservations (.nv.reservedSmem.begin, .nv.reservedSmem.cap, .nv.reservedSmem.offset0, .nv.reservedSmem.offset1)
Local memory -- per-thread spill storage
Constant bank merging -- merges constant bank data across kernels, with deduplication (sub_1CA6890: "found duplicate value 0x%x, alias %s to %s")

The shared memory allocator sub_1CA92F0 (2,804 bytes) builds an interference graph for shared objects and performs group allocation for non-overlapping variables.

SHN_XINDEX Overflow

Large CUDA programs can exceed the ELF 65,280-section limit (SHN_LORESERVE = 0xFF00). Each kernel generates at minimum 4 sections (.text, .rela.text, .nv.info, .nv.constant0), so a program with 16,000+ kernels triggers the overflow mechanism:

e_shnum = 0 in the ELF header (signals overflow)
Section header [0].sh_size = real section count
e_shstrndx = SHN_XINDEX (0xFFFF)
Section header [0].sh_link = real .shstrtab index
Symbol st_shndx = SHN_XINDEX when real index >= 0xFF00
.symtab_shndx entries hold the actual section indices

This is standard ELF overflow handling, and it is production-critical -- sub_1CB68D0 checks for it with "overflow number of sections %d".

Key Functions

Address	Size	Decompiled	Purpose
`sub_612DE0`	~12 KB	47 KB	Cubin generation entry point
`sub_1C9F280`	15,263 B	97 KB	Master ELF emitter
`sub_1CC9800`	14,764 B	90 KB	nvinfo/EIATTR section builder
`sub_1CABD60`	11,856 B	67 KB	Master section allocator (shared/const/local)
`sub_1CB68D0`	9,578 B	49 KB	Symbol table builder (.symtab)
`sub_1CA3A90`	6,289 B	45 KB	Section merger (merc + non-merc)
`sub_1C9DC60`	5,663 B	29 KB	Section layout calculator
`sub_1C99BB0`	4,900 B	25 KB	Section index remap (.symtab_shndx)
`sub_1C9C300`	3,816 B	24 KB	Capsule Mercury section processor
`sub_1C9B110`	4,585 B	23 KB	Mercury capsule builder
`sub_1CD48C0`	4,184 B	22 KB	Master relocation resolver
`sub_1CBC090`	2,870 B	20 KB	Dead function eliminator
`sub_1CA2E40`	3,152 B	18 KB	Mercury section cloner
`sub_1CA92F0`	2,804 B	16 KB	Shared memory interference graph
`sub_1C9D1F0`	2,667 B	16 KB	Debug section classifier
`sub_1CA6890`	2,286 B	15 KB	Constant bank deduplication
`sub_1CC8950`	2,634 B	15 KB	Barrier/register count propagator
`sub_1CBB920`	~2,000 B	14 KB	Recursion detector (DFS)
`sub_1CB91C0`	2,668 B	13 KB	ELF structure dumper (debug)
`sub_1CB53A0`	3,480 B	13 KB	ELFW constructor
`sub_1CAB300`	2,157 B	12 KB	Bindless texture/surface handler
`sub_1CD5920`	1,985 B	11 KB	Relocation writer (.nv.resolvedrela)
`sub_1CD13A0`	2,541 B	11 KB	File serializer (final disk write)
`sub_1CBE1B0`	1,992 B	10 KB	.nv.callgraph section builder
`sub_1CD22E0`	1,979 B	10 KB	UFT manager
`sub_1CB3570`	1,963 B	10 KB	Section creator (44 call sites)
`sub_1C98C60`	1,755 B	9 KB	Mercury debug section classifier
`sub_1CB2CA0`	2,038 B	8 KB	Symbol fixup (post-deletion)
`sub_1CC7FB0`	--	--	.nv.info section name formatter
`sub_1CB9FF0`	--	--	Section count accessor
`sub_1CB9C40`	--	--	Get section by index

Cross-References

Custom ELF Emitter -- deep dive into ELFW object, header construction, section management, file serialization
Section Catalog & EIATTR -- complete inventory of section types and EIATTR attribute encoding
Relocations & Symbols -- relocation resolution, UFT/UDT management, symbol table details
Debug Information -- DWARF generation and .debug_* section handling
Capsule Mercury & Finalization -- capmerc packaging format, off-target finalization, self-check
Mercury Encoder -- Mercury instruction encoding (phases 117--122) that feeds the ELF emitter
SASS Code Generation -- the upstream per-kernel compilation that produces SASS bytes
Pipeline Overview -- where the ELF phase fits in the full PTX-to-SASS flow

The Ori Internal Representation

Ori -- short for "Original IR" -- is ptxas's sole intermediate representation. It is a fully proprietary, SASS-level IR with virtual registers, its own CFG infrastructure, and a partial-SSA discipline. Ori has no relationship to LLVM IR: there is no LLVM Value hierarchy, no LLVM-style use-def chains, no SSA dominance-frontier construction. Every IR-level optimization pass in ptxas (prefixed Ori in the NamedPhases table: OriCopyProp, OriSanitize, OriBranchOpt, OriLoopSimplification, OriStrengthReduce, OriDoPredication, etc.) operates on this representation.

The key design decision that distinguishes Ori from PTX: Ori uses SASS opcode names, not PTX mnemonics. After the MercConverter pass (sub_9F1A90, 35KB) runs, every instruction carries the name of the hardware SASS instruction it will become -- IMAD, FFMA, LDG, STG, BAR, BRA, EXIT, etc. -- just with virtual (not physical) register operands. This means the optimizer already knows exactly which hardware functional unit each instruction will execute on, enabling accurate latency modeling and scheduling from the earliest optimization phases.

Key Facts

Property	Value
Name	Ori ("Original IR")
Heritage	Fully proprietary (not LLVM-based)
Level	SASS machine-level with virtual registers
SSA form	Partial -- constructed by phase 23, destroyed by phase 73
Code Object size	~1136 bytes per function (C++ object)
Code Object vtable	`0x21EE238`
Register files	4: R (GPR), UR (uniform), P (predicate), UP (uniform predicate)
Operand kinds	10 distinct types
CFG representation	FNV-1a hash maps for successor/backedge edges
Opcode encoding	ROT13 of real SASS mnemonic names
BB entry size	40 bytes per basic block, contiguous array
Instruction linkage	Doubly-linked list within each basic block

Architecture Overview

  PTX source
      |
      v
  [Flex/Bison parser]          -- see pipeline/ptx-parser.md
      |
      v
  [PTX-to-Ori lowering]        -- see pipeline/ptx-to-ori.md
      |
      v
  +-------------------------------------------+
  |            Ori IR                          |
  |                                            |
  |  Code Object (per-function container)      |
  |    +-- Basic Block array (40B entries)     |
  |    |     +-- Instruction linked list       |
  |    |           +-- Packed operand array     |
  |    +-- CFG (FNV-1a hash map edges)         |
  |    +-- RPO array                           |
  |    +-- Register file arrays                |
  |    +-- Backedge map                        |
  +-------------------------------------------+
      |
      | 159 optimization phases (phases 0-158)
      |   phase 23: GenerateMovPhi (enter partial SSA)
      |   phase 73: ConvertAllMovPhiToMov (exit partial SSA)
      |
      v
  [Instruction selection]      -- see codegen/isel.md
      |
      v
  [Register allocation]        -- see regalloc/overview.md
      |
      v
  [Instruction scheduling]     -- see scheduling/overview.md
      |
      v
  [SASS binary encoding]       -- see codegen/encoding.md

The Code Object

Every function under compilation is represented by a single Code Object -- a ~1136-byte C++ structure that owns all IR data for that function. The Code Object vtable is at 0x21EE238. Its constructor is at sub_A3B080.

Field Map

Offset	Type	Field	Description
+24	`u32`	`sm_version`	SM target (encoded: 12288=sm30, 20481=sm50, 36865=sm90)
+72	`ptr`	`code_buf`	Output code object buffer
+88	`ptr`	`reg_file`	Register descriptor array. `(ctx+88)+8regId` -> descriptor
+152	`ptr`	`sym_table`	Symbol/constant lookup array
+272	`ptr`	`instr_head`	Instruction linked-list head
+296	`ptr`	`bb_array`	Basic block array pointer (40B per entry)
+304	`u32`	`bb_index`	Basic block array count/current index
+312	`ptr`	`options`	`OptionsManager*` for knob queries
+648	`ptr`	`succ_map`	CFG successor edge hash table
+680	`ptr`	`backedge_map`	CFG backedge hash table
+720	`ptr`	`rpo_array`	Reverse post-order array (`int*`)
+768	`ptr`	`const_sections`	Constant memory section array
+776	`ptr`	`smem_sections`	Shared memory section array
+976	`ptr`	`block_info`	Block info array (40 bytes per entry, contiguous)
+984	`i32`	`num_blocks`	Number of basic blocks
+1584	`ptr`	`sm_backend`	SM-specific architecture backend object (see data-structures.md)
+1664	`ptr`	`knob_container`	Knob container pointer (for `-knob` queries)
+1928	`ptr`	`codegen_ctx`	Code object / code generation context

Register and Instruction Counts (SM Backend Object)

The register counts and instruction counts live in the SM backend object at *(code_obj+1584), accessed via DWORD-indexed fields (not Code Object byte offsets). Earlier versions of this page incorrectly listed these as Code Object offsets +99, +102, +159, +335, +341 -- those are DWORD indices, making the actual byte offsets 396, 408, 636, 1340, and 1364 respectively within the SM backend.

DWORD Index	Byte Offset	Type	Field	Description
`[99]`	+396	`u32`	`ur_count`	Uniform register (UR) count
`[102]`	+408	`u32`	`r_alloc`	R-register count (allocated)
`[159]`	+636	`u32`	`r_reserved`	R-register count (reserved)
`[335]`	+1340	`u32`	`instr_hi`	Instruction count (upper bound)
`[341]`	+1364	`u32`	`instr_lo`	Instruction count (lower bound)

total_R_regs      = v5[159] + v5[102]   // reserved + allocated
instruction_count = v5[335] - v5[341]   // upper - lower

The stats emitter at sub_A3A7E0 prints a detailed per-function profile:

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]

Basic Blocks

Basic blocks are stored as 40-byte entries in a contiguous array at Code Object +976. The block count is at +984.

Block Entry Layout (40 bytes)

Offset	Type	Field
+0	`ptr`	Head instruction pointer (first instruction in BB)
+8	`ptr`	Instruction list link / tail
+28	`i32`	`bix` -- block index (unique ID for CFG operations)
+32	`u64`	Flags / padding

Blocks are additionally accessible via a sub-block array at Code Object +368, indexed as *(ctx+368) + 8*blockIndex.

The debug dumper (sub_BE21D0) emits Graphviz DOT output for the CFG:

digraph f {
  node [fontname="Courier" ...]
  bix0 -> bix1
  bix0 -> bix3
  bix1 -> bix2
  bix2 -> bix1    // backedge (loop)
  bix2 -> bix3
}

Control Flow Graph

The CFG uses FNV-1a hash maps to represent edges. Two separate hash tables exist at Code Object offsets +648 (successor edges) and +680 (backedge info).

FNV-1a Hashing

All CFG hash lookups use the same parameters, confirmed across 50+ call sites:

Parameter	Value
Initial hash	`0x811C9DC5`
Prime	`16777619` (`0x01000193`)
Input	4-byte block index, hashed byte-by-byte

Hash Map Structure

Each hash map uses chained hashing with 24-byte bucket entries:

Bucket (24 bytes):
  +0   node* head      // first node in chain
  +8   node* tail      // last node in chain
  +16  i32   count     // entries in this bucket

Full Node (64 bytes):
  +0   node* next      // chain link
  +8   i32   key       // block index
  +16  ptr   values    // successor/predecessor block list
  +32  sub-hash data   // embedded sub-table for multi-edge blocks
  +56  u32   hash      // cached FNV-1a hash

Simple Node (16 bytes):
  +0   node* next
  +8   i32   key
  +12  u32   hash

Growth policy: rehash when total_elements > num_unique_keys (load factor exceeds 1.0). Capacity doubles on each rehash.

Key CFG Functions

Address	Size	Function	Notes
`sub_BDE150`	9KB	`CFG::computeRPO`	Explicit DFS stack, assigns RPO numbers into +720 array
`sub_BDE8B0`	2KB	`CFG::printEdges`	FNV-1a lookup, prints `"bix%d -> bix%d\n"`
`sub_BDEA50`	4KB	`CFG::dumpRPOAndBackedges`	RPO + backedge debug dump
`sub_BE0690`	54KB	`CFG::buildAndAnalyze`	Main CFG constructor: predecessors, successors, RPO, loop detection
`sub_BE21D0`	1.4KB	`CFG::dumpDOT`	Graphviz DOT format output
`sub_BE2330`	4KB	`CFG::computeDominators`	Post-build dominator/loop analysis with bitvector ops

The RPO dump (sub_BDEA50) produces output like:

Showing RPO state for each basic block:
  bix0 -> RPONum: 0
  bix1 -> RPONum: 1
  bix2 -> RPONum: 3
  bix3 -> RPONum: 2
RPO traversal order: bix0, bix1, bix3, bix2
Showing backedge info:
  bix2 -> backedge's successor BB: 1

Instructions

Instructions are C++ objects with a large vtable, linked into per-basic-block doubly-linked lists. Each instruction carries a unique integer ID, an opcode, and a packed operand array.

Instruction Layout

Offset	Type	Field	Description
+8	varies	`reg_class`	Register class / encoding fields
+16	`i32`	`id`	Unique instruction ID
+28	`u32`	`opcode`	SASS opcode (lower 12 bits = base, bits 11-12 = modifier)
+36	`u32`	`flags`	Flags (bits 19-21 = subtype)
+48	`u8`	`special_flags`	Volatile/special (bit 5 = volatile)
+72	`u32`	`opcode_info`	Opcode info (duplicate/extended field, confirmed 50+ sites)
+73	`u8`	`instr_flags`	Per-instruction flag byte
+80	`u32`	`operand_count`	Number of operands
+84	`u32[]`	`operands`	Packed operand array (8 bytes per operand)
+160	`ptr`	`enc_buf`	Encoding buffer pointer (post-selection)
+184	`u32`	`enc_mode`	Encoding mode
+200	`u64`	`imm_value`	Immediate value

Packed Operand Encoding

Each operand occupies 8 bytes in the operand array starting at instruction offset +84:

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+
                 ^                            ^
                 bit 24: extended flag         bits 0-19: reg/sym index

type field (bits 28-30):
  1 = register operand      -> index into *(ctx+88) register file
  5 = symbol/const operand  -> index into *(ctx+152) symbol table

Operand Word 1 (Upper 4 Bytes)

Each 8-byte operand slot has two DWORDs. Word 0 (documented above) carries type/modifier/index. Word 1 carries extended flags:

Word 1 (at instr + 84 + 8*i + 4):

 31  30  29  28  27  26  25  24  23                             0
+---+---+---+---+---+---+---+---+-------------------------------+
|     reserved / mod flags      |CB |      auxiliary data        |
+---+---+---+---+---+---+---+---+-------------------------------+
                             ^
                             bit 24: const-bank flag (CB)

Bits 25-31 (mask 0xFE000000): extended modifier flags
  When any bit is set, the operand has special semantics.
  Peephole matchers bail out early if (word1 & 0xFE000000) != 0.
  Bit 25 (0x2000000): operand reuse / negation extension
  Bit 26 (0x4000000): absolute-value modifier (|x|)

Bit 24 (mask 0x1000000): const-bank flag
  When set, indicates the source references a constant bank (c[N][offset]).
  The scheduler uses this to distinguish FADD (standard) from FADD (const-bank)
  for latency modeling (see scheduling/latency-model.md).

Bits 0-23: auxiliary data
  For symbol/const operands (type 5): constant bank number
  For predicate guards (type 6): predicate sense (true/false)
  For register operands (type 1): typically zero

Evidence: sub_40848E checks (word1 & 0xFE000000) != 0 across all operands; sub_405769 tests both 0x1000000 and 0x6000000 combinations; sub_404AD0 verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms. Confirmed in 30+ decompiled functions (confidence 0.92).

Extraction Pattern

Extraction pattern (appears in 50+ functions):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;
int index   = operand & 0xFFFFF;
int mods    = (operand >> 20) & 0xFF;

uint32_t word1 = *(uint32_t*)(instr + 84 + 8 * i + 4);
bool has_const_bank = (word1 & 0x1000000) != 0;
bool has_ext_mods   = (word1 & 0xFE000000) != 0;

Opcode Constants

Selected confirmed opcodes (from multiple independent functions):

Value	Instruction	Notes
47	NOP / barrier
72	CALL / JMP	Function call or jump
91	ATOM	Atomic memory operation
92	RED	Reduction operation
95	STS	Store to shared memory (ROT13: `FGF`). Note: EXIT = opcode 77 (`RKVG`), RET = opcode 72 (`ERG`)
155	LD variant	Load instruction
173	ST variant	Store instruction
183	LD.E	Extended load (`& 0xFFFFCFFF` mask removes modifier bits)
267	ST variant	Store (`& 0xFFFFCFFF`)
268	LD variant	Load (`& 0xFFFFCFFF`)
288	ST.E	Extended store

The 0xFFFFCFFF mask (clear bits 12-13) strips modifier/suboperation bits from the opcode, yielding the base instruction class. This pattern appears in InstructionClassifier, MBarrierDetector, and OperandLowering code.

ROT13 Opcode Names

All SASS opcode mnemonic strings stored in the binary are ROT13-encoded. The master table is initialized in sub_BE7390 (InstructionInfo constructor) at offset 4184 of the InstructionInfo object, with 16-byte {name, length} entries. This is lightweight obfuscation -- not a security measure.

Selected decoded names (~200+ total, covering the full sm_70+ SASS ISA):

ROT13	Real	Category
`VZNQ`	`IMAD`	Integer multiply-add
`VNQQ3`	`IADD3`	3-input integer add
`SSZN`	`FFMA`	FP fused multiply-add
`SNQQ`	`FADD`	FP add
`SZHY`	`FMUL`	FP multiply
`ZBI`	`MOV`	Move
`FRY`	`SEL`	Select
`YBC3`	`LOP3`	3-input logic
`VFRGC`	`ISETP`	Integer set-predicate
`SFRGC`	`FSETP`	FP set-predicate
`YRN`	`LEA`	Load effective address
`FUS`	`SHF`	Shift / funnel shift
`ZHSH`	`MUFU`	Multi-function unit (SFU)
`YQT`	`LDG`	Load global
`FGT`	`STG`	Store global
`YQP`	`LDC`	Load constant
`YQY`	`LDL`	Load local
`YQF`	`LDS`	Load shared
`NGBZ`	`ATOM`	Atomic
`ONE`	`BAR`	Barrier
`OEN`	`BRA`	Branch
`PNYY`	`CALL`	Call
`ERG`	`RET`	Return
`RKVG`	`EXIT`	Exit
`GRK`	`TEX`	Texture
`ZRZONE`	`MEMBAR`	Memory barrier
`JNECFLAP`	`WARPSYNC`	Warp synchronize
`C2E`	`P2R`	Predicate to register
`E2C`	`R2P`	Register to predicate
`ABC`	`NOP`	No-op
`OFFL`	`BSSY`	Branch sync stack push
`OFLAP`	`BSYNC`	Branch sync
`QRCONE`	`DEPBAR`	Dependency barrier

Register Files

Ori maintains four distinct register files, mirroring the SASS hardware register model.

Register File Summary

File	Width	Range	Special	ABI type	Code Object offset
R	32-bit	R0 -- R255	RZ (read-zero)	2	+102 (alloc), +159 (reserved)
UR	32-bit	UR0 -- UR63	URZ (read-zero)	3	+99
P	1-bit	P0 -- P6	PT (always-true)	5	(tracked separately)
UP	1-bit	UP0 -- UP6	UPT (always-true)	--	(tracked separately)

R registers are the main 32-bit general-purpose registers. 64-bit values occupy consecutive pairs (e.g., R4:R5). The total R-register count for a function is field[159] + field[102] (reserved + allocated). Maximum is 255 usable registers (R0-R254); R255 is the hardware zero register RZ.

UR registers (sm_75+) are uniform registers shared across the warp. Every thread sees the same value. UR0-UR63 on supported architectures. The count is at Code Object +99.

P registers are 1-bit predicate registers used for conditional execution. P0-P6 are usable; PT is the hardwired always-true predicate (writes are discarded).

UP registers are the uniform variant of predicates, shared across the warp like UR.

Register Descriptor

Each register is described by a descriptor in the register file array, accessed as *(ctx+88) + 8*regId:

Offset	Type	Field
+8	`u32`	Size / live range info
+12	`u32`	Register number
+16	`u32`	Register class (enum)
+20	`u32`	Physical register name (assigned after regalloc)
+24	`ptr`	Definition info (0 = undefined / uninitialized)
+36	`u32`	Flags (bits 19-21 = subtype)
+48	`u8`	Volatile/special flags (bit 5 = volatile marker)
+64	`u32`	Register file type enum
+68	`u32`	Physical register number (post-allocation)

Value	Meaning
2	General-purpose (R)
3	Uniform (UR)
5	Predicate (P)
6	General register (alternate classification)
7	Predicate (alternate classification)
10	Extended register pair (64-bit)
11	Extended register quad (128-bit)

The register class name table at off_21D2400 maps reg_type enum values to string names. The stat collector (sub_A60B60, 24KB) enumerates ~25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others. The allocator processes classes 0--6 (matching reg_type values 0--6); barrier registers (reg_type 9) are handled separately.

Partial SSA

Ori does not maintain full SSA form at all times. Instead, it uses a bounded "partial SSA" window managed by two phases in the 159-phase optimization pipeline.

Phase 23: GenerateMovPhi

Constructs phi-like MovPhi pseudo-instructions at CFG merge points. Inserted after loop unrolling (phase 22) and before pipelining (phase 24). This establishes partial SSA form -- not through LLVM-style dominance-frontier phi insertion, but through explicit MovPhi nodes that represent value merging at control-flow join points.

Phase 73: ConvertAllMovPhiToMov

Destructs SSA form by lowering every MovPhi into a plain MOV instruction. Runs after sync instruction expansion (phase 72) and before uniform register conversion (phase 74). This is SSA destruction without the need for interference-graph-based coalescing -- the MovPhi nodes simply become copies.

The SSA Window

The partial-SSA window spans phases 23 through 73, covering the bulk of the optimization pipeline:

Phase 23  GenerateMovPhi         <-- SSA construction
Phase 24  OriPipelining
Phase 25  StageAndFence
Phase 26  OriRemoveRedundantBarriers
Phase 29  GeneralOptimize
Phase 37  GeneralOptimizeMid
Phase 46  GeneralOptimizeMid2
Phase 49  GvnCse
Phase 50  OriReassociateAndCommon
Phase 54  OriDoRematEarly
Phase 58  GeneralOptimizeLate
Phase 63  OriDoPredication
Phase 65  GeneralOptimizeLate2
Phase 69  OriDoRemat
Phase 70  OriPropagateVaryingSecond
Phase 71  OptimizeSyncInstructions
Phase 72  LateExpandSyncInstructions
Phase 73  ConvertAllMovPhiToMov  <-- SSA destruction

All optimizations between these two phases can rely on the single-definition property of MovPhi nodes for reaching-definition analysis.

MovPhi Instruction Format

A MovPhi is not a distinct opcode -- it reuses the MOV opcode (19) with a distinguishing flag in the instruction's auxiliary fields. Phase 73 (ConvertAllMovPhiToMov) converts MovPhi to plain MOV by clearing this flag, without changing the opcode value.

MovPhi operand layout:
  +72  opcode         = 19 (MOV)
  +76  opcode_aux     = flag distinguishing MovPhi from plain MOV
  +80  operand_count  = 2*N + 1  (variable, one destination + N source-predecessor pairs)

  operand[0]:           destination register (the merged value)
  operand[1], [2]:      {source_reg, predecessor_bix} for predecessor 0
  operand[3], [4]:      {source_reg, predecessor_bix} for predecessor 1
  ...
  operand[2*N-1], [2*N]: {source_reg, predecessor_bix} for predecessor N-1

This is the operational equivalent of an SSA phi node. For a CFG merge with two predecessors:

;; PTX-level CFG:            ;; Ori MovPhi:
;;   bix1 defines R7         ;;
;;   bix2 defines R9         ;;   MovPhi R3, R7, bix1, R9, bix2
;;   bix3 merges             ;;
;;   uses R3                 ;;   "if from bix1, R3 = R7; if from bix2, R3 = R9"

Phase 23 (GenerateMovPhi) inserts these at merge points where a register has different reaching definitions from different predecessors. Phase 73 destructor linearizes them: it inserts a MOV R3, R7 at the end of bix1 and a MOV R3, R9 at the end of bix2, then deletes the MovPhi.

Operand Kinds

The IR supports 10 distinct operand kinds, identified through the register allocator verifier (sub_A55D80) and the instruction selection pattern matcher infrastructure.

#	Kind	Description
1	R/UR register	General-purpose or uniform register operand
2	P/UP register	Predicate or uniform-predicate register operand
3	Any register	Wildcard -- matches any register class
4	Offset	Memory offset for address computation
5	Regular	Standard immediate or constant value
6	Predicated	Guard predicate controlling conditional execution
7	Remat	Rematerialization marker (value can be recomputed instead of spilled)
8	Spill-refill	Spill/refill pair marker for register allocator
9	R2P / P2R	Register-to-predicate or predicate-to-register conversion pair
10	Bit-spill	Single-bit spill (predicate register spill to GPR)

The regalloc verifier (sub_A55D80, confidence 0.95) classifies 10 problem categories that map to these operand kinds:

Missing spill match for refill
Refill reads uninitialized memory
P2R-R2P pattern match failure
Bit-spill-refill pattern match failure
Previously defined operand now uninitialized
Extra post-regalloc definitions (mixed-size check)
Rematerialization problem
P2R-R2P base destroyed
Bit-spill-refill base destroyed
Definitions disappeared without new ones added

The pattern matcher infrastructure at 0xB7D000--0xBA9D00 (~390 functions) uses a separate classification for instruction selection:

Function	Predicate
`sub_B28E10`	`isRegOperand`
`sub_B28E20`	`isPredOperand`
`sub_B28E40`	`isImmOperand`
`sub_B28E80`	`isConstOperand`
`sub_B28E90`	`isUReg`
`sub_B28E00`	`getRegClass` (1023 = wildcard, 1 = GPR)

Ori vs. PTX

PTX is a virtual ISA -- a stable interface between the compiler frontend and the architecture-specific backend. Ori is the architecture-specific backend representation that replaces PTX opcodes with actual SASS instructions early in compilation.

Aspect	PTX	Ori
Opcode set	Virtual mnemonics (`add`, `mul`, `ld`, `st`)	SASS hardware opcodes (`IMAD`, `FFMA`, `LDG`, `STG`)
Register model	Unlimited virtual registers, typed	4 hardware register files (R, UR, P, UP) with virtual numbering
SSA form	Not applicable (PTX is a linear ISA)	Partial SSA between phases 23 and 73
CFG representation	Implicit (labels + branches)	Explicit hash-map-based CFG with RPO, backedges, dominators
Target dependence	Architecture-independent (forward-compatible)	Architecture-specific (per-SM instruction selection)
Conversion point	Input to ptxas	After MercConverter (`sub_9F1A90`)

The MercConverter pass is the boundary: it transforms PTX-derived intermediate opcodes into SM-specific SASS opcodes by dispatching through a large opcode switch (sub_9ED2D0, 25KB). After MercConverter, the string "After MercConverter" appears in diagnostic output, and the IR is fully in SASS-opcode form. Each instruction then carries enough information for the scheduler to compute accurate latencies, throughputs, and functional-unit assignments.

Worked Example: `add.f32` to FADD

This traces a single PTX instruction through the Ori representation, showing exactly how the opcode, operands, and register references are encoded in memory.

PTX Input

add.f32 %f3, %f1, %f2

After MercConverter (sub_9F1A90), this becomes the Ori instruction:

FADD R3, R1, R2

The type qualifier .f32 disappears -- the "F" in FADD encodes the float type. Register names %f1, %f2, %f3 become virtual register IDs R1, R2, R3 in the R (GPR) register file.

Instruction Object in Memory

FADD is opcode 12 in the ROT13 name table (ROT13: SNQQ, at InstructionInfo+4184+16*12). The 296-byte instruction object:

Offset  Value              Field
------  -----------------  ---------------------
+0      prev_ptr           Linked-list prev
+8      next_ptr           Linked-list next
+16     <id>               Unique instruction ID
+72     0x0000000C         opcode = 12 (FADD)
+80     0x00000003         operand_count = 3
+84     0x10000003         operand[0] word0: dst R3
+88     0x00000000         operand[0] word1: no ext flags
+92     0x10000001         operand[1] word0: src R1
+96     0x00000000         operand[1] word1: no ext flags
+100    0x10000002         operand[2] word0: src R2
+104    0x00000000         operand[2] word1: no ext flags

Operand Decoding

Take operand[0] word0 = 0x10000003:

  0x10000003 in binary:
    bit 31     = 0       (no sign/negate)
    bits 28-30 = 001     (type = 1 = register operand)
    bits 20-27 = 00000000 (no modifiers)
    bits 0-19  = 00003   (register index = 3)

The register index resolves through the register descriptor array:

reg_desc = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * 3);
// reg_desc + 64: reg_file_type = 2 (R / GPR file)
// reg_desc + 12: register number = 3

If the source operand were a constant-bank reference (e.g., FADD R3, R1, c[0][0x10]), operand[2] would have type=5 (symbol/constant) in word0 and the const-bank flag (0x1000000) set in word1. The scheduler distinguishes these two FADD variants for latency modeling: standard FADD gets throughput class 0x3D, while const-bank FADD gets 0x78.

Memory Space Classification

Memory operands carry a space type enum, resolved by sub_91C840 which maps the PTX-level space identifier to an internal category number. The full input enumeration (from complete decompilation of sub_91C840, confidence 0.98):

Input	PTX Space	Internal Category	Notes
0	(none)	--	Unmapped, no memory space
1	Register / generic	16	Register file address
2	Code / function	12	Function address
3	(gap)	--	Unmapped
4	`.shared`	1	Shared memory
5	`.const`	3	Constant memory
6	`.global`	11	Global memory
7	`.local`	2	Local memory
8	(gap)	--	Unmapped
9	`.local` (variant)	2	Same as 7, alternate encoding
10--11	(gap)	--	Unmapped
12	`.param`	4	Parameter memory
13	Generic (unqualified)	0	Generic address space
14	`.tex`	8	Texture memory
15	`.surf`	17	Surface memory
16	Spill space	7	Register spill/fill scratch
17	(gap)	--	Unmapped
18	(instruction-dependent)	varies	Sub-classifies by opcode at `a2[1]`
19	`.uniform`	15	Uniform (sm_75+)
20	`.global` (extended)	6	Global, extended variant
21	`.const` (extended)	5	Constant, extended store-to-global path
22	`.const` (extended, alt)	5	Constant, alternate extended
23	`.surf` / tensor (ext)	18	Surface/tensor extended (sm_90+)

Case 18 (0x12) uses a sub-switch on the opcode value at a2[1] to further classify: opcodes 7, 43, 45, 53 map to category 6 (global-like); opcode 111 and opcodes in the 183--199 range map to category 5 (constant-like); opcodes 54 and 189 map to category 9 (special).

The hot/cold classifier pair (sub_A9CDE0 / sub_A9CF90) consumes the internal category to partition instructions for scheduling. Hot memory operations (global loads/stores, certain atomics -- category 11) have long latencies and benefit from aggressive scheduling; cold operations (constant loads -- category 3) have shorter latencies and are treated more conservatively.

Instructions -- detailed instruction format and encoding
CFG -- basic block and control-flow-graph internals
Registers -- register model, descriptor layout, allocation interface
Data Structures -- hash tables, bitvectors, linked lists
Pipeline Overview -- where Ori sits in the full PTX-to-SASS flow
PTX-to-Ori Lowering -- how PTX becomes Ori
Optimizer -- the 159-phase optimization pipeline
Hash Tables and Bitvectors -- FNV-1a maps and SSE2 bitvectors used by the CFG

Key Functions

Address	Size	Role	Confidence
`sub_A3B080`	--	Code Object constructor; allocates ~1136-byte per-function IR container (vtable at `0x21EE238`)	0.90
`sub_A4B8F0`	--	Register count formula: `total_R = v5[159] + v5[102]`, `instr_count = v5[335] - v5[341]`	0.90
`sub_A3A7E0`	--	Stats emitter; prints per-function profile (instruction count, register count, occupancy, latency)	0.90
`sub_BE21D0`	1.4KB	`CFG::dumpDOT`; emits Graphviz DOT output for the control flow graph	0.92
`sub_BDE150`	9KB	`CFG::computeRPO`; explicit DFS stack, assigns reverse post-order numbers into Code Object +720 array	0.92
`sub_BDE8B0`	2KB	`CFG::printEdges`; FNV-1a lookup, prints `"bix%d -> bix%d\n"`	0.92
`sub_BDEA50`	4KB	`CFG::dumpRPOAndBackedges`; RPO traversal order + backedge debug dump	0.92
`sub_BE0690`	54KB	`CFG::buildAndAnalyze`; main CFG constructor -- predecessors, successors, RPO, loop detection	0.92
`sub_BE2330`	4KB	`CFG::computeDominators`; post-build dominator and loop analysis with bitvector operations	0.92
`sub_BE7390`	--	`InstructionInfo` constructor; initializes 322-entry ROT13 opcode name table at object offset +4184	0.90
`sub_9F1A90`	35KB	MercConverter pass; transforms PTX-derived opcodes into SM-specific SASS opcodes	0.92
`sub_9ED2D0`	25KB	Opcode switch inside MercConverter; dispatches per-opcode legalization	0.90
`sub_91C840`	--	Memory space classifier; maps PTX-level space identifiers (0--23) to internal category numbers	0.98
`sub_A9CDE0`	--	Hot/cold memory classifier (hot path); partitions instructions by memory category for scheduling	0.85
`sub_A9CF90`	--	Hot/cold memory classifier (cold path); complement of `sub_A9CDE0`	0.85
`sub_A60B60`	24KB	Register stat collector; enumerates ~25 register sub-classes (R, P, B, UR, UP, UB, Tensor/Acc, etc.)	0.85
`sub_A55D80`	--	Register allocator verifier; classifies 10 operand-kind problem categories for regalloc validation	0.95
`sub_40848E`	--	Operand extended-flag checker; tests `(word1 & 0xFE000000) != 0` across all operands	0.85
`sub_405769`	--	Operand flag tester; tests `0x1000000` and `0x6000000` combinations in operand word 1	0.85
`sub_404AD0`	--	Peephole guard; verifies `(word1 & 0xFE000000) == 0` before allowing peephole transforms	0.85
`sub_B28E10`	--	`isRegOperand`; ISel pattern matcher operand predicate	0.90
`sub_B28E20`	--	`isPredOperand`; ISel pattern matcher operand predicate	0.90
`sub_B28E40`	--	`isImmOperand`; ISel pattern matcher operand predicate	0.90
`sub_B28E80`	--	`isConstOperand`; ISel pattern matcher operand predicate	0.90
`sub_B28E90`	--	`isUReg`; ISel pattern matcher operand predicate	0.90
`sub_B28E00`	--	`getRegClass`; returns register class (1023 = wildcard, 1 = GPR)	0.90

Instructions & Opcodes

This page documents the Ori IR instruction representation: in-memory layout, opcode encoding, operand model, instruction flags, creation/iteration APIs, the master descriptor table, and opcode categories. All offsets are from ptxas v13.0.88 (37.7 MB stripped x86-64 ELF).

Instruction Object Layout

Every Ori instruction is a 296-byte C++ object allocated from the Code Object's arena. Instructions are linked into per-basic-block doubly-linked lists via pointers at offsets +0 and +8. The allocator at sub_7DD010 allocates exactly 296 bytes per instruction and zeroes the object before populating it.

Memory Layout (296 bytes)

Offset	Size	Type	Field	Description
+0	8	`ptr`	`prev`	Previous instruction in BB linked list (`nullptr` for head)
+8	8	`ptr`	`next`	Next instruction in BB linked list (`nullptr` for tail)
+16	4	`i32`	`id`	Unique instruction ID (monotonically increasing within function)
+20	4	`i32`	`ref_count`	Reference/use count (incremented by `sub_7E6090`)
+24	4	`i32`	`bb_index`	Basic block index (`bix`) this instruction belongs to
+28	4	`u32`	`reserved_28`	Reserved / padding
+32	4	`u32`	`control_word`	Scheduling control word (stall cycles, yield, etc.)
+36	4	`u32`	`flags_36`	Instruction flags (bits 19-21 = subtype, see below)
+40	8	`ptr`	`sched_slot`	Scheduling state pointer
+48	8	`u64`	`flag_bits`	Extended flag bits (bit 5 = volatile, bit 27 = reuse)
+56	8	`ptr`	`def_instr`	Defining instruction (for SSA def-use chains)
+64	8	`ptr`	`reserved_64`	Reserved / register class info
+72	4	`u32`	`opcode`	Full opcode word (lower 12 bits = base opcode, bits 12-13 = modifier)
+76	4	`u32`	`opcode_aux`	Auxiliary opcode data (sub-operation, comparison predicate)
+80	4	`u32`	`operand_count`	Total number of operands (destinations + sources)
+84	var	`u32[N*2]`	`operands[]`	Packed operand array (8 bytes per operand slot)
+88	4	`u32`	`operands[0].extra`	High word of first operand slot
+100	1	`u8`	`type_flags`	Data type / modifier flags (bits 0-2 = data type code)
+104	4	`u32`	`reserved_104`	Reserved
+112	8	`ptr`	`use_chain`	Use chain linked list head (for CSE)
+120	8	`ptr`	`reserved_120`	Reserved
+136	4	`i32`	`reserved_136`	Reserved
+160	8	`ptr`	`enc_buf`	Encoding buffer pointer (populated during code generation)
+168	8	`ptr`	`reserved_168`	Reserved
+184	4	`u32`	`enc_mode`	Encoding mode selector
+200	8	`u64`	`imm_value`	Immediate value (for instructions with constant operands)
+208	16	`xmm`	`sched_params`	Scheduling parameters (loaded via `_mm_load_si128`)
+240	4	`u32`	`reserved_240`	Reserved
+244	1	`u8`	`reserved_244`	Reserved
+248	8	`i64`	`sentinel_248`	Initialized to `-1` (0xFFFFFFFFFFFFFFFF)
+256	8	`i64`	`sentinel_256`	Initialized to `0xFFFFFFFF`
+264	8	`i64`	`bb_ref`	Basic block reference / block index storage
+272	8	`i64`	`reserved_272`	Reserved
+280	16	`u128`	`reserved_280`	Zeroed on creation

Linked-List Pointers

Instructions form a doubly-linked list within each basic block. The Code Object stores the global list head at offset +272 and tail at offset +280:

Code Object +272  -->  head instruction (prev = nullptr)
                            |
                            v  (+8 = next)
                       instruction 2
                            |
                            v
                       instruction 3
                            |
                            v  ...
Code Object +280  -->  tail instruction (next = nullptr)

The linked-list traversal pattern appears in hundreds of functions throughout ptxas:

// Forward iteration over all instructions
for (instr = *(ptr*)(code_obj + 272); instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t opcode = *(uint32_t*)(instr + 72);
    uint32_t num_ops = *(uint32_t*)(instr + 80);
    // process instruction...
}

Opcode Encoding

The opcode field at offset +72 is a 32-bit word with a structured layout.

Opcode Word Format

 31              16  15  14  13  12  11            0
+------------------+---+---+---+---+---------------+
|    upper flags   |   |   | M | M |  base opcode  |
+------------------+---+---+---+---+---------------+
                            ^   ^
                            |   bit 12: modifier bit 0
                            bit 13: modifier bit 1

M = modifier bits (stripped by the 0xFFFFCFFF mask)
base opcode = 12-bit instruction class identifier (0-4095)

The mask 0xFFFFCFFF (clear bits 12-13) is used throughout InstructionClassifier, MBarrierDetector, OperandLowering, and many other subsystems to extract the base instruction class, stripping sub-operation modifier bits:

uint32_t raw_opcode = *(uint32_t*)(instr + 72);
uint32_t base_opcode = raw_opcode & 0xFFFFCFFF;

Additionally, bit 11 is sometimes used in operand count calculations:

// Effective operand count adjustment (appears in 50+ functions)
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int dst_count = *(uint32_t*)(instr + 80) - adj;

Canonical Opcode Reference

The opcode value stored at instruction+72 is the same index into the ROT13 name table at InstructionInfo+4184. There is a single numbering system -- the ROT13 table index IS the runtime opcode. This was verified by tracing sub_BEBAC0 (getName), which computes InstructionInfo + 4184 + 16 * opcode with no remapping.

The following table lists frequently-referenced opcodes from decompiled code, with their canonical SASS mnemonic names from the ROT13 table. Each opcode appears in 10+ decompiled functions reading *(instr+72).

Base Opcode	SASS Mnemonic	Category	Reference Count
0	`ERRBAR`	Error barrier (internal)	Sentinel in scheduler
1	`IMAD`	Integer multiply-add	100+ functions
7	`ISETP`	Integer set-predicate	`sub_7E0030` switch
18	`FSETP`	FP set-predicate	`sub_7E0030` switch
19	`MOV`	Move	80+ functions
23	`PLOP3`	Predicate 3-input logic	`sub_7E0030` case 23
25	`NOP`	No-op	Scheduling, peephole
52	`AL2P_INDEXED`	BB boundary pseudo-opcode	`sub_6820B0`, 100+
54	`BMOV_B`	Barrier move (B)	`sub_7E6090` case 54
61	`BAR`	Barrier synchronization	Sync passes
67	`BRA`	Branch	`sub_74ED70`, CFG builders
71	`CALL`	Function call	`sub_7B81D0`, ABI, spill
72	`RET`	Return	`sub_74ED70` (with 67)
77	`EXIT`	Exit thread	`sub_7E4150`, CFG sinks
93	`OUT_FINAL`	Tessellation output (final)	`sub_734AD0`, 25+
94	`LDS`	Load shared	`sub_7E0650` case 94
95	`STS`	Store shared	`sub_7E0030`, 40+
96	`LDG`	Load global	Memory analysis
97	`STG`	Store global	`sub_6820B0`, 30+
102	`ATOM`	Atomic	Encoding switch
104	`RED`	Reduction	Encoding switch
111	`MEMBAR`	Memory barrier	Sync passes
119	`SHFL`	Warp shuffle	`sub_7E0030` case 119
122	`DFMA`	Double FP fused mul-add	`sub_7E0030` case 122
130	`HSET2`	Half-precision set (packed)	20+ functions
135	`INTRINSIC`	Compiler intrinsic (pseudo)	ISel, lowering
137	`SM73_FIRST`	SM gen boundary (real instr)	Strength reduction
183	sm_82+ opcode	Extended mem operation	`& 0xFFFFCFFF` mask

Important caveats:

Opcode 52 (AL2P_INDEXED in name table) is universally used as a basic block delimiter in 100+ decompiled functions. The SASS mnemonic name may be vestigial; no decompiled code uses it for attribute-to-patch operations.
SM boundary markers (136=SM70_LAST, 137=SM73_FIRST, etc.) have marker names in the ROT13 table but are valid runtime opcodes. Instructions with these opcode values exist in the IR and are processed by optimization passes (e.g., strength reduction operates on opcode 137).
Earlier versions of this page had a "Selected Opcode Values" table that assigned incorrect SASS mnemonics based on behavioral inference rather than the ROT13 name table. Those labels (93=BRA/CALL, 95=EXIT, 97=CALL/label, 130=MOV) were wrong. The correct labels are: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET2. Branch/call/exit are at 67=BRA, 71=CALL, 77=EXIT.

Opcode Ranges by SM Generation

The ROT13 opcode name table in sub_BE7390 (InstructionInfo constructor) includes explicit SM generation boundary markers:

Marker Opcode	Decoded Name	Meaning
136	`SM70_LAST`	Last sm_70 (Volta) opcode
137	`SM73_FIRST`	First sm_73 (Volta+) opcode
171	`SM73_LAST`	Last sm_73 opcode
172	`SM82_FIRST`	First sm_82 (Ampere) opcode
193	`SM82_LAST`	Last sm_82 opcode
194	`SM86_FIRST`	First sm_86 (Ampere+) opcode
199	`SM86_LAST`	Last sm_86 opcode
200	`SM89_FIRST`	First sm_89 (Ada) opcode
205	`SM89_LAST`	Last sm_89 opcode
206	`SM90_FIRST`	First sm_90 (Hopper) opcode
252	`SM90_LAST`	Last sm_90 opcode
253	`SM100_FIRST`	First sm_100 (Blackwell) opcode
280	`SM100_LAST`	Last sm_100 opcode
281	`SM104_FIRST`	First sm_104 (Blackwell Ultra) opcode
320	`SM104_LAST`	Last sm_104 opcode
321	`LAST`	Sentinel (end of table)

This gives a clear partitioning: opcodes 0-136 are the base sm_70+ ISA, 137-171 extend to sm_73, and so on up through sm_104. Each SM generation only adds opcodes; no base opcodes are removed.

Operand Model

Packed Operand Encoding

Each operand occupies 8 bytes (two 32-bit words) in the operand array starting at instruction offset +84. The first word carries the type, modifier bits, and index. The second word carries additional data (extended flags, immediate bits, etc.).

Word 0 (at instr + 84 + 8*i):

 31  30  29  28  27  26  25  24  23  22  21  20  19                  0
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
| S |  type(3) |       modifier (8 bits)        |    index (20 bits)   |
+---+---+---+---+---+---+---+---+---+---+---+---+---------------------+
  ^   ^                                           ^
  |   bits 28-30: operand type                    bits 0-19: register/symbol index
  bit 31: sign/negative flag (S)

Word 1 (at instr + 88 + 8*i):

 31                                                                  0
+--------------------------------------------------------------------+
|               extended data / immediate bits / flags                |
+--------------------------------------------------------------------+

Operand Type Field (bits 28-30)

Value	Type	Index Meaning
0	Unused / padding	—
1	Register	Index into `(code_obj+88) + 8index` register descriptor array
2	Predicate register	Index into predicate register file
3	Uniform register	UR file index
4	Address/offset	Memory offset value
5	Symbol/constant	Index into `*(code_obj+152)` symbol table
6	Predicate guard	Guard predicate controlling conditional execution
7	Immediate	Encoded immediate value

Operand Extraction Pattern

This exact extraction pattern appears in 50+ functions across scheduling, regalloc, encoding, and optimization passes:

uint32_t operand_word = *(uint32_t*)(instr + 84 + 8 * i);

int  type   = (operand_word >> 28) & 7;     // bits 28-30
int  index  = operand_word & 0xFFFFF;        // bits 0-19 (also seen as 0xFFFFFF)
int  mods   = (operand_word >> 20) & 0xFF;   // bits 20-27
bool is_neg = (operand_word >> 31) & 1;      // bit 31

// Register operand check (most common pattern)
if (type == 1) {
    reg_descriptor = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * index);
    reg_file_type  = *(uint32_t*)(reg_descriptor + 64);
    reg_number     = *(uint32_t*)(reg_descriptor + 12);
}

Some functions use a 24-bit index mask (& 0xFFFFFF) instead of 20-bit, packing additional modifier bits into the upper nibble of the index field.

Operand Classification Predicates

Small predicate functions at 0xB28E00-0xB28E90 provide the instruction selection interface for operand queries:

Address	Function	Logic
`sub_B28E00`	`getRegClass`	Returns register class; 1023 = wildcard, 1 = GPR
`sub_B28E10`	`isRegOperand`	`(word >> 28) & 7 == 1`
`sub_B28E20`	`isPredOperand`	`(word >> 28) & 7 == 2`
`sub_B28E40`	`isImmOperand`	`(word >> 28) & 7 == 7`
`sub_B28E80`	`isConstOperand`	`(word >> 28) & 7 == 5`
`sub_B28E90`	`isUReg`	`(word >> 28) & 7 == 3`

Destination vs. Source Operand Split

Destinations come first in the operand array, followed by sources. The boundary is computed from the operand_count field and the modifier bits in the opcode:

uint32_t total_ops = *(uint32_t*)(instr + 80);
int adj = (*(uint32_t*)(instr + 72) >> 11) & 2;  // 0 or 2
int first_src_index = total_ops - adj;             // or total_ops + ~adj + 1
// Destinations: operands[0 .. first_src_index-1]
// Sources:      operands[first_src_index .. total_ops-1]

For most instructions, adj = 0 and the split point equals operand_count. Instructions with bit 11 set in the opcode word shift the split by 2, indicating 2 extra destination operands (e.g., predicated compare-and-swap operations that write both a result register and a predicate).

Predicate Guard Operand

The last operand (at index operand_count - 1) can be a predicate guard (type 6) controlling conditional execution. The guard predicate check in sub_7E0E80:

bool has_pred_guard(instr) {
    int last_idx = *(uint32_t*)(instr + 80) + ~((*(uint32_t*)(instr + 72) >> 11) & 2);
    uint32_t last_op = *(uint32_t*)(instr + 84 + 8 * last_idx);
    return ((last_op & 0xF) - 2) < 7;  // type bits in low nibble
}

Instruction Flags and Modifiers

Opcode Modifier Bits (offset +72, bits 12-13)

Bits 12-13 of the opcode word encode sub-operation modifiers. The 0xFFFFCFFF mask strips them to yield the base opcode. Common uses:

Modifier	Meaning
0	Default operation
1	`.HI` or alternate form
2	`.WIDE` or extended form
3	Reserved / architecture-specific

Extended Flag Bits (offset +48)

The 64-bit flag word at offset +48 accumulates flags throughout the compilation pipeline:

Bit	Hex Mask	Flag	Set By
6	`0x40`	Live-out	`sub_7E6090` (def-use builder)
16	`0x10000`	Has single def	`sub_7E6090`
25	`0x2000000`	Has prior use	`sub_7E6090`
27	`0x8000000`	Same-block def	`sub_7E6090`
33	`0x200000000`	Source-only ref	`sub_7E6090`

Control Word (offset +32)

The control word encodes scheduling metadata added by the instruction scheduler. It is initialized to zero and populated during scheduling (phases ~150+):

Stall cycles (how many cycles to wait before issuing the next instruction)
Yield hint (whether the warp scheduler should yield after this instruction)
Dependency barrier assignments
Reuse flags (register reuse hints for the hardware register file cache)

The stall cycle field is checked during scoreboard computation at sub_A08910. The control word format is the same as the SASS encoding control field.

Data Type Flags (offset +100)

The byte at offset +100 encodes the instruction's data type in its low 3 bits:

uint8_t type_code = *(uint8_t*)(instr + 100) & 7;

These correspond to SASS data type suffixes (.F32, .F64, .U32, .S32, .F16, .B32, etc.). The exact encoding is architecture-specific and queried through the InstructionInfo descriptor table.

ROT13 Opcode Name Table

All SASS opcode mnemonic strings in the binary are ROT13-encoded. This is lightweight obfuscation, not a security measure. The InstructionInfo constructor at sub_BE7390 populates a name table at object offset +4184 with 16-byte {char* name, uint64_t length} entries.

Table Structure

InstructionInfo object:
  +0       vtable pointer (off_233ADC0)
  +8       parent pointer
  ...
  +4184    opcode_names[0].name_ptr    -> "REEONE"   (ROT13 of ERRBAR)
  +4192    opcode_names[0].length      -> 6
  +4200    opcode_names[1].name_ptr    -> "VZNQ"     (ROT13 of IMAD)
  +4208    opcode_names[1].length      -> 4
  ...
  +9320    opcode_names[321].name_ptr  -> "YNFG"     (ROT13 of LAST)
  +9328    opcode_names[321].length    -> 4
  +9336    encoding_category_map[0..321]  (322 x int32, from unk_22B2320)
  +10624   (end of encoding category map)

Total: 322 named opcodes (indices 0-321). The 0x508 bytes at +9336 are not additional name entries -- they are a 322-element int32 array mapping each opcode index to an encoding category number (see Encoding Category Map below).

Full Decoded Opcode Table (Base ISA, sm_70+)

Idx	ROT13	SASS	Category
0	`REEONE`	`ERRBAR`	Error barrier (internal)
1	`VZNQ`	`IMAD`	Integer multiply-add
2	`VZNQ_JVQR`	`IMAD_WIDE`	Integer multiply-add wide
3	`VNQQ3`	`IADD3`	3-input integer add
4	`OZFX`	`BMSK`	Bit mask
5	`FTKG`	`SGXT`	Sign extend
6	`YBC3`	`LOP3`	3-input logic
7	`VFRGC`	`ISETP`	Integer set-predicate
8	`VNOF`	`IABS`	Integer absolute value
9	`YRN`	`LEA`	Load effective address
10	`FUS`	`SHF`	Funnel shift
11	`SSZN`	`FFMA`	FP fused multiply-add
12	`SNQQ`	`FADD`	FP add
13	`SZHY`	`FMUL`	FP multiply
14	`SZAZK`	`FMNMX`	FP min/max
15	`SFJMNQQ`	`FSWZADD`	FP swizzle add
16	`SFRG`	`FSET`	FP set
17	`SFRY`	`FSEL`	FP select
18	`SFRGC`	`FSETP`	FP set-predicate
19	`ZBI`	`MOV`	Move
20	`FRY`	`SEL`	Select
21	`C2E`	`P2R`	Predicate to register
22	`E2C`	`R2P`	Register to predicate
23	`CYBC3`	`PLOP3`	Predicate 3-input logic
24	`CEZG`	`PRMT`	Byte permute
25	`ABC`	`NOP`	No-op
26	`IBGR`	`VOTE`	Warp vote
27	`PF2E_32`	`CS2R_32`	Control/status to register (32-bit)
28	`PF2E_64`	`CS2R_64`	Control/status to register (64-bit)
29	`CZGEVT`	`PMTRIG`	Performance monitor trigger
30	`CFZGRFG`	`PSMTEST`	PSM test
31	`INOFQVSS`	`VABSDIFF`	Vector absolute difference
32	`INOFQVSS4`	`VABSDIFF4`	Vector absolute difference (4-way)
33	`VQC`	`IDP`	Integer dot product
34	`VQR`	`IDE`	Integer dot expand
35	`V2V`	`I2I`	Integer to integer conversion
36	`V2VC`	`I2IP`	Integer to integer (packed)
37	`VZAZK`	`IMNMX`	Integer min/max
38	`CBCP`	`POPC`	Population count
39	`SYB`	`FLO`	Find leading one
40	`SPUX`	`FCHK`	FP check (NaN/Inf)
41	`VCN`	`IPA`	Interpolate attribute
42	`ZHSH`	`MUFU`	Multi-function unit (SFU)
43	`S2S`	`F2F`	Float to float conversion
44	`S2S_K`	`F2F_X`	Float to float (extended)
45	`S2V`	`F2I`	Float to integer
46	`S2V_K`	`F2I_X`	Float to integer (extended)
47	`V2S`	`I2F`	Integer to float
48	`V2S_K`	`I2F_X`	Integer to float (extended)
49	`SEAQ`	`FRND`	FP round
50	`SEAQ_K`	`FRND_X`	FP round (extended)
51	`NY2C`	`AL2P`	Attribute to patch
52	`NY2C_VAQRKRQ`	`AL2P_INDEXED`	Attribute to patch (indexed)
53	`OERI`	`BREV`	Bit reverse
54	`OZBI_O`	`BMOV_B`	Barrier move (B)
55	`OZBI_E`	`BMOV_R`	Barrier move (R)
56	`OZBI`	`BMOV`	Barrier move
57	`F2E`	`S2R`	Special register to register
58	`O2E`	`B2R`	Barrier to register
59	`E2O`	`R2B`	Register to barrier
60	`YRCP`	`LEPC`	Load effective PC
61	`ONE`	`BAR`	Barrier synchronization
62	`ONE_VAQRKRQ`	`BAR_INDEXED`	Barrier (indexed)
63	`FRGPGNVQ`	`SETCTAID`	Set CTA ID
64	`FRGYZRZONFR`	`SETLMEMBASE`	Set local memory base
65	`TRGYZRZONFR`	`GETLMEMBASE`	Get local memory base
66	`QRCONE`	`DEPBAR`	Dependency barrier
67	`OEN`	`BRA`	Branch
68	`OEK`	`BRX`	Branch indirect
69	`WZC`	`JMP`	Jump
70	`WZK`	`JMX`	Jump indirect
71	`PNYY`	`CALL`	Function call
72	`ERG`	`RET`	Return
73	`OFFL`	`BSSY`	Branch sync stack push
74	`OERNX`	`BREAK`	Break
75	`OCG`	`BPT`	Breakpoint trap
76	`XVYY`	`KILL`	Kill thread
77	`RKVG`	`EXIT`	Exit
78	`EGG`	`RTT`	Return to trap handler
79	`OFLAP`	`BSYNC`	Branch sync
80	`ZNGPU`	`MATCH`	Warp match
81	`ANABFYRRC`	`NANOSLEEP`	Nanosleep
82	`ANABGENC`	`NANOTRAP`	Nano trap
83	`GRK`	`TEX`	Texture fetch
84	`GYQ`	`TLD`	Texture load
85	`GYQ4`	`TLD4`	Texture load 4
86	`GZZY`	`TMML`	Texture mip-map level
87	`GKQ`	`TXD`	Texture fetch with derivatives
88	`GKD`	`TXQ`	Texture query
89	`YQP`	`LDC`	Load constant
90	`NYQ`	`ALD`	Attribute load
91	`NFG`	`AST`	Attribute store
92	`BHG`	`OUT`	Tessellation output
93	`BHG_SVANY`	`OUT_FINAL`	Tessellation output (final)
94	`YQF`	`LDS`	Load shared
95	`FGF`	`STS`	Store shared
96	`YQT`	`LDG`	Load global
97	`FGT`	`STG`	Store global
98	`YQY`	`LDL`	Load local
99	`FGY`	`STL`	Store local
100	`YQ`	`LD`	Load (generic)
101	`FG`	`ST`	Store (generic)
102	`NGBZ`	`ATOM`	Atomic
103	`NGBZT`	`ATOMG`	Atomic global
104	`ERQ`	`RED`	Reduction
105	`NGBZF`	`ATOMS`	Atomic shared
106	`DFCP`	`QSPC`	Query space
107	`PPGY_AB_FO`	`CCTL_NO_SB`	Cache control (no scoreboard)
108	`PPGY`	`CCTL`	Cache control
109	`PPGYY`	`CCTLL`	Cache control (L2)
110	`PPGYG`	`CCTLT`	Cache control (texture)
111	`ZRZONE`	`MEMBAR`	Memory barrier
112	`FHYQ`	`SULD`	Surface load
113	`FHFG`	`SUST`	Surface store
114	`FHNGBZ`	`SUATOM`	Surface atomic
115	`FHERQ`	`SURED`	Surface reduction
116	`CVKYQ`	`PIXLD`	Pixel load
117	`VFOREQ`	`ISBERD`	Indexed set binding for redirect
118	`VFORJE`	`ISBEWR`	Indexed set binding for write
119	`FUSY`	`SHFL`	Warp shuffle
120	`JNECFLAP`	`WARPSYNC`	Warp synchronize
121	`ZVRYQ`	`MYELD`	Yield (internal)
122	`QSZN`	`DFMA`	Double FP fused multiply-add
123	`QNQQ`	`DADD`	Double FP add
124	`QZHY`	`DMUL`	Double FP multiply
125	`QFRGC`	`DSETP`	Double FP set-predicate
126	`UNQQ2`	`HADD2`	Half-precision add (packed)
127	`UNQQ2_S32`	`HADD2_F32`	Half-precision add (F32 accum)
128	`USZN2`	`HFMA2`	Half FP fused multiply-add (packed)
129	`UZHY2`	`HMUL2`	Half-precision multiply (packed)
130	`UFRG2`	`HSET2`	Half-precision set (packed)
131	`UFRGC2`	`HSETP2`	Half-precision set-predicate (packed)
132	`UZZN_16`	`HMMA_16`	Half MMA (16-wide)
133	`UZZN_32`	`HMMA_32`	Half MMA (32-wide)
134	`VZZN`	`IMMA`	Integer MMA
135	`VAGEVAFVP`	`INTRINSIC`	Compiler intrinsic (pseudo)

Opcode Categories

The ~400 opcodes group into these functional categories:

Integer ALU (14 opcodes): IMAD, IMAD_WIDE, IADD3, IADD, IMNMX, IABS, BMSK, SGXT, LOP3, ISETP, LEA, SHF, POPC, FLO, BREV, IDP, IDE, PRMT

FP32 ALU (9 opcodes): FFMA, FADD, FMUL, FMNMX, FSWZADD, FSET, FSEL, FSETP, FCHK

FP64 ALU (4 opcodes): DFMA, DADD, DMUL, DSETP

FP16 Packed (6 opcodes): HADD2, HADD2_F32, HFMA2, HMUL2, HSET2, HSETP2

Conversion (12 opcodes): F2F, F2I, I2F, I2I, F2FP, F2IP, I2FP, I2IP, FRND, and their _X extended variants

Data Movement (6 opcodes): MOV, UMOV, MOVM, SEL, USEL, PRMT

Special Function (1 opcode): MUFU (sin, cos, rsqrt, rcp, etc.)

Predicate (4 opcodes): PLOP3, P2R, R2P, VOTE

Memory -- Global (4 opcodes): LDG, STG, LD, ST

Memory -- Shared (4 opcodes): LDS, STS, LDSM, STSM

Memory -- Local (2 opcodes): LDL, STL

Memory -- Constant (2 opcodes): LDC, LDCU

Atomic/Reduction (6 opcodes): ATOM, ATOMG, ATOMS, RED, REDUX, REDAS

Texture (6 opcodes): TEX, TLD, TLD4, TMML, TXD, TXQ

Surface (4 opcodes): SULD, SUST, SUATOM, SURED

Control Flow (12 opcodes): BRA, BRX, JMP, JMX, CALL, RET, EXIT, BREAK, BSSY, BSYNC, KILL, BPT

Synchronization (6 opcodes): BAR, BAR_INDEXED, DEPBAR, MEMBAR, WARPSYNC, NANOSLEEP

Tensor Core / MMA (25+ opcodes): HMMA_*, IMMA_*, BMMA_*, DMMA, GMMA, QMMA_*, OMMA_*, and their sparse (_SP_) variants

Uniform Register (30+ opcodes): All U-prefixed variants (UIMAD, UIADD3, UMOV, USEL, ULOP3, ULEPC, etc.) that operate on uniform registers shared across the warp

Blackwell sm_100+ (28 opcodes): ACQBLK, CGABAR_*, CREATEPOLICY, ELECT, ENDCOLLECTIVE, FENCE_G/S/T, LDTM, STTM, MEMSET, ACQSHMINIT, UTCBAR_*, UTCMMA_*, UTCSHIFT_*, UTCCP_*, TCATOMSWS, TCLDSWS, TCSTSWS, VIRTCOUNT, UGETNEXTWORKID, FADD2, FFMA2, FMUL2, FMNMX3, CREDUX, QFMA4, QADD4, QMUL4, WARPGROUP

Instruction Descriptor Table

The InstructionInfo class at sub_BE7390 (inheriting from the base class at sub_738E20) provides a per-opcode descriptor table consulted by every pass in the compiler. The derived constructor calls the base class constructor sub_738E20, then populates the ROT13 name table, allocates the per-opcode descriptor block, and queries SM-specific configuration knobs. The resulting object is ~11,240 bytes inline plus a 10,288-byte dynamically allocated descriptor block.

Construction Sequence

sub_BE7390(this, parent_context) executes in this order:

Base class init (sub_738E20): sets vtable, stores parent pointer, allocates the opcode-to-descriptor mapping array (512 bytes, 64 QWORD slots), zeroes all four descriptor data areas (+744..+3624), queries SM version and stores at +3728, allocates per-opcode property array (4 * sm_opcode_count bytes at +4112), allocates a reference-counted descriptor block (24 bytes at +4136), queries knobs 812/867/822/493 for configuration. Sets +4132 = 8 and +4176 = 0 (init incomplete).
Override vtable: +0 = off_233ADC0 (derived vtable).
Populate ROT13 name table: 322 inline entries (indices 0-321) at offsets +4184..+9328, each 16 bytes ({char* name_ptr, u64 length}).
Bulk-copy encoding category map: qmemcpy(+9336, unk_22B2320, 0x508) -- 322-entry int32 array (1288 bytes) mapping opcode index to encoding category number. The source table varies by arch constructor (see below).
Initialize post-table fields: zero offsets +10624..+10680.
Store sentinels: +11200 = -2, +11224 = 0xFFFFFFFF.
Set constants: +4048 = 2, +4056 = 10, +3733 = 1.
Descriptor defaults (sub_1370BD0): populates scheduling templates and operand defaults at +192..+704.
Override property mode: +4132 = 7 (overwriting base class's 8).
Allocate descriptor block: 10,288 bytes via the MemoryManager, partitioned into 3 sections.
Query SM-specific config: reads parent->+1664->+72->+55080 and stores result at +10648.

InstructionInfo Object Layout

The complete byte-level field map, derived from sub_BE7390 (derived constructor), sub_738E20 (base constructor), and sub_1370BD0 (descriptor defaults init).

Region 1: Vtable, Parent, and Core Identity (+0 to +91)

Offset	Size	Type	Field	Description
+0	8	`ptr`	`vtable`	`off_233ADC0` (derived); base chain: `off_21DB6E8` / `off_21B4790`
+8	8	`ptr`	`parent_ctx`	Parent compilation context pointer
+44	8	`u64`	`operand_counts`	Packed pair `0x100000001`: lo=1 dst, hi=1 src (base default)

Region 2: Scheduling Defaults and Flags (+92 to +159)

Offset	Size	Type	Field	Description
+92	16	`xmm`	`sched_defaults`	Scheduling parameter defaults (loaded from `xmmword_2029FE0`)
+108	4	`i32`	`desc_idx_a`	Descriptor index sentinel = 0
+112	4	`i32`	`desc_idx_b`	Descriptor index sentinel = -1 (`0xFFFFFFFF`)
+116	1	`u8`	`flag_116`	= 0
+117	1	`u8`	`flag_117`	= 0
+118	1	`u8`	`flag_118`	= 1
+120	3	`u8[3]`	`flags_120`	All = 0
+136	4	`i32`	`sentinel_136`	= -1 (`0xFFFFFFFF`)
+148	8	`u64`	`reserved_148`	= 0

Region 3: Opcode-to-Descriptor Mapping (+160 to +191)

Offset	Size	Type	Field	Description
+160	8	`ptr`	`mapping_allocator`	MemoryManager used for mapping array
+168	8	`ptr`	`mapping_array`	Dynamically allocated QWORD array (initial: 512 bytes, 64 entries)
+176	4	`i32`	`mapping_count`	Current entry count (initially 63)
+180	4	`i32`	`mapping_capacity`	Current capacity (initially 64)
+184	8	`u64`	`packed_flags`	= `0x4000000000` (bit 38: descriptor config flag)

Region 4: Descriptor Defaults (+192 to +704, set by `sub_1370BD0`)

Offset	Size	Type	Field	Description
+192	8	`u64`	`default_operand_cfg`	Packed `0x200000002`: lo=2, hi=2
+200	4	`u32`	`default_dst_count`	= 4
+208	4	`u32`	`default_modifier`	= 2
+216	16	`xmm`	`sched_template_a`	Scheduling template (from `xmmword_233B1E0`)
+240	4	`u32`	`default_operand_w`	= 4
+448	8	`u64`	`section_marker_448`	= 1
+456	4	`u32`	`section_id_456`	= 2
+464	4	`u32`	`section_id_464`	= 3
+472	16	`xmm`	`sched_template_b`	Scheduling template (from `xmmword_233B1F0`)
+496	4	`u32`	`default_value_496`	= 5

Gaps within +204..+447 and +500..+695 are zero-initialized by sub_1370BD0.

Region 5: Primary Descriptor Data (+744 to +2155)

Offset	Size	Type	Field	Description
+744	8	`u64`	`desc_data_start`	Primary area header = 0
+752..+2155	1404	`u8[]`	`desc_data`	Zero-initialized per-opcode descriptor records

Region 6: Secondary Descriptor Area (+2156 to +2211)

Offset	Size	Type	Field	Description
+2156	8	`u64`	`secondary_header`	= 0
+2164..+2211	48	`u8[]`	`secondary_data`	Zero-initialized

Region 7: Tertiary Descriptor Area (+2212 to +3623)

Offset	Size	Type	Field	Description
+2212	8	`u64`	`tertiary_header`	= 0
+2220..+3623	1404	`u8[]`	`tertiary_data`	Zero-initialized
+2372	4	`u32`	`desc_record_type_a`	= 4 (set by derived constructor)
+2400	4	`u32`	`desc_record_type_b`	= 4 (set by derived constructor)

Region 8: Quaternary Descriptor Area and Target Config (+3624 to +3735)

Offset	Size	Type	Field	Description
+3624	8	`u64`	`quaternary_header`	= 0
+3640..+3664	32	`u64[4]`	`quat_ptrs`	All = 0
+3672	1	`u8`	`is_sm75_plus`	= 1 if SM ID >= 16389, else 0
+3673	1	`u8`	`target_flag_bit6`	Bit 6 of `*(target+1080)`
+3674	1	`u8`	`target_flag_bit7`	Bit 7 of `*(target+1080)`
+3675..+3682	8	`u8[8]`	`zero_pad`	All = 0
+3684	32	`u128[2]`	`zero_pad_3684`	= 0
+3716..+3717	2	`u8[2]`	`flags_3716`	= 0
+3720	4	`u32`	`value_3720`	= 0
+3724	1	`u8`	`flag_3724`	= 1
+3725	1	`u8`	`flag_3725`	= 0
+3728	4	`u32`	`sm_opcode_count`	SM version / total opcode count from arch query
+3732	1	`u8`	`knob_812_flag`	Knob 812 derived flag
+3733	1	`u8`	`derived_flag`	= 1 (set by derived constructor; base leaves at 0)

Region 9: Scheduling Configuration (+4016 to +4111)

Offset	Size	Type	Field	Description
+4016	16	`u128`	`sched_config_a`	= 0
+4032	8	`u64`	`sched_config_b`	= 0
+4040	16	`xmm`	`sched_constants`	Loaded from `xmmword_21B4EE0`
+4048	4	`u32`	`constant_2`	= 2 (derived overrides base default 0)
+4056	4	`u32`	`constant_10`	= 10 (derived overrides base default `0x7FFFFFFF`)
+4060..+4064	8	`u32[2]`	`zero_pad`	= 0
+4072	8	`u64`	`sched_ptr`	= 0
+4080	8	`u64`	`sched_ext`	= 0
+4088	1	`u8`	`flag_4088`	= 0
+4089	1	`u8`	`knob_867_flag`	= 1 if knob absent; = `(knob_value == 1)` otherwise
+4090	1	`u8`	`flag_4090`	= 0
+4092	4	`u32`	`knob_822_value`	Default 7; overridden by knob 822
+4096	4	`u32`	`knob_493_value`	Default 5; overridden by knob 493

Region 10: Per-Opcode Property Array (+4112 to +4183)

Offset	Size	Type	Field	Description
+4112	8	`ptr`	`property_array`	Allocated: `4 * sm_opcode_count` bytes; 4 bytes per opcode
+4120	4	`u32`	`property_count`	= `4 * !hasExtendedPredicates` (0 or 4)
+4124	4	`u32`	`property_aux`	= 0
+4128	1	`u8`	`property_init_flag`	= 1
+4132	4	`u32`	`property_mode`	Base sets 8, derived overwrites to 7
+4136	8	`ptr`	`ref_counted_block`	24-byte block: `[refcount=2, data=0, allocator_ptr]`
+4144..+4160	24	`u64[3]`	`rc_aux`	All = 0
+4176	1	`u8`	`init_complete`	= 0 initially; set to 1 after full initialization

Region 11: ROT13 Opcode Name Table (+4184 to +10623)

Offset	Size	Type	Field	Description
+4184	5152	`struct[322]`	`opcode_names[0..321]`	322 inline entries, each 16 bytes: `{char* name, u64 len}`
+9336	1288	`int32[322]`	`encoding_category_map[0..321]`	Per-opcode encoding category; bulk-copied from arch-specific static table (see below)

Total: 322 named opcodes. Index N name is at offset 4184 + 16*N. The getName accessor at sub_BEBAC0 computes this + 4184 + 16 * opcode directly. Encoding category for opcode N is at +9336 + 4*N.

Encoding Category Map

The 1288-byte block at +9336 is a 322-element int32 array that maps each opcode index to an encoding category number. The SASS mnemonic lookup function (sub_1377C60) uses this to resolve a (mnemonic, arch) pair to a binary encoding format descriptor.

Arch-specific source tables:

Constructor	Source Table	Content
`sub_7A5D10` (base)	`unk_21C0E00`	Identity map: `map[i] = i` for all `i` in 0..321
`sub_7C5410`	`unk_21C3600`	Arch-remapped: some entries differ from identity
`sub_BE7390`	`unk_22B2320`	Arch-remapped: some entries differ from identity

The base constructor uses a pure identity map where opcode N maps to encoding category N. Arch-specific constructors override selected entries so the same mnemonic at different opcode indices can map to different encoding formats. For example, DMMA at opcode index 180 maps to encoding category 434 on one arch, while DMMA at opcode index 215 maps to encoding category 515 on another.

Reader: sub_1377C60 (SASS mnemonic lookup)

// After matching mnemonic string v11 to opcode index v18 via ROT13 comparison:
v84 = *(_DWORD *)(a1 + 4 * v18 + 9336);  // encoding_category_map[v18]
// v84 is then FNV-1a hashed together with arch discriminator v16,
// and looked up in the hash table at *(a1 + 10672) to find the
// encoding format descriptor for this (category, arch) pair.

The hash table at +10672 stores entries of the form {encoding_category, arch_code, format_value}, keyed by FNV-1a of (encoding_category, arch_discriminator). This is the central mechanism that maps a SASS mnemonic string plus target architecture to the correct binary encoding format.

Region 12: Descriptor Block Control (+10624 to +10687)

Offset	Size	Type	Field	Description
+10624	8	`u64`	`block_ctrl_a`	= 0
+10632	8	`u64`	`block_ctrl_b`	= 0
+10648	4	`u32`	`arch_config`	SM-specific config from `target+55080/55088`
+10656	8	`ptr`	`descriptor_block`	Pointer to allocated 10,288-byte per-opcode descriptor block
+10664	8	`ptr`	`block_allocator`	MemoryManager that allocated the descriptor block
+10672	8	`ptr`	`encoding_lookup_table`	Hash table for `(encoding_category, arch)` -> format descriptor lookup; read by `sub_1377C60`
+10680	8	`u64`	`block_aux_b`	= 0

Region 13: Sentinels and Architecture Handler (+11200 to +11240)

Offset	Size	Type	Field	Description
+11200	4	`i32`	`sentinel`	= -2 (`0xFFFFFFFE`)
+11208	8	`ptr`	`arch_handler`	= `parent_ctx->+16` (MemoryManager)
+11216	8	`u64`	`zero_11216`	= 0
+11224	8	`u64`	`sentinel_11224`	= `0xFFFFFFFF`
+11232	1	`u8`	`flag_11232`	= 0
+11236	4	`u32`	`zero_11236`	= 0

Per-Opcode Descriptor Block (10,288 bytes)

Allocated by the derived constructor and stored at +10656. The block is 10288 / 8 = 1286 QWORD entries, partitioned into three sections:

+--------------------+  block + 0
| Section 0 header   |  QWORD[0] = 0
+--------------------+  block + 8
| Section 0 payload  |  QWORD[1..640]  = all zero (memset)
| (640 slots)        |  Per-opcode descriptors for opcodes 0..639
+--------------------+  block + 5128
| Section 1 header   |  QWORD[641] = 0
+--------------------+  block + 5136
| Section 1 payload  |  QWORD[642..1283]  (NOT explicitly zeroed)
| (642 slots)        |  Modifier-variant descriptors (opcode | 0x1000, etc.)
+--------------------+  block + 10272
| Section 2 (16B)    |  QWORD[1284] = parent_ctx  (back-pointer)
|                    |  QWORD[1285] = instr_info   (self back-pointer)
+--------------------+  block + 10288

Section 0 (5,128 bytes): 641 QWORD slots. Only the payload (slots 1..640, 5,120 bytes) is explicitly zeroed. Each slot corresponds to a base opcode index. With 402 named opcodes, ~240 slots remain spare.

Section 1 (5,144 bytes): 643 QWORD slots. The header is zeroed but the payload is NOT explicitly zeroed -- it relies on the arena allocator's default behavior or lazy initialization during opcode registration. Likely stores modifier-variant descriptors (e.g., entries for opcode | 0x1000 when bits 12-13 carry sub-operation modifiers).

Section 2 (16 bytes): Two back-pointers for navigating from the descriptor block back to its owning objects (parent compilation context and the InstructionInfo instance).

Architecture-Specific Sub-Tables (sub_896D50, 26,888 bytes)

The architecture-specific extended property object is NOT stored inside InstructionInfo. It is lazily allocated by sub_7A4650, which gates on target+372 == 0x8000 (sm_80 / Ampere targets). The allocation is 26,888 bytes, constructed by sub_896D50(block, parent_context).

sub_896D50 Object Layout

Offset	Size	Type	Field	Description
+0	8	`ptr`	`vtable`	`off_21DADF8`
+8	8	`ptr`	`parent_ctx`	From construction parameter
+40	8	`ptr`	`allocator_base`	MemoryManager from `parent->+16`

Property Array A (at sub-object +56):

Sub-offset	Field	Description
+56	`ptr`	Array pointer: 64 bytes per entry, 772 entries (49,408 bytes allocated)
+64	`i32`	Count = 771
+68	`i32`	Capacity = 772

Each 64-byte entry: bytes [0..11] initialized to 0xFF (pipeline-unassigned sentinel), bytes [12..63] zeroed. Stores latency, throughput, port mask, and register class requirements per opcode.

Property Array B (at sub-object +80):

Sub-offset	Field	Description
+80	`ptr`	Array pointer: 36 bytes per entry, 772 entries (27,792 bytes allocated)
+88	`i32`	Count = 771
+92	`i32`	Capacity = 772

Each 36-byte entry: all zeroed. Stores encoding class, format identifiers, operand encoding rules.

Property Array C (at sub-object +176):

Sub-offset	Field	Description
+176	`ptr`	Array pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+184	`i32`	Count = 34
+188	`i32`	Capacity = 35

Each 16-byte entry: zeroed. Stores functional unit properties for major FU categories.

Property Array D (at sub-object +200):

Sub-offset	Field	Description
+200	`ptr`	Array pointer: 16 bytes per entry, 35 entries (560 bytes allocated)
+208	`i32`	Count = 34

Parallel table for alternate functional unit configurations.

Dimension Table (at sub-object +472):

Sub-offset	Field	Description
+472	`ptr`	168-byte block: `[count=40, entries[0..39]]`, 4 bytes per entry, zero-initialized

Alphabetical SASS Name Table (at sub-object +11360):

Starting at offset +11360, sub_896D50 populates an alphabetically sorted ROT13 name table using the same {char*, u64} format. Unlike the InstructionInfo name table (indexed by opcode), this table is sorted by decoded mnemonic name and includes modifier variants:

OZZN.168128 (BMMA.168128)
PPGY.P.YQP.VINYY (CCTL.C.LDC.IVALL)
VZNQ.JVQR.ERNQ.NO (IMAD.WIDE.READ.AB)
VZZN.FC.{168128.*|16864.*8.*8} (IMMA.SP.{...} -- regex patterns for variant matching)

This table is used for SASS assembly parsing and opcode-to-encoding resolution, where a single base opcode may map to multiple encoding variants distinguished by modifier suffixes.

Knob-derived fields:

Sub-offset	Field	Source
+108	`i32`	Knob 803 value (instruction scheduling latency override)
+468	`u8`	= 0
+469	`u8`	= 1
+470	`u8`	= 1

Accessor Stubs

40+ tiny vtable accessor stubs at 0x859F80-0x85A5F0 and 0x868500-0x869700 provide virtual dispatch access to per-opcode properties. Typical pattern:

int getLatency(ArchSpecificInfo* this, int opcode) {
    return *(int*)(this->property_array_a + 64 * opcode + latency_offset);
}

PTX Text-Generation Operand Accessor API

The PTX text generation subsystem (instruction pretty-printer, dispatcher at sub_5D4190) converts Ori IR instructions into PTX assembly text. The ~580 formatter functions at 0x4DA340-0x5A9FFF query a PTX instruction context object through a stable API of 48 small accessor helpers concentrated at 0x707000-0x710FFF.

PTX Instruction Context Object

The accessor functions do NOT operate on the 296-byte Ori IR instruction directly. They take a PTX instruction context object (~2500+ bytes) that contains pre-decoded fields for text generation. The raw Ori instruction is accessible at *(context + 1096). Each formatter receives this context as argument a1 and a pool allocator table as argument a2.

Partial field map of the PTX instruction context (offsets used by accessors):

Offset	Size	Type	Field	Accessed By
+544	8	`ptr`	`predicate_ptr`	`has_predicate`, `get_opcode_string`
+564	4	`u32`	`saturation_code`	`get_saturation_mode` (== 12 means saturate)
+596	4	`u32`	`field_operand_count`	`get_field_a`..`get_field_d`
+600	1	`u8`	`flag_byte_a`	Bit 0: precision, bit 6: addressing, bit 7: addr_mode
+604	1	`u8`	`rounding_mode`	Bits 0-2: rounding mode code (3 bits)
+605	1	`u8`	`scale_byte`	Bits 4-7: scale code (4 bits, 16 entries)
+609	1	`u8`	`base_addr_byte`	Bits 2-3: base address mode (2 bits, 4 entries)
+611	1	`u8`	`param_flags`	Bits 4-5: parameter variant selector
+615	1	`u8`	`ftz_byte`	Bits 6-7: FTZ flag code (2 bits, 4 entries)
+620	1	`u8`	`variant_index`	Variant string lookup index (8 bits, 256 entries)
+627	1	`u8`	`flag_byte_b`	Bits 0-1: extended_op, 2-3: flag_b, 4-5: modifier/variant
+640	4	`i32`	`precision_code`	Index into precision string table
+648	var	`ptr[]`	`operand_names`	Per-operand name string pointer array (8B per slot)
+800	4	`u32`	`operand_count`	Number of operands for comparison/count accessors
+816	var	`ptr[]`	`reg_operands`	Register operand pointer array (8B per slot)
+944	var	`u32[]`	`operand_types`	Per-operand type code array (4B per slot)
+1024	var	`ptr[]`	`src_part0`	Source part 0 pointer array (8B per slot)
+1264	var	`ptr[]`	`src_part1`	Source part 1 pointer array (8B per slot)
+1504	var	`ptr[]`	`data_types_0`	Data type array, part 0 (8B per slot)
+1744	var	`ptr[]`	`data_types_1`	Data type array, part 1 (8B per slot)
+1984	var	`u32[]`	`target_sm`	Target SM version array (4B per slot)
+2120	8	`ptr`	`opcode_name`	Opcode mnemonic string pointer
+2488	8	`ptr`	`string_intern`	String interning table for modifier deduplication

Accessor Catalog

Tier 1: Core Accessors (>200 callers)

Used by nearly every formatter function. These are the fundamental building blocks of PTX text generation.

Address	Name	Size	Callers	Signature	Logic
`sub_710860`	`getDataType`	39B	2953	`(ctx, idx, part) -> u8`	`part ? *(ctx+1744+8idx) & 0x3F : *(ctx+1504+8idx) & 0x3F`
`sub_70B910`	`getSrcPart0`	12B	1656	`(ctx, idx) -> ptr`	`(ctx + 8idx + 1024)`
`sub_70B8E0`	`getRegOperand`	12B	1449	`(ctx, idx) -> ptr`	`(ctx + 8idx + 816)`
`sub_70B920`	`getSrcPart1`	12B	1296	`(ctx, idx) -> ptr`	`(ctx + 8idx + 1264)`
`sub_70B700`	`hasPredicate`	14B	946	`(ctx) -> bool`	`*(ctx + 544) != 0`
`sub_70B780`	`getPredicateName`	151B	514	`(ctx, pool) -> str`	Allocates `"@" + opcode_name`; inserts `"!"` if negated
`sub_70CA60`	`getOperandType`	11B	480	`(ctx, idx) -> u32`	`(ctx + 4idx + 944)`
`sub_70B710`	`getOpcodeString`	111B	348	`(ctx, pool) -> str`	Allocates `"@" + *(ctx+2120)` from arena pool
`sub_70FA00`	`getTargetSM`	10B	286	`(ctx, idx) -> u32`	`(ctx + 4idx + 1984)`

Tier 2: Modifier and Property Accessors (10-200 callers)

Used by instruction-class families (memory ops, float ops, texture ops, etc.).

Address	Name	Size	Callers	Signature	Logic
`sub_70CA70`	`getTypeSuffix`	427B	191	`(ctx, pool) -> str`	Iterates `*(ctx+796)` type codes; looks up in `off_2032300[]` with interning
`sub_70CD20`	`getOperandOffset`	122B	158	`(ctx, idx) -> str`	`off_2032300[(ctx+4idx+944)]`; resolves via string interning for codes <= 0x39
`sub_707CE0`	`getAddressOperand`	22B	93	`(ctx) -> str`	`off_2033DE0[*(ctx+600) >> 7]`
`sub_70B930`	`getOperandCount`	7B	68	`(ctx) -> u32`	`*(ctx + 800)`
`sub_70B4C0`	`getBaseAddress`	22B	46	`(ctx) -> str`	`off_2032700[(*(ctx+609) >> 2) & 3]`
`sub_709A10`	`getVariantString`	73B	46	`(ctx) -> str`	`off_2033060[*(ctx+620)]` resolved via string interning
`sub_70B6E0`	`hasPredicate_v2`	14B	42	`(ctx) -> bool`	`*(ctx + 544) != 0` (identical body to `hasPredicate`)
`sub_709760`	`getComparisonOp`	127B	21	`(ctx, pool) -> str`	Iterates `*(ctx+800)` operand names from +648 array with `" , "` separator
`sub_709FE0`	`getRoundingMode`	11B	17	`(ctx) -> u8`	`*(ctx + 604) & 7`
`sub_70A500`	`getSaturationMode`	13B	15	`(ctx) -> bool`	`*(ctx + 564) == 12`
`sub_709910`	`getVariantCount`	14B	13	`(ctx) -> u8`	`(*(ctx+627) >> 4) & 3`
`sub_708E40`	`getExtendedOperand`	29B	10	`(ctx, idx) -> str`	`off_2033720[(*(ctx+627) >> (idx==1 ? 0 : 2)) & 3]`

Tier 3: Instruction-Class-Specific Accessors (<10 callers)

Used by specific instruction families (MMA/tensor, texture, guardrail formatters).

Address	Name	Size	Callers	Signature	Purpose
`sub_70FA10`	`checkTargetSM`	66B	7	`(ctx, idx, str) -> bool`	`sscanf(str, "sm_%d")` then compare to `(ctx+1984+4idx)`
`sub_70C890`	`getOperandDetail`	~300B	varies	`(ctx, pool, maxlen, type) -> str`	Complex: hex parse, fallback to `sub_707380`, type-dispatch
`sub_70A810`	`getScaleString`	22B	varies	`(ctx) -> str`	`off_2032BA0[(*(ctx+605) >> 4) & 0xF]`
`sub_70B3F0`	`getFtzFlag`	22B	varies	`(ctx) -> str`	`off_20327C0[(*(ctx+615) >> 6) & 3]`
`sub_707530`	`getPrecisionString`	12B	varies	`(ctx) -> str`	`off_2033FA0[*(ctx+640)]`
`sub_707C60`	`getAddressingMode`	12B	varies	`(ctx) -> bool`	`(*(ctx+600) & 0x40) != 0`
`sub_707C80`	`getScopeString`	22B	varies	`(ctx) -> str`	`off_2033E00[(*(ctx+600) & 0x40) != 0]`
`sub_7075E0`	`getLayoutString`	22B	varies	`(ctx) -> str`	`off_2033EE0[*(ctx+600) & 1]` -- WMMA/TCGEN05
`sub_707BE0`	`getShapeString`	22B	varies	`(ctx) -> str`	`off_2033E30[(*(ctx+600) & 4) != 0]` -- WMMA/TCGEN05
`sub_7075C0`	`getInstrFlagA`	7B	varies	`(ctx) -> u8`	`*(ctx+600) & 1` -- WMMA/rsqrt
`sub_707BC0`	`getInstrFlagB`	varies	varies	`(ctx) -> varies`	Secondary flag accessor -- WMMA/rsqrt
`sub_70D3B0`	`getFieldA`	91B	2	`(ctx) -> str`	Returns `".transA"` if operand count matches MMA shape
`sub_70D410`	`getFieldB`	99B	2	`(ctx) -> str`	Returns `".transB"` (symmetric with `getFieldA`)
`sub_70D480`	`getFieldC`	91B	2	`(ctx) -> str`	MMA field C modifier string
`sub_70D4E0`	`getFieldD`	91B	2	`(ctx) -> str`	MMA field D modifier string
`sub_70D360`	`getModifier`	76B	1	`(ctx, pool) -> str`	Reads operand at index 3 or 5 depending on byte 627
`sub_70D2F0`	`getImmediate`	107B	1	`(ctx, pool) -> str`	Reads operand at +672, conditionally appends second value
`sub_70FCB0`	`getParamA`	varies	varies	`(ctx) -> u64`	Dispatch on `(*(ctx+611) & 0x30)`: selects guardrail constant
`sub_70FCF0`	`getParamB`	varies	varies	`(ctx) -> u64`	Similar dispatch on different bit field
`sub_70E670`	`getParamC`	varies	varies	`(ctx) -> u64`	Third parameter accessor

Static String Tables

The accessor functions perform table-driven lookups using static string pointer arrays in .rodata. Each table is indexed by a small bit-field extracted from the context object:

Table Address	Entries	Indexed By	Content
`off_2032300`	>57	Operand type code	Type suffix strings (`.f32`, `.u16`, `.b64`, etc.)
`off_2032700`	4	`(ctx+609 >> 2) & 3`	Base address mode strings
`off_20327C0`	4	`(ctx+615 >> 6) & 3`	FTZ flag strings (empty, `.ftz`, etc.)
`off_2032BA0`	16	`(ctx+605 >> 4) & 0xF`	Scale modifier strings
`off_2033060`	256	`ctx+620`	Variant name strings
`off_2033720`	4	`(ctx+627 >> N) & 3`	Extended operand strings
`off_2033DE0`	2	`ctx+600 >> 7`	Address operand strings
`off_2033E00`	2	`(ctx+600 & 0x40) != 0`	Scope strings (`.cta`, `.gpu`, etc.)
`off_2033E30`	2	`(ctx+600 & 4) != 0`	Shape strings -- WMMA/TCGEN05
`off_2033EE0`	2	`ctx+600 & 1`	Layout strings -- WMMA/TCGEN05
`off_2033FA0`	indexed by int	`ctx+640`	Precision strings for texture ops

Architectural Notes

String interning: String-returning accessors for type codes <= 0x39 go through a string interning table at *(ctx+2488). The pattern is: look up a candidate string from the static table, then pass it through sub_426D60 (hash lookup) or sub_7072A0 (insert-and-return). This deduplicates PTX modifier strings across the entire text generation pass.
Pool allocation: Accessors that construct new strings (prefixing "@", joining with separators) receive a pool allocator parameter. They allocate from the formatter's 50KB temp buffer via sub_4280C0 (get pool) -> sub_424070 (alloc from pool) -> sub_42BDB0 (abort on failure).
Duplicate functions: sub_70B700 (hasPredicate, 946 callers) and sub_70B6E0 (hasPredicate_v2, 42 callers) have bytewise-identical bodies. Both return *(a1+544) != 0. These are likely methods in different classes (base and derived, or two sibling classes) that were not merged by the linker because they have distinct mangled names.
MMA/tensor accessors: getFieldA through getFieldD, getLayoutString, and getShapeString are used exclusively by WMMA, HMMA, and TCGEN05 instruction formatters. They decode matrix operation modifiers (.transA, .transB, .row, .col) from compressed bit fields.

Instruction Creation

Allocation: `sub_7DD010`

The primary instruction allocator at sub_7DD010 (called from pass code that needs to create new instructions):

Allocates 296 bytes from the Code Object's arena allocator (vtable+16, size 296)
Zeroes the entire 296-byte object
Initializes sentinel fields: offset +248 = -1, +256 = 0xFFFFFFFF, +264 and +272 = 0xFFFFFFFF00000000
Loads scheduling parameter defaults from xmmword_2027620 into offset +208
Appends the new instruction to the Code Object's instruction index array at +368 (resizable, 1.5x growth policy)
Assigns a unique instruction index: *(instr + 264) = index
Invalidates cached analysis (RPO at +792)

The instruction is created unlinked -- it is not yet in any basic block's linked list.

Linking: `sub_925510` (Insert Before)

sub_925510 inserts instruction a2 before instruction a3 in the doubly-linked list of Code Object a1:

void InsertBefore(CodeObject* ctx, Instr* instr, Instr* before) {
    // 1. Check if instruction removal impacts scheduling state
    if (IsScheduleRelevant(instr, ctx))
        UpdateScheduleState(ctx, instr);

    // 2. Notify observers
    NotifyObservers(ctx->observer_chain + 1952, instr);

    // 3. Unlink from current position
    if (instr->prev) {
        instr->prev->next = instr->next;
        if (instr->next)
            instr->next->prev = instr->prev;
        else
            ctx->tail = instr->prev;   // was tail
    } else {
        ctx->head = instr->next;        // was head
        instr->next->prev = nullptr;
    }

    // 4. Insert before target
    instr->next = before;
    instr->bb_index = before->bb_index;
    instr->prev = before->prev;
    if (before->prev)
        before->prev->next = instr;
    if (before == ctx->head)
        ctx->head = instr;
    before->prev = instr;

    // 5. Post-insert bookkeeping
    PostInsertUpdate(ctx, instr);
}

Removal: `sub_9253C0`

sub_9253C0 (634 callers) removes an instruction from its linked list:

Checks if the instruction affects scheduling state (same check as insert)
Notifies the observer chain at Code Object +1952
Unlinks from the doubly-linked list (updating head/tail pointers at +272/+280)
Optionally updates the instruction map at Code Object +1136 (if a3 flag is set)
Handles debug info cleanup if the debug flag at byte +1421 bit 5 is set

Instruction Removal Check: `sub_7E0030`

Before removing an instruction (sub_7E0030, called from both sub_9253C0 and sub_925510), the compiler checks whether the removal is legal. This function examines:

Whether the instruction is an STS (store shared, base opcode 95) with specific operand count and data type patterns (operand_count - adj == 5 with data type codes 1, 2, or 4 prevent removal)
Whether a target-specific scheduler hook (vtable offset 2128 on the SM backend at compilation context +1584) vetoes the removal
Whether the instruction is a PLOP3 (predicate logic, opcode 23) writing to a special register (register file type 9 at descriptor +64)
Whether the dead-code check (sub_7DF3A0) clears the instruction, excluding opcodes 93 (OUT_FINAL), 124 (DMUL), and 248 (SM90+ opcode) which have required side effects
Whether the opcode class has a "must keep" flag in the per-opcode property array at Code Object +776 (byte[4*opcode + 2] & 4)

Instruction Iteration

Forward Walk

The standard forward walk over a basic block's instructions:

// code_obj->head is at +272, tail at +280
instr_ptr instr = *(ptr*)(code_obj + 272);
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 8);  // next
}

Reverse Walk

instr_ptr instr = *(ptr*)(code_obj + 280);  // tail
while (instr) {
    // process instruction
    instr = *(ptr*)(instr + 0);  // prev
}

Block-Scoped Iteration

When iterating within a specific basic block (used by scheduling, regalloc, and peephole passes), the block's head instruction pointer at block_entry +0 is the starting point, and iteration continues until the next block boundary (opcode 52, named AL2P_INDEXED in the ROT13 table but universally used as a BB delimiter pseudo-opcode) or the list tail:

// Block info at code_obj+976, 40 bytes per block
ptr block_head = *(ptr*)(*(ptr*)(code_obj + 976) + 40 * block_index);
for (instr = block_head; instr != nullptr; instr = *(ptr*)(instr + 8)) {
    uint32_t op = *(uint32_t*)(instr + 72) & 0xFFFFCFFF;
    if (op == 52)  // BB boundary
        break;
    // process instruction
}

Def-Use Chain Iterator: `sub_7E6090`

The complex def-use chain builder sub_7E6090 (650 lines decompiled) is the core instruction analysis function. Called from sub_8E3A80 and numerous optimization passes, it:

Walks all instructions in program order
For each register operand (type == 1 via (word >> 28) & 7), updates the register descriptor's def/use counts at offsets +20 and +24
Builds use chains via linked list nodes allocated from the arena (16-byte nodes with {next, instruction_ptr})
Sets flag bits in register descriptors (+48) for live-out, same-block-def, has-prior-use, and source-only-ref
Tracks the single-definition instruction at register descriptor +56
Handles CSE matching: compares operand arrays of instructions with matching opcode, operand count, and auxiliary data to detect redundant computations
Takes parameter a5 as a bitmask of register file types to process (bit per register class)

Instruction Lowering Handler -- `sub_65D640` (48 KB)

The central PTX-to-Ori instruction lowering handler lives at sub_65D640. It is installed at vtable offset +32 in the ISel Phase 1 dispatch table (sub_660CE0) and called through the vtable for every PTX instruction during lowering.

Signature: int64 sub_65D640(context*, bb_ref, ptx_node*, ori_instr*)

The function reads the PTX opcode from *(*(ptx_node+32)+8) and dispatches through a ~60-case switch. An entry gate (sub_44AC80) diverts certain opcode types to an alternate handler (sub_656600). The function calls sub_A2FD90 (operand setter) 59 times to populate Ori operands on the resulting instructions.

Opcode Case Map

Case(s)	PTX family	Handler	Description
5	`prmt` (byte permute)	inline	Decodes 8-bit per-byte channel mask, sets 2 operands
6	`prmt` (extended)	inline	Two-operand permute with address computation via `sub_6294E0`
10	`mov` (special)	inline	Clears immediate flag for float type 109
12	(delegated)	`sub_659F90`	--
13	multi-operand expansion	inline	Expands via `sub_62E840`, resolves type 87 (address) and 97 (register) operands
17, 18, 24	`mov`/`cvt` variants	`sub_652FA0`	--
19, 20, 23	surface ops	inline	~200 lines: multi-register data, `sub_6273E0` operand classification, up to 4 data regs + address
34, 35	load/store	inline	Optional address resolution gated on `(ptx_node+61 & 0xC)`
45, 238	conversion	inline	Rewrites operand type to 20 (integer), binds address via `sub_6294E0`
68, 71	register indirect rewrite	inline	Checks operand size == 8, rewrites descriptor to type 110
81	instruction expansion	inline	Creates IADD3 (opcode 38) with constant 0, reg class 12
82	instruction expansion	inline	Rewrites to opcode 162 with IADD3 operand
84	load expansion	inline	Creates IADD3 with offset, flags 0x2000
85	operand reorder	inline	3-operand shuffle
87	reg class adjustment	inline	Table lookup at `dword_2026C60`, swaps operands 1/2, sets opcode 150
88	matrix config	inline	MMA dimension table at `dword_2026C48`, sets fields 179/180
104	4-wide load	inline	Creates 4-operand instruction, address binding via `sub_6294E0`
110	(delegated)	`sub_652610`	--
123	generic addressing	inline	Converts flat-to-specific addresses; SM-version-dependent multi-instruction sequences
124, 125	cvta / isspacep	inline	Address space conversion; creates CVTA opcode 538/539 on SM > 0x1A
130	instruction fusion	inline	Fuses instruction if operand count is not 3 or 4
165	(delegated)	`sub_65BF40`	--
175--178	texture addr_mode	inline	Resolves `.addr_mode_0/1/2` attributes from texture descriptor
179	atomic address mode	inline	Classifies atomic op type, creates SEL + ATOM sequence
180	(delegated)	`sub_65CE90`	--
181, 182	(delegated)	`sub_64FF20`	--
183	conditional atomic	inline	State space 0x20: rewrites to opcode 71 with mask 0xFF01010101
184--190	surface/texture lowering	inline	Handles SULD/SUST/SURED (opcodes 449-456); SM-dependent operand resolution
197, 198	call site lowering	inline	Same-module vs cross-module call dispatch
201--204, 208--211	wide load/store	inline	`.v2`/`.v4` multi-element operations with IADD3 offset computation
206, 207, 212, 213	3-op wide load/store	inline	3-operand variants of wide memory operations
221, 222	TMA operations	inline	Sets field 197 with value 365/366

Addressing Mode Types

ptxas handles four distinct addressing mode categories during instruction lowering, all resolved by sub_65D640:

1. Texture Addressing Modes (per-dimension)

Cases 175--178 resolve .addr_mode_0, .addr_mode_1, .addr_mode_2 attributes from texture descriptors. These are the PTX txq query targets.

The function walks the texture descriptor's attribute linked list at *(descriptor+16)+24, comparing each attribute name string:

// Pseudocode for cases 175-178:
addr_mode_0 = addr_mode_1 = addr_mode_2 = 0;
found = false;
for (node = attr_list_head; node != NULL; node = *node) {
    name = *(node[1] + 16);    // attribute name string
    value = *(*(node[1] + 24) + 16);  // integer value
    if (strcmp(name, "addr_mode_0") == 0)  { addr_mode_0 = value; found = true; }
    else if (strcmp(name, "addr_mode_1") == 0)  { addr_mode_1 = value; found = true; }
    else if (strcmp(name, "addr_mode_2") == 0)  { addr_mode_2 = value; found = true; }
}

For 2D textures (state space byte & 0xB0 == 0x20), the function checks addr_mode_0 == addr_mode_1. For 3D textures (0x30), it checks all three equal. If modes are uniform (all equal), the instruction gets a single addressing mode flag (field 91 = 1 for clamp_to_border). If modes differ, it delegates to sub_64FC90 for a multi-instruction lowering that handles per-dimension mode selection.

2. Generic-to-Specific Address Conversion (case 123)

Converts flat/generic pointers to specific memory space pointers. The address space ID from *(ptx_node+40) selects the conversion strategy:

Space ID	Memory space	Strategy
4	shared	`sub_654A90` (direct conversion)
5	combined	OR of global + shared + local conversions
6	local	`sub_64F7A0` with register pair 101/102
7	generic (flat)	SM-dependent: `sub_654FB0` (SM <= 0x1A) or SHR/AND extraction + SEL mux (SM > 0x10)
8	global	`sub_64F7A0` with register pair 98/99

For generic space on older architectures (SM <= 0x1A with feature flag via sub_61AF90), a simpler single-instruction path is used. On newer architectures, a multi-instruction sequence extracts the space tag from the upper address bits.

3. Address Space Conversion (cases 124--125, cvta/isspacep)

The cvta (Convert Address) and isspacep (Is Space Predicate) instructions convert between generic and specific address spaces. For global space (type 8) on SM > 0x1A, the handler creates CVTA with opcode 538 (isspacep) or 539 (cvta) and sets register class 7 with width 4 or 16 bytes.

4. Memory Addressing Modes (implicit)

Memory addressing modes for load/store/atomic instructions are not enumerated as named constants. Instead, they emerge from the operand construction patterns in cases 19--23, 34--35, 81--84, 104, 201--213:

Pattern	PTX syntax	Ori representation
Register indirect	`[%rd1]`	Operand type 87 from `sub_629E40`
Register + offset	`[%rd1+16]`	Register operand + immediate via `sub_6273E0`
Constant bank	`c[2][0x100]`	Constant operand via `sub_620320` (type 12)
Immediate address	`.local` space	Constant value via `sub_620320`
Base + index	`[%rd1], %r2`	Two-operand form

ISel Phase 1 Dispatch Vtable

sub_660CE0 constructs a 17-slot vtable at context offset +3784 for the ISel Phase 1 instruction handlers:

Offset	Handler	Size	Role
+0	`sub_650840`	--	Primary handler
+8	`sub_64EEB0`	--	Operand handler
+16	`sub_64F270`	--	Type handler
+24	`sub_6575D0`	49 KB	Register-class-to-opcode dispatch
+32	`sub_65D640`	48 KB	Instruction lowering (this function)
+40	`sub_64EDD0`	--	Auxiliary handler
+128	`sub_64EEC0`	--	Lowering helper

Key Function Reference

Address	Size	Function	Description
`sub_7DD010`	1.3KB	`Instruction::create`	Allocate and initialize 296-byte instruction
`sub_7E0030`	3.6KB	`Instruction::canRemove`	Check if instruction removal is legal
`sub_7E0650`	0.7KB	`Instruction::hasPredGuard`	Check if instruction has predicate guard
`sub_7E0E80`	0.1KB	`Instruction::lastOpIsPred`	Quick predicate-guard check on last operand
`sub_7E6090`	10KB	`DefUseChain::build`	Build def-use chains for all instructions
`sub_7DDCA0`	0.2KB	`Observer::notify`	Walk observer chain and notify
`sub_9253C0`	0.5KB	`Instruction::remove`	Remove instruction from linked list (634 callers)
`sub_925510`	0.5KB	`Instruction::insertBefore`	Insert instruction before another (13 callers)
`sub_917A60`	6.8KB	`InstrInfo::getRegClass`	Opcode-to-register-class mapping (221 callers)
`sub_91A0F0`	5.6KB	`InstrInfo::resolveRegClass`	Resolve operand register class with constraints
`sub_9314F0`	0.4KB	`RegClass::query`	Register class query (1,547 callers)
`sub_738E20`	10KB	`InstrDescTable::init`	Base instruction descriptor table constructor
`sub_BE7390`	16KB	`InstructionInfo::init`	InstructionInfo constructor (ROT13 table + descriptors)
`sub_896D50`	21KB	`InstrMnemTable::init`	Architecture-specific mnemonic table initializer
`sub_65D640`	48KB	`InstrLowering::handle`	PTX-to-Ori instruction lowering handler (60+ opcode cases, addressing mode resolution)
`sub_660CE0`	0.3KB	`InstrLowering::initVtable`	Constructs ISel Phase 1 dispatch vtable (17 slots)
`sub_6575D0`	49KB	`RegClassOpcodeDispatch::handle`	Register-class-to-opcode dispatch (vtable +24 sibling)
`sub_6D9690`	94KB	`Instruction::encode`	Master SASS instruction encoder
`sub_B28E00`	varies	`isReg/isPred/isImm`	Operand type predicates (isel infrastructure)
`sub_5D4190`	12.9KB	`PTXFormatter::dispatch`	PTX text generation dispatcher (580 formatters)
`sub_710860`	39B	`PTXCtx::getDataType`	Data type accessor (2,953 callers)
`sub_70B8E0`	12B	`PTXCtx::getRegOperand`	Register operand accessor (1,449 callers)
`sub_70B910`	12B	`PTXCtx::getSrcPart0`	Source part 0 accessor (1,656 callers)
`sub_70B700`	14B	`PTXCtx::hasPredicate`	Predicate presence check (946 callers)
`sub_70CA60`	11B	`PTXCtx::getOperandType`	Operand type code accessor (480 callers)
`sub_70B710`	111B	`PTXCtx::getOpcodeString`	Opcode string with "@" prefix (348 callers)
`sub_70FA00`	10B	`PTXCtx::getTargetSM`	Target SM version accessor (286 callers)

Ori IR Overview -- Code Object, basic blocks, CFG, register files
Registers -- Register descriptor layout, register file types
CFG -- Basic block structure, control-flow graph
Data Structures -- Hash tables, bitvectors, linked lists
Peephole Optimization -- Instruction rewriting passes
SASS Encoding -- How Ori instructions become SASS binary
Instruction Selection -- Pattern matching for instruction selection
PTX-to-Ori Pipeline -- Full lowering pipeline context for sub_65D640
Scheduling -- 3-phase instruction scheduler

Basic Blocks & Control Flow Graph

ptxas maintains a custom CFG infrastructure built entirely from scratch -- no LLVM BasicBlock, no LLVM MachineBasicBlock, no LLVM dominator framework. Basic blocks are stored in contiguous arrays, edges are stored in FNV-1a hash maps, and RPO / backedge / loop information is computed by dedicated functions in the 0xBDE000--0xBE2400 address range.

Key Facts

Property	Value
BasicBlock object size	136 bytes (allocated by `sub_62BB00`)
Block info entry (scheduling)	40 bytes per entry, contiguous array
Block naming	`bix%d` (block index, 0-based integer)
Edge representation	FNV-1a hash map (key = block index, value = successor list)
RPO storage	`int[]` array, indexed by RPO position
Backedge storage	Separate FNV-1a hash map
CFG construction phase	Phase 3: `AnalyzeControlFlow`
Block layout phase	Phase 112: `PlaceBlocksInSourceOrder`
BB merge suppression	`--dont-merge-basicblocks` / `-no-bb-merge` CLI flag

Two-Level Block Representation

ptxas uses two distinct but linked representations for basic blocks. The first is owned by the Code Object (used by all optimization passes); the second is owned by the scheduling/CFG analysis context (used by scheduling and post-regalloc passes).

Code Object Block Array

The Code Object stores an array of pointers to full BasicBlock objects:

Code Object Offset	Type	Field	Description
+296	`ptr`	`bb_array`	Array of `BasicBlock*` pointers (8 bytes each)
+304	`i32`	`bb_count`	Number of basic blocks

Access pattern (from sub_78B430):

int bb_count = *(int*)(ctx + 304);
for (int i = 0; i <= bb_count; i++) {
    BasicBlock* bb = *(BasicBlock**)(*(ctx + 296) + 8 * i);
    int rpo = *(int*)(bb + 144);
    // ...
}

Scheduling Block Info Array

The scheduling context maintains a parallel 40-byte-per-entry array:

Scheduling Context Offset	Type	Field	Description
+976	`ptr`	`block_info`	Contiguous array of 40-byte entries
+984	`i32`	`num_blocks`	Max block index (0-based; actual count = `num_blocks + 1`)

Block Info Entry Layout (40 bytes)

Offset	Type	Field	Description
+0	`ptr`	`bb_ptr`	Pointer to the full BasicBlock object
+8	`ptr`	`insn_head`	Pointer to the instruction list head (or sentinel)
+16	`u64`	`reserved`	Reserved / padding
+24	`u32`	`flags`	Block flags
+28	`i32`	`bix`	Block index (unique ID used in all CFG operations)
+32	`u64`	`aux`	Auxiliary data (varies by pass)

The DOT dumper at sub_BE21D0 iterates this array with a 40-byte stride:

for (int i = 0; i <= num_blocks; i++) {
    entry = *(sched_ctx + 976) + 40 * i;
    int bix = *(int*)(entry + 28);
    int label = *(int*)(*(ptr*)(entry + 0) + 152);
    printf("bix%d(L%x)", bix, label);
}

BasicBlock Object (136 bytes)

Allocated by sub_62BB00 during the parsing/lowering phase. The parser references the string "bb-controlflow" when constructing these objects. After allocation, the 136-byte block is zeroed via memset, then individual fields are populated.

BasicBlock Field Map

Offset	Type	Field	Description
+0	`ptr`	`vtable`	Virtual function table pointer (or type tag)
+8	`ptr`	`insn_list`	Instruction doubly-linked list head/sentinel
+16	`ptr`	`insn_tail`	Instruction list tail (for O(1) append)
+24	`u32`	`insn_count`	Number of instructions in the block
+28	`u32`	`flags_a`	Block attribute flags (see below)
+104	`ptr`	`bb_next`	Linked-list link to next BasicBlock in function
+108	`u8`	`opcode_flags`	Terminator opcode classification bits
+128	`ptr`	`succ_list`	Linked list of successor block references
+136	`ptr`	`pred_list`	Linked list of predecessor block references
+144	`i32`	`rpo_number`	Reverse post-order number (set by RPO computation)
+152	`i32`	`label_id`	Label / source line identifier (displayed as `L%x` in DOT)

The insn_list at +8 is the head of a doubly-linked list. Each instruction node has a next pointer at offset +8 of the instruction object. The sentinel/end is detected by comparing the current node pointer against the tail stored in the BasicBlock or against a per-block sentinel address.

Successor/Predecessor Lists

Both succ_list (+128) and pred_list (+136) are singly-linked lists of small nodes. Each node contains:

Offset	Type	Field
+0	`ptr`	`next` pointer (NULL = end of list)
+8	`i32`	Block index of the referenced block

Iteration pattern (from sub_78B430 -- LoopStructurePass):

// Walk predecessor list
PredNode* pred = *(PredNode**)(bb + 136);
while (pred) {
    BasicBlock* pred_bb = *(BasicBlock**)(*(ctx + 296) + 8 * pred->bix);
    int pred_rpo = *(int*)(pred_bb + 144);
    // ...
    pred = pred->next;
}

// Walk successor list
SuccNode* succ = *(SuccNode**)(bb + 128);
while (succ) {
    BasicBlock* succ_bb = *(BasicBlock**)(*(ctx + 296) + 8 * succ->bix);
    // ...
    succ = succ->next;
}

CFG Edge Hash Maps

In addition to the per-block predecessor/successor linked lists, the scheduling context maintains two global FNV-1a hash maps for fast edge queries. These are the primary edge representation used by RPO computation, backedge detection, and the scheduling pass.

Successor Edge Map (Code Object +648)

Maps block index to a set of successor block indices. Used by CFG::computeRPO (sub_BDE150), CFG::printEdges (sub_BDE8B0), and CFG::buildAndAnalyze (sub_BE0690).

Backedge Map (Code Object +680)

Maps block index to the set of backedge targets. A backedge exists when block bix_src has a successor bix_dst where RPO(bix_dst) <= RPO(bix_src) -- i.e., the successor was visited before the source in the DFS traversal, indicating a loop.

FNV-1a Hash Parameters

All CFG hash lookups use identical parameters, confirmed across 50+ call sites:

Parameter	Value
Initial hash	`0x811C9DC5`
FNV prime	`16777619` (`0x01000193`)
Key size	4 bytes (block index)
Hash method	Byte-by-byte XOR-fold

The hash computation for a 32-bit block index bix:

uint32_t hash = 0x811C9DC5;
hash = 16777619 * (hash ^ (bix & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 8) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 16) & 0xFF));
hash = 16777619 * (hash ^ ((bix >> 24) & 0xFF));
uint32_t bucket = hash & (num_buckets - 1);

Hash Map Structure

HashMap:
  +0   ptr   first_free_node    // Free list for node recycling
  +8   ptr   node_arena         // Pool allocator for new nodes
  +16  ptr   bucket_array       // Array of 24-byte bucket headers
  +24  u64   num_buckets        // Power of two, initial = 8
  +32  i32   total_elements     // Total entries across all buckets
  +36  i32   num_unique_keys    // Distinct keys inserted

Bucket (24 bytes):
  +0   ptr   head               // First node in collision chain
  +8   ptr   tail               // Last node in collision chain
  +16  i32   count              // Number of nodes in this bucket

Full Node (64 bytes, for edge maps):
  +0   ptr   next               // Chain link within bucket
  +8   i32   key                // Block index (bix)
  +12  i32   value_info         // Edge count or flags
  +16  ptr   value_array        // Pointer to sub-array of successor indices
  +24  i32   value_count        // Number of successors in sub-array
  +32  ptr   sub_hash_data      // Embedded sub-hash for multi-edge blocks
  +40  u64   sub_hash_size      // Sub-hash capacity
  +56  u32   cached_hash        // Cached FNV-1a hash of key

Simple Node (16 bytes, for backedge set membership):
  +0   ptr   next               // Chain link within bucket
  +8   i32   key                // Block index
  +12  u32   cached_hash        // Cached hash

Growth policy: rehash when total_elements > num_unique_keys (load factor > 1.0). New capacity = 4 * old_bucket_count. Hash map insert/find is implemented at sub_BDED20 (full nodes, 64 bytes) and sub_BDF480 (simple nodes, 16 bytes).

CFG Construction: Phase 3 (AnalyzeControlFlow)

AnalyzeControlFlow is phase 3 in the 159-phase optimizer pipeline. It runs immediately after the parser builds the initial instruction list and before any optimization. This phase:

Populates the successor edge hash table at Code Object +648 by scanning the last instruction of each basic block. Branch instructions (opcode 67 = BRA, opcode 77 = EXIT; the code also checks opcodes 93 and 95 which are OUT_FINAL and STS respectively in the ROT13 name table but serve as internal control-flow markers in this context) provide the target block indices.
Computes the backedge map at Code Object +680 by identifying edges where the target has a lower or equal block index position in the DFS tree.
Builds the reverse post-order (RPO) array at Code Object +720 via iterative DFS.
Identifies loop headers and backedges for later loop optimization passes.

The phase is critical because the Bison parser constructs basic blocks and instruction lists incrementally. AnalyzeControlFlow ensures the CFG is fully consistent and annotated before optimization begins.

Phase 6: SetControlFlowOpLastInBB

Phase 6 enforces a structural invariant: control flow operations must be the last instruction in their basic block. If a branch, jump, return, or exit instruction is followed by other instructions in the same block (which can happen during lowering passes), this phase splits the block at the control-flow instruction. New basic block entries are allocated and the instruction linked list is rewritten.

This invariant is required by all downstream passes -- the scheduler and register allocator assume that only the final instruction in a block can be a control-flow transfer.

Reverse Post-Order (RPO) Computation

RPO is computed by sub_BDE150 (CFG::computeRPO), a 9KB function that implements iterative DFS using an explicit stack.

RPO Storage

Code Object Offset	Type	Field
+720	`ptr`	`rpo_array` -- `int*`, indexed by RPO position
+728	`i32`	`rpo_size` -- number of entries used
+732	`i32`	`rpo_capacity` -- allocated capacity

The array is resized with the standard ptxas growth policy: new_capacity = old + (old + 1) / 2, with a minimum of num_blocks + 1. Growth is implemented in sub_BDFB10.

Algorithm

The RPO computation uses a standard iterative DFS with post-order numbering:

function computeRPO(cfg, entry_block):
    stack = [entry_block]           // Explicit stack at offset +88..+100
    visited = new BitArray(num_blocks)  // At offset +16..+40
    in_stack = new BitArray(num_blocks) // At offset +40
    counter = num_blocks            // Decremented as blocks complete

    while stack is not empty:
        bix = stack.top()
        if visited[bix]:
            stack.pop()
            rpo_number[bix] = counter  // *(cfg+64)[bix] = counter
            rpo_array[counter] = bix   // *(*(cfg+720))[counter] = bix
            counter--
            continue

        visited[bix] = true
        for each successor s of bix (via hash map lookup):
            if not visited[s]:
                stack.push(s)

    return counter  // Should be -1 if all blocks reachable

The key assignment line from the decompilation:

*(_DWORD *)(*(_QWORD *)(a1 + 64) + 4 * v16) = *a3;           // rpo_number[bix] = counter
*(_DWORD *)(*(_QWORD *)(*(_QWORD *)a1 + 720) + 4 * (*a3)--) = v16;  // rpo_array[counter--] = bix

After completion, rpo_array[0] is the entry block, and rpo_array[num_blocks] is the deepest post-dominator (typically the EXIT block).

RPO Debug Dump

sub_BDEA50 (CFG::dumpRPOAndBackedges) prints the RPO state:

Showing RPO state for each basic block:
    bix0 -> RPONum: 0
    bix1 -> RPONum: 1
    bix2 -> RPONum: 3
    bix3 -> RPONum: 2
RPO traversal order: [0, 1, 3, 2]
Showing backedge info:
    bix2 -> backedge's successor BB: 1

This output is gated by option flag #24 at offset +1728 relative to the options manager.

Backedge Detection and Loop Identification

Backedges are identified during CFG::buildAndAnalyze (sub_BE0690, 54KB). A backedge from block src to block dst exists when dst has already been visited in the DFS traversal (i.e., dst has a smaller or equal RPO number than src). Backedges are stored in the hash map at Code Object +680.

Natural Loop Detection

The LoopStructurePass (sub_78B430) combines RPO numbering with backedge analysis to identify natural loops:

Calls sub_781F80 (BasicBlockAnalysis) to compute RPO numbers and dominance.
Iterates the bb_array at Code Object +296.
For each block, checks if rpo_number (+144) is non-zero and equals the value at +152 (loop exit RPO marker). Combined with a branch opcode check (opcode & 0xFFFFFFFD == 0x5D = BRA or conditional branch), this identifies loop header blocks.
Walks the predecessor list to find the backedge source -- the predecessor with the largest RPO number that is still less than the header's RPO.
Walks the successor list to find the loop latch -- the successor with the smallest RPO number greater than the loop preheader's RPO.

The RPO range [header_rpo, exit_rpo] defines the set of blocks belonging to the loop body. A block with header_rpo <= block_rpo <= exit_rpo is inside the loop.

LoopMakeSingleEntry Transformation

If a natural loop has multiple entry points, sub_78B430 transforms it into a single-entry loop. This is gated by:

The LoopMakeSingleEntry pass-disable check (via sub_799250)
Knob 487 (queried via the knob vtable at +152)

Two code paths handle different branch types:

Opcode 93 (OUT_FINAL in the ROT13 name table; used here as a control-flow boundary marker): Calls sub_9253C0 to rewrite the branch target
Conditional branches: Calls sub_748BF0 to insert a new preheader block and redirect edges

After transformation, sub_931920 is called to split blocks and update the instruction list.

Dominance

Dominance is computed by sub_BE2330 (4KB) and/or within sub_781F80 (12KB, BasicBlockAnalysis). The implementation uses bitvector operations -- each block has a bitvector of dominators, and the fixpoint iteration proceeds in RPO order.

The bitvector layout used by the dominator computation:

Offset	Type	Field
+0	`ptr`	`data` -- pointer to `uint32_t[]` words
+8	`i32`	`word_count`
+12	`i32`	`capacity`
+16	`i32`	`bit_count`

Evidence for an iterative dataflow approach (rather than Lengauer-Tarjan) comes from the function sizes and patterns: sub_781F80 at 12KB and sub_BE2330 at 4KB are both small enough that they likely implement the simple iterative algorithm:

dom[entry] = {entry}
for all other blocks b: dom[b] = all_blocks

repeat until no changes:
    for each block b in RPO order (skip entry):
        dom[b] = {b} union (intersection of dom[p] for all predecessors p)

This is adequate for the small CFGs typical of GPU kernels (rarely exceeding a few hundred blocks). The O(n^2) worst case is not a concern at GPU kernel scale.

Block Layout: Phase 112 (PlaceBlocksInSourceOrder)

Phase 112 (PlaceBlocksInSourceOrder) runs in the post-scheduling stage of the pipeline, after register allocation and before Mercury encoding. It reorders the basic block array to restore source-order layout.

The implementation at sub_A92C50 (3.5KB binary, ~19KB decompiled) manipulates linked list structures and uses hash table lookups to reorder blocks. The goal is to minimize branch distances in the final SASS output -- placing fall-through successors immediately after their predecessors.

Hot/Cold Block Layout

Two companion phases handle hot/cold partitioning:

Phase	Name	Purpose
108	`OptimizeHotColdInLoop`	Moves cold blocks out of loop bodies
109	`OptimizeHotColdFlow`	Global hot/cold block separation

Cold blocks (e.g., error handlers, unlikely branches, assert paths) are moved to the end of the function's block sequence. The MarkAdditionalColdBlocks pass marks blocks as cold based on heuristics. This separation improves instruction cache utilization on the GPU's SM instruction fetch unit.

BB Merge Suppression

The --dont-merge-basicblocks (alias -no-bb-merge) CLI flag prevents the optimizer from merging consecutive basic blocks. This is used for debuggable code -- without it, the debugger cannot set breakpoints at the original source line boundaries. The flag is documented in the binary as:

(Note: "optization" and "perfomance" are typos in the original binary string.)

Entry and Exit Blocks

Block index 0 (bix0) is always the function entry block. It is the first element in the bb_array and the root of the RPO traversal. The entry block has no predecessors (its predecessor list at +136 is NULL).

The exit block is the block containing the EXIT instruction (opcode 77 = EXIT in the ROT13 name table). For functions with multiple exit points, each EXIT-containing block is a CFG sink. The RPO computation assigns these the highest RPO numbers. The SetControlFlowOpLastInBB phase (phase 6) ensures each EXIT is the final instruction in its block.

The CFG::buildAndAnalyze function (sub_BE0690) checks the terminator opcode at instruction offset +28. Opcodes 4 and 7 (internal control-flow opcodes) receive special treatment during edge construction:

Opcode	Type	Edge behavior
4	Unconditional branch	Single successor edge to target block
7	Conditional branch	Two successor edges (taken + fall-through)
93	`OUT_FINAL`	ROT13 name is OUT_FINAL; used as a control-flow boundary marker in CFG construction
95	`STS`	ROT13 name is STS; used as a control-flow terminator marker in CFG construction

CFG Update Protocol

Passes that modify the CFG (block splitting, merging, edge redirection) must maintain consistency across several data structures:

Block array -- both the Code Object bb_array (+296) and the scheduling block_info (+976) must be updated.
Predecessor/successor linked lists -- the per-block lists at +128 and +136 must reflect the new edges.
Edge hash maps -- the successor map (+648) and backedge map (+680) must be invalidated or updated.
RPO array -- the RPO order at +720 must be recomputed after structural changes.
Block count -- both bb_count (+304) and num_blocks (+984) must be incremented.

The general pattern observed in sub_931920 (block splitter called from sub_78B430):

function splitBlock(ctx, bb, split_point):
    new_bb = allocateBasicBlock()
    
    // Move instructions after split_point to new_bb
    new_bb->insn_list = split_point->next
    bb->insn_list_tail = split_point
    split_point->next = sentinel
    
    // Transfer successors from bb to new_bb
    new_bb->succ_list = bb->succ_list
    bb->succ_list = new_node(new_bb->bix)
    
    // Update predecessor lists of old successors
    for each succ in new_bb->succ_list:
        replace bb in succ->pred_list with new_bb
    
    // new_bb's only predecessor is bb
    new_bb->pred_list = new_node(bb->bix)
    
    // Invalidate and recompute RPO
    ctx->bb_count++
    recomputeRPO(ctx)

The AnalyzeControlFlow phase (phase 3) is explicitly re-run or incrementally updated after phases that modify the CFG structure. The phase pipeline contains multiple OriPerformLiveDead and GeneralOptimize passes that may rebuild portions of the CFG.

Key CFG Functions

Address	Size	Identity	Confidence
`sub_62BB00`	16.5KB	`BasicBlock::allocate` -- allocates 136-byte block, initializes fields	HIGH
`sub_781F80`	12KB	`BasicBlockAnalysis` -- RPO, loop detection, dominance	MEDIUM
`sub_78B430`	1.2KB	`LoopStructurePass` -- single-entry loop transformation	HIGH
`sub_BDE150`	9KB	`CFG::computeRPO` -- iterative DFS with explicit stack	HIGH
`sub_BDE6C0`	3KB	`HashMap::erase` -- remove node from edge hash map	MEDIUM
`sub_BDE8B0`	2KB	`CFG::printEdges` -- prints `"bix%d -> bix%d\n"`	HIGH
`sub_BDEA50`	4KB	`CFG::dumpRPOAndBackedges` -- RPO + backedge debug dump	HIGH
`sub_BDED20`	12KB	`HashMap::insertOrFind` -- full 64-byte node insert	HIGH
`sub_BDF480`	10KB	`HashMap::insertOrFind_simple` -- 16-byte node insert	HIGH
`sub_BDFB10`	24KB	`CFG::buildBlockMap` -- block array init, RPO resize	MEDIUM
`sub_BE0690`	54KB	`CFG::buildAndAnalyze` -- master CFG builder	HIGH
`sub_BE21D0`	1.4KB	`CFG::dumpDOT` -- Graphviz DOT format output	HIGH
`sub_BE2330`	4KB	`CFG::computeDominators` -- bitvector-based dominance	MEDIUM
`sub_A92C50`	3.5KB	`PlaceBlocksInSourceOrder` -- block reordering (phase 112)	MEDIUM

CFG Visualization

The CFG::dumpDOT function (sub_BE21D0) generates Graphviz DOT output when option flag #20 is enabled (offset +1440 from the options manager). The output format:

digraph f {
    node [fontname="Courier",fontsize=10,shape=Mrecord];
    "bix0"
    [label="bix0(L0)"]
    bix0 -> bix1
    bix0 -> bix3
    "bix1"
    [label="bix1(L10)"]
    bix1 -> bix2
    "bix2"
    [label="bix2(L20)"]
    bix2 -> bix1
    bix2 -> bix3
    "bix3"
    [label="bix3(L30)"]
}

Where L%x is the label identifier at BasicBlock +152. This can be converted to a visual graph with dot -Tpng.

If option flag #24 is also enabled (offset +1728), the RPO and backedge dump from sub_BDEA50 is appended.

Ori IR Overview -- Code Object layout, instruction format, register files
Instructions -- instruction format and opcode details
Data Structures -- FNV-1a hash maps, bitvectors, linked lists
Optimizer Pipeline -- the 159-phase pipeline including CFG phases
Branch & Switch Optimization -- OriBranchOpt pass
Loop Optimization -- OriLoopSimplification, LoopUnrolling
Hot/Cold Partitioning -- OptimizeHotColdFlow, MarkAdditionalColdBlocks

Register Model (R / UR / P / UP)

ptxas models four hardware register files plus two auxiliary barrier register files. Every Ori instruction references registers from one or more of these files. During the optimization phases (0--158), registers carry virtual numbers; the fat-point register allocator (phase 159+) maps them to physical hardware slots. This page documents the register files, the virtual/physical register descriptor, the 7 allocator register classes, wide register conventions, special registers, the operand encoding format, pressure tracking, and SM-specific limits.

Four Register Files

File	Mnemonic	Width	Usable range	Zero/True	ABI type	Introduced
R	General-purpose	32 bits	R0 -- R254	RZ (R255)	2	sm_30
UR	Uniform	32 bits	UR0 -- UR62	URZ (UR63)	3	sm_75
P	Predicate	1 bit	P0 -- P6	PT (P7)	5	sm_30
UP	Uniform predicate	1 bit	UP0 -- UP6	UPT (UP7)	--	sm_75

R registers are per-thread 32-bit general-purpose registers. They hold integers, floating-point values, and addresses. 64-bit values occupy consecutive even/odd pairs (R4:R5); 128-bit values occupy aligned quads (R0:R1:R2:R3). The total R-register count for a function is field[159] + field[102] (reserved + allocated), stored in the Code Object at offsets +159 and +102. Maximum usable: 254 (R0--R254). R255 is the hardware zero register RZ -- reads return 0, writes are discarded.

UR registers (uniform general-purpose) are warp-uniform: every thread in a warp sees the same value. Available on sm_75 and later. Range: UR0--UR62 usable, UR63 is the uniform zero register URZ. The UR count is at Code Object +99. Attempting to use UR on pre-sm_75 targets triggers the diagnostic "Uniform registers were disallowed, but the compiler required (%d) uniform registers for correct code generation.".

P registers are 1-bit predicates used for conditional execution (@P0 FADD ...) and branch conditions. P0--P6 are usable; P7 is the hardwired always-true predicate PT. Writes to PT are discarded. The assembler uses PT as the default predicate for unconditional instructions. In the allocator, predicate registers support half-width packing: two virtual predicates can be packed into one physical predicate slot, with the hi/lo distinction stored in bit 23 (0x800000) of the virtual register flags.

UP registers are the uniform predicate variant. UP0--UP6 are usable; UP7 is UPT (always-true). Available on sm_75+.

Seven Allocator Register Classes

The fat-point allocator processes 7 register classes, indexed by the reg_type field at vreg+64. Class 0 is the cross-class constraint propagation channel and is skipped in the main allocation loop. Classes 1--6 are allocated independently, in order. The allocator distribution loop in sub_9721C0 (lines 520--549) reads *(int*)(vreg+64) and uses it directly as the class bucket index, guarded by reg_type <= 6:

Class ID	Name	Width	HW limit	Description
0	(unified)	--	--	Cross-class constraint propagation (skipped)
1	R	32-bit	255	General-purpose registers (R0--R254)
2	R (alt)	32-bit	255	GPR variant (used for RZ sentinel, stat collector alternate)
3	UR	32-bit	63	Uniform general-purpose (UR0--UR62)
4	UR (ext)	32-bit	63	Uniform GPR variant (triggers flag update at +1369 in constructor)
5	P / UP	1-bit	7	Predicate registers (P0--P6, UP0--UP6)
6	Tensor/Acc	32-bit	varies	Tensor/accumulator registers for MMA/WGMMA operations

The class ID is the reg_type value stored at vreg+64. The allocator class at vreg+12 is a separate field used for instruction-level classification, not for the per-class allocation passes. The allocator's per-class linked lists at alloc[3*reg_type + 138] are populated directly from vreg+64.

Per-class state is initialized via the target descriptor vtable call vtable[896](alloc_state, class_id), which populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class).

Barrier Registers

Barrier registers (B and UB) are a distinct register file used by the BAR, DEPBAR, BSSY, and BSYNC instructions for warp-level and CTA-level synchronization. B0--B15 are the non-uniform barrier registers; UB0--UB15 are the uniform variant. Barrier registers have reg_type = 9, which is above the <= 6 cutoff for the main allocator class buckets. They are handled by a separate allocation mechanism outside the 7-class system.

Tensor/Accumulator Registers (Class 6)

Class 6 registers are created during intrinsic lowering of tensor core operations (MMA, WGMMA, HMMA, DMMA). Over 30 intrinsic lowering functions in the 0x6B--0x6D address range call sub_91BF30(ptr, ctx, 6) to create these registers. The GMMA pipeline pass (sub_ADA740, sub_69E590) identifies accumulator operands by checking *(vreg+64) == 6. The accumulator counting function at sub_78C6B0 uses the pair-mode bits at vreg+48 (bits 20--21) to determine whether a type-6 register consumes 1 or 2 physical R slots.

Virtual Register Descriptor

Every virtual register in a function is represented by a 160-byte descriptor allocated from the per-function arena. The register file array is at Code Object +88, indexed as *(ctx+88) + 8*regId. The descriptor is created by sub_91BF30 (register creation function).

Descriptor Layout

Offset	Size	Type	Field	Notes
+0	8	`ptr`	`next`	Linked list pointer (allocation worklist)
+8	4	`i32`	`id`	Unique register ID within function
+12	4	`i32`	`class_index`	Allocator register class (0--6)
+20	1	`u8`	`flags_byte`	Bit 0x20 = live
+24	4	`i32`	`bb_index`	Basic block of definition
+28	4	`i32`	`epoch`	Epoch counter for liveness tracking
+32	8	`ptr`	`alias_next`	Next aliased register (coalescing chain)
+36	8	`ptr`	`alias_parent`	Coalesced parent pointer
+40	4	`f32`	`spill_cost`	Accumulated spill cost
+48	8	`u64`	`flags`	Multi-purpose flag word (see below)
+56	8	`ptr`	`def_instr`	Defining instruction pointer
+64	4	`i32`	`reg_type`	Register file type enum
+68	4	`i32`	`physical_reg`	Physical register number (-1 = unassigned)
+72	1	`u8`	`size`	0 = scalar, nonzero = encoded width
+76	4	`f32`	`secondary_cost`	Secondary spill cost
+80	4	`i32`	`spill_flag`	0 = not spilled, 1 = spilled
+97	2	`u16`	`reserved`
+104	8	`ptr`	`use_chain`	Use chain head (instruction pointer)
+112	8	`ptr`	`def_chain`	Definition chain
+120	8	`ptr`	`regfile_next`	Next in register file linked list
+128	8	`ptr`	`linked_next`	Next in linked-register chain
+136	8	`ptr`	`reserved2`
+144	8	`ptr`	`constraint_list`	Constraint list head for allocator
+152	8	`ptr`	`reserved3`

Initial values set by the constructor (sub_91BF30):

vreg->next           = NULL;            // +0
vreg->id             = ctx->reg_count + 1;  // +8, auto-incrementing
vreg->class_index    = 0;               // +12
vreg->flags_byte     = 0;               // +20
vreg->alias_parent   = (ptr)-1;         // +20..27 (qword write)
vreg->physical_reg   = -1;              // +68 (unassigned)
vreg->reg_type       = a3;              // +64 (passed as argument)
vreg->size           = 0;               // +72
vreg->spill_flag     = 0;               // +80
vreg->use_chain      = NULL;            // +104
vreg->def_chain      = NULL;            // +112
vreg->constraint_list = NULL;           // +144

For predicate types (a3 == 2 or a3 == 3), the flags word at +48 is initialized to 0x1000 (4096). For all other types, it is initialized to 0x1018 (4120). If the type is 7 (alternate predicate classification), the physical register is initialized to 0 instead of -1.

Flag Bits at +48

Bit	Mask	Meaning
9	`0x200`	Pre-assigned / fixed register
10	`0x400`	Coalesced source
11	`0x800`	Coalesced target
12	`0x1000`	Base flag (set for all types)
14	`0x4000`	Spill marker (already spilled)
18	`0x40000`	Needs-spill (allocator sets when over budget)
20--21	(pair mode)	0 = single, 1 = lo-half of pair, 3 = double-width
22	`0x400000`	Constrained to architecture limit
23	`0x800000`	Hi-half of pair (predicate half-width packing)
27	`0x8000000`	Special handling flag

Register File Type Enum (at +64)

This enum determines the register file a VR belongs to. It is used by the register class name table at off_21D2400 to map type values to printable strings ("R", "UR", "P", etc.) for diagnostic output such as "Referencing undefined register: %s%d".

Value	File	Alloc class	Description
1	R	1	General-purpose register (32-bit)
2	R (alt)	2	GPR variant (RZ sentinel in `sub_7D82E0`, stat collector alternate)
3	UR	3	Uniform register (32-bit)
4	UR (ext)	4	Uniform GPR variant (triggers flag update at +1369 in constructor)
5	P / UP	5	Predicate register (1-bit); covers both P and UP
6	Tensor/Acc	6	Tensor/accumulator register for MMA/WGMMA operations
7	P (alt)	--	Predicate variant (physical = 0 at init); above allocator cutoff
8	--	--	Extended type (created by `sub_83EF00`); above allocator cutoff
9	B / UB	--	Barrier register; above allocator cutoff, separate allocation
10	R2	--	Extended register pair (64-bit, two consecutive R regs)
11	R4	--	Extended register quad (128-bit, four consecutive R regs)

Values 0--6 are within the allocator's class system (the distribution loop in sub_9721C0 guards with reg_type <= 6). Values 7+ are handled by separate mechanisms. The off_21D2400 name table is indexed by reg_type and provides display strings for diagnostic output.

The stat collector at sub_A60B60 (24 KB) enumerates approximately 25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others by iterating vtable getter functions per register class.

Wide Registers

NVIDIA GPUs have only 32-bit physical registers. Wider values are composed from consecutive registers.

64-Bit Pairs (R2)

A 64-bit value occupies two consecutive registers where the base register has an even index: R0:R1, R2:R3, R4:R5, and so on. The low 32 bits reside in the even register; the high 32 bits in the odd register. In the Ori IR, a 64-bit pair is represented by a single virtual register with:

vreg+64 (type) = 10 (extended pair)
vreg+48 bits 20--21 (pair mode) = 3 (double-width)

The allocator selects even-numbered physical slots by scanning with stride 2 instead of 1. The register consumption function (sub_939CE0) computes slot + (1 << (pair_mode == 3)) - 1, consuming two physical slots.

128-Bit Quads (R4)

A 128-bit value occupies four consecutive registers aligned to a 4-register boundary: R0:R1:R2:R3, R4:R5:R6:R7, etc. Used by texture instructions, wide loads/stores, and tensor core operations. In the Ori IR:

vreg+64 (type) = 11 (extended quad)
Allocator scans with stride 4

Alignment Constraints

Width	Base alignment	Stride	Example
32-bit (scalar)	Any	1	R7
64-bit (pair)	Even	2	R4:R5
128-bit (quad)	4-aligned	4	R8:R9:R10:R11

The texture instruction decoder (sub_1170920) validates even-register alignment via a dedicated helper (sub_1170680) that checks if a register index falls within the set {34, 36, 38, ..., 78} and returns 0 if misaligned.

The SASS instruction encoder for register pairs (sub_112CDA0, 8.9 KB) maps 40 register pair combinations (0/1, 2/3, ..., 78/79) to packed 5-bit encoding values at 0x2000000 (33,554,432) intervals.

Special Registers

Zero and True Registers

Register	File	Index	Internal sentinel	Behavior
RZ	R	255	1023	Reads return 0; writes discarded
URZ	UR	63	1023	Uniform zero; reads return 0
PT	P	7	31	Always-true predicate; writes discarded
UPT	UP	7	31	Uniform always-true

The internal sentinel value 1023 (0x3FF) represents "don't care" or "zero register" throughout the Ori IR and allocator. During SASS encoding, hardware register index 255 is mapped to sentinel 1023 for R/UR files, and hardware index 7 is mapped to sentinel 31 for P/UP files. These sentinels are checked in encoders to substitute the default register value:

// Decoder: extract register operand (sub_9B3C20)
if (reg_idx == 255)
    internal_idx = 1023;   // RZ sentinel

// Decoder: extract predicate operand (sub_9B3D60)
if (pred_idx == 7)
    internal_idx = 31;     // PT sentinel

// Encoder: emit register field
if (reg == 1023)
    use *(a1+8) as default;  // encode physical RZ

Architectural Predicate Indices

The allocator skips architectural predicate registers by index number:

Index	Register	Treatment
39	(special)	Skipped during allocation (skip predicate `sub_9446D0`)
41	PT	Skipped -- hardwired true predicate
42	P0	Skipped -- architectural predicate
43	P1	Skipped -- architectural predicate
44	P2	Skipped -- architectural predicate

The skip check in sub_9446D0 returns true (skip) for register indices 41--44 and 39, regardless of register class. For other registers, it checks whether the instruction is a CSSA phi (opcode 195 with barrier type 9) or whether the register is in the exclusion set hash table at alloc+360.

Special System Registers (S2R / CS2R)

Thread identity and hardware state are accessed through the S2R (Special Register to Register) and CS2R (Control/Status Register to Register) instructions. These read read-only hardware registers into R-file registers.

Common system register values (from PTX parser initialization at sub_451730):

PTX name	Hardware	Description
`%tid` / `%ntid`	SR_TID_X/Y/Z	Thread ID within CTA
`%ctaid` / `%nctaid`	SR_CTAID_X/Y/Z	CTA ID within grid
`%laneid`	SR_LANEID	Lane index within warp (0--31)
`%warpid` / `%nwarpid`	SR_WARPID	Warp index within CTA
`%smid` / `%nsmid`	SR_SMID	SM index
`%gridid`	SR_GRIDID	Grid identifier
`%clock` / `%clock_hi` / `%clock64`	SR_CLOCK / SR_CLOCK_HI	Cycle counter
`%lanemask_eq/lt/le/gt/ge`	SR_LANEMASK_*	Lane bitmask variants

The S2R register index must be between 0 and 255 inclusive, enforced by the string "S2R register must be between 0 and 255 inclusive". Special system register ranges are tracked at Code Object offsets +1712 (start) and +1716 (count).

Operand Encoding in Ori Instructions

Each instruction operand is encoded as a 32-bit packed value in the operand array starting at instruction offset +84. The operand at index i is at *(instr + 84 + 8*i).

Packed Operand Format (Ori IR)

 31   30  29  28  27            24  23  22  21  20  19                  0
+----+---+---+---+---------------+---+---+---+---+---------------------+
|sign|     type  |  modifier (8) |                index (20)           |
+----+---+---+---+---------------+---+---+---+---+---------------------+
 bit 31: sign/direction flag          bits 0-19: register/symbol index
 bits 28-30: operand type (3 bits)    bit 24: pair extension flag

Extraction pattern (50+ call sites):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;     // bits 28-30
int index   = operand & 0xFFFFF;       // bits 0-19
int mods    = (operand >> 20) & 0xFF;  // bits 20-27
bool is_neg = (operand >> 31) & 1;     // bit 31

Type value	Meaning
1	Register operand (index into register file at `(ctx+88) + 8index`)
5	Symbol/constant operand (index into symbol table at `*(ctx+152)`)
6	Special operand (barrier, system register)

For register operands (type 1), the index is masked as operand & 0xFFFFFF (24 bits) to extract the full register ID. Indices 41--44 are architectural predicates that are never allocated.

SASS Instruction Register Encoding

During final SASS encoding, the register operand encoder (sub_7BC030, 814 bytes, 6147 callers) packs register operands into the 128-bit instruction word:

Encoded register field (16 bits at variable bit offset):
  bit 0:      presence flag (1 = register present)
  bits 1-4:   register file type (4 bits, 12 values)
  bits 5-14:  register number (10 bits)

The 4-bit register file type field in the SASS encoding maps the internal operand type tag to hardware encoding:

Operand type tag	Encoded value	Register file
1	0	R (32-bit)
2	1	R pair (64-bit)
3	2	UR (uniform 32-bit)
4	3	UR pair (uniform 64-bit)
5	4	P (predicate)
6	5	(reserved)
7	6	(reserved)
8	7	B (barrier)
16	8	(extended)
32	9	(extended)
64	10	(extended pair)
128	11	(extended quad)

The predicate operand encoder (sub_7BCF00, 856 bytes, 1657 callers) uses a different format: 2-bit predicate type, 3-bit predicate condition, and 8-bit value. It checks for PT (operand byte[0] == 14) and handles the always-true case.

Register-Class-to-Hardware Encoding

The function sub_1B6B250 (2965 bytes, 254 callers) implements the mapping from the compiler's abstract (register_class, sub_index) pair to hardware register numbers:

hardware_reg = register_class * 32 + sub_index

For example: class 0, index 1 returns 1; class 1, index 1 returns 33; class 2, index 1 returns 65. The guard wrapper sub_1B73060 (483 callers) returns 0 for the no-register case (class=0, index=0).

The register field writer (sub_1B72F60, 483 callers) packs the encoded register number into the 128-bit instruction word with the encoding split across two bitfields:

*(v2 + 12) |= (encoded_reg << 9) & 0x3E00;       // bits [13:9]
*(v2 + 12) |= (encoded_reg << 21) & 0x1C000000;   // bits [28:26]

Register Pressure Tracking

Scheduling Phase Pressure Counters

The scheduler maintains 10 per-block register pressure counters at offsets +4 through +40 of the per-BB scheduling record (72 bytes per basic block). At BB entry, these are copied into the scheduler context at context offsets +48 through +87. The counters track live register counts for each register class:

BB record offset	Context offset (idx)	Register class
+4	+48 (idx 12)	R (general-purpose)
+8	+52 (idx 13)	P (predicate)
+12	+56 (idx 14)	UR (uniform)
+16	+60 (idx 15)	UP (uniform predicate)
+20	+64 (idx 16)	B (barrier)
+24	+68 (idx 17)	(arch-specific class 0)
+28	+72 (idx 18)	(arch-specific class 1)
+32	+76 (idx 19)	(arch-specific class 2)
+36	+80 (idx 20)	(arch-specific class 3)
+40	+84 (idx 21)	(arch-specific class 4 / control total)

The spill cost analyzer (sub_682490, 14 KB) allocates two stack arrays (v94[511] and v95[538]) as per-register-class pressure delta arrays. For each instruction, it computes pressure increments and decrements based on the instruction's register operand definitions and uses.

The register pressure coefficient is controlled by knob 740 (double, default 0.045). The pressure curve function uses a piecewise linear model with parameters (4, 2, 6) via sub_8CE520.

Liveness Bitvectors

The Code Object maintains register liveness as bitvectors:

Offset	Bitvector	Description
+832	Main register liveness	One bit per virtual register; tracks which registers are live at the current program point
+856	Uniform register liveness	Separate bitvector for UR/UP registers

These bitvectors are allocated via sub_BDBAD0 (bitvector allocation, with size = register count + 1 bits) and manipulated via the SSE2-optimized bitvector primitives at sub_BDBA60 / sub_BDC180 / sub_BDCDE0 / sub_BDC300.

For each basic block during dependency graph construction (sub_A0D800, 39 KB), the per-block liveness is computed by iterating instructions and checking operand types ((v >> 28) & 7 == 1 for register operands), then updating the bitvector at +832 with set/clear operations.

Allocator Pressure Arrays

The fat-point allocator (sub_957160) uses two 512-DWORD (2048-byte) arrays per allocation round:

Array	Role
Primary (`v12[512]`)	Per-physical-register interference count
Secondary (`v225[512]`)	Tie-breaking cost metric

Both are zeroed with SSE2 vectorized _mm_store_si128 loops at the start of each round. For each VR being allocated, the pressure builder (sub_957020) walks the VR's constraint list and increments the corresponding physical register slots. The threshold (knob 684, default 50) filters out congested slots.

ABI Register Reservations

Reserved Registers

Registers R0--R3 are unconditionally reserved by the ABI across all SM generations. The diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s" fires if they are targeted by parameter assignment or user directives.

Minimum Register Counts by SM Generation

SM generation	Value	SM targets	Minimum registers
3	`(sm_target+372) >> 12 == 3`	sm_35, sm_37	(no minimum)
4	`== 4`	sm_50 -- sm_53	16
5	`== 5`	sm_60 -- sm_89	16
9	`== 9`	sm_90, sm_90a	24
>9	`> 9`	sm_100+	24

Violating the minimum emits warning 7016: "regcount %d specified below abi_minimum of %d".

Per-Class Hardware Limits

Class	Limit	Notes
R	255	R0--R254 usable; controlled by `--maxrregcount` and `--register-usage-level` (0--10)
UR	63	UR0--UR62 usable; sm_75+ only
P	7	P0--P6 usable
UP	7	UP0--UP6 usable; sm_75+ only
B	16	B0--B15
UB	16	UB0--UB15

The --maxrregcount CLI option sets a per-function hard ceiling for R registers. The --register-usage-level option (0--10, default 5) modulates the register allocation target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class budget at alloc + 32*class + 884 reflects the interaction between the CLI limit and the optimization level.

The --device-function-maxrregcount option overrides the kernel-level limit for device functions when compiling with -c.

Dynamic Register Allocation (setmaxnreg)

sm_90+ (Hopper and later) supports dynamic register allocation through the setmaxnreg.inc and setmaxnreg.dec instructions, which dynamically increase or decrease the per-thread register count at runtime. ptxas tracks these as internal states setmaxreg.try_alloc, setmaxreg.alloc, and setmaxreg.dealloc. Multiple diagnostics guard correct usage:

"setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count in the program (%d)"
"setmaxreg.dealloc/release has register count (%d) less than launch min target (%d) allowed"
"Potential Performance Loss: 'setmaxnreg' ignored to maintain minimum register requirements."

Pair Modes and Coalescing

The pair mode at vreg+48 bits 20--21 controls how the allocator handles wide registers:

Pair mode	Value	Behavior
Single	0	Occupies one physical register slot
Lo-half	1	Low half of a register pair
Double-width	3	Occupies two consecutive physical slots

The allocator computes register consumption via sub_939CE0:

consumption = slot + (1 << (pair_mode == 3)) - 1;
// single:  slot + 0  = slot (1 slot)
// double:  slot + 1  = slot+1 (2 slots)

The coalescing pass (sub_9B1200, 800 lines) eliminates copy instructions by merging the source and destination VRs into the same physical register. The alias chain at vreg+36 (coalesced parent) is followed during assignment (sub_94FDD0) to propagate the physical register through all aliased VRs:

alias = vreg->alias_parent;     // vreg+36
while (alias != NULL) {
    alias->physical_reg = slot;  // alias+68
    alias = alias->alias_parent; // alias+36
}

Register Name Table

The register class name table at off_21D2400 is a pointer array indexed by the register file type enum (from vreg+64). Each entry points to a string: "R", "UR", "P", "UP", "B", "UB", etc. This table is used by diagnostic functions:

sub_A4B9F0 (StatsEmitter::emitUndefinedRegWarning): "Referencing undefined register: %s%d" where %s is off_21D2400[*(vreg+64)] and %d is *(vreg+68) (physical register number).
sub_A60B60 (RegisterStatCollector::collectStats, 24 KB): Enumerates ~25 register sub-classes by iterating vtable getters, one per register class. The enumerated classes include R, P, B, UR, UP, UB, SRZ, PT, RZ, and others.
"Fatpoint count for entry %s for regclass %s : %d": Prints per-function per-class allocation statistics.

Key Functions

Address	Size	Function	Description
`sub_91BF30`	99 lines	`createVirtualRegister`	Allocates 160-byte VR descriptor, initializes fields, appends to register file array
`sub_9446D0`	29 lines	`shouldSkipRegister`	Returns true for indices 41--44, 39 (architectural specials); checks CSSA phi and exclusion set
`sub_A4B8F0`	248B	`emitInstrRegStats`	Emits `"instr/R-regs: %d instructions, %d R-regs"`
`sub_A4B9F0`	774B	`emitUndefinedRegWarning`	Walks operands backward, formats `"Referencing undefined register: %s%d"`
`sub_A60B60`	4560B	`collectRegisterStats`	Enumerates ~25 register sub-classes via vtable getters
`sub_7BC030`	814B	`encodeRegOperand`	Packs register into SASS instruction: 1-bit presence + 4-bit type + 10-bit number
`sub_7BCF00`	856B	`encodePredOperand`	Packs predicate into SASS: 2-bit type + 3-bit condition + 8-bit value
`sub_9B3C20`	--	`decodeRegOperand`	Decoder helper: extracts register, maps 255 to 1023 (RZ)
`sub_9B3D60`	--	`decodePredOperand`	Decoder helper: extracts predicate, maps 7 to 31 (PT)
`sub_1B6B250`	2965B	`regClassToHardware`	Maps (class, sub_index) to hardware number: `class * 32 + sub_index`
`sub_1B73060`	19B	`regClassToHardwareGuard`	Guard wrapper: returns 0 for no-register case
`sub_1B72F60`	32B	`writeRegField`	Packs encoded register into instruction word bits [13:9] and [28:26]
`sub_112CDA0`	8.9KB	`encodeRegisterPair`	Maps 40 register pair combinations to 5-bit packed encoding values
`sub_939CE0`	23 lines	`computeConsumption`	Pair-aware register slot consumption counter
`sub_94FDD0`	155 lines	`assignRegister`	Commits physical register assignment, propagates through alias chain
`sub_A0D800`	39KB	`buildDependencyGraph`	Per-block dependency graph with register-to-instruction mapping
`sub_A06A60`	15KB	`scheduleWithPressure`	Per-block scheduling loop tracking live register set bitvector
`sub_682490`	14KB	`computeRegPressureDeltas`	Per-instruction register pressure delta computation
`sub_B28E00`	--	`getRegClass`	Returns register class (1023 = wildcard, 1 = GPR)
`sub_B28E10`	--	`isRegOperand`	Predicate: is this a register operand?
`sub_B28E20`	--	`isPredOperand`	Predicate: is this a predicate operand?
`sub_B28E90`	--	`isUReg`	Predicate: is this a uniform register?

Opcode Register Class Table

Every Ori opcode carries an implicit register class contract: which register files its operands may reference, what data widths are valid, and which addressing modes apply. The function sub_6575D0 (49 KB, buildEncodingDescriptor) is the central dispatch that translates each instruction's opcode into a packed encoding descriptor consumed by the SASS encoder.

Function Signature

// sub_6575D0 -- buildEncodingDescriptor
// a1 = compiler context
// a2 = Ori instruction node pointer
// a3 = output: 4-DWORD packed encoding descriptor
char buildEncodingDescriptor(Context *a1, Instruction *a2, uint32_t *a3);

Architecture

The function is a two-level dispatch:

Outer switch on the Ori opcode at *(instr->info + 8) -- 168 unique case values spanning opcodes 3 (IADD3) through 0xF5 (PIXLD).
Inner encoding per opcode (or group): assigns an encoding category ID to a3[0], then calls the bitfield packers to fill a3[1..2] with register class attributes.

Two helper functions pack fields into the descriptor:

Function	Role	Call count	Field ID range
`sub_917A60` (`packRegClassField`)	Bitfield encoder -- field IDs 91--340 map to specific bit positions in `a3[1]` and `a3[2]`	112	91--340
`sub_A2FF00` (`packOperandField`)	Alternate encoder for operand-level slots (data type, memory space)	28	3--71

Encoding Category Assignment

The encoding category at a3[0] selects which SASS instruction format template the downstream per-SM encoder uses. Key mappings (opcode index to encoding category):

Opcode(s)	SASS mnemonic	Category	Register class summary
3	`IADD3`	489	R dest, R/UR sources, P carry
4	`BMSK`	106	R only
5--6	`SGXT` / `LOP3`	490--491	R dest, R/UR sources
7	`ISETP`	59	P dest, R/UR sources + memory ordering fields
8	`IABS`	60	R dest, R source + memory ordering fields
0x0E--0x10	`FSET`/`FSEL`/`FSETP`	510	R/P dest, FP operation variant
0x11/0x12/0x18	`FSETP`/`MOV`/`PRMT`	517	FP comparison, combine, data width (IDs 288--299)
0x15--0x16	`P2R`/`R2P`	524--525	P-to-R or R-to-P conversion
0x19	`VOTE`	526	R dest, optional memory class
0x1A	`CS2R` variant	527	UR source width (494--496), data type from `a2+92`
0x1B	`CS2R_32`	497	Source width (494/495/496), predicate flag (ID 270)
0x1E	`IPA`	494	Interpolation mode (440--442), flat/smooth (443/444)
0x1F	`MUFU`	501	Subfunction (445--447), precision (450--459)
0x20	`SHF`	502	Direction (461--463), source class (464--466), clamp, data type
0x21	`SHFL`	503	Mode (470/471), operand classes (472--482)
0x22--0x23	`I2I`/`I2IP`	55/56	Integer conversion type (23 entries in `dword_2026B20`)
0x28--0x2A	`IPA`/`MUFU` ext	512	Extended encoding variants (428--430)
0x2B--0x2C	`F2F`/`F2F_X`	513	Conversion direction (432/433), saturation (434/435)
0x2D	`FRND`	516	Rounding variant (526), mode (528/529)
0x51--0x53	`AL2P`, `AL2P_IDX`	437--438	Bindless flag (ID 148), predicate (ID 147)
0x54--0x56	`BMOV_B`/`BMOV_R`/`BMOV`	423--424	B-register class
0x64--0x67	`SETLMEMBASE`/`ATOM`	156/463	Atom-vs-red (ID 178), data width (ID 181)
0x68	`BRX`	468	Target (ID 190), call convention (IDs 191--192)
0x6A/0x6C/0x6D	`JMP`/`JMX`/`CALL`	469	Control flow target class (ID 176)
0x77--0x79	`BSSY`/`BREAK`/`BSYNC`	528--530	Sync mode (ID 324), variant (ID 325)
0x82	`NANOTRAP`	487	Trap operation class (ID 257), has-source (ID 256)
0x9E--0x9F	Hopper+ instrs	535--536	Hopper class A/B (IDs 337--338)
0xAF--0xB2	`LD`/`ST` variants	431--446	Full modifier set: uniform (91), pair (92--102)
0xB8--0xBE	`LDG`/`STG`/`LDL`/`STL`	449--456	Cache policy (131), float mode (134), width (131)
0xC1	Conditional	10/13	Branch type (ID 167), divergent (ID 168)
0xC8	`PRMT`	24	Permute selector (ID 65/66)
0xC9--0xD3	Texture/surface	61/455	Texture data type (IDs 17/18), surface (IDs 19--22)
0xD6--0xD7	`DMMA`/`CVTA`	515	Direction (304), predicate (305), data type (306)
0xDA--0xDB	`SUATOM`	521/533	Data width (326--331), sync mode (328)
0xDC	`SURED`	534	Data width (331), type (335--336), sync (333)
0xE0	`WGMMA`	500	Data type (198), enable (199), barrier (201)
0xF5	`PIXLD`	532	Mode from `dword_2026AA0` (ID 323)

Extended Opcode Path (Memory/Atomic Sub-dispatch)

When the opcode falls in the 0xF6--0x10C range (memory/atomic extended instructions), a separate sub-dispatch applies. The function sub_44AC80 gates entry; sub_44AC60 and sub_44AC70 select among three encoding categories:

Category	Gate function	Meaning
441	default	Base memory operation
442	`sub_44AC60` true	Predicated memory variant
443	`sub_44AC70` true	Extended memory variant

Within each category, the sub-opcode selects register class fields:

Sub-opcode	Register class (field 115)	Data width (field 113)
0xF6/0xFF/0x106	69 (class A)	60 (standard)
0xF7/0x100/0x107	71 (class B)	60 (standard)
0xF8/0x102/0x109	0 (default)	63 (wide)
0xF9/0x103/0x10A	0 (default)	61 (narrow)
0xFA/0x104/0x10B	0 (default)	62 (medium)
0xFB	0 (default)	65 (type A)
0xFC	0 (default)	66 (type B)
0xFD	0 (default)	68 (type C)
0xFE/0x105/0x10C	0 (from table)	64 (from `dword_2026C30`)
0x101/0x108	72 (class C)	60 (standard)

Packed Descriptor Layout

The output descriptor a3 is a 4-DWORD (16-byte) structure:

DWORD	Content
`a3[0]`	Encoding category ID (0--542) -- selects SASS format template
`a3[1]`	Packed bitfield: memory space (bits 0--3), address type (bits 4--7)
`a3[2]`	Packed bitfield: register class attributes (data width, type, modifiers)
`a3[3]`	Auxiliary flags (bit 1 = texture scope, bit 29 = special)
`a3[4]`	Operand count override (set to 12 for KILL/extended mem ops)

Register Class Field Groups

The 112 calls to packRegClassField (sub_917A60) use field IDs organized into functional groups. Each field ID maps to a specific bit range in the output descriptor via a mask-and-OR encoding:

// Example: field 113 (data width) -- bits 7-9 of a3[2]
case 113:
    val = dword_21DEB20[a3_value - 61];  // 8-entry lookup
    a3[2] = (val << 7) | (a3[2] & 0xFFFFF87F);
    break;

// Example: field 91 (uniform flag) -- bit 16 of a3[2]
case 91:
    a3[2] = ((value == 1) << 16) | (a3[2] & 0xFFFEFFFF);
    break;

Field group	IDs	Bits written	Purpose
Core class	91--102	`a3[2]` bits 5--22	Uniform, pair, predicate, data type, saturate, negate, abs, complement
Data width	113--117	`a3[2]` bits 0--9	Width code, uniform-mem, source regclass, type specifier, write-back
Load/store	118--134	`a3[1]` + `a3[2]`	Memory space, address type, cache policy, atomic op, scope, float mode
Texture/surface	135--165	`a3[2]` bits 1--31	Texture type, dimension, LOD mode, ordering, acquire, scope hint
Control flow	167--202	`a3[2]` bits 1--6	Branch type, divergent, WGMMA data type/enable/barrier
FP/conversion	230--264	`a3[2]` various	FP operation, comparison, combine, interpolation, MUFU, SHF, SHFL
Extended	269--299	`a3[2]` various	CS2R, FSETP, rounding, data type wide, destination regclass
Hopper/Blackwell	304--340	`a3[2]` various	DMMA, WGMMA, TMA hints, surface sync, Hopper-specific classes

Sub-handler Functions

Complex opcode families delegate register class encoding to dedicated sub-functions:

Function	Opcodes handled	Purpose
`sub_650390`	TEX, TLD, texture family	Texture register class (sampler, coordinate, LOD)
`sub_650220`	LDG, STG, LD, ST, ATOM, RED	Memory instruction register class
`sub_651330`	FMUL (opcode 0x0D)	FP multiply register class
`sub_650920`	LEA, special (0x09, 0x72, 0x74, 0x7A, 0x80, 0x81)	LEA / special instruction
`sub_650A90`	I2I, F2F, conversions (0x24--0x27, 0xE2--0xEB)	Type conversion register class
`sub_652190`	Branch/call (0x13, 0x14, 0x17)	Branch/call register class
`sub_653B90`	Misc (0x0C)	Miscellaneous instruction
`sub_650C80`	Memory barrier modifiers	Applied when `(a2+56) & 0x4F0` is nonzero
`sub_651A90`	Texture modifiers (0x83)	Applied before texture encoding
`sub_62D5D0`	Memory space computation	Computes memory space tag from operand types

Lookup Tables

The function references 28 static lookup tables that map instruction attribute values to register class encoding values:

Table	Size	Used by field(s)	Content
`dword_21DEB80`	5	94	Data type encoding
`dword_21DEB50`	3	107, 115, 145, 157, 165	3-value encoding (reused across 5 fields)
`dword_21DEB20`	8	113	Data width code
`dword_21DEB00`	7	116, 126, 131, 170	Type encoding (reused across 4 fields)
`dword_21DEAE0`	5	119/123, 136, 143, 159	Variant table (reused across 4 fields)
`dword_21DEAA0`	13	120	Memory space code
`dword_21DEA60`	10	121, 135/151	Address/texture type
`dword_21DEA20`	15	124/125	Reduction type
`dword_21DE9F0`	6	129/130, 150	Scope code
`dword_2026C30`	6	116 (ext path)	Sub-opcode to data type
`dword_2026C80`	20	165 (surface)	Surface operation codes
`dword_2026E20`	17	286	Data type (wide)
`dword_2026AC0`	16	198	WGMMA data type
`dword_2026B20`	23	I2I conversion	Integer conversion type

Ori IR Overview -- register files in the context of the full IR
Instructions -- packed operand format and opcode encoding
Allocator Architecture -- the 7-class fat-point allocator
Fat-Point Algorithm -- pressure arrays, constraint types, selection loop
GPU ABI -- reserved registers, parameter passing, return address
Spilling -- spill/reload for each register class
Scheduler -- 10 per-block pressure counters at record +4..+40
SASS Encoding -- how the descriptor drives instruction word layout

Data Structure Layouts

This page documents the key internal data structures in ptxas v13.0.88: the compilation context ("god object"), the Ori Code Object, symbol tables, constant/shared memory descriptors, the pool allocator's object model, and the generic container types (hash maps, linked lists, growable arrays) that underpin nearly every subsystem.

All offsets are byte offsets from the structure base unless otherwise noted. Types are inferred from decompiled access patterns. Field names are reverse-engineered -- the binary is stripped.

Compilation Context (the "God Object")

The compilation context is the central state object passed to every phase in the pipeline. It is not the Code Object (which is per-function); it is the per-compilation-unit container that owns the Code Object, the knob system, the output stream, the function list, and all per-pass configuration. The sub_7FBB70 (PerKernelEntry) function receives this as a1, and every phase's execute() receives it as the second argument.

The context is a polymorphic C++ object with a vtable at offset +0. It is allocated by the compilation driver and persists for the lifetime of a single compilation unit. Key observations:

The vtable at +0 provides 263+ virtual methods (vtable spans to offset 2104+)
The object is at least 1928 bytes based on the highest confirmed field access (+1928 = codegen_ctx)
The knob/options system is accessed through an indirection at +1664 (pointer to knob container object)
The output stream lives at +1440

Compilation Context Field Map

Offset	Type	Field	Evidence
+0	`vtable*`	vtable	`(_QWORD )a1` in every virtual dispatch
+8	`ptr`	`parent` / `driver_ctx`	Back-pointer; `sub_A3A7E0` reads `v2 = *(a1+8)` then `v2[198]` for Code Object
+80	`u32`	`last_exit_code`	`sub_663C30`: `*(a1+80) = v2[462]`
+96	`u32`	`compile_unit_index`	`sub_663C30`: `*(a1+96) = 1` on first call
+139	`u8`	`multi_function_flag`	`sub_663C30`: `if (!*(a1+139))`
+144	`ptr`	`name_table` (via vtable+144)	`sub_7FBB70`: `(a1 + 144)` -> name lookup vtable
+296	`ptr`	`current_function`	`sub_7FBB70`: `((a1+296) + 164)` = function index
+368	`ptr`	`function_name_array`	`sub_7FBB70`: `(a1+368 + 8func_id)` -> name object
+1144	`ptr`	`function_list_head`	`sub_663C30`: linked list of function descriptors
+1160	`ptr`	`entry_list_head`	`sub_663C30`: linked list of kernel entry descriptors
+1376	`u32`	`scheduling_mode_flags`	Bit 0x08 = forward, bit 0x10 = bidirectional
+1412	`i8`	`compilation_flags_byte`	`sub_A3B080`: `(char)(a2+1412) < 0`
+1416	`u8`	`output_detail_flags`	`sub_7FBB70`: `*(a1+1416) \|= 0x80`; bits 4-5 control latency reporting mode
+1418	`u8`	`codegen_mode_flags`	`sub_A3B080`: `*(a2+1418) & 4`
+1428	`i32`	`function_index`	`sub_7FBB70`: `*(a1+1428) < 0` means first invocation
+1440	`stream*`	`output_stream`	`sub_7FBB70`: `sub_7FE930(a1+1440, "\nFunction name: ")`
+1560	`ptr`	`timing_records`	Growable array of 32-byte timing entries
+1576	`u32`	`timing_count`	`sub_C62720`: `cu->timing_count++`
+1552	`i32`	`pipeline_progress`	Pipeline progress counter (0--21), monotonically increases; see known values
+1584	`ptr`	`sm_backend`	SM-specific architecture backend object (polymorphic, 1712--1992B depending on SM target); provides vtable dispatch for legalization, optimization, scheduling, and codegen; see note below
+1664	`ptr`	`knob_container`	`sub_7FB6C0`, `sub_A3B080`: options/knob dispatch object
+1864	`ptr`	`bb_structure`	`sub_7FB6C0`: destroyed via `sub_77F880`
+1872	`ptr`	`per_func_data`	`sub_7FB6C0`: destroyed via `sub_7937D0`
+1880	`ptr`	`function_context`	`sub_7FB6C0`: 17 analysis-result pairs at qword offsets
+1928	`ptr`	`codegen_ctx`	Confirmed in overview.md Code Object table

SM Backend Object at +1584

The pointer at context+0x630 (decimal 1584) is the single most confusing field in the compilation context, because it serves multiple roles through a single polymorphic C++ object. Different wiki pages historically called it different names depending on which role they observed:

Legalization pages see it dispatching MidExpansion, LateExpansionUnsupportedOps, etc., and call it "SM backend" or "arch_backend"
Scheduling pages see it providing hardware latency profiles at *(sm_backend+372) and call it "scheduler context" or "hw_profile"
Optimization pages see it dispatching GvnCse (vtable[23]) and OriReassociateAndCommon (vtable[44]) and call it "optimizer state" or "function manager"
Codegen/template pages see it holding register file capacity at +372 and hardware capability flags at +1037

It is one object. The canonical name is sm_backend. It is constructed per-compilation-unit in sub_662920 with a switch on SM version bits (v3 >> 12). Each SM generation gets a different-sized allocation and a different vtable:

SM Case	Size	Base Constructor	Vtable	SM Generations
3	1712B	`sub_A99A30`	`off_2029DD0`	sm_30 (Kepler)
4	1712B	`sub_A99A30`	`off_21B4A50`	sm_50 (Maxwell)
5	1888B	`sub_A99A30`	`off_22B2A58`	sm_60 (Pascal)
6	1912B	`sub_A99A30`	`off_21D82B0`	sm_70 (Volta)
7	1928B	`sub_ACDE20`	`off_21B2D30`	sm_80 (Ampere)
8	1992B	`sub_662220`	`off_21C0C68`	sm_89 (Ada)
9	1992B	`sub_662220`	`off_21D6860`	sm_90+ (Hopper/Blackwell)

Key sub-fields on the SM backend:

+372 (i32): codegen factory value / encoded SM architecture version (e.g., 28673 = sm_80)
+1037 (u8): hardware capability flags (bit 0 = has high-precision FP64 MUFU seeds)
Vtable slots provide architecture-specific dispatch for 50+ operations

Pipeline Progress Counter at +1552

The field at context+1552 is a monotonically increasing int32 that tracks how far the compilation has progressed through the 159-phase pipeline. It is not a legalization-only counter -- it is incremented by phases across all categories (legalization, optimization, scheduling, regalloc). Each increment is performed by a small thunk function whose sole body is *(ctx + 1552) = N.

Known values and their associated phases:

Value	Thunk Address	Phase / Context
0	(init)	`sub_7F7DC0` -- compilation context constructor
1	`sub_C5F620`	Early pipeline (before ConvertUnsupportedOps)
2	`sub_C5F5A0`	After ConvertUnsupportedOps (phase 5)
3	`sub_C5EF80`	After MidExpansion (phase 45)
4	`sub_C5EF30`	After OriDoRematEarly (phase 54) -- signals remat mode active
5	`sub_1233D70`	Mid-pipeline scheduling/ISel context
7	`sub_6612E0` / `sub_C60AA0`	After LateExpansion (phase 55)
8	`sub_849C60`	Post-optimization context
9	`sub_C5EB80`	After OriBackCopyPropagate (phase 83)
10	`sub_88E9D0`	Late optimization
11	`sub_C5EA80`	After SetAfterLegalization (phase 95) region
12	`sub_C5E980`	Post-legalization
13	`sub_13B5C80`	ISel/scheduling
14	`sub_C5E830`	Post-scheduling
15	`sub_C5E7C0`	Register allocation phase
16	`sub_C5E6E0`	Post-regalloc
17	`sub_C5E5A0`	Mercury/codegen
18	`sub_C5E4D0`	Post-Mercury
19	`sub_C5E440`	Late codegen
20	`sub_C5E390`	Post-RA cleanup
21	`sub_C5E0B0`	Final pipeline stage

Readers of downstream passes use *(ctx+1552) > N to gate behavior that should only run after a certain pipeline point. For example, the rematerialization cross-block pass checks *(ctx+1552) > 4 to enable its second-pass mode.

Knob Container Access Pattern

The knob container at +1664 is accessed through a two-level virtual dispatch pattern that appears at 100+ call sites:

// Fast path: known vtable -> direct array read
_QWORD *v2 = *(_QWORD **)(ctx + 1664);
bool (*query)(__int64, int) = *(bool (**)(...))(*v2 + 72);
if (query == sub_6614A0)
    result = *(u8*)(v2[9] + knob_index * 72 + offset) != 0;
else
    result = query((int64)v2, knob_index);  // slow path

The fast path reads directly from the knob value array at v2[9] (offset +72 of the knob state object), where each knob value occupies 72 bytes. The slow path invokes the virtual method for derived knob containers.

Function Context (at +1880)

When a function is under compilation, +1880 points to a large context object containing 17 pairs of analysis-result data structures. Each pair consists of a sorted container and a hash map, holding results such as live ranges, register maps, and scheduling data. The cleanup code in sub_7FB6C0 destroys pairs at qword offsets [102, 97, 92, 87, 82, 77, 72, 67, 62, 57, 52, 47, 42, 36, 31, 26, 21] from the context base, then handles reference-counted objects at offsets [10] and [2].

Ori Code Object (~1136 bytes)

The Code Object is the per-function container for all IR data. One instance exists for each function under compilation. Constructor is at sub_A3B080, vtable at 0x21EE238.

Constructor Analysis

The constructor (sub_A3B080) takes two arguments: a1 (the Code Object to initialize) and a2 (the compilation context). It:

Sets +8 = a2 (back-pointer to compilation context)
Sets +0 = &unk_21EE238 (vtable)
Zeroes approximately 250 distinct fields across the 1136-byte range
Loads two SSE constants from xmmword_2027600 and xmmword_21EFAE0 into offsets +96 and +112 (likely default register file descriptors or encoding parameters)
Reads a2+1412 and a2+1418 to set mode flags at +1101 and +1008
Accesses the knob container at a2+1664 to query knob 367 for initial configuration
Sets +1008 = 0x300000050 (default) or 0x400000080 (if a2+1418 & 4)

Code Object Field Map

Offset	Type	Field	Evidence / Notes
+0	`vtable*`	vtable	`0x21EE238`, 263+ virtual methods
+8	`ptr`	`compilation_ctx`	Back-pointer to owning compilation context
+16	`u128`	(zeroed)	SSE zero-store in constructor
+24	`u32`	`sm_version`	Encoded SM target (12288=sm30, 20481=sm50, 36865=sm90)
+32	`u128`	(zeroed)	SSE zero-store
+48	`u128`	(zeroed)	SSE zero-store
+64	`u32`	`init_flags`	Zeroed in constructor
+72	`ptr`	`code_buf`	Output code buffer
+80	`u128`	(zeroed)
+88	`ptr`	`reg_file`	Register descriptor array: `(ctx+88) + 8regId`
+96	`u128`	`reg_defaults_1`	Loaded from `xmmword_2027600`
+99	`u32`	`ur_count`	Uniform register (UR) count
+102	`u32`	`r_alloc`	R-register allocated count
+112	`u128`	`reg_defaults_2`	Loaded from `xmmword_21EFAE0`
+128--175	`u128[3]`	(zeroed)	SSE zero-stores
+152	`ptr`	`sym_table`	Symbol/constant lookup array
+159	`u32`	`r_reserved`	R-register reserved count
+176	`ptr`	(zeroed)
+184	`u32`	(zeroed)
+192	`ptr`	(zeroed)
+200	`u128`	(zeroed)
+216	`u128`	(zeroed)
+232	`u32`	(zeroed)
+236	`u32`	(zeroed)
+240	`ptr`	(zeroed)
+248	`u128`	(zeroed)
+264	`u128`	(zeroed)
+272	`ptr`	`instr_head`	Instruction linked-list head
+280	`u32`	(zeroed)
+288	`ptr`	(zeroed)
+296	`ptr`	`bb_array`	Basic block array pointer (40 bytes per entry)
+304	`u32`	`bb_index`	Current basic block count
+312	`ptr`	`options`	`OptionsManager*` for knob queries
+320--359	`u128[3]`	(zeroed)
+335	`u32`	`instr_hi`	Instruction count upper bound
+336	`u32`	`tex_inst_count`	Texture instruction count (stats emitter)
+338	`u32`	`fp16_vect_inst`	FP16 vectorized instruction count
+340	`u32`	`inst_pairs`	Instruction pair count
+341	`u32`	`instr_lo`	Instruction count lower bound
+342	`u32`	`tepid_inst`	Tepid instruction count
+360	`ptr`	(zeroed)
+368	`u32`	`sub_block_flags`
+372	`u32`	`instr_total`	Total instruction count (triggers chunked scheduling at > 0x3FFF)
+376	`u32`	(zeroed)
+384--416	`ptr[5]`	(zeroed)
+424	`u32`	(zeroed)
+432	`ptr`	(zeroed)
+440	`u32`	(zeroed)
+448	`ptr`	(zeroed)
+464	`ptr`	(zeroed)
+472	`u8`	(zeroed)
+473	`u8`	(zeroed)
+536	`u32`	(zeroed)
+540	`u32`	(zeroed)
+648	`ptr`	`succ_map`	CFG successor edge hash table
+680	`ptr`	`backedge_map`	CFG backedge hash table
+720	`ptr`	`rpo_array`	Reverse post-order array (`int*`)
+728	`ptr`	`bitmask_array`	Grow-on-demand bitmask array for scheduling
+740	`u32`	`bitmask_capacity`	Capacity of bitmask array
+752	`ptr`	(zeroed)
+760	`u32`	(zeroed)
+764	`u32`	(zeroed)
+768	`ptr`	`const_sections`	Constant memory section array
+772	`u8`	(zeroed)
+776	`ptr`	`smem_sections`	Shared memory section array
+976	`ptr`	`block_info`	Block info array (40 bytes per entry, contiguous)
+984	`i32`	`num_blocks`	Number of basic blocks
+996	`u32`	`annotation_offset`	Current offset into annotation buffer (`sub_A4B8F0`)
+1000	`ptr`	`annotation_buffer`	Annotation data buffer (`sub_A4B8F0`)
+1008	`u64`	`encoding_params`	Default `0x300000050` or `0x400000080`
+1016	`ptr`	(zeroed)
+1024	`u32`	(zeroed)
+1032	`ptr`	(zeroed)
+1040	`ptr`	(zeroed)
+1064	`ptr`	(zeroed)
+1080	`u128`	(zeroed)
+1096	`u32`	(zeroed)
+1100	`u8`	(zeroed)
+1101	`u8`	`optimization_mode`	Set from knob 367 and `compilation_ctx+1412`
+1102	`u8`	(zeroed)
+1104	`ptr`	(zeroed)
+1120	`u128`	(zeroed)

Register Count Formula

From the stats emitter at sub_A3A7E0 and the register count function at sub_A4B8F0 (which both use vtable+2104 dispatch with sub_859FC0 as the fast path):

total_R_regs      = code_obj[159] + code_obj[102]   // reserved + allocated
instruction_count = code_obj[335] - code_obj[341]   // upper - lower

Stats Emitter Field Map

The stats emitter (sub_A3A7E0) accesses a per-function stats record through the SM backend: v3 = *(compilation_ctx+8)[198] (offset +1584 from the outer compilation context points to the SM backend object; the emitter then reads per-function stats fields within it). It uses DWORD indexing (4-byte), and reveals these additional fields:

DWORD Index	Byte Offset	Field	Stat String
8	+32	`est_latency`	`[est latency = %d]`
10	+40	`worst_case_lat`	`[worstcaseLat=%f]`
11	+44	`avg_case_lat`	`[avgcaseLat=%f]`
12	+48	`spill_bytes`	`[LSpillB=%d]`
13	+52	`refill_bytes`	`[LRefillB=%d]`
14	+56	`s_refill_bytes`	`[SRefillB=%d]`
15	+60	`s_spill_bytes`	`[SSpillB=%d]`
16	+64	`low_lmem_spill`	`[LowLmemSpillSize=%d]`
17	+68	`frame_lmem_spill`	`[FrameLmemSpillSize=%d]`
18	+72	`non_spill_bytes`	`[LNonSpillB=%d]`
19	+76	`non_refill_bytes`	`[LNonRefillB=%d]`
20	+80	`non_spill_size`	`[NonSpillSize=%d]`
26	+104	`occupancy` (float)	`[Occupancy = %f]`
27	+108	`div_branches`	`[est numDivergentBranches=%d]`
28	+112	`attr_mem_usage`	`[attributeMemUsage=%d]`
29	+116	`program_size`	`[programSize=%d]`
42	+168	`precise_inst`	`[Precise inst=%d]`
44	+176	`udp_inst`	`[UDP inst=%d]`
45	+180	`vec_to_ur`	`[numVecToURConverts inst=%d]`
49	+196	`max_live_suspend`	`[maxNumLiveValuesAtSuspend=%d]`
87	+348	`partial_unroll`	`[partially unrolled loops=%d]`
88	+352	`non_unrolled`	`[non-unrolled loops=%d]`
89	+356	`cb_bound_tex`	`[CB-Bound Tex=%d]`
90	+360	`partial_bound_tex`	`[Partially Bound Tex=%d]`
91	+364	`bindless_tex`	`[Bindless Tex=%d]`
92	+368	`ur_bound_tex`	`[UR-Bound Tex=%d]`
93	+372	`sm_version_check`	`> 24575` triggers UR reporting
99	+396	`ur_count_stats`	`[urregs=%d]`
102	+408	`r_alloc`	R-register allocated count
159	+636	`r_reserved`	R-register reserved count
303	+1212	`est_fp`	`[est fp=%d]`
306	+1224	`est_half`	`[est half=%d]`
307	+1228	`est_transcendental`	`[est trancedental=%d]`
308	+1232	`est_ipa`	`[est ipa=%d]`
310	+1240	`est_shared`	`[est shared=%d]`
311	+1244	`est_control_flow`	`[est controlFlow=%d]`
315	+1260	`est_load_store`	`[est loadStore=%d]`
316	+1264	`est_tex`	`[est tex=%d]`
334	+1336	`inst_pairs`	`[instPairs=%d]`
335	+1340	`instr_hi`	Instruction count upper bound
336	+1344	`tex_inst_count`	`[texInst=%d]`
337	+1348	`fp16_inst`	`[FP16 inst=%d]`
338	+1352	`fp16_vect_inst`	`[FP16 VectInst=%d]`
339	+1356	`inst_hint`	`[instHint=%d]`
340	+1360	`inst_pairs_2`	checked for non-zero to print instHint line
341	+1364	`instr_lo`	Instruction count lower bound
342	+1368	`tepid_inst`	`[tepid=%d]`

Note: The stats emitter accesses the Code Object through a float pointer (v3), so DWORD indices map to byte offsets via index * 4 for integers and index * 4 for floats. Float fields at indices 9, 26, 50, 54, 57, 58, 59, 61, 62, 65, 84, 85, 86 hold throughput and occupancy metrics. A linked list at qword index 55 (byte +440) holds additional string annotations.

Basic Block Entry (40 bytes)

Basic blocks are stored in a contiguous array at Code Object +976, with count at +984.

BasicBlock (40 bytes)
  +0    ptr      instr_head     // first instruction in this BB
  +8    ptr      instr_tail     // last instruction (or list link)
  +16   ptr      (reserved)
  +24   u32      (reserved)
  +28   i32      bix            // block index (unique ID for CFG ops)
  +32   u64      flags          // scheduling/analysis flags

The scheduling pass (sub_8D0640) initializes per-block scheduling state by iterating the block list and zeroing qword offsets [7], [13], [19], and setting [21] = -1 on each block.

Instruction Layout

Instructions are polymorphic C++ objects linked into per-BB doubly-linked lists. The instruction format is detailed in Instructions; this section covers only the structural linkage.

Each instruction carries a unique integer ID at +16, an opcode at +72 (the peephole optimizer masks with & 0xCF on byte 1 to strip modifier bits), and a packed operand array starting at +84. The operand count is at +80. Operands are 8 bytes each.

Packed Operand Format

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+

Extraction (50+ confirmed sites):
  uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
  int type    = (operand >> 28) & 7;     // bits 28-30
  int index   = operand & 0xFFFFF;       // bits 0-19
  int mods    = (operand >> 20) & 0xFF;  // bits 20-27

Type Value	Meaning	Resolution
1	Register operand	Index into `*(code_obj+88)` register file
5	Symbol/constant operand	Index into `*(code_obj+152)` symbol table

The operand classifier functions at 0xB28E00--0xB28E90 provide predicate checks:

Function	Predicate
`sub_B28E00`	`getRegClass` (1023 = wildcard, 1 = GPR)
`sub_B28E10`	`isRegOperand`
`sub_B28E20`	`isPredOperand`
`sub_B28E40`	`isImmOperand`
`sub_B28E80`	`isConstOperand`
`sub_B28E90`	`isUReg`

Symbol Table

The symbol table is accessed through Code Object +152. Based on the symbol table builder at sub_621480 (21KB, references a1+30016 for the symbol table base), symbols are stored in a hash-map-backed structure where each symbol has a name and associated properties (address, type, section binding).

Internal Symbol Names

The following internal symbol names appear in decompiled code, indicating the kinds of entities tracked:

Symbol	Purpose
`__ocg_const`	OCG-generated constant data
`__shared_scratch`	Shared memory scratch space
`__funcAddrTab_g`	Global indirect function call table
`__funcAddrTab_c`	Constant indirect function call table
`_global_ptr_%s`	Global pointer for named variable
`$funcID$name`	Function-local relocation symbol
`__cuda_dummy_entry__`	Dummy entry generated by `--compile-only`
`__cuda_sanitizer`	CUDA sanitizer instrumentation symbol

Symbol Resolution Flow

Symbol resolution (sub_625800, 27KB) traverses the symbol table to resolve references during the PTX-to-Ori lowering and subsequent optimization phases. The format %s[%d] (from sub_6200A0) is used for array-subscripted symbol references, and __$endLabel$__%s markers delimit function boundaries.

Constant Buffer Layout

Constant memory is organized into banks (c[0], c[1], ...) corresponding to the CUDA .nv.constant0, .nv.constant2, etc. ELF sections. The constant section array at Code Object +768 tracks all constant banks for the current function.

Constant Bank Handling

The constant bank handler at sub_6BC560 (4.9KB) manages references to constant memory using the c[%d] (integer bank) and c[%s] (named bank, sw-compiler-bank) notation. It enforces:

A maximum constant register count (error: "Constant register limit exceeded; more than %d constant registers")
LDC (Load Constant) requires a constant or immediate bank number

ELF Constant Symbols

The ELF symbol emitter (sub_7FD6C0) creates symbols for constant bank metadata:

Symbol Name	Purpose
`.nv.ptx.const0.size`	Size of constant bank 0 (kernel parameters)

The constant emission function (sub_7D14C0, 5.6KB) iterates the constant section array and copies bank data into the output ELF sections.

Shared Memory Layout

Shared memory (.nv.shared) allocations are tracked through the shared memory section array at Code Object +776. Reserved shared memory regions are managed by sub_6294E0 (12.1KB) and sub_629E40 (6.1KB).

Reserved Shared Memory Symbols

The ELF emitter recognizes these special symbols for shared memory layout:

Symbol Name	Purpose
`.nv.reservedSmem.begin`	Start of reserved shared memory region
`.nv.reservedSmem.cap`	Capacity of reserved shared memory
`.nv.reservedSmem.end`	End of reserved shared memory region
`.nv.reservedSmem.offset0`	First reserved offset within shared memory
`.nv.reservedSmem.offset1`	Second reserved offset within shared memory

The --disable-smem-reservation CLI option disables the reservation mechanism. Shared memory intrinsic lowering (sub_6C4DA0, 15KB) validates that shared memory operations use types {b32, b64}.

Descriptor Size Symbols

Additional ELF symbols track texture/surface descriptor sizes in shared memory:

Symbol Name	Purpose
`.nv.unified.texrefDescSize`	Unified texture reference descriptor size
`.nv.independent.texrefDescSize`	Independent texture reference descriptor size
`.nv.independent.samplerrefDescSize`	Independent sampler reference descriptor size
`.nv.surfrefDescSize`	Surface reference descriptor size

Pool Allocator

The pool allocator (sub_424070, 3,809 callers) is the single most heavily used allocation function. Every dynamic data structure in ptxas is allocated through pools.

Pool Object Layout

Offset	Type	Field	Notes
+0	`ptr`	`large_block_list`	Singly-linked list of large (>4999 byte) blocks
+32	`u32`	`min_slab_size`	Minimum slab allocation size
+44	`u32`	`slab_count`	Number of slabs allocated
+48	`ptr`	`large_free_list`	Free list for large blocks (boundary-tag managed)
+56	`u32`	`fragmentation_count`	Fragmentation counter (decremented on split)
+60	`u32`	`max_order`	Maximum power-of-2 order for large blocks
+64	...	(large block free lists)	`a1 + 32*(order+2)` = per-order free list head
+2112	`ptr`	`tracking_map`	Hash map for allocation metadata tracking
+2128	`ptr[N]`	`small_free_lists`	Size-binned free lists: `(pool + 8(size>>3) + 2128)` = head
+7128	`mutex*`	`pool_mutex`	`pthread_mutex_t*` for thread safety

Allocation Paths

Small path (size <= 4999 bytes = 0x1387):

Round size up to 8-byte alignment: aligned = (size + 7) & ~7
Minimum 16 bytes
Compute bin: bin = pool + 8 * (aligned >> 3) + 2128
If bin has a free block: pop from free list, decrement slab available bytes
If bin is empty: allocate a new slab from the parent (size = aligned * ceil(min_slab_size / aligned)), carve into free-list nodes

Large path (size > 4999 bytes):

Add 32 bytes for boundary tags
Search power-of-2 order free lists starting from log2(size+32)
If found: split block if remainder > 39 bytes, return payload
If not found: call sub_423B60 to grow the pool, allocate new slab from parent

Boundary Tag Format (Large Blocks)

Large blocks use boundary tags for coalescing on free:

Block Header (32 bytes):
  +0    i64      sentinel      // -1 = allocated, else -> next free
  +8    ptr      prev_free     // previous in free list (or 0)
  +16   u64      tag_offset    // 32 (header size)
  +24   u64      payload_size  // user-requested allocation size

Block Footer (32 bytes at end):
  +0    i64      sentinel
  +8    ptr      prev_free
  +16   u64      footer_tag    // 32
  +24   u64      block_size    // total size including headers

Slab Descriptor (56 bytes)

Each slab is tracked by a 56-byte descriptor:

Offset	Type	Field
+0	`ptr`	`chain_link`
+8	`u64`	`total_size`
+16	`u64`	`available_size`
+24	`ptr`	`owning_pool`
+32	`ptr`	`memory_base`
+40	`u8`	`is_small_slab`
+44	`u32`	`slab_id`
+48	`u32`	`bin_size`

Hierarchical Pools

Pools are hierarchical. When sub_424070 is called with a1 = NULL, it falls back to a global allocator (sub_427A10) that uses malloc directly. Non-null a1 values are pool objects that allocate from their own slabs, which are themselves allocated from a parent pool (the TLS context at offset +24 holds the per-thread pool pointer). The top-level pool is named "Top level ptxas memory pool" and is created in the compilation driver.

Hash Map

The hash map (sub_426150 insert / sub_426D60 lookup, 2,800+ and 422+ callers respectively) is the primary associative container in ptxas.

Hash Map Object Layout (~112 bytes)

Offset	Type	Field	Notes
+0	`fptr`	`hash_func`	Custom hash function pointer
+8	`fptr`	`compare_func`	Custom compare function pointer
+16	`fptr`	`hash_func_2`	Secondary hash (or NULL)
+24	`fptr`	`compare_func_2`	Secondary compare (or NULL)
+32	`u32`	`has_custom_compare`	Flag
+40	`u64`	`bucket_mask`	`capacity - 1` for power-of-2 masking
+48	`u64`	`entry_count`	Number of stored entries
+64	`u64`	`load_factor_threshold`	Resize when `entry_count` exceeds this
+72	`u32`	`first_free_slot`	Tracking for bitmap-based slot allocation
+76	`u32`	`entries_capacity`	Capacity of entries array
+80	`u32`	`bitmap_capacity`	Capacity of used-bits bitmap
+84	`u32`	`flags`	Hash mode in bits 4-7
+88	`ptr`	`entries`	Array of 16-byte `{key, value}` pairs
+96	`ptr`	`used_bitmap`	Bitmap tracking occupied slots
+104	`ptr`	`buckets`	Array of pointers to chained index lists

Hash Modes

The hash mode is encoded in bits 4-7 of the flags field at offset +84:

Mode	Flag Bits	Hash Function	Use Case
0	`0x00`	Custom (`+0` function pointer)	User-defined hash/compare
1	`0x10`	Pointer hash: `(key>>11) ^ (key>>8) ^ (key>>5)`	Pointer-keyed maps
2	`0x20`	Identity: key used directly	Integer-keyed maps

Mode selection happens automatically in the constructor (sub_425CA0): if the hash/compare pair matches (sub_427750, sub_427760), mode 2 is set; if (sub_4277F0, sub_427810), mode 1.

Lookup Algorithm

// Mode 1 (pointer hash) example:
uint64_t hash = (key >> 11) ^ (key >> 8) ^ (key >> 5);
uint64_t bucket_idx = hash & map->bucket_mask;
int32_t* chain = map->buckets[bucket_idx];
while (*++chain != -1) {
    entry_t* e = map->entries + 16 * (*chain);
    if (key == e->key)
        return e->value;  // found
}
return 0;  // not found

Growth policy: the map doubles capacity and rehashes when entry_count > load_factor_threshold.

String-Keyed Maps

String-keyed maps use MurmurHash3 (sub_427630, 73 callers) as the hash function. The implementation uses the standard MurmurHash3_x86_32 constants:

Constant	Value	Standard Name
c1	`0xCC9E2D51` (-862048943)	MurmurHash3 c1
c2	`0x1B873593` (461845907)	MurmurHash3 c2
fmix1	`0x85EBCA6B` (-2048144789)	MurmurHash3 fmix
fmix2	`0xC2B2AE35` (-1028477387)	MurmurHash3 fmix

CFG Hash Map (FNV-1a)

The control flow graph uses a separate hash map implementation based on FNV-1a hashing, distinct from the general-purpose hash map above. Two instances exist per Code Object at offsets +648 (successor edges) and +680 (backedge info).

Parameter	Value
Initial hash	`0x811C9DC5` (-2128831035)
Prime	`16777619` (`0x01000193`)
Input	4-byte block index, hashed byte-by-byte

Bucket entry: 24 bytes {head, tail, count}. Node: 64 bytes with chain link, key, values, sub-hash data, and cached hash. See CFG for the full CFG hash map specification.

Linked List

The linked list (sub_42CA60 prepend, 298 callers; sub_42CC30 length, 48 callers) is a singly-linked list of 16-byte nodes:

ListNode (16 bytes, pool-allocated)
  +0    ptr      next        // pointer to next node (NULL = end)
  +8    ptr      data        // pointer to payload object

Prepend allocates a 16-byte node from the pool, sets node->data = payload, and links it at the list head. This is used for function lists, relocation lists, annotation chains, and many intermediate pass-local collections.

Growable Array (Pool Vector)

Growable arrays appear throughout the PhaseManager and elsewhere. The layout is a triple of {data_ptr, count, capacity}:

PoolVector (24 bytes inline, or embedded in parent struct)
  +0    ptr      data         // pointer to element array
  +8    i32      count        // current element count
  +12   i32      capacity     // allocated capacity

Growth strategy (confirmed in the PhaseManager timing records): new_capacity = max(old + old/2 + 1, requested) (1.5x growth factor). Elements are typically 8 bytes (pointers) or 16 bytes (pointer pairs). Reallocation uses sub_424C50 (pool realloc, 27 callers).

The PhaseManager uses this pattern for the phase list (16-byte {phase_ptr, pool_ptr} pairs), the name table (8-byte string pointers), and the timing records (32-byte entries).

Knob Value Array

Knob values are stored in a contiguous array of 72-byte slots, accessed at knob_state[9] + 72 * knob_index (where knob_state[9] is offset +72 of the knob state object).

Knob Value Slot (72 bytes)

Offset	Type	Field
+0	`u8`	Type tag (0=unset, 1=bool, 2=int, ..., 12=opcode list)
+8	`i64`	Integer value / pointer to string / linked list head
+16	`i64`	Secondary value (range max, list count, etc.)
+24	`i64`	Tertiary value
+64	`ptr`	Allocator reference

Supported types:

Type	Tag	Storage
Boolean	1	Flag at +0
Integer	2	Value at +8
Integer+extra	3	Value at +8, extra at +12
Integer range	4	Min at +8, max at +16
Integer list	5	Growable array of ints
Float	6	`float` at +8
Double	7	`double` at +8
String	8/11	Pointer at +8
When-string	9	Linked list of 24-byte condition+value nodes
Value-pair list	10	Opcode:integer pairs via vtable
Opcode list	12	Opcode names resolved through vtable

Knob Descriptor (64 bytes)

Knob descriptors are stored in a table at knob_state+16, with count at knob_state+24:

Offset	Type	Field
+0	`ptr`	Primary name (ROT13-encoded)
+8	`u64`	Primary name length
+16	`u32`	Type tag
+24	...	(reserved)
+40	`ptr`	Alias name (ROT13-encoded)
+48	`u64`	Alias name length

Stream Object

The output stream used for diagnostics and stats reporting (e.g., at compilation context +1440) is a C++ iostream-like object with operator overloads. Field layout (from sub_7FE5D0 and sub_7FECA0):

Offset	Type	Field
+0	`vtable*`	vtable (dispatch for actual I/O)
+8	`u32`	`width`
+12	`u32`	`precision`
+16	`u64`	`char_count`
+24	`ptr`	`format_buffer`
+56	`u32`	`flags` (bit 0=hex, bit 1=oct, bit 2=left-align, bit 3=uppercase, bits 7-8=sign)

ORI Record Serializer (`sub_A50650`)

The ORI Record Serializer (sub_A50650, 74 KB, 2,728 decompiled lines) is the central function that takes a Code Object's in-memory state and flattens it into a linear output buffer organized as a table of typed section records. It is the serialization backbone for both the DUMPIR diagnostic subsystem and the compilation output path. Despite the _ORI_ string it contains, it is not an optimization pass -- it is infrastructure.


Address	`0xA50650`
Size	~74 KB
Identity	`CodeObject::EmitRecords`
Confidence	0.90
Called from	`sub_A53840` (wrapper), `sub_AACBF0` / `sub_AAD2A0` (DUMPIR diagnostic path)
Calls	`sub_A4BC60` (register serializer, new format), `sub_A4D3F0` (legacy format), `sub_A4B8F0` (register count annotation), `sub_A47330` + `sub_A474F0` (multi-section finalization), `sub_1730890` / `sub_17308C0` / `sub_17309A0` (scheduling serializers), `sub_1730FE0` (register file map)

Parameters

a1 is a serialization state object ("OriRecordContext") that carries the section table, compilation context back-pointer, and per-subsection index/size pairs. a2 is the output buffer write cursor, advanced as data is emitted.

Key fields on a1:

Offset	Type	Field	Evidence
+8	`ptr`	`compilation_ctx`	Dereferenced to reach sm_backend at `+1584`
+24	`i32`	`header_section_idx`	`v5 + 32 * (*(a1+24) + 1)`
+72	`ptr`	`section_table`	Array of 32-byte section entries
+180	`u32`	`instr_counter_1`	Reset to 0 at entry
+472	`u8`	`has_debug_info`	Gates debug section emission
+916	`i32`	`multi_section_count`	`> 0` triggers link-record emission and tail call to `sub_A47330`
+1102	`u8`	`multi_section_enabled`	Master flag for multi-section mode
+1120	`ptr`	`scheduling_ctx`	Scheduling context for barrier/scope serialization

Section Record Format

Each section occupies a 32-byte entry in the table at *(a1+72) + 32 * section_index:

Offset  Type   Field
+0      u16    type_tag           section type identifier
+4      u32    data_size          byte size of data payload
+8      ptr    data_ptr           pointer to data in output buffer
+16     u32    element_count      number of elements (or auxiliary metadata)
+20     u32    aux_field          additional per-type context
+24     u32    aux_field_2        secondary per-type context

Data payloads are 16-byte aligned: cursor += (size + 15) & ~0xF.

Section Type Tag Catalog

The serializer emits up to 56 unique section types across three tag ranges.

Base types (0x01--0x58):

Tag	Hex	Content	Evidence
1	0x01	Instruction stream (register-allocated code body)	Emitted via `sub_A4BC60` or `sub_A4D3F0`
3	0x03	Virtual-dispatch section (vtable+48 on state obj)	Conditional on `*(a1+64) > 0`
16	0x10	Source operand bank (v7[199] entries at v7+97)	`*(entry+48) = v7[199]`
17	0x11	Destination operand bank (bit-packed from v7+203)	Conditional on `!v7[1414]`
19	0x13	Annotation stream	`*(a1+232)` counter
34	0x22	Original-definition name table (`_ORI_` prefixed)	`strcpy(v50, "_ORI_")` at line 1762
35	0x23	Instruction info snapshot (340 bytes from v7+4)	`qmemcpy` of 340 bytes
46	0x2E	Texture/surface binding table	`v7[248]` entries, 16 bytes each
50	0x32	Live range interval table (spill map)	From compilation context +984
51	0x33	Register file occupancy table	`*(ctx+1424) & 4`
53	0x35	Source operand type bitmap (4-bit per operand)	v7[131] operands, 20-byte stride
54	0x36	Destination operand type bitmap	v7[134] operands, 20-byte stride
55	0x37	Scheduling barrier data	via `sub_1730890`
56	0x38	Register file mapping	via `sub_1730FE0`
58	0x3A	Scheduling dependency graph	via `sub_17309A0`
59	0x3B	Multi-section link record	Conditional on `*(a1+1102)`
64	0x40	External reference (from ctx+2120)	Pointer stored, no data copy
68	0x44	Performance counter section	`*(a1+932)` counter
70	0x46	Spill/fill metadata	`v7[408]`
71	0x47	Call graph edge table	From v7+61, linked list traversal
73	0x49	Codegen context snapshot	From ctx+932 register allocation state
80	0x50	Hash table section	v7+207/208, hash bucket traversal
81	0x51	Extended call info	From v7+84
83	0x53	Convergence scope data	via `sub_17308C0`
85	0x55	Register geometry record (banks, warps, lanes)	From ctx+1600, writes bank/warp/lane counts
88	0x58	Extended scheduling annotations	Conditional on `*(a1+1088) > 0`

Extended types (0x1208--0x1221): Emitted only when *(char*)(ctx+1412) < 0, which enables the full post-register-allocation diagnostic mode. These 16 types carry per-register-class live range and operand definition data:

Tag	Hex	Content
4616	0x1208	Extended operand class 0
4617--4623	0x1209--0x120F	Extended operand classes 1--7
4624	0x1210	Block-level operand summary
4625	0x1211	Live-in vector (12 bytes/element, count at `*(a1+668)`)
4626	0x1212	Live-out vector (12 bytes/element)
4627	0x1213	Extended operand class 8
4628--4629	0x1214--0x1215	Extended operand classes 9--10
4630	0x1216	Memory space descriptor (SM arch > 0x4FFF)
4631	0x1217	Extended scheduling flag (SM arch > 0x4FFF)
4632	0x1218	Instruction hash (ctx+1386 bit 3)
4633	0x1219	Annotation metadata
4640	0x1220	Extended section metadata
4641	0x1221	Optimization level record (from knob system, knob 988)

The `_ORI_` Name Prefix

The _ORI_ string is not a pass name. At line 1762 the serializer iterates the linked list at v7+55 (the original-definition chain maintained for rematerialization debugging) and for each entry creates a string "_ORI_<original_name>":

// Line 1748-1770 (simplified)
for (def = v7->original_defs; def; def = def->next) {
    entry = &section_table[16 * (state->instr_offset + idx)];
    entry->type_tag = 34;      // original-definition name
    entry->data_ptr = cursor;
    strcpy(cursor, "_ORI_");
    strcpy(cursor + 5, def->name);
    cursor += align16(strlen(def->name) + 21);
}

These names are consumed by the register allocation verifier (sub_A55D80) when it compares pre- and post-allocation reaching definitions. A mismatch triggers the "REMATERIALIZATION PROBLEM" diagnostic (string at 0xa55dd8), which lists original definitions under their _ORI_ names alongside the post-allocation state.

Wrapper: `sub_A53840`

sub_A53840 (48 lines) is a thin wrapper that:

Emits a type-44 header record if *(ctx+1600)[1193] is set (scheduling metadata header)
Calls sub_A50650 with the output buffer
Optionally emits a type-62 trailer record if *(ctx+1600)[48] is set

This wrapper is the typical entry point reached through vtable dispatch.

Function Map

Address	Size	Callers	Identity
`sub_A3B080`	~700 B	multiple	Code Object constructor
`sub_A3A7E0`	~700 B	1	Stats emitter (per-function profile)
`sub_A4B8F0`	~250 B	1	Register count / annotation writer
`sub_A50650`	~74 KB	8	ORI Record Serializer (`CodeObject::EmitRecords`)
`sub_A53840`	~400 B	1	EmitRecords wrapper (adds type-44 header)
`sub_424070`	2,098 B	3,809	Pool allocator (`alloc`)
`sub_4248B0`	923 B	1,215	Pool deallocator (`free`)
`sub_424C50`	488 B	27	Pool reallocator (`realloc`)
`sub_426150`	~1.2 KB	2,800	Hash map insert
`sub_426D60`	345 B	422	Hash map lookup
`sub_426EC0`	349 B	29	Hash map contains
`sub_425CA0`	114 B	127	Hash map constructor
`sub_425D20`	121 B	63	Hash map destructor
`sub_42CA60`	81 B	298	Linked list prepend
`sub_42CC30`	34 B	48	Linked list length
`sub_427630`	273 B	73	MurmurHash3 string hash
`sub_621480`	21 KB	low	Symbol table builder
`sub_625800`	27 KB	low	Symbol resolution
`sub_6BC560`	4.9 KB	low	Constant bank handler
`sub_6294E0`	12.1 KB	low	Reserved shared memory management
`sub_6C4DA0`	15 KB	low	Shared memory intrinsic lowering
`sub_7FD6C0`	~800 B	3	ELF symbol emitter
`sub_7FB6C0`	~800 B	1	Pipeline orchestrator (context cleanup)
`sub_7FBB70`	~100 B	1	Per-kernel entry point
`sub_663C30`	~300 B	1	Compilation loop body
`sub_662920`	varies	1	Global initialization (calls KnobsInit)

Ori IR Overview -- top-level IR design, Code Object field summary
Instructions -- detailed instruction format and encoding
CFG -- FNV-1a hash map CFG implementation
Registers -- register descriptor layout
Phase Manager -- PhaseManager object layout, phase dispatch
Memory Pool Allocator -- full allocator internals
Hash Tables & Bitvectors -- hash map and bitvector details
Knobs System -- knob descriptors, value types, ROT13 encoding
Entry Point & CLI -- compilation driver, options block

Pass Inventory & Ordering

The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.

All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.


Total phases	159 (indices 0--158)
Named (static table)	159 (all have entries in `off_22BD0C0`)
Late-pipeline phases	20 (indices 139--158, added after the original 0--138 design)
Gate passes (AdvancedPhase)	17 conditional hooks
Update passes	9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned)
Report passes	10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned)
GeneralOptimize instances	6 compound optimization bundles
Liveness/DCE instances	5 (including EarlyOriSimpleLiveDead)
LICM instances	4
Pipeline infrastructure	Phase Manager, Optimization Pipeline

Phase Categories

Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.

Tag	Meaning	Count
Validation	Checks IR structural correctness, catches illegal patterns	3
Lowering	Converts unsupported ops, expands macros, legalizes IR	14
Optimization	Transforms IR to improve performance (DCE, CSE, LICM, etc.)	68
Analysis	Computes information consumed by later passes (liveness, CFG)	6
Reporting	Dumps IR, statistics, or memory usage for debugging	9
Scheduling	Instruction scheduling, sync insertion, WAR fixup	8
RegAlloc	Register allocation and related fixups	6
Encoding	Mercury SASS encoding, expansion, microcode generation	9
Cleanup	Post-transformation updates, NOP removal, block layout	13
Gate	Conditional hooks (`AdvancedPhase*`) -- no-op by default	17

Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.

Numbering Discrepancy

The 16 omitted name table entries (with their true static table indices) are:

True Index	Name	Category	Relationship to Wiki
22	`OriCopyProp`	Optimization	Sub-pass within all 6 GeneralOptimize bundles; also injected into Mercury pipeline
32	`OptimizeNaNOrZero`	Optimization	Standalone NaN/zero folding pass; not documented under current wiki numbering
37	`ConvertMemoryToRegisterOrUniform`	Optimization	Sub-pass of GeneralOptimizeMid; gated by knob 487; `sub_910840`
41	`Vectorization`	Optimization	Load/store vectorization; gated by `DisableReadVectorization`/`DisableWriteVectorization` knobs
57	`OriCommoning`	Optimization	Commoning sub-pass; related to `LateOriCommoning` (wiki phase 64)
69	`OriSimpleLiveDead`	Optimization	Liveness/DCE sub-pass; related to `EarlyOriSimpleLiveDead` (wiki phase 10)
73	`LateVectorization`	Optimization	Late vectorization (2nd instance, after optimization exposes new opportunities)
77	`SinkCodeIntoBlock`	Optimization	Code sinking; `sub_78DB70`; `DisablePhases=SinkCodeIntoBlock` gate
103	`LateEnforceArgumentRestrictions`	Lowering	Late counterpart to `EnforceArgumentRestrictions` (wiki phase 48)
114	`ScheduleInstructions`	Scheduling	Worker for `AdvancedPhasePreSched`; `sub_8D0640` (22 KB)
115	`UpdateAfterScheduleInstructions`	Cleanup	IR metadata refresh after scheduling completes
118	`UpdateAfterOriDoSyncronization`	Cleanup	IR metadata refresh after sync insertion (wiki phase 99)
120	`ReportBeforeRegisterAllocation`	Reporting	DUMPIR target; diagnostic dump before register allocation
122	`AllocateRegisters`	RegAlloc	Worker for `AdvancedPhaseAllocReg`; canonical allocator entry
124	`UpdateAfterOriAllocateRegisters`	Cleanup	IR metadata refresh after register allocation
127	`PostExpansion`	Lowering	Worker for `AdvancedPhasePostExpansion`; post-RA expansion

All 16 are valid DUMPIR targets (resolvable through sub_C641D0 binary search over the phase name table). Several are also valid DisablePhases targets.

Gate Passes (AdvancedPhase)

Seventeen phases are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.

Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.

The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.

Gate Pass Worker Correspondence

Several gate passes dispatch to named worker functions when activated by a backend. The worker names appear in the static name table and are valid DUMPIR/NamedPhases targets:

Gate Pass (Wiki #)	Worker Function (True Table Index)	Evidence
`AdvancedPhasePreSched` (97)	`ScheduleInstructions` [114]	`sub_8D0640`, string `"ScheduleInstructions"`
`AdvancedPhaseAllocReg` (101)	`AllocateRegisters` [122]	String `"Please use -knob DUMPIR=AllocateRegisters"` at `sub_9714E0`
`AdvancedPhasePostExpansion` (104)	`PostExpansion` [127]	Post-RA expansion dispatch
`AdvancedPhasePostFixUp` (111)	`PostFixUp` [140]	Target vtable+0x148 dispatch

See Optimization Levels for per-gate activation rules.

Update Passes

Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):

Phase	Name	Refreshes
76	`UpdateAfterOptimize`	Rebuilds IR metadata after the late optimization group
125	`UpdateAfterPostRegAlloc`	Rebuilds IR metadata after register allocation and post-RA fixups
128	`UpdateAfterFormatCodeList`	Rebuilds the code list after Mercury encoding reformats instructions
132	`UpdateAfterConvertUnsupportedOps`	Rebuilds IR metadata after late unsupported-op expansion
150	`UpdateAfterPostRegAlloc`	Late-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default)
154	`UpdateAfterFormatCodeList`	Late-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default)
(true 115)	`UpdateAfterScheduleInstructions`	Refreshes IR after scheduling completes (omitted from compressed numbering)
(true 118)	`UpdateAfterOriDoSyncronization`	Refreshes IR after sync insertion (omitted from compressed numbering)
(true 124)	`UpdateAfterOriAllocateRegisters`	Refreshes IR after register allocation (omitted from compressed numbering)

These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.

Report Passes

Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):

Phase	Name	Output
9	`ReportInitialRepresentation`	Dumps the Ori IR immediately after initial lowering
96	`ReportBeforeScheduling`	Dumps the IR as it enters the scheduling/RA stage
102	`ReportAfterRegisterAllocation`	Dumps the IR after register allocation completes
(true 120)	`ReportBeforeRegisterAllocation`	Dumps IR before register allocation; omitted from compressed numbering (name at `0x22BD068`)
126	`ReportFinalMemoryUsage`	Prints memory pool consumption summary
129	`DumpNVuCodeText`	SASS text disassembly (`cuobjdump`-style)
130	`DumpNVuCodeHex`	Raw SASS hex dump
151	`ReportFinalMemoryUsage`	Late-pipeline duplicate: memory pool summary (no-op by default, `isNoOp=1`)
155	`DumpNVuCodeText`	Late-pipeline duplicate: SASS text disassembly; guarded by `ctx+0x598` and `ctx+0x740`
156	`DumpNVuCodeHex`	Late-pipeline duplicate: raw SASS hex dump; same guard as phase 155

Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:

Phase	Name	Position
13	`GeneralOptimizeEarly`	After initial setup, before loop passes
29	`GeneralOptimize`	After early loop/branch optimizations
37	`GeneralOptimizeMid`	After mid-level transformations
46	`GeneralOptimizeMid2`	After VTA/CTA/mbarrier expansion
58	`GeneralOptimizeLate`	After late expansion
65	`GeneralOptimizeLate2`	After predication and late commoning

See GeneralOptimize Bundles for the sub-pass decomposition.

O-Level Gating

Twenty-two phases have confirmed optimization-level gates. The O-Level column in the table below annotates every phase where the activation threshold has been verified from decompiled isNoOp() methods or execute-function guards. Phases without an O-Level annotation run at all optimization levels (O0--O5). Threshold notation: > N means the phase requires opt_level > N; == 0 means the phase is active only at O0.

See Optimization Levels for the complete per-phase activation table, the O-level accessor (sub_7DDB50), and the NvOpt recipe system.

Complete 159-Phase Table

Stage 1 -- Initial Setup (Phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.

#	Phase Name	Category	Description	Detail Page
0	`OriCheckInitialProgram`	Validation	Validates structural correctness of the initial Ori IR after PTX lowering
1	`ApplyNvOptRecipes`	Optimization	Applies NvOptRecipe transformations (option 391, 440-byte sub-manager)
2	`PromoteFP16`	Lowering	Promotes FP16 operations to FP32 where hardware lacks native support
3	`AnalyzeControlFlow`	Analysis	Builds the CFG: identifies loops, dominators, back edges
4	`AdvancedPhaseBeforeConvUnSup`	Gate	Hook before unsupported-op conversion; no-op by default
5	`ConvertUnsupportedOps`	Lowering	Replaces operations not natively supported on the target SM with equivalent sequences	Late Legalization
6	`SetControlFlowOpLastInBB`	Cleanup	Ensures control flow instructions are the final instruction in each basic block
7	`AdvancedPhaseAfterConvUnSup`	Gate	Hook after unsupported-op conversion; no-op by default
8	`OriCreateMacroInsts`	Lowering	Expands PTX-level macro instructions into Ori instruction sequences
9	`ReportInitialRepresentation`	Reporting	Dumps the Ori IR for debugging (no-op unless DUMPIR enabled)
10	`EarlyOriSimpleLiveDead`	Optimization	Quick early dead code elimination pass	Liveness
11	`ReplaceUniformsWithImm`	Optimization	Replaces uniform register reads with immediate constants where value is known	Uniform Regs
12	`OriSanitize`	Validation	Validates IR consistency after initial setup transformations
13	`GeneralOptimizeEarly`	Optimization	Compound pass: copy prop + const fold + algebraic simplify + DCE (early)	GeneralOptimize

Stage 2 -- Early Optimization (Phases 14--32)

Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.

#	Phase Name	Category	O-Level	Description	Detail Page
14	`DoSwitchOptFirst`	Optimization	> 0	Optimizes switch statements: jump table generation, case clustering (1st pass)	Branch & Switch
15	`OriBranchOpt`	Optimization	> 0	Branch folding, unreachable block elimination, conditional branch simplification	Branch & Switch
16	`OriPerformLiveDeadFirst`	Analysis		Full liveness analysis + dead code elimination (1st of 4 major instances)	Liveness
17	`OptimizeBindlessHeaderLoads`	Optimization		Hoists and deduplicates bindless texture header loads
18	`OriLoopSimplification`	Optimization	4--5	Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+	Loop Passes
19	`OriSplitLiveRanges`	Optimization		Splits live ranges at loop boundaries to reduce register pressure	Liveness
20	`PerformPGO`	Optimization		Applies profile-guided optimization data (block weights, branch probabilities)
21	`OriStrengthReduce`	Optimization		Replaces expensive operations (multiply, divide) with cheaper equivalents (shift, add)	Strength Reduction
22	`OriLoopUnrolling`	Optimization	> 1	Unrolls loops based on trip count and register pressure heuristics	Loop Passes
23	`GenerateMovPhi`	Lowering		Inserts SSA phi nodes as `MOV.PHI` pseudo-instructions
24	`OriPipelining`	Optimization	> 1	Software pipelining: overlaps loop iterations to hide latency	Loop Passes
25	`StageAndFence`	Lowering		Inserts memory fence and staging instructions for coherence	Sync & Barriers
26	`OriRemoveRedundantBarriers`	Optimization	> 1	Eliminates barrier instructions proven redundant by data-flow analysis	Sync & Barriers
27	`AnalyzeUniformsForSpeculation`	Analysis		Identifies uniform values safe for speculative execution	Uniform Regs
28	`SinkRemat`	Optimization	> 1 / > 4	Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlass	Rematerialization
29	`GeneralOptimize`	Optimization		Compound pass: copy prop + const fold + algebraic simplify + DCE (mid-early)	GeneralOptimize
30	`DoSwitchOptSecond`	Optimization	> 0	Second switch optimization pass after loop/branch transformations	Branch & Switch
31	`OriLinearReplacement`	Optimization		Replaces branch-heavy patterns with linear (branchless) sequences
32	`CompactLocalMemory`	Optimization		Compacts local memory allocations by eliminating dead slots and reordering

Stage 3 -- Mid-Level Optimization (Phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.

#	Phase Name	Category	O-Level	Description	Detail Page
33	`OriPerformLiveDeadSecond`	Analysis		Full liveness analysis + DCE (2nd instance, post-early-optimization cleanup)	Liveness
34	`ExtractShaderConstsFirst`	Optimization		Identifies uniform values loadable from constant memory instead of per-thread computation (1st pass)
35	`OriHoistInvariantsEarly`	Optimization		Loop-invariant code motion: hoists invariant computations out of loops (early)	Loop Passes
36	`EmitPSI`	Lowering		Emits PSI (Pixel Shader Input) interpolation setup for graphics shaders
37	`GeneralOptimizeMid`	Optimization		Compound pass: copy prop + const fold + algebraic simplify + DCE (mid)	GeneralOptimize
38	`OptimizeNestedCondBranches`	Optimization	> 0	Simplifies nested conditional branches into flatter control flow	Branch & Switch
39	`ConvertVTGReadWrite`	Lowering		Converts vertex/tessellation/geometry shader read/write operations
40	`DoVirtualCTAExpansion`	Lowering		Expands virtual CTA operations into physical CTA primitives
41	`MarkAdditionalColdBlocks`	Analysis		Marks basic blocks as cold based on heuristics and profile data	Hot/Cold
42	`ExpandMbarrier`	Lowering		Expands `MBARRIER` pseudo-instructions into native barrier sequences	Sync & Barriers
43	`ForwardProgress`	Lowering		Inserts instructions guaranteeing forward progress (prevents infinite stalls)
44	`OptimizeUniformAtomic`	Optimization		Converts thread-uniform atomic operations into warp-level reductions
45	`MidExpansion`	Lowering		Target-dependent mid-level expansion of operations before register allocation	Late Legalization
46	`GeneralOptimizeMid2`	Optimization		Compound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd)	GeneralOptimize
47	`AdvancedPhaseEarlyEnforceArgs`	Gate		Hook before argument enforcement; no-op by default
48	`EnforceArgumentRestrictions`	Lowering		Enforces ABI restrictions on function arguments (register classes, alignment)
49	`GvnCse`	Optimization	> 1	Global value numbering combined with common subexpression elimination	Copy Prop & CSE
50	`OriReassociateAndCommon`	Optimization		Reassociates expressions for better commoning opportunities, then eliminates commons	Copy Prop & CSE
51	`ExtractShaderConstsFinal`	Optimization		Final shader constant extraction pass (after GVN may expose new constants)
52	`OriReplaceEquivMultiDefMov`	Optimization		Eliminates redundant multi-definition move instructions with equivalent sources

Stage 4 -- Late Optimization (Phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#	Phase Name	Category	O-Level	Description	Detail Page
53	`OriPropagateVaryingFirst`	Optimization		Propagates varying (non-uniform) annotations to identify divergent values (1st pass)
54	`OriDoRematEarly`	Optimization	> 1	Early rematerialization: recomputes cheap values near uses to reduce register pressure	Rematerialization
55	`LateExpansion`	Lowering		Expands operations that must be lowered after high-level optimizations	Late Legalization
56	`SpeculativeHoistComInsts`	Optimization		Speculatively hoists common instructions above branches
57	`RemoveASTToDefaultValues`	Cleanup		Removes AST (address space type) annotations that have been lowered to defaults
58	`GeneralOptimizeLate`	Optimization		Compound pass: copy prop + const fold + algebraic simplify + DCE (late)	GeneralOptimize
59	`OriLoopFusion`	Optimization		Fuses adjacent loops with compatible bounds and no inter-loop dependencies	Loop Passes
60	`DoVTGMultiViewExpansion`	Lowering		Expands multi-view operations for vertex/tessellation/geometry shaders
61	`OriPerformLiveDeadThird`	Analysis		Full liveness analysis + DCE (3rd instance, post-late-optimization)	Liveness
62	`OriRemoveRedundantMultiDefMov`	Optimization		Removes dead multi-definition move instructions
63	`OriDoPredication`	Optimization	> 1	If-conversion: converts short conditional branches into predicated instructions	Predication
64	`LateOriCommoning`	Optimization		Late commoning pass: eliminates common subexpressions exposed by predication	Copy Prop & CSE
65	`GeneralOptimizeLate2`	Optimization		Compound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd)	GeneralOptimize
66	`OriHoistInvariantsLate`	Optimization		LICM: hoists loop-invariant code (late, after predication may expose new invariants)	Loop Passes
67	`DoKillMovement`	Optimization		Moves kill annotations closer to last use to improve register pressure
68	`DoTexMovement`	Optimization		Moves texture fetch instructions to minimize latency exposure
69	`OriDoRemat`	Optimization	> 1	Late rematerialization: recomputes values exposed by predication and fusion	Rematerialization
70	`OriPropagateVaryingSecond`	Optimization		Propagates varying annotations (2nd pass, after predication changes control flow)
71	`OptimizeSyncInstructions`	Optimization	> 1	Eliminates and simplifies synchronization instructions	Sync & Barriers
72	`LateExpandSyncInstructions`	Lowering	> 2	Expands sync pseudo-instructions into final hardware sequences	Sync & Barriers
73	`ConvertAllMovPhiToMov`	Lowering		Destroys SSA form: converts `MOV.PHI` instructions into plain `MOV`
74	`ConvertToUniformReg`	Optimization		Converts qualifying values from general registers (R) to uniform registers (UR)	Uniform Regs
75	`LateArchOptimizeFirst`	Optimization		Architecture-specific late optimizations (1st pass)
76	`UpdateAfterOptimize`	Cleanup		Rebuilds IR metadata invalidated by the late optimization group
77	`AdvancedPhaseLateConvUnSup`	Gate		Hook at the late unsupported-op boundary; no-op by default

Stage 5 -- Legalization (Phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.

#	Phase Name	Category	O-Level	Description	Detail Page
78	`LateExpansionUnsupportedOps`	Lowering		Expands remaining unsupported operations after all optimizations	Late Legalization
79	`OriHoistInvariantsLate2`	Optimization		LICM (late 2nd pass) after unsupported-op expansion	Loop Passes
80	`ExpandJmxComputation`	Lowering		Expands JMX (jump with index computation) pseudo-instructions
81	`LateArchOptimizeSecond`	Optimization		Architecture-specific late optimizations (2nd pass)
82	`AdvancedPhaseBackPropVReg`	Gate		Hook before backward copy propagation; no-op by default
83	`OriBackCopyPropagate`	Optimization		Backward copy propagation: propagates values backward through move chains	Copy Prop & CSE
84	`OriPerformLiveDeadFourth`	Analysis		Full liveness analysis + DCE (4th instance, pre-legalization cleanup)	Liveness
85	`OriPropagateGmma`	Optimization		Propagates WGMMA accumulator values through the IR	GMMA Pipeline
86	`InsertPseudoUseDefForConvUR`	Lowering		Inserts pseudo use/def instructions for uniform register conversion bookkeeping	Uniform Regs
87	`FixupGmmaSequence`	Lowering		Fixes WGMMA instruction sequences for hardware ordering constraints	GMMA Pipeline
88	`OriHoistInvariantsLate3`	Optimization		LICM (late 3rd pass) after GMMA fixup	Loop Passes
89	`AdvancedPhaseSetRegAttr`	Gate		Hook before register attribute setting; no-op by default
90	`OriSetRegisterAttr`	Analysis		Annotates registers with scheduling attributes (latency class, bank assignment)	Scheduling
91	`OriCalcDependantTex`	Analysis		Computes texture instruction dependencies for scheduling
92	`AdvancedPhaseAfterSetRegAttr`	Gate		Hook after register attribute setting; no-op by default
93	`LateExpansionUnsupportedOps2`	Lowering		Second late unsupported-op expansion (catches ops exposed by GMMA/attr passes)	Late Legalization
94	`FinalInspectionPass`	Validation		Final IR validation gate: catches illegal patterns before irreversible scheduling/RA
95	`SetAfterLegalization`	Cleanup	> 1	Sets post-legalization flag on the compilation context
96	`ReportBeforeScheduling`	Reporting		Dumps IR before scheduling (no-op unless diagnostic options enabled)

Stage 6 -- Scheduling & Register Allocation (Phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#	Phase Name	Category	O-Level	Description	Detail Page
97	`AdvancedPhasePreSched`	Gate		Hook before scheduling; when active, dispatches to `ScheduleInstructions` (`sub_8D0640`, true table index 114)	Scheduling
98	`BackPropagateVEC2D`	Optimization		Backward-propagates 2D vector register assignments
99	`OriDoSyncronization`	Scheduling	> 1	Inserts synchronization instructions (`BAR`, `DEPBAR`, `MEMBAR`) per GPU memory model	Sync & Barriers
100	`ApplyPostSyncronizationWars`	Scheduling	> 1	Fixes write-after-read hazards exposed by sync insertion	Sync & Barriers
101	`AdvancedPhaseAllocReg`	Gate		Register allocation driver hook; when active, dispatches to `AllocateRegisters` (true table index 122); `DUMPIR=AllocateRegisters` targets this	RegAlloc Architecture
102	`ReportAfterRegisterAllocation`	Reporting		Dumps IR after register allocation (no-op unless diagnostic options enabled)
103	`Get64bRegComponents`	RegAlloc		Splits 64-bit register pairs into 32-bit components for architectures that require it	RegAlloc Architecture

Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.

#	Phase Name	Category	O-Level	Description	Detail Page
104	`AdvancedPhasePostExpansion`	Gate		Hook after post-RA expansion; when active, dispatches to `PostExpansion` (true table index 127)
105	`ApplyPostRegAllocWars`	RegAlloc		Fixes write-after-read hazards exposed by register allocation
106	`AdvancedPhasePostSched`	Gate		Hook after post-scheduling; no-op by default
107	`OriRemoveNopCode`	Cleanup		Removes NOP instructions and dead code inserted as placeholders
108	`OptimizeHotColdInLoop`	Optimization		Separates hot and cold paths within loops for cache locality	Hot/Cold
109	`OptimizeHotColdFlow`	Optimization		Separates hot and cold paths at the function level	Hot/Cold
110	`PostSchedule`	Scheduling	> 0	Post-scheduling pass: finalizes instruction ordering	Scheduling
111	`AdvancedPhasePostFixUp`	Gate		Hook after post-fixup; when active, dispatches to `PostFixUp` (phase 140, target vtable+0x148)
112	`PlaceBlocksInSourceOrder`	Cleanup		Determines final basic block layout in the emitted binary
113	`PostFixForMercTargets`	Encoding		Fixes up instructions for Mercury encoding requirements	Mercury
114	`FixUpTexDepBarAndSync`	Scheduling		Fixes texture dependency barriers and sync instructions post-scheduling	Scoreboards
115	`AdvancedScoreboardsAndOpexes`	Gate	> 0	Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0	Scoreboards
116	`ProcessO0WaitsAndSBs`	Scheduling	== 0	Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazard	Scoreboards

Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.

Stage 8 -- Mercury Backend (Phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#	Phase Name	Category	Description	Detail Page
117	`MercEncodeAndDecode`	Encoding	Converts Ori instructions to Mercury encoding, then round-trip decodes for verification	Mercury
118	`MercExpandInstructions`	Encoding	Expands pseudo-instructions into final SASS instruction sequences	Mercury
119	`MercGenerateWARs1`	Encoding	Generates write-after-read hazard annotations (1st pass, pre-expansion)	Mercury
120	`MercGenerateOpex`	Encoding	Generates "opex" (operation extension) annotations for each instruction	Mercury
121	`MercGenerateWARs2`	Encoding	Generates WAR annotations (2nd pass, covers hazards introduced by expansion)	Mercury
122	`MercGenerateSassUCode`	Encoding	Produces the final SASS microcode bytes (the actual binary encoding)	Mercury

"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).

Stage 9 -- Post-Mercury (Phases 123--131)

#	Phase Name	Category	Description	Detail Page
123	`ComputeVCallRegUse`	RegAlloc	Computes register usage for virtual call sites
124	`CalcRegisterMap`	RegAlloc	Computes the final physical-to-logical register mapping emitted as EIATTR metadata	RegAlloc Architecture
125	`UpdateAfterPostRegAlloc`	Cleanup	Rebuilds IR metadata after post-RA processing
126	`ReportFinalMemoryUsage`	Reporting	Prints memory pool consumption summary to stderr
127	`AdvancedPhaseOriPhaseEncoding`	Gate	Phase encoding hook; no-op by default
128	`UpdateAfterFormatCodeList`	Cleanup	Rebuilds the code list after Mercury encoding reformats instructions
129	`DumpNVuCodeText`	Reporting	Dumps human-readable SASS text disassembly
130	`DumpNVuCodeHex`	Reporting	Dumps raw SASS binary as hex
131	`DebuggerBreak`	Cleanup	Development hook: triggers a debugger breakpoint at this pipeline position

Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.

#	Phase Name	Category	O-Level	Description	Detail Page
132	`UpdateAfterConvertUnsupportedOps`	Cleanup		Rebuilds IR metadata after late unsupported-op conversion
133	`MergeEquivalentConditionalFlow`	Optimization		Merges basic blocks with equivalent conditional flow (tail merging)
134	`AdvancedPhaseAfterMidExpansion`	Gate		Hook after mid-level expansion; no-op by default
135	`AdvancedPhaseLateExpandSyncInstructions`	Gate		Hook for late sync instruction expansion; no-op by default
136	`LateMergeEquivalentConditionalFlow`	Optimization		Second conditional flow merge pass (catches cases exposed by late transforms)
137	`LateExpansionUnsupportedOpsMid`	Lowering		Mid-late unsupported-op expansion (between the two merge passes)	Late Legalization
138	`OriSplitHighPressureLiveRanges`	RegAlloc		Last-resort live range splitter when register pressure exceeds hardware limits	RegAlloc Architecture
139	`ProcessO0WaitsAndSBs`	Scheduling	== 0	Conservative scoreboard insertion for `-O0`; inserts maximum wait counts at every hazard	Scoreboards
140	`PostFixUp`	Cleanup		Target-specific post-fixup dispatch (calls target vtable+0x148)
141	`MercConverter`	Encoding		Initial Mercury conversion: translates Ori instructions to Mercury format (`sub_9F3760`)	Mercury
142	`MercEncodeAndDecode`	Encoding		Encode/decode round-trip verification of SASS binary encoding (`sub_18F21F0`)	Mercury
143	`MercExpandInstructions`	Encoding		Expands Mercury pseudo-instructions into final SASS sequences; gated by `ctx+0x570` bit 5	Mercury
144	`MercGenerateWARs1`	Encoding		WAR hazard annotation (1st pass, pre-expansion); gated by `ctx+0x570` sign bit	Mercury
145	`MercGenerateOpex`	Encoding		Generates operation extension annotations per instruction; gated by `ctx+0x570` bit 6	Mercury
146	`MercGenerateWARs2`	Encoding		WAR hazard annotation (2nd pass, covers hazards from expansion in phase 143)	Mercury
147	`MercGenerateSassUCode`	Encoding		Final SASS microcode emission: produces the binary bytes for the ELF; gated by `ctx+0x571` bit 0	Mercury
148	`ComputeVCallRegUse`	RegAlloc		Computes register usage for virtual call sites (EIATTR metadata for indirect calls)
149	`CalcRegisterMap`	RegAlloc		Computes the final physical-to-logical register mapping; gated by `ctx+0x590` bit 1	RegAlloc Architecture
150	`UpdateAfterPostRegAlloc`	Cleanup		Rebuilds IR metadata after post-RA processing (no-op by default, `isNoOp=1`)
151	`ReportFinalMemoryUsage`	Reporting		Prints memory pool consumption summary (no-op by default, `isNoOp=1`)
152	`AdvancedPhaseOriPhaseEncoding`	Gate		Phase encoding gate; when active, sets `ctx+0x610` (`pipeline_progress`) `= 0x15` (21) to mark encoding boundary
153	`FormatCodeList`	Encoding		Formats the instruction list for ELF output; dispatches through `ctx+0x648` vtable+0x10	Mercury
154	`UpdateAfterFormatCodeList`	Cleanup		Rebuilds IR data structures after FormatCodeList reformats instructions (no-op by default, `isNoOp=1`)
155	`DumpNVuCodeText`	Reporting		Dumps human-readable SASS text disassembly; guarded by `ctx+0x598 > 0` and `ctx+0x740` non-null
156	`DumpNVuCodeHex`	Reporting		Dumps raw SASS binary as hex; same guard as phase 155
157	`DebuggerBreak`	Cleanup		Development hook: convenient breakpoint location for pipeline debugging (empty body in release)
158	`NOP`	Cleanup		Terminal no-op sentinel; final phase in the 159-phase pipeline

Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride). All 20 have names in the static table at off_22BD0C0 (159 entries, not 139). The vtable slot at +16 is isNoOp() (returns 0 for active phases, 1 for phases skipped by default); name resolution goes through the static table indexed by getIndex() at +8.

The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing backends to selectively enable/disable encoding passes. WAR generation runs in two passes (144, 146) bracketing instruction expansion (143) because expansion can introduce new write-after-read hazards.

Pipeline Ordering Notes

Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.

Identity ordering. The default ordering table at 0x22BEEA0 is an identity mapping: exec[N] = factory[N] for all 159 phases. The factory index IS the execution order. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below). In the true 159-phase table, phases execute in strict index order 0--158.

Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:

Pass Family	Instances	Phases
`GeneralOptimize*`	6	13, 29, 37, 46, 58, 65
`OriPerformLiveDead*`	4	16, 33, 61, 84
`OriHoistInvariants*`	4	35, 66, 79, 88
`LateExpansionUnsupportedOps*`	3	78, 93, 137
`ExtractShaderConsts*`	2	34, 51
`OriPropagateVarying*`	2	53, 70
`OriDoRemat*`	2	54, 69
`DoSwitchOpt*`	2	14, 30
`LateArchOptimize*`	2	75, 81
`MergeEquivalentConditionalFlow`	2	133, 136
`MercGenerateWARs*`	2	144, 146
`UpdateAfterPostRegAlloc`	2	125, 150
`UpdateAfterFormatCodeList`	2	128, 154
`ReportFinalMemoryUsage`	2	126, 151
`DumpNVuCodeText`	2	129, 155
`DumpNVuCodeHex`	2	130, 156
`ComputeVCallRegUse`	2	123, 148
`CalcRegisterMap`	2	124, 149
`DebuggerBreak`	2	131, 157
`Vectorization`/`LateVectorization`	2	(true 41, 73) -- omitted from compressed numbering
`EnforceArgumentRestrictions`/`Late...`	2	48 (wiki), (true 103) -- late variant omitted

Cross-References

Optimization Pipeline -- pipeline infrastructure, PhaseManager data structures, dispatch loop
Phase Manager Infrastructure -- PhaseManager object layout, constructor, destructor, factory switch
GeneralOptimize Bundles -- sub-pass decomposition of compound optimization passes
Branch & Switch Optimization -- phases 14, 15, 30, 38
Loop Passes -- phases 18, 22, 24, 35, 59, 66, 79, 88
Strength Reduction -- phase 21
Copy Propagation & CSE -- phases 49, 50, 64, 83
Predication -- phase 63
Rematerialization -- phases 28, 54, 69
Liveness Analysis -- phases 10, 16, 19, 33, 61, 84
Synchronization & Barriers -- phases 25, 26, 42, 71, 72, 99, 100, 114
Hot/Cold Partitioning -- phases 41, 108, 109
GMMA/WGMMA Pipeline -- phases 85, 87
Uniform Register Optimization -- phases 11, 27, 74, 86
Late Expansion & Legalization -- phases 5, 45, 55, 78, 93, 137
Register Allocator Architecture -- phases 101, 103, 105, 123, 124, 138, 148, 149
Scheduler Architecture -- phases 90, 97--100, 110
Scoreboards & Dependency Barriers -- phases 114, 115, 116
Mercury Encoder -- phases 113, 117--122, 141--147, 153
Optimization Levels -- O-level gating of gate passes
DUMPIR & NamedPhases -- user-specified phase targeting and reordering

Key Functions

Address	Size	Role	Confidence
`sub_C60D30`	--	Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (`execute`, `getIndex`, `isNoOp`, NULL, NULL)	0.92
`sub_7DDB50`	232B	Opt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold	0.95
`sub_A36360`	52KB	Master scoreboard control word generator; per-opcode dispatch for phase 115 (`AdvancedScoreboardsAndOpexes`)	0.90
`sub_A23CF0`	54KB	DAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation	0.90
`sub_9F1A90`	35KB	MercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern	0.92
`sub_9ED2D0`	25KB	Opcode switch inside MercConverter; dispatches per-opcode legalization/conversion	0.90
`sub_9F3760`	--	Phase 141 (`MercConverter`) execute function; initial Mercury conversion of Ori instructions	0.85
`sub_18F21F0`	--	Phase 142 (`MercEncodeAndDecode`) execute function; encode/decode round-trip verification	0.85

Phase Manager Infrastructure

The PhaseManager is the central orchestration layer in ptxas. It owns the entire 159-phase optimization and code generation pipeline, constructs each phase as a polymorphic object via an abstract factory, and drives execution through a virtual dispatch loop. Every compilation unit passes through the same PhaseManager sequence: construct all 159 phase objects, iterate the phase index array calling execute() on each, optionally collect per-phase timing and memory statistics, then tear down. The PhaseManager also hosts an optional NvOptRecipe sub-manager (440 bytes) for architecture-specific "advanced phase" hooks that inject additional processing at 16 defined points in the pipeline.

The design is a textbook Strategy + Abstract Factory pattern: a 159-case switch statement maps phase indices to vtable pointers, each vtable provides execute(), isNoOp(), and getName() virtual methods, and the dispatch loop iterates a flat index array that defines execution order. This makes the pipeline fully data-driven -- reordering, disabling, or injecting phases requires only modifying the index array, not the dispatch logic.


Core range	`0xC60000`--`0xC66000` (13 functions, ~17.5 KB)
Constructor	`sub_C62720` (4,734 bytes)
Destructor	`sub_C61B20` (1,753 bytes)
Phase factory	`sub_C60D30` (3,554 bytes, 159-case switch)
Dispatch loop	`sub_C64F70` (1,455 bytes)
Name lookup	`sub_C641D0` (305 bytes, case-insensitive binary search)
Timing reporter	`sub_C64310` (3,168 bytes)
Pool reporter	`sub_C62200` (888 bytes)
Total phases	159 (139 explicitly named + 20 arch-specific)
AdvancedPhase hooks	16 no-op-by-default insertion points
Default phase table	Static array at `0x22BEEA0` (returned by `sub_C60D20`)
Phase name table	Static array at `off_22BD0C0` (159 string pointers)
Vtable range	`off_22BD5C8` (phase 0) through `off_22BEE78` (phase 158)
Callers	`sub_7FB6C0` (main compilation driver), `sub_9F63D0` (library/ftrace entry)

PhaseManager Object Layout

The PhaseManager is a plain C++ object (no vtable of its own) allocated by the compilation driver. Minimum size is 112 bytes, though the full extent depends on whether timing and NvOptRecipe are enabled.

PhaseManager (112+ bytes)
  +0    int64     compilation_unit      // back-pointer to owning compilation unit
  +8    int64*    allocator             // pool allocator (from compilation_unit->field_16)
  +16   void*     sorted_name_table     // sorted {name_ptr, index} pairs for binary search
  +24   int       sorted_name_count
  +28   int       sorted_name_capacity
  +32   int64*    allocator2            // copy of allocator (for phase list ops)
  +40   void*     phase_list            // array of 16-byte {phase_ptr, pool_ptr} pairs
  +48   int       phase_list_count      // always 159 after construction
  +52   int       phase_list_capacity
  +56   int64     nvopt_recipe_ptr      // NvOptRecipe sub-manager, or NULL
  +64   int64     (reserved)
  +72   bool      timing_enabled        // set from compilation_unit->config->options[17928]
  +76   int       (flags/padding)
  +80   bool      flag_byte             // initialized to 1, reset after first timing report
  +88   int64*    timing_allocator
  +96   void*     phase_name_raw_table  // 159 name string pointers, copied from off_22BD0C0
  +104  int       phase_name_raw_count
  +108  int       phase_name_raw_capacity

The two allocator fields (+8 and +32) both point to the same pool allocator extracted from the compilation unit, but are used in different contexts: +8 for name table operations, +32 for phase list operations.

Phase Object Model

Each phase is a 16-byte polymorphic object:

Phase (16 bytes)
  +0    vtable*   // points to one of 159 vtable instances
  +8    void*     // pool pointer (memory pool for phase-local allocations)

The vtable provides the interface contract:

Vtable offset	Method	Signature
`+0`	`execute`	`void execute(phase, compilation_context)`
`+8`	`isNoOp`	`bool isNoOp(phase*)` -- returns `true` to skip execution
`+16`	`getName`	`int getName(phase*)` -- returns index into name table
`+24`	`alloc`	`void* alloc(pool*, size_t)` -- pool allocator
`+32`	`free`	`void free(pool, void)` -- pool deallocator

The vtable addresses span off_22BD5C8 (phase 0) through off_22BEE78 (phase 158), with a stride of 0x28 (40 bytes) between consecutive entries. All vtables reside in .data.rel.ro.

Phase Factory -- `sub_C60D30`

The factory is a 159-case switch statement that serves as the sole point of phase instantiation. For each case:

Extracts the pool allocator from context->field_16
Allocates 16 bytes via pool_alloc (vtable offset +24)
Writes the case-specific vtable pointer at offset +0
Returns a {phase_ptr, pool_ptr} pair

The default case returns {NULL, NULL}, which the caller treats as an invalid phase index.

// Pseudocode for sub_C60D30
pair<phase*, pool*> PhaseFactory(int phase_index, context* ctx) {
    pool* p = ctx->allocator;
    phase* obj = p->alloc(16);
    switch (phase_index) {
        case 0:   obj->vtable = off_22BD5C8; break;  // OriCheckInitialProgram
        case 1:   obj->vtable = off_22BD5F0; break;  // ApplyNvOptRecipes
        case 2:   obj->vtable = off_22BD618; break;  // PromoteFP16
        // ... 156 more cases ...
        case 158: obj->vtable = off_22BEE78; break;  // sentinel/NOP
        default:  return {NULL, NULL};
    }
    return {obj, p};
}

Called exclusively by the constructor (sub_C62720).

Construction Sequence -- `sub_C62720`

The constructor performs 11 steps, building all internal data structures and instantiating every phase:

// Pseudocode for sub_C62720
bool PhaseManager::construct(compilation_unit* cu) {
    this->cu          = cu;
    this->allocator   = cu->field_16;      // extract pool allocator
    this->allocator2  = cu->field_16;

    // 1. Check timing flag
    this->timing_enabled = cu->config->options[17928];

    // 2. Allocate and copy phase name table (1272 = 159 * 8 bytes)
    this->phase_name_raw_table = alloc(1272);
    memcpy(this->phase_name_raw_table, off_22BD0C0, 1272);
    this->phase_name_raw_count    = 159;
    this->phase_name_raw_capacity = 159;

    // 3. Initialize timing records
    resize_timing(/*capacity=*/159);                    // sub_C62580
    cu->timing_count++;                                 // at cu+1576
    append_timing({index=-1, name=0x2030007, time=0, flags=0});  // sentinel

    // 4. Create all 159 phase objects
    resize_phase_list(/*capacity=*/159);                // sub_C62640
    for (int i = 0; i < 159; i++) {
        auto [phase, pool] = PhaseFactory(i, cu);       // sub_C60D30
        phase_list[i] = {phase, pool};
    }

    // 5. Optionally create NvOptRecipe sub-manager
    if (cu->config->getOption(391)) {
        auto* recipe = alloc(440);
        // initialize hash table, ref-counted lists, timing arrays (8 entries)
        // inherit phase chain from previous execution context
        this->nvopt_recipe_ptr = recipe;
    }
    this->flag_byte = 1;
    return true;
}

Key constants:

159 -- total phase count, used as loop bound and array capacities
1272 -- 159 * 8, phase name pointer table size in bytes
440 -- NvOptRecipe sub-manager object size
0x2030007 (33,739,079) -- timing sentinel magic value
Option 17928 -- enables per-phase timing/memory reporting
Option 391 -- enables NvOptRecipe sub-manager

Destruction Sequence -- `sub_C61B20`

Teardown mirrors construction in reverse order, with careful handling of the NvOptRecipe's reference-counted shared state:

// Pseudocode for sub_C61B20
void PhaseManager::destroy() {
    // 1. Free raw name table
    timing_allocator->free(phase_name_raw_table);

    // 2. Tear down NvOptRecipe if present
    if (nvopt_recipe_ptr) {
        auto* r = nvopt_recipe_ptr;
        // decrement shared_list ref-count at +432
        if (--r->shared_list_refcount == 0)
            free_list_nodes(r->shared_list);
        free(r->hash_buckets);        // +408
        free(r->sorted_array);        // +376
        free(r->timing_records);      // +344, stride=584 per entry
        free(r->node_pool);           // +16
        free(r);
    }

    // 3. Destroy each phase via virtual destructor (vtable+32)
    for (int i = 0; i < phase_list_count; i++) {
        auto [phase, pool] = phase_list[i];
        pool->free(phase);            // invokes vtable+32
    }

    // 4. Free base arrays
    allocator2->free(phase_list);
    allocator->free(sorted_name_table);
}

The ref-count decrement-and-destroy pattern on shared_list at +432 follows C++ shared_ptr semantics: the NvOptRecipe may share state across multiple compilation units in library mode.

Phase Dispatch Loop -- `sub_C64F70`

The dispatch loop is the runtime engine. It takes a slice of the phase index array and executes each phase in order:

// Pseudocode for sub_C64F70
bool PhaseManager::dispatch(int* phase_indices, int count) {
    memory_snapshot_t base_snap;
    take_snapshot(&base_snap);                          // sub_8DADE0

    for (int i = 0; i < count; i++) {
        int idx = phase_indices[i];
        phase* p = this->phase_list[idx].phase;

        // Resolve phase name
        int name_idx = p->getName();                    // vtable+16
        const char* name = this->phase_name_raw_table[name_idx];

        // Record timing entry
        append_timing({idx, name, opt_level, flags, metrics});

        // Take pre-execution snapshot
        memory_snapshot_t pre_snap;
        take_snapshot(&pre_snap);

        // Execute (unless no-op)
        if (!p->isNoOp()) {                             // vtable+8
            p->execute(this->cu);                       // vtable+0
            // Construct diagnostic: "Before <name>" or "After <name>"
        }

        // Report per-phase stats
        if (this->timing_enabled) {
            report_phase_stats(name, &pre_snap, false); // sub_C64310
            this->flag_byte = 0;
        }
    }

    // Summary after all phases
    if (this->timing_enabled) {
        report_phase_stats("All Phases Summary", &base_snap, true);
        report_pool_consumption();                      // sub_C62200
    }
    return true;
}

The "Before" / "After" diagnostic strings use an interesting encoding trick: the string "Before " is stored as the 64-bit integer 0x2065726F666542 in little-endian, allowing the compiler to emit a single mov instruction instead of a memcpy.

Phase Name Lookup -- `sub_C641D0`

External callers (e.g., --ftrace-phase-after option processing in sub_9F4040) resolve phase names to indices through a case-insensitive binary search:

// Pseudocode for sub_C641D0
int PhaseManager::lookup_phase(const char* name) {
    ensure_sorted();                                    // sub_C63FA0

    int lo = 0, hi = sorted_name_count - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        int cmp = strcasecmp(sorted_name_table[mid].name, name);
        if (cmp == 0) return sorted_name_table[mid].index;
        if (cmp < 0)  lo = mid + 1;
        else           hi = mid - 1;
    }
    return 158;  // sentinel: last phase (NOP)
}

The sorted name table is rebuilt on demand by sub_C63FA0 when the raw count differs from the sorted count. Sorting uses an iterative quicksort (sub_C639A0) with median-of-three pivot selection and three-way partitioning. The sort stack is pre-allocated to 33 entries, sufficient for ceil(log2(160)).

Per-Phase Timing and Memory Reporting

When timing is enabled (option 17928), the dispatch loop calls sub_C64310 after each phase to print memory statistics:

<indent><phase_name>  ::  [Total 1234 KB]  [Freeable 567 KB]  [Freeable Leaked 12 KB] (2%)

The reporter computes three memory deltas from snapshot pairs:

Metric	Helper	Meaning
Total	`sub_8DAE20`	Total memory allocated since snapshot
Freeable	`sub_8DAE30`	Memory eligible for release
Freeable Leaked	`sub_8DAE40`	Freeable memory not actually released

Size formatting thresholds:

0--1023: raw bytes (suffix B)
1024--10,485,760: kilobytes with 3 decimal places (suffix KB)
above 10 MB: megabytes with 3 decimal places (suffix MB)

After all phases complete, the loop prints an "All Phases Summary" line using the same reporter, then calls sub_C62200 to print the pool consumption total:

[Pool Consumption = 45.678 MB]

Timing Record Format

Each timing entry is 32 bytes:

Timing Record (32 bytes)
  +0    int       phase_index       // -1 for sentinel
  +8    int64     phase_name        // string pointer, or 0x2030007 for sentinel
  +16   int64     timing_value      // elapsed time
  +24   int       memory_flags      // opt level / additional metrics

Records are stored in a growable array at compilation_unit+1560. Growth uses a 1.5x strategy: new_capacity = max(old + old/2 + 1, requested).

NvOptRecipe Sub-Manager (440 bytes)

When option 391 is enabled, the constructor creates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This object provides the runtime for "AdvancedPhase" hooks -- the 16 phases that are no-ops by default but can be activated for architecture-specific or optimization-level-specific processing. The NvOpt level (0--5) controls per-phase aggressiveness independently of the -O CLI level: -O gates which phases run at all, while the NvOpt level controls how aggressively active phases behave.

Object Layout

NvOptRecipe (440 bytes)
  +0    int64     compilation_unit           // back-pointer to owning CU
  +8    int64     phase_manager_backref      // back-pointer to PhaseManager
  +16   void*     node_pool                  // 24-byte ref-counted list node
  +24   int64     secondary_bucket_count     // secondary hash (migration buffer)
  +32   void*     secondary_bucket_array     // secondary hash bucket array
  +40   int64     secondary_total_entries    // secondary hash entry count
  +48   (264 B)   [opaque internal region]   // +48..+311 undecoded
  +312  int64     recipe_data                // from option 391 value (ext. pointer)
  +320  int64     (reserved)                 // zeroed in constructor
  +328  (8 B)     [alignment gap]
  +336  int64     allocator                  // cu->field_16->field_16
  +344  void*     timing_records             // stride = 584 bytes per entry
  +352  int32     timing_count               // init -1 (empty sentinel)
  +356  int32     timing_flags               // init 0
  +360  int32     timing_extra               // init 0
  +364  (4 B)     (padding)
  +368  int64*    timing_allocator            // cu->field_16->field_16 copy
  +376  void*     sorted_array               // 4-byte entries, init capacity = 8
  +384  int32     sorted_count               // init 7 (pre-filled)
  +388  int32     sorted_capacity            // init 8
  +392  void*     ref_counted_list_2         // 24-byte ref-counted list node
  +400  int32     hash_bucket_count          // primary hash table bucket count
  +404  (4 B)     (padding)
  +408  void*     hash_buckets               // primary hash, 24-byte stride/bucket
  +416  int64     hash_size                  // total entries across all buckets
  +424  (8 B)     (padding)
  +432  void*     shared_list_ptr            // ref-counted, shared across CUs

Sub-Structures

Ref-Counted List Node (24 bytes) -- used at +16, +392, +432:

RefCountedListNode (24 bytes)
  +0    int64     refcount        // manual shared_ptr: decrement-and-destroy
  +8    void*     next            // singly-linked list chain
  +16   void*     allocator       // for self-deallocation when refcount → 0

When the refcount reaches zero, the destructor walks the next chain freeing each node, then frees the head node itself through the allocator at +16.

Hash Bucket Entry (24 bytes) -- array at +408:

HashBucketEntry (24 bytes)
  +0    void*     chain_head      // first element in bucket chain
  +8    void*     chain_sentinel  // end-of-chain sentinel
  +16   int32     chain_count     // number of elements in this bucket

Timing Record (584 bytes) -- array at +344:

TimingRecord (584 bytes)
  +0    (40 B)    header
  +40   void*     sub_allocator   // allocator for sub-data at +48
  +48   void*     sub_data        // freed during cleanup
  +56   int32     sub_count       // set to -1 when cleaned
  +60   int32     cleanup_flag    // if >= 0: sub_data exists, free it
  +64   (520 B)   timing/metric data

Records are iterated backward during cleanup (base + 584 * (count + 1) - 584 down to base). The sentinel value -1 at offset +56 marks an entry as already cleaned up.

Construction Sequence

The constructor (sub_C62720, lines 356--850 in decompilation) performs these steps:

Check option 391 -- fast path: *(config_obj[9] + 28152) != 0; slow path: virtual call with argument 391. If disabled, skip entirely.
Read option 391 value -- the value is the recipe_data pointer. Fast path checks type tag 5 (int64) at config offset 28152, reads the 64-bit value at offset 28160. This is an externally-provided pointer, not computed locally.
Allocate 440 bytes from the pool allocator at compilation_unit->field_16.
Initialize core fields -- back-pointers at +0/+8, node_pool at +16 (24-byte ref-counted node, refcount=1), zero +24/+32/+40, store recipe_data at +312.
Initialize timing -- zero +344, set +352 to -1 (empty sentinel), zero +360, copy allocator to +336 and +368.
Allocate sorted_array -- initial capacity 8 entries (32 bytes), pre-fill 7 entries, set +384 = 7, +388 = 8.
Allocate ref_counted_list_2 at +392 (24-byte node, refcount=1), zero +400/+408/+416.
Allocate shared_list at +432 (24-byte node, refcount=1).
Inherit from previous recipe -- if PhaseManager+56 already holds an NvOptRecipe from a prior compilation unit:
- Decrement old shared_list refcount; free if zero
- Migrate hash bucket chains from old recipe to new ref_counted_list_2
- Walk old timing records backward (stride 584), freeing sub-allocations
- Drain old secondary hash table, release old node_pool
- Free old NvOptRecipe object
Install -- set PhaseManager+56 = new recipe, PhaseManager+64 = allocator.

Destruction Sequence

The destructor (sub_C61B20) tears down the recipe in reverse:

Decrement shared_list_ptr (+432) refcount; free linked nodes if zero
Walk hash buckets (+408, stride 24, count from +416): for each chain element, clean sub-entries (timing at offsets +56/+60/+64/+76), decrement per-entry refcounts at element [9], append to ref_counted_list_2; zero bucket; reset +400 to 0
Clean up ref_counted_list_2 (+392); free if refcount zero
Free sorted_array (+376) if sorted_count (+388) >= 0
Walk timing_records (+344) backward, stride 584, freeing sub-allocations; reset +352 to -1
Drain secondary hash (+24/+32/+40), move chains to node_pool
Release node_pool (+16); free if refcount zero
Free the 440-byte object via PhaseManager+64 allocator

NvOpt Level Validation

The recipe application function sub_C173E0 validates the NvOpt level at each recipe record:

// At sub_C173E0 + 0x2FD9 (line 1431)
int nvopt_level = *(int*)(recipe_record + 344);
if (nvopt_level > 5) {
    emit_warning(cu + 1232, 8000, "Invalid nvopt level : %d.", nvopt_level);
    // warning 8000 (0x1F40) -- non-fatal, compilation continues
}

Valid levels are 0--5. The level is consumed as a bitmask 1 << nvopt_level, passed to a vtable call that dispatches on a recipe configuration byte at target descriptor offset 35280 (8-case switch: cases 0--5, 7). This byte controls which recipe application mode is used for the target architecture.

Shared State in Library Mode

The shared_list at +432 enables recipe state persistence across compilation units in library mode (multiple .ptx files compiled by one ptxas invocation):

Each new NvOptRecipe sets its shared_list refcount to 1
During inheritance (step 9), hash bucket contents are migrated from the old recipe to the new one, accumulating per-kernel recipe decisions
When a PhaseManager is destroyed, the recipe decrements the shared_list refcount; only the last reference frees the nodes
This allows the NvOptRecipe to cache per-kernel optimization decisions across compilation passes

Key Constants

Value	Meaning
440	NvOptRecipe object size (bytes)
584	Per-entry timing record stride (bytes)
24	Hash bucket entry size / ref-counted list node size
8	Initial `sorted_array` capacity
7	Initial `sorted_count` (pre-filled entries)
391	Option ID (enables NvOptRecipe; value = recipe data pointer)
28152	Option 391 type-tag offset in config storage
28160	Option 391 value offset (8 bytes after type tag)
0x1F40	Warning code 8000: "Invalid nvopt level"
5	Maximum valid NvOpt level
35280	Recipe config byte offset in target descriptor

Multi-Function Dispatch -- `sub_C60BD0`

When a compilation unit contains more than one function, sub_C60BD0 redirects to a per-function dispatch path:

// Pseudocode for sub_C60BD0
void PhaseManager::invoke_multi(compilation_unit* cu) {
    int func_count = get_function_count(cu);            // sub_7DDB50
    if (func_count > 1) {
        auto list1 = create_refcounted_list();
        auto list2 = create_refcounted_list();
        this->phase_chain = current_chain;              // +88
        per_function_dispatch(cu, list1, list2);        // sub_790A40
        release(list1);
        release(list2);
    }
}

Complete Phase Table

Group 1: Initial Setup (phases 0--12)

Index	Phase Name	Purpose
0	`OriCheckInitialProgram`	Validate initial Ori IR
1	`ApplyNvOptRecipes`	Apply NvOptRecipe transformations
2	`PromoteFP16`	Promote FP16 operations where beneficial
3	`AnalyzeControlFlow`	Build/analyze control flow graph
4	`AdvancedPhaseBeforeConvUnSup`	Hook -- before unsupported op conversion
5	`ConvertUnsupportedOps`	Lower unsupported operations to supported sequences
6	`SetControlFlowOpLastInBB`	Mark control flow ops as last in basic block
7	`AdvancedPhaseAfterConvUnSup`	Hook -- after unsupported op conversion
8	`OriCreateMacroInsts`	Create macro instruction patterns
9	`ReportInitialRepresentation`	Diagnostic dump of initial IR
10	`EarlyOriSimpleLiveDead`	Early dead code elimination
11	`ReplaceUniformsWithImm`	Replace uniform register loads with immediates
12	`OriSanitize`	IR consistency checks

Group 2: Early Optimization (phases 13--36)

Index	Phase Name	Purpose
13	`GeneralOptimizeEarly`	First GeneralOptimize pass (peephole + simplify)
14	`DoSwitchOptFirst`	Switch statement optimization, first pass
15	`OriBranchOpt`	Branch simplification and folding
16	`OriPerformLiveDeadFirst`	Liveness analysis, first pass
17	`OptimizeBindlessHeaderLoads`	Optimize bindless texture header loads
18	`OriLoopSimplification`	Canonicalize loop structure
19	`OriSplitLiveRanges`	Split long live ranges to reduce pressure
20	`PerformPGO`	Apply profile-guided optimizations
21	`OriStrengthReduce`	Strength reduction on induction variables
22	`OriLoopUnrolling`	Loop unrolling
23	`GenerateMovPhi`	Convert phi nodes to MOV-phi representation
24	`OriPipelining`	Software pipelining of loops
25	`StageAndFence`	Memory staging and fence insertion
26	`OriRemoveRedundantBarriers`	Remove unnecessary barrier instructions
27	`AnalyzeUniformsForSpeculation`	Identify uniform values for speculative execution
28	`SinkRemat`	Sink rematerializable instructions
29	`GeneralOptimize`	Main GeneralOptimize pass
30	`DoSwitchOptSecond`	Switch optimization, second pass
31	`OriLinearReplacement`	Replace complex patterns with linear sequences
32	`CompactLocalMemory`	Compact local memory layout
33	`OriPerformLiveDeadSecond`	Liveness analysis, second pass
34	`ExtractShaderConstsFirst`	Extract shader constants, first pass
35	`OriHoistInvariantsEarly`	Early loop-invariant hoisting
36	`EmitPSI`	Emit program state information

Group 3: Mid-Level Optimization (phases 37--58)

Index	Phase Name	Purpose
37	`GeneralOptimizeMid`	Mid-pipeline GeneralOptimize
38	`OptimizeNestedCondBranches`	Simplify nested conditional branches
39	`ConvertVTGReadWrite`	Convert vertex/tessellation/geometry read/write ops
40	`DoVirtualCTAExpansion`	Expand virtual CTA operations
41	`MarkAdditionalColdBlocks`	Mark additional basic blocks as cold
42	`ExpandMbarrier`	Expand mbarrier intrinsics
43	`ForwardProgress`	Ensure forward progress guarantees
44	`OptimizeUniformAtomic`	Optimize uniform atomic operations
45	`MidExpansion`	Mid-pipeline lowering and expansion
46	`GeneralOptimizeMid2`	Second mid-pipeline GeneralOptimize
47	`AdvancedPhaseEarlyEnforceArgs`	Hook -- before argument restrictions
48	`EnforceArgumentRestrictions`	Enforce ABI argument constraints
49	`GvnCse`	Global value numbering and common subexpression elimination
50	`OriReassociateAndCommon`	Reassociation and commoning
51	`ExtractShaderConstsFinal`	Extract shader constants, final pass
52	`OriReplaceEquivMultiDefMov`	Replace equivalent multi-def MOVs
53	`OriPropagateVaryingFirst`	Varying propagation, first pass
54	`OriDoRematEarly`	Early rematerialization
55	`LateExpansion`	Late lowering of complex operations
56	`SpeculativeHoistComInsts`	Speculatively hoist common instructions
57	`RemoveASTToDefaultValues`	Remove AST nodes set to default values
58	`GeneralOptimizeLate`	Late GeneralOptimize

Group 4: Late Optimization (phases 59--95)

Index	Phase Name	Purpose
59	`OriLoopFusion`	Fuse compatible loops
60	`DoVTGMultiViewExpansion`	Expand multi-view VTG operations
61	`OriPerformLiveDeadThird`	Liveness analysis, third pass
62	`OriRemoveRedundantMultiDefMov`	Remove redundant multi-def MOVs
63	`OriDoPredication`	If-conversion / predication
64	`LateOriCommoning`	Late value commoning
65	`GeneralOptimizeLate2`	Second late GeneralOptimize
66	`OriHoistInvariantsLate`	Late invariant hoisting
67	`DoKillMovement`	Move kill instructions for better scheduling
68	`DoTexMovement`	Move texture instructions for latency hiding
69	`OriDoRemat`	Main rematerialization pass
70	`OriPropagateVaryingSecond`	Varying propagation, second pass
71	`OptimizeSyncInstructions`	Optimize synchronization instructions
72	`LateExpandSyncInstructions`	Expand sync instructions to HW sequences
73	`ConvertAllMovPhiToMov`	Convert all MOV-phi to plain MOV
74	`ConvertToUniformReg`	Promote values to uniform registers
75	`LateArchOptimizeFirst`	Architecture-specific late optimization, first pass
76	`UpdateAfterOptimize`	Post-optimization bookkeeping
77	`AdvancedPhaseLateConvUnSup`	Hook -- before late unsupported op expansion
78	`LateExpansionUnsupportedOps`	Late lowering of unsupported operations
79	`OriHoistInvariantsLate2`	Second late invariant hoisting
80	`ExpandJmxComputation`	Expand JMX (join/merge) computations
81	`LateArchOptimizeSecond`	Architecture-specific late optimization, second pass
82	`AdvancedPhaseBackPropVReg`	Hook -- before back-copy propagation
83	`OriBackCopyPropagate`	Backward copy propagation
84	`OriPerformLiveDeadFourth`	Liveness analysis, fourth pass
85	`OriPropagateGmma`	GMMA/WGMMA propagation
86	`InsertPseudoUseDefForConvUR`	Insert pseudo use/def for uniform reg conversion
87	`FixupGmmaSequence`	Fix up GMMA instruction sequences
88	`OriHoistInvariantsLate3`	Third late invariant hoisting
89	`AdvancedPhaseSetRegAttr`	Hook -- before register attribute setting
90	`OriSetRegisterAttr`	Set register attributes (types, constraints)
91	`OriCalcDependantTex`	Calculate dependent texture operations
92	`AdvancedPhaseAfterSetRegAttr`	Hook -- after register attribute setting
93	`LateExpansionUnsupportedOps2`	Second late unsupported op expansion
94	`FinalInspectionPass`	Final IR validity checks
95	`SetAfterLegalization`	Mark legalization complete

Group 5: Scheduling and Register Allocation (phases 96--105)

Index	Phase Name	Purpose
96	`ReportBeforeScheduling`	Diagnostic dump before scheduling
97	`AdvancedPhasePreSched`	Hook -- before scheduling
98	`BackPropagateVEC2D`	Back-propagate 2D vector instructions
99	`OriDoSyncronization`	Insert synchronization instructions
100	`ApplyPostSyncronizationWars`	Apply post-synchronization write-after-read fixes
101	`AdvancedPhaseAllocReg`	Hook -- register allocation
102	`ReportAfterRegisterAllocation`	Diagnostic dump after regalloc
103	`Get64bRegComponents`	Extract 64-bit register components
104	`AdvancedPhasePostExpansion`	Hook -- after post-expansion
105	`ApplyPostRegAllocWars`	Apply post-regalloc write-after-read fixes

Group 6: Post-Schedule and Code Generation (phases 106--131)

Index	Phase Name	Purpose
106	`AdvancedPhasePostSched`	Hook -- after scheduling
107	`OriRemoveNopCode`	Remove NOP instructions
108	`OptimizeHotColdInLoop`	Hot/cold partitioning within loops
109	`OptimizeHotColdFlow`	Hot/cold partitioning across flow
110	`PostSchedule`	Post-scheduling fixups
111	`AdvancedPhasePostFixUp`	Hook -- after post-schedule fixup
112	`PlaceBlocksInSourceOrder`	Reorder blocks to match source order
113	`PostFixForMercTargets`	Mercury target-specific fixups
114	`FixUpTexDepBarAndSync`	Fix texture dependency barriers and sync
115	`AdvancedScoreboardsAndOpexes`	Hook -- before scoreboard generation
116	`ProcessO0WaitsAndSBs`	Process O0-level waits and scoreboards
117	`MercEncodeAndDecode`	Mercury encode to SASS and decode-verify
118	`MercExpandInstructions`	Expand macro instructions to SASS
119	`MercGenerateWARs1`	Generate write-after-read hazard stalls, pass 1
120	`MercGenerateOpex`	Generate operand exchange stalls
121	`MercGenerateWARs2`	Generate write-after-read hazard stalls, pass 2
122	`MercGenerateSassUCode`	Emit final SASS microcode
123	`ComputeVCallRegUse`	Compute virtual call register usage
124	`CalcRegisterMap`	Calculate final register map
125	`UpdateAfterPostRegAlloc`	Post-regalloc bookkeeping
126	`ReportFinalMemoryUsage`	Report final memory consumption
127	`AdvancedPhaseOriPhaseEncoding`	Hook -- before final encoding
128	`UpdateAfterFormatCodeList`	Update after code list formatting
129	`DumpNVuCodeText`	Dump NV microcode as text (debug)
130	`DumpNVuCodeHex`	Dump NV microcode as hex (debug)
131	`DebuggerBreak`	Debugger breakpoint (debug)

Group 7: Late Cleanup (phases 132--158)

Index	Phase Name	Purpose
132	`UpdateAfterConvertUnsupportedOps`	Bookkeeping after late conversion
133	`MergeEquivalentConditionalFlow`	Merge equivalent conditional branches
134	`AdvancedPhaseAfterMidExpansion`	Hook -- after mid-expansion
135	`AdvancedPhaseLateExpandSyncInstructions`	Hook -- after late sync expansion
136	`LateMergeEquivalentConditionalFlow`	Late merge of equivalent conditionals
137	`LateExpansionUnsupportedOpsMid`	Mid-point late unsupported op expansion
138	`OriSplitHighPressureLiveRanges`	Split live ranges under high register pressure
139--158	(architecture-specific)	20 additional phases with names in vtable `getString()` methods

Phases 139--158 are not in the static name table at off_22BD0C0. Their names are returned by each phase's getName() virtual method. These are conditionally-enabled phases for specific architecture targets (SM variants) or optimization levels.

AdvancedPhase Hook Points

The 16 AdvancedPhase entries are insertion points for architecture-specific or optimization-level-specific processing. All return isNoOp() == true by default. When activated (typically by NvOptRecipe configuration for a specific SM target), they execute additional transformations at precisely defined points in the pipeline:

Index	Hook Name	Insertion Context
4	`AdvancedPhaseBeforeConvUnSup`	Before `ConvertUnsupportedOps`
7	`AdvancedPhaseAfterConvUnSup`	After `ConvertUnsupportedOps`
47	`AdvancedPhaseEarlyEnforceArgs`	Before `EnforceArgumentRestrictions`
77	`AdvancedPhaseLateConvUnSup`	Before `LateExpansionUnsupportedOps`
82	`AdvancedPhaseBackPropVReg`	Before `OriBackCopyPropagate`
89	`AdvancedPhaseSetRegAttr`	Before `OriSetRegisterAttr`
92	`AdvancedPhaseAfterSetRegAttr`	After `OriSetRegisterAttr`
97	`AdvancedPhasePreSched`	Before scheduling pipeline
101	`AdvancedPhaseAllocReg`	Register allocation entry point
104	`AdvancedPhasePostExpansion`	After post-regalloc expansion
106	`AdvancedPhasePostSched`	After scheduling
111	`AdvancedPhasePostFixUp`	After post-schedule fixup
115	`AdvancedScoreboardsAndOpexes`	Before scoreboard/opex generation
127	`AdvancedPhaseOriPhaseEncoding`	Before final instruction encoding
134	`AdvancedPhaseAfterMidExpansion`	After mid-level expansion
135	`AdvancedPhaseLateExpandSyncInstructions`	After late sync instruction expansion

Mercury Encoding Sub-Pipeline

Phases 113--122 form a self-contained sub-pipeline that transforms the optimized, register-allocated Ori IR into final SASS machine code via the Mercury encoding format:

PostFixForMercTargets (113)
  → FixUpTexDepBarAndSync (114)
    → [AdvancedScoreboardsAndOpexes hook (115)]
      → ProcessO0WaitsAndSBs (116)
        → MercEncodeAndDecode (117)      ← encode to SASS + decode for verification
          → MercExpandInstructions (118) ← expand remaining macros
            → MercGenerateWARs1 (119)    ← first WAR hazard pass
              → MercGenerateOpex (120)   ← operand exchange stalls
                → MercGenerateWARs2 (121)← second WAR hazard pass
                  → MercGenerateSassUCode (122) ← final microcode emission

"Mercury" is NVIDIA's internal name for the SASS encoding format on recent GPU architectures (Blackwell-era SM 100/103/110/120).

Diagnostic Strings

Address	String	Emitted By	Context
`0x22BC3B3`	`"[Pool Consumption = "`	`sub_C62200`	After all phases summary
`0x22BC416`	`"All Phases Summary"`	`sub_C64F70`	End of dispatch loop
(inline)	`" :: "`	`sub_C64310`	Phase timing line separator
(inline)	`"[Total "`	`sub_C64310`	Total memory delta
(inline)	`"[Freeable "`	`sub_C64310`	Freeable memory delta
(inline)	`"[Freeable Leaked "`	`sub_C64310`	Leaked memory delta
(inline)	`"Before "` / `"After "`	`sub_C64F70`	Phase execution diagnostic

Function Map

Address	Size	Function	Confidence
`sub_C60D20`	16	Default phase table pointer	HIGH
`sub_C60D30`	3,554	Phase factory (159-case switch)	VERY HIGH
`sub_C60BD0`	334	Multi-function phase invoker	MEDIUM-HIGH
`sub_C61B20`	1,753	PhaseManager destructor	VERY HIGH
`sub_C62200`	888	Pool consumption reporter	VERY HIGH
`sub_C62580`	253	Timing record array resizer	HIGH
`sub_C62640`	223	Phase list resizer	HIGH
`sub_C62720`	4,734	PhaseManager constructor	VERY HIGH
`sub_C639A0`	1,535	Case-insensitive quicksort	HIGH
`sub_C63FA0`	556	Phase name table sort/rebuild	HIGH
`sub_C641D0`	305	Phase name-to-index lookup	VERY HIGH
`sub_C64310`	3,168	Per-phase timing reporter	VERY HIGH
`sub_C64F70`	1,455	Phase dispatch loop	VERY HIGH

Cross-References

Pass Inventory & Ordering -- full phase sequence and stage grouping
GeneralOptimize Bundles -- phases 13, 29, 37, 46, 58, 65
Synchronization & Barriers -- phases 26, 71, 72, 99, 100
Liveness Analysis -- phases 10, 16, 33, 61, 84
Mercury Encoder -- phases 113--122
Memory Pool Allocator -- pool allocation infrastructure used by PhaseManager
Optimization Levels -- how opt level controls phase behavior
DUMPIR & NamedPhases -- phase name resolution for debug output

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple sub-transformations in sequence on each basic block, repeating until no further changes occur (fixed-point iteration). They serve as the primary IR cleanup mechanism throughout the pipeline: after any major transformation introduces new dead code, redundant copies, or foldable constants, a GeneralOptimize pass re-normalizes the IR before the next major phase.

Six instances exist at strategic positions in the 159-phase pipeline. Despite sharing the "GeneralOptimize" name prefix, the six instances decompose into three distinct implementation families -- a lightweight block-iteration variant, a heavyweight bitvector-tracked orchestrator, and an indirect vtable dispatch stub. Each family shares a common architectural pattern (per-block iteration with convergence check) but invokes different sub-pass combinations and has different gate conditions.


Instances	6 (phases 13, 29, 37, 46, 58, 65)
Pattern	Per-block iteration with convergence check
Sub-passes	Copy propagation, constant folding, structural equivalence elimination, dead code elimination, predicate simplification, register promotion (Phase 37)
Convergence	Boolean change flag per iteration; stops when no sub-pass reports a change
Iteration cap	Knob-controlled (option 464); breaks loop if knob returns false
Single-function fast path	Phases 13 and 65 have direct tail-call paths bypassing the multi-function dispatch
Multi-function gate	All variants check `sub_7DDB50(ctx) > 1` before entering the main loop
Code range	Execute functions at `0xC5F940`--`0xC60870`; sub-pass bodies at `0x7917F0`--`0x910840`

Instance Map

Phase	Name	Vtable	`execute()`	Sub-pass Body	Gate Conditions
13	`GeneralOptimizeEarly`	`off_22BD7D0`	`0xC5F940`	`sub_7917F0` (multi-func) / `0x1C64BF0` (single-func)	`bit 2` of `ctx+1382` must be set
29	`GeneralOptimize`	`off_22BDA50`	`0xC5FC50`	`sub_908EB0`	Option 487 enabled; option 231 not set; option 461 pass
37	`GeneralOptimizeMid`	`off_22BDB90`	`0xC5FD70`	`sub_910840`	`sub_8F3EA0` pre-check; option 487; "ConvertMemoryToRegisterOrUniform" name-gate
46	`GeneralOptimizeMid2`	`off_22BDCF8`	`0xC60840`	indirect via `[*(ctx+1584)]->vtable[0x1C0]`	Vtable dispatch; skips if target == `sub_7D6DD0` (no-op sentinel)
58	`GeneralOptimizeLate`	`off_22BDED8`	`0xC5FF20`	`sub_8F7080`	Function count > 2; bits 4-5 of `ctx+1396` != `0x20`; option 31 checked
65	`GeneralOptimizeLate2`	`off_22BDFF0`	`0xC60550`	indirect via `[*(ctx+1584)]->vtable[392]`	Function count > 1; indirect dispatch through compilation unit vtable

Architecture: Three Structural Variants

Variant A: Block-Iteration with Explicit Fixed-Point Loop (Phases 13, 29)

The Early and standard GeneralOptimize passes iterate over basic blocks with an explicit convergence loop. Phase 13 (GeneralOptimizeEarly) at sub_7917F0 is the simplest and best-documented:

// sub_7917F0 -- GeneralOptimizeEarly (multi-function path)
void GeneralOptimizeEarly(int64_t ctx) {
    if (!(*(uint8_t*)(ctx + 1382) & 4))   return;   // gate: optimization flag

    // Option 214 check -- uses vtable fast-path comparison:
    //   if vtable[72] == sub_6614A0, reads *(config + 15408) directly
    //   otherwise calls the virtual getOption(214)
    if (getOption(ctx, 214))               return;   // gate: skip if set

    // Option 487 check -- uses vtable[152] fast-path:
    //   if vtable[152] == sub_67EB60, calls sub_7468B0(config, 487)
    //   otherwise calls the virtual isOptionSet(487, 1)
    if (!getOption_v2(ctx, 487))           return;   // gate: general opt enable

    if (*(int64_t*)(*(int64_t*)ctx + 1056)) return;  // gate: already processed

    sub_785E20(ctx, 0);                    // reset per-block change tracking
    sub_781F80(ctx, 1);                    // initialize instruction flags
    sub_7E6090(ctx, 0, 0, 0, 0);          // prepare operand use/def chains
    sub_7E6AD0(ctx, 0, ...);              // build def-use/use-def links

    // Iterate over basic blocks (block_count at ctx+520)
    int bb_count = *(int32_t*)(ctx + 520);
    for (int i = 1; i <= bb_count; i++) {
        // block_order at ctx+512, block_table at ctx+296
        int bb_idx = *(int32_t*)(*(int64_t*)(ctx + 512) + 4*i);
        BasicBlock* bb = *(BasicBlock**)(*(int64_t*)(ctx + 296) + 8*bb_idx);

        // Fixed-point loop on this block
        int64_t state[...];   // stack-allocated state at rbp-0x88
        while (true) {
            bool changed = sub_753600(&state, bb);   // run sub-passes
            if (!changed)  break;

            // Iteration cap: knob 464
            if (!getOption_v2(ctx, 464))  break;

            sub_753B50(&state);            // apply instruction rewrites
        }
    }

    if (any_changed)
        sub_785E20(ctx, 0);                // re-normalize if anything changed
}

The inner function sub_753600 runs on a single basic block and returns a boolean indicating whether any transformation fired. When it returns true, sub_753B50 applies the accumulated changes (instruction replacement, operand rewriting, def-use chain updates), and the loop re-runs sub_753600 on the same block to check if the new IR enables further simplifications.

The convergence check for option 464 acts as an emergency brake: if the knob returns false, the loop breaks even if changes were detected. This prevents pathological cases where mutual transformations oscillate indefinitely.

Phase 29 (sub_C5FC50) follows the same pattern but delegates to sub_908EB0, which implements a more complex instruction walk with additional opcode dispatch (opcodes 97 [STG in ROT13; used here as a definition anchor], 18 [FSETP], 124 [conditional select]) and predicate-aware propagation.

Variant B: Full-Program Sub-Pass Orchestration (Phases 37, 58)

The Mid and Late variants operate at a higher level: they construct a multi-field context structure, initialize bitvector tracking infrastructure, and call a heavyweight sub-pass orchestrator.

Phase 37 -- GeneralOptimizeMid (`sub_910840`)

Calls sub_8F3EA0 -- a pre-condition check (returns false to skip the entire pass)
Checks option 487 (general optimization enable) via the same vtable fast-path pattern
Calls sub_799250 with the string "ConvertMemoryToRegisterOrUniform" (at 0x21DD228) -- a named phase gate that allows the pass to be selectively disabled via --no-phase

Constructs a 0x408-byte context object on the stack with vtable pointer off_21DBEF8 at offset 0. The layout is:

GeneralOptimizeMid Context (0x408 bytes)
  +0x000  vtable_ptr     = off_21DBEF8
  +0x008  allocator      = *(ctx + 16)
  +0x010  (zero-init)    ...
  +0x018  (zero-init)    ...
  +0x020  (zero-init)    ...
  +0x030  int count      = 0
  +0x040  sub_context    -- initialized by sub_905B50 (bitvectors, register tracking)
  ...

Calls sub_905B50 -- a 500+ line setup function that creates bitvector arrays for tracking register definitions, use-def chains, and per-block change flags. Allocates three pairs of {bitvector, metadata, capacity} structures for tracking definition reach, register liveness, and fold eligibility
Calls sub_90FBA0 -- the main optimization loop that iterates over all blocks, running sub-passes per instruction

After sub_90FBA0 returns, the function destroys three RAII-style bitvector containers at offsets +0x200, +0x228, and +0x1E0 by invoking their vtable destructors via *(vtable + 32).

Phase 58 -- GeneralOptimizeLate (`sub_8F7080`)

Checks function count > 2 via sub_7DDB50 (stricter than other variants that check > 1)
Checks optimization level bits at ctx+1396: the condition (flags & 0x30) != 0x20 ensures the pass is skipped at certain reduced optimization levels
Checks option 31 via the vtable fast-path; when option 31 reports as "extended" (value at config+2232 is 1 with non-zero extra word at config+2240), an additional sub_7DC0E0 check determines a secondary control flag v7

Constructs a 0x168-byte context on the stack with 7 sub-pass tracking groups. Each group occupies 56 bytes (three __int128 values + a boolean changed-flag + a counter):

GeneralOptimizeLate Context (0x168 bytes)
  +0x000  ctx_ptr     = ctx (the compilation context)
  +0x008  flag_a      -- initialized from (ctx+1396 & 4)
  +0x009  flag_b      -- initialized from (ctx+1396 & 8)
  +0x00C  counter_0   = 0   |
  +0x010  changed_0   = 0   | Sub-pass group 0 (56 bytes)
  +0x018  ...              |
  +0x048  counter_1   = 0   | Sub-pass group 1
  ...
  +0x12C  counter_6   = 0   | Sub-pass group 6
  +0x130  changed_6   = 0   |
  +0x138  ...              |

Calls sub_8F6FA0 -- the block iterator

The block iterator sub_8F6FA0 initializes per-context flags from ctx+1396:

Bit 2 (& 4): stored at context+9, controls whether opcode-7 instructions are processed
Bit 3 (& 8): stored at context+8, controls whether opcode-6 (MOV variant) instructions are processed

It then calls sub_7E6090 to rebuild use-def chains and walks the block list calling sub_8F6530 per block.

Variant C: Indirect Vtable Dispatch (Phases 46, 65)

The Mid2 and Late2 variants use indirect vtable dispatch to call their sub-pass bodies, making the exact implementation architecture-dependent:

Phase 46 (GeneralOptimizeMid2) at 0xC60840:

mov  rdi, [rsi+0x630]      ; load sm_backend (compilation_context+1584)
mov  rax, [rdi]             ; load vtable
mov  rax, [rax+0x1C0]      ; load vtable slot 56 (offset 0x1C0 = 448)
cmp  rax, 0x7D6DD0          ; compare against no-op sentinel
jne  call_it                ; if not sentinel, call it
ret                          ; otherwise, return (phase is no-op)
call_it:
jmp  rax                    ; tail-call the vtable method

Phase 65 (GeneralOptimizeLate2) at sub_C60550:

// sub_C60550 -- GeneralOptimizeLate2 execute
int64_t GeneralOptimizeLate2(int64_t phase, int64_t ctx) {
    int64_t result = sub_7DDB50(ctx);       // get function count
    if ((int)result > 1) {
        int64_t comp_unit = *(int64_t*)(ctx + 1584);
        return (*(int64_t(**)(int64_t, int64_t))(*(int64_t*)comp_unit + 392))(comp_unit, ctx);
    }
    return result;
}

This indirection means the actual optimization behavior for phases 46 and 65 is determined by the compilation unit's vtable, which varies by target architecture and optimization level. The no-op sentinel sub_7D6DD0 (for phase 46) indicates that some architectures skip this pass entirely.

Sub-Pass Decomposition

The sub-passes that run inside a GeneralOptimize iteration are not named individually in the binary -- they are inline code within the per-block processing functions. Based on the decompiled logic, the following sub-transformations are identifiable:

Copy Propagation Algorithm

String evidence: "OriCopyProp" at 0x21E6CE1 appears in the phase name table at index 22, confirming that copy propagation is a recognized sub-pass within the system.

Two distinct copy propagation algorithms exist across the GeneralOptimize variants:

Algorithm A: Chain-Matching Copy Propagation (Phase 13 -- `sub_753600`)

Phase 13's copy propagation operates by matching structurally equivalent instruction pairs connected through single-use def-use chains. The 253-line function sub_753600 uses a state structure (8 int64_t fields, allocated on the stack at rbp-0x88 in sub_7917F0) that accumulates matched chain endpoints:

sub_753600 State Layout (8 qwords)
  state[0] = ctx           -- Code Object pointer (set by caller)
  state[1] = match_start   -- first matched instruction in chain
  state[2] = match_end     -- last matched instruction in chain
  state[3] = def_entry_a   -- first definition chain entry (from sub_753520)
  state[4] = reg_entry     -- register/BB entry for replacement target
  state[5] = def_entry_b   -- extended chain entry (second level)
  state[6] = reg_entry_b   -- extended register/BB entry

The algorithm proceeds in eight steps:

// sub_753600 -- Phase 13 copy propagation (decompiled pseudocode)
function copy_prop_early(state, basic_block):
    ctx = state[0]
    first_instr = *(basic_block[1])              // head of instruction list

    // Step 1: Entry gate -- only process blocks starting with control-flow terminator
    if first_instr.opcode != 95: return false    // opcode 95 = STS in ROT13; used as terminator class
    if first_instr.operand_count != 5: return false
    format = first_instr[25] & 7
    if format != 3 and format != 4: return false // must be imm or reg source

    // Step 2: Single-use chain check
    use_link = basic_block[17]                   // use-def chain link
    if use_link == NULL: return false
    if *use_link == NULL: return false
    if **use_link != NULL: return false           // must be SINGLE consumer

    // Step 3: Follow to defining instruction via opcode-97 anchor
    next_instr = *(basic_block[1] + 8)           // linked list next
    if next_instr.opcode != 97: return false     // must be def anchor
    reg_entry = *(ctx+296)[ next_instr.bb_index ] // BB/def lookup

    // Step 4: Walk def-use chain to find structural match
    chain_a = follow_chain_filtered(state, reg_entry)  // sub_753520
    if chain_a == NULL: return false
    state[3] = chain_a

    // Step 5: Walk reverse chain from chain_a
    chain_b = follow_reverse_chain(state, chain_a)     // sub_753570
    if chain_b == NULL: return false
    state[1] = chain_b
    state[2] = chain_b

    // Step 6: Predicate-operand compatibility check
    endpoint_instr = *(chain_b[1])
    if endpoint_instr.opcode != 95: return false
    if !predicate_operand_compatible(first_instr, endpoint_instr): return false
                                                       // sub_7E7380

    // Step 7: Operand-level matching
    if operand formats differ (format-4 parity mismatch): return false
    if reg_indices match AND metadata matches AND modifiers match:
        goto apply   // direct match

    // Step 7b: Deep sub-DAG equivalence (for non-trivial patterns)
    if both sources are register type (bits 28-30 == 1)
       and both have use_count <= 1
       and both defining instructions have opcode 119
       and no aliasing hazards (sub_748570)
       and sub_1245740(ctx, def_a, def_b, 2):   // depth-2 DAG compare
        goto apply

    return false

apply:
    // Step 8: Record replacement target
    state[4] = register_entry_for_replacement
    // Optionally follow one more chain level for state[5]/state[6]
    return true   // caller invokes sub_753B50 to rewrite

The chain walker sub_753480 (43 lines) is the core of this algorithm. It follows single-use, single-def chains within a basic block:

// sub_753480 -- def-use chain walker (at 0x753480)
function follow_chain(ctx, entry, &skip_flag):
    skip_flag = false
    if entry == NULL: return NULL
    current = entry
    loop:
        if check_multi_condition_skip(current):   // sub_7E5120
            skip_flag = true                      // chain crossed a skip point

        if current[16] == NULL: break             // no next-use link
        if *current[16] != NULL: break            // MULTI-USE: stop

        if current[17] == NULL: break             // no def link
        if *current[17] != NULL: break            // MULTI-DEF: stop

        def_bb_idx = *(current[17] + 8)
        instr_bb_idx = *(current[1] + 8).bb_index  // at +24
        if def_bb_idx != instr_bb_idx: break      // CROSS-BB: stop

        next_instr = *(current[1] + 8)
        if next_instr.opcode == 97:               // def anchor
            current = *(ctx+296)[ def_bb_idx ]    // follow chain
            continue
        else:
            return NULL                           // chain broken

    return current                                // last valid entry

Key properties of this walker:

Only follows single-use chains (current[16] must have exactly one consumer)
Only follows single-def chains (current[17] must have exactly one producer)
Only follows intra-block chains (definition and use must share the same BB index)
Only traverses through opcode 97 (definition anchor) instructions
The check_multi_condition_skip (sub_7E5120, 18 lines) tests four conditions: vtable dispatch at ctx+1784, block ordering bounds at ctx+1776, instruction flags at +283 bit 0, and knob 91

The helper sub_753520 wraps sub_753480 with an additional opcode-93 gate: the chain endpoint's instruction must have opcode 93 (OUT_FINAL in ROT13; used as an internal chain-link marker) and the use-chain at entry[16] must be empty. sub_753570 performs the reverse direction check, verifying that following the chain backward from a given entry reaches the expected starting point with matching register indices.

Algorithm B: Forward Walk with Flag Marking (Phase 29 -- `sub_908EB0`)

Phase 29's copy propagation walks the instruction linked list sequentially from *(ctx+272) (instruction list head) and marks eligible operands with flag bits for later consumption. The 217-line function sub_908EB0 maintains three key state variables:

Variable	Type	Purpose
`v10`	`bool`	"previous instruction was a recognized copy" -- gates liveness fallback
`v11`	`int64_t`	Current definition tracking entry (BB array pointer, from opcode 97)
`v21`	`char`	Architecture-allows-predicate-marking flag (from vtable at `**(ctx+1584)+1312`)

// sub_908EB0 -- Phase 29 forward copy propagation (decompiled pseudocode)
function copy_prop_forward(ctx):
    // Gate checks: option 487, option 231, option 461, function count,
    // architecture check via sub_7DC0E0, vtable dispatch at +1312
    v21 = check_arch_predicate_marking(ctx)

    sub_781F80(ctx, 1)         // initialize per-instruction flags
    v10 = initial_gate_flag    // from option 487 check
    v11 = 0                    // no current definition context

    for instr in instruction_list(ctx+272):
        opcode = instr.opcode & ~0x3000          // mask bits 12-13

        switch opcode:
        case 97:   // DEFINITION ANCHOR
            v10 = initial_gate_flag              // reset copy-tracking
            v11 = *(ctx+296)[ instr.operand[0] & 0xFFFFFF ]
            // Updates definition context -- subsequent opcodes 18/124
            // reference v11 for their propagation decisions
            continue

        case 18:   // SET-PREDICATE (FSETP/ISETP)
            if sub_8F2E50(ctx, instr):           // eligible?
                v10 = false                       // suppress liveness check
                if v21:                           // arch supports pred marking?
                    dst_idx = count + ~((opcode>>11) & 2)
                    instr.operand[dst_idx] |= 0x400   // mark: propagated-under-predicate
            continue

        case 124:  // CONDITIONAL SELECT
            if !sub_8F2E50(ctx, instr): continue
            dst = instr.operand[ count + ~((opcode>>11) & 2) ]

            if (ctx+1379) & 7 == 0:              // simple mode
                dst |= 0x100                      // mark: propagated
                continue

            if (dst & 0xF) == 1:                  // integer constant type
                if !sub_8F29C0(ctx): continue     // arch check
                // fall through to direct marking
            else:
                if !sub_8F29C0(ctx) or (ctx+1379 & 0x1B) != 0:
                    // Two-pass predicate simplifier
                    sub_908A60(ctx, v11, instr, 1, &hit, &partial)  // forward
                    if hit: goto mark_propagated
                    if !partial:
                        sub_908A60(ctx, v11, instr, 0, &hit, &partial) // backward
                        if hit: goto mark_propagated
                        if !partial: continue     // no match at all
                // Direct propagation: convert operand type
                dst = (dst & 0xFFFFFDF0) | 0x201  // clear type, set reg+deferred
                continue

            // Liveness-gated propagation check for extended chains
            if !v10 or v21:
                mark_propagated:
                instr.operand[dst_idx] |= 0x100   // mark: propagated
            else:
                // Follow definition chain from v11 for additional candidates
                follow_and_check_chain(ctx, v11, instr)
            continue

        default:
            if !v10:                               // no prior copy recognized
                status = sub_7DF3A0(instr, ctx)   // liveness check
                v10 = (*status & 0xC) != 0        // live uses exist?
            continue

Target Opcodes in Copy Propagation Context

Opcode	IR Meaning	Role in Copy Prop	Evidence
97	Definition anchor / label marker (`STG` in the ROT13 name table; used here as a definition anchor, not an actual store-global instruction)	Updates the current definition tracking context (`v11`). Its operand `instr+84 & 0xFFFFFF` is an index into the BB array at `ctx+296`, retrieving the BasicBlock descriptor for the definition point. All subsequent propagation decisions for opcodes 18 and 124 reference this context.	`sub_908EB0` lines 74--78: `v11 = ((a1+296) + 8 * (*(v9+84) & 0xFFFFFF))`
18	`FSETP`/`ISETP` (set predicate)	A predicate-setting comparison instruction. Copy propagation treats it as a "predicated copy" target: when source operands have type 2 or 3 (predicate/uniform register) and pass `sub_91D150` register constraint checks, the destination predicate can be folded into consumers. Marked with `0x400` when the architecture supports it.	`sub_908EB0` lines 84--96, `sub_8F2E50` lines 19--61
124	Conditional select (phi-like)	A two-source selection instruction controlled by a predicate. Copy propagation attempts to simplify it to a direct assignment when one source is a constant or when structural analysis shows the predicate is trivially true/false. Marked with `0x100` or type-converted via `(operand & 0xFFFFFDF0)	0x201`.

Flag Bit Semantics

The propagation marks operands with three flag bits on the destination operand word at instr + 84 + 8*dst_idx:

Bit	Mask	Name	Set When	Effect
8	`0x100`	Propagated	Conditional select (opcode 124) is eligible for propagation AND the architecture/mode checks pass	Downstream apply-changes passes replace all uses of this destination with its source operand. Checked as a guard in `sub_8F2E50`: `if (dst & 0x100) return false` prevents double-propagation.
9	`0x200`	Deferred cleanup	Combined with type-field rewriting: `(operand & 0xFFFFFDF0)	0x201`
10	`0x400`	Propagated under predicate	Set-predicate instruction (opcode 18) is eligible AND the architecture flag `v21` is true (vtable dispatch at `**(ctx+1584)+1312` returned non-zero)	Marks a conditional propagation: the destination predicate can be folded into consumers, but only if the guarding predicate is maintained. Distinguished from `0x100` because the propagation is predicate-dependent rather than unconditional.

Eligibility Checker: `sub_8F2E50`

The 64-line function sub_8F2E50 is the central gatekeeper for both opcodes 18 and 124. Decompiled logic:

// sub_8F2E50 -- copy/fold eligibility (from decompiled code at 0x8F2E50)
function is_eligible(ctx, instr):
    opcode = instr[18] with BYTE1 &= 0xCF       // mask bits 12-13

    if opcode == 18:                              // set-predicate
        dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
        type_nibble = (dst >> 2) & 0xF
        if type_nibble == 10: return false        // type 10: never foldable
        if type_nibble == 0 and !(dst & 0x400):   // no type bits, not yet marked
            // Architecture-gated source operand check
            vtable_fn = **(ctx+1584) + 1320
            if vtable_fn == sub_7D7240:           // sentinel: direct check
                if (instr[23] >> 28) & 7 not in {2, 3}: return false
            else:
                if vtable_fn() returns true: goto opcode_124_check
            // Register constraint check on both source operands
            if sub_91D150(ctx, instr[23] & 0xFFFFFF): goto opcode_124_check
            if sub_91D150(ctx, instr[25] & 0xFFFFFF): goto opcode_124_check
            return true
        return false

    opcode_124_check:
    if opcode == 124:                             // conditional select
        dst = instr[2 * (count + ~((opcode>>11)&2)) + 21]
        if dst & 0x100: return false              // already propagated
        if dst & 0x70: return false               // has modifier bits
        type = dst & 0xF
        sm_version = *(*(ctx+1584) + 372)
        if (type == 1 or type == 2)               // integer or float
           and (sm_version <= 20479 or !(dst & 0x1C00)):  // SM gate
            return true

    return false

The SM version threshold 20479 (0x4FFF) divides generation-4-and-below architectures (Kepler/Maxwell, where constant propagation through conditional selects is unconditionally safe) from generation-5+ architectures (Pascal onward) that require the constraint bits at dst & 0x1C00 to be zero. See SM Version Encoding and the 20479 Boundary for the encoding formula.

Architecture Predicate Query: `sub_8F29C0`

The 9-line function sub_8F29C0 at 0x8F29C0 determines whether the compilation unit's target architecture supports predicate-aware copy propagation:

// sub_8F29C0 -- architecture predicate query (decompiled verbatim)
bool check_arch_predicate(int64_t ctx) {
    int64_t comp_unit = *(int64_t*)(ctx + 1584);
    return sub_7DC0E0(comp_unit)          // primary arch capability
        || sub_7DC050(comp_unit)          // secondary arch capability
        || sub_7DC030(comp_unit);         // tertiary arch capability
}

This same query is used inside sub_908A60 (the two-pass predicate simplifier) to initialize the default "safe to transform" flag before instruction-level analysis refines the answer.

Two-Pass Predicate Simplifier: `sub_908A60`

When simple eligibility checks pass for opcode 124 but additional predicate analysis is needed (specifically: when sub_8F29C0 returns false OR ctx+1379 & 0x1B has bits set), the two-pass predicate simplifier sub_908A60 at 0x908A60 is invoked. It takes a direction argument (1 = forward, 0 = backward) and scans the instruction stream in the specified direction looking for matching definitions:

Forward pass (a4=1): Starts from the current definition context v11, walks forward through the block's instruction list. For each instruction, dispatches on opcode: 97 updates tracking context, 124/18 checks eligibility via sub_8F2E50, others check liveness. Uses a hash-set membership test (sub_767240) to avoid visiting the same instruction twice.
Backward pass (a4=0): Starts from the definition chain at v11+136, walks backward through linked definitions with the same opcode dispatch logic.

The function outputs two flags: out_a (full match found -- propagation is safe) and out_b (partial match found -- further analysis may help). Phase 29 invokes forward first; if forward finds neither a full nor partial match, it invokes backward. This handles PHI-like merge patterns where the definition chain has both forward paths (normal control flow) and backward paths (loop back-edges).

Comparison of Algorithm A vs Algorithm B

Aspect	Phase 13 (`sub_753600`)	Phase 29 (`sub_908EB0`)
Pattern	Chain matching (pair structural equivalence)	Forward walk with flag marking
Opcodes handled	95 (entry gate), 93 (chain gate), 97 (anchor), 119 (deep eq)	97 (anchor), 18 (pred copy), 124 (cond select)
Chain depth	Multi-level (follows through opcode 97 anchors)	Single-level (immediate operand check)
Result mechanism	Direct instruction rewriting via `sub_753B50`	Flag marking (`0x100`/`0x200`/`0x400`), consumed later
Convergence	Fixed-point loop in `sub_7917F0` (option 464 cap)	Single pass, flags consumed by subsequent iterations
Complexity	253 lines + 5 helper functions	217 lines + 4 helper functions
Scope	Intra-block, single-use chains only	Intra-block, all instructions in sequence

Constant Folding Patterns

Constant folding in GeneralOptimize is a two-level mechanism. At the ORI IR level (phases 29 and 37), the fold-eligibility check sub_8F2E50 at 0x8F2E50 decides which operands can be marked as constant-propagation-eligible. Separately, at the SASS level, the peephole pass sub_1249B50 performs instruction-combining folds on ALU operations whose sources are both MOV-from-immediate. The ORI-level fold does not evaluate arithmetic at compile time -- it marks operands with flag bits that downstream passes consume to replace registers with immediates.

The Eligibility Check: `sub_8F2E50`

The central gatekeeper, called by sub_908EB0 (phase 29) and sub_908A60 (predicate simplifier). Returns boolean: 1 = foldable, 0 = not foldable. Two dispatch paths based on the masked opcode at instr[18] & ~0x3000:

// sub_8F2E50 -- Fold eligibility check (complete, annotated)
bool is_fold_eligible(int64_t ctx, uint32_t* instr) {
    uint32_t raw = instr[18];
    uint32_t opcode = raw;
    BYTE1(opcode) &= 0xCF;    // clear bits 12-13 (predication variant)

    // --- Path A: opcode 18 (predicated copy) ---
    if (opcode == 18) {
        int dest_idx = instr[20] + ~((raw >> 11) & 2);   // last-operand index
        int dest = instr[2 * dest_idx + 21];
        int type_nibble = (dest >> 2) & 0xF;

        if (type_nibble == 10) return false;   // operand type 10: never foldable

        // Require both type nibble == 0 AND no predicate-propagated flag (0x400)
        if (type_nibble != 0 || (dest & 0x400))
            return false;

        // Vtable dispatch at comp_unit->vtable[1320]:
        //   sentinel sub_7D7240 -> check source operand type bits directly
        //   otherwise -> call virtual method
        fn = *(comp_unit->vtable + 1320);
        if (fn == sub_7D7240) {
            src_type = (instr[23] >> 28) & 7;
            if (src_type - 2 > 1) return false;   // only types 2,3 eligible
        } else {
            if (fn() != 0) goto check_opcode_124;
            src_type = (instr[23] >> 28) & 7;
            if (src_type - 2 > 1) return false;
        }
        // Verify register constraints via sub_91D150 for both sources
        if (sub_91D150(ctx, instr[23] & 0xFFFFFF))
            goto check_opcode_124;
        src2_type = (instr[25] >> 28) & 7;
        if (src2_type - 2 <= 1 && !sub_91D150(ctx, instr[25] & 0xFFFFFF))
            return true;
        // Fall through to opcode 124 check on constraint failure
    }

check_opcode_124:
    // --- Path B: opcode 124 (conditional select / phi-like move) ---
    if (opcode == 124) {
        int dest_idx = instr[20] + ~((raw >> 11) & 2);
        int dest = instr[2 * dest_idx + 21];
        if (dest & 0x100) return false;     // already propagated
        if (dest & 0x70)  return false;     // has modifier bits (neg/abs/sat)

        int type = dest & 0xF;
        int sm_ver = *(int32_t*)(*(int64_t*)(ctx + 1584) + 372);

        if ((type == 1 || type == 2) &&
            (sm_ver <= 20479 || (dest & 0x1C00) == 0))
            return true;
    }
    return false;
}

The function sub_91D150 is a trivial lookup into a per-register constraint array: return *(uint32_t*)(*(ctx+440) + 4 * reg_index). A return value of 0 means the register has no fold-blocking constraint.

Fold Eligibility Table

ORI Opcode	Operation	Foldable?	Conditions	Evidence
18	Predicated copy	Yes	Source operand types must be 2 (predicate) or 3 (uniform); operand type nibble must be 0; no `0x400` flag; both source registers pass `sub_91D150` constraint check	`sub_8F2E50` lines 17--61
124	Conditional select	Yes	Dest type 1 (integer) or 2 (float); no modifier bits (`& 0x70 == 0`); not already propagated (`& 0x100 == 0`); SM-version-dependent constraint check	`sub_8F2E50` lines 42--51
97	Register-to-register move	Propagated, not folded	Dest register replaced by source in all uses (copy propagation); no type/SM checks	`sub_908EB0` lines 75--79
98	Local load (LDL)	Cost-exempt fold target	In phase 37 only; target symbol looked up from constant bank; foldable if symbol is in constant bank	`sub_90FBA0` lines 261--270
130	HSET2 (packed FP16x2 compare)	Cost-exempt	Phase 37 bitmask: opcode in {130,133,134,137} bypasses fold cost penalty	`sub_90FBA0` bitmask `0x99`
133	(SM-range-dependent ALU)	Cost-exempt	Same bitmask as 130	`sub_90FBA0`
134	(SM-range-dependent ALU)	Cost-exempt	Same bitmask as 130	`sub_90FBA0`
137	(SM-range-dependent ALU)	Cost-exempt	Same bitmask as 130	`sub_90FBA0`
272	Extended instruction	Cost-exempt	Phase 37: excluded from cost penalty alongside 273	`sub_90FBA0` line 226
273	Extended instruction	Cost-exempt	Same as 272	`sub_90FBA0`
Others	General ALU/memory	Not directly foldable	GeneralOptimize does not attempt to fold; deferred to peephole passes	`sub_8F2E50` returns 0

"Cost-exempt" means that when phase 37 (GeneralOptimizeMid) encounters these opcodes, it does not apply the standard cost penalty to the fold decision. Other opcodes accumulate a cost of 1.0 per fold, while these accumulate only v104 (default 0.25, adjustable via knob 474). Exception: even cost-exempt opcodes pay full cost if sub_91E860 reports a use count exceeding 4, preventing over-folding of widely-referenced constants.

SM Version 20479 Threshold

The value at *(*(ctx+1584) + 372) encodes the target SM architecture. The threshold 20479 (0x4FFF) appears in three locations in the binary:

sub_8F2E50 line 47: fold eligibility for opcode 124
sub_908A60 line 107: predicate simplifier's opcode-124 handling
Transitively through sub_8F2E50 calls from sub_908A60

The threshold divides two immediate-encoding regimes:

SM range	Encoded value	Fold rule	Rationale
<= 20479	Legacy encoding	Integer (type 1) and float (type 2) constants in conditional selects fold unconditionally	Legacy architectures use fixed-width immediate slots with no sign/width constraints
> 20479	Extended encoding	Same types fold only if `(dest & 0x1C00) == 0` -- constraint bits at operand positions 10--12 must all be zero	Extended architectures introduced variable-width immediate encoding with sign-extension rules; bits 10--12 encode width/signedness constraints that prevent certain constants from being represented as immediates

The encoded value at comp_unit+372 uses the formula (generation << 12) | variant. Known values: 12288 = sm_30 (gen 3), 16385 = sm_50 (gen 4), 20481 = sm_50a (gen 5), 24576 = sm_60 (gen 6), 28672 = sm_70 (gen 7), 32768 = sm_90 (gen 8), 36864 = sm_100 (gen 9). The threshold 20479 = (5 << 12) - 1 = 0x4FFF falls exactly at the generation 4/5 boundary: all generation-4 values (Kepler/Maxwell) are at or below 20479, while the first generation-5 value (20481) exceeds it. This aligns with the introduction of extended immediate encoding formats in Pascal (sm_60, gen 6) and its predecessors in the gen-5 range.

How Fold Results Are Consumed

When sub_8F2E50 returns true in sub_908EB0 (phase 29), the action depends on the opcode:

Opcode 18 (predicated copy): If the v21 flag is set (vtable dispatch at comp_unit+1312 returned non-zero, confirming the target supports this transformation), the destination operand is marked with 0x400 (propagated-under-predicate). No immediate evaluation occurs.

Opcode 124 (conditional select): A tier flag at ctx+1379 controls aggressiveness:

// In sub_908EB0, after sub_8F2E50 returns true for opcode 124:
int tier = *(uint8_t*)(ctx + 1379) & 7;
if (tier == 0) {
    // AGGRESSIVE: mark dest byte-1 |= 1 (fold-committed, fast path)
    dest_operand[1] |= 1;
} else {
    // CONSERVATIVE: type-dispatched analysis required
    if ((dest & 0xF) == 1) {              // integer immediate
        if (sub_8F29C0(ctx))              // predicate analysis passes
            dest = (dest & 0xFFFFFDF0) | 0x201;  // clear type, set propagated+eligible
    } else {                              // float or other
        if (!sub_8F29C0(ctx) || (*(ctx+1379) & 0x1B) != 0) {
            // Two-pass predicate simplifier (forward, then backward)
            sub_908A60(ctx, reg, instr, 1, &out_a, &out_b);  // forward
            if (!out_a && !out_b[0])
                sub_908A60(ctx, reg, instr, 0, &out_a, &out_b);  // backward
        }
        dest = (dest & 0xFFFFFDF0) | 0x201;  // set propagated+eligible
    }
}

The tier value at ctx+1379 & 7 distinguishes:

0 = aggressive fold (unconditional fast path, no predicate analysis)
1--7 = conservative fold (requires sub_8F29C0 predicate analysis and potentially sub_908A60 two-pass simplification)

The actual constant value is not computed during GeneralOptimize. The fold marks operands with flag bits (0x100, 0x200, 0x400, byte-1 |= 1) that downstream passes consume: the apply-changes function sub_753B50 rewrites instruction lists, and the peephole/codegen passes emit the actual immediates.

The `limit-fold-fp` Knob


String	`"limit-fold-fp"` at `0x1CE3D23`
Help text	`"Enable/disable constant folding of float operations."` at `0x1CE63B0`
Type	Boolean
Default	`"false"` (FP folding is NOT limited -- folding is enabled)
Config offset	`config + 340` (registered at `sub_434320` line 268)
Category	Optimization control (registration category 4)
Visibility	Internal (not exposed on public CLI)

Despite the name, limit-fold-fp follows the convention that limit-X = true means restrict/disable X. When set to true:

The config+340 byte propagates into per-function context flags at ctx+1379 during compilation context setup
The ctx+1379 & 7 tier value becomes non-zero, forcing all type-2 (float) operands through the conservative fold path
Conservative fold requires predicate analysis via sub_8F29C0 and potentially the two-pass sub_908A60 simplifier, which rejects folds where predicate conditions are ambiguous
This prevents FP constants from being folded when the fold could alter precision semantics -- for example, folding an FMA source operand might lose the fused multiply-add precision guarantee that the original instruction provided

The predicate analysis helper sub_8F29C0 (11 lines) performs three sequential checks on the compilation unit at ctx+1584: sub_7DC0E0, sub_7DC050, and sub_7DC030. If any returns true, the predicate allows safe propagation. These check architecture capability flags for predicated constant operations.

Phase 37 Fold Cost Model

sub_90FBA0 (the main loop for GeneralOptimizeMid) integrates fold decisions into a cost-weighted convergence model rather than a simple boolean. Key elements:

Opcode classification (lines 226--228 of the decompiled output): a bitmask 0x99 applied to the range 130--137 classifies opcodes as cost-exempt. The expression ~(0x99 >> ((uint8_t)opcode + 126)) & v15 clears v15 (the cost flag) for opcodes where the corresponding bit in 0x99 is set. Combined with the range check for 272--273:

Cost-exempt opcodes	Bitmask bit	Interpretation
130	bit 0 of 0x99	FP16x2 compare (HSET2 family)
133	bit 3 of 0x99	SM-range-dependent ALU
134	bit 4 of 0x99	SM-range-dependent ALU
137	bit 7 of 0x99	SM-range-dependent ALU
272, 273	Direct check	Extended load/store variants

Cost computation: For cost-exempt opcodes, fold cost = v104 * v35 where v104 defaults to 0.25 (overridable via knob 474) and v35 is 0.0 if the instruction is dead (checked via sub_7DF3A0), 1.0 otherwise. For non-exempt opcodes, fold cost = 1.0 * v35 (full weight).

Use-count gate: Even cost-exempt opcodes pay full cost if sub_91E860 (use-count estimator) reports more than 4 uses, preventing over-folding of widely-referenced constants.

Convergence: Accumulated costs at context+26 (weighted) and context+27 (unweighted) are doubles. The loop continues until the cost delta falls below the threshold (default 0.25 from knob 474; overridable by knob 135 when *(config+9720) is set).

Register Constraint Validation: `sub_8F3FE0`

Phase 37 uses sub_8F3FE0 to validate that folding an instruction's operands respects register-class constraints. The function:

Queries comp_unit->vtable[904] for the per-element operand size of the instruction
Queries comp_unit->vtable[936] (if not sentinel sub_7D7040) for per-instruction fold metadata
Iterates over all source operands:
- Requires operand type bits (>> 28) & 7 to be 2 or 3 (predicate or uniform register)
- Calls sub_91D150 to look up the register constraint for each source operand
- Compares against a previously cached constraint at context + 7 (8-byte stride per fold group)
- Returns 0 (fold invalid) on any constraint mismatch
Loop count is determined by the destination operand format field (& 7)
Returns 1 only if all source operands have consistent register constraints

Constant Folding and Propagation Marking Architecture

The term "constant folding" in the context of GeneralOptimize is misleading. The pass does not evaluate arithmetic at compile time (e.g., replacing 3 + 5 with 8). Instead, it performs constant propagation eligibility marking -- identifying operands that hold constant or propagatable values and setting flag bits so downstream passes can exploit this information. Actual arithmetic evaluation occurs elsewhere in the pipeline.

Three Levels of Constant Handling in ptxas

Constant handling spans three distinct pipeline stages, each with different scope and mechanism:

Level	Stage	Functions	What It Does	What It Does NOT Do
1 -- ORI-IR Propagation Marking	GeneralOptimize (phases 13/29/37/46/58/65)	`sub_908EB0` (body), `sub_8F2E50` (gate), `sub_908A60` (deep analysis)	Marks operands with flag bits (`0x100`/`0x200`/`0x400`) indicating they are eligible for constant propagation	Evaluate arithmetic; rewrite instructions; emit immediates
2 -- SASS Peephole Combining	Post-ISel peephole (phases 83+)	`sub_83EF00` (156KB mega-peephole), `sub_1249B50` (integer ALU fold), `sub_1249940` (MOV-pair matcher)	Combines MOV-from-immediate + ALU instruction pairs into single instructions with folded constants	Operate on ORI IR; handle non-MOV sources
3 -- Frontend Expression Evaluation	PTX parser/validator (address range `0x460000`--`0x4D5000`)	Multiple validator functions (string evidence: `"Constant expression has division by zero"`, `"Constant overflow"`)	Evaluates PTX-level constant expressions during parsing; reports errors for invalid expressions	Operate on internal IR; run during optimization

The limit-fold-fp knob controls Level 1 only -- specifically whether float-typed operands take the fast path or must go through predicate analysis before being marked.

SM Version Encoding and the 20479 Boundary

The SM version at comp_unit->profile[+372] is not a direct sm_XX number. It uses a packed encoding:

encoded_sm = (generation << 12) | variant

Concrete values from the binary:

Encoded	Hex	Generation	Variant	Architecture
12288	`0x3000`	3	0	sm_30 (Kepler)
16385	`0x4001`	4	1	sm_50 (Maxwell)
20481	`0x5001`	5	1	sm_50a (Maxwell alt / gen-5 base)
24576	`0x6000`	6	0	sm_60 (Pascal)
28672	`0x7000`	7	0	sm_70 (Volta)
28673	`0x7001`	7	1	sm_80 (Ampere)
32768	`0x8000`	8	0	sm_90 (Hopper)
36864	`0x9000`	9	0	sm_100 (Blackwell)

The threshold 20479 = (5 << 12) - 1 = 0x4FFF. This is the largest value that fits in generation 4. Every generation-5+ encoded value exceeds it.

The fold-eligibility impact:

SM <= 20479 (generation 4 and below -- Kepler, Maxwell): Integer and float immediates in conditional-select instructions (opcode 124) fold unconditionally. The hardware uses fixed-width immediate slots with no sign/width constraints at operand bit positions 10--12.
SM > 20479 (generation 5+ -- Pascal and all newer): The operand's constraint bits at positions 10--12 (mask 0x1C00) must all be zero for folding to proceed. These bits encode hardware constraints introduced with extended immediate formats:
- Bit 10: immediate width constraint (narrow vs wide encoding)
- Bit 11: sign-extension requirement
- Bit 12: bank-relative vs absolute encoding

The threshold appears in 6 locations across the binary, confirming it is a fundamental architectural boundary rather than an ad-hoc check: sub_8F2E50 (fold eligibility), sub_406C5E (peephole), sub_406018 (peephole operand matcher), sub_751940 (instruction walker), sub_78DB70 (phase pre-check), sub_848790 (register bank coalescer).

Architecture Class Predicate: `sub_8F29C0` Internals

The 9-line function sub_8F29C0 queries three architecture capability checks in sequence. If any returns true, the conservative fold path (which requires additional predicate analysis) is the correct approach for the target:

bool arch_needs_conservative_fold(int64_t ctx) {
    int64_t cu = *(int64_t*)(ctx + 1584);
    if (sub_7DC0E0(cu)) return true;   // isDualIssue
    if (sub_7DC050(cu)) return true;   // isNvlinkArch
    return sub_7DC030(cu);             // isGraceArch
}

Each sub-function reads the architecture class field at comp_unit->profile[+12]:

Function	Check	Class ID	Architecture Family
`sub_7DC0E0`	`profile[+12] == 4`	4	Dual-issue (Maxwell sm_50)
`sub_7DC050`	`profile[+12] == 11` OR `profile[+1418] & 1`	11	NVLink-capable (Volta+)
`sub_7DC030`	`profile[+12] == 10` OR `profile[+1417] >> 7`	10	Grace (ARM-based)

When sub_8F29C0 returns true: folding a constant into a conditional select requires predicate analysis first, because these architectures have immediate encoding differences between conditional and unconditional instruction forms, or because predicate evaluation may have observable side effects.

When sub_8F29C0 returns false (simpler architectures): the fold attempt still proceeds but falls through to the more expensive two-pass predicate simplifier (sub_908A60) as a fallback rather than using the direct marking path.

Two-Pass Predicate Simplifier: `sub_908A60` Internals

When the eligibility check passes for opcode 124 but the conservative path is required (either sub_8F29C0 returns false, or the tier flags at ctx+1379 & 0x1B have bits set), sub_908A60 performs a bidirectional scan of the instruction stream to validate that the fold is safe.

Signature: sub_908A60(ctx_array, basic_block_id, instr, direction, &out_hit, &out_partial)

Parameter	Type	Meaning
`a1`	`int64_t*`	Context as QWORD array (`a1[37]` = block array, `a1[198]` = comp_unit)
`a2`	`int`	Basic block index (from the definition anchor)
`a3`	`int64_t`	Current instruction pointer
`a4`	`int`	Direction: `1` = forward scan, `0` = backward scan
`a5`	`int*`	Output: 1 if a complete safe-fold chain was found
`a6`	`int*`	Output: 1 if architecture supports aggressive mode

Algorithm:

Allocates a 24-byte tracking structure via comp_unit->vtable[24]
Queries architecture mode via sub_7DC0E0/sub_7DC050/sub_7DC030
Walks instructions in the specified direction within the basic block:
- Opcode 97 (STG in ROT13; used as definition anchor/label marker): follows the label chain to the next definition
- Opcode 52 (NOP/delimiter): stops the scan (block boundary)
- Opcode 124 or 18: recursively calls sub_8F2E50 on the chained instruction to verify fold safety through the chain
Sets output flags based on whether a complete safe-fold chain was found

Invocation pattern in sub_908EB0:

// Forward pass first
sub_908A60(ctx, bb_id, instr, 1, &hit, &partial);
if (hit) goto mark_propagated;

// If forward found nothing useful, try backward
if (!partial) {
    sub_908A60(ctx, bb_id, instr, 0, &hit, &partial);
    if (hit) goto mark_propagated;
    if (!partial) continue;   // neither direction found a match
}

The two-pass strategy (forward then backward) handles PHI-like merge patterns at loop boundaries. Forward catches definitions along normal control flow; backward catches definitions from loop back-edges. The partial flag prevents unnecessary backward scans when the forward pass already determined the chain is definitively unfoldable.

Algebraic Simplification and Structural Equivalence

The algebraic simplifier in GeneralOptimize is not a traditional constant-identity pattern matcher. It does not check operand values against constants (0, 1, -1) to recognize identities like x+0 or x*1. Instead, it is a structural equivalence-based pattern recognizer that detects when two instructions in a def-use chain compute identical values, enabling one to be eliminated. Traditional algebraic identity patterns (x+0->x, x*1->x, x&0->0, x-x->0, etc.) are handled by the separate MainPeepholeOptimizer -- see the comparison table below.

The simplifier lives in sub_753600 (Phase 13, GeneralOptimizeEarly) and is approximately 253 lines of decompiled code. It operates on chains of instructions linked through def-use relationships.

Entry Guard

The function only triggers on instructions matching a narrow pattern:

// sub_753600 entry guard
if (instr[18] == 95           // opcode 95 (STS in ROT13; used as terminator class)
    && instr[20] == 5         // exactly 5 operands
    && (instr[25] & 7) - 3 <= 1)  // operand format 3 (register) or 4 (immediate)
{
    // proceed to chain walk
}

The restriction to opcode 95 means this simplifier targets conditional exit/return sequences where a guard predicate or condition is computed redundantly. The 5-operand constraint ensures the instruction has the expected layout: result, predicate, and three source operands.

Chain-Walking Algorithm

After the entry guard passes, sub_753600 executes a 9-step algorithm:

Step 1 -- Def-chain traversal. Reads the use-list pointer at instr[17] (offset 136). Checks that the use-list head exists, points to a single definition (head's first element is null), and that the next instruction in the chain has opcode 97 (STG in ROT13; used as definition anchor/label).

Step 2 -- Register resolution. Follows the register index through the register table at ctx+296 to resolve the first chain link to a concrete register entry. Both chain paths (via instr[17]+8 field, "use-list index", and via the register table) must point to the same entry.

Step 3 -- First pair detection via sub_753520. This helper calls sub_753480 to walk the single-def chain forward, looking for an instruction with opcode 93 (OUT_FINAL in ROT13; used as a chain-link marker). At each step, sub_753480 checks:

sub_7E5120 -- is the current entry eligible for chain-following? (checks constant bank membership, block region flags, and opcode 91 via sub_7A1A90)
The use-list pointer at entry[16] has a null head (single use)
The use-list pointer at entry[17] has a null head (single def)
The register index at entry[17]+8 matches the next instruction's register at entry[1]+8 -> +24

Step 4 -- Second pair detection via sub_753570. Starting from the first pair's result, follows the chain one more step looking for a second opcode-93 instruction that references back to the same register as the first pair's target.

Step 5 -- Predicate-operand compatibility check via sub_7E7380:

// sub_7E7380 -- predicate-operand compatibility check (narrow, not full structural)
bool predicate_operand_compatible(Instr* a, Instr* b) {
    bool a_has_pred = (a->opcode & 0x1000) != 0;  // bit 12: predicated
    bool b_has_pred = (b->opcode & 0x1000) != 0;
    if (a_has_pred != b_has_pred)
        return false;
    if (a_has_pred && b_has_pred) {
        // Compare last operand (predicate register): 24-bit register index
        int a_idx = a->operands[a->operand_count - 1] & 0xFFFFFF;
        int b_idx = b->operands[b->operand_count - 1] & 0xFFFFFF;
        if (a_idx != b_idx)  return false;
        // Compare preceding operand pair (full 64-bit equality)
        return a->operands[a->operand_count - 2] == b->operands[b->operand_count - 2];
    }
    return true;  // both unpredicated: predicate-compatible at this level
}

This confirms the two instructions have matching predication structure -- same predicate register, same predicate condition encoding.

Step 6 -- Operand format classification. Computes the effective operand position as operand_count - ((opcode >> 11) & 2) and checks whether it equals 5. When it does, reads the format code at instr[25] & 7. Format 3 means register operand, format 4 means immediate. Both instructions must have the same format classification (both register or both immediate).

Step 7 -- Register index equality. Compares the 24-bit register index: (instr_a[v23+21] & 0xFFFFFF) == (instr_b[v24+21] & 0xFFFFFF). When equal and the full operand descriptors at instr[23] and instr[24] also match, the instructions provably compute the same value. The function jumps to the success path.

Step 8 -- Modifier verification via sub_747F40 and sub_747F80:

// sub_747F40 -- negation flag extraction
int get_negation(Instr* instr) {
    int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
    if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
        return (instr->data[25] >> 3) & 1;   // bit 3 of format byte
    return 0;
}

// sub_747F80 -- absolute-value flag extraction
int get_abs(Instr* instr) {
    int eff = instr->operand_count - ((instr->opcode >> 11) & 2);
    if (eff == 5 && (instr->data[25] & 7) - 3 < 2)
        return (instr->data[25] >> 4) & 1;   // bit 4 of format byte
    return 0;
}

Both instructions must have identical negation and absolute-value flags. If neg(a) != neg(b) or abs(a) != abs(b), the pattern is rejected. This prevents incorrectly identifying neg(x) as equivalent to x.

Step 9 -- Deep sub-DAG equivalence. When register indices differ but operand type bits (bits 28-30) equal 1 (register type), the simplifier follows the definition chain to the defining instruction and attempts structural matching at depth:

// Deep equivalence path (sub_753600, lines 149-189)
if (((operand_a >> 28) & 7) == 1) {           // register-type operand
    RegEntry* reg_a = reg_table[operand_a & 0xFFFFFF];
    if (reg_a->use_count_field == 5) {         // field at +64
        Instr* def_a = reg_a->defining_instr;  // field at +56
        // ...same for operand_b...
        if (def_a->opcode == 119 && def_b->opcode == 119) {  // both SHFL
            int res_a = def_a->operands[2 * def_a->operand_count + 19];
            int res_b = def_b->operands[2 * def_b->operand_count + 19];
            if ((res_a & 1) == 0 && (res_b & 1) == 0        // bit 0 clear
                && ((res_a | res_b) & 8) == 0                // bit 3 clear
                && !sub_748570(def_a, ctx)                   // no alias hazard
                && !sub_748570(def_b, ctx)                   // no alias hazard
                && def_a->data[25] == def_b->data[25]        // format match
                && def_a->data[26] == def_b->data[26]        // descriptor match
                && sub_1245740(ctx, def_a, def_b, 2))        // depth-2 DAG eq
            {
                // Match found -- proceed to chain extension
            }
        }
    }
}

The depth limit of 2 (fourth argument to sub_1245740) prevents exponential blowup in the equivalence check while still catching common patterns like f(g(x)) == f(g(y)) when x == y.

Chain Extension and Accumulation

After finding one matching pair, the function extends the search down the chain. It calls sub_753520 and sub_753570 on subsequent entries, accumulating the full matching sequence in the state array at a1[1] through a1[6]. The state layout is:

State array (passed as a1, 7 qword slots):
  a1[0] = ctx (compilation context)
  a1[1] = first matched instruction (start of sequence)
  a1[2] = second matched instruction (end of first pair)
  a1[3] = third matched instruction (from sub_753520)
  a1[4] = fourth instruction (next link)
  a1[5] = fifth instruction (from secondary sub_753520)
  a1[6] = sixth instruction (final chain link)

The function returns true (changed) when the full chain is successfully matched. The caller (sub_7917F0) then invokes sub_753B50 to rewrite the matched sequence.

What This Actually Eliminates

The pattern this simplifier catches is: a sequence of conditional exit instructions where the guard predicates, condition codes, and source operands are structurally equivalent. In practice, this arises from lowering transformations that produce redundant conditional exit/return pairs -- for example, when a function has multiple return paths that were not merged during earlier optimization, or when predicated code duplication creates exit sequences with identical conditions.

The rewrite performed by sub_753B50 replaces the redundant chain with a single exit/return sequence, updating the block's instruction list, register-to-instruction mappings, and def-use chains.

Algebraic Pattern Location Map

The following table clarifies which optimization pass handles each category of algebraic simplification:

Pattern Category	Pass	Location	Evidence
Structural equivalence (identical computation chains)	GeneralOptimize Phase 13	`sub_753600`	CERTAIN -- decompiled
Modifier canonicalization (neg/abs flag matching)	GeneralOptimize Phase 13	`sub_747F40`, `sub_747F80`	CERTAIN -- decompiled
Sub-DAG equivalence (depth-limited tree comparison)	GeneralOptimize Phase 13	`sub_1245740`	CERTAIN -- decompiled
Copy propagation (reg-reg, predicated, conditional)	GeneralOptimize Phase 29	`sub_908EB0`	CERTAIN -- decompiled
Predicate simplification (constant predicates)	GeneralOptimize Phase 29	`sub_908A60`	CERTAIN -- decompiled
Register promotion (memory-to-register conversion)	GeneralOptimize Phase 37	`sub_90EF70`	CERTAIN -- decompiled
Identity: `x+0->x`, `x*1->x`, `x&(-1)->x`, `x\|0->x`, `x^0->x`	MainPeepholeOptimizer	`sub_169B190` et al.	HIGH -- 3,185 pattern matchers
Annihilator: `x*0->0`, `x&0->0`	MainPeepholeOptimizer	`sub_169B190` et al.	HIGH -- 3,185 pattern matchers
Inverse: `x-x->0`, `x^x->0`, `!!x->x`	MainPeepholeOptimizer	`sub_169B190` et al.	HIGH -- 3,185 pattern matchers
Strength reduction: `x*2->x<<1`, `x/1->x`	StrengthReduction (phase 26)	documented separately	CERTAIN -- separate pass
Predicate identity: `p&true->p`, `p\|false->p`	MainPeepholeOptimizer + Phase 29	combined	MEDIUM

The MainPeepholeOptimizer operates on the full SASS opcode set via three 233-280 KB dispatch functions with 373-case primary switches. Its pattern tables encode the constant-identity rules (IADD3 with zero source becomes MOV, IMAD with unit multiplier becomes shift/add, LOP3 with identity LUT becomes passthrough, etc.) as prioritized rewrite rules. See Peephole Optimization for full details.

Helper Functions: `sub_753E30` and `sub_753F70`

Two additional helpers extend the Phase 13 algebraic simplifier beyond the main sub_753600 path:

sub_753E30 (67 lines) -- secondary chain matcher that handles the case where the first instruction in the chain has a source register index (instr[25] & 0xFFFFFF) that differs from the current block's register at *(a2 + 24). It follows a more complex chain topology involving three register entries (at state slots a1[7], a1[8], a1[9]) and validates that the secondary chain loops back to the primary entry. This catches equivalences across register renaming boundaries.

sub_753F70 (49 lines) -- vtable-dispatched transformation that performs the actual rewrite for chains detected by sub_753E30. It calls through comp_unit->vtable[656] (with sentinel check against sub_744F30). When the vtable method returns true, it constructs opcode-93 replacement instructions via sub_92E1B0 and splices the old chain out via sub_91E310. This is the surgical rewrite counterpart to sub_753B50's rewrite for the main path.

sub_753DB0 (33 lines) -- chain tail finder that walks from a given register entry forward through the def-chain, following opcode-97 links via the register table. Returns the last reachable entry in the chain (the "tail") or the entry one step before a broken link. Used by the extended chain detection logic to determine where the equivalence region ends.

Dead Code Elimination

DCE within GeneralOptimize is lightweight compared to the standalone OriPerformLiveDead passes (phases 16, 33, 61, 84). It operates locally within basic blocks using the sub_7DF3A0 function:

// sub_7DF3A0 -- instruction liveness check
//   Returns pointer to status word
//   Bits 2-3 (mask 0xC): has live uses
//   Bit 0 (mask 0x1): marked dead
int8_t* check_liveness(int64_t instr, int64_t* ctx) {
    // ... examines use-def chains ...
    return status_ptr;   // caller checks (*result & 0xC) != 0
}

In sub_908EB0, the DCE check appears as the fallback for unrecognized opcodes:

if (!v10) {   // v10 = "previous instruction was a recognized copy"
    int8_t* status = sub_7DF3A0(instr, ctx);
    v10 = (*status & 0xC) != 0;   // live uses exist?
}

When (*status & 0xC) == 0, the instruction has no live consumers and is effectively dead. In Variant A, dead instructions are not immediately deleted -- they are marked for removal by the convergence loop cleanup phase (sub_753B50), which rewires the instruction list to skip dead nodes and updates the block's def-use chains via sub_931920, sub_932E80, sub_749090, and sub_9253C0.

In Variant B (phase 58), sub_8F6530 uses the same sub_7DF3A0 liveness check but integrates the result into its 7-counter change tracking structure, incrementing the appropriate sub-pass counter when a dead instruction is found.

Predicate Simplification

A distinct sub-pass handles predicate register operations. The code in sub_908EB0 at the opcode-18 and opcode-124 branches processes predicated moves and conditional selects:

Opcode 18 (predicated move): if the predicate is known-true (from prior constant folding), simplifies to unconditional move. If the v21 flag is set (indicating the vtable dispatch at comp_unit+1312 returned non-zero, i.e. the target supports this transformation), marks the destination operand with 0x400
Opcode 124 (conditional select): if both source operands are identical (detected via def-chain comparison), simplifies to an unconditional copy; if the predicate is constant, selects the appropriate source. The two-pass approach via sub_908A60 handles phi-like patterns where direction matters:
- Pass 1: sub_908A60(ctx, reg_entry, instr, 1, &out_a, &out_b) -- forward direction
- Pass 2 (if pass 1 found no simplification but detected a partial match): sub_908A60(ctx, reg_entry, instr, 0, &out_a, &out_b) -- backward direction

The helper sub_8F29C0 at 0x8F29C0 performs predicate-specific analysis, determining whether the predicate condition allows safe propagation given the current instruction context.

The Per-Block Sub-Pass Runner: `sub_8F6530` (Variant B Detail)

The 550-line function sub_8F6530 is the core of Variant B (phase 58). It processes a single basic block using a 6-slot circular buffer of instruction pairs, tracked at 56-byte intervals:

sub_8F6530 Context (passed as a1)
  +0x000  ctx_ptr                 -- compilation context
  +0x008  flag_ctrl_flow_4        -- from ctx+1396 bit 2 (opcode-7 enable)
  +0x009  flag_ctrl_flow_8        -- from ctx+1396 bit 3 (opcode-6 enable)
  +0x00C  slot_index              -- current slot (modulo 6)
  +0x010  slot_0_changed          -- boolean: did this slot's pair fire?
  +0x014  slot_0_count            -- how many pairs stored in this slot

  Slot layout (each 56 bytes = 7 int64_t):
    +0x00  count/used flag
    +0x04  changed flag
    +0x08  instr_ptr_a            -- first instruction of the pair
    +0x10  instr_ptr_b            -- second instruction of the pair
    +0x18  (reserved)
    ...

  6 slots at offsets: +0x10, +0x48, +0x80, +0xB8, +0xF0, +0x128

The slot index increments with (*(a1+3) + 1) % 6 after each pair is processed. When a new instruction pair is encountered that doesn't match any existing slot, the oldest slot is evicted (slot index advances). Each slot can hold up to 2 instruction pointers.

The function walks the instruction list looking for specific opcode patterns:

Opcodes 139 and 110 (MOV variants with different addressing modes): these are the primary targets. The function checks operand field at instr+76 for value 6 (register operand) or 7 (immediate operand), with the flag_ctrl_flow_4 and flag_ctrl_flow_8 gates controlling which variants are processed
For register operands (type field bits 28-30 == 1), it verifies:
- Use count == 1 (*(reginfo+24) == 1)
- No aliasing flags (*(reginfo+50) & 1 == 0)
- Register class not in range 2-8 (*(reginfo+20) - 2 > 6)
For instructions with opcode 139 and no modifier bits (*(instr+88) & 0x603FFFF == 0), the function attempts to find the instruction in the circular buffer and either promote it (if found) or insert it as a new entry
Option 605 (getOption(ctx, 605)) at 0x8F6530+0x1A0: when enabled, restricts the matching to only instructions already present in the buffer, preventing new insertions. This is an architecture-gated optimization

Fixed-Point Convergence

Per-Block Iteration Model

All GeneralOptimize variants use a per-block convergence model: they iterate over basic blocks in linear order (following the block ordering table at ctx+512), and for each block, run the sub-passes repeatedly until convergence. This differs from the global worklist model used by other optimizers (GVN-CSE at phase 49 uses a global worklist).

for each block B in reverse postorder:
    repeat:
        changed = run_sub_passes(B)
    until !changed OR !getOption(464)

The block ordering table is an array of int32_t indices at *(ctx+512), with the count at *(ctx+520). Block iteration starts at index 1 (not 0) and proceeds through bb_count inclusive. Each index is used to look up the actual basic block pointer via *(*(ctx+296) + 8 * block_order[i]).

Change Detection Mechanism

Changes are detected through different protocols depending on the variant:

Variant A (sub_753600): returns a boolean. The return value is the logical OR of all sub-pass fire events. The state machine in sub_7917F0 stores the result in v15 (mapped to register bp) and accumulates across iterations via v4 = v15
Variant B, phase 58 (sub_8F6530): maintains 7 independent counters at 56-byte intervals in the context structure. Counters are at *(a1 + 5), *(a1 + 19), *(a1 + 33), *(a1 + 47), *(a1 + 61), *(a1 + 75). The corresponding boolean changed-flags are at *(a1 + 16), *(a1 + 72), *(a1 + 128), *(a1 + 184), *(a1 + 240), *(a1 + 296). All are zero-initialized at entry. The caller checks if any counter is non-zero to determine convergence
Variant B, phase 37 (sub_90FBA0): uses a different approach -- tracks a floating-point "cost" accumulator at context+25/26/27 (three double values representing total cost, weighted cost, and instruction count). Convergence is determined when the cost delta falls below a threshold (initialized to 0.25, adjustable via knob 474 at 0x90FBA0+0x50). Knob 135 at 0x90FBA0+0x20 controls an initial threshold override when enabled (checked via *(config+9720))

Iteration Limits

The fixed-point loop is guarded by option 464 in Variant A. In sub_7917F0:

while (true) {
    bool changed = sub_753600(&state, bb);
    if (!changed) break;

    // Option 464 check -- same vtable fast-path pattern:
    //   vtable[152] == sub_67EB60  =>  sub_7468B0(config, 464)
    //   otherwise                  =>  vtable[152](config, 464, 1)
    if (!getOption_v2(ctx, 464)) break;

    sub_753B50(&state);   // apply rewrites before re-scanning
}

The option 464 check is called after each successful iteration (when changed == true). If the option returns false, the loop terminates even though more changes could be made. The exact semantics of option 464 depend on the knob's implementation -- it could be a simple counter that decrements, a boolean that gets cleared after N iterations, or a cost-based threshold. The default behavior (when option 464 always returns true) allows unbounded iteration until convergence.

Variant B (phases 37 and 58) does not use option 464 for iteration control. Phase 37 uses the cost-based threshold described above. Phase 58 makes a single pass over the block list via sub_8F6FA0, which does not loop -- each block is visited exactly once, with the 6-slot circular buffer providing limited lookback within the walk.

In practice, most basic blocks converge in 1--3 iterations. A block that generates new optimization opportunities typically does so because copy propagation exposes a constant, which enables constant folding, which creates a dead instruction. The second iteration catches any cascading effects, and the third confirms convergence. Blocks requiring more than 3 iterations are rare and typically involve chains of dependent copies or nested predicate simplifications.

The Apply-Changes Function: `sub_753B50`

After sub_753600 reports changes, sub_753B50 applies the accumulated transformations. This is a compact 70-line function that performs instruction-list surgery:

Creates a replacement instruction via sub_931920(ctx, state->instr_pair, *(*(state->instr_pair+8)+8), -1) -- the -1 argument (0xFFFFFFFF) signals "allocate new"
Updates the block's instruction head at *(ctx+232) with the new instruction's head pointer
Clears the block's instruction count at *(ctx+264) = 0
Calls sub_932E80 to relink the instruction into the block's doubly-linked list
Propagates flags: if the original instruction had flag bit 3 of *(instr+280) set (indicating a control-flow-sensitive instruction), the replacement inherits it via new_instr[70] |= 8
Walks the state's instruction chain (from state[1] through state[2]), creating replacements for each and calling sub_749090 to update register-to-instruction mappings
Final cleanup: calls sub_9253C0 to remove the dead instructions from their blocks, and sub_749290 to update the register numbering, and sub_91E310 to splice the old instruction range out of the linked list

Differences Between Early/Mid/Late Variants

1. Gate Conditions (Who Runs)

Phase	Gate Logic
13 (Early)	Requires `ctx->flags_1382 & 4`; skips if option 214 is set; requires option 487; skips if `((ctx)+1056)` is non-null
29	Requires option 487; skips if option 231 (dump mode) is set; requires `*(config+33192)` check or option 461 pass; skips if function count == 1
37 (Mid)	Requires `sub_8F3EA0` pre-check; option 487; can be disabled via `--no-phase ConvertMemoryToRegisterOrUniform`; skips if function count == 1
46 (Mid2)	Indirect dispatch; skips if vtable slot `[0x1C0]` points to no-op sentinel `sub_7D6DD0`
58 (Late)	Requires function count > 2 (not just > 1); checks optimization level bits `(ctx+1396 & 0x30) != 0x20`; checks option 31 with extended-value semantics
65 (Late2)	Requires function count > 1; indirect dispatch through compilation unit vtable slot at offset 392

2. Sub-Pass Selection (What Runs)

Phase	Sub-Passes Included
13 (Early)	Structural equivalence detection via `sub_753600` (def-use chain walking, instruction pair matching, modifier verification, depth-2 sub-DAG comparison via `sub_1245740`), instruction rewrite via `sub_753B50`. No instruction-level constant folding. Lightweight -- designed for quick cleanup after initial lowering.
29	Copy prop with full opcode dispatch (97, 18, 124), predicate-aware propagation via `sub_8F2E50`/`sub_8F29C0`, two-pass predicate simplification via `sub_908A60`, liveness-gated DCE via `sub_7DF3A0`. Flag marking with `0x100`/`0x200`/`0x400` bits.
37 (Mid)	Full sub-pass suite plus `ConvertMemoryToRegisterOrUniform` (memory-to-register promotion). Bitvector-based change tracking. Cost-driven convergence with configurable threshold (default 0.25, knob 474). Most comprehensive instance.
46 (Mid2)	Architecture-dependent (vtable dispatch). May include additional target-specific simplifications.
58 (Late)	6-slot circular buffer pattern matching over MOV/copy instructions (opcodes 139, 110). Register use-count and aliasing checks. Option-605-gated restriction mode. Per-block single-pass (no iteration).
65 (Late2)	Architecture-dependent (vtable dispatch). Final cleanup before register allocation.

3. Infrastructure Weight (How It Runs)

Phase	Context Size	Tracking	Complexity
13 (Early)	Minimal (0x88 bytes on stack)	Boolean changed flag	Low (78 lines in `sub_7917F0`)
29	Stack frame (~0x60 bytes)	Boolean + instruction flag bits	Medium (218 lines in `sub_908EB0`)
37 (Mid)	0x408-byte stack context + heap bitvectors	Cost-based convergence (3 doubles) + bitvector arrays	High (500+ lines in setup + 400+ in loop)
46 (Mid2)	Vtable-dependent	Vtable-dependent	Variable
58 (Late)	0x168-byte stack context	7 counters at 56-byte stride + 6-slot circular buffer	Medium-high (550 lines in `sub_8F6530`)
65 (Late2)	Vtable-dependent	Vtable-dependent	Variable

Initialization Infrastructure

Two large helper functions set up the state required before the sub-passes can run:

`sub_785E20` -- Change Tracking Reset

Called at the start of phase 13 and after the convergence loop completes (if any changes were made). Resets per-block change flags and instruction state. Takes (ctx, 0) -- the second argument selects the reset mode.

`sub_781F80` -- Instruction Flag Initialization

A large function (~1800 lines) that walks every instruction in every basic block, setting per-instruction optimization flags. Called with argument 1 to enable full initialization. These flags control which instructions are eligible for the sub-passes: instructions marked with certain flag patterns are skipped by copy prop, others are skipped by the algebraic simplifier.

`sub_7E6090` -- Use-Def Chain Builder

Builds operand use-def chains for copy propagation. Called with (ctx, 0, 0, 0, 0) at the start of phases 13 and 58. The zero arguments indicate "build from scratch" rather than incremental update.

`sub_7E6AD0` -- Def-Use Link Builder

Builds bidirectional def-use/use-def links. Called only by phase 13 (Variant A). Variant B phases use their own bitvector-based tracking instead.

`sub_905B50` -- Bitvector Infrastructure (Phase 37 Only)

A 500+ line setup function specific to GeneralOptimizeMid. Allocates and initializes three major bitvector structures for tracking:

Register definition reach (which definitions reach each block entry)
Per-register liveness within basic blocks
Fold eligibility tracking (which operands have known-constant sources)

These bitvectors are destroyed by RAII-style cleanup after sub_90FBA0 returns, using vtable destructors at offsets +32 in the bitvector vtables.

Pipeline Positioning

The six instances are positioned to clean up after specific groups of transformations:

Phase 0-12:  Initial setup, FP16 promotion, unsupported op conversion
  --> Phase 13: GeneralOptimizeEarly  (clean up after lowering artifacts)

Phase 14-28: Branch opt, loop passes, strength reduction, pipelining
  --> Phase 29: GeneralOptimize       (clean up after loop transformations)

Phase 30-36: Switch opt, linear replacement, LICM
  --> Phase 37: GeneralOptimizeMid    (heavy cleanup + mem-to-reg promotion)

Phase 38-45: Nested branch opt, CTA expansion, mbarrier, mid expansion
  --> Phase 46: GeneralOptimizeMid2   (clean up after mid-level expansion)

Phase 47-57: GVN-CSE, reassociation, remat, late expansion, speculative hoist
  --> Phase 58: GeneralOptimizeLate   (clean up after late expansion)

Phase 59-64: Loop fusion, predication, late commoning
  --> Phase 65: GeneralOptimizeLate2  (final cleanup before register work)

After phase 65, the pipeline transitions to register-attribute setting (phase 90), synchronization (phase 99), and register allocation (phase 101). No GeneralOptimize instance runs after register allocation -- the post-RA pipeline uses different peephole mechanisms.

Knobs and Options

Option	Decoded Name	Type	Code Default	Used By	Description
31	`AllowReassociateCSE`	OKT_INT	unset	Phase 58	Architecture-dependent fold eligibility gate; extended-value semantics via `config+2232`/`+2240`
135	`ConvertMemoryToRegIndexedSizeLimit`	OKT_INT	unset (fallback: 0.25 from knob 474)	Phase 37	Threshold override for cost-based convergence when `*(config+9720)` is set; controls indexed-access size limit for memory-to-register conversion
214	`DisableMergeEquivalentConditionalFlow`	OKT_NONE	false	Phase 13 only	When present, skips `GeneralOptimizeEarly` entirely (`if (getOption(ctx, 214)) return;`)
231	`DisableRedundantBarrierRemoval`	OKT_NONE	false	Phase 29 only	Dump mode -- when present, skips `GeneralOptimize` to preserve IR state for debugging
461	`MembarFlowControl`	OKT_INT	unset	Phase 29	Secondary gate; controls whether memory barrier flow analysis runs during standard GeneralOptimize; passed through `sub_661470`
464	`MergeEquivalentConditionalFlowBudget`	OKT_BDGT	unset (= unbounded)	Phase 13 (Variant A)	Iteration cap -- budget knob that breaks the fixed-point loop when exhausted; prevents oscillating transformations
474	`MovWeightForConvertMemToReg`	OKT_DBL	0.25	Phase 37 (`sub_90FBA0`)	Cost convergence threshold and per-fold cost weight for cost-exempt opcodes (`v104` in cost computation)
487	(not yet decoded)	--	enabled	Phases 13, 29, 37	General optimization enable -- master switch for all GeneralOptimize passes
499	`OptBudget`	OKT_BDGT	enabled (pass-through)	`sub_7DDB50` (opt-level accessor)	Master guard for opt-level accessor; when disabled, caps all opt-level-gated behavior at O1
605	`ReassociateCSEWindow`	OKT_NONE	false	Phase 58 (`sub_8F6530`)	When present, restricts 6-slot circular buffer matching to existing entries only (no new entries added during walk)
`limit-fold-fp`	--	bool	`"false"` (config+340)	Phase 37	When `true`, forces conservative fold path via `ctx+1379` tier flags; prevents FP folds that could alter precision semantics

The "ConvertMemoryToRegisterOrUniform" named-phase gate at 0x21DD228 allows phase 37 to be disabled via the --no-phase command-line option.

Function Map

Address	Name	Role	Confidence
`0xC5F940`	Phase 13 execute	Tail-calls `0x1C64BF0` (single-func) or `sub_7917F0` (multi-func)	CERTAIN
`0xC5FC50`	Phase 29 execute	Checks count > 1, calls `sub_908EB0`	CERTAIN
`0xC5FD70`	Phase 37 execute	Checks count > 1, calls `sub_910840`	CERTAIN
`0xC60840`	Phase 46 execute	Indirect vtable dispatch through `comp_unit->vtable[0x1C0]`	CERTAIN
`0xC5FF20`	Phase 58 execute	Checks count > 1, calls `sub_8F7080`	CERTAIN
`0xC60550`	Phase 65 execute	Checks count > 1, indirect dispatch through `comp_unit->vtable[392]`	CERTAIN
`0x7917F0`	`GeneralOptimizeEarly` body	Multi-function path: iterates blocks, fixed-point loop with `sub_753600`	HIGH
`0x908EB0`	`GeneralOptimize` body	Per-block copy prop + predicate simplification with flag marking	HIGH
`0x910840`	`GeneralOptimizeMid` body	Full suite with mem-to-reg; delegates to `sub_905B50` + `sub_90FBA0`	HIGH
`0x8F7080`	`GeneralOptimizeLate` body	Bitvector-tracked 7-counter pass; calls `sub_8F6FA0`	HIGH
`0x753600`	Per-block sub-pass runner (Early)	Structural equivalence detection on def-use chains; returns boolean changed	HIGH
`0x753B50`	Per-block apply changes (Early)	Instruction rewriting: `sub_931920`, `sub_932E80`, `sub_749090`, `sub_9253C0`	HIGH
`0x753480`	Chain walker (Early)	Walks single-def chain forward, checking `sub_7E5120` eligibility	HIGH
`0x753520`	Pair detector (Early)	Finds opcode-93 instruction in chain via `sub_753480`	HIGH
`0x753570`	Secondary pair detector (Early)	Finds second opcode-93 link referencing back to primary	HIGH
`0x753DB0`	Chain tail finder (Early)	Walks opcode-97 links to find end of chain	MEDIUM
`0x753E30`	Secondary chain matcher (Early)	Handles register renaming boundaries; stores `a1[7..9]`	MEDIUM
`0x753F70`	Vtable rewrite dispatcher (Early)	Calls `comp_unit->vtable[656]`; constructs opcode-93 replacements	HIGH
`0x7E5120`	Chain eligibility predicate	Checks constant bank, block region, opcode 91	HIGH
`0x8F6530`	Per-block sub-pass runner (Late)	6-slot circular buffer; 7-counter change tracking; 550-line function	HIGH
`0x8F6FA0`	Block iterator (Late)	Walks block list calling `sub_8F6530` per block; single pass, no iteration	HIGH
`0x905B50`	Setup/init (Mid)	~500 lines; creates bitvector infrastructure; 3 tracked structures	HIGH
`0x90FBA0`	Main loop (Mid)	Cost-based instruction-level iteration with register bank analysis	HIGH
`0x90EF70`	Register promotion (Mid)	Memory-to-register conversion; threshold-based (default 0.93, knob 136)	HIGH
`0x903A10`	Register bank helper (Mid)	Per-instruction register bank assignment for LD/ST materialization	MEDIUM
`0x8F3FE0`	Register constraint fold validator (Mid)	Validates all source operand types are 2/3 and `sub_91D150` constraints match cached values; queries `vtable[904]` for element size and `vtable[936]` for fold metadata	HIGH
`0x8F2E50`	Fold eligibility check	Two-path dispatch: opcode 18 checks source types 2/3 + `sub_91D150` constraints; opcode 124 checks dest type 1/2 + SM <= 20479 threshold + constraint bits `& 0x1C00`	HIGH
`0x8F29C0`	Architecture predicate query	9 lines; returns `sub_7DC0E0(cu) \|\| sub_7DC050(cu) \|\| sub_7DC030(cu)` on `ctx+1584`	HIGH
`0x908A60`	Two-pass predicate simplify	Called with direction flag (1 = forward, 0 = backward)	HIGH
`0x785E20`	Change tracking reset	Resets per-block change flags	MEDIUM
`0x781F80`	Instruction flag init	Initializes per-instruction optimization flags (~1800 lines)	MEDIUM
`0x7E6090`	Use-def chain builder	Builds operand use-def chains; called with `(ctx, 0, 0, 0, 0)`	HIGH
`0x7E6AD0`	Def-use link builder	Builds def-use/use-def bidirectional links	HIGH
`0x7DF3A0`	Liveness check	Returns status pointer; bits 2-3 (`& 0xC`) indicate live uses	HIGH
`0x7E7380`	Predicate-operand compatibility	Narrow check: predicate modifier parity + last-operand 24-bit ID + penultimate 8-byte encoding (not full structural comparison)	HIGH
`0x747F40`	Negation flag extractor	Extracts negation modifier from operand encoding	HIGH
`0x747F80`	Absolute-value flag extractor	Extracts abs modifier from operand encoding	HIGH
`0x748570`	Alias hazard check	Returns true if operand has aliasing hazard	MEDIUM
`0x1245740`	Sub-DAG equivalence	Compares two instruction sub-DAGs for structural equivalence (arg 2 = depth)	HIGH
`0x91D150`	Register constraint lookup	Trivial: `return ((ctx+440) + 4*reg_index)`; 0 = no fold-blocking constraint	CERTAIN
`0x91E860`	Use-count estimator	Returns estimated use count for cost-based decisions (used by phase 37)	MEDIUM
`0xA9BD30`	Register-class remapper	Maps opcode indices in set {1,2,3,7,11,15,20,24} via `vtable[632]`; writes `value \| 0x60000000` (constant class marker)	HIGH
`0x1249B50`	SASS-level integer ALU fold	Combines IMAD_WIDE/IADD3/SGXT/CCTLT (opcodes 2,3,5,110) with MOV source pairs via `sub_1249940` and `sub_1245740`	HIGH
`0x1249940`	MOV-pair fold combiner	Matches two MOV-from-immediate (opcode 139) instructions feeding an ALU op; validates structural equivalence at depth 1 and 2	HIGH
`0x7E19E0`	Operand info extractor	Builds 52-byte operand descriptor for opcodes 2,3,5,6,7; classifies source types and constant bank membership	MEDIUM
`0x7DC0E0`	Architecture capability check A	Checks compilation unit capability flag; used by `sub_8F29C0` for predicate fold safety	MEDIUM
`0x7DC050`	Architecture capability check B	Secondary capability check for `sub_8F29C0`	MEDIUM
`0x7DC030`	Architecture capability check C	Tertiary capability check for `sub_8F29C0`	MEDIUM

Cross-References

Pass Inventory -- full 159-phase table with GeneralOptimize instances highlighted
Phase Manager -- dispatch loop, vtable protocol, factory switch at sub_C60D30
Optimization Pipeline -- overall pipeline stages
Copy Propagation & CSE -- standalone copy propagation passes (phases 49, 50, 64, 83)
Liveness Analysis -- standalone OriPerformLiveDead passes (heavier DCE)
Peephole Optimization -- MainPeepholeOptimizer; handles constant-identity patterns (x+0, x*1, x&0, etc.)
Strength Reduction -- standalone strength reduction pass (phase 26)
Knobs System -- MergeEquivalentConditionalFlowBudget (464, iteration cap), option 487 (general opt enable), OptBudget (499, opt-level guard), AllowReassociateCSE (31), MovWeightForConvertMemToReg (474, cost threshold), limit-fold-fp
Optimization Levels -- knob 499 (OptBudget) as opt-level accessor guard

Branch & Switch Optimization

Four phases in the ptxas pipeline transform branch and switch-statement control flow in the Ori IR. Two phases optimize switch statements (phases 14 and 30), one performs general branch simplification (phase 15), and one flattens nested conditional branches (phase 38). Together they reduce branch count, eliminate unreachable code, and prepare the CFG for downstream passes like predication (phase 63), liveness analysis (phase 16), and loop canonicalization (phase 18).

These phases operate on the Ori IR before register allocation and scheduling. At this pipeline stage, branch instructions use the Ori OEN opcode (SASS BRA), conditional execution is controlled by predicate registers (P0--P6, PT), and the CFG is a hash-map-based structure with FNV-1a-keyed successor/predecessor edges.


DoSwitchOptFirst	Phase 14 -- vtable at `off_22BD7F8`
OriBranchOpt	Phase 15 -- vtable at `off_22BD820`
DoSwitchOptSecond	Phase 30 -- vtable at `off_22BDA78`
OptimizeNestedCondBranches	Phase 38 -- vtable at `off_22BDBB8`
Phase factory	`sub_C60D30` cases 14, 15, 30, 38
Phase object size	16 bytes (standard `{vtable_ptr, allocator_ptr}`)
IR level	Ori -- SASS opcodes with virtual registers
Key opcodes	`OEN` (BRA), `OFFL` (BSSY), `OFLAP` (BSYNC)
CFG infrastructure	FNV-1a hash maps at Code Object +648 (successors), +680 (backedges)
Related passes	31 `OriLinearReplacement`, 63 `OriDoPredication`, 80 `ExpandJmxComputation`, 133/136 `MergeEquivalentConditionalFlow`

Pipeline Placement

Phase   3  AnalyzeControlFlow              ── builds CFG (predecessors, successors, RPO, dominators)
Phase   6  SetControlFlowOpLastInBB        ── ensures branches are last in each block
Phase  13  GeneralOptimizeEarly            ── const fold + copy prop (feeds branch info)
Phase  14  DoSwitchOptFirst                ── SWITCH OPTIMIZATION (1st pass)
Phase  15  OriBranchOpt                    ── BRANCH SIMPLIFICATION
Phase  16  OriPerformLiveDeadFirst         ── DCE cleanup of dead branches
    ...
Phase  29  GeneralOptimize                 ── const fold after loop transforms
Phase  30  DoSwitchOptSecond               ── SWITCH OPTIMIZATION (2nd pass)
Phase  31  OriLinearReplacement            ── branchless replacement
    ...
Phase  37  GeneralOptimizeMid              ── const fold + copy prop (feeds nested cond info)
Phase  38  OptimizeNestedCondBranches      ── NESTED CONDITIONAL FLATTENING
    ...
Phase  63  OriDoPredication                ── if-conversion (converts short branches to predicates)
    ...
Phase  80  ExpandJmxComputation            ── expands jump-table index computations
    ...
Phase 133  MergeEquivalentConditionalFlow  ── tail merging
Phase 136  LateMergeEquivalentConditionalFlow

Why Two DoSwitchOpt Passes?

The first pass (phase 14) runs immediately after the initial GeneralOptimizeEarly compound pass. At this point, constant folding and copy propagation have resolved many switch selector values, enabling the optimizer to determine case density and choose a lowering strategy.

The second pass (phase 30) runs after loop unrolling (phase 22), strength reduction (phase 21), SSA phi insertion (phase 23), and software pipelining (phase 24). These transformations can expose new switch patterns -- particularly after loop unrolling duplicates switch bodies, creating opportunities for case clustering that were not visible before.

Despite their names, the two passes use different dispatch paths. Phase 14 dispatches through the SM backend's vtable at offset +136 (*(*(ctx+1584)+136)), making it a polymorphic, architecture-specific switch optimization. Phase 30 calls the generic switch optimization core (sub_77CF40 via sub_791F00). This means phase 14 runs whatever switch optimization the current SM target provides, while phase 30 always runs the generic algorithm. The two passes share pipeline position semantics (first pass vs. second pass) but not necessarily the same code.

DoSwitchOpt -- Switch Statement Optimization (Phases 14, 30)

Overview

DoSwitchOpt transforms high-level switch statements from their initial representation as cascading conditional branches into one of three lowered forms, selected based on case density and count. The input is a chain of ISETP (integer set-predicate) + BRA (conditional branch) instruction pairs that compare the switch selector against successive case constants. The output is one of:

Jump table -- a contiguous array of branch targets indexed by the selector value
Binary search tree -- a balanced tree of comparisons that narrows the target in O(log n)
Cascading if-else chain -- the original form, retained when the switch is small or sparse

Input: Switch Pattern Recognition

The pass scans each basic block for a characteristic pattern:

// Input: cascading if-else for switch(x)
BB0:
    ISETP.EQ P0, x, #case_0      // compare selector against constant
    @P0 BRA target_0              // conditional branch to case body
    ISETP.EQ P0, x, #case_1
    @P0 BRA target_1
    ISETP.EQ P0, x, #case_2
    @P0 BRA target_2
    ...
    BRA default_target            // fallthrough to default case

The recognizer collects:

The selector register (the common first operand of all ISETP instructions)
The set of case constants (immediate operands of each ISETP)
The branch targets (one per case, plus the default target)
The case count N

Decision: Strategy Selection

The strategy is selected by evaluating case density and count:

function select_switch_strategy(cases[], N, min_val, max_val):
    range = max_val - min_val + 1
    density = N / range                    // fraction of range covered by cases

    if N <= SMALL_SWITCH_THRESHOLD:        // observed: ~4 cases
        return CASCADING_IF_ELSE           // keep original form

    if density >= JUMP_TABLE_DENSITY:      // observed: ~0.4 (40%)
        if range <= MAX_JUMP_TABLE_SIZE:   // observed: ~1024 entries
            return JUMP_TABLE

    return BINARY_SEARCH_TREE

The thresholds are not configurable via the knob system. They are hardcoded constants in the phase execute function.

Jump table is preferred when case values are dense -- the selector maps directly to a table index with a bounds check and a subtraction. This produces the fastest code but consumes data memory proportional to the value range (not the case count).

Binary search tree is the default for large sparse switches. The pass sorts case constants and generates a balanced BST of ISETP + BRA pairs. Each comparison eliminates half the remaining candidates, reaching the target in O(log N) branches.

Cascading if-else is retained for small switches (typically 4 or fewer cases) where the overhead of a jump table or BST setup exceeds the cost of linear comparison.

Output: Jump Table Lowering

For jump-table-eligible switches, the pass produces:

// Output: jump table lowering
BB_switch:
    IADD3 Rtmp, Rselector, -min_val, RZ    // normalize to 0-based index
    ISETP.GE.U32 P0, Rtmp, #range          // bounds check (unsigned)
    @P0 BRA default_target                  // out-of-range -> default
    // The jump table index computation is left as a pseudo-instruction
    // that phase 80 (ExpandJmxComputation) expands later into:
    //   LEA Raddr, Rtmp, #table_base, 2    // Raddr = table_base + index * 4
    //   BRX Raddr, #table_base             // indexed branch

The actual BRX (branch indexed) instruction is a SASS-level indirect branch through a table embedded in the .text section. Each table entry is a 4-byte relative offset. Phase 80 (ExpandJmxComputation) runs much later (after legalization) to expand the index computation pseudo-instruction into the final LEA + BRX sequence.

Output: Binary Search Tree Lowering

For BST-eligible switches:

function emit_bst(cases[], lo, hi, selector, default_target):
    if lo > hi:
        emit: BRA default_target
        return

    mid = (lo + hi) / 2

    if lo == hi:
        emit: ISETP.EQ P0, selector, #cases[mid].value
        emit: @P0 BRA cases[mid].target
        emit: BRA default_target
        return

    emit: ISETP.LT P0, selector, #cases[mid].value
    emit: @P0 BRA left_subtree_label

    // Right subtree (selector >= cases[mid].value)
    emit: ISETP.EQ P0, selector, #cases[mid].value
    emit: @P0 BRA cases[mid].target
    emit_bst(cases, mid+1, hi, selector, default_target)

    // Left subtree (selector < cases[mid].value)
    left_subtree_label:
    emit_bst(cases, lo, mid-1, selector, default_target)

This produces a balanced tree with depth ceil(log2(N+1)). Each internal node performs at most two comparisons (less-than and equality), though the pass may optimize nodes with consecutive case values to use range checks.

GPU-Specific: SIMT Divergence Impact

Switch optimization interacts directly with SIMT execution. On a GPU, when threads in a warp take different switch cases, the warp diverges and each case executes serially. The optimizer considers this:

Jump tables produce a single divergence point at the BRX instruction. All threads that pick the same case reconverge naturally. The hardware BSSY/BSYNC (branch sync stack push/pop) mechanism ensures reconvergence after the switch.
BST lowering produces O(log N) potential divergence points. Threads that agree on the BST path stay converged; threads that disagree at each BST node split into independently masked sub-warps.
Cascading if-else produces N potential divergence points. Each comparison can split the warp.

For GPU code, jump tables are strongly preferred when density permits, because they minimize the number of divergence points to exactly one (the BRX), regardless of case count.

OriBranchOpt -- Branch Simplification (Phase 15)

Overview

OriBranchOpt performs four categories of CFG-level simplification on the Ori IR. It runs as a single pass that iterates over all basic blocks and applies the following transformations until no further changes occur:

Unconditional branch folding -- eliminates BRA instructions that jump to the immediately following block
Unreachable block elimination -- removes basic blocks with no predecessors (except the entry block)
Conditional branch simplification -- simplifies conditional branches where the condition is provably constant or the true/false targets are identical
Branch chain threading -- redirects branches that target blocks consisting of a single unconditional BRA, directly to the final destination

Transformation 1: Unconditional Branch Folding

When a basic block ends with an unconditional BRA to the block that immediately follows in layout order, the branch is redundant and is deleted:

// Before:                        // After:
BB_A:                             BB_A:
    ...                               ...
    BRA BB_B                          // fallthrough
BB_B:                             BB_B:
    ...                               ...

This is the most common transformation. It arises frequently after switch optimization introduces new blocks and after loop unrolling creates copies of loop bodies that end with unconditional jumps back to the next iteration.

Transformation 2: Unreachable Block Elimination

After other branch simplifications may redirect branches away from certain blocks, those blocks lose all predecessors and become unreachable. The pass deletes them:

function eliminate_unreachable(func):
    for each block B in func (excluding entry):
        if predecessor_count(B) == 0:
            // Remove B from successor lists of all blocks
            // Delete all instructions in B
            // Remove B from the block list
            // Update CFG hash maps

The CFG hash maps at Code Object offsets +648 (successors) and +680 (backedges) must be updated atomically with block deletion to maintain consistency for downstream passes.

Transformation 3: Conditional Branch Simplification

Two sub-cases:

Constant condition. If copy propagation or constant folding (in the preceding GeneralOptimizeEarly, phase 13) has determined that a predicate register always holds a known value at the branch point, the conditional branch is replaced:

// Before: condition always true      // After:
BB:                                   BB:
    ISETP.EQ PT, R0, R0              //   (deleted -- tautology)
    @PT BRA target                        BRA target   // unconditional
    BRA fallthrough                   //   (deleted)

Equivalent targets. If both the taken and not-taken paths of a conditional branch go to the same block, the condition test is dead and the branch becomes unconditional:

// Before: both targets identical     // After:
BB:                                   BB:
    @P0 BRA target                        BRA target   // unconditional
    BRA target                        //   (deleted)

Transformation 4: Branch Chain Threading

When a branch targets a block whose only content is another unconditional branch, the pass redirects the original branch directly to the final target:

// Before:                            // After:
BB_A:                                 BB_A:
    @P0 BRA BB_B                          @P0 BRA BB_C   // threaded
BB_B:                                 // BB_B may become unreachable
    BRA BB_C                          BB_C:
BB_C:                                     ...
    ...

The pass applies threading iteratively, following chains of single-branch blocks until a non-trivial block is reached. A depth limit prevents infinite loops on pathological CFGs with cycles of empty blocks (which should not exist in well-formed IR but are guarded against defensively).

Fixed-Point Iteration

The four transformations are applied in a worklist-driven loop. Each transformation can enable others:

Threading can make intermediate blocks unreachable (enables transformation 2)
Unreachable block elimination can make remaining branches target the immediately following block (enables transformation 1)
Folding can expose equivalent-target conditionals (enables transformation 3)

The pass terminates when a full iteration over all blocks produces no changes.

OptimizeNestedCondBranches -- Nested Conditional Flattening (Phase 38)

Overview

Phase 38 targets a specific control flow pattern: nested conditional branches that test related predicates. This pattern commonly arises from C/C++ code with compound conditions (if (a && b), if (a || b)) and from switch-case fall-through after DoSwitchOpt lowering.

The pass runs after GeneralOptimizeMid (phase 37), which provides fresh constant folding and copy propagation results. It runs before OriDoPredication (phase 63), feeding it simpler CFG patterns that are easier to convert to predicated code.

Pattern: Nested If-Then

// Before: nested conditional
BB_outer:
    @P0 BRA BB_inner
    BRA BB_merge
BB_inner:
    @P1 BRA BB_body
    BRA BB_merge
BB_body:
    ... body instructions ...
    BRA BB_merge
BB_merge:
    ...

// After: flattened with combined predicate
BB_entry:
    LOP3 Ptmp, P0, P1, 0xC0          // Ptmp = P0 AND P1
    @Ptmp BRA BB_body
    BRA BB_merge
BB_body:
    ... body instructions ...
    BRA BB_merge
BB_merge:
    ...

The LOP3 (3-input logic) instruction with truth table 0xC0 computes AND. This combines two branch tests into one, eliminating a basic block and a divergence point.

Pattern: Nested If-Or

// Before: short-circuit OR
BB_test1:
    @P0 BRA BB_body                   // first condition true -> body
    BRA BB_test2
BB_test2:
    @P1 BRA BB_body                   // second condition true -> body
    BRA BB_merge                      // both false -> merge
BB_body:
    ...

// After: flattened with OR predicate
BB_entry:
    LOP3 Ptmp, P0, P1, 0xFC          // Ptmp = P0 OR P1
    @Ptmp BRA BB_body
    BRA BB_merge
BB_body:
    ...

Safety Constraints

The pass applies these transformations only when:

No side effects between the nested branches -- the intermediate block must contain only the branch instruction (and optionally predicate-setting ISETP/FSETP instructions)
No live-out values from the intermediate block other than the predicate -- if the intermediate block defines registers used after the merge, the transformation would change semantics
Both branches target the same merge point -- the not-taken path of both the outer and inner branches must reach the same merge block
The predicates are independent -- P0 and P1 must not be related by a def-use chain within the nested pattern (otherwise folding changes the evaluation order)

Relationship to Predication

Phase 38 is a stepping stone toward phase 63 (OriDoPredication). By reducing nested branches to single-level branches, it creates more opportunities for if-conversion -- the predication pass can then convert the single remaining branch into a fully predicated (branchless) instruction sequence.

The transformation pipeline for a if (a && b) { x = y; } pattern is:

Phase 38: nested {if(a) { if(b) { ... }}}  -->  if(a AND b) { ... }
Phase 63: if(a AND b) { x = y; }           -->  @(a AND b) MOV x, y

Without phase 38, the predication pass would see a multi-level branch diamond that exceeds its nesting-depth threshold, and both branches would remain in the output.

GPU-Specific Considerations

SIMT Divergence and Reconvergence

On NVIDIA GPUs, branch optimization has a direct impact on warp execution efficiency. Every conditional branch is a potential divergence point where threads in a 32-thread warp may take different paths. Divergence serializes execution: the warp must execute both paths, masking inactive threads.

The BSSY (branch sync stack push) / BSYNC (branch sync) mechanism on modern NVIDIA architectures (sm_75+) manages reconvergence:

BSSY B0, reconvergence_point     // push reconvergence point onto sync stack
@P0 BRA taken_path               // diverge
    ... not-taken path ...
    BSYNC B0                     // threads arriving here wait
taken_path:
    ... taken path ...
    BSYNC B0                     // all threads reconverge here
reconvergence_point:
    ...                          // continue with full warp

Branch optimization directly reduces the number of BSSY/BSYNC pairs needed:

Branch folding (phase 15) eliminates unconditional branches that do not cause divergence but still consume BSSY/BSYNC bookkeeping
Nested conditional flattening (phase 38) reduces two nested BSSY/BSYNC regions to one, cutting sync-stack depth by one level
Jump table lowering (phases 14/30) collapses N divergence points into one BRX instruction

Reconvergence Stack Depth

The hardware branch sync stack has finite depth (varies by architecture, typically 16--32 entries on sm_75+). Deeply nested branches can overflow the stack, causing hardware serialization or requiring the compiler to restructure control flow. Branch optimization reduces sync-stack pressure by flattening nesting.

Uniform Branches

When all threads in a warp evaluate a branch condition identically (uniform branch), no divergence occurs. The optimizer detects uniform branches via the AnalyzeUniformsForSpeculation pass (phase 27) and the OriPropagateVarying passes (phases 53, 70). Uniform branches are cheaper because:

No BSSY/BSYNC is needed (the warp stays converged)
On sm_75+, uniform branches can use the UBRA (uniform branch) encoding, which has lower latency

Branch optimization interacts with uniformity analysis: simplifications that eliminate branches also eliminate divergence-point metadata, and conversely, branches proven uniform may not need optimization because their execution cost is already minimal.

Switch Tables and Warp Divergence

A switch with K active cases in a 32-thread warp incurs at most K serialized case executions (one per unique case value across threads). Jump table lowering does not change this thread-level divergence cost, but it does change the instruction-level cost:

Strategy	Instructions (worst case)	Divergence points	Sync-stack entries
Cascading if-else (N cases)	2N (ISETP + BRA per case)	N	N
BST (N cases)	2 * ceil(log2(N))	ceil(log2(N))	ceil(log2(N))
Jump table	3 (IADD3 + ISETP + BRX)	1	1

The jump table is strongly preferred for GPU execution because it minimizes sync-stack entries to exactly 1, regardless of case count.

Implementation Details

Phase Vtable Structure

All four phases follow the standard 16-byte phase object model. Each vtable has three methods: +0 execute, +8 getPhaseNumber, +16 isNoOp.

Phase	Factory case	Vtable address	execute body	isNoOp
14 `DoSwitchOptFirst`	case 14	`off_22BD7F8`	`sub_C5F720` (42B)	returns false
15 `OriBranchOpt`	case 15	`off_22BD820`	`sub_C5F950` (34B)	returns false
30 `DoSwitchOptSecond`	case 30	`off_22BDA78`	`sub_C5FC80` (34B)	returns false
38 `OptimizeNestedCondBranches`	case 38	`off_22BDBB8`	`sub_C5FA70` (34B)	returns false

All four isNoOp methods return false unconditionally -- gating is performed inside the execute body, not via isNoOp. Each execute body calls sub_7DDB50 (156B), which reads the optimization level from compilation_context+2104 and checks knob 499. The guard is opt_level > 1, so these phases execute at -O2 and above. At -O0 and -O1, sub_7DDB50 returns 1 and the execute body returns without action.

Execute Body Details

Phase 14 -- sub_C5F720 (42 bytes). After the sub_7DDB50 gate, dispatches through the SM backend object's vtable: (*(*(ctx+1584) + 136))(*(ctx+1584)). Offset +136 is vtable slot 17 on the SM backend. This is a polymorphic call -- each SM target (sm_50, sm_75, sm_89, sm_100, etc.) provides its own switch optimization implementation. The SM backend object at compilation_context+1584 is documented in data-structures.md.

Phase 15 -- sub_C5F950 (34 bytes). After the gate, calls sub_7917F0 (529B) directly -- no polymorphic dispatch. sub_7917F0 is the branch simplification core:

Checks context+1382 bit 2 (CFG validity flag) -- returns immediately if clear
Checks knob 214 via the knob state dispatcher -- if set, skips the pass (OriBranchOpt disable switch)
Checks knob 487 (general optimization enablement)
Calls sub_785E20 (266B) to rebuild the CFG
Calls sub_781F80 (8335B) for block preparation infrastructure
Calls sub_7E6090 (2614B) to scan branch patterns and sub_7E6AD0 (33B) for chain setup
Iterates over basic blocks in RPO order (block list at *(ctx+296), RPO indices at *(ctx+512)). For each block, calls sub_753600 (1351B) for the transformation, with a convergence loop gated by knob 464
After convergence, calls sub_785E20 again to finalize the CFG

Phase 30 -- sub_C5FC80 (34 bytes). After the gate, calls sub_791F00(ctx, 1). The second argument 1 indicates this is the second switch optimization pass. sub_791F00 (587B) performs lazy initialization of a 152-byte SwitchOptContext cached at code_object+1288:

SwitchOptContext (152 bytes, allocated at code_object+1288):
    +0   back-pointer to code object
    +8   allocator reference (from code_object+16)
    +16  case collection array (capacity = block_count + 2)
    +56  secondary collection array
    +96  code_object reference copy
    +104 initialized sentinel (0xFFFFFFFF)
    +112 tertiary collection array

After setup, sub_791F00 calls sub_77CF40 (4698B, 987 instructions) -- the main switch optimization algorithm containing pattern matching, strategy selection (jump table vs. BST vs. cascading if-else), and code emission.

Phase 38 -- sub_C5FA70 (34 bytes). After the gate, calls sub_A0F020 (2375B, 563 instructions) directly. sub_A0F020 implements the nested conditional optimizer as a fixed-point loop. It allocates a 16-byte work context at code_object+1648 (lazy init), then iterates: scan blocks for nested branch patterns, combine predicates with LOP3, remove eliminated blocks, repeat until stable. The function also accesses code object fields +832 (block hash map) and +856 (edge data) for CFG manipulation.

Knob Gating Summary

Knob	Index	Effect	Checked by
ConvertUnsupportedOps	499	Master opt-level gate (all 4 phases)	`sub_7DDB50`
OriBranchOpt disable	214	Disables branch simplification	`sub_7917F0` (phase 15)
General optimization	487	Enables/disables optimizer passes	`sub_7917F0` (phase 15)
Convergence loop	464	Gates the fixed-point iteration	`sub_7917F0` (phase 15)

Interaction with ExpandJmxComputation (Phase 80)

Phase 80 is the delayed lowering phase for jump table index computations created by DoSwitchOpt. The separation exists because:

Jump table index computation requires knowing the final table address, which is not available until after legalization
Intervening optimization passes (GVN-CSE, strength reduction) may simplify the index computation before it is expanded
Register allocation needs to see the index computation as a single pseudo-instruction for accurate pressure estimation

The pseudo-instruction left by DoSwitchOpt is expanded by phase 80 into the final LEA + BRX sequence after all high-level optimizations are complete.

Interaction with OriLinearReplacement (Phase 31)

Phase 31 runs immediately after DoSwitchOptSecond (phase 30). It targets branch-heavy patterns that survived switch optimization and attempts to replace them with branchless (linear) computation sequences using SEL (select) and predicated MOV instructions. This is a complement to predication (phase 63) -- it operates earlier in the pipeline on simpler patterns, while predication handles more complex diamond-shaped control flow later.

Interaction with MergeEquivalentConditionalFlow (Phases 133, 136)

Two late-pipeline passes perform tail merging -- identifying basic blocks with identical instruction sequences that branch to the same targets, and merging them into a single block. This catches redundancy left over after branch optimization, particularly in switch case bodies that perform similar operations on different case values.

Algorithmic Summary

Pass                           Algorithm                    Complexity    CFG Changes
─────────────────────────────  ───────────────────────────  ────────────  ──────────────────────
DoSwitchOpt (14, 30)           Pattern match + decision     O(N log N)    Rewrites blocks, adds
                               tree for strategy selection   per switch    jump table pseudo-ops

OriBranchOpt (15)              Worklist-driven CFG          O(B + E)      Deletes blocks, removes
                               simplification (fixed-point)  per iter      edges, threads branches

OptimizeNestedCondBranches     Pattern match on nested      O(B)          Merges blocks, replaces
(38)                           branch diamonds                             branches with LOP3+BRA

Where N = number of switch cases, B = number of basic blocks, E = number of CFG edges.

Function Map

All addresses from ptxas v13.0.88. Vtable entries resolved by reading the ELF .rodata section at file offset VA - 0x400000. Confidence: HIGH for vtable functions (direct binary read), HIGH for core algorithms (single-caller chains from vtable execute bodies).

Phase Vtable Functions

Address	Size	Phase	Vtable slot	Role
`sub_C5F720`	42B	14	`+0`	execute -- dispatches to SM backend vtable[17]
`sub_C5F4A0`	6B	14	`+8`	getPhaseNumber -- returns 14
`sub_C5F4B0`	3B	14	`+16`	isNoOp -- returns false
`sub_C5F950`	34B	15	`+0`	execute -- calls `sub_7917F0`
`sub_C5F480`	6B	15	`+8`	getPhaseNumber -- returns 15
`sub_C5F490`	3B	15	`+16`	isNoOp -- returns false
`sub_C5FC80`	34B	30	`+0`	execute -- calls `sub_791F00(ctx, 1)`
`sub_C5F2A0`	6B	30	`+8`	getPhaseNumber -- returns 30
`sub_C5F2B0`	3B	30	`+16`	isNoOp -- returns false
`sub_C5FA70`	34B	38	`+0`	execute -- calls `sub_A0F020`
`sub_C5F1A0`	6B	38	`+8`	getPhaseNumber -- returns 38
`sub_C5F1B0`	3B	38	`+16`	isNoOp -- returns false

Core Algorithm Functions

Address	Size	Callers	Description
`sub_77CF40`	4698B	1	DoSwitchOpt core -- pattern match, strategy select, code emit
`sub_7917F0`	529B	2	OriBranchOpt core -- worklist CFG simplification
`sub_A0F020`	2375B	11	OptimizeNestedCondBranches core -- predicate combining
`sub_791F00`	587B	3	DoSwitchOpt setup -- SwitchOptContext init, calls `sub_77CF40`

Infrastructure Functions

Address	Size	Callers	Description
`sub_7DDB50`	156B	180	Optimization level gate (knob 499 + opt-level check)
`sub_781F80`	8335B	131	Block preparation infrastructure (major shared function)
`sub_785E20`	266B	34	CFG rebuild after transformation
`sub_7E6090`	2614B	80	Branch pattern scanner
`sub_7E6AD0`	33B	10	Branch chain setup
`sub_753600`	1351B	1	Block-level branch transform (phase 15 inner loop)
`sub_753B50`	598B	1	Block transform continuation

Factory and Vtable Data

Symbol	Address	Description
`sub_C60D30`	`0xC60D30`	Phase factory -- 159-case switch, allocates 16-byte phase objects
`off_22BD5C8`	`0x22BD5C8`	Vtable base -- 40-byte stride, index = phase number
`off_22BD7F8`	`0x22BD7F8`	Phase 14 vtable (base + 14 * 0x28)
`off_22BD820`	`0x22BD820`	Phase 15 vtable (base + 15 * 0x28)
`off_22BDA78`	`0x22BDA78`	Phase 30 vtable (base + 30 * 0x28)
`off_22BDBB8`	`0x22BDBB8`	Phase 38 vtable (base + 38 * 0x28)

Cross-References

Pass Inventory -- complete 159-phase table with phase numbers and categories
Optimization Pipeline -- pipeline infrastructure, dispatch loop, phase ordering
Ori IR -- instruction layout, opcode table (OEN = BRA, OFFL = BSSY, OFLAP = BSYNC), CFG hash maps
GeneralOptimize Bundles -- phases 13, 29, 37 that feed constant/copy information to branch passes
Liveness Analysis -- phase 16 (DCE cleanup after branch/switch optimization)
Predication -- phase 63 (if-conversion, consumes simplified CFG from phases 15 and 38)
Hot/Cold Partitioning -- phases 41, 108, 109 (block placement interacts with branch layout)
Synchronization & Barriers -- BSSY/BSYNC reconvergence mechanism
Data Structures -- SM backend object at +1584 (phase 14 polymorphic dispatch target)
Optimization Levels -- sub_7DDB50 opt-level gate, knob 499 interaction
Knobs System -- knobs 214, 464, 487, 499 gating branch/switch phases

Loop Passes

Eight phases in the ptxas pipeline transform loops in the Ori IR: one canonicalizer (phase 18), one unroller (phase 22), one software pipeliner (phase 24), four LICM instances (phases 35, 66, 79, 88), and one fusion pass (phase 59). Together they account for the largest category of repeated-pass instances in the pipeline -- the LICM family alone runs four times because intervening transformations (predication, legalization, GMMA fixup) continuously expose new invariants.

ptxas is not built on LLVM. Its loop infrastructure is a custom, non-SSA representation operating directly on the Ori IR's basic-block graph. Loop detection is performed by AnalyzeControlFlow (phase 3), which identifies back-edges, computes dominators, and annotates each basic block with a loop nesting depth stored at block offset +144. This nesting depth is the primary loop identity used by all eight passes.


OriLoopSimplification	Phase 18 -- vtable at `off_22BD898`
OriLoopUnrolling	Phase 22 -- vtable at `off_22BD938`
OriPipelining	Phase 24 -- vtable at `off_22BD988`
OriHoistInvariantsEarly	Phase 35 -- vtable at `off_22BDB40`
OriLoopFusion	Phase 59 -- vtable at `off_22BDF00`
OriHoistInvariantsLate	Phase 66 -- vtable at `off_22BE018`
OriHoistInvariantsLate2	Phase 79 -- vtable at `off_22BE220`
OriHoistInvariantsLate3	Phase 88 -- vtable at `off_22BE388`
Phase factory	`sub_C60D30` cases 18, 22, 24, 35, 59, 66, 79, 88
Phase object size	16 bytes (standard `{vtable_ptr, allocator_ptr}`)
IR level	Ori -- SASS opcodes with virtual registers, pre-RA
Loop detection	`AnalyzeControlFlow` (phase 3) -- back-edges, dominators, nesting depth
Related passes	3 `AnalyzeControlFlow`, 19 `OriSplitLiveRanges`, 21 `OriStrengthReduce`, 108 `OptimizeHotColdInLoop`

Pipeline Placement

Phase   3  AnalyzeControlFlow              ── builds CFG, identifies loops, computes dominators
Phase  13  GeneralOptimizeEarly            ── const fold + copy prop (feeds loop analysis)
Phase  15  OriBranchOpt                    ── branch simplification (may change loop shape)
Phase  16  OriPerformLiveDeadFirst         ── DCE removes dead loop bodies
Phase  18  OriLoopSimplification           ── CANONICALIZATION: single entry, preheader insertion
Phase  19  OriSplitLiveRanges              ── splits live ranges at loop boundaries
Phase  21  OriStrengthReduce               ── induction variable strength reduction
Phase  22  OriLoopUnrolling                ── UNROLLING: full/partial based on trip count
Phase  23  GenerateMovPhi                  ── SSA phi insertion (after unrolling changes CFG)
Phase  24  OriPipelining                   ── SOFTWARE PIPELINING: overlaps iterations
    ...
Phase  35  OriHoistInvariantsEarly         ── LICM #1: after GVN, before mid-expansion
    ...
Phase  59  OriLoopFusion                   ── FUSION: merges adjacent compatible loops
    ...
Phase  66  OriHoistInvariantsLate          ── LICM #2: after predication
    ...
Phase  79  OriHoistInvariantsLate2         ── LICM #3: after late unsupported-op expansion
    ...
Phase  88  OriHoistInvariantsLate3         ── LICM #4: after GMMA fixup
    ...
Phase 108  OptimizeHotColdInLoop           ── separates hot/cold paths within loops (post-RA)

Ordering Rationale

The eight loop passes are deliberately spread across the pipeline rather than clustered together. Each occupies a specific position dictated by what has been lowered or optimized upstream:

Phase 18 (simplification) must run before strength reduction (21) and unrolling (22) because both require canonical loop forms.
Phase 22 (unrolling) runs after strength reduction so that induction variable simplifications are already applied, avoiding redundant computation in unrolled copies.
Phase 24 (pipelining) runs after unrolling because pipelining targets loops that were not fully unrolled.
Phase 35 (early LICM) runs after GeneralOptimize at phase 29, which performs partial CSE, giving it common subexpressions to hoist.
Phase 59 (fusion) runs after late expansion (phase 55) because expansion can split a single operation into a loop pair that fusion can reunite.
Phases 66, 79, 88 (late LICM instances) each follow a major transformation that can create new loop-invariant code: predication (63), unsupported-op expansion (78), and GMMA fixup (87), respectively.

Loop Representation in Ori IR

ptxas does not use a dedicated loop descriptor data structure (no LoopInfo object like LLVM's). Instead, loop membership is implicit in the CFG through annotations computed by AnalyzeControlFlow (phase 3):

BB Field	Offset	Type	Meaning
`loop_depth`	+144	`int`	Loop nesting depth (0 = not in loop)
`loop_depth_equal`	+152	`int`	Copy of loop_depth, used for sibling detection
`predecessor_list`	+128	`linked_list*`	List of predecessor block indices
`successor_list`	+136	`linked_list*`	List of successor block indices

A loop header is a block whose loop_depth equals its own back-edge source's depth. Back-edge information is stored in the Code Object's back-edge hash map at offset +680. Diagnostic output from sub_BDEA50 prints this information as bix%d -> backedge's successor BB: %d.

The block iteration order is controlled by a reverse-post-order (RPO) array stored at Code Object offset +512. All loop passes iterate over this array, ensuring they visit headers before inner blocks. The array length is at Code Object offset +520.

Phase 18 -- OriLoopSimplification

Purpose

Canonicalizes loop structure to simplify downstream analysis. Ensures each natural loop has a single entry edge, inserts dedicated preheader blocks where needed, and normalizes back-edge shapes. This is a prerequisite for strength reduction, unrolling, and pipelining, all of which assume canonical loop form.

Entry Point

sub_C5FB00 (34 bytes)          ── vtable execute(), calls sub_7DDB50
  └─ sub_78B430 (1,172 bytes)  ── LoopMakeSingleEntry core
       ├─ sub_7753F0            ── pre-pass: loop peeling setup
       ├─ sub_789BE0            ── canonicalize back-edges
       ├─ sub_781F80            ── rebuild instruction list
       └─ sub_9253C0            ── split edges / insert preheader

Algorithm

function LoopSimplification(code_object):
    if code_object.flags[1368] & 1 == 0:          // optimization disabled
        return

    // Phase 1: optional loop peeling for O4+ or flagged functions
    if opt_level not in {4,5} and flags[1382] & 4 set:
        peeled = PeelOuterEdges(code_object, 0)         // sub_7753F0
        canonicalized = CanonicalizeBackEdges(code_object, peeled)  // sub_789BE0
    else:
        canonicalized = CanonicalizeBackEdges(code_object, 0)

    if code_object.flags[1368] & 1 == 0:          // re-check after canon
        return

    // Phase 2: single-entry enforcement
    if not QueryKnob("LoopMakeSingleEntry", knob_487):  // OCG knob 487
        return

    RebuildInstructionList(code_object, 1)               // sub_781F80
    for each block in RPO order:
        if block.loop_depth > 0 and block is loop header:
            // find the deepest-nesting back-edge target
            // if multiple entries exist, split into single-entry form
            // insert preheader block between external predecessors and header
            InsertPreheaderIfNeeded(code_object, block)  // sub_9253C0

GPU-Specific Considerations

The simplification pass checks the optimization level at offset +896 of the code object. Levels 4 and 5 (-O4, -O5) enable aggressive loop peeling via sub_7753F0 before canonicalization. At the default -O2, peeling is suppressed to avoid code size growth that could cause instruction cache thrashing.

The LoopMakeSingleEntry knob (OCG knob 487) is the master enable. When disabled, only back-edge canonicalization runs -- preheader insertion is skipped. This knob is checked via the standard OCG knob query at offset +152 of the allocator vtable.

The pass also inspects the convergence flag at offset +1380 (bit 7). When set, it indicates a convergent execution context (e.g., warp-synchronous code), and certain edge-splitting transformations are suppressed to avoid disrupting convergence guarantees.

Knob Name	Default	Description
`LoopInversion`	enabled	Enable loop inversion (do-while to while conversion)
`LoopInversionBudget`	unset	Maximum instruction count for loop inversion
`LoopPeelInversion`	disabled	Enable loop peeling combined with inversion
`EnableSingleThreadPeelingLoops`	unset	Enable peeling for single-thread execution paths
`GenPeelingLoopsForSyncs`	unset	Generate peeling loops around sync instructions
`AssertIfPeelingLoopForTexSurf`	unset	Assert (debug) if peeling a loop for texture/surface ops

Phase 22 -- OriLoopUnrolling

Purpose

Performs full unrolling of loops with known small trip counts and partial unrolling of larger loops to amortize loop overhead and expose instruction-level parallelism. This is one of the most impactful optimization passes for GPU code, where loops over texture coordinates, reduction accumulators, and matrix tiles dominate execution time.

Function Map

Correction (P1-04): The W023 report incorrectly listed sub_83EF00 as the unrolling driver. That function is the MainPeepholeOptimizer (confirmed by p1.06a sweep). The actual unrolling call chain starts at sub_1392E30.

Function	Size	Role	Confidence
`sub_1392E30`	25 lines	Phase 22 execute entry: guards, calls initializer + driver + cleanup	HIGH
`sub_1389AF0`	593 lines	Unrolling context initializer: reads all knobs from OCG profile	HIGH
`sub_1390B30`	1,598 lines	Main unrolling driver: per-loop decision, factor selection, dispatch	HIGH
`sub_138A6E0`	774 lines	Post-unroll cleanup: frees working structures	HIGH
`sub_7E5120`	19 lines	Nounroll/skip check: pragma flag, convergence, knob 91	HIGH
`sub_7F5D20`	99 lines	Rejection recording: indexes string table at `0x21D1EA0`	HIGH
`sub_138E3E0`	125 lines	Loop body scanner: three-pass analysis (header, forward, backward)	HIGH
`sub_13858C0`	42 lines	Loop back-edge locator	HIGH
`sub_1385E90`	~200 lines	Trip count bound extractor (init, limit, stride from IV)	MEDIUM
`sub_1383620`	1,157 lines	Full unroll profitability evaluator (foldable constants, addresses)	MEDIUM
`sub_1387C30`	~400 lines	Partial unroll body replicator	MEDIUM
`sub_13880F0`	~200 lines	Post-unroll CFG fixup	MEDIUM
`sub_1385950`	~300 lines	Induction variable analysis	MEDIUM
`sub_138E9C0`	~400 lines	IV stride/direction verification	MEDIUM
`sub_1385CC0`	~200 lines	IV constant detection	MEDIUM
`sub_13829F0`	~200 lines	Profitability: foldable constant load counting	MEDIUM
`sub_A3A7E0`	1,236 lines	Post-unroll statistics (DUMPIR output)	HIGH

Unrolling Decision Algorithm

The unrolling decision is a multi-stage pipeline implemented in sub_1390B30. The function iterates over loops in reverse RPO order (innermost first, matching the RPO array at code_object+512) and applies a series of eligibility checks, trip count analysis, factor selection, and profitability evaluation before committing to the unroll.

Entry Guard (`sub_1392E30`)

function OriLoopUnrolling_Execute(code_object):
    if code_object.flags[1368] & 1 == 0:           // optimization disabled
        return
    if code_object.flags[1397] & 0xC0 == 0x40:     // global nounroll override
        return
    if DUMPIR_skip("LoopUnrolling"):                // sub_799250
        return
    if CountBlocks(code_object) <= 2:               // sub_7DDB50
        return
    if not QueryKnob(487, true):                    // master loop pass guard
        return

    ctx = InitializeContext(code_object)             // sub_1389AF0
    RunUnrolling(ctx)                                // sub_1390B30
    Cleanup(ctx)                                     // sub_138A6E0

Context Initialization and Knob Defaults (`sub_1389AF0`)

The initializer reads unrolling parameters from the OCG profile object. Each knob uses a three-valued flag: 0 = use hardcoded default, 1 = use integer override, 2 = use float override, 3 = use double override. The defaults recovered from binary:

Context Field	Profile Offset	Default	Knob Name (inferred)
`ctx+168` (int32)	+31320	140	`UnrollBudget`
`ctx+172` (float)	+31032	0.25	`UnrollFlexableFullLimit`
`ctx+176` (int32)	+30960	4	`UnrollUnknownCount`
`ctx+180` (int32)	+30816	4	`UnrollSmallLoopLimit`
`ctx+184` (dbl)	+64656	0.4	`LoopUnrollLargePartOfShaderPct`
`ctx+192` (float)	+31392	20.0	`UnrollInstLimit`
`ctx+196` (int32)	+64872	50	`UnrollPregThreshold`
`ctx+200` (int32)	+31248	2	`UnrollExtraInstPerPercentSaving`
`ctx+204` (int32)	+31176	200	`UnrollFullInstLimit`
`ctx+208` (int32)	+64296	46	`LoopUnrollNumExtraInstBase`

Boolean and integer knobs read via vtable dispatch:

Knob ID	Profile Offset	Default	Knob Name
437	+31464	true	`LoopUnroll` (master enable)
894	+64368	true	`LoopUnrollNonInnermost`
897	+64584	true	`UnrollMultiBlockLoops`
902	+64944	true	`UnrollVariableBounds`
896	+64512	0	`LoopUnrollFactor` (INT override; 0 = heuristic)
895	+64440	0	`EpilogueLoopUnrollCount`
900	+64800	0	`LoopUnrollNumInstTex`
903	+65016	false	`DisablePartialUnrollOverflowCheck`

String knob: knob 427 (profile+30744) returns the LoopUnrollFactor per-block override string, with the format "-N-" to skip block N, "+N+" to force-unroll block N, "-" to skip all, "+" to force all.

Nounroll Pragma Check (`sub_7E5120`)

Returns true (suppress unrolling) when any of these conditions hold:

Convergence constraint: The back-edge analysis context at code_object+1784 is active, and the loop header's entry in the back-edge table (code_object+1776+16) is valid and within the convergence limit. This suppresses unrolling of warp-synchronous loops.
PTX nounroll pragma: Byte 292 of the block descriptor at (code_object+368 + 8*block_idx) has bit 1 set. This bit is set during PTX-to-Ori lowering when the nounroll pragma string (at 0x1CFE126) is parsed.
Instruction-level marker: Byte 283 of the loop header instruction has bit 0 set.
Per-block knob: OCG knob 91 is set for this block (queried via sub_7A1A90).

Main Decision Flowchart (`sub_1390B30`)

function RunUnrolling(ctx):
    code_object = ctx.code_object

    // Phase 1: Read master enable and per-block override string
    master_enable = QueryKnob(437)                   // LoopUnroll
    override_string = QueryKnobString(427)           // "-N-" / "+N+" format
    RecomputeRegisterPressure(code_object)            // sub_7E6090
    RebuildInstructionList(code_object)               // sub_781F80

    // Phase 2: Pre-scan -- count inlinable calls and non-unrollable instructions
    for each instruction in code_object.instruction_list:
        if opcode == 97 (BRX):
            if callee.entry_block == callee.exit_block:
                inlinable_calls++
                if trip_count > 1:
                    multi_exit |= AnalyzeMultiExit(ctx, callee)

    // Phase 3: Iterate loops in reverse RPO (innermost first)
    rpo_count = code_object.rpo_count                // offset +520
    for idx = rpo_count-1 downto 0:
        block = code_object.blocks[code_object.rpo[idx]]

        // ── Step A: nounroll annotation propagation ──
        if block.nounroll_annotation:                // byte +246
            propagate nounroll to all blocks at >= same nesting depth

        // ── Step B: eligibility filter ──
        if block.loop_depth == 0:          continue  // not a loop
        if block.loop_depth != block.loop_depth_equal: continue
        if block.nounroll and not ctx.force_all:     continue

        // ── Step C: structure analysis ──
        latch = LocateBackEdge(ctx, block)           // sub_13858C0
        if not latch:                    continue
        exit_inst = latch.last_instruction
        if exit_inst.opcode != 95:                   // not conditional branch
            Reject(block, 13); continue              // indirect jump

        // ── Step D: nounroll / convergence check ──
        if CheckNounroll(block, code_object):        // sub_7E5120
            Reject(block, 11); continue

        // ── Step E: execution frequency analysis ──
        freq_header = code_object.freq_table[header_reg]
        freq_latch  = code_object.freq_table[latch_reg]
        is_hot = (freq_latch > 999) and (freq_header > 0)
                 and (freq_latch / freq_header > 3)

        // ── Step F: body analysis ──
        body_info = ScanLoopBody(ctx, block, latch)  // sub_138E3E0
        // body_info contains: tex_count, body_size, foldable_ldc_count,
        //                     has_cross_edges, mem_count
        if body_info.has_cross_edges:    continue

        // ── Step G: budget computation ──
        budget_scale = QueryKnobDouble(898, 0.5)     // default 0.5
        scaled_body = (int)(budget_scale * body_size)
        remaining = total_budget - body_size - scaled_body - ...

        // ── Step H: per-block override check ──
        if override_string:
            needle = "-{block_id}-"
            if override_string == "-" or strstr(override_string, needle):
                continue                             // skip this block
            needle = "+{block_id}+"
            if override_string == "+" or strstr(override_string, needle):
                force_unroll = true

        // ── Step I: pragma force-unroll ──
        if flags[1397] & 0xC0 == 0x80:              // PTX pragma force
            force_unroll = true

        // ── Step J: non-innermost filter ──
        if not ctx.allow_non_innermost and not force_unroll:
            if 10 * body_info.tex_count < remaining:
                Reject(block, 7); continue

        // ── Step K: factor selection ──
        if force_unroll:
            factor = 1 << ctx.force_factor           // power-of-2 override
        else if known_trip_count:
            factor = trip_count
            // Budget-constrain: while factor * body_cost > UnrollBudget:
            //     factor--
            if factor > 4 and trip_count == 1:
                factor &= ~3                         // round to mult-of-4
            if factor <= 1:
                Reject(block, 12); continue
        else:
            if body_size <= 49 and body_info.tex_count > 0:
                factor = 2                           // conservative default
            else:
                factor = max(1, UnrollBudget / body_cost)

        // ── Step L: knob override ──
        if QueryKnob(429):                           // LoopUnrollFactor INT
            factor = GetKnobInt(429)

        // ── Step M: IV analysis ──
        iv_info = AnalyzeIV(ctx, latch)              // sub_1385950
        if not iv_info:             Reject(block, 14); continue
        if not ValidateIV(ctx, iv_info):             // sub_1387870
                                    Reject(block, 14); continue
        bound = ExtractBound(ctx, iv_info)           // sub_1385E90
        if not bound or bound.opcode != 2:
                                    Reject(block, 16); continue
        if bound.def_block.predecessor_count != 1:
                                    Reject(block, 17); continue
        if bound.init_reg == bound.limit_reg:
                                    Reject(block, 18); continue
        stride_ok = VerifyStride(ctx, block, latch, iv_info, bound)
        if stride_ok & 2:          Reject(block, 17); continue
        if stride_ok & 1:          Reject(block, 18); continue

        // ── Step N: detect constant trip count ──
        const_iv = DetectConstantIV(ctx, iv_info)    // sub_1385CC0

        // ── Step O: profitability for full unroll ──
        if factor == trip_count and single_block_body:
            if CheckFoldableProfitability(ctx, block, iv_info, factor):
                ReplicateFullUnroll(ctx, block, factor) // sub_1383620
                stats.unrolled_count++
                continue

        // ── Step P: partial unroll execution ──
        if factor >= 2:
            remainder = trip_count % factor
            iterations_per_copy = (trip_count - remainder) / factor
            block.iterations_per_copy = iterations_per_copy
            if remainder > 0:
                for r = 0 to remainder-1:
                    DuplicateBody(ctx, block)         // sub_932E40
            ReplicatePartialUnroll(ctx, block, latch,
                factor, remainder)                    // sub_1387C30
            stats.unrolled_count++
        else:
            Reject(block, 24)                         // budget exceeded

    // Phase 4: Post-unroll fixup
    stats.non_unrolled = total_loops - stats.unrolled - stats.failed
    if any_unrolled:
        RebuildBackEdges(code_object)                 // sub_7846F0
        RerunLiveness(code_object)                    // sub_A0F020
        RerunControlFlow(code_object)                 // sub_752E40
        MarkModified(code_object)                     // sub_7B52B0

Unroll Rejection Table

When a loop cannot be unrolled, sub_7F5D20 records the reason by indexing a string pointer array at 0x21D1EA0. The diagnostic strings contain hex codes like "0x80000001 - Not unrolled: Irregular loop" -- these hex values are part of the printed message text, not the internal array index. The W023 report originally described a 36-byte structure table at 0x21D1980; that table belongs to the operand range lookup in the peephole optimizer (sub_7E39B0), not the unrolling pass. The actual internal rejection codes are simple integers indexing the string array:

Code	Category	Reason
7	Performance	Body too large relative to texture savings (`10 * tex_count < remaining_budget`)
11	Pragma/knob	PTX `nounroll` pragma, convergence constraint, or per-block knob 91
12	Budget	Partial unroll factor reduced to 1 (no factor >= 2 fits within `UnrollBudget`)
13	Ineligible	Loop exit contains BRX (indirect jump, opcode 95 with special flags)
14	Unsupported IV	Induction variable analysis failed (`sub_1385950` or `sub_1387870`)
15	Unsupported IV	IV register class is not integer (class 1) or pointer (class 2/3)
16	Trip count	Trip count bound extraction failed (`sub_1385E90`)
17	Irregular	IV definition block has multiple predecessors, or stride/direction verification failed
18	Trip count	IV initial value register equals IV limit register (degenerate zero-trip loop)
19	Unsupported IV	IV stride sign inconsistent between loop header and induction increment
24	Budget	Catch-all: budget exceeded after all factor reduction attempts

The diagnostic output is gated by flags[1421] & 0x20 (DUMPIR verbose mode). When enabled, the rejection string is recorded in a hash map keyed by the loop header instruction node, using FNV-1a hashing of the node's block index.

Heuristic Thresholds (Knobs)

The unrolling decision is controlled by a rich set of OCG knobs. All knob names are stored ROT13-encoded in the binary:

Knob Name	Type	Default	Description
`LoopUnroll`	BOOL	true	Master enable for loop unrolling
`LoopUnrollFactor`	INT	0	Override unroll factor (0 = heuristic)
`UnrollBudget`	INT	140	Maximum total instruction count after unrolling
`UnrollInstLimit`	FLOAT	20.0	Maximum instructions in a single unrolled loop body
`UnrollFullInstLimit`	INT	200	Maximum body size for full unrolling
`UnrollFlexableFullLimit`	FLOAT	0.25	Flexible full-unroll limit (adjusted by loop characteristics)
`UnrollSmallLoopLimit`	INT	4	Body size threshold below which loops are always fully unrolled
`UnrollPregThreshold`	INT	50	Maximum predicate register pressure for unrolling
`UnrollMultiBlockLoops`	BOOL	true	Allow unrolling of multi-basic-block loop bodies
`UnrollVariableBounds`	BOOL	true	Allow unrolling when trip count is not compile-time constant
`UnrollUnknownCount`	INT	4	Default trip count assumption when count is unknown
`UnrollUnknownInstLimit`	INT	0	Maximum body size for unrolling with unknown trip count
`UnrollExtraInstPerPercentSaving`	INT	2	Instructions allowed per percent of cycle saving
`UnrollTex3DPercentSavedThreshold`	INT	0	Minimum savings percent for 3D texture loops
`UnrollProfiledColdInstsScale`	INT	0	Scale factor for instruction count in profiled-cold blocks
`LoopUnrollExtraFoldableLdcWeight`	INT	0	Extra weight for foldable constant loads in unroll benefit
`LoopUnrollFoldableAddrWeight`	INT	0	Weight for foldable address computations
`LoopUnrollLargePartOfShaderPct`	DOUBLE	0.4	Percentage threshold: loop is "large part of shader"
`LoopUnrollNumExtraInstBase`	INT	46	Base extra instruction allowance per unroll iteration
`LoopUnrollNumInstSmallLoop`	INT	0	Instruction count defining "small loop"
`LoopUnrollNumInstTex`	INT	0	Texture instruction count bonus for unrolling
`LoopUnrollSingleLoopSavedPctFactor`	INT	0	Savings factor for single-loop shaders
`LoopUnrollNonInnermost`	BOOL	true	Allow unrolling of non-innermost loops
`LoopUnrollUnknownMultiBlock`	BOOL	false	Allow multi-block unroll with unknown bounds
`EpilogueLoopUnrollCount`	INT	0	Unroll count for epilogue (remainder) loops
`DisablePartialUnrollOverflowCheck`	BOOL	false	Skip overflow check on partial unroll count

GPU-Specific Unrolling Concerns

Register pressure. GPU threads share a fixed register file per SM. Unrolling increases live ranges, potentially reducing occupancy (the number of concurrent warps). The unroller queries register pressure estimates and compares against UnrollPregThreshold before committing.

Instruction cache. GPU instruction caches are small (typically 128KB L1i per SM). Aggressive unrolling of large loop bodies can cause i-cache thrashing. The UnrollBudget knob caps the total instruction growth.

Texture instruction scheduling. Texture fetches have high latency (hundreds of cycles). Unrolling loops containing texture operations is especially profitable because it exposes independent fetches that the scheduler can overlap. The LoopUnrollNumInstTex and UnrollTex3DPercentSavedThreshold knobs give extra weight to texture-heavy loops.

PTX nounroll pragma. The PTX string nounroll at 0x1CFE126 is parsed during PTX-to-Ori lowering and sets bit 1 of byte 292 in the block descriptor at (code_object+368 + 8*block_idx). The check is performed by sub_7E5120, which also tests three additional suppression conditions: the convergence constraint (back-edge table at code_object+1776), an instruction-level marker (byte 283 bit 0), and per-block knob 91. Any single condition is sufficient to suppress unrolling for that loop (rejection code 11).

Convergence constraint. When the back-edge analysis context at code_object+1784 is active (indicating warp-synchronous code), the unroller checks whether the loop header falls within the convergence region. If it does, unrolling is suppressed to avoid breaking warp-level synchronization guarantees. This is particularly important for cooperative groups and ballot-based algorithms.

DUMPIR Statistics

When diagnostics are enabled, the pass outputs:

# [partially unrolled loops=N] [non-unrolled loops=M]

This line appears in eight SM-variant statistics printers (sub_ABBA50 through sub_ABEB50), each a 1,771-byte clone specializing output format for a specific SM generation.

Phase 24 -- OriPipelining

Purpose

Performs modulo software pipelining on loops that were not fully unrolled. The pass overlaps successive loop iterations by interleaving instructions from different iterations within a single loop body, hiding functional unit and memory latency. This is the single most complex loop transformation in ptxas.

Two-Layer Pipelining Architecture

ptxas implements software pipelining in two cooperating layers:

Phase 24 (OriPipelining, pre-RA): Annotates instruction operands with pipeline latency classes, computes the minimum initiation interval (MII), performs the modulo scheduling loop transformation (iteration overlap, prolog/epilog generation). Operates on the Ori IR before register allocation.
Post-RA SoftwarePipeline (sub_8B9390, 23KB): A scheduling algorithm variant within the post-RA instruction scheduler (address range 0x893000--0x8FE000) that performs instruction-level scheduling of already-pipelined loop bodies using physical registers. One of approximately 12 scheduling variants alongside DualIssueScheduler, TensorScheduler, LoopScheduler, PrefetchScheduler, etc.

The two layers cooperate: Phase 24 transforms the loop structure (instruction replication, prolog/epilog construction) before register allocation. The post-RA SoftwarePipeline variant handles the cycle-accurate instruction placement of already-pipelined loops.

Function Map

Function	Size	Role	Confidence
`sub_926A30`	22,116 bytes	Per-instruction operand latency annotator and encoding rewriter	HIGH
`sub_91A0F0`	5,550 bytes	Opcode-to-latency-class classifier (~350 opcodes, 13 distinct classes)	HIGH
`sub_9203A0`	4,881 bytes	Pipeline stage cost calculator (ResMII computation, FP cost accumulation)	MEDIUM
`sub_921820`	1,592 bytes	Prolog/epilog code generator	MEDIUM
`sub_9202D0`	207 bytes	Two-operand pipeline feasibility check (returns 60=reject, 130=accept)	HIGH
`sub_91E610`	399 bytes	Register-class-based latency lookup (class 4→26, class 5/2→20)	HIGH
`sub_91E900`	470 bytes	Pipe-assignment-based stall cycle calculator (32/64 cycle caps)	HIGH
`sub_92C0D0`	358 bytes	Per-instruction annotation wrapper (calls `sub_926A30`, checks opcode changes)	HIGH
`sub_92C240`	8,033 bytes	Extended GEMM-loop pipeliner (SM90+ TMA pipeline depth management)	MEDIUM
`sub_8B9390`	22,841 bytes	Post-RA software pipelining scheduling variant (in scheduler subsystem)	MEDIUM

Correction (P1-06): The original function map listed sub_926A30 as the "main pipelining engine (modulo scheduling)." Decompilation reveals it is the per-instruction operand latency annotator -- it iterates over each operand of an instruction, calls sub_91A0F0 to classify the operand's latency class, and rewrites the operand encoding with the latency annotation. The modulo scheduling loop transformation is distributed across the remaining functions, with sub_9203A0 computing stage costs and sub_921820 generating prolog/epilog code.

Software Pipelining Algorithm

Phase 1: Operand Latency Annotation

For each instruction in the loop body, sub_92C0D0 calls sub_926A30 to annotate operands:

function AnnotateOperandLatencies(code_object, instruction):
    opcode = instruction.word & 0xFFFFCFFF      // strip modifier bits (bits 12-13)
    secondary_opcode = instruction.secondary_opcode
    operand_array = instruction.operands         // offset +84
    operand_count = instruction.operand_count    // offset +80

    for i in 0..operand_count-1:
        operand_type = (operand_array[i].word >> 28) & 7
        if operand_type in {2, 3}:               // register or register pair
            // Adjust count for predicated instructions (bit 12)
            adjusted_count = operand_count - 2 * ((opcode >> 11) & 2 != 0)
            if i < adjusted_count:
                latency_class = ClassifyLatency(opcode, secondary_opcode,
                                                operand_array, adjusted_count, i)
                if latency_class != default:
                    RewriteOperandEncoding(operand_array[i], code_object, latency_class)

        // For register operands: call full rewriter sub_922210
        // For non-register operands: call sub_9267C0

Phase 2: Pipeline Feasibility Filtering

Each instruction is checked by sub_9202D0:

function CheckPipelineFeasibility(code_object, instruction):
    // Reject instructions with special operand flags
    if (operand_array[1] & 0x603FFFF) != 0 or (operand_array[3] & 0xF8000000) != 0:
        if optimization_level > 1:
            return REJECT                        // return code 60

    // Reject if pipe assignment class <= 3 (control/barrier pipe)
    pipe_class = PipeAssignment(code_object, primary_opcode)   // vtable+904
    if pipe_class <= 3:
        return REJECT

    // Reject if operand 0 and operand 1 have different latency classes
    lat0 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 0)
    lat1 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 1)
    if lat0 != lat1:
        return REJECT                            // asymmetric latencies

    // Reject if extended operands have blocking flags
    if operand_count > 2 and (operand_array[4] & 0xF) or (operand_array[4] >> 4) & 1:
        return REJECT

    // Accept: trim to 2-operand form
    result_operands = &operand_array[2]
    result_count = 2
    return ACCEPT                                // return code 130

Phase 3: MII Computation

The minimum initiation interval is computed as:

MII = max(RecMII, ResMII)

RecMII (recurrence-constrained): The longest data dependence cycle in the DDG divided by the iteration distance it spans. For a cycle of total latency L spanning D iterations: RecMII = ceil(L / D).

ResMII (resource-constrained): Computed by sub_9203A0 using floating-point cost accumulation. The function classifies each instruction's pipe class using a 7-entry pipe class table at code_object+16 and accumulates per-pipe instruction counts:

function ComputeResMII(loop_body, pipe_table):
    pipe_counts[0..6] = {0}
    for each instruction in loop_body:
        lat0 = ClassifyLatency(instruction, operand=0)
        lat1 = ClassifyLatency(instruction, operand=1)
        pipe = MapLatencyToPipe(lat0, pipe_table)    // 7-entry lookup
        pipe_counts[pipe] += cost(instruction)       // FP cost weights

    ResMII = max(pipe_counts[i] / pipe_width[i] for i in 0..6)

The pipe class boundaries stored at code_object+16 define 7 functional unit classes. Each class has a capacity (number of execution slots per cycle). ResMII is the maximum ratio of instruction demand to capacity across all pipe classes.

Phase 4: Modulo Schedule Construction

function ModuloSchedule(loop_body, MII):
    II = MII
    while II <= MAX_II:
        MRT = new ModuloReservationTable(II)     // II rows x pipe_classes columns
        success = true

        for each instruction in priority order:
            earliest = max(data_dependency_constraints)
            latest = earliest + II - 1
            placed = false

            for slot in earliest..latest:
                row = slot mod II
                pipe = instruction.pipe_class
                if MRT[row][pipe] has capacity:
                    MRT[row][pipe] -= 1
                    instruction.scheduled_time = slot
                    instruction.stage = slot / II
                    placed = true
                    break

            if not placed:
                success = false
                break

        if success:
            return (II, schedule)
        II += 1

    return FAILURE                               // could not pipeline

Phase 5: Prolog/Epilog Generation

Once a valid schedule is found at initiation interval II with S pipeline stages, sub_921820 generates:

function GeneratePrologEpilog(loop, II, num_stages):
    // Prolog: S-1 partial iterations
    for stage in 0..num_stages-2:
        emit instructions assigned to stages 0..stage
        // Each prolog iteration adds one more stage

    // Kernel: steady-state loop body
    emit all instructions from all stages
    // Trip count adjusted: new_trip = original_trip - (num_stages - 1)

    // Epilog: S-1 drain iterations
    for stage in num_stages-2..0:
        emit instructions assigned to stages stage+1..num_stages-1
        // Each epilog iteration removes one stage

Instruction Latency Classifier (sub_91A0F0)

The classifier is a 5.5KB, 1372-line switch statement mapping approximately 350 Ori opcodes to 13 distinct latency class values. It takes five parameters: (opcode, secondary_opcode, operand_array, operand_count, operand_index) and returns a class ID -- not a cycle count. The scheduler maps class IDs to actual cycle counts via the hardware profile.

Latency Class Table

Class	Typical opcodes	Meaning
1	Past-end operands, invalid indices	Skip / not used
6	Simple ALU, bitwise, short integer	Short-pipe latency (~80 opcodes)
7	Paired register operations	Medium-short (~5 opcodes)
8	Special cases (via lookup table `dword_21E1340`)	Medium
9	Type conversions (via lookup table)	Medium
10	Integer multiply, shifts, `IMAD`	Medium-long (~40 opcodes)
11	Address computations, `LEA` variants	Medium-long (~15 opcodes)
12	Memory operations, FP32, barriers	Standard long (~100 opcodes)
14	Wide memory, atomics, FP64 stores	Extended long (~20 opcodes)
16	FP64 special variants	Extended long (~3 opcodes)
20	Texture fetches, uniform loads	Very long (~30 opcodes)
26	Global memory loads, uncached access	Maximum latency (~25 opcodes)
31	Scoreboard/barrier-related operands	Special handling (~5 opcodes)

Opcode Family Handling

Opcode range	Category	Latency behavior
`0x03`--`0x24`	Integer ALU	Mostly passthrough default; `0x23` always returns 10
`0x3C`, `0x3E`, `0x4E`, `0x4F`	Memory (load/store)	Returns field from `operand_array[4]` bits for operands 0--1
`0x46`, `0xF3`--`0x106`	Texture	Returns 6 normally; 10 for MIO-dependent with extended flag check
`0x49`, `0x4A`, `0x51`, `0x143`, `0x15E`	Atomic/reduce	Always returns 12
`0x55`--`0x6F`	Floating-point	Complex per-operand logic; `0x55` uses lookup table `dword_21E1340`
`0x5B`, `0x5C`, `0x137`	Barriers/sync	Returns 12 for operand 1, else default
`0xB7`, `0x120`	WGMMA setup	Per-operand latency (10--20) based on accumulator flags
`0x135`	HMMA/IMMA	Calls `sub_7E39B0`/`sub_7E3A70`/`sub_7E3BA0`/`sub_7E3C30` for matrix latency
`0x13D`, `0x13E`	Extended FP	Accumulator-flag-dependent returns (10 or 12)

Stall Cycle Calculator (sub_91E900)

sub_91E900 computes the stall penalty for an instruction by mapping latency classes through the pipe assignment function (vtable+904):

function ComputeStallCycles(code_object, instruction):
    lat0 = ClassifyLatency(instruction, operand=0)
    pipe0 = PipeAssignment(code_object, lat0)         // vtable+904

    if pipe0 == 8:                                     // long-latency pipe
        stall = StallTable[instruction.index]          // code_object+440
        return min(stall, 64)                          // cap at 64 cycles

    lat1 = ClassifyLatency(instruction, operand=1)
    pipe1 = PipeAssignment(code_object, lat1)

    if pipe1 == 8:
        stall = StallTable[instruction.index]
        return min(stall, 64)

    // Neither operand on long pipe
    stall = StallTable[instruction.index]
    return min(stall, 32)                              // cap at 32 cycles

The pipe assignment value 8 corresponds to the long-latency functional unit (memory/texture). Instructions on this pipe get a 64-cycle cap; all others are capped at 32 cycles.

GEMM Pipelining (sub_92C240)

The GemmPipeliner* family of knobs controls a specialized pipelining mode for GEMM (matrix multiply) loops:

Knob Name	Type	Default	Description
`GemmPipelinerEnabled`	BOOL	false	Master enable for GEMM-specific pipelining
`GemmPipelinerPipelineDepthEnforceDeltaFull`	INT	0	Pipeline depth adjustment for full enforcement
`GemmPipelinerPipelineDepthEnforceDeltaPartial`	INT	0	Pipeline depth adjustment for partial enforcement
`GemmPipelinerDependenciesPopbl`	BOOL	false	Dependency resolution policy between DMA and compute stages
`GemmPipelinerScoreboardHashPopbl`	BOOL	false	Scoreboard hash policy for GEMM barrier tracking
`GemmPipelinerUseRegisterCalculation`	INT	0	Use register-based calculation for pipeline depth vs. fixed

The extended pipelining in sub_92C240 (8KB) handles GEMM-like patterns where the loop body contains WGMMA/IMMA instructions. From decompilation:

Activation: The GEMM pipeliner activates when code_object+48 (GEMM mode flag) is set and the pipeline context at code_object+56 has a valid stage range.
Stage iteration: Iterates from context+84 (start stage) to context+88 (end stage), with 96-byte descriptors per stage at context+136.
Pipeline depth management: Uses sub_8A4DA0 to validate stage depth and sub_6E6650 for dynamic array resizing when pipeline depth exceeds the current allocation. Writes stage bitmasks (1 << stage_index) into the stage descriptor arrays.
Hardware model: On SM90+ (Hopper), TMA supports up to 8 outstanding asynchronous copy operations. The GEMM pipeliner matches this hardware depth, staging DMA (memory) and compute (math) operations to fill the pipeline.

The DUMPIR diagnostic output includes For Dma Loop and For Math Loop sections from sub_7A4500, confirming the pipeliner explicitly distinguishes between DMA and compute loop stages.

Other Pipelining Knobs

Knob Name	Type	Default	Description
`OkToPipelineNoUnroll`	INT	0 (disabled)	Allow pipelining even when unrolling was also suppressed
`PipelineHoistCondLimit`	INT	unset	Maximum condition complexity for hoisting in pipelined loops
`PipelineHoistRRegPressureLimit`	INT	unset	R-register pressure limit for hoisting inside pipelined body
`PipelineHoistPRegPressureLimit`	INT	unset	P-register pressure limit for hoisting inside pipelined body
`PipelineMIOVQToInstRatio`	DBL	unset	MIOVQ-to-instruction ratio threshold for pipeline profitability
`PipelineMultiOutputTex`	INT	0 (disabled)	Enable pipelining of loops with multi-output texture instructions
`PipelineSpecUsesInHeadOnly`	INT	0 (disabled)	Restrict speculative uses to loop header only

GPU-Specific Pipeline Concerns

Warp divergence. Pipelined loops assume all threads in a warp execute the same number of iterations. If the trip count is warp-divergent, the prolog/epilog handling must account for early-exit threads. The pass checks the varying analysis (phases 53, 70) to determine divergence.

Barrier placement. Pipelined loops containing BAR.SYNC or MEMBAR instructions are checked by sub_9202D0 -- if the pipe assignment class for a barrier instruction is <= 3, the instruction is rejected from pipelining. The latency classifier (sub_91A0F0) assigns class 12 to barrier operands (opcodes 0x5B, 0x5C, 0x137), but the feasibility check rejects based on pipe class, not latency class.

Memory pipeline depth. The sub_92C240 extended pipeliner for GEMM-like loops manages the hardware memory pipeline on SM90+. It explicitly tracks DMA pipeline depth using 96-byte per-stage descriptors, resizing arrays dynamically when depth exceeds allocation. The stage descriptor at context+136 + 96*stage holds bitmask membership, latency counters, and dependency links.

Pipe class model. The 7-entry pipe class table at code_object+16 partitions the functional units into classes. The post-RA software pipelining variant (sub_8B9390) uses the same table to determine which functional unit class each instruction uses, ensuring resource conflict detection is consistent between the two pipelining layers.

Phases 35, 66, 79, 88 -- OriHoistInvariants (LICM)

Purpose

Hoists computations that produce the same result on every loop iteration out of the loop body and into the preheader. This reduces the dynamic instruction count proportionally to the trip count. The four instances are not redundant -- each targets invariants created by different intervening transformations.

Function Map

All four instances share the same core implementation:

Function	Size	Role	Confidence
`sub_C5FE00`	34 bytes	Phase 35 execute wrapper	CERTAIN
`sub_C5FE30`	34 bytes	Phase 66 execute wrapper	CERTAIN
`sub_C5FE60`	34 bytes	Phase 79 execute wrapper	CERTAIN
`sub_C5FE90`	34 bytes	Phase 88 execute wrapper	CERTAIN
`sub_7DDB50`	156 bytes	Optimization guard: checks knob 499, block count > 2	HIGH
`sub_8FFDE0`	573 bytes	HoistInvariants orchestrator: iterates blocks, queries knob 381, dispatches inner worker	HIGH
`sub_8FF780`	1,622 bytes	LICM inner worker: identifies and moves invariant instructions	HIGH
`sub_8FEAC0`	2,053 bytes	Invariance marking: forward/backward operand scan per block	HIGH
`sub_8F76E0`	90 bytes	Per-instruction invariance test: checks output register def-block	HIGH
`sub_8F7770`	810 bytes	Hoisting safety check: operand class + latency analysis	HIGH
`sub_8F8CB0`	658 bytes	Profitability check: budget-weighted score vs latency penalty	HIGH
`sub_8F7DD0`	374 bytes	Transitive invariance propagation through def-use chains	HIGH
`sub_8F7AE0`	558 bytes	Instruction mover: unlinks from loop, inserts at preheader	HIGH
`sub_8FF2D0`	1,186 bytes	Budget computation + invariant marking + hoist dispatch	HIGH
`sub_8F8BC0`	257 bytes	Instruction counting: header/body weight via isNoOp	HIGH
`sub_74D720`	353 bytes	Loop boundary analysis: barrier/jump/predecessor checks	HIGH
`sub_74F500`	--	Preheader location finder	MEDIUM
`sub_7DF3A0`	88 bytes	Opcode flags table lookup (side-effect classification)	HIGH
`sub_7E0540`	156 bytes	Observable side-effect checker (memory, call, barrier)	HIGH

Execute Flow

sub_C5FExxx(phase_obj)                         // 34-byte vtable dispatch
  └─ sub_8FFDE0(code_object, pass_id)          // orchestrator
       ├─ sub_7DDB50(code_object)              // guard: returns block count, checks knob 499
       ├─ sub_799250(allocator, "HoistInvariants", &skip)  // DUMPIR check
       └─ sub_8FF780(context)                  // per-loop LICM core
            ├─ sub_781F80                       // rebuild instruction list
            ├─ sub_7E6090                       // recompute register pressure
            ├─ sub_773140                       // recompute loop depths
            ├─ sub_74D720                       // analyze loop boundaries
            ├─ sub_74F500                       // find preheader
            ├─ sub_7A1A90 / sub_7A1B80         // query knob 381 per block
            └─ sub_8F8BC0                       // move instruction to preheader

Why Four Instances?

Phase	Pass ID (`a2`)	Pipeline Position	What Creates New Invariants
35 (`Early`)	0	After `GeneralOptimize` (29), `ExtractShaderConsts` (34)	CSE eliminates redundant expressions, exposing loop-invariant results; shader constant extraction hoists uniform loads
66 (`Late`)	1	After predication (63), `GeneralOptimizeLate2` (65)	Predication converts conditional branches to predicated instructions; if the condition is loop-invariant, the entire predicated instruction becomes invariant
79 (`Late2`)	2	After `LateExpansionUnsupportedOps` (78)	Late expansion splits compound operations into sequences; address computations and constant sub-expressions in expanded sequences are often invariant
88 (`Late3`)	3	After `FixupGmmaSequence` (87)	GMMA fixup reorders/inserts instructions for wgmma hardware constraints; descriptor loads and accumulator setup become visible as invariants

Pass ID Controls Aggressiveness

The pass_id parameter (parameter a2 of sub_8FFDE0) affects which loops are processed and how aggressively hoisting is performed. From the decompiled logic at sub_8FFDE0:

// sub_8FFDE0 lines 58-89 (simplified)
v7 = sub_7A1B80(allocator, 381, block);   // query knob 381 for this block
if (v7 == 1) {                             // knob says "inner loops only"
    if (pass_id == 1) goto hoist_block;    // Late pass: proceed
    goto skip_block;                       // Early pass: skip
}
if (v7 == 3) {                             // knob says "never"
    if (pass_id <= 1) goto handle_conservative;
    goto skip_block;
}
if (v7 == 0) {                             // knob says "always"
    if (pass_id == 0) goto hoist_aggressively;
    goto skip_block;
}

pass_id = 0 (Early): Hoists aggressively and calls sub_A112C0(code_object, 1) to re-run sub-analyses afterward. This is the most aggressive pass.
pass_id = 1 (Late): Includes inner-loop-only blocks, but skips the re-analysis call.
pass_id >= 2 (Late2, Late3): Most conservative -- only hoists from blocks where knob 381 returns 0 (always-hoist).

Per-Block Knob 381 Policy

The LICM pass queries OCG knob 381 (sub_7A1A90 / sub_7A1B80) per basic block to determine the hoisting policy:

Knob 381 Result	Meaning
0	Always hoist from this block
1	Hoist from inner loops only
3	Never hoist from this block

This per-block granularity allows the knob system to selectively disable hoisting in specific loop nests (e.g., those known to be register-pressure-critical).

Guard Function (sub_7DDB50)

Before the LICM core runs, sub_7DDB50 (156 bytes) gates execution on two conditions:

Knob 499 enabled. Queries the allocator vtable at +72 for OCG knob 499 (the master LICM switch). If disabled, returns 1 which causes the orchestrator to bail (since 1 <= 2).
Rate limiter. When knob 499 is enabled, the guard checks a pair of counters at allocator[9]+35936 (max invocations) and allocator[9]+35940 (current count). If the current count has reached the maximum, returns 1. Otherwise increments the counter and returns the actual basic block count from code_object+2104. This bounds the number of LICM invocations for compile-time control in functions with many loops.
Block count > 2. The orchestrator (sub_8FFDE0) checks the return value: if <= 2, no hoisting is attempted. Single-block functions have no loops; two-block functions have at most a trivial loop not worth processing.

LICM Invariant Detection Algorithm

The invariance detection pipeline runs inside sub_8FF2D0 (1,186 bytes), which is called from sub_8FF780 once per loop nest level. It executes five stages in sequence: budget computation, forward invariance marking, backward non-invariance marking, transitive propagation, and profitability gating.

Stage 1: Budget Computation (sub_8FF2D0)

function ComputeHoistBudget(context, block, is_simple, num_preds, hoist_mode, is_inner):
    // Base budget from knob 483 (HoistBudget)
    if QueryKnob(483):
        budget = QueryKnobValue(483)                 // 0 = unlimited
    else:
        budget = 10                                  // default

    // CBO budget from knob 482
    if QueryKnob(482):
        cbo_budget = QueryKnobValue(482)
    else:
        cbo_budget = (pass_id == 0) ? 22 : 100

    // Adjust by loop type and depth
    if pass_id > 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 300
    else if pass_id == 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 200

    // Conservative multiplier for Late3
    if pass_id == 3:
        budget *= 100                                // generous once decided to hoist

    // Split budget among back-edge blocks
    if hoist_mode == 3:                              // processing back-edge block
        budget /= num_preds

    // Inner-loop divisor from knob 380
    if is_inner:
        if QueryKnob(380):
            budget /= QueryKnobValue(380)
        else:
            budget /= 10

Stage 2: Forward Invariance Marking (sub_8FEAC0, a3=1)

The forward pass iterates every instruction in the basic block and marks each register operand's invariance status based on where it was defined.

function MarkInvariants_Forward(context, block_index):
    block = blocks[block_index]
    header_depth = context.header_depth
    loop_depth_range = [context.header_depth, context.max_depth]

    // Two code paths based on knob 934 (UseNewLoopInvariantRoutineForHoisting)
    if QueryKnob(934):
        // Advanced path: set-based computation via sub_768BF0 + sub_8F7280
        return MarkInvariants_SetBased(context, block_index)

    // Default path: single-pass scan
    for each instruction in block (linked list: block+0 .. sentinel at block+8):
        has_side_effect = isNoOp(instruction)            // vtable+1824
        opcode = instruction+72 (masked: BYTE1 &= 0xCF)
        num_operands = instruction+80

        // Special case: opcode 195 + first dst is reg class 9 (predicate)
        is_predicate_def = (opcode == 195 and dst_reg.class == 9)

        is_invariant = true
        for each operand from LAST to FIRST:             // reverse scan
            operand = instruction + 84 + 8*i
            type = (operand >> 28) & 7

            if type != 1:                                // not a register
                continue                                 // immediates are always invariant

            if IsFixedRegister(operand, code_object):    // sub_7DEB90
                continue                                 // e.g., RZ, PT — always available

            if pass_id == 3:                             // Late3 extra check
                if IsSpecialRegClass(operand, code_object):  // sub_7DA2F0
                    // Exception: IADD3 (opcode 130, flag 0x1000) carry-out
                    if not (opcode == 130 and flag_0x1000 and is_penultimate_operand):
                        continue

            reg = RegisterDescriptor(code_object, operand & 0xFFFFFF)

            if reg.def_block (reg+76) == block_index:
                // Defined in THIS block — not invariant for this loop
                is_invariant = false
            else if context.is_multi_depth:
                def_instr = reg.def_instruction (reg+56)
                if def_instr is null or reg has pinned bit:
                    handle_predicate_invariance()
                else:
                    def_block = blocks[def_instr.block_index]
                    def_depth = def_block.loop_depth (offset +144)
                    if def_depth < header_depth or def_depth > max_depth:
                        reg.use_count (reg+80) = 0       // mark as loop-external
                    else:
                        is_invariant = false
                        reg.def_block (reg+76) = block_index
            else:
                reg.use_count (reg+80) = 0               // simple loop: mark external

        // Side-effect check for the entire instruction
        flags = LookupOpcodeFlags(instruction, code_object)  // sub_7DF3A0
        if (flags & 2) != 0:                             // has memory/control side effect
            is_invariant = false

        if MemoryOverlapsLoopLiveSet(instruction):       // sub_74F5E0
            is_invariant = false

        if is_multi_depth and HasObservableSideEffects(instruction):  // sub_7E0540
            is_invariant = false

        // Mark destination operands
        for each dst_operand (bit 31 set = definition):
            if type == 1 and not pinned:
                if is_invariant:
                    reg.def_block = block_index           // mark for hoisting
                else:
                    reg.use_count += 1                    // count loop-internal uses

The key insight is that invariance is determined by definition site: if every source register was defined outside the loop (or in a block already processed), the instruction is invariant. Immediates and constants are trivially invariant. The check is not purely structural -- it uses the reg+76 field which gets updated as hoisting proceeds, allowing transitive invariance discovery.

Stage 3: Backward Non-Invariance Marking (sub_8FEAC0, a3=0)

The backward pass uses the same function with a3=0. Instead of marking definitions as external, it marks operands whose definitions are inside the loop as non-invariant by setting reg.def_block = block_index. This clears any false positives from the forward pass where a register appeared invariant but its defining instruction depends on a loop-variant value.

For destination operands, the backward pass increments reg.use_count for all non-pinned register definitions, building the use-count information needed by the profitability check.

Stage 4: Transitive Invariance Propagation (sub_8F7DD0)

After the two marking passes, sub_8F7DD0 propagates invariance transitively through the instruction chain. This handles the case where instruction A is invariant and defines register R, and instruction B uses R and is otherwise invariant -- the forward pass may have marked B as non-invariant because R's definition was in the loop, but A (the definer) is itself invariant.

function PropagateInvariance(context, block_index):
    block = blocks[block_index]
    side_effect_mask = 0

    for each instruction in block:
        aliases_memory = CheckMemoryAlias(code_object, instruction)  // sub_74F5E0

        for each operand (type == 1, register):
            reg = RegisterDescriptor(operand)

            if operand is definition (bit 31 set):
                if isNoOp(instruction):
                    if IsInvariant(instruction, block_index):      // sub_8F76E0
                        side_effect_mask |= reg.flags & 0x3
                    else:
                        reg.flags |= aliases_memory ? 1 : 0
                else:
                    reg.flags |= (has_side_effect ? 1 : 0) | 2
            else:  // use
                if has_side_effect:
                    reg.def_block = block_index            // taint defining register
                else:
                    reg.use_count += 1

    return side_effect_mask

Stage 5: Profitability Check (sub_8F8CB0)

The final gate before hoisting. Computes a cost-benefit ratio and rejects hoisting if the ratio is unfavorable.

function IsProfitable(context, block_index, budget, is_hoist_safe):
    header_weight = context.header_insn_count            // from sub_8F8BC0
    body_weight = context.body_insn_count

    // Scoring weights depend on pass aggressiveness and safety
    if is_hoist_safe:
        noOp_weight = (pass_id == 0) ? 60 : 150
        real_weight = 5
    else:
        noOp_weight = (pass_id == 0) ? 12 : 30
        real_weight = 1

    score = 0
    latency_penalty = 0
    instruction_count = 0

    for each instruction in block:
        instruction_count += 1
        if IsInvariant(instruction, block_index):        // sub_8F76E0
            if isNoOp(instruction):
                score += noOp_weight
            else:
                score += 1
                for each dst_operand with scoreboard flag:
                    score += real_weight
                    latency = GetLatencyClass(instruction)  // sub_91E860
                    latency_penalty += (latency > 4) ? 2 : 1
        else:
            for each high-latency dst_operand:
                latency_penalty += (latency > 4) ? 2 : 1

    // Final decision: weighted score vs latency cost
    if pass_id == 0:                                     // aggressive
        denominator = real_weight * instruction_count
    else:
        denominator = body_weight / 3 + header_weight

    return denominator != 0 and (score * budget) / (real_weight * denominator) >= latency_penalty

The profitability check encodes a fundamental GPU tradeoff: hoisting reduces dynamic instruction count (proportional to trip count) but extends live ranges (increasing register pressure and reducing occupancy). The budget parameter, which varies by 100x between pass_id 0 and 3, controls how aggressively this tradeoff is resolved. Pass_id 0 (Early) uses the smallest denominator, making it easiest to exceed the threshold.

Per-Instruction Invariance Test (sub_8F76E0)

The leaf-level invariance test used by stages 4 and 5 is a simple definition-site check:

function IsInvariant(instruction, current_block_index):
    num_operands = instruction.operand_count             // inst+80
    if num_operands == 0:
        return false

    // Find the last "interesting" operand (skip immediates/constants)
    // Immediates have type bits in the 0x70000000 range
    last_operand = scan backwards from operand[num_operands-1]
                   while (operand XOR 0x70000000) & 0x70000000 == 0

    // Check: is this a register definition outside the current block?
    if last_operand is negative (bit 31 = definition)
       and type_bits == 1 (register)
       and not pinned (byte+7 bit 0 == 0):
        reg = RegisterDescriptor(last_operand & 0xFFFFFF)
        return reg.def_block (reg+76) != current_block_index

    return false

This is the most-called function in the LICM pipeline. It checks whether an instruction's primary output register was defined outside the current block -- if so, the instruction is considered invariant (already hoisted or defined in a dominating block).

Side-Effect Blocking Rules

An instruction is blocked from hoisting if any of the following conditions hold, regardless of operand invariance:

Check	Function	Condition
Memory store	`sub_7DF3A0`	Flags byte bits 2-3 set and bit 5 clear
Memory barrier	`sub_74D720`	Opcode 159 (`BAR.SYNC`), 32 (`MEMBAR`), or 271 (barrier variant)
Indirect jump	`sub_74D720`	Opcode 236 (`BRX`)
Volatile/atomic access	`sub_7DFA80`	Called from `sub_7E0540`; detects volatile or atomic memory
Function call	vtable+1456	`isBarrier()` returns true
Texture side effect	`sub_7DF3A0`	Flags byte bit 6 set with operand modifier flag
Address-space effect	`sub_7E0540`	Opcodes 85/109 (memory ops) with `(flags+20 & 2) != 0`

The boundary analysis (sub_74D720) also produces a 5-byte result array that gates the entire loop:

Byte	Meaning	Effect
0	Has external predecessor (outside loop depth range)	Skip loop (not a natural loop)
1	Non-header block with different nesting	Marks as complex multi-depth loop
2	Contains barrier instruction	Skip loop entirely
3	Contains indirect jump	Skip loop entirely
4	Multi-depth safety flag	AND-ed with `sub_7E5120` per inner block

Instruction Counting (sub_8F8BC0)

Before the profitability check, sub_8F8BC0 counts instructions in the loop header and body separately. It walks the instruction linked list for each block in the loop and classifies each instruction using isNoOp (vtable+1824):

No-op instruction (scheduling placeholder, predicate set, etc.): weight 1
Real instruction (ALU, memory, branch, etc.): weight 30

The header count is stored at context+64 and the body count at context+68. The profitability formula uses these to normalize the hoisting score: a loop with a heavy header relative to the body benefits less from hoisting.

Instruction Movement (sub_8F7AE0)

After all checks pass, sub_8F7AE0 physically moves each invariant instruction from the loop body to the preheader:

Invariance re-check. Calls sub_8F76E0 one final time per instruction. Instructions whose invariance status changed during the marking passes are skipped.
Knob 484 gate. Queries the allocator for knob 484; if disabled, no movement occurs. This provides a fine-grained override separate from the loop-level knob 381.
Preheader creation. On the first hoisted instruction, creates or locates the preheader block:
- If the loop has an existing preheader block (context+16 non-null): clones it via sub_931920, copies convergence flags from the original preheader's offset+282 bit 3, and links it into the CFG via sub_8F7610.
- If no preheader exists: creates a new block via sub_92E1F0 and links it.
Unlink and reinsert. For each invariant instruction:
- sub_9253C0(code_object, instruction, 1): unlinks the instruction from the current block.
- sub_91E290(code_object, instruction): inserts at the preheader insertion point.
- Updates the Ori instruction's control word at instruction+32 (not the SchedNode): sets bit 1 at byte offset +13 to mark the instruction as hoisted (prevents the scheduler from reordering it back into the loop).
Destination register tracking. For each output operand, if the defining instruction at reg+56 differs from the current instruction, sets context+44 (hoisted_cbo flag). For pass_id == 2, additionally sets reg+48 bit 26 if the register class is in {2, 3, 4} (GPR classes) and the preheader has the convergence flag.
Special IADD3 handling. For pass_id == 3, instructions with opcode 130 (IADD3), flag 0x1000, and a negative byte at +90 (carry chain) receive special treatment via sub_9232B0 which adjusts the carry-out register linkage before movement.

Multi-Depth Loop Handling

For loops with nesting depth > 1 (inner loops within the hoisting target), sub_8FF780 performs multiple rounds of sub_8FF2D0 calls:

Header block. First call processes the loop header with hoist_mode = 0.
Intermediate blocks. For each depth level between header_depth+1 and max_depth, checks if the block's parent depth (block+148) matches the header depth. If the block is a back-edge predecessor of the loop header, uses hoist_mode = 3. Otherwise, checks a dominance bitmap at block[25] + 4*(depth >> 5): if bit (1 << depth) is set, uses hoist_mode = 1 (dominated); otherwise hoist_mode = 2 (non-dominated).
Back-edge block. Final call with hoist_mode = 3 and the deepest back-edge block index, ensuring the budget is split among back-edge predecessors.

Multi-depth permission is gated by knob 220 (queried at allocator[9]+15840 for the fast path) and the DisableNestedHoist knob. When hoisting from an inner loop to the header of an outer loop, an additional constraint applies:

allow_nested = allow_nested_hoist AND is_simple_loop
               AND body_insn_count > 1
               AND num_predecessors == 1
               AND body_insn_count < header_insn_count * max_iterations

This prevents hoisting from inner loops where the cost (extended live range across multiple loop levels) exceeds the benefit (reduced inner-loop dynamic count).

LICM Outer Loop (sub_8FF780)

The complete outer driver that iterates over all loop nests:

function HoistInvariantsCore(context):
    code_object = context.code_object
    pass_id = context.pass_id

    // Read iteration limit from allocator+34632
    config_byte = allocator[34632]
    max_iterations = (config_byte == 0) ? 2
                   : (config_byte == 1) ? allocator[34640]
                   : 0                                   // unlimited

    allow_nested_hoist = (allocator[20016] != 0)

    // Prepare IR
    RebuildInstructionList(code_object, 1)               // sub_781F80
    RecomputeRegisterPressure(code_object, 1, 0, 0, 0)  // sub_7E6090
    RecomputeLoopDepths(code_object, 0)                  // sub_773140

    if code_object.flags[176] & 2 and pass_id > 1:
        RecomputeLoopNesting(code_object)                // sub_789280

    // Clear prior invariance markers
    for each block in instruction list:
        block.marker (offset +76) = 0xFFFFFFFF

    // Iterate from innermost loop outward (last RPO entry first)
    current = blocks[rpo[block_count]]

    while current is valid:
        if current has no predecessors or no first instruction:
            advance; continue

        // Count predecessors at >= current loop depth
        header_depth = current.loop_depth                // offset +144
        for each predecessor:
            if pred.loop_depth >= header_depth:
                num_at_depth++; track deepest back-edge index

        if num_at_depth == 0:                            // not a loop header
            advance; continue

        // Simple vs multi-depth
        if max_depth == header_depth:
            is_simple = true
        else:
            info = AnalyzeBoundaries(code_object, header_depth, max_depth)
            if has_external_pred or has_barrier or has_indirect_jump:
                advance; continue
            if !MultiDepthAllowed(knob_220):
                advance; continue
            context.is_multi_depth = true

        // Find preheader and query knob 381
        context.insert_pt = FindPreheader(code_object, current, ...)
        if !ShouldHoist(QueryKnob381(381, current), pass_id, opt_level):
            advance; continue

        // Count instruction weights
        CountInstructions(context)                       // sub_8F8BC0

        // === CORE HOISTING PIPELINE (per loop) ===
        sub_8FF2D0(context, header_block, ...)           // header block

        if context.is_multi_depth:
            for depth in (header_depth+1 .. max_depth-1):
                sub_8FF2D0(context, block_at_depth, ..., hoist_mode, ...)
            sub_8FF2D0(context, back_edge_block, ..., 3, ...)  // back-edge

        // Post-hoist cleanup
        if context.changed and current.num_back_edge_successors > 1:
            RebuildInstructionList(code_object, 0)

        advance to next loop

Hoisting Knobs

Knob Name	Type	Default	Description
`HoistBudget`	FLOAT	10	Maximum number of instructions to hoist per loop (0 = unlimited)
`HoistLoopInvBudget`	FLOAT	22 (early) / 100 (late)	Budget specifically for loop-invariant hoisting; pass_id 0 uses 22, pass_id > 0 uses 100
`HoistConservativeScale`	INT	10 (divisor)	Inner-loop budget divisor; budget /= scale when hoisting from inner loops
`HoistLate`	INT	per-block policy	Per-block hoisting policy (0=always, 1=inner only, 3=never)
`HoistCBOMode`	INT	0	Constant-buffer-object hoisting mode
`HoistCBOLoad`	INT	unset	Enable hoisting of CBO load instructions
`HoistCBOFromLoopWithColdNest`	INT	1 (enabled)	Hoist CBO loads even from loops with cold nesting
`HoistCBOHighCostSBInstRatioThreshold`	INT	unset	Scoreboard cost threshold for CBO hoisting
`HoistCBOLoadIDOMTravseLimit`	INT	4	IDOM traversal limit for CBO load hoisting
`HoistCBORRegPressureLimitApplyRate`	INT	80	R-register pressure limit application rate (percentage)
`HoistTexToInstRatioHigh`	DBL	0.045	High texture-to-instruction ratio threshold for aggressive hoisting
`HoistTexToInstRatioLow`	DBL	0.03	Low texture-to-instruction ratio threshold for conservative hoisting
`DisableNestedHoist`	BOOL	false	Disable hoisting from nested loops (false = nested hoisting allowed)
`NestedHoistInnerThreshold`	INT	22 / 100	Inner loop instruction threshold for nested hoisting (same value as `HoistLoopInvBudget`)
`NestedHoistOuterThreshold`	INT	10	Outer loop instruction threshold for nested hoisting (same value as `HoistBudget`)
`UseNewLoopInvariantRoutineForHoisting`	BOOL	false	Use updated set-based invariance check routine (legacy single-pass is default)
`MaxMidHeaderSizeRateForAggressiveHoist`	INT	2	Maximum LICM iteration count (limits repeated hoisting passes)
`EnableHoistLowLatencyInstMidBlock`	BOOL	false	Hoist low-latency instructions from mid-block positions
`MovWeightForSinkingHoisting`	DBL	0.25	Weight for MOV instructions in sink/hoist decisions

GPU-Specific LICM Concerns

Constant buffer loads. GPU shaders frequently load from constant buffers (LDC). These loads are loop-invariant by definition (the buffer is read-only during kernel execution). The HoistCBO* knobs control a specialized path that aggressively hoists these loads, trading register pressure for reduced memory traffic.

Register pressure vs. occupancy. Every hoisted instruction extends its live range from the preheader through the entire loop. On GPUs, this directly reduces occupancy. The four LICM passes use increasingly conservative heuristics (controlled by pass_id) to avoid excessive register growth in later pipeline stages where register allocation is imminent.

Texture instruction hoisting. Texture fetches (TEX, TLD, TLD4) are high-latency and loop-invariant when their coordinates are loop-invariant. The HoistTexToInstRatio* knobs provide thresholds for deciding when to hoist texture instructions -- a tradeoff between reducing loop body latency and increasing preheader register pressure.

Phase 59 -- OriLoopFusion

Purpose

Fuses adjacent loops with compatible bounds and no inter-loop data dependencies into a single loop. This reduces loop overhead (branch, induction variable update) and creates opportunities for the scheduler to overlap instructions from the formerly separate loop bodies.

Knobs

Knob Name	Type	Default	Description
`PerformLoopFusion`	INT	0 (disabled)	Master enable for loop fusion; must be explicitly set to a nonzero value
`PerformLoopFusionBudget`	FLOAT	unset	Maximum instruction count in fused body

Fusion Criteria

Two adjacent loops L1 followed by L2 are candidates for fusion when:

Same trip count. Both loops iterate the same number of times (same induction variable bounds and stride, or equivalent after normalization).
No violated inter-loop dependencies. No flow dependence (write in L1, read in L2) that crosses iteration boundaries differently after fusion. Since both loops are sequential pre-fusion, this reduces to: L2 must not read a value written by L1 at a different iteration index.
Compatible loop structure. Both must be single-basic-block bodies (or the fused body must remain within the PerformLoopFusionBudget instruction limit).
No intervening barriers. No BAR.SYNC, MEMBAR, or fence instructions between the two loop bodies.

Pipeline Position Rationale

Phase 59 runs after GeneralOptimizeLate (phase 58) and before predication (phase 63). This position is chosen because:

Late expansion (phase 55) may have split a single operation into a pair of loops (e.g., an atomic-reduce pattern becomes a compare loop followed by an exchange loop).
After fusion, the merged loop body gives predication (phase 63) a larger basic block to work with, improving if-conversion opportunities.
The subsequent LICM (phase 66) can hoist invariants from the fused loop that were not hoistable from either original loop individually (because they appeared in the "between-loops" region).

Loop Infrastructure Functions

Several utility functions are shared across the loop passes:

Function	Address	Size	Purpose
`sub_781F80`	`0x781F80`	--	Rebuild instruction linked list after CFG modification
`sub_789280`	`0x789280`	--	Recompute loop nesting depths (called when `flags[176] & 2` set)
`sub_773140`	`0x773140`	--	Recompute register pressure estimates
`sub_7E6090`	`0x7E6090`	2,614	Create complex multi-operand instruction (used in unroll body duplication)
`sub_7753F0`	`0x7753F0`	--	Loop peeling setup (splits first/last iterations)
`sub_789BE0`	`0x789BE0`	--	Back-edge canonicalization
`sub_74D720`	`0x74D720`	--	Loop boundary analysis (determines header, latch, exit)
`sub_74F500`	`0x74F500`	--	Find preheader block for a given loop
`sub_9253C0`	`0x9253C0`	--	Edge splitting / preheader block insertion
`sub_7A1A90`	`0x7A1A90`	--	OCG knob query (boolean)
`sub_7A1B80`	`0x7A1B80`	--	OCG knob query (multi-valued)
`sub_799250`	`0x799250`	--	Named-phase DUMPIR check (string match against phase name)
`sub_A112C0`	`0xA112C0`	--	Trigger sub-analysis re-run (liveness, CFG refresh)
`sub_BDEA50`	`0xBDEA50`	--	Back-edge information printer (`bix%d -> backedge's successor BB: %d`)

Phase	Name	Relationship
3	`AnalyzeControlFlow`	Builds the CFG, identifies loops, computes dominators -- prerequisite for all loop passes
19	`OriSplitLiveRanges`	Splits live ranges at loop boundaries to reduce register pressure post-simplification
20	`PerformPGO`	Applies profile data that informs unrolling and pipelining heuristics
21	`OriStrengthReduce`	Reduces induction variable strength before unrolling
23	`GenerateMovPhi`	Inserts SSA phi nodes after unrolling changes the CFG
25	`StageAndFence`	Inserts memory fences needed by pipelined loops
56	`SpeculativeHoistComInsts`	Speculatively hoists common instructions above branches (related to LICM)
108	`OptimizeHotColdInLoop`	Post-RA hot/cold partitioning within loop bodies
138	`OriSplitHighPressureLiveRanges`	Last-resort splitter when unrolling or LICM caused excessive register pressure

Cross-References

Pass Inventory & Ordering -- complete 159-phase table
Strength Reduction -- phase 21, IV simplification before unrolling
Predication -- phase 63, creates new LICM opportunities for phase 66
GMMA/WGMMA Pipeline -- phases 85, 87, creates LICM opportunities for phase 88
Late Legalization -- phase 78, creates LICM opportunities for phase 79
Hot/Cold Partitioning -- phase 108, loop-interior hot/cold splitting
Liveness Analysis -- phases 16, 33, 61, 84 -- liveness drives unroll register pressure
Knobs System -- knob infrastructure, ROT13 encoding
Scheduling Architecture -- pipelined loops interact with the instruction scheduler

Strength Reduction

Phase 21 (OriStrengthReduce) replaces expensive arithmetic operations with cheaper equivalents in the Ori IR. It runs early in the optimization pipeline -- after loop simplification (phase 18) and live range splitting (phase 19), but before loop unrolling (phase 22) and software pipelining (phase 24). This placement is deliberate: strength reduction benefits from canonicalized loop structure and benefits subsequent loop transformations by simplifying induction variable expressions.

Strength reduction in ptxas is not a single monolithic pass. It is distributed across three layers, each operating at a different abstraction level:

Phase 21 (OriStrengthReduce) -- Ori-level induction variable strength reduction on the use-def graph
Peephole patterns -- SASS-level algebraic simplifications in the MainPeepholeOptimizer (sub_83EF00)
Division lowering templates -- Newton-Raphson integer division sequences emitted during instruction selection


Phase index	21
Phase name	`OriStrengthReduce`
Category	Optimization
Pipeline position	Stage 2 (Early Optimization), between PGO (phase 20) and loop unrolling (phase 22)
Vtable address	`off_22BD910`
`execute()`	`sub_C5FB30` (wrapper) -> `sub_752E40` (core logic, 359 lines decompiled, ~1.2 KB binary)
`isNoOp()`	`sub_C5F3D0` -- returns 0 (always runs)
`getName()`	`sub_C5F3C0` -- returns 21
Gate knob	487 (general optimization enablement)
Key helpers	`sub_745A80` (replacement register creator), `sub_91BF30` (virtual register allocator), `sub_A13890` (use-def chain iterator), `sub_9253C0` (instruction deleter)
Peephole SHR+SHL->BFE	`sub_81DB30` (matcher: `sub_81D7E0`)
Peephole BFE+ADD	`sub_81DDD0` (matcher: `sub_81DBC0`)
Division templates	`sub_1724A20` (32-bit, 28 KB), `sub_1728930` (64-bit unsigned, 16.5 KB), `sub_1727AC0` (64-bit signed, 13.7 KB)

Phase 21: Induction Variable Strength Reduction

Architecture

The execute wrapper (sub_C5FB30) gates on multi-function compilation (function count > 1 via sub_7DDB50) and delegates to sub_752E40 with parameters (context, 0, 0, 0).

sub_752E40 is the core. It performs a single-pass walk over the instruction list, focusing on a specific intermediate opcode -- opcode 137 (SM73_FIRST), masked with & 0xFFFFCFFF to strip modifier bits in the opcode field at instruction offset +72. The ROT13 name SM73_FIRST is a generation boundary marker name, but the Ori IR reuses this opcode slot at runtime for IMAD-like multiply-accumulate instructions in their pre-lowered form. The actual SASS IMAD is opcode 1.

Algorithm

The pass executes in two phases within a single call:

Phase 1 -- Trivial multiply elimination. The first loop walks the instruction list (*(context+272) is the list head). For each instruction with masked opcode == 137 (SM73_FIRST; IMAD-like):

Check if the destination register (operand at +84) has no uses (*(def+56) == NULL) AND the source chain is empty (*src_chain == NULL). If both hold, delete the instruction via sub_9253C0 -- it is dead.
Otherwise, for each source operand (iterating from operand count - 1 down to 0):
- Check operand type: must be register ((operand >> 28) & 7 == 1)
- Look up the register definition in the SSA value table (*(context+88) + 8 * (operand & 0xFFFFFF))
- Check the definition has no special flags (*(def+48) & 0x400000022 == 0)
- Check the register type is not 9 (predicate register)
- Check the source operand's use chain is empty (single-use) and the def has no other users
- If all conditions hold, call sub_91BF30 to allocate a replacement register with the same type, then patch the operand in place

Phase 2 -- Use-def chain traversal. The second loop walks the instruction list again. For each instruction with operands that have been marked (flag 0x100 at instruction[6], set during initialization):

For each source operand with a use chain:
- Compute the replacement register via sub_745A80(context, def, a4), which:
  - Allocates a new virtual register via sub_91BF30 with the same type as the original
  - Copies the data type field (+16) and relevant flags (0x40, 0x10, 0x8 bits of the flags word at +48)
  - Returns the new register ID
- If the operand was not yet marked (flag 0x100 bit not set), initialize it and mark as "needs strength reduction"
- Traverse the use chain as a worklist: for each user of the replaced register, check if its uses also need updating, growing the worklist dynamically (doubling allocation via pool allocator)
Track how many source operands were rewritten (v72 counter)
After processing all operands of an instruction: if the instruction is still opcode 137 (SM73_FIRST; IMAD-like) and certain conditions hold (destination matches source pattern, specific operand bit patterns), either delete it or convert it to opcode 130 / 0x82 (HSET2 in ROT13; used as an internal MOV-like marker -- actual SASS MOV is opcode 19)

The worklist traversal is the key algorithmic insight: when a multiply's result feeds into another multiply, the chain of strength reductions propagates transitively through the def-use graph.

Data Flow

sub_C5FB30 (execute wrapper)
  |
  +-- sub_7DDB50: check function count > 1
  |
  +-- sub_752E40 (core logic)
        |
        +-- sub_7468B0 / vtable+152: check knob 487 (optimization enabled)
        |
        +-- Phase 1: Walk instruction list (*(ctx+272))
        |     +-- For opcode 137 (`SM73_FIRST`; IMAD-like) instructions:
        |     |     +-- sub_9253C0: delete dead instructions
        |     |     +-- sub_91BF30: allocate replacement registers
        |     |
        |     +-- Clear flag 0x100 on all basic blocks (*(ctx+104) chain)
        |     +-- Set flag 0x40 at ctx+1385
        |
        +-- sub_A13890: initialize use-def traversal context
        |     +-- Creates context object with vtable off_21DBEF8
        |     +-- Sets up iterator with vtable off_21B4FD0
        |
        +-- Phase 2: Walk instruction list again
        |     +-- For each source operand with use chain:
        |     |     +-- sub_745A80: create replacement register
        |     |     +-- Worklist propagation through use chain
        |     |
        |     +-- Convert trivial IMAD to MOV (opcode 130 / `0x82`, `HSET2`; MOV-like)
        |     +-- sub_9253C0: delete fully reduced instructions
        |
        +-- sub_7B52B0: optional post-reduction scheduling pass
        |     (called if any replacements were made, flag v76)
        |
        +-- sub_8E3A20: destroy use-def context

Instruction Representation

The pass operates on the Ori IR instruction format. Relevant fields:

Offset	Field	Usage in this pass
+8	next pointer	Instruction list traversal
+64	source operand chain	Array of `{use_chain_ptr, ...}` per operand
+72	opcode (DWORD)	Bits 0-11 = base opcode, bits 12-13 = modifier (masked with `0xCF`)
+80	operand count	Number of source operands
+84	operand[0]	First source operand descriptor (bits 28-30 = type tag, bits 0-23 = register ID)
+92	operand[1]	Second source operand
+100	operand[2]	Third source operand (for IMAD: accumulator)

Operand type tags (bits 28-30):

1 = register operand (index into SSA value table at *(context+88))
2, 3 = immediate operand
7 = special/predicate

Offset	Field	Usage
+8	register ID	Unique identifier
+16	data type	Copied to replacement register
+20	use count	Checked for single-use optimization
+28	replacement ID	Set by `sub_745A80` to point to strength-reduced version
+48	flags (QWORD)	Bit 0x100 = "marked for strength reduction", bit 0x40 = volatile, bit 0x10/0x8 = scheduling hints
+56	defining instruction	Pointer to the instruction that defines this register
+64	register class	Type code (2/3 = integer, 4 = predicate, 7 = special, 9 = predicate)

Peephole Strength Reduction Patterns

The MainPeepholeOptimizer (sub_83EF00, 29 KB, case 2 of the opcode switch) applies algebraic strength reduction patterns at the SASS instruction level. These run later in the pipeline than phase 21 and operate on concrete SASS opcodes rather than the pre-lowered intermediate form.

Pattern: SHR + SHL -> BFE (Bit-Field Extract)

Matcher: sub_81D7E0 (166 lines decompiled) Emitter: sub_81DB30 Target opcodes: 290, 151, or 2 (various ALU forms) with operand size 11 or 12

Recognition:

The instruction must have two register source operands (type tag 1), no modifier bits, no special flags
Source operand 0's definition must be opcode 213 (SHL) or 214 (SHR)
Source operand 1's definition must be the complementary shift (SHR if 0 was SHL, or vice versa)
Both shift definitions must have immediate shift amounts (type tag 2 or 3)
The shift amounts must sum to 32 (i.e., SHL(x, n) paired with SHR(x, 32-n))
Both definitions must dominate the current instruction (sub_1245740 dominance check)
Loop depth heuristic: if the shift definitions are in a shallower loop than the current instruction (checked via block RPO depth at *(block+156)), the transformation may be suppressed to avoid increasing register pressure

Transformation:

Before:  t1 = SHL(x, n)          ; opcode 213
         t2 = SHR(x, 32-n)       ; opcode 214
         r  = ALU(t1, t2)        ; opcode 290/151/2

After:   r  = BFE(x, ...)        ; opcode 210 (bit-field extract)

The emitter calls sub_9314F0 to create a BFE instruction (opcode 210) with the appropriate operands, then sub_9253C0 to delete the original ALU instruction.

Pattern: BFE Folding into ADD

Matcher: sub_81DBC0 (83 lines decompiled) Emitter: sub_81DDD0 Target opcode: 2 (IADD) with operand size 11 or 12

Recognition:

One source operand is defined by opcode 210 (BFE)
The BFE has no modifier bits, no special flags on the last operand
The BFE's immediate operand (shift amount) is 1-31
The BFE has a single use (use count <= 1)
Dominance check passes

Transformation:

Before:  t = BFE(x, amount)      ; opcode 210
         r = IADD(t, y)          ; opcode 2

After:   r = LOP3/SHF(x, y, ...) ; opcode 102, combining shift+add

The emitter creates opcode 102 (a combined shift-and-add operation) with encoded shift amount (8 * amount | 0x60000002).

Integer Division Lowering

Integer division and modulo by non-constant values are lowered to multi-instruction sequences during instruction selection. This is not part of phase 21 but is the most visible strength reduction in ptxas output. The sequences use the classic Barrett reduction / Newton-Raphson reciprocal algorithm.

32-bit Division -- `sub_1724A20`

Size: 28,138 bytes decompiled (the largest function in the 0x1723000-0x17F8000 ISA description range) Called from: sub_1727130 (template driver) Instruction count: ~35 SASS instructions emitted

Algorithm (unsigned 32-bit a / b):

Convert divisor to float: I2F(b) (opcode 0xD5)
Compute approximate reciprocal via MUFU.RCP (opcode 0x3C)
Convert back to integer: F2I(1/b) (opcode 0xD6)
Refine via multiply-add: IMAD(q, b, ...) (opcode 0x6E)
Correction step with conditional branch: ISETP + BRA (opcodes 0xC9, 0x5F)
Final adjustment via IADD (opcode 0x02)

Key constants allocated via sub_91D160:

23 (float exponent bias for mantissa extraction)
255 (exponent mask)
127 (IEEE 754 single-precision bias)
254 (double-bias for overflow guard)
1, -1 (correction increments)

The temporary register pool uses indices 90-126 from a parameter array (a7[]), providing 37 dedicated scratch registers for the sequence.

64-bit Division

Two variants handle 64-bit operands:

sub_1728930 (16,545 bytes): unsigned 64-bit division. Emits longer sequences with double-width multiply and carry propagation.
sub_1727AC0 (13,776 bytes): signed 64-bit division. Parallel structure with sign-handling logic.

Both are called from sub_1729B50.

Division by Constant

Division by compile-time constant is handled separately during constant folding (in the GeneralOptimize bundle passes). The classic magic-number multiplication technique (Granlund-Montgomery) converts x / C into MULHI(x, magic) >> shift, avoiding the Newton-Raphson sequence entirely. This produces 2-3 instructions instead of ~35.

SASS Cost Model

The profitability of strength reduction on NVIDIA GPUs differs from CPUs in several important ways:

Integer multiply is cheap. Modern NVIDIA GPUs (sm_70+) have dedicated integer multiply-add (IMAD) functional units. IMAD has the same throughput as IADD on most architectures -- both are single-cycle operations on the integer ALU. This means the classical "replace multiply with shift+add" transformation is often not profitable on GPU. ptxas does not aggressively replace multiplies with shift chains the way CPU compilers do.

Integer division is expensive. There is no hardware integer divider. Division must be lowered to the ~35-instruction Newton-Raphson sequence described above. This is why division-by-constant is a high-priority optimization -- replacing 35 instructions with 2-3 is a massive win.

Shift operations. SHL and SHR are single-cycle on the integer ALU, same throughput as IADD and IMAD. However, they use a different functional unit slot on some architectures, which can matter for scheduling.

BFE (bit-field extract) is a dedicated single-cycle instruction. Recognizing SHR+SHL pairs and folding them to BFE saves an instruction and a register, which is the primary motivation for the peephole patterns.

Register pressure dominates. On GPUs, the primary cost metric is not instruction count but register pressure, because register count directly determines occupancy (the number of concurrent warps). The strength reduction pass checks loop depth before transformations and suppresses replacements that would increase register pressure in inner loops (the RPO depth comparison in sub_81D7E0).

This explains why phase 21's core logic is relatively compact (~1.2 KB binary) compared to CPU compilers' strength reduction passes: the GPU cost model makes fewer algebraic replacements profitable, so the pass focuses narrowly on use-def chain simplification and trivial multiply elimination rather than elaborate pattern tables.

Pipeline Context

Phase 21 runs after:

Phase 18 (OriLoopSimplification) -- loops are canonicalized with single entry, single back-edge, and preheaders
Phase 19 (OriSplitLiveRanges) -- live ranges are split at loop boundaries
Phase 20 (PerformPGO) -- profile data is applied (block weights inform the cost model)

Phase 21 runs before:

Phase 22 (OriLoopUnrolling) -- simplified induction variables enable better unroll decisions
Phase 24 (OriPipelining) -- strength-reduced loops are more amenable to software pipelining
Phase 29 (GeneralOptimize) -- compound pass cleans up any dead code left by strength reduction

The GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) also perform algebraic simplification that overlaps with strength reduction -- specifically constant folding of multiply-by-power-of-2 to shifts. Phase 21 handles the more complex cases that require use-def chain analysis, while GeneralOptimize handles local, single-instruction rewrites.

Function Map

Address	Size	Function	Role
`sub_C5FB30`	9 bytes	`OriStrengthReduce::execute`	Vtable entry, gates on function count
`sub_C5F3C0`	16 bytes	`OriStrengthReduce::getName`	Returns phase index 21
`sub_C5F3D0`	16 bytes	`OriStrengthReduce::isNoOp`	Returns 0 (never skipped)
`sub_752E40`	~1.2 KB	Core strength reduction	Use-def chain walk, replacement
`sub_745A80`	168 bytes	Replacement register creator	Allocates new register with copied type/flags
`sub_91BF30`	~400 bytes	Virtual register allocator	Creates 160-byte register descriptor
`sub_9253C0`	325 bytes	Instruction deleter	Unlinks and removes instruction (634 callers)
`sub_A13890`	~2 KB	Use-def context initializer	Sets up chain traversal structures
`sub_81D7E0`	~660 bytes	SHR+SHL->BFE matcher	Peephole pattern recognizer
`sub_81DB30`	~112 bytes	SHR+SHL->BFE emitter	Emits BFE (opcode 210)
`sub_81DBC0`	~330 bytes	BFE+ADD matcher	Peephole pattern recognizer
`sub_81DDD0`	~100 bytes	BFE+ADD emitter	Emits combined shift-add (opcode 102)
`sub_1724A20`	28,138 bytes	32-bit div/mod template	Newton-Raphson integer division
`sub_1728930`	16,545 bytes	64-bit unsigned div template	Double-width Newton-Raphson
`sub_1727AC0`	13,776 bytes	64-bit signed div template	Signed variant
`sub_1727130`	~2 KB	Division template driver	Allocates temps, dispatches to templates

Cross-References

Pass Inventory & Ordering -- complete 159-phase table showing phase 21's position
GeneralOptimize Bundles -- algebraic simplification sub-passes that complement strength reduction
Loop Passes -- loop canonicalization (phase 18) that enables induction variable analysis
Ori IR Overview -- instruction format, opcode encoding (ROT13), register model
Peephole Optimization -- MainPeepholeOptimizer containing SHR+SHL->BFE patterns
Newton-Raphson Templates -- detailed analysis of division lowering sequences
Scheduling -- sub_7B52B0 scheduling pass called after strength reduction
Knobs System -- knob 487 controlling optimization enablement

Copy Propagation & CSE

Copy propagation and common subexpression elimination in ptxas are spread across four dedicated pipeline phases (49, 50, 64, 83) plus a forward copy propagation sub-pass (OriCopyProp) embedded inside every GeneralOptimize bundle. Together these passes form the value-redundancy elimination subsystem: they detect computations that produce values already available elsewhere in the program, then eliminate the redundant instructions or replace them with cheaper copies.

The four dedicated phases run at specific pipeline positions chosen to exploit opportunities created by preceding transformations. GvnCse (phase 49) runs after mid-level expansion and argument enforcement when the IR is maximally normalized. OriReassociateAndCommon (phase 50) immediately follows GvnCse to catch near-misses through algebraic normalization. LateOriCommoning (phase 64) runs after predication (phase 63) converts branches into predicated instructions, exposing new redundancies. OriBackCopyPropagate (phase 83) runs late in the pipeline to shorten MOV chains before register allocation.


Phases covered	49 (GvnCse), 50 (OriReassociateAndCommon), 64 (LateOriCommoning), 83 (OriBackCopyPropagate)
Forward copy prop	`OriCopyProp` sub-pass inside each GeneralOptimize bundle (phases 13, 29, 37, 46, 58, 65)
Related knobs	22 knobs controlling budgets, modes, and enable/disable flags
Pipeline position	Mid-optimization (49--50), post-predication (64), pre-regalloc legalization (83)
Prerequisite passes	AnalyzeControlFlow (3), GeneralOptimizeMid2 (46), EnforceArgumentRestrictions (48)
Downstream consumers	ExtractShaderConstsFinal (51), OriDoPredication (63), register allocation (101)

Phase Summary Table

Phase	Name	Vtable	execute	getName	isNoOp	Default
49	`GvnCse`	`off_22BDD70`	`0xC5F000` (thunk)	`0xC5F010` (ret 49)	`0xC5F020` (ret 0)	Enabled
50	`OriReassociateAndCommon`	`off_22BDD98`	`sub_C604D0`	`0xC5EFE0` (ret 50)	`0xC5EFF0` (ret 0)	Enabled
64	`LateOriCommoning`	`off_22BDFC8`	`sub_C60020`	`0xC5EDF0` (ret 64)	`0xC5EE00` (ret 0)	Enabled
83	`OriBackCopyPropagate`	`off_22BE2C0`	`sub_C5EB80`	`0xC5EB90` (ret 83)	`0xC5EBA0` (ret 1)	Disabled

Phase name strings (from static name table at off_22BD0C0, verified in ptxas_strings.json):

Phase	String Address	Name Table Ref
49	`0x22BC80C`	`0x22BD280`
50	`0x22BC813`	`0x22BD290`
64	`0x22BC949`	`0x22BD310`
83	`0x22BCAE5`	`0x22BD3C8`

All four vtables are laid out at uniform 0x28-byte (40-byte) spacing in .data.rel.ro, matching the 5-pointer-per-vtable pattern used by all 159 phases. The factory switch at sub_C60D30 allocates each phase as a 16-byte object and installs the corresponding vtable pointer.

Phase 83 is disabled by default (isNoOp returns 1). It is activated through the AdvancedPhaseBackPropVReg gate (phase 82), which architecture-specific backends override to enable backward copy propagation for their target.

Phase 49 -- GvnCse (Global Value Numbering + CSE)

Overview

GvnCse combines global value numbering (GVN) with common subexpression elimination (CSE) in a single pass. GVN assigns a canonical "value number" to every expression in the program such that two expressions with the same value number are guaranteed to compute the same result. CSE then uses these value numbers to detect and eliminate redundant computations.

The pass is gated by the EnableGvnCse knob (address 0x21BDA50). When disabled, the pass is skipped entirely.

Dispatch Mechanism

The execute function at 0xC5F000 is a 16-byte thunk:

mov  rdi, [rsi+0x630]     ; rdi = compilation_context->sm_backend
mov  rax, [rdi]            ; rax = sm_backend->vtable
jmp  [rax+0xB8]            ; tail-call vtable[23] -- the actual GVN-CSE implementation

The real implementation lives in the compilation context's SM backend object (at context+0x630 / +1584), dispatched through its vtable at offset 0xB8 (slot 23). This indirection means the GVN-CSE algorithm can be overridden by architecture-specific backends that provide a different SM backend vtable. (This object was previously called "optimizer_state" on this page, but it is the same polymorphic SM backend used for legalization, scheduling, and all other architecture-dependent dispatch -- see data-structures.md.)

Algorithm (Reconstructed)

The ptxas GVN-CSE operates on the Ori IR basic block list with dominator-tree-guided traversal:

procedure GvnCse(function F):
    build dominator tree DT for F
    initialize value_table: hash_map<expression_key, value_number>
    vn_counter = 0

    for each block B in RPO(DT):
        for each instruction I in B:
            key = canonicalize(I.opcode, I.type, [lookup_vn(op) for op in I.operands])
            if key in value_table:
                existing_vn = value_table[key]
                replace all uses of I.dest with representative(existing_vn)
                mark I as dead
            else:
                value_table[key] = ++vn_counter
                set_representative(vn_counter, I.dest)

    run dead code elimination to remove marked instructions

Key design decisions visible from the binary:

Hash-based value table. The value numbering table uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619 / 0x01000193), the same hash primitive used throughout ptxas for instruction fingerprinting, code caching, and scheduling table lookups. The hash function incorporates the opcode, type, and recursively resolved value numbers of all operands. Hash table entries are 24 bytes each: [next_ptr (8B), key (8B), value/metadata (8B)] with chained collision resolution.
Dominator-tree scoping. Values defined in block B are only visible to blocks dominated by B. When the walk exits a dominator subtree, value table entries scoped to that subtree are removed. This prevents CSE from moving computations to positions where they would not dominate all uses. Dominance is checked via sub_1245740, which performs a single-bit test against a per-block dominator bitvector: the dominator set at block descriptor offset +176 is indexed by the dominator block's ID from offset +144. The check is O(1).
Commutativity normalization. For commutative operations (ADD, MUL, AND, OR, XOR, MIN, MAX), operands are sorted by value number before hashing. This ensures a + b and b + a get the same value number without requiring a separate reassociation pass.
Address space awareness. Memory operations in different address spaces (shared, global, local, constant) are never considered equivalent even if they have identical operands. The address space qualifier is encoded in the instruction opcode or modifier bits (not the operand), so the opcode comparison in the structural equivalence check inherently preserves this distinction.
Predicate handling. Predicated instructions (@P0 IADD R1, R2, R3) hash the predicate register's value number as an additional operand. Two identical computations under different predicates are distinct values.
Predicate-operand compatibility (sub_7E7380). After opcode and type matching in the caller, sub_7E7380 performs a focused predicate-operand compatibility check (30 lines, 150 bytes). The function tests: (a) predicate modifier parity -- instr+73 bit 4 versus instr+72 bit 12 (0x1000); if one instruction has a predicate modifier and the other does not, they are incompatible; (b) last operand 24-bit value ID -- (instr + 84 + 8*(operand_count-1)) & 0xFFFFFF must match; (c) second-to-last operand 8-byte encoding -- the two dwords immediately before the last operand slot must be identical. The broader structural comparison (opcodes masked with & 0xFFFFCFFF, data types at +76, operand counts at +80, full per-operand encoding, register class at +64) is performed by each of the 21 callers of sub_7E7380, not by the function itself. Instructions with volatile flags (bit 0x20 at register descriptor offset +48) and barrier-type registers (type 9) are excluded from CSE by the callers' pre-checks.

GVN Algorithm Details (Binary Trace)

The GVN-CSE body was located by reading SM backend vtable slot 23 (offset +0xB8) from all seven SM backend vtables in the ptxas binary. The actual function pointer varies by SM generation:

SM Backend	Vtable	Slot 23 Function	Behavior
SM30 (Kepler)	`off_2029DD0`	`sub_661250`	Returns 0 -- NO-OP
SM50 (Maxwell)	`off_21B4A50`	`sub_661250`	Returns 0 -- NO-OP
SM60 (Pascal)	`off_22B2A58`	`sub_BEE590`	Real GVN-CSE
SM70 (Volta)	`off_21D82B0`	`sub_BEE590`	Real GVN-CSE
SM80 (Ampere)	`off_21B2D30`	`sub_661250`	Returns 0 -- NO-OP
SM89 (Ada)	`off_21C0C68`	`sub_661250`	Returns 0 -- NO-OP
SM90+ (Hopper)	`off_21D6860`	`sub_BEE590`	Real GVN-CSE

GVN-CSE (phase 49) is a no-op on Kepler, Maxwell, Ampere, and Ada. It only executes on Pascal, Volta, and Hopper/Blackwell. SM80/SM89 backends rely on LateOriCommoning (phase 64) and the GeneralOptimize sub-passes for CSE coverage instead. This is a deliberate per-generation decision embedded in each SM backend's vtable.

Call Chain

GvnCse::execute (0xC5F000)
  -> sm_backend->vtable[23]  (indirect dispatch)
     -> sub_BEE590           (GVN entry, SM60/70/90)
        -> sub_781F80(ctx, 0) (rebuild def chains, mode=full)
        -> sub_BEE370         (mode dispatcher)
           -- queries knob 402 via knob_container->vtable[9] --
           mode 0: disabled, return
           mode 1: sub_BEA450  (simple GVN)
           mode 2: sub_BEAD00  (standard dominator-guided GVN)
           mode 3-6: sub_BED7E0 (full GVN with extended block scope)

Mode Dispatcher (`sub_BEE370`)

The mode is determined by knob 402 (EnableGvnCseMode), queried through two vtable calls on the knob container at context+1664:

Boolean query -- knob_container->vtable[9](402) (offset +72): checks if the knob is set at all. The dispatcher has a fast-path optimization: when vtable[9] is sub_6614A0 (the standard implementation), it reads directly from knob_container+72+28944 instead of dispatching through the vtable.
Integer query -- knob_container->vtable[15](402) (offset +120): reads the mode value as an integer. Similarly fast-pathed when vtable[15] is sub_661470.

If both queries return truthy, the integer value selects the GVN variant:

Mode	Function	Description
0	(none)	Pass disabled, return immediately
1	`sub_BEA450`	Simple single-block GVN (111 lines, ~2KB)
2	`sub_BEAD00`	Standard dominator-guided GVN (157 lines, ~2.5KB)
3	`sub_BED7E0`	Full GVN (when `sm_backend+1106` bit 6 AND `context+1416` bit 0)
4	`sub_BED7E0`	Full GVN (remapped to mode 2 if bit 6 is clear)
5-6	`sub_BED7E0`	Full GVN with extended block scope
>6	(none)	Return immediately (no operation)

Additional flags modulate the mode selection:

SM backend flag at sm_backend+1106 bit 6 (0x40): when set, enables modes 5-6 (enhanced scope). When clear and mode is 4, the dispatcher remaps to mode 2.
Context flag at context+1416 bit 0: when set (and bit 6 is set), selects mode 3 over mode 5-6.
SM version threshold sm_backend+372 <= 0x7FFF (32767): gates the EBB pre-pass sub_BED430 via knob 210.

Before the standard GVN (sub_BEAD00), the mode dispatcher may invoke sub_BED430 -- an extended basic block (EBB) pre-pass that identifies and marks multi-block CSE opportunities within single-entry regions. The EBB pre-pass is called unless: (a) SM version > 0x7FFF, AND (b) knob 210 is set or context+1368 bit 0 is clear.

Simple GVN (`sub_BEA450`, Mode 1)

Mode 1 provides the lightest GVN variant -- single-scope CSE without cross-dominator lookup. Reconstructed pseudocode:

procedure SimpleGvn(gvn_state S):
    context = S.context
    first_reg = operand_24bit(first_instr(context+272))
    value_record = context.reg_table[first_reg]       // context+296
    if not value_record: return

    for each value_record in linked order:
        if knob_query(257, value_record): break        // per-instruction gate

        first_instr = value_record.head                // value_record[0]
        sentinel = value_record.sentinel               // value_record[1]
        eligible = false

        for each instr from first_instr to sentinel:
            if not eligible:
                eligible = check_eligibility(instr)    // sub_BEA1E0
                if eligible: advance and check sentinel
            if instr.opcode_masked == 145:             // barrier
                if sm_backend->vtable[371](instr):     // safe to CSE
                    mark eligible
                else: break scope

        if eligible:
            // Directly generate MOV replacement -- no dominator check
            context+232 = value_record.head
            context+264 = value_record.head->field_20
            sub_9314F0(context, 0x124, 1, 0, 0)       // insert MOV 292

        advance to next block via opcode 97 (block header) -> field +24

This variant does not examine the immediate-dominator chain at instruction+148. It only replaces redundancies that are visible within the current value record's instruction list (effectively single-block scope).

Standard GVN (`sub_BEAD00`, Mode 2)

Mode 2 extends the simple GVN with cross-dominator CSE. After finding an eligible instruction and reaching the end of a block, it follows the immediate-dominator chain:

procedure StandardGvn(gvn_state S, char cross_block_flag):
    // ... (same entry and block walk as SimpleGvn) ...

    // After eligibility walk reaches sentinel:
    idom = instr.field_148                             // immediate dominator index
    if idom != 0:
        dom_record = context.reg_table[context.idom_map[idom]]  // context+296[context+512[4*idom]]
        if dom_record and (not cross_block_flag or dom_record.opcode != 1):
            if not dominance_check(S, value_record):   // sub_BEA3B0
                leader = dom_record.head
                if leader.next.opcode != 292:          // not already a MOV
                    context+232 = leader
                    context+264 = leader.field_20
                    sub_9314F0(context, 0x124, 1, 0, 0)   // insert MOV

    // Fallback: if idom chain is empty, try block-level CSE
    block_desc = context.block_table[instr.field_164]  // context+368
    if block_desc+280 bit 0 is clear:
        leader = reg_table[operand_24bit(block_desc.first_instr)]
        if leader.next.opcode != 292:
            generate MOV replacement

The cross_block_flag parameter (passed from the mode dispatcher) controls whether the standard GVN allows replacement when the dominator has opcode == 1 (a block-header sentinel). When set, it skips such cases to avoid unsafe cross-block hoisting.

Dominance Check with Cache (`sub_BEA3B0`)

The dominance check is guarded by context+1377 bit 5 (0x20). When this flag is clear, the function returns 0 immediately (no dominance, meaning "safe to CSE" -- the caller inverts the result).

When the flag is set, the function implements a single-entry global cache to accelerate repeated dominator queries:

procedure DominanceCheck(gvn_state S, value_record vr):
    if not (context+1377 & 0x20): return 0           // no extended scope

    idom = vr.field_148
    if idom == 0: return 1                             // no dominator -> can't CSE

    dom_record = reg_table[idom_map[idom]]
    if dom_record == NULL: return 1

    // Check global cache (single-entry, TLS-safe through static storage)
    if dom_record == cached_key:                       // qword_2A12A08
        return cached_result ^ 1                       // byte_2A129FE[0] ^ 1

    // Cache miss: compute dominator ordering via sub_74D720
    if idom >= 0 and vr.field_152 >= 0:
        cached_key = dom_record
        sub_74D720(context, idom, vr.field_152, &cached_result)
        return cached_result ^ 1
    else:
        return 1                                       // negative index -> can't CSE

The cache stores a single (key, result) pair in global statics qword_2A12A08 and byte_2A129FE. This is effective because the GVN walk processes instructions within a block sequentially, and many consecutive instructions share the same dominator. The cache hit rate is high for blocks dominated by a single predecessor.

EBB Pre-Pass (`sub_BED430`)

The Extended Basic Block (EBB) pre-pass runs before mode 2 GVN when the SM version and knob conditions are met. It identifies cross-block CSE opportunities within single-entry CFG regions.

procedure EbbPrePass(gvn_state S):
    // Phase 1: Clear previous markings
    for each block B in linked order:
        B.field_264 = 0                               // clear EBB marking

    // Phase 2: Find first CSE-eligible instruction
    for each instr in instruction list:
        if check_eligibility(instr) and instr.opcode != 211:
            break  // found seed
    if not found: return

    // Phase 3: Build dominator tree and compute block ordering
    sub_7846D0(context)                                // dominator tree + RPO
    sub_A12EA0(context, walker_context, visitor)       // dominator tree walk
    sub_775010(context)                                // predecessor setup
    sub_773140(context, 0)                             // successor setup
    sub_770E60(context, 0)                             // block ordering

    // Phase 4: Mark CSE candidates on every instruction
    for each instr in instruction list:
        if check_eligibility(instr) and instr.opcode != 211:
            instr.field_48 = 1                         // mark as CSE candidate
        else:
            instr.field_48 = 0

    // Phase 5: Propagate eligibility through operand chains
    sub_BED0A0(walker_state)                           // fixed-point propagation

    // Phase 6: Evaluate cross-block candidates
    for each value_record in RPO order:
        if knob_query(257, vr): continue               // per-instruction gate
        idom = vr.field_148
        if idom != 0:
            dom_record = resolve_idom(context, idom)
            if dom_record and dom_record.field_264 == 0:
                dom_record.field_264 = sub_BEA000(walker, dom_record, 0) ? 2 : 1

The EBB propagation engine (sub_BED0A0) is a fixed-point iteration that propagates CSE eligibility backward through operand use-chains. For each instruction with field_48 bit 0 set, it follows the operand-to-instruction back-references at context+88 to mark defining instructions as eligible too. The iteration continues until no more changes occur. This ensures that an entire expression tree is marked eligible when any of its consumers is eligible.

Full GVN Body (`sub_BED7E0`, 689 lines, ~18KB binary)

This is the most complete GVN variant (modes 3-6). Reconstructed pseudocode:

procedure FullGvnCse(gvn_state S):
    context = S.context
    mode_flags = S.mode & ~0x02            // strip bit 1
    extended_scope = (mode_flags - 5)      // >1 enables cross-block scope

    // Phase 1: Initialization
    block_count = context.block_count      // at +376
    visited[] = allocate_zeroed(block_count + 1)   // one byte per block
    build_dominator_tree(context)                   // sub_7846D0
    scope_tree = new_scoped_tree()                  // sub_661750
    rpo = context.rpo_ordering                      // at +792

    // Phase 2: RPO block walk
    for i = 0 to rpo.count - 1:
        block_idx = rpo.indices[i]
        block = context.block_table[block_idx]      // context+368
        if block.head == NULL or block.flags & SKIP:
            continue

        first_instr = lookup_first_instruction(block)
        dominator_candidate = NULL
        has_redundant = false

        for each instr in block:
            // Per-instruction knob gate (knob 257)
            if knob_query(257, instr):
                break to next block boundary

            eligible = check_eligibility(instr)     // sub_BEA1E0

            if eligible:
                visited[block_idx] = not block.visited_flag

            elif opcode_masked(instr) in {32, 159}: // branch, return
                propagate visited flag from predicate operand

            elif opcode_masked(instr) == 145:       // barrier/sync
                safe = sm_backend->vtable[371](instr)
                if safe: mark_as_candidate

            // Check dominator for existing equivalent
            idom_ref = instr.idom_ref               // at +148
            if idom_ref != 0:
                dom_block = resolve_idom(context, idom_ref)
                if dom_block dominates current position:
                    leader = dom_block.first_instr
                    if leader.opcode != 292:        // not already a MOV
                        replace_with_mov(context, leader, 0x124)

        record_block_in_scope_tree(scope_tree, block_idx)

    // Phase 3: Post-processing dominated blocks
    for each (node, bit_pos) in scope_tree.bit_iterate():
        block_idx = bit_pos | (node.data << 6)
        block_record = reg_table[block_idx]
        cse_dominated_block(S, block_record)        // sub_BEA5F0

    // Phase 4: Cleanup
    flush_deferred_instructions(scope_tree)
    destroy_scoped_tree(scope_tree)

Key observations from the binary:

Block walk order is RPO. The outer loop reads context+792 -- a struct containing {int count; int indices[]} -- and iterates in that order. The RPO array is pre-computed by sub_7846D0 which also builds the dominator tree.
The value table is a register-indexed array, not a hash map. Values are stored in context+296 (an array of pointers indexed by the 24-bit register/value identifier from the operand encoding at instruction+84). This gives O(1) lookup by register ID. The dominator tree is used for scoping, not a stack-based hash table.
Dominator scoping uses a balanced binary tree with bitset nodes. Each tree node stores a 64-bit bitset of block indices, traversed with tzcnt for efficient iteration. Block index is recovered as bit_position | (node_data << 6), supporting up to 64 * depth blocks.
Replacement is MOV insertion. When a redundant instruction is found, the pass calls sub_9314F0(context, 0x124, 1, 0, 0) to generate a replacement MOV instruction (opcode 0x124 = 292 decimal). The original computation is recorded at context+232 (source) and context+264 (metadata) before the MOV is generated.
Barrier instructions (opcode 145) have a dedicated safety check via sm_backend->vtable[371] (offset +2968), which is an architecture-specific predicate that determines whether an instruction can be CSE'd across a barrier boundary.

Instruction Eligibility (`sub_BEA1E0`)

Opcode (masked)	Category	Condition
16	Register copy / PHI	Always, unless last operand bit 1 set
183	Memory load/compute	Bit 5 of last operand, or `sub_91BC40` safety check
119	GPU special	SM flag `+1106` bit 6 required; operand bit 1
186	GPU special	SM flag `+1106` bit 6 required; operand bit 0
211	GPU special	SM flag `+1106` bit 6 required; operand bit 2
283	GPU special	SM flag `+1106` bit 6 required; operand bit 3
122	Conditional	Type 2-3: always; type 7-8: bit 7 set
310	Specialized	`(flags & 0xF) == 2` and `(flags & 0x30) != 0x30`
145	Barrier/sync	Separate `sm_backend->vtable[371]` check
all others	--	Not eligible

Opcodes 119, 186, 211, 283 are only CSE-eligible when sm_backend+1106 bit 6 (0x40) is set. This bit appears to be an architecture-specific capability flag enabling extended CSE for certain GPU-specific instruction classes.

Per-Dominated-Block CSE (`sub_BEA5F0`)

After the RPO walk populates the scope tree, sub_BEA5F0 processes each dominated block:

SM version gate: if sm_backend+372 <= 28671 (SM70 or earlier), enables a special operand canonicalization path for commutative operations
Instruction walk: iterates via block+128 child pointer chain
Dominator ordering: compares instruction+144 (dominator number) to test dominance
Commutative canonicalization (opcode 95): calls sm_backend->vtable[79] (offset +632) to sort operands by value number. Rewrites operand encoding with flags 0x60000000 and 0x40000000 to mark canonicalized operands
Replacement: calls sub_931920 to insert copy instructions when a dominating equivalent is found

Scope Tree Bit-Iteration Detail

The scope tree post-processing (lines 498-664 of sub_BED7E0) uses a binary tree where each node contains a 4-word (32-byte) bitset region starting at node+32. The iteration:

Start at the leftmost node: follow node->left until NULL
Scan the 4-word bitset region (node+32 through node+64), finding each set bit via tzcnt (x86 trailing-zero-count)
Recover the block index: bit_position | (((word_offset_in_node | (node.field_24 << 2)) << 6))
After processing a bit, mask it out: word &= ~(0xFFFFFFFFFFFFFFFF >> (64 - (bit_pos + 1)))
When current word is exhausted, advance to next word in the 4-word region
When all 4 words are exhausted, follow parent/right-child links to traverse the tree in order

Each block index recovered from the tree triggers a call to sub_BEA5F0 for per-dominated-block CSE. The tree structure allows the scope walk to skip large ranges of blocks that have no CSE candidates, making it efficient for sparse CFGs.

GVN Function Map

Address	Name	Size	Role
`sub_BEE590`	GvnEntry	~200B	Entry point (vtable slot 23, SM60/70/90)
`sub_BEE370`	ModeDispatcher	~550B	Selects GVN variant via knob 402
`sub_BED7E0`	FullGvn	~18KB	Full GVN body (modes 3-6, RPO + scope tree)
`sub_BED430`	EbbPrePass	~2KB	Extended basic block pre-pass
`sub_BED0A0`	EbbPropagate	~3KB	EBB eligibility propagation (fixed-point)
`sub_BEC880`	EbbInit	--	EBB state initialization
`sub_BEAD00`	StandardGvn	~2.5KB	Standard dominator-guided GVN (mode 2)
`sub_BEA5F0`	PerBlockCse	~9KB	Per-dominated-block CSE + commutative canon.
`sub_BEA450`	SimpleGvn	~2KB	Simple single-block GVN (mode 1)
`sub_BEA3B0`	DomCheckCached	~300B	Dominance check with global cache
`sub_BEA1E0`	EligibilityCheck	~500B	Instruction eligibility (7 opcode classes)
`sub_BEA000`	EbbCandidateCheck	~700B	EBB candidate dominator-chain walk
`sub_7E7380`	PredicateCompat	~150B	Predicate-operand compatibility check
`sub_661250`	NoOp	~6B	No-op stub (SM30/50/80/89)
`sub_7846D0`	BuildDomTree	--	Dominator tree + RPO ordering builder
`sub_661750`	ScopeTreeInit	--	Scoped value tree init/destroy
`sub_9314F0`	InsertMov	--	Instruction insertion (generates MOV 292)
`sub_934630`	InsertMulti	--	Instruction insertion (multi-operand variant)
`sub_931920`	InsertNode	--	Instruction node insertion into linked list
`sub_9253C0`	DeleteInstr	--	Instruction deletion
`sub_6B4520`	RecordBlock	--	Block recording for dominator scoping
`sub_74D720`	DomOrdering	--	Dominator ordering comparison
`sub_69DD70`	TreeExtract	--	Tree node extraction (deferred processing)
`sub_7A1A90`	KnobQuery	--	Knob query (per-instruction enablement)
`sub_91BC40`	MemSafetyCheck	--	Memory operation safety check
`sub_A12EA0`	DomTreeWalk	--	Dominator tree walker (EBB discovery)

GPU-Specific CSE Constraints

GPU CSE must respect constraints that do not arise in CPU compilers:

Divergence. A uniform subexpression (same value across all threads in a warp) can be safely hoisted. A divergent subexpression may have different values per thread and must only be CSE'd within the same control-flow path. The GvnCse pass runs after AnalyzeUniformsForSpeculation (phase 27), which provides divergence annotations.
Barrier sensitivity. A computation that reads shared memory before a BAR.SYNC cannot be commoned with an identical computation after the barrier, because intervening threads may have written different values. Memory operations with barrier dependencies are assigned unique value numbers. The actual barrier check is performed by sm_backend->vtable[371] (offset +2968), an architecture-specific predicate.
Register pressure. Aggressive CSE can increase register pressure by extending the live range of the representative value. The EnableGvnCse knob allows the pass to be disabled when register pressure is the binding constraint.
Per-SM enablement. GVN-CSE is only active on SM60, SM70, and SM90+. SM80/SM89 rely on LateOriCommoning (phase 64) and GeneralOptimize sub-passes instead. This per-generation selection is embedded in the SM backend vtable at slot 23.

Phase 50 -- OriReassociateAndCommon

Overview

Reassociation normalizes the algebraic structure of expressions to expose commoning opportunities that GvnCse missed. GvnCse cannot detect that (a + b) + c and (a + c) + b compute the same value unless the expressions are first reassociated into a canonical form. This pass performs that reassociation and then runs a second commoning pass over the normalized IR.

Dispatch Mechanism

// sub_C604D0 -- OriReassociateAndCommon::execute
int64 execute(phase* self, compilation_context* ctx) {
    int func_count = get_function_count(ctx);   // sub_7DDB50
    if (func_count > 1)
        return ctx->field_1584->vtable[44](ctx->field_1584, ctx);
    return func_count;
}

For multi-function compilation units, the pass dispatches through the compilation context's SM backend (field +1584 / 0x630), calling vtable slot 44 (offset 0x160). This enables per-function reassociation with function-level isolation of value numbering state.

Algorithm (Reconstructed)

Reassociation works on associative and commutative operators:

procedure ReassociateAndCommon(function F):
    for each basic block B in RPO:
        for each instruction I in B:
            if I.opcode is associative+commutative (ADD, MUL, AND, OR, XOR):
                flatten expression tree rooted at I into a list of leaves
                sort leaves by canonical order (constants last, then by register number)
                rebuild balanced binary tree from sorted leaves
            if I.opcode is SUB:
                rewrite (a - b) as (a + (-b)) for uniformity

    // Second pass: hash-based commoning over the reassociated IR
    run local CSE over each basic block

Why Reassociation Matters

The reassociation and commoning phases are tightly coupled because reassociation's primary goal is to enable commoning:

BB0:  R5 = (R2 + R3) + R4 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R3)), vn(R4))
BB1:  R6 = (R2 + R4) + R3 ; GvnCse sees: VN(ADD, VN(ADD,vn(R2),vn(R4)), vn(R3))
      -- These are NOT the same VN because the inner ADDs differ.

After reassociation, both flatten to {R2, R3, R4} sorted canonically, then rebuild as (R2 + R3) + R4. Now they share the same value number and the second is eliminated.

Controlling Knobs

Knob	Address	Purpose
`AllowReassociateCSE`	`0x21C0180`	Master enable/disable
`ReassociateCSEBudget`	`0x21BA810`	Max instructions processed per function
`ReassociateCSEWindow`	`0x21BA7D0`	Sliding window size for local CSE after reassociation
`ReassociateCSESkip`	`0x21BA7F0`	Skip first N instructions (debugging)
`ReassociateLargeImmInUIADD64`	`0x21BA7A0`	Large immediates in 64-bit unsigned ADD
`DistributeAndReassociateMulBudget`	`0x21BDDC0`	Budget for `ab + ac -> a*(b+c)`

Phase 64 -- LateOriCommoning

Overview

LateOriCommoning is a late CSE pass that runs immediately after predication (phase 63, OriDoPredication). If-conversion transforms conditional branches into predicated instructions, which can expose new redundancies: two computations that were previously in mutually exclusive branches become adjacent predicated instructions that may compute the same value.

Dispatch Mechanism

// sub_C60020 -- LateOriCommoning::execute
char execute(phase* self, compilation_context* ctx) {
    int func_count = get_function_count(ctx);    // sub_7DDB50
    if (func_count > 1)
        return sub_9059B0(ctx);                  // late commoning implementation
    return func_count;
}

Implementation -- `sub_9059B0`

sub_9059B0 is the entry point for late commoning. It:

Checks knob 487 (ForceLateCommoning at 0x21BD2F0) to determine whether the pass is enabled
Verifies the function's optimization state has commoning enabled: the byte at context->field_1664->field_72 + 60696 must be 1, and the dword at offset +60704 must be nonzero
Allocates a ref-counted working set via the pool allocator
Calls sub_9055F0 -- the core commoning walker

Core Commoning Walker -- `sub_9055F0`

sub_9055F0 (203 lines decompiled) is the central commoning algorithm for late CSE. Its structure, reconstructed from the decompilation:

procedure LateCommoning(function_state S):
    if not knob_enabled(487):  return
    if S.flags & 0x02:  return                 // already processed
    if (S.flags | S.flags2) & 0x08:  return    // conflicting mode

    rebuild_def_chains(S, mode=1)              // sub_781F80
    rebuild_use_chains(S)                      // sub_763070
    compute_hash_values(S, 0, 0, 0, 0)        // sub_7E6090

    block_count = S.field_520 + 1
    allocate bit_array[block_count]

    // Reset hash/VN slots on all instructions
    for each instruction I in S.instruction_list:
        I.field_88 = 0xFFFFFFFF00000000        // upper 32 bits = -1, lower = 0

    // Main commoning loop over code list
    for each instruction I in S.code_list:
        // Phase 1: Remap operands through equivalence table
        for each operand (reverse order):
            if operand is register ref (type 0x10000000):
                resolve to canonical representative

        // Phase 2: Try commoning based on opcode class
        if I.opcode == 72 (MOV):
            propagate_equivalence(I)            // sub_8F2CD0
        elif is_pure(I):                        // sub_7DF3A0
            opcode_class = I.opcode & 0xCF00
            if opcode_class == 0x0061 (SEL):    // conditional select
                reset_tracking()
            elif opcode_class == 0x0034 (PHI):
                record_phi_equivalence(S, I)
            else:
                if not try_common(S, I):        // sub_901A90
                    hash = compute_hash(S, I)   // sub_74ED70
                    record_hash_for_future_matching(hash)

The three infrastructure functions called at the beginning are shared with the GeneralOptimize sub-passes:

sub_781F80 -- rebuilds reaching definition chains (also used by GeneralOptimizeEarly)
sub_763070 -- rebuilds use-def chains
sub_7E6090 -- pre-computes instruction hash values

Commoning Check -- `sub_901A90`

sub_901A90 (387 lines) is the instruction-level CSE checker. It:

Examines the instruction's opcode, type, and operand value numbers
Looks up the instruction's hash in the per-block equivalence table
If a match is found, verifies that the matched instruction dominates the current position via sub_1245740 (O(1) bitvector bit test: (1 << def_dom_id) & dom_set[def_dom_id >> 5])
If domination holds, replaces the current instruction's destination with the matched instruction's destination
Returns true if commoning succeeded, false otherwise

A related commoning pattern was confirmed from sub_90A340 (1670 bytes, 21 callees), which performs commoning on opcode 130 (HSET2 in the ROT13 name table; used as an internal marker for MOV-like instructions -- actual SASS MOV is opcode 19) instructions. From the decompilation, the operand comparison loop:

// Operand-by-operand equivalence check within commoning body
for (i = operand_count - 1; i >= 0; i--) {
    if (candidate.operands[2*i + 21] != existing.operands[2*i + 21])
        break;  // operand value mismatch
    if (candidate.operands[2*i + 22] != existing.operands[2*i + 22])
        break;  // operand modifier mismatch
}
// If all operands match AND opcodes match AND operand counts match:
//   verify dominance, then replace

The reverse iteration order (from last operand to first) is an optimization: destination operands at lower indices are more likely to differ, so checking source operands first (higher indices) allows early exit.

Instruction Hashing -- `sub_74ED70`

sub_74ED70 (304 lines) computes a hash value for an instruction, incorporating:

Opcode and type qualifiers
Value numbers of all source operands (recursively resolved through MOV chains)
Address space for memory operations
Predicate register (if predicated)
Immediate values (folded into the hash)

The hash is stored at instruction field +88 (the upper 32 bits that were reset to 0xFFFFFFFF during initialization). The function calls sub_7DF3A0 (purity check), sub_7E0030 and sub_7E2530 (operand accessors), and sub_748440 (hash combining).

Controlling Knobs

Knob	Address	Purpose
`ForceLateCommoning`	`0x21BD2F0`	Force-enable late commoning
`DisableMoveCommoning`	`0x21BE2C0`	Disable MOV-based equivalence propagation within the commoning walker

Phase 83 -- OriBackCopyPropagate

Overview

Backward copy propagation propagates values backward through MOV chains, eliminating intermediate copies. Unlike forward copy propagation (which replaces uses of a copy's destination with the copy's source), backward copy propagation replaces the definition of a copy's source with the copy's destination, allowing the copy instruction itself to be deleted.

Phase 83 uses a split-phase design with phase 82 (AdvancedPhaseBackPropVReg). The actual backward copy propagation algorithm lives in architecture-specific SM backend overrides of phase 82. Phase 83 is a pipeline progress marker that advances the pipeline counter context+1552 to 9 after backward copy propagation completes, signaling to downstream operand encoding functions that they may apply relaxed register constraints.

This phase is disabled by default (isNoOp returns 1). It is activated only when an architecture backend overrides phase 82 to provide its own backward propagation implementation.

Dispatch Mechanism

The execute function is a 7-byte stub that advances the pipeline progress counter:

// sub_C5EB80 -- OriBackCopyPropagate::execute
void execute(phase* self, compilation_context* ctx) {
    ctx->field_1552 = 9;   // advance pipeline progress counter to backward-copy-prop stage
}

Phase 83 does not contain the backward copy propagation algorithm. The actual algorithm is provided by the architecture-specific SM backend that overrides phase 82 (AdvancedPhaseBackPropVReg). The split-phase design works as follows:

Phase	Role	Default behavior	When arch-activated
82 (`AdvancedPhaseBackPropVReg`)	Gate + algorithm provider	No-op (hook, `isNoOp` = 1)	Arch backend installs backward copy propagation body
83 (`OriBackCopyPropagate`)	Pipeline progress marker	No-op (`isNoOp` = 1)	Sets `context+1552 = 9`, enabling downstream constraint relaxation

The factory switch at sub_C60D30 installs vtable off_22BE298 for phase 82 and off_22BE2C0 for phase 83. Both vtables are 40-byte (5-pointer) structures at consecutive addresses in .data.rel.ro.

Gate Mechanism (Phase 82)

Phase 82 (AdvancedPhaseBackPropVReg) is one of 16 AdvancedPhase hook points in the pipeline. By default its isNoOp returns true, meaning the phase is skipped entirely. When an architecture backend needs backward copy propagation, it:

Overrides phase 82's vtable to install the actual backward propagation algorithm as the execute function
Overrides phase 82's isNoOp to return 0 (enabled)
Configures phase 83's isNoOp to return 0, enabling the pipeline counter advancement

The BackCopyPropBudget knob (index 808, address 0x21BFDF0) limits the number of backward propagations performed. This knob is read by sub_8C0270 (scheduler initialization) at the point where the scheduler allocates its per-function work structure. When knob 808 is not set by the user, the budget falls back to a default stored in the scheduler state object at offset +92.

Algorithm (Reconstructed)

The backward copy propagation algorithm is reconstructed from the phase name, the infrastructure it shares with forward copy propagation (sub_781F80, sub_763070), the BackCopyPropBudget knob, and the pipeline position constraints. The actual algorithm body resides in architecture-specific SM backend code, not in the generic binary.

procedure BackCopyPropagate(function F):
    budget = knob(808)     // BackCopyPropBudget
    count = 0

    // Phase 1: rebuild def-use chains (shared infrastructure)
    rebuild_def_chains(F)  // sub_781F80
    rebuild_use_chains(F)  // sub_763070

    // Phase 2: walk blocks in RPO, instructions in reverse
    for each basic block B in reverse postorder:
        for each instruction I in B (last to first):
            if count >= budget:
                return

            if I is not MOV (opcode & 0xCF00 != MOV class):
                continue

            // I is: Rd = MOV Rs
            def_of_Rs = reaching_def(Rs)

            // Guard 1: Rs must have exactly one use (this MOV)
            if use_count(Rs) != 1:
                continue

            // Guard 2: def(Rs).dest can be renamed to Rd without conflict
            if not can_rename(def_of_Rs.dest, Rd):
                continue

            // Guard 3: no intervening definition of Rd between def(Rs) and I
            if has_intervening_def(Rd, def_of_Rs, I):
                continue

            // Perform backward propagation: rename definition
            rename def_of_Rs.dest from Rs to Rd
            delete I  // MOV is now redundant
            count++

The backward walk direction is essential for cascading chain collapse:

Before:    R1 = expr;    R2 = R1;    R3 = R2
                                      ^^^^^^ processed first (backward)
Step 1:    R1 = expr;    R3 = R1;    (deleted R3=R2, renamed R2→R3 in "R2=R1")
                         ^^^^^^ processed next
Step 2:    R3 = expr;                (deleted R3=R1, renamed R1→R3 in "R1=expr")

Result: entire 3-instruction chain collapses to single "R3 = expr"

If the walk were forward, only R2 = R1 would be processed first (renaming R1 = expr to R2 = expr), but then R3 = R2 would need a second pass to collapse further. The backward direction achieves full chain collapse in a single pass.

Why Phase 83 Runs So Late

Phase 83 is positioned at pipeline slot 83 out of 158, immediately before the register attribute computation sequence (phases 84--95). This late position serves three purposes:

Catches late-created copies. Phases 66--81 include late optimizations (LICM, texture movement, rematerialization, late arch-specific peepholes) that frequently insert new MOV instructions. Backward copy propagation after these passes cleans up the residual chains that forward propagation (which last ran in phase 65) cannot see.
Reduces register pressure for allocation. Every eliminated MOV is one fewer live range the register allocator (phase 101) must handle. By running just before the liveness/DCE pass (phase 84, OriPerformLiveDeadFourth), backward copy propagation minimizes the input to register allocation.
Safe renaming window. After phase 83, the pipeline enters the register attribute and legalization sequence. Renaming destinations before this point avoids conflicts with the fixed register assignments that legalization may impose.

Why Disabled by Default

Phase 83 is disabled by default (isNoOp returns 1) for several reasons:

Backward renaming is inherently riskier than forward propagation. Forward copy propagation modifies uses (safe because the original definition still exists). Backward copy propagation modifies definitions -- changing which register an instruction writes to. A bug here can silently corrupt values used by other instructions.
Architecture-specific register constraints. The legality of renaming a destination depends on target-specific constraints: fixed-function registers (thread ID, special purpose), register bank conflicts, paired/grouped register requirements for 64-bit operations, and uniform register constraints on newer architectures (Volta+). Only the architecture backend knows which renames are safe.
Diminishing returns. Forward copy propagation (OriCopyProp) runs six times during the GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65) and handles the majority of copy elimination. Backward propagation catches only residual chains that forward propagation structurally cannot eliminate.
Gate requirement. Architecture backends that enable backward copy propagation via phase 82 may also need to pre-process the IR (e.g., marking registers that must not be renamed, or inserting constraints that protect fixed-function registers).

Downstream Effects: Pipeline Counter and Encoding Relaxation

When phase 83 sets context+1552 to 9, two operand encoding pattern functions (sub_9BF350 and sub_9BFAF0) change behavior. These functions gate on two conditions:

// Gate check in sub_9BF350 and sub_9BFAF0
if ((context->field_1398 & 0x04) != 0 && context->field_1552 > 9) {
    // Apply register constraint relaxation
    // Check if operand register class == 3 (address register) or reg_id == 41
    // Assign special operand mask 0xFFFFFA (16777210) instead of 0xFFFFFF
}

The flag at context+1398 bit 2 is an architecture capability flag. When both conditions are met (capability flag set AND pipeline has progressed past phase 83), the encoding functions relax operand constraints for address registers (class 3) and special register 41, allowing these to participate in operand patterns that they would otherwise be excluded from.

The pipeline counter value 9 is part of a progression: phase 95 (SetAfterLegalization, sub_C5E440) later advances the counter to 19, enabling a further tier of relaxation in the scheduler initialization (sub_8C0270).

Forward vs. Backward Copy Propagation

The two propagation directions are complementary and handle different structural patterns:

Property	Forward (OriCopyProp)	Backward (OriBackCopyPropagate)
Direction	Replaces uses of copy destination with copy source	Replaces definitions to eliminate copies
Example	`R2=R1; ADD R3,R2,R4` -> `ADD R3,R1,R4`	`R1=expr; R2=R1` -> `R2=expr`
Runs	6 times (phases 13,29,37,46,58,65)	Once (phase 83)
Default	Always enabled	Disabled (arch-gated)
Risk	Low (original def unchanged)	Higher (modifies defs)
Catches	Most copies from expansion and lowering	Residual chains from late passes (66--81)

Controlling Knobs

Knob	Address	Purpose
`BackCopyPropBudget`	`0x21BFDF0`	Maximum backward propagations per function (knob index 808)

Forward Copy Propagation -- OriCopyProp

Overview

Forward copy propagation is not a standalone pipeline phase but a sub-pass within each of the six GeneralOptimize bundles (phases 13, 29, 37, 46, 58, 65). It is identified by the name string OriCopyProp at address 0x21E6CE1 and can be individually targeted via the --named-phases mechanism.

The OriCopyProp name appears in the NamedPhases parser (sub_9F4040 at offset +1648), where it is looked up via sub_C641D0 (case-insensitive binary search over the phase name table). When the user specifies --named-phases OriCopyProp, the system resolves this to the appropriate sub-pass within GeneralOptimize.

Target Opcodes and Flag Bits

Three Ori opcodes are candidates for forward copy propagation:

Opcode	Meaning	Propagation Rule
97	Definition anchor (`STG` in ROT13; used internally as a register-to-register MOV/definition marker -- actual SASS MOV is opcode 19)	Replace uses of destination with source
18	Predicated copy	Propagate only under matching predicate guard
124	Conditional select (CSEL)	Propagate when select condition is provably constant

Opcode matching uses a mask: (instr.opcode & 0xCF00) == target, stripping modifier bits in the upper nibble of the opcode field at instruction offset +72.

Three flag bits on instruction field [6] (byte offset 24) track propagation state:

Bit	Hex	Meaning
8	`0x100`	Copy has been propagated
9	`0x200`	Deferred cleanup (instruction may still be needed)
10	`0x400`	Under predicate guard (requires predicate-aware handling)

Eligibility Check (`sub_8F2E50`)

The eligibility function checks whether a copy can be safely propagated, with an SM-version-dependent constraint:

function isEligibleForPropagation(instr, ctx):
    sm_version = *(ctx + 372)
    operand_type = instr.operand_type & 0xF
    if sm_version <= 20479:        // pre-Turing (sm_70 and earlier)
        return true                // unconditionally safe
    else:                          // Turing+ (sm_75+)
        return (operand_type & 0x1C00) == 0   // constraint bits must be clear

The SM version threshold 20479 corresponds to the boundary between Volta (sm_70) and Turing (sm_75). Turing introduced new operand constraint bits that restrict when copies can be folded.

Algorithm

Forward copy propagation replaces uses of a copy's destination with the copy's source:

procedure OriCopyProp(function F):
    for each basic block B in RPO:
        for each instruction I in B:
            if I is MOV Rd, Rs:
                for each use U of Rd that I dominates:
                    if Rs is still live at U:
                        replace Rd with Rs in U
                if Rd has no remaining uses:
                    mark I as dead

Within the GeneralOptimize loop, copy propagation interacts with constant folding and algebraic simplification: a copy propagation may expose a constant operand, enabling constant folding in the next iteration, which may create a dead instruction for DCE. This is why GeneralOptimize runs as a fixed-point loop. In Variant A (phases 13, 29), the fixed-point iteration is capped by knob 464. In Variant B (phases 37, 58), convergence uses a cost-based threshold of 0.25 (knob 474). Two-pass predicate simplification via sub_908A60 runs within the copy propagation loop to handle predicate-conditional copies.

Controlling Knobs

Knob	Address	Purpose
`CopyPropBudget`	`0x21BECD0`	Maximum instructions processed per invocation
`CopyPropGlobalBudget`	`0x21BEC70`	Budget for cross-block (global) copy propagation
`CopyPropForceGlobal`	`0x21BEC90`	Force global copy propagation
`CopyPropAddr`	`0x21BECE8`	Propagate through address computations
`CopyPropConstantBank`	`0x21BECB0`	Propagate constant bank references
`CopyPropUseReachingDefs`	`0x21BEBD0`	Use reaching definitions for more aggressive propagation
`CopyPropPreAllocReg`	`0x21BEBF0`	Enable for pre-allocated (fixed) registers
`CopyPropNoWriteNonRR`	`0x21BEC10`	Disable into non-register-register contexts
`CopyPropNonRegMultiDef`	`0x21BEC30`	Handle non-register multi-definition copies
`CopyPropNoMmaCb`	`0x21BEC50`	Disable into MMA constant bank operands
`LateCopyPropComplPred`	`0x21BC680`	Late copy propagation for complementary predicates

The CopyPropUseReachingDefs knob is particularly significant: when enabled, the pass uses reaching definitions analysis (built by sub_781F80) instead of simple dominator checks, allowing more aggressive propagation at the cost of additional analysis time.

Complete Knob Reference

All 24 knobs controlling copy propagation and CSE:

Knob	ROT13	Address	Controls
`EnableGvnCse`	`RanoyrTiaPfr`	`0x21BDA50`	Master enable for phase 49
`EnableGvnCseMode`	--	knob 402	GVN mode selector (0=off, 1=simple, 2=standard, 3-6=full)
`EnableGvnCsePerInstr`	--	knob 257	Per-instruction GVN enablement gate
`AllowReassociateCSE`	`NyybjErnffbpvngrPFR`	`0x21C0180`	Master enable for reassociation CSE
`ReassociateCSEBudget`	`ErnffbpvngrPFROhqtrg`	`0x21BA810`	Instruction budget
`ReassociateCSEWindow`	`ErnffbpvngrPFRJvaqbj`	`0x21BA7D0`	Sliding window size
`ReassociateCSESkip`	`ErnffbpvngrPFRFxvc`	`0x21BA7F0`	Skip first N
`ReassociateLargeImmInUIADD64`	`ErnffbpvngrYnetrVzzVaHVNQQ64`	`0x21BA7A0`	64-bit ADD imm
`DistributeAndReassociateMulBudget`	`QvfgevohgrNaqErnffbpvngrZhyOhqtrg`	`0x21BDDC0`	Distributive law
`ForceLateCommoning`	`SbeprYngrPbzzbavat`	`0x21BD2F0`	Force phase 64
`DisableMoveCommoning`	`QvfnoyrZbirPbzzbavat`	`0x21BE2C0`	Disable MOV commoning
`BackCopyPropBudget`	`OnpxPbclCebcOhqtrg`	`0x21BFDF0`	Phase 83 budget
`CopyPropBudget`	`PbclCebcOhqtrg`	`0x21BECD0`	Per-invocation budget
`CopyPropGlobalBudget`	`PbclCebcTybonyOhqtrg`	`0x21BEC70`	Cross-block budget
`CopyPropForceGlobal`	`PbclCebcSbeprTybony`	`0x21BEC90`	Force global
`CopyPropAddr`	`PbclCebcNqqe`	`0x21BECE8`	Address prop
`CopyPropConstantBank`	`PbclCebcPbafgnagOnax`	`0x21BECB0`	Constant bank
`CopyPropUseReachingDefs`	`PbclCebcHfrErnpuvatQrsf`	`0x21BEBD0`	Reaching defs
`CopyPropPreAllocReg`	`PbclCebcCerNyybpErt`	`0x21BEBF0`	Fixed registers
`CopyPropNoWriteNonRR`	`PbclCebcAbJevgrAbaEE`	`0x21BEC10`	Non-RR disable
`CopyPropNonRegMultiDef`	`PbclCebcAbaErtZhygvQrs`	`0x21BEC30`	Multi-def
`CopyPropNoMmaCb`	`PbclCebcAbZznPo`	`0x21BEC50`	MMA disable
`LateCopyPropComplPred`	`YngrPbclCebcPbzcyCerq`	`0x21BC680`	Compl pred
`SpeculativeHoistCommonInsts`	`FcrphyngivruBvfgPbzzbaVafgf`	`0x21B81B0`	Spec hoist (phase 56)

Interaction Between Passes

The copy propagation and CSE passes interact with each other and with the rest of the pipeline in a specific sequence designed to maximize redundancy elimination:

Phase 46: GeneralOptimizeMid2
  |-- OriCopyProp (forward copy propagation)
  |-- constant folding, algebraic simplification, DCE

Phase 48: EnforceArgumentRestrictions
  |-- may insert MOVs for ABI compliance -> new copy prop opportunities

Phase 49: GvnCse
  |-- global value numbering + CSE
  |-- eliminates redundant computations across basic blocks

Phase 50: OriReassociateAndCommon
  |-- normalizes expression trees for better commoning
  |-- local CSE over reassociated IR
  |-- catches cases GvnCse missed due to non-canonical form

Phase 51: ExtractShaderConstsFinal
  |-- may replace computations with constant loads -> dead code

Phase 58: GeneralOptimizeLate
  |-- OriCopyProp again (cleans up after expansion passes)

Phase 63: OriDoPredication
  |-- converts branches to predicated instructions
  |-- previously mutually-exclusive code becomes linear

Phase 64: LateOriCommoning
  |-- CSE on newly-linearized predicated code
  |-- eliminates redundancies exposed by if-conversion

Phase 65: GeneralOptimizeLate2
  |-- OriCopyProp + DCE (final cleanup)

Phase 82: AdvancedPhaseBackPropVReg (gate, arch-specific)
Phase 83: OriBackCopyPropagate
  |-- backward MOV chain elimination (disabled by default)
  |-- reduces copy count before register allocation

Key Function Map

Address	Size	Name	Purpose
`0xC5F000`	16 B	GvnCse::execute	Thunk to sm_backend (context+0x630)->vtable[23]
`0xC5F010`	6 B	GvnCse::getName	Returns 49
`0xC5F020`	6 B	GvnCse::isNoOp	Returns 0 (enabled)
`sub_BEE590`	~200 B	GvnCse body (SM60/70/90)	Entry point: rebuilds def chains, dispatches to mode
`sub_BEE370`	~550 B	GvnCse mode dispatcher	Queries knob 402, selects mode 0-6
`sub_BED7E0`	~18 KB	FullGvnCse (modes 3-6)	RPO block walk + dominator-scoped CSE, 689 lines
`sub_BEAD00`	~2.5 KB	StandardGvnCse (mode 2)	Dominator-guided GVN for SM < 32K threshold
`sub_BEA5F0`	~9 KB	PerDominatedBlockCse	Per-block CSE within dominator subtree, commutative canon
`sub_BEA450`	~2 KB	SimpleGvn (mode 1)	Basic GVN variant
`sub_BEA1E0`	~500 B	GvnCse eligibility check	Opcode-based CSE eligibility (16,122,145,183,186,...)
`sub_BED430`	~2 KB	EBB pre-pass	Extended basic block identification (gated by knob 210)
`sub_661250`	6 B	GvnCse no-op stub	Returns 0 (SM30/50/80/89 vtable slot 23)
`sub_7846D0`	--	Build dominator tree	Also computes RPO ordering at context+792
`sub_661750`	--	Scoped value tree	Init/destroy balanced BST for dominator scoping
`0xC604D0`	42 B	OriReassociate::execute	Dispatches to sm_backend (context+1584)->vtable[44]
`0xC5EFE0`	6 B	OriReassociate::getName	Returns 50
`0xC5EFF0`	6 B	OriReassociate::isNoOp	Returns 0 (enabled)
`0xC60020`	48 B	LateOriCommoning::execute	Calls `sub_9059B0`
`0xC5EDF0`	6 B	LateOriCommoning::getName	Returns 64
`0xC5EE00`	6 B	LateOriCommoning::isNoOp	Returns 0 (enabled)
`0xC5EB80`	7 B	BackCopyProp::execute	Sets context+1552 = 9 (pipeline progress marker)
`0xC5EB90`	6 B	BackCopyProp::getName	Returns 83
`0xC5EBA0`	6 B	BackCopyProp::isNoOp	Returns 1 (disabled)
`0xC5EBB0`	6 B	AdvancedPhaseBackPropVReg::getName	Returns 82
`0xC5EBC0`	6 B	AdvancedPhaseBackPropVReg::isNoOp	Returns 0 (overridden to 1 at runtime by default vtable)
`sub_9BF350`	8.6 KB	Encoding pattern (post-phase-83)	Checks context+1552 > 9 for register constraint relaxation
`sub_9BFAF0`	9.0 KB	Encoding pattern (post-phase-83)	Checks context+1552 > 9 for register constraint relaxation
`sub_8C0270`	14 KB	Scheduler vtable init	Reads knob 808 (BackCopyPropBudget), checks +1552 == 19
`sub_9059B0`	~320 B	LateOriCommoning impl	Knob check + ref-counted working set + core walker
`sub_9055F0`	~800 B	LateCommoning core	Iterates code list, remaps operands, calls commoning check
`sub_901A90`	~1.5 KB	Commoning check	Hash lookup + dominance verify + replacement
`sub_74ED70`	~1.2 KB	Instruction hash	Opcode + type + operand VNs + address space -> hash
`sub_781F80`	--	Rebuild def chains	Reaching definitions for commoning
`sub_763070`	--	Rebuild use chains	Use-def chains
`sub_7E6090`	--	Compute hash values	Pre-computes per-instruction hashes
`sub_7DDB50`	~140 B	get_function_count	Returns func count from compilation context
`sub_7DF3A0`	~80 B	is_pure_instruction	Side-effect-free check (bits 2-3 of status word)
`sub_748440`	--	Hash combine	Mixes operand hashes into instruction hash
`sub_8F2CD0`	--	Propagate equivalence	MOV-based value equivalence propagation
`sub_8FCE70`	~150 B	Ref-count release	Releases ref-counted working set objects
`sub_1245740`	--	Dominance check	O(1) bitvector bit test for CSE safety
`sub_6B9180`	--	Set membership test	Commoning set contains check
`sub_9253C0`	--	Instruction deletion	Removes dead/redundant instructions
`sub_90A340`	1.7 KB	Commoning body	Commoning pass instance (21 callees, confirms operand comparison pattern)
`sub_908A60`	--	Predicate simplifier	Two-pass (forward+backward) predicate simplification in copy prop
`sub_8F2E50`	--	Copy/fold eligibility	SM-version-dependent eligibility check (threshold 20479)
`sub_7BA510`	5.2 KB	HashCompute	Program/instruction sequence hash (FNV/Jenkins variant)
`sub_7BB260`	3.5 KB	HashAccumulate	Incremental hash accumulation
`sub_8DCF20`	23 KB	FNV-1a hash table	8-byte key hash table with chained collision (24-byte entries)
`sub_8DF1C0`	16 KB	FNV-1a hash table	32-bit key hash table, two-level structure
`sub_9B1200`	7.7 KB	Code-caching hash	Jenkins-style instruction fingerprint for RA cache

Hash Infrastructure

The GVN/CSE passes share hash infrastructure with other subsystems (scheduling, code caching, register allocation). All FNV-1a implementations in ptxas use the same constants:

Constant	Value	Purpose
FNV offset basis	`0x811C9DC5`	Initial hash state
FNV prime	`16777619` (`0x01000193`)	Multiplication factor per byte

Hash-related functions identified in the binary:

Address	Size	Function	Used By
`sub_7BA510`	5.2 KB	`HashCompute` -- program/instruction sequence hash	Shader hash matching (`SH=` knob)
`sub_7BB260`	3.5 KB	`HashAccumulate` -- incremental hash accumulation	Instruction-at-a-time hashing
`sub_8DCF20`	23 KB	FNV-1a hash table (8-byte keys, chained collision)	Instruction deduplication in scheduling
`sub_8DF1C0`	16 KB	FNV-1a hash table (32-bit keys, two-level)	Opcode pattern classification
`sub_9B1200`	7.7 KB	Jenkins-style instruction hash for code caching	Register allocator cache hit detection
`sub_74ED70`	~1.2 KB	Per-instruction hash for commoning	LateOriCommoning (phase 64)
`sub_748440`	--	Hash combine helper	Mixes operand hashes into instruction hash

The code-caching hash at sub_9B1200 uses a different algorithm from FNV-1a:

hash = (1025 * (value + hash)) ^ ((1025 * (value + hash)) >> 6)

It processes instruction opcodes (offset +72), operand counts (+80), operand encodings (+76), register properties (+64), and variable pair mode (bits 20-21 of the descriptor at offset +48).

Cross-References

Pass Inventory -- complete 159-phase table
GeneralOptimize Bundles -- forward copy propagation (OriCopyProp) sub-pass
Predication -- phase 63 creates opportunities for LateOriCommoning
Liveness Analysis -- liveness data consumed by copy propagation
Strength Reduction -- produces normalized expressions for GvnCse
Knobs System -- ROT13-encoded knob infrastructure
Phase Manager -- vtable dispatch, phase factory
Ori IR -- instruction representation, operand encoding

Predication (If-Conversion)

OriDoPredication (phase 63) is the if-conversion pass in ptxas. It transforms short conditional branch regions into predicated straight-line code, eliminating branches by guarding individual instructions with predicate registers. On NVIDIA GPUs, where all threads in a warp execute in lockstep, eliminating divergent branches avoids the performance penalty of serialized path execution under the SIMT model.


Phase index	63
Phase name	`OriDoPredication`
Category	Optimization
Entry point	`sub_1381DA0` (1,517 bytes)
Core driver	`sub_1381CD0` (206 bytes)
Main loop	`sub_1381010` (3,249 bytes)
Total code	~17 KB across 19 functions in `0x137D8B0`--`0x13829F0`
SSA window	Yes -- runs at phase 63, within the partial-SSA window (phases 23--73)
Pipeline position	After `OriRemoveRedundantMultiDefMov` (62), before `LateOriCommoning` (64)
Gating	Disabled when bit 5 of `context+1376` flags is set; can be disabled via `PTXAS_DISABLED_PASSES` containing `"Predication"`
Knob controls	Knob 487 (enable/limit gate), knob 577 (per-region enable), knob 579 (texture-bearing region gate), knob 582 (block-level cold-region query), knob 260 (extra-latency penalty check)

GPU Motivation

The SIMT execution model makes predication qualitatively different from its role on scalar CPUs.

On a scalar CPU, a correctly-predicted branch is essentially free -- the branch predictor eliminates the control flow cost. If-conversion on CPUs is a niche optimization applied only when branches are highly unpredictable.

On a GPU, a divergent conditional branch forces the warp to serialize: the hardware executes the taken path with some threads masked off, then executes the not-taken path with the complementary mask. Both paths execute regardless, and the warp reconverges at the post-dominator. The cost is the sum of both paths, not the maximum.

Predication eliminates this divergence penalty entirely. Both paths still execute, but without the overhead of stack-based reconvergence (BSSY/BSYNC pairs on sm_70+), without the branch instruction itself, and with the ability for the scheduler to interleave the predicated instructions with other independent work. For short regions (a few instructions per side), predication is strictly superior to branching.

Branching (divergent):               Predicated:

  ISETP.NE P0, R4, R5               ISETP.NE P0, R4, R5
  BSSY B0, target                    @P0  IADD3 R6, R6, 1, RZ
  @P0 BRA taken_path                 @!P0 IADD3 R7, R7, 1, RZ
  // not-taken:                      // continues straight-line
  IADD3 R7, R7, 1, RZ
  BRA rejoin
  // taken:
  IADD3 R6, R6, 1, RZ
  // rejoin:
  BSYNC B0

The branching version requires 6 instructions (including BSSY/BSYNC convergence bookkeeping) and forces warp serialization. The predicated version requires 3 instructions and executes without divergence.

Algorithm Overview

The pass operates in three layers:

Entry and gating (sub_1381DA0): checks the "Predication" disable flag and knob 487, initializes working state, calls the driver.
Iterative driver (sub_1381CD0): initializes via the SM backend's vtable dispatch at sm_backend+1296, then calls the main loop up to 3 times (controlled by a knob at options offset 41768) with different aggressiveness settings.
Main RPO loop (sub_1381010): walks the RPO block order, identifies candidate branch regions, evaluates profitability, and applies the transformation.

Entry Point -- `sub_1381DA0`

sub_1381DA0(compilation_unit):
    if context+1376 bit 5 set:
        return                       // phase disabled by flag

    knob_state = *(context+1664)     // OCG knob container
    mode = *(*(knob_state+72) + 16416)

    if mode == 0:
        limit = (context+1419 bit 4) != 0
    elif mode == 1:
        limit = *(*(knob_state+72) + 16424)
    else:
        // mode >= 2: skip limit check

    IsPassDisabled(knob_state, "Predication", &disabled)
    if disabled or limit:
        return

    // Check knob 487 iteration limit
    CheckKnob487(knob_state)

    // Set up working state (allocate two pool objects)
    context+1385 |= 1               // mark predication active
    sub_1381CD0(state)               // call driver
    context+1385 &= ~1              // clear predication flag

    // Cleanup: release pool objects and tree structures

The context+1385 byte has bit 0 set during predication execution, which signals downstream code (such as sub_137EE50) that the pass is active.

Iterative Driver -- `sub_1381CD0`

sub_1381CD0(state):
    // Initialize via SM backend
    sm_backend = *(context+1584)
    init_fn = vtable(sm_backend)+1296
    if init_fn == sub_7D82C0:       // fast path: zero-init
        clear state fields
    else:
        init_fn(sm_backend, state)   // backend-specific init

    bb_count = *(context+520)
    if bb_count <= 1: return 0       // nothing to if-convert

    // Determine iteration count from knob at options+41760
    iterations = 0
    if *(options+41760) == 1:
        iterations = *(options+41768)

    // First pass: always run
    state[14].byte[8] = 0           // not second-pass mode
    changed = sub_1381010(state)

    // Optional second/third pass with relaxed thresholds
    while changed and iterations > 0:
        state[14].byte[8] = (iterations == 1)
        changed = sub_1381010(state)
        if iterations <= 2: break

The iteration mechanism allows the pass to make a second (and potentially third) traversal with progressively relaxed profitability thresholds. The flag at state[14].byte[8] signals the final iteration, which changes some size-limit comparisons in the profitability heuristic.

Main Loop -- `sub_1381010`

The main loop walks basic blocks in RPO order (via the block index array at context+512), identifies candidate branch regions, and decides whether to if-convert each one.

sub_1381010(state):
    // Rebuild liveness and CFG
    sub_781F80(context, 1)          // rebuild liveness
    if context+1370 bit 4 set:
        sub_A10160(context, 1)      // rebuild analysis
    sub_7E6090(context, 0,0,0,0)    // refresh CFG
    // Clear block-76 fields
    for each block in chain:
        block+76 = 0
    sub_791F00(context, 0)          // clear RPO numbering

    changed = false
    for rpo_idx = 2 .. bb_count:
        bb = bb_array[rpo_order[rpo_idx]]

        if bb is same as previous region tail:
            // Continuation of prior diamond -- reuse state
            restore saved state
        else:
            // Fresh candidate: analyze new region
            init candidate state
            if not isTriangleDiamondCandidate(bb):
                skip
            if not analyzeRegion(state, candidate):
                skip

        // Region identified -- extract branch info
        header = bb
        true_target = successor of header's terminator
        branch_pred = extractBranchPredicate(header)
        false_target = fallthrough

        // Try to if-convert both sides
        if evaluateProfitability(true_side, false_side):
            applyTransformation(...)
            changed = true

    if changed:
        context+1370 &= ~4          // invalidate CFG
        sub_785E20(context, 0)       // rebuild
    return changed

CFG Pattern Recognition

The pass recognizes three CFG shapes for if-conversion:

Triangle Pattern

One arm of the branch is empty (falls through directly to the merge point).

         [header]
        /    \
       /      \
   [then]      |
       \      /
        \    /
       [merge]

Requirements:

header ends with a conditional branch (opcode 93; OUT_FINAL in the ROT13 name table, but checked here as a control-flow terminator marker)
then block has a single predecessor (the header)
then block's sole successor is the merge block
merge has exactly two predecessors: header and then
No backedges into the region

Diamond Pattern

Both arms contain instructions.

         [header]
        /    \
       /      \
   [then]  [else]
       \      /
        \    /
       [merge]

Requirements (same as triangle, plus):

The else block has a single predecessor (the header)
The else block's sole successor is the same merge block
merge has exactly two (or three, for extended diamonds) predecessors

Extended Diamond Pattern

The pass can also handle diamonds where one or both arms chain through a successor block before merging. The sub_137FE10 function implements this extended analysis, walking forward through fall-through blocks until it reaches a merge point or encounters a block that fails the candidate check.

         [header]
        /    \
       /      \
   [then]  [else]
      |       |
   [then2] [else2]   (optional chain blocks)
       \      /
        \    /
       [merge]

Region Analysis -- `sub_137E3A0`

This function (sub_137E3A0, 367 bytes) validates that a basic block is part of a valid if-conversion candidate. It checks:

Predecessor count: The merge block must have exactly header_predecessor_count + 1 predecessors.
Terminator type: The header's terminator must match opcode 95 after masking bits 12-13 (STS in the ROT13 name table; used here as a control-flow terminator class marker, not an actual store-shared instruction).
Branch predicate: The branch guard must be a non-negated register operand (type field (>>28)&7 == 1), from the predicate register file (register file type checked against the state's expected file types 2 or 3, corresponding to R or UR).
No backedges: The predecessor list must not contain a self-edge.
Merge block successor check: Validates that the merge block's sole successor leads to the expected continuation block.

// Pseudocode for sub_137E3A0
bool isTriangleDiamondCandidate(state, bb):
    pred_count = bb->predecessor_count    // at bb+144
    if pred_count == 0: return false
    preds = bb->predecessor_list          // at bb+128
    if preds == NULL: return false
    if preds->next != NULL: return false  // must be single-entry

    header = bb_array[preds->block_index]
    if header->predecessor_count + 1 != pred_count:
        return false

    terminator = header->first_instr
    opcode = terminator->opcode & 0xFFFFCFFF   // mask bits 12-13
    if opcode != 95: return false               // opcode 95 = STS in ROT13 table; used as control-flow terminator class

    // Extract branch predicate from last operand
    last_op_idx = terminator->num_operands - ((opcode >> 11) & 2) - 2
    pred_operand = terminator->operands[last_op_idx]
    if operand_type(pred_operand) != 1: return false   // must be register
    if pred_operand is negated: return false

    reg_file = get_register_file(pred_operand)
    if reg_file != state->expected_file: return false

    // Check successor list for backedges
    for each successor of header:
        if successor == bb: continue
        if other_successor exists: return false  // at most one other
    return true

Region Scanning -- `sub_137D990`

This function (1,270 bytes) walks all instructions in a candidate block, counting them and checking each for predicability. It builds a cost model:

Per-Instruction Checks

For each instruction in the candidate block:

Already-predicated check (opcode bit 12 = 0x1000): Instructions that already carry a predicate guard are flagged via state+48 for special handling.
MOV counting (opcode 130): Instructions with opcode 130 (HSET2 in the ROT13 name table; the code treats this value as an internal marker for MOV-like operations) that match specific operand patterns increment a separate MOV counter at state+4, used to adjust profitability thresholds. The actual SASS MOV instruction is opcode 19.
Predicable instruction check (sub_137D8B0): Each instruction is tested via the SM backend's canPredicate vtable method at sm_backend+1424. Instructions that cannot be predicated (atomics, certain memory operations, barriers) cause the scan to fail.
Primary memory load classification: For load instructions (opcode 125 after masking), the memory space is queried via sub_91C840. The internal category number is tested against bitmask 0x90E ((1 << category) & 0x90E), which selects the five primary data memory spaces: .shared (1), .local (2), .const (3), .tex (8), .global (11). When a load targets one of these spaces, the has_primary_memory_load flag is set at candidate+12, which affects profitability thresholds in the heuristic. See the Memory Space Classification for Predication section for the full bitmask decode.
Extra-latency check: Instructions matching opcodes in the set {22, 23, 41, 42, 55, 57, 352, 297} (long-latency operations including texture, surface, and certain memory ops) have their latency contribution tallied at state+16 via the SM backend's getExtraLatency method at sm_backend+1392.
Predicate-register conflict: If any destination operand writes to the same predicate register that the branch uses as its guard, the region cannot be if-converted (the predicate would be clobbered before all instructions are guarded).
Instruction count limit: The non-MOV instruction count at state+8 is compared against a threshold from the state object. If exceeded and the block is not marked as "must-predicate" (state+20), the scan returns failure.

// Pseudocode for sub_137D990
bool analyzeRegion(state, candidate):
    bb = candidate->basic_block
    if bb->flags & 2: return false         // block excluded

    first_instr = bb->first_instruction
    // Check if first instruction is speculative-safe
    if isSpeculativelyUnsafe(first_instr, context):
        candidate->has_unsafe = first_instr

    // Extract branch predicate register index
    header = bb_array[bb->predecessor->block_index]
    terminator = header->first_instruction
    branch_pred_idx = extractPredicateIndex(terminator)

    // Walk all instructions in the block
    for instr = first_instr; instr != bb->tail; instr = instr->next:
        // Track already-predicated flag
        candidate->has_predicated |= (instr->opcode & 0x1000) != 0

        // Count MOVs
        if isMOV(instr) and matchesMOVPattern(instr):
            candidate->mov_count++

        // Check speculation safety for uniform operands
        if state->has_uniform_speculation:
            check uniform register SSA chain

        // Check predicability via backend
        if not canPredicateInstruction(state, instr, header):
            fail with "too many instructions"

        // Primary memory load classification (0x90E bitmask)
        if isLoadOp(instr):
            space = getMemorySpace(instr)
            if space is in {shared, local, const, tex, global}:
                candidate->has_primary_memory_load = true

        // Extra latency accounting
        if isLongLatencyOp(instr):
            candidate->extra_latency += getExtraLatency(instr)

        // Count non-trivial instructions
        if not isMOVPHI(instr):         // opcode 263 = MOV.PHI
            candidate->instr_count++
            if not candidate->must_predicate:
                if candidate->instr_count > state->threshold:
                    return false

        // Check for predicate-register clobber
        for each destination operand:
            if dest is register and dest index == branch_pred_idx:
                return false

    candidate->complete = true
    return true

Profitability Heuristic -- `sub_1380BF0`

The profitability decision (sub_1380BF0, 1,055 bytes) is the most complex part of the pass. It considers multiple factors to decide whether converting a branch region to predicated code is profitable.

Decision Flow

sub_1380BF0(state, true_side, false_side, is_reverse, result):
    result = false

    // 1. Texture-bearing region check
    if true_side->has_predicated:
        if not CheckKnob579(knob_state):
            return false

    // 2. Must-predicate override
    if true_side->must_predicate:
        return true

    // 3. CONV.ALLOC check
    if state->has_conv_alloc:
        if not (bb->flags & 8) or not state->flag_byte76:
            return false

    // 4. Branch-predicate matching
    //    Check if the branch condition matches a known pattern
    //    (SEL instruction producing the predicate)
    header_terminator = state->header->first_instruction
    pred_operand = extractLastPredicate(header_terminator)
    if predicateMatchesSELPattern(pred_operand):
        return true

    // 5. False-side memory load check
    if false_side->has_primary_memory_load:
        return sub_137F800(...)        // speculation safety analysis

    // 6. Extra-latency penalty
    if CheckKnob260(knob_state):
        if true_side->extra_latency > 0 and false_side->extra_latency > 0:
            return false               // both sides have long-latency ops

    // 7. Size-based thresholds (main heuristic)
    instr_count = true_side->instr_count

    if true_side->has_primary_memory_load:
        // Memory loads route to extended diamond analysis
        return sub_137FE10(...)        // extended diamond analysis

    mov_count = true_side->mov_count
    if mov_count <= state->mov_threshold:
        if state->flag_byte76:
            // Uniform-speculation-aware thresholds
            if true_side->has_predicated:
                return state->uniform_tex_limit >= instr_count
            else:
                return state->uniform_limit >= instr_count
        else:
            if true_side->has_predicated:
                return state->tex_limit >= instr_count
            else:
                return state->base_limit >= instr_count
                       and (true_extra <= 2 or false_extra <= 2)

    // 8. Fallback: combined size check
    combined = true_side->instr_count + false_side->instr_count
    if state->combined_limit < instr_count and combined > state->threshold:
        return false

    // 9. False-side memory loads boost profitability
    if false_side->has_primary_memory_load:
        return true                    // scheduling overlap benefit
    return sub_1380810(...)            // fall-through block analysis

Threshold Fields

The state object contains multiple instruction-count thresholds, initialized by the scheduler backend during sub_1381CD0:

State offset (as `int32` index)	Field	Typical role
`[8]`	`base_limit`	Maximum instructions for simple (non-textured, non-uniform) regions
`[9]`	`tex_limit`	Maximum instructions for textured regions (without uniform speculation)
`[10]`	`uniform_limit`	Maximum instructions with uniform-speculation enabled
`[11]`	`uniform_tex_limit`	Maximum for textured + uniform-speculation regions
`[12]`	`threshold`	Hard ceiling on non-MOV instruction count
`[13]`	`combined_limit`	Maximum for combined (both-sides) instruction count
`[14]`	`fallthrough_limit`	Threshold for fall-through block extension
`[15]`	`extended_limit`	Threshold for extended diamond regions
`[16]`	`mov_threshold`	MOV count below which standard limits apply
`[17]`	`mov_limit`	MOV-specific threshold

These values are architecture-specific -- the scheduler backend's vtable method at offset 1296 initializes them based on the SM target and optimization level.

Instruction Predication -- `sub_9324E0`

Once a region passes the profitability check, each instruction in the region is predicated. The predication is performed by sub_9324E0 (280 bytes), which transforms each instruction by adding a predicate guard operand.

Transformation Rules

For a non-branch instruction with opcode op:

Copy the operand array, appending the guard predicate as the new last operand and the predicate register as the penultimate operand.
Set bit 12 of the opcode (op | 0x1000) to mark the instruction as predicated.
Special case for opcode 188: remapped to 190.
Special case for opcode 93 (OUT_FINAL in the ROT13 name table; used here as a branch marker): replaced with opcode 95 (STS in the ROT13 name table; used here as a conditional-select construct), not simply predicated.
Emit the new instruction via sub_92C240, which creates the replacement in the code list.
Transfer debug info: *new_instr+32 = *old_instr+32 (debug location).
Delete the original instruction via sub_9253C0.

// Predicate guard encoding in operand word:
//   guard_pred = predicate_reg_index | 0x60000000
//   (type field 3 = 0x6000_0000 >> 28, register index in low 24 bits)
//
// Example: @P2 IADD3 R0, R1, R2, RZ
//   Original IADD3 operands: [R0_def, R1, R2, RZ]
//   Predicated operands:     [R0_def, R1, R2, RZ, guard_word, P2 | 0x60000000]

Already-Predicated Instructions

Instructions that already have a predicate guard (bit 12 set in original opcode) are handled by sub_9321B0, which must compose the existing predicate with the new guard using a predicate-AND or predicate-SEL operation rather than simply replacing the guard.

Post-Transformation -- `sub_137DE90`

After predicating all instructions in a region, sub_137DE90 (1,286 bytes) performs cleanup:

Bitvector maintenance: For each register operand in the predicated instructions, checks whether the register is live in the dominator's bitvector (at context+832). If the register is newly defined under the predicate, marks it in the bitvector via sub_BDBB80. This ensures later liveness analysis accounts for the conditionally-defined values.
Per-instruction predication: Walks the block's instruction list and calls sub_9324E0 on each instruction, passing the predicate register index and the guard operand word.
Predicate register tracking: If any register was newly exposed to the bitvector, and the guard predicate is a non-negated register operand, marks the predicate register's descriptor at +76 with bit 0 set, and increments a counter at state+200.
Cleanup: Resets the per-block tracking arrays (stored at state[27]/state[56..57]) which track which registers were bitvector-updated during this region.

Speculative Execution Safety -- `sub_137EE50`

After the main if-conversion, sub_137EE50 (969 bytes) performs a secondary scan to identify instructions that were speculatively moved above their original control-flow guard. This function:

Checks the global predication flag at context+1412 and the per-function flag at context+1392 bit 0. If the function already has speculated instructions from a prior pass, returns immediately.
Scans the true-side block for load instructions to global or surface memory (opcodes 183 and 288 after masking). For each such load, queries the memory space via sub_91C840 and checks whether space type 18 (unmapped/invalid) could be accessed.
Records speculatively unsafe instructions in a tracking hash set (at state+240), used by later passes to insert appropriate guard instructions or to avoid further speculation.
Scans the false-side block with the same logic.

The post-predication speculation safety check targets exclusively category 18 (.surf/tensor extended, sm_90+). This is the only memory space that sub_137EE50 treats as requiring speculative-unsafe tracking; global loads and texture loads are considered acceptable for speculative execution in the predication cost model.

Memory Space Classification for Predication

The bitmask 0x90E appears in five functions within the predication pass (sub_137D990, sub_137F560, sub_137F220, sub_137FB60, sub_1380810). All five use the identical test pattern:

category = sub_91C840(operand);          // classify memory space
if (category <= 0xB && ((1LL << category) & 0x90E) != 0)
    // load targets a primary data memory space

Bitmask Decode

0x90E = binary 1001 0000 1110 -- bits {1, 2, 3, 8, 11} are set.

Bit	Category	PTX Space	In `0x90E`?	Role in predication
0	0	Generic (unqualified)	No	Unresolved address space -- cannot be classified, excluded
1	1	`.shared`	Yes	CTA-scope scratchpad; always mapped for executing CTA; 20--30 cycle latency
2	2	`.local`	Yes	Thread-private stack/frame; always mapped; backed by L1/L2
3	3	`.const`	Yes	Constant bank (`c[bank][offset]`); loaded by driver before launch; always mapped
4	4	`.param`	No	Kernel parameter memory; typically constant-folded or register-promoted by earlier passes
5	5	`.const` (extended)	No	Extended constant path (PTX inputs 21, 22); different scheduling model
6	6	`.global` (extended)	No	Extended global variant (PTX input 20); different scheduling model
7	7	Spill space	No	Compiler-generated register spill/fill; handled separately by regalloc
8	8	`.tex`	Yes	Texture memory; high latency (200+ cycles); texture cache always valid when bound
9	9	Special (opcode-dep.)	No	Ambiguous classification from case-18 sub-switch in `sub_91C840`
10	--	(unused)	No	No memory space maps to category 10
11	11	`.global`	Yes	DRAM-backed global memory; highest latency (300+ cycles)

Categories 12--18 (code/function, uniform, register file, surface, surface/tensor extended) all exceed the <= 0xB range check and are excluded from the bitmask test automatically.

What the Bitmask Selects

The five selected categories -- shared, local, const, texture, global -- are the primary data memory spaces: the ones that involve real data movement through the GPU memory hierarchy and carry meaningful scheduling latency. These are the loads a scheduler can profitably overlap with predicated computation.

The excluded categories are either:

Unresolvable (generic -- could be anything)
Non-load in practice (param -- folded away, code -- function pointers)
Compiler-internal (spill, special -- the compiler already knows how to handle these)
Out of range (register file, uniform, surface, surface/tensor -- categories > 11)

How the Bitmask Affects Profitability

The bitmask test does NOT directly determine speculation safety. It sets a has_primary_memory_load flag at candidate offset +12, which the profitability heuristic (sub_1380BF0) uses in three ways:

True-side memory loads (a2+12 set): The profitability check routes to the extended diamond analysis (sub_137FE10) instead of the standard size-threshold path. This allows larger regions to be if-converted when they contain meaningful loads.
False-side memory loads -- speculation guard (a3+12 set): If the false side has memory loads AND the SM backend's speculation policy (vtable at sm_backend+1200) allows it, the detailed speculation analysis (sub_137F800) is invoked. If that analysis flags the loads as risky, predication is rejected.
False-side memory loads -- profitability boost (a3+12 set, passes safety): If the false side has memory loads and passes safety checks, the profitability heuristic returns true directly (line 166 of sub_1380BF0). The reasoning: if the false-side code contains real memory loads, converting the branch to predicated straight-line code lets the scheduler overlap those loads with other work.

Speculation Safety (Separate Mechanism)

The actual speculation safety tracking is handled by sub_137EE50 (post-predication scan), which uses a different criterion from the 0x90E bitmask:

Scans both sides for opcodes 183 (LDG) and 288 (STG) after masking
For each, queries sub_91C840 and checks if category == 18 (.surf/tensor extended)
Only category 18 loads are tracked as "speculatively unsafe" in the hash set at state+240
The context+1392 bit 0 flag persists and is checked by OriHoistInvariantsLate (phase 66)

This means global loads (category 11) that are speculatively predicated are not tracked as unsafe. In the ptxas cost model, global memory loads under a predicate guard are considered acceptable: the hardware will issue the load speculatively, and if the predicate is false, the result is simply discarded. On architectures with memory access traps (e.g., page faults on unmapped addresses), the hardware masks the fault for lanes where the predicate is false. Surface/tensor extended operations (category 18), however, may have side effects that cannot be masked, so they receive the unsafe designation.

Fall-Through Block Analysis -- `sub_1380810`

When the standard profitability check is inconclusive, sub_1380810 (980 bytes) analyzes the fall-through continuation of the merge block. The idea: even if the region itself is borderline, if the code immediately after the merge point contains long-latency operations (loads, texture fetches), the predicated version may be better because the scheduler can overlap the predicated instructions with those long-latency operations.

The function walks instructions in the merge block's successor(s), using the same 0x90E bitmask test to identify primary-data-memory loads. Non-load instructions are checked via the SM backend's vtable at sm_backend+1824. The function counts:

Primary-memory-space loads (via the 0x90E mask)
Other long-latency operations (via the backend vtable check)
Total instruction count

If the fall-through region contains enough long-latency work (compared to state->fallthrough_limit and state->extended_limit), the function returns true, indicating that predication is profitable despite the region being above the standard size threshold.

Extended Diamond Analysis -- `sub_137FE10`

For complex diamonds where one side has primary-memory loads that affect profitability thresholds, sub_137FE10 (2,550 bytes) performs a more thorough analysis. It can "look through" the diamond to the merge block and even one block beyond, checking whether the instruction mix in the continuation makes predication worthwhile. It invokes sub_137F560 (which also uses the 0x90E bitmask) to scan continuation blocks for scheduling-relevant loads.

The function also handles the case where the merge block falls through to another conditional branch that itself is a predication candidate -- effectively analyzing a chain of adjacent diamonds.

Interaction with Later Passes

The predication pass is positioned to maximize the benefit of subsequent passes:

Phase	Name	Interaction
64	`LateOriCommoning`	Predication may create duplicate computations on both sides of the original branch. Commoning eliminates these by recognizing that `@P0 IADD3 R0, R1, R2, RZ` and `@!P0 IADD3 R0, R1, R2, RZ` with the same inputs can be merged into an unconditional instruction.
65	`GeneralOptimizeLate2`	The copy propagation and constant folding sub-passes clean up the predicated code: dead predicate definitions, redundant MOVs introduced by the PHI destruction at merge points, and constant-foldable predicates.
66	`OriHoistInvariantsLate`	Predication can convert loop-varying branches into predicated straight-line code. LICM then hoists any newly-exposed loop-invariant computations.
69	`OriDoRemat`	Predicated instructions that define values used far from their definition are candidates for rematerialization, reducing register pressure.
70	`OriPropagateVaryingSecond`	After predication changes the control flow, varying annotations must be recomputed. The second varying-propagation pass updates which values are uniform vs. divergent.

The context+1392 bit 0 flag set by sub_137EE50 persists through these passes and is checked by OriHoistInvariantsLate to avoid hoisting speculatively-unsafe instructions out of their guarded context.

Key Functions

Address	Size	Function	Role
`sub_1381DA0`	1,517 B	`OriDoPredication::execute`	Phase entry point; gating, setup, cleanup
`sub_1381CD0`	206 B	`runPredicationDriver`	Iterative driver; calls main loop up to 3 times
`sub_1381010`	3,249 B	`predicationMainLoop`	RPO walk, region identification, transformation dispatch
`sub_137E3A0`	367 B	`isTriangleDiamondCandidate`	CFG pattern validation
`sub_137D990`	1,270 B	`analyzeRegion`	Per-block instruction scan, cost modeling
`sub_137D8B0`	209 B	`canPredicateInstruction`	Single-instruction predicability check
`sub_1380BF0`	1,055 B	`evaluateProfitability`	Multi-factor profitability decision
`sub_137FE10`	2,550 B	`analyzeExtendedDiamond`	Extended diamond and chain analysis
`sub_137F800`	864 B	`analyzeSpeculationSafety`	Speculation safety for side-effect loads
`sub_1380810`	980 B	`analyzeFallThrough`	Fall-through block continuation analysis
`sub_137EE50`	969 B	`markSpeculativeInstructions`	Post-transformation speculative-load tracking
`sub_137DE90`	1,286 B	`applyPredication`	Instruction rewriting and bitvector update
`sub_137FB60`	687 B	`classifyInstruction`	Per-instruction classification during walk
`sub_137F560`	665 B	`scanBlockForUnsafe`	Block scan for speculative safety
`sub_137F220`	828 B	`classifyInstructionExtended`	Classification with bitvector tracking
`sub_137E510`	2,360 B	`moveInstructionsToHash`	Instruction movement during transformation
`sub_9324E0`	280 B	`predicateInstruction`	Adds predicate guard to single instruction
`sub_9321B0`	~800 B	`predicateAlreadyGuarded`	Handles already-predicated instructions
`sub_92C240`	(shared)	`createInstruction`	Instruction builder (shared utility)

SASS Predicate Model

NVIDIA SASS provides 7 usable predicate registers (P0--P6) plus the hardwired always-true register PT. Every instruction in the SASS ISA can optionally carry a predicate guard:

@P0  IADD3 R0, R1, R2, RZ    // executes only if P0 is true
@!P2 FMUL  R3, R4, R5         // executes only if P2 is false
     FADD  R6, R7, R8          // unconditional (implicit @PT)

Predicate conditions are set by comparison instructions:

ISETP.GT.AND P0, PT, R1, R2, PT   // P0 = (R1 > R2) AND PT
FSETP.LT.AND P1, P2, R3, R4, PT   // P1 = (R3 < R4), P2 = !(R3 < R4)

Uniform predicates (UP0--UP6, UPT) are the warp-uniform variant available on sm_75+. When all threads in a warp have the same predicate value, using UP instead of P avoids consuming a per-thread predicate register and enables the hardware to skip the entire instruction rather than masking per-thread.

In the Ori IR, predicate operands are encoded with type field 5 (bits 28-30 of the packed operand word). The guard predicate is appended as a pair of extra operands: the guard control word (type 3, 0x60000000 | reg_index) followed by the predicate register operand itself.

Opcode Reference

Key opcodes referenced by the predication pass (after BYTE1 &= 0xCF masking to clear bits 12-13):

Value	Mnemonic	Role in predication
93	`OUT_FINAL`	ROT13 name is OUT_FINAL; used here as a conditional branch marker -- the instruction being eliminated. Actual SASS BRA is opcode 67.
95	`STS`	ROT13 name is STS; used here as the branch terminator class marker and conditional-select replacement target. Actual SASS EXIT is opcode 77.
97	`STG`	ROT13 name is STG; used here as a block boundary sentinel for scan termination. Actual SASS CALL is opcode 71.
125	LD (variant)	Load -- checked for speculative safety
130	`HSET2`	ROT13 name is HSET2; used here as an internal marker for MOV-like instructions counted separately for profitability. Actual SASS MOV is opcode 19.
183	LDG	Global load -- speculative-unsafe
188	(variant)	Remapped to 190 when predicated
263	MOV.PHI	SSA phi -- not counted in instruction totals
286	CONV.ALLOC	Convergence allocation marker -- special handling in profitability check
288	STG	Global store -- speculative-unsafe
352, 297	(long-latency)	Texture/surface ops -- extra latency penalty

Cross-References

Pass Inventory -- phase 63 in the 159-phase table
IR Overview -- Ori instruction format, operand encoding, register files
Copy Propagation & CSE -- phase 64 (LateOriCommoning) runs immediately after
GeneralOptimize Bundles -- phase 65 cleans up after predication
Loop Passes -- phase 66 (OriHoistInvariantsLate) hoists newly exposed invariants
Rematerialization -- phase 69 (OriDoRemat) handles increased register pressure
Liveness Analysis -- liveness rebuilt at entry, bitvectors maintained during transformation
Knobs System -- knobs 260, 487, 577, 579, 582 control predication behavior
Scheduling -- scheduler backend initializes profitability thresholds

Rematerialization

Rematerialization is the compiler technique of recomputing a value near its use instead of keeping the original definition live across a long range. In ptxas, rematerialization is implemented through three cooperating pipeline phases and tightly integrated with the register allocator's spill-vs-remat decision logic. On GPUs, where register pressure directly determines occupancy and therefore throughput, aggressive rematerialization is one of the most performance-critical optimizations in the entire pipeline.


Phase 28	SinkRemat -- sinks instructions closer to uses, marks remat candidates
Phase 54	OriDoRematEarly -- sets remat mode flag (`ctx+1552 = 4`)
Phase 69	OriDoRemat -- late rematerialization after predication and fusion
Address range (phase 28)	Execute: `sub_C5FC20`, core: `sub_913A30` -> `sub_A0F020`
Address range (phase 69)	Execute: `sub_C5F910`, core: `sub_A112C0` -> `sub_A11060` -> `sub_A107B0`
Minimum opt level	Phase 28: requires level > 4 (knob 487); Phase 69: requires level > 1
Operand kind 7	"Remat" marker in the Ori IR operand classification
Vreg flags (offset +80)	`0x80000001` = remat candidate; `0x80000007` = remat with predication; `0x80000008` = remat committed
Regalloc integration	`sub_93AC90` (remat check), `sub_99A9D0`/`sub_99AA50` (range remat cost)
DUMPIR name	`SinkRemat`, `OriDoRematEarly`, `OriDoRemat`

Why Rematerialization Matters on GPUs

On NVIDIA GPUs, register count per thread inversely determines the number of concurrent warps (occupancy). Each additional register consumed by a kernel reduces the number of warps that can be resident on an SM. Since GPU performance depends on hiding memory latency through massive parallelism, even a single extra register can measurably degrade throughput.

Rematerialization trades instruction count for register pressure reduction. Instead of keeping a computed value alive in a register from its definition to its last use, the compiler recomputes it where needed. This is profitable when:

The original instruction is cheap (single-cycle ALU: IADD, IMAD, MOV, SEL, LOP3, SHF)
All source operands are still available at the use point (not overwritten)
The live range of the result is long enough to actually cause register pressure
The instruction has no side effects (no memory writes, no barrier interactions)

On GPUs, the cost-benefit tradeoff is skewed much further toward remat than on CPUs. A single spill/refill pair (STL + LDL) costs 20--100 cycles of local memory latency, while a rematerialized IADD costs 1 cycle. More importantly, the spill itself consumes a register for the address computation, potentially cascading into more spills.

Pipeline Position

Phase 23   GenerateMovPhi          SSA phi nodes -> MOV instructions
Phase 24   OriPipelining           Software pipelining
Phase 25   StageAndFence           Memory fence insertion
Phase 26   OriRemoveRedundantBarriers
Phase 27   AnalyzeUniformsForSpeculation
Phase 28   SinkRemat               *** Sink + remat candidate marking ***
Phase 29   GeneralOptimize         Bundled mid-level optimizations
  ...
Phase 53   OriPropagateVaryingFirst
Phase 54   OriDoRematEarly         *** Sets remat mode flag ***
Phase 55   LateExpansion
  ...
Phase 63   OriDoPredication        If-conversion (creates new opportunities)
  ...
Phase 66   OriHoistInvariantsLate
Phase 67   DoKillMovement
Phase 68   DoTexMovement
Phase 69   OriDoRemat              *** Late rematerialization ***
Phase 70   OriPropagateVaryingSecond

The three-phase design is deliberate:

Phase 28 (early): Runs after SSA construction and pipelining but before the main optimization passes. Sinks instructions closer to their uses and identifies candidates. This is the most complex of the three phases.
Phase 54 (mode setter): A trivial phase that writes 4 to ctx+1552 (the pipeline progress counter), signaling to downstream passes that rematerialization mode is active. Its isNoOp() returns 1 in the default vtable, meaning the dispatch loop skips its execute() by default. The phase is only active when an architecture backend overrides the vtable to return 0, at which point the single-store execute body runs.
Phase 69 (late): Runs after predication (phase 63) and loop fusion (phase 59), which restructure control flow and create new rematerialization opportunities that did not exist at phase 28 time. Also runs after OriHoistInvariantsLate (phase 66), which may have extended live ranges by hoisting invariants.

Phase 28: SinkRemat

Entry and Guard Logic

The execute function (sub_C5FC20) applies two layers of gating:

function SinkRemat_execute(phase, ctx):
    opt_level = getOptLevel(ctx)           // sub_7DDB50
    if opt_level <= 1:
        return
    return sub_913A30(ctx)                 // actual implementation

sub_913A30 (131 lines) performs additional checks before invoking the core:

Optimization level >= 5: Required for the full sink+remat pass
Knob 487: Must be enabled (queried via vtable+152 dispatch on ctx+1664)
Cutlass detection (sub_8F47E0): Checks if the function name contains "cutlass" via strstr(). Cutlass kernels receive special treatment
Flag check (ctx+1368 bit 0): Must be set (compilation is in SSA window)
Feature flags (ctx+1376): Must have bit 26 set (0x4000000) but NOT bit 53 (0x20000000000000) simultaneously

When the cutlass flag (ctx+1381 bit 6) is set, the pass enters an iterative mode:

function sub_913A30(ctx):
    if opt_level <= 4:
        return
    if not knob_enabled(487):
        return
    is_cutlass = function_name_contains("cutlass")
    if not (flag_byte(ctx+1368) & 1):
        return
    if not is_cutlass and not (flag_byte(ctx+1381) & 0x40):
        return

    // Feature flag gating
    features = *(ctx+1376) & 0x20000004000000
    if features != 0x4000000:
        return

    // Cutlass iterative mode
    if flag_byte(ctx+1381) & 0x40:
        max_iters = 5                      // default
        if hw_config->field_62064:         // architecture-specific override
            max_iters = getKnob(862)       // configurable iteration limit
            if max_iters <= 0: goto sinkRemat_core
        for iter in 0..max_iters:
            sub_8F5220(&state, ctx)        // initialize iteration state
            changed = sub_911030(&state, iter)  // core sink+remat
            if not changed or sub_8F59C0(&state):  // convergence check
                break
            sub_8F5AD0(&state)             // update state for next iter
            sub_909A20(&state)             // propagate changes
            // clean up 4 bitvectors + 2 hash tables
        return

    // Non-cutlass path: single invocation
    sinkRemat_core:
    if is_cutlass:
        // Instruction count limit check
        if *(ctx+1584)->field_372 > 0x7FFF:
            // Warn via vtable dispatch (diagnostic knob 356, severity 2)
        sub_A0F020(ctx)                    // CORE: sink + remat driver
        vtable_callback()                  // post-processing hook
        sub_781F80(ctx, 1)                 // rebuild liveness
        sub_8F4820(ctx, &worklist)         // build remat worklist
        // Process worklist in reverse order
        for item in worklist (descending):
            sub_8F4F90(ctx, &item)         // apply remat decisions

Core Sink+Remat Driver: sub_A0F020

sub_A0F020 (494 lines) is the main workhorse of phase 28. It operates on the entire function body, processing basic blocks in reverse postorder through the dominator tree.

The algorithm has two main stages:

Stage 1: Per-block sinking analysis (via sub_A06A60 calling sub_A08250)

For each basic block in reverse postorder:

Walk the instruction list backward
For each instruction, check if it has a single use in a dominated block
If so, sink the instruction to the use block (moves the instruction node in the linked list)
Track whether any changes were made for convergence

Stage 2: Cross-block rematerialization (via sub_A06A60 calling sub_A07DA0)

For each basic block in reverse postorder:

Walk the instruction list
For each rematerialization-eligible instruction, check if the cost model approves duplication
If profitable, clone the instruction at the use site and mark the original's result register with the remat flag

The pass alternates between sinking and rematerialization in a fixed-point loop, repeating until no more changes occur. The two worklist callbacks (sub_A08250 for sinking, sub_A07DA0 for remat) operate on a per-block basis through a generic block visitor (sub_A06A60).

The block visitor manages per-block liveness bitvectors:

block+16: live-in bitvector
block+40: live-out bitvector
block+64: kill set
block+112: live-through set (computed as intersection of live-in and live-out)

Phase 69: OriDoRemat (Late Rematerialization)

Entry Logic

The execute function (sub_C5F910) checks the optimization level and delegates:

function OriDoRemat_execute(phase, ctx):
    opt_level = getOptLevel(ctx)           // sub_7DDB50
    if opt_level <= 1:
        return
    return sub_A112C0(ctx, -1)             // -1 = process all blocks

Remat State Initialization: sub_A0C310

sub_A112C0 first initializes a rematerialization state object (88+ bytes) via sub_A0C310:

struct RematState {
    +0    compilation_ctx*  ctx
    +8    RefCountedObj*    block_worklist     // refcounted allocator object
    +16   int              hash_capacity       // hash table for seen blocks
    +20   int              hash_count
    +24   HashNode**       hash_buckets
    +32   int              hash_bucket_count
    +72   int              iteration_count     // convergence counter
    +80   RefCountedObj*   instruction_pool    // for cloned instructions
    +88   int              pass_id             // -1 for late remat
    +92   byte             requires_tex_check  // hw_config->field_17064 == 1
};

The pass ID -1 distinguishes OriDoRemat from OriDoRematEarly. When the hardware configuration at hw_config+17064 is 1 and hw_config+17072 != 0, the requires_tex_check flag is set, enabling additional texture-instruction awareness.

Iterative Remat Loop: sub_A112C0 + sub_A11060

The late remat pass runs in a convergence loop:

function sub_A112C0(ctx, pass_id):
    init_remat_state(&state, ctx, pass_id)

    // Iterative convergence loop
    while sub_A11060(&state) and getOptLevel(ctx) != 1
          and sub_785E20(ctx, 0):           // instruction budget check
        continue

    // Cleanup: drain worklist, release refcounted objects
    ...

Per-Iteration Worker: sub_A11060

Each iteration of sub_A11060 (155 lines) processes the entire instruction list:

function sub_A11060(state):
    ctx = state->ctx
    sub_7E6090(ctx, 0, 1, 0, 0)           // rebuild use-def chains
    // Reset all basic block depth markers to 0x80000000 (unvisited)
    for bb in ctx->block_list:
        bb->field_76 = 0x80000000

    // Drain hash table back into instruction pool
    drain_hash_table(state)

    first_pass = !state->requires_tex_check
    changed = false

    // Walk instructions in program order
    instr = ctx->first_instruction         // ctx+280
    while instr:
        if first_pass:
            first_pass = false
            while instr:
                opcode = instr->opcode & 0xFFFFCFFF
                if opcode == 97:           // STG in ROT13; used as definition anchor/label marker
                    changed |= sub_A10DF0(state, instr)
                next = instr->next
                sub_A107B0(state, instr, &sink_flag, &changed_flag,
                          &remat_flag, true)
                instr = next
        else:
            // Non-first-pass: skip MOV processing
            while instr:
                next = instr->next
                sub_A107B0(state, instr, &sink_flag, &changed_flag,
                          &remat_flag, true)
                instr = next

        if not changed_flag:
            goto check_second_pass
        // Decrement iteration counter, check convergence
        if --state->iteration_count == 0:
            return sink_flag

    check_second_pass:
    if remat_flag and *(ctx+1552) > 4:
        // Second pass: walk block list for cross-block opportunities
        for bb in ctx->block_list:
            if (bb->field_20 & 1) == 0 or bb->size <= 0
               or (bb->field_20 & 6) == 6:
                continue                   // skip empty/dead/cold blocks
        instr = ctx->first_block_instruction
        while instr:
            instr = sub_A0C540(state, instr, &changed, ...)
        if changed:
            // Reset depth markers and loop
            continue

    --state->iteration_count
    return sink_flag

Per-Instruction Remat Worker: sub_A107B0

sub_A107B0 (316 lines) is the core per-instruction decision function called from both phases 28 and 69. It determines whether a specific instruction should be sunk, rematerialized, or left alone.

function sub_A107B0(state, instr, sink_flag_out, changed_out, remat_flag_out,
                     allow_remat):
    // Quick rejection: check if instruction is sinkable
    result = sub_A105F0(state, instr, sink_flag_out, changed_out)
    if result:
        return result                      // already sunk, done

    num_operands = instr->operand_count    // at instr+80
    if num_operands <= 0:
        return 0

    // Walk destination operands
    for i in 0..num_operands:
        operand = instr->operands[i]       // at instr+84 + 8*i
        operand_type = (operand >> 28) & 7

        if operand_type == 7:              // barrier register
            // Track barrier liveness
            continue
        if operand_type != 1:              // not a GPR destination
            continue

        // GPR destination operand
        if operand < 0:                    // bit 31 set = definition
            vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
            vreg->flags |= 0x80000001     // mark as remat candidate
            if has_predication_flag and last_operand_is_0x20:
                vreg->flags |= 0x80000007 // enhanced remat with predication
            if sub_A0C410(state, vreg, instr, allow_remat):
                // Remat is profitable: clear depth flag, update block assignment
                vreg->field_76 = ~instr->block_id
            else:
                // Not profitable: process as regular definition
                // Check for multi-use definitions
                ...
        else:
            // Source operand: track liveness contribution
            ...

    return result

Sinkability Check: sub_A105F0

sub_A105F0 (77 lines) determines if an instruction can be sunk to a single-use block. It enforces strict criteria:

Opcode filter: Only opcode 0x5F (95; STS in the ROT13 name table, used here as a constant/immediate load variant marker) with state->byte_92 clear
Single-use check via sub_A07940: The instruction must have exactly one use
Dominator check: The use must be in a block dominated by the definition block
MOV chain check: If the instruction feeds opcode 93 (OUT_FINAL in ROT13; used here as a MOV-like chain link), verifies through an FNV-1a hash table that the definition matches the expected pattern
Cost check via sub_A0C4A0: Verifies that sinking reduces pressure (returns the pressure delta)

When sinking succeeds, the instruction is physically moved in the linked list via sub_92E1B0 (insert at new position) and sub_9253C0 (remove from old position).

Rematerialization Eligibility Criteria

The eligibility check spans multiple functions. An instruction is rematerializable if it passes ALL of these filters:

Opcode Whitelist

From sub_911030 and sub_A11060, the eligible opcode set (after masking opcode & 0xFFFFCFFF) is:

Opcode	Identity	Category
22	IADD/IADD3	Integer add (1 cycle)
50	SHF	Funnel shift (1 cycle)
77	IMAD	Integer multiply-add (1 cycle on modern SM)
83	ISETP	Integer set-predicate (1 cycle)
93	`OUT_FINAL` in ROT13; used as MOV-like marker	Register move (0--1 cycles, often eliminated). Actual SASS MOV is opcode 19.
95	`STS` in ROT13; used as constant-load marker	Constant materialization
297	LOP3	3-input logic (1 cycle)
352	SEL	Conditional select (1 cycle)

The eligibility bitmask is encoded as 0x2080000010000001 >> (opcode - 22) for opcodes in range [22, 83], with explicit checks for opcodes 297 and 352. This is a compile-time-constant bitmask covering single-cycle ALU instructions.

Operand Source Liveness

sub_90C010 (70 lines) checks that all source operands are still available (live) at the proposed remat point:

function check_sources_available(state, instr, operand_idx, cost_out):
    operand = &instr->operands[operand_idx]

    // Immediate operand: always available
    if sub_7DEB90(operand, state->ctx):
        return 1

    // Must be a GPR (type 1) and not a definition (bit 31 clear)
    type = (operand->value >> 28) & 7
    if type != 1 or (operand->value_high & 1):
        return 0

    // Check if the source vreg has a single reaching definition
    vreg = lookup_vreg(ctx, operand->value & 0xFFFFFF)
    single_def = vreg->field_56
    if single_def:
        return sub_90B790(state, single_def, cost_out, false)

    // Multiple definitions: walk the def-use chain
    min_cost = UINT_MAX
    for def in vreg->def_chain:         // at instr->field_64 + 8*operand_idx
        cost = sub_90B790(state, def->instruction, cost_out, false)
        if cost == 0:
            return 0                    // any unavailable source -> reject
        // For rematerializable defs, add depth cost
        if def is rematerializable:
            cost += (def->block_depth <= instr->block_depth) ? 1 : 0
        min_cost = min(min_cost, cost)
    return min_cost

Cost Model: sub_90B790

sub_90B790 (large function, ~350 lines) implements the core cost/benefit analysis. It returns a non-negative integer cost where:

0 = not profitable, do not rematerialize
1+ = profitable, higher values indicate cheaper remat

The function considers:

Opcode-specific register consumption: Different opcodes produce different register-type results. sub_7E36C0, sub_7E40E0, sub_7E3790, sub_7E3800, sub_7E3640 extract per-operand register class (R/P/UR/UP) and width
Live range length: Longer live ranges benefit more from remat
Use count: Multiple uses may require multiple remat copies -- still profitable if the live range is long enough
Block depth: Instructions in deeper loop nests get higher remat cost thresholds since the duplicated instruction executes more frequently
Predication state: Predicated instructions have additional constraints on remat safety
Pre-existing flags: If vreg+80 already has 0x80000001 set, the register is already a remat candidate

Cross-Block Rematerialization: sub_A0C540

sub_A0C540 (228 lines) handles rematerialization across basic block boundaries, invoked in the second pass of sub_A11060. It processes definitions that are used in different blocks:

function cross_block_remat(state, instr, changed_out):
    // Walk operands in reverse order (destinations first)
    for i in (instr->operand_count - 1) downto 0:
        operand = instr->operands[i]
        if (operand >> 28) & 7 != 1:      // not a GPR
            continue
        if (operand_high & 1):             // skip source operands
            continue

        vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
        if (vreg->flags_48 & 0x22) != 0:  // skip special vregs
            continue
        if vreg->reg_index in [41..44]:    // skip architectural predicates
            continue

        flags80 = vreg->field_80
        if not (flags80 & 1):             // not a remat candidate
            continue
        if vreg->use_count <= 0:
            continue
        if (flags80 & 2) and (flags80 & 4):  // already fully processed
            continue

        // Compute instruction-level remat cost
        cost = sub_91E860(ctx, instr, i)

        if operand < 0:                    // definition
            if cost <= 3:
                vreg->field_80 |= 0x80000008  // commit remat
                continue
            // Remat profitable: insert remat copy
            adjust_pressure(state, instr, -1)  // sub_A0C4A0
            duplicate_at_use(ctx, instr)       // vtable dispatch +1280
            adjust_pressure(state, instr, +1)
            vreg->field_80 |= 0x80000008
            vreg->flags_48 &= ~0x300000        // clear live-through bits
            // Rebuild interference for affected ranges
            adjust_pressure(state, instr, -1)
            sub_92C0D0(ctx, instr, 0, ...)     // clone instruction at use
            adjust_pressure(state, instr, +1)
            *changed_out = 1

Interaction with Register Allocator

The rematerialization flags set during phases 28 and 69 are consumed by the fat-point register allocator in several ways:

Remat Detection During Assignment: sub_93AC90

During per-instruction register assignment (sub_9680F0, 3722 lines), the allocator calls sub_93AC90 (29 lines) to check if a virtual register is a rematerialization candidate:

function check_remat_opportunity(alloc, vreg_index, reg_class):
    if alloc->vreg_count == 0:
        BUG()
    entry = hash_lookup(alloc->remat_table, vreg_index)
    cost = entry->field_144[reg_class]
    if cost < entry->threshold:
        return true
    return (cost == entry->threshold) and (reg_class == entry->field_12)

Range Remat Cost: sub_99AA50

The live-range infrastructure at 0x994000--0x9A1000 includes remat-aware cost functions. sub_99AA50 (51 lines) inserts a rematerialization cost node into a range's cost linked list, enabling the allocator to compare spill cost against remat cost when choosing between spilling and rematerializing a value.

Spill-vs-Remat Decision

The allocator's main iteration driver (sub_9AEF60, 1415 lines) uses remat information to guide the spill-vs-remat tradeoff:

During interference analysis, remat candidates get lower interference weights (they can be killed and recreated)
When a spill is triggered, the allocator first checks if the value is rematerializable. If so, it inserts a remat copy instead of a spill/refill pair
Remat linked lists are maintained at alloc+161..+175 in the per-class allocator state

Verification: sub_A55D80

The post-allocation verifier (sub_A55D80, referenced by "REMATERIALIZATION PROBLEM..." string) validates that rematerialization was applied correctly. Error case 7 in the verifier specifically checks that:

The rematerialized instruction produces the same value as the original
The reaching definitions before and after allocation match (modulo known-safe remat transformations)
No rematerialized instruction references a register that was invalidated by the allocation

Operand Kind 7: Remat Markers

The Ori IR operand classification includes a dedicated "Remat" kind (value 7) that marks operands participating in rematerialization. This marker is orthogonal to the vreg+80 flags -- it exists in the instruction's operand descriptors and tells downstream passes that this particular use was created by rematerialization rather than by the original program.

The 10 operand kinds in the Ori IR:

Kind	Name	Description
0	R register	General-purpose register
1	Offset	Memory offset
2	P/UP register	Predicate register
3	Any register	Wildcard
4	Regular	Immediate or constant
5	Predicated	Guard predicate
6	--	(reserved)
7	Remat	Rematerialization marker
8	Spill-refill	Spill/refill pair
9	R2P/P2R	Register-to-predicate conversion

Vreg Flags at Offset +80

The virtual register's field at offset +80 encodes rematerialization state through a bitmask:

Bit	Mask	Meaning
0	`0x1`	Remat candidate -- this value CAN be recomputed
1	`0x2`	Remat source processed -- cross-block analysis done
2	`0x4`	Remat committed -- the allocator should prefer remat over spill
31	`0x80000000`	Depth marker / unvisited sentinel

Common flag combinations:

0x80000001: Candidate identified by sub_A107B0, pending cost analysis
0x80000007: Candidate with predication awareness (stronger guarantee for predicated code paths)
0x80000008: Remat committed by cross-block analysis (sub_A0C540), allocator should use remat

Knobs and Configuration

Knob ID	Role	Default	Notes
487	Gate for SinkRemat pass	(enabled)	Must be true for phase 28 to execute
862	Cutlass iteration limit	5	Max iterations in cutlass-specific iterative mode
356	Instruction count diagnostic	--	Severity-2 warning when instruction count exceeds 32767

The optimization level gating:

Level <= 1 (-O0/-O1): All three remat phases are disabled
Level <= 4: Phase 28 runs the non-cutlass path only
Level >= 5 (-O3+): Full sink+remat with cutlass iteration support

Function Map

Phase 28 (SinkRemat)

Address	Function	Size (lines)	Role
`0xC5FC20`	`sub_C5FC20`	12	Phase execute dispatcher
`0xC5F2E0`	`sub_C5F2E0`	7	getName() -> returns 28
`0xC5F2F0`	`sub_C5F2F0`	7	isNoOp() -> returns 0 (always runs)
`0x913A30`	`sub_913A30`	131	SinkRemat entry with knob/feature gating
`0xA0F020`	`sub_A0F020`	494	Core sink+remat driver (block visitor loop)
`0x911030`	`sub_911030`	2408	Per-block promotion/sinking engine
`0x90C010`	`sub_90C010`	70	Source operand liveness check for remat
`0x90B790`	`sub_90B790`	~350	Cost model: remat profitability analysis
`0x8F47E0`	`sub_8F47E0`	12	Cutlass detection (`strstr("cutlass")`)
`0x8F4820`	`sub_8F4820`	--	Build remat worklist
`0x8F4F90`	`sub_8F4F90`	--	Apply remat decisions from worklist

Phase 54 (OriDoRematEarly)

Address	Function	Size (lines)	Role
`0xC5EF30`	`sub_C5EF30`	7	Phase execute: writes `ctx+1552 = 4`
`0xC5EF40`	`sub_C5EF40`	7	getName() -> returns 54
`0xC5EF50`	`sub_C5EF50`	7	isNoOp() -> returns 1

Phase 54 is a degenerate phase. Its execute body is a single store: *(ctx + 1552) = 4. Its isNoOp() returns 1, so the dispatch loop skips execute() by default -- the phase does nothing unless an architecture backend overrides the vtable to activate it. When active, the value 4 written to ctx+1552 advances the pipeline progress counter, which sub_A11060 checks (if *(ctx+1552) > 4 triggers the cross-block second pass).

Phase 69 (OriDoRemat)

Address	Function	Size (lines)	Role
`0xC5F910`	`sub_C5F910`	24	Phase execute dispatcher
`0xC5ED50`	`sub_C5ED50`	7	getName() -> returns 69
`0xC5ED60`	`sub_C5ED60`	7	isNoOp() -> returns 0 (always runs)
`0xA112C0`	`sub_A112C0`	245	Late remat entry + cleanup
`0xA0C310`	`sub_A0C310`	45	RematState initialization
`0xA11060`	`sub_A11060`	155	Per-iteration remat worker
`0xA107B0`	`sub_A107B0`	316	Per-instruction remat decision
`0xA105F0`	`sub_A105F0`	77	Sinkability check (opcode 0x5F)
`0xA10DF0`	`sub_A10DF0`	138	MOV chain analysis (FNV-1a hash table)
`0xA0C540`	`sub_A0C540`	228	Cross-block rematerialization
`0xA0C4A0`	`sub_A0C4A0`	--	Pressure adjustment (+1 or -1)
`0xA0C410`	`sub_A0C410`	--	Remat profitability check for a vreg

Register Allocator Integration

Address	Function	Size (lines)	Role
`0x93AC90`	`sub_93AC90`	29	Remat opportunity check during assignment
`0x99A9D0`	`sub_99A9D0`	38	Range rematerialization cost cleanup
`0x99AA50`	`sub_99AA50`	51	Range rematerialization cost insertion
`0x9AEF60`	`sub_9AEF60`	1415	Main allocation driver with remat support
`0xA55D80`	`sub_A55D80`	~800	Post-allocation remat verification

Sinking vs. Rematerialization

The SinkRemat pass (phase 28) and the late OriDoRemat pass (phase 69) both move instructions closer to their uses, but through fundamentally different mechanisms:

Sinking moves the original instruction. The definition is physically relocated from its original position to a dominated block closer to the use. This does not increase instruction count but may change schedule. Sinking is legal only when:

The instruction has exactly one use
The use is in a block dominated by the current definition block
Moving the instruction does not cross any barrier or synchronization point
All source operands remain available at the new position

Rematerialization duplicates the instruction. The original definition remains in place (or is deleted if dead), and a fresh copy is inserted near each use. This increases instruction count but can dramatically reduce register pressure. Remat is legal for any instruction in the opcode whitelist, subject to:

All source operands available at the use point
The cost model approves the duplication
The instruction has no side effects

The sub_A105F0 sinkability check runs first in sub_A107B0. Only if sinking fails does the function proceed to the rematerialization path. This prioritizes the cheaper transformation (sinking = zero instruction overhead) before falling back to the more expensive one (remat = duplicated instructions).

Architectural Notes

The three-phase structure with an interleaved flag-setter (phase 54) suggests the rematerialization infrastructure evolved over multiple ptxas generations. Phase 54's isNoOp() = 1 default means its execute() is skipped unless an architecture backend activates it by overriding the vtable. This indicates the phase was likely once a full pass that was later simplified to a flag write, with its analysis logic migrated into phase 69.

The CUTLASS-specific iterative mode in phase 28 (sub_913A30) reveals that NVIDIA's matrix-multiply library is important enough to warrant dedicated compiler heuristics. The strstr("cutlass") check is a name-based pattern match on the function name, not a property of the IR itself. This coupling between compiler optimization and library naming conventions is a pragmatic choice for a production compiler targeting known workloads.

The FNV-1a hash (constant 0x811C9DC5, prime 16777619) appears in both the rematerialization infrastructure (sub_A10DF0 for MOV chain tracking) and the register allocator (sub_926A30 for interference). This shared hash implementation is one of ptxas's standard infrastructure components (see Hash Tables & Bitvectors).

Liveness Analysis & Dead Code Elimination

Liveness analysis is the most frequently repeated computation in the ptxas pipeline. Six dedicated phases perform liveness analysis combined with dead code elimination (DCE), and at least four additional subsystems recompute liveness on demand. The core algorithm is a standard backward dataflow analysis over the CFG, but the implementation is notable for its SSE2-accelerated bitvector library, per-register-file liveness tracking, and the orWithAndNotIfChanged fused transfer function that implements the entire dataflow step in a single SIMD pass.


Dedicated liveness phases	6 (phases 10, 16, 19, 33, 61, 84)
Core bitvector library	`0xBDBA60`--`0xBDE150` (15+ functions, SSE2)
BitVector object size	20 bytes header + dynamic word array
Word size	32-bit (`uint32_t`) -- indexed by `>> 5` and `& 0x1F`
Transfer function	`out = gen
Fixed-point detection	`orIfChanged` / `andIfChanged` return `bool`
Liveness storage	Code Object `+832` (main), `+856` (uniform)
NamedPhases override	`"OriPerformLiveDead"` controls all 4 instances
Related phase	138 `OriSplitHighPressureLiveRanges` (late cleanup)

Pipeline Placement

The six liveness-related phases are distributed across the entire optimization pipeline. Each runs after a group of transformations that may have introduced dead code or invalidated previous liveness information:

Phase  10  EarlyOriSimpleLiveDead         ── Initial Setup
Phase  16  OriPerformLiveDeadFirst         ── Early Optimization
Phase  19  OriSplitLiveRanges             ── Early Optimization
Phase  33  OriPerformLiveDeadSecond        ── Mid-Level Optimization
Phase  61  OriPerformLiveDeadThird         ── Late Optimization
Phase  84  OriPerformLiveDeadFourth        ── Legalization

Phase 138  OriSplitHighPressureLiveRanges  ── Late Cleanup (related)

The four OriPerformLiveDead instances are identical passes invoked at different pipeline positions. They share the same vtable execute function and differ only in when they run. The NamedPhases system addresses all four through the single name "OriPerformLiveDead".

Why Four Instances?

Each instance cleans up dead code introduced by the preceding optimization group:

Phase	Runs After	Cleans Up
16 (First)	Branch optimization, switch optimization	Dead branches, unreachable code from CFG simplification
33 (Second)	GeneralOptimize, loop unrolling, pipelining, strength reduction	Dead loop induction variables, redundant computations from unrolling
61 (Third)	GeneralOptimizeLate, loop fusion, VTG expansion, late expansion	Dead code from loop fusion, expanded macro instructions
84 (Fourth)	Backward copy propagation, late arch optimization	Dead copies, redundant moves from backward propagation

Without these intermediate liveness passes, dead code would accumulate through the pipeline, inflating register pressure and increasing compile time for downstream passes.

Dataflow Algorithm

Classical Backward Liveness

ptxas implements textbook backward dataflow analysis. For each basic block B, the analysis computes:

LiveIn(B)  = gen(B) | (LiveOut(B) - kill(B))
LiveOut(B) = Union over all successors S of LiveIn(S)

Where:

gen(B): registers used in B before any definition in B (upward-exposed uses)
kill(B): registers defined in B (regardless of whether they are also used)
LiveIn(B): registers live at the entry of B
LiveOut(B): registers live at the exit of B

Iterative Fixed-Point Solver

The analysis iterates in reverse post-order (RPO) until no LiveIn/LiveOut set changes:

function compute_liveness(func):
    compute_RPO(func)                              // sub_BDE150
    
    // Initialize gen/kill sets per block
    for each block B in func:
        gen(B)  = {}
        kill(B) = {}
        for each instruction I in B (reverse order):
            for each source operand r of I:
                if r not in kill(B):
                    gen(B) |= {r}
            for each destination operand d of I:
                kill(B) |= {d}
    
    // Initialize LiveOut to empty for all blocks
    for each block B:
        LiveOut(B) = {}
    
    // Iterate until fixed point
    changed = true
    while changed:
        changed = false
        for each block B in reverse RPO:
            // LiveOut = union of successors' LiveIn
            for each successor S of B:
                changed |= LiveOut(B).orIfChanged(LiveIn(S))
            
            // LiveIn = gen | (LiveOut - kill)
            //        implemented as: orWithAndNotIfChanged
            changed |= LiveIn(B).orWithAndNotIfChanged(
                            gen(B), LiveOut(B), kill(B))

The key optimization is the fused orWithAndNotIfChanged operation (sub_BDD560), which computes dst |= gen | (in & ~kill) and returns whether any bit changed -- all in a single SSE2 pass over the bitvector words. This avoids materializing intermediate bitvectors for (LiveOut - kill).

Convergence

The analysis converges because:

All sets are initialized to empty (bottom of the lattice).
Each iteration can only add bits (the transfer function is monotone).
The lattice has finite height (bounded by the total number of virtual registers).
RPO traversal order minimizes the number of iterations -- typically 2--3 passes for acyclic code, proportional to loop nesting depth for loops.

BitVector Implementation

The bitvector library at 0xBDBA60--0xBDE150 is the most performance-critical infrastructure in ptxas dataflow analysis. All operations are SSE2-accelerated with manual alignment handling.

Layout

struct BitVector {       // 20 bytes total
    uint32_t* data;      // +0:  pointer to word array (heap-allocated)
    int32_t   word_count; // +8:  number of 32-bit words in use
    int32_t   capacity;  // +12: allocated words (>= word_count)
    int32_t   bit_count; // +16: number of valid bits
};

Word count is computed from bit count: word_count = (bit_count + 31) >> 5. Memory is allocated via the pool allocator (vtable dispatch at allocator +24 for alloc, +32 for free). Reallocation occurs only when the new word count exceeds the current capacity.

Core Operations

Address	Operation	Signature	Notes
`sub_BDBA60`	`allocate`	`(bv, alloc, num_bits)`	Grow-only; no shrink
`sub_BDBFB0`	`setBit`	`(bv*, bit_index)`	`data[i>>5]
`sub_BDC0E0`	`clearBit`	`(bv*, bit_index)`	`data[i>>5] &= ~(1 << (i&31))`
`sub_BDC200`	`testBit`	`(bv*, bit_index) -> bool`	`(data[i>>5] >> (i&31)) & 1`
`sub_BDCDE0`	`operator\|=`	`(dst, src)`	SSE2 `_mm_or_si128` loop
`sub_BDCF40`	`orIfChanged`	`(dst, src) -> bool`	Scans `(~dst & src) != 0` first
`sub_BDC5F0`	`operator&=`	`(dst, src)`	SSE2 `_mm_and_si128`; zeroes tail
`sub_BDC790`	`andIfChanged`	`(dst, src) -> bool`	Scans `(~src & dst) != 0` first
`sub_BDDAA0`	`operator^=`	`(dst, src)`	SSE2 `_mm_xor_si128`
`sub_BDC3F0`	`assignAND`	`(dst, a, b*)`	`dst = a & b`
`sub_BDD300`	`orWithAndNot`	`(dst, gen, in, kill)`	`dst \|= gen \| (in & ~kill)`
`sub_BDD560`	`orWithAndNotIfChanged`	`(dst, gen, in, kill) -> bool`	Core transfer function
`sub_BDBD60`	`extractBits`	`(out[], start, end)`	Cross-word boundary handling
`sub_BDD8C0`	`popcount`	`(bv*) -> int`	Count set bits
`sub_BDDC00`	`clear`	`(bv*)`	`memset(data, 0, ...)`
`sub_BDCA60`	`operator=`	`(dst, src)`	Copy with possible realloc
`sub_BDCC20`	`isSubsetOf`	`(a, b) -> bool`	Tests `(a & ~b) == 0`

SSE2 Loop Structure

All bulk operations follow the same pattern:

// Alignment prologue: process scalar words until 16-byte aligned
int align_count = (-(uintptr_t)(dst_ptr) >> 2) & 3;
for (int i = 0; i < min(align_count, word_count); i++)
    dst_ptr[i] |= src_ptr[i];

// SSE2 main loop: process 4 words (128 bits) per iteration
int sse_count = (word_count - align_count) >> 2;
for (int i = 0; i < sse_count; i++) {
    __m128i d = _mm_load_si128(&dst_ptr[aligned_offset + 4*i]);
    __m128i s = _mm_loadu_si128(&src_ptr[aligned_offset + 4*i]);
    _mm_store_si128(&dst_ptr[aligned_offset + 4*i], _mm_or_si128(d, s));
}

// Scalar epilogue: remaining 0-3 words
for (remaining words)
    dst_ptr[j] |= src_ptr[j];

The orWithAndNot transfer function fuses three operations into one SSE2 expression:

__m128i result = _mm_or_si128(
    _mm_or_si128(gen_vec, dst_vec),
    _mm_andnot_si128(kill_vec, in_vec)   // in & ~kill
);

The IfChanged variants first scan for any bit that would change (~dst & new_bits), then apply the operation only from the first differing word forward. This early-exit optimization avoids unnecessary writes when the analysis has already converged for most blocks.

Per-Register-File Liveness

GPU register allocation manages multiple independent register files. ptxas tracks liveness separately for each:

Register File	Bit Range	Storage
R (GPR, 32-bit)	Bits 0..254	Code Object `+832` (main bitvector)
UR (uniform GPR)	Bits 0..63	Code Object `+856` (uniform bitvector)
P (predicate, 1-bit)	Separate tracking	Operand type `(v >> 28) & 7 == 5`
UP (uniform predicate)	Separate tracking	Flag at Code Object `+1378` bit 4
B (barrier)	Indices 20, 21	Special-cased in dependency graph

The main liveness bitvector at Code Object +832 covers R registers. The uniform register bitvector at +856 is conditionally allocated: it exists only when the flag at Code Object +1378 bit 4 is set (indicating the function uses uniform registers). The scheduling pass (sub_A06A60) allocates both bitvectors via sub_BDBAD0 and processes them in parallel.

Predicate registers are handled at the operand level during scheduling: the operand type check ((operand >> 28) & 7) == 5 identifies predicate operands, which are tracked in a separate per-block set rather than the main bitvector.

Barrier registers (IDs 20, 21 for sm >= 4.0) receive special treatment in the dependency graph builder (sub_A0D800): they generate ordering dependencies rather than data dependencies, since barriers enforce execution ordering constraints independent of register values.

Phase 10: EarlyOriSimpleLiveDead

The earliest liveness pass, running immediately after initial IR construction (after ReportInitialRepresentation at phase 9). This is a simplified liveness + DCE pass that removes obviously dead instructions from the freshly-lowered IR.

Pipeline context: At this point, the IR has just been lowered from PTX. Many PTX instructions expand to multiple Ori instructions, some of which produce values that are immediately dead (e.g., condition codes that are never tested, intermediate values from multi-instruction expansions). EarlyOriSimpleLiveDead removes this low-hanging dead code before the main optimization pipeline begins, reducing the working set for all subsequent passes.

Implementation evidence: The sweep at p1.10 (W010) confirms this pass uses the bitvector infrastructure at sub_BDBA60--sub_BDE150 for liveness computation. The "simple" in the name may indicate a local-only (per-BB) analysis that avoids the cost of full global iterative dataflow -- sufficient for removing obviously dead definitions that have no uses within the same block.

Phases 16, 33, 61, 84: OriPerformLiveDead

The four instances of the full liveness + DCE pass. These perform global iterative dataflow analysis followed by dead instruction removal.

Algorithm

function OriPerformLiveDead(func):
    // 1. Rebuild basic block metadata
    rebuild_basic_blocks(func, mode)        // sub_781F80
    
    // 2. Compute global liveness (iterative fixed-point)
    compute_global_liveness(func)           // iterative solver
    
    // 3. Dead code elimination
    for each block B in func:
        for each instruction I in B:
            if all destinations of I are dead (not in LiveOut):
                if I has no side effects:
                    remove(I)
    
    // 4. Update IR metadata
    //    (instruction counts, block sizes, etc.)

Side-Effect Preservation

Not all instructions with dead destinations can be removed. The DCE must preserve:

Memory stores (STG, STS, STL, ATOM, etc.) -- observable side effects
Barrier instructions (BAR, MEMBAR) -- synchronization semantics
Control flow (BRA, EXIT, RET, CALL) -- program structure
Texture operations with side effects
Instructions with volatile flags

The opcode mask & 0xCFFF (seen in sub_A06A60) strips modifier bits to obtain the base opcode for side-effect classification. Opcodes 93 (OUT_FINAL in the ROT13 name table; used as a call-like marker -- actual CALL is opcode 71), 94 (LDS)/95 (STS) (used as block boundary markers), 97 (STG; used as a branch-like marker -- actual BRA is opcode 67), and 52 (AL2P_INDEXED; used as NOP/boundary) receive special handling.

DCE Integration

The OriPerformLiveDead pass combines liveness computation with DCE in a single pass rather than running them as separate analyses. After computing LiveOut sets for each block, the pass walks each block backward: for each instruction, it checks whether every destination register is absent from the current live set. If so and the instruction has no side effects, it is unlinked from the instruction list. Source operands of removed instructions are themselves removed from the live set, potentially enabling cascading removal of further dead instructions within the same backward walk.

Phase 19: OriSplitLiveRanges

This phase splits live ranges at loop boundaries and across phi/copy chains to reduce register pressure. It runs after OriPerformLiveDeadFirst (phase 16) and OriLoopSimplification (phase 18), when the loop structure is canonical.

String reference: "OriSplitLiveRanges" at 0x22BC5C0.

Core implementation: sub_BEF110 (108KB, 3,414 decompiled lines). Called via sub_A1D3A0 (vtable execute) -> sub_BF33D0 (knob-gated entry, reads register budget from ctx+1624 and knob 456).

Motivation

On GPUs, register pressure directly determines occupancy (the number of concurrent warps). A value defined before a loop and used only after the loop occupies a register for the entire loop body, even though it is not accessed within the loop. Splitting the live range at the loop boundary -- by inserting a copy before the loop and a copy after -- can free the register for use inside the loop, reducing peak pressure and enabling higher occupancy.

Algorithm (Decompiled from sub_BEF110)

The function operates in five distinct phases:

Phase 1: Pre-analysis -- Rebuilds basic blocks (sub_781F80), allocates three bitvector fields per virtual register (kill at VR+96, gen at VR+24, live-through at VR+176), then runs the standard iterative liveness solver (sub_775010 + sub_773140). Walks the register table checking interference chains: for each VR with a chain at VR+136, tests whether the chain target's kill set is a subset of the VR's kill set (sub_BDC390 = isSubsetOf). Non-subset cases receive the +264 bit 1 flag, marking them as interference candidates.

Phase 2: Work structure allocation -- Allocates a scratch array s[] (one entry per split candidate), a hash table for interference tracking (power-of-2 buckets sized via _BitScanReverse64), and an array of 64-byte per-block split records:

struct PerBlockSplitRecord {    // 64 bytes, indexed by block ID
    void*    list_head;         // +0:  interference linked list
    void*    first_in_block;    // +8:  first entry pointer
    void*    sentinel;          // +16: self-pointer
    void*    reserved;          // +24
    void*    last_in_block;     // +32: last entry pointer
    void*    tail;              // +40: tail pointer
    int32_t  count;             // +48: entry count
    int32_t  pad;               // +52
    void*    allocator_ref;     // +56: refcounted allocator
};

Phase 3: Main splitting loop -- Iterates the ordered register array at ctx+792 in reverse order (highest VR ID first). For each VR, walks the def-use chain via ctx+296 (register table), classifying instructions by opcode:

Opcode (masked)	Meaning	Split Action
167 (0xA7)	Phi-like	Walk up phi chain, split at each level via `sub_931920`
158 (0x9E)	Copy-like	Similar chain walk with copy-specific handling
188 (0xBC)	Multi-operand special	Check operand types, dispatch to `sub_BE3720` for multi-source split
27 (0x1B)	Register move	Standard split point; emit via `sub_9314F0` with 4 operands
269 (0x10D)	Copy	Lightweight split; emit via `sub_9314F0` with 2 operands

For each split: allocates a new VR via sub_931920, copies the three bitvector fields (sub_BDBA60 allocates, sub_BDC1B0 copies dst |= src), validates the register class via sub_9314F0 (called 11 times total across different split patterns), and updates the interference hash via sub_BEEC80.

The inline interference check in the hot path:

// Fast single-bit test: is vr_class_id live in the kill set?
if ((1 << vr_class_id) & kill_set[vr_class_id >> 5]) != 0)
    // VRs interfere -- cannot share a physical register

Phase 4: Interference hash processing -- Builds a global interference hash table using FNV-1a (0x811C9DC5 offset basis, 16777619 prime). Walks per-block split records, for each entry scans the kill bitvector (sub_BDDC00 clears from position, scanning forward) to find concurrently live VRs. Tests interference via sub_BEE7F0 and emits split instructions via sub_934630 (opcode 46). The hash table resizes when load factor exceeds 50%.

Phase 5: Cleanup -- Marks phi/copy chains with the +245 rewrite flag (triggering opcode mutation from 188 to 93 or 95), frees hash tables and per-block records, clears ctx+1370 bit 2 to signal liveness invalidation.

function OriSplitLiveRanges(func):
    // Phase 1: Pre-analysis
    rebuild_basic_blocks(func, 0)           // sub_781F80
    alloc_kill_bitvectors(func)             // sub_BEAFD0: VR+96
    alloc_gen_bitvectors(func)              // sub_BEB110: VR+24
    compute_liveness(func)                  // sub_775010
    propagate_per_block(func, 0)            // sub_773140
    mark_interference_candidates(func)      // inline: walk chains, test subsets

    // Phase 2: Work structure allocation
    allocate_work_structures(split_candidate_count)

    // Phase 3: Main splitting loop
    for each VR in ordered_array[ctx+792] (reverse):
        walk def-use chain via ctx+296:
            classify instruction by opcode
            if splittable:
                new_vr = allocate_vr(func, vr, def_instr)    // sub_931920
                copy_bitvectors(new_vr, vr)                   // sub_BDBA60 + sub_BDC1B0
                validate_reg_class(new_vr, opcode, operands)  // sub_9314F0
                update_interference_hash(new_vr)               // sub_BEEC80

    // Phase 4: Interference hash processing
    for each entry in interference_hash:
        for each concurrent_vr in kill_bitvector:
            if interferes(entry, concurrent_vr):              // sub_BEE7F0
                emit_split_instruction(entry, concurrent_vr)  // sub_934630

    // Phase 5: Cleanup
    mark_rewrite_flags()                    // byte +245
    free_work_structures()
    ctx[+1370] &= ~4                       // invalidate liveness

Three Bitvector Fields per Virtual Register

The splitting pass maintains three independent bitvectors per VR, all using the standard 32-bit-word BitVector from 0xBDBA60--0xBDE150:

VR Offset	Name	Content	Allocated by
`+96`	Kill set	Registers defined by this VR's instructions	`sub_BEAFD0`
`+24`	Gen set	Registers used before definition in this VR's range	`sub_BEB110`
`+176`	Live-through set	Registers live through the range without kill or gen	Derived

These per-VR bitvectors differ from the per-block liveness bitvectors used by OriPerformLiveDead. The per-block sets track global liveness; the per-VR sets track interference within a single virtual register's live range, enabling the split decision: if two VRs have overlapping kill sets (tested via the fast inline (1 << id) & word[id >> 5] check), they interfere and splitting one of them at the boundary reduces the overlap.

Helper Functions

Address	Identity	Role
`sub_BEAFD0`	`AllocKillBitvectors`	Allocate `VR+96` kill sets; propagate via interference chain `VR+136`
`sub_BEB110`	`AllocGenBitvectors`	Allocate `VR+24` gen sets; scan phi/copy defs (opcodes 158, 167)
`sub_BE3390`	`ComputeSplitCount(interference)`	Count split points for interference-chain case
`sub_BE3590`	`ComputeSplitCount(clean)`	Count split points for non-interfering case
`sub_BE3720`	`ComputeSplitCount(multiSrc)`	Count split points for multi-source operand case
`sub_BEE7F0`	`TestInterference`	Test bitvector interference between two VRs
`sub_BEEC80`	`UpdateHashWithSplit`	Update per-split hash table (192-byte entries, 8 buckets)

Relationship to Phase 138

Phase 138 (OriSplitHighPressureLiveRanges) performs a similar transformation but much later in the pipeline (late cleanup stage), targeting live ranges that still cause excessive pressure after all optimization and legalization passes have run. Phase 19 is the early, conservative version; phase 138 is the late, aggressive fallback.

Liveness Consumers

The liveness information computed by these phases is consumed throughout the pipeline:

Register Allocator

The fat-point register allocator (sub_9721C0) is the primary consumer. Its entry point explicitly rebuilds liveness before allocation:

sub_781F80(ctx, 1);     // rebuild basic blocks
sub_A10160(ctx, 1);     // recompute liveness

The allocator uses liveness information for:

Interference computation: Two virtual registers interfere if their live ranges overlap. The interference graph builder (sub_926A30, 155KB decompiled) uses bitvector intersection to detect overlaps.
Spill cost estimation: sub_94E620 computes spill costs weighted by liveness range length and instruction properties.
Spill placement: sub_9449B0 (liveness range calculator, 1800 bytes) iterates instructions in reverse block order using bitvector operations to determine optimal spill/reload insertion points.

Instruction Scheduler

The scheduling subsystem maintains its own liveness tracking at Code Object +832:

Pre-scheduling: sub_8DBAF0 (16KB, LivenessAnalysis) computes register liveness for the scheduling priority function.
Per-BB liveness: sub_8DB5F0 (8.4KB, LivenessCompute) computes per-basic-block liveness sets.
Initialization: sub_8DB070 (8.2KB, LivenessInit) sets up the liveness data structures.
Iterative solver: sub_8DE7A0 (12KB) runs the iterative fixed-point computation for scheduling-specific dataflow.

The scheduler uses liveness to:

Estimate register pressure at each scheduling point
Identify last-use operands for dead-register marking (sub_A08250 checks (1 << reg_num) & *(live_set + 4*(reg_num >> 5)))
Compute instruction priority based on register pressure impact

DAG Construction

The dependency graph builder (sub_A0F970, sub_A0D800) uses liveness to:

Determine which registers are live at block boundaries
Identify anti-dependencies (WAR) that constrain scheduling
Track callee-clobbered registers at call sites (opcode 93; OUT_FINAL in ROT13, used as call-like marker -- actual CALL is opcode 71)

Multi-Set Register Manager

sub_A7BC80 (36KB) manages multiple parallel liveness bitvectors for different register classes (R, P, B, UR, UP) during post-allocation scheduling. It allocates and deallocates bitvectors in coordinated groups, updating each set based on instruction defs/uses.

Uninitialized Register Detector

sub_A0B5E0 uses liveness information to detect potentially uninitialized registers. After scheduling, it walks each block's entry live set: for each live register, it checks the 0x20 flag at register descriptor offset 48. If the flag is clear, the register is reported as potentially uninitialized via warning strings "Found %d potentially uninitialized register(s) in function %s" (warning 0x1E14).

Data Flow Infrastructure for Scheduling

The scheduling subsystem has its own dataflow infrastructure (separate from the optimizer's OriPerformLiveDead):

Address	Size	Identity
`sub_8DB070`	8.2KB	`LivenessInit` -- allocate and initialize per-BB liveness structures
`sub_8DB5F0`	8.4KB	`LivenessCompute` -- compute liveness per basic block
`sub_8DBAF0`	16KB	`LivenessAnalysis` -- full liveness analysis (red-black tree interval structure)
`sub_8DC3F0`	3.0KB	`ComputeDataFlowState` -- scheduling-specific dataflow
`sub_8DC620`	3.3KB	`UpdateDataFlowOnSchedule` -- update flow after scheduling decisions
`sub_8DC880`	10KB	`PropagateDataFlow` -- propagate dataflow information
`sub_8DCF20`	23KB	`BuildDFGForScheduling` -- build scheduling data flow graph
`sub_8DE7A0`	12KB	`IterativeDataFlow` -- iterative fixed-point solver
`sub_8DEF90`	2.0KB	`FinalizeDataFlow` -- finalize dataflow results

The sub_8DBAF0 function implements a red-black tree (evidenced by tree rotations and color flags at node offset +40 in the decompiled code), used to store liveness intervals as an ordered set. This enables efficient range queries: "which registers are live at program point P?" is answered by a tree search in O(log n) rather than a linear scan.

sub_781F80: Basic Block Rebuild

This function appears ubiquitously as a prerequisite to liveness computation. It is called with a mode parameter:

sub_781F80(func, 0): Reset/rebuild basic block metadata for reverse scheduling mode
sub_781F80(func, 1): Full rebuild for forward analysis (used before register allocation)

Over 50 call sites reference this function across the optimizer, register allocator, and scheduler. It refreshes the basic block linked lists, instruction counts, and block boundary markers that the liveness analysis depends on.

Key Function Table

Address	Size	Identity	Confidence
`sub_BDBA60`	~120B	`BitVector::allocate`	HIGH (0.90)
`sub_BDBFB0`	~120B	`BitVector::setBit`	HIGH (0.90)
`sub_BDC0E0`	~120B	`BitVector::clearBit`	HIGH (0.90)
`sub_BDC200`	~140B	`BitVector::testBit`	HIGH (0.90)
`sub_BDCDE0`	~400B	`BitVector::operator\|=` (OR)	HIGH (0.95)
`sub_BDCF40`	~564B	`BitVector::orIfChanged`	HIGH (0.95)
`sub_BDC5F0`	~484B	`BitVector::operator&=` (AND)	HIGH (0.95)
`sub_BDC790`	~800B	`BitVector::andIfChanged`	HIGH (0.95)
`sub_BDDAA0`	~400B	`BitVector::operator^=` (XOR)	HIGH (0.95)
`sub_BDC3F0`	~520B	`BitVector::assignAND`	HIGH (0.90)
`sub_BDD300`	~488B	`BitVector::orWithAndNot`	HIGH (0.92)
`sub_BDD560`	~648B	`BitVector::orWithAndNotIfChanged`	HIGH (0.92)
`sub_BDBD60`	~368B	`BitVector::extractBits`	HIGH (0.88)
`sub_BDD8C0`	~320B	`BitVector::popcount`	MEDIUM (0.80)
`sub_BDDC00`	~140B	`BitVector::clear`	HIGH (0.90)
`sub_BDCA60`	~280B	`BitVector::operator=` (copy)	MEDIUM (0.85)
`sub_BDCC20`	~320B	`BitVector::isSubsetOf`	MEDIUM (0.85)
`sub_BDE150`	9KB	`CFG::computeRPO`	HIGH (0.90)
`sub_781F80`	varies	Basic block rebuild	HIGH (0.85)
`sub_A10160`	~2KB	Liveness computation entry	MEDIUM (0.75)
`sub_A0BA40`	15KB	Block-level liveness iteration	HIGH (0.85)
`sub_A06A60`	15KB	Per-block register set tracking	HIGH (0.95)
`sub_A0D800`	39KB	Dependency graph construction	HIGH (0.95)
`sub_A0F970`	10KB	DAG construction entry	HIGH (0.95)
`sub_92C240`	8KB	Liveness bitvector operations (regalloc)	HIGH (87 callers)
`sub_9449B0`	1.8KB	Liveness range calculator (spill codegen)	HIGH
`sub_8DBAF0`	16KB	LivenessAnalysis (scheduling)	HIGH (0.85)
`sub_8DB5F0`	8.4KB	LivenessCompute (per-BB scheduling)	HIGH (0.85)
`sub_8DB070`	8.2KB	LivenessInit (scheduling)	HIGH (0.85)
`sub_8DE7A0`	12KB	IterativeDataFlow (scheduling solver)	HIGH (0.80)
`sub_A0B5E0`	varies	Uninitialized register detector	HIGH (0.97)
`sub_A7BC80`	36KB	RegisterSetManager (multi-file liveness)	MEDIUM (0.65)
`sub_BEF110`	108KB	`OriSplitLiveRanges` core (Phase 19)	HIGH (0.90)
`sub_BF33D0`	~1KB	`OriSplitLiveRanges` knob-gated entry (reads knob 456)	HIGH (0.90)
`sub_A1D3A0`	~0.2KB	`OriSplitLiveRanges` vtable execute	HIGH (0.90)
`sub_BEAFD0`	~2KB	`AllocKillBitvectors` (VR+96 per-VR kill sets)	HIGH (0.85)
`sub_BEB110`	~3KB	`AllocGenBitvectors` (VR+24 per-VR gen sets)	HIGH (0.85)
`sub_BE3390`	varies	`ComputeSplitCount(interference)`	MEDIUM (0.80)
`sub_BE3590`	varies	`ComputeSplitCount(clean)`	MEDIUM (0.80)
`sub_BE3720`	varies	`ComputeSplitCount(multiSrc)`	MEDIUM (0.80)
`sub_BEE7F0`	varies	`TestInterference` (BV interference test)	MEDIUM (0.80)
`sub_BEEC80`	~1KB	`UpdateHashWithSplit` (per-split hash update)	MEDIUM (0.80)
`sub_BEB9C0`	varies	Hash table init/destroy (secondary)	MEDIUM (0.75)
`sub_BEBA40`	varies	Hash table init/destroy (primary)	MEDIUM (0.75)

Key Constants

Value	Meaning
`+832`	Code Object offset: main register liveness bitvector (R registers)
`+856`	Code Object offset: uniform register liveness bitvector (UR registers)
`+840`	Code Object offset: max live register count
`+848`	Code Object offset: liveness info pointer
`+720`	Code Object offset: RPO order array
`+984`	Code Object offset: number of basic blocks
`+1378` bit 4	Flag: function uses uniform registers (enables `+856` bitvector)
`0xCFFF`	Opcode mask: strips modifier bits for side-effect classification
`+792`	Context offset: reverse-ordered register array (for live range splitting)
`+1370` bit 2	Flag: liveness invalid (cleared by `sub_BEF110` on exit)
`+1624`	Context offset: register budget (double, read by `sub_BF33D0`)
`VR+24`	Virtual register offset: gen bitvector (allocated by `sub_BEB110`)
`VR+96`	Virtual register offset: kill bitvector (allocated by `sub_BEAFD0`)
`VR+136`	Virtual register offset: interference chain (linked list of aliased VRs)
`VR+144`	Virtual register offset: register class ID (int32)
`VR+176`	Virtual register offset: live-through bitvector
`VR+245`	Virtual register byte flag: needs-opcode-rewrite (set by Phase 19 cleanup)
`VR+264`	Virtual register flags: bit 0 = has-interference-chain, bit 1 = non-subset, bit 2 = was-split
`VR+280`	Virtual register flags: bit 2 = needs-split, bit 4 = propagated, bit 12 = predicate-qualified
`0x811C9DC5`	FNV-1a offset basis (used in Phase 19 interference hash)
`16777619`	FNV-1a prime (0x01000193)
`0x22BC5C0`	String address: `"OriSplitLiveRanges"`
`0x22BCFE8`	String address: `"OriSplitHighPressureLiveRanges"`

Cross-References

Pass Inventory -- full 159-phase listing with liveness phase positions
Optimizer Pipeline -- pipeline stage groupings
Ori IR -- Code Object layout, bitvector infrastructure details
Allocator Architecture -- liveness as input to fat-point allocator
Spill Mechanism -- liveness range calculator for spill placement
Scheduler Architecture -- scheduling-specific liveness
Scheduling Algorithm -- priority function's pressure estimates

Synchronization & Barriers

The ptxas synchronization pipeline manages the insertion, optimization, and expansion of all GPU synchronization and barrier instructions. Eight phases span the full compilation pipeline, from early memory-ordering fence insertion through post-scheduling dependency barrier fixup. These phases collectively translate the PTX memory model into the hardware synchronization primitives required by each SM architecture: thread block barriers (BAR), memory barriers (MEMBAR), dependency barriers (DEPBAR), warp-level synchronization (WARPSYNC/BSYNC/BSSY), and asynchronous barriers (MBARRIER).


Phases	25, 26, 42, 71, 72, 99, 100, 114
Categories	Lowering (25, 42, 72), Optimization (26, 71), Scheduling (99, 100, 114)
Pipeline span	Phase 25 (early optimization) through phase 114 (post-scheduling)
Key opcodes	`BAR` (opcode 61), `MEMBAR` (opcode 111), `DEPBAR`, `BSYNC`, `BSSY`, `WARPSYNC`, `MBARRIER.*`. Note: the code uses opcode 130 (`HSET2` in the ROT13 name table) as an internal marker for barrier/sync instructions in the Ori IR.
Architecture gates	Phases 100, 114 dispatch through architecture vtable; phase 42 dispatches through backend vtable at `ctx+1584` offset `0x168`
Related EIATTR	`EIATTR_SYNC_STACK`, `EIATTR_NUM_BARRIERS`, `EIATTR_NUM_MBARRIERS`, `EIATTR_MBARRIER_INSTR_OFFSETS`, `EIATTR_GEN_ERRBAR_AT_EXIT`, `EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS`
CLI options	`--assume-extern-functions-do-not-sync`, `--no-membermask-overlap`, `--print-potentially-overlapping-membermasks`
Knobs	`DisableErrbarAfterMembar`, knob 487 (iteration gate), knob 358 (sync mode), knob 472 (barrier liveness)

GPU Synchronization Model

NVIDIA GPUs provide four distinct synchronization mechanisms, each operating at a different scope and addressing different hazards.

Thread Block Barriers (BAR)

Thread block barriers synchronize all threads within a cooperative thread array (CTA). The hardware provides 16 named barriers (indices 0--15), each tracking participation counts. PTX exposes these as:

bar.sync N -- block until all threads in the CTA arrive at barrier N
bar.red.{and,or,popc} N -- barrier with warp-level reduction
bar.arrive N -- signal arrival without blocking
barrier.cta.{sync,arrive,red} -- PTX 8.0+ cluster-aware variants

In SASS, these map to the BAR instruction family (opcode 61 in the ROT13 name table). The Ori IR uses opcode 130 (HSET2 in the ROT13 name table) as an internal barrier/sync marker. The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used, which the hardware uses to partition the convergence barrier file.

PTX:     bar.sync 0;
SASS:    BAR.SYNC 0x0;
         // stalls warp until all CTASize threads arrive at barrier 0

Memory Barriers (MEMBAR)

Memory barriers enforce ordering of memory operations across different visibility scopes:

membar.cta -- visible to threads in the same CTA
membar.gpu -- visible to threads on the same GPU device
membar.sys -- visible to all agents (including host CPU and peer GPUs)

Additionally, fence.proxy instructions enforce ordering between different memory proxy domains (generic, texture, surface, constant).

The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS records the byte offsets of membar.sys instructions for driver-level workaround injection.

Dependency Barriers (DEPBAR / Scoreboards)

Dependency barriers are the micro-architectural mechanism for tracking instruction-level data hazards. Each SM provides 6 scoreboard entries (barriers 0--5) that track completion of long-latency operations. SASS instructions encode a 23-bit control word containing:

Stall count (4 bits): cycles to wait before issuing the next instruction
Yield flag (1 bit): hint to give up the scheduling quantum
Write barrier (3 bits): scoreboard index to set on result writeback
Read barrier mask (6 bits): scoreboard entries to wait for before reading
Wait barrier mask (6 bits): scoreboard entries to clear/release

DEPBAR is the explicit dependency barrier instruction that waits for a specific set of scoreboard entries. Scoreboards are assigned by phase 115 (AdvancedScoreboardsAndOpexes) and phase 116 (ProcessO0WaitsAndSBs); the sync passes described here prepare the IR for scoreboard generation but do not assign scoreboards directly.

Warp-Level Synchronization

Warp-level sync instructions operate within a single warp (32 threads):

WARPSYNC mask -- synchronizes threads identified by the lane mask (sm70+)
BSSY B, target -- pushes a synchronization barrier for convergence
BSYNC B -- pops and waits at the convergence barrier

The BSSY/BSYNC mechanism replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points. EIATTR_SYNC_STACK records metadata about the convergence barrier stack depth.

Asynchronous Barriers (MBARRIER)

Introduced in sm90 (Hopper), MBARRIER provides hardware-accelerated asynchronous barriers in shared memory. These support non-blocking arrival, expected transaction count tracking, and parity-based phase completion -- critical for async copy (cp.async.bulk) and TMA (Tensor Memory Accelerator) operations.

MBARRIER operations in PTX:

PTX instruction	Purpose
`mbarrier.init`	Initialize barrier object in shared memory
`mbarrier.arrive`	Signal arrival (non-blocking)
`mbarrier.arrive_drop`	Arrive and decrement expected count
`mbarrier.arrive.expect_tx`	Arrive with expected transaction byte count
`mbarrier.test_wait`	Test if barrier phase is complete
`mbarrier.try_wait`	Wait with timeout
`mbarrier.try_wait.parity`	Phase-parity-based wait
`mbarrier.pending_count`	Query remaining arrivals
`mbarrier.inval`	Invalidate barrier
`mbarrier.complete_tx`	Mark transaction bytes as complete

The EIATTR_NUM_MBARRIERS and EIATTR_MBARRIER_INSTR_OFFSETS metadata inform the runtime about barrier allocation and instruction locations for driver patching.

Phase 25 -- StageAndFence


Phase name	`StageAndFence`
Category	Lowering
Execute wrapper	`sub_C5FBC0` (34 bytes)
Implementation	`sub_1392E30` (166 bytes)
Core logic	`sub_1390B30` (8,956 bytes, 97 callees)
Setup	`sub_1389AF0` (3,049 bytes)
Teardown	`sub_138A6E0` (3,408 bytes)
Gating	Requires `opt_level > 1` AND `context+1368 bit 0` AND `context+1397 bits[6:7] != 0x40`; additionally guarded by `"LoopUnrolling"` disable check and knob 487
Total code	~16 KB across `0x1389AF0`--`0x1393340`

Purpose

StageAndFence inserts memory fence and staging instructions to enforce coherence ordering after loop unrolling. When loop unrolling replicates memory operations, the replicated loads and stores may violate the memory model if they cross a synchronization boundary that was inside the original loop body. This pass re-establishes correctness by inserting fence operations at the boundaries of unrolled iterations.

Execution Flow

sub_1392E30(compilation_unit):
    // Guard: must have loops and bit flags set
    if !(context+1368 bit 0) or (context+1397 & 0xC0) == 0x40:
        return

    // Check if "LoopUnrolling" pass is disabled
    IsPassDisabled(knob_state, "LoopUnrolling", &disabled)
    if disabled: return
    if opt_level <= 2: return

    // Check knob 487
    if !CheckKnob(knob_state, 487, 1): return

    // Core execution
    sub_1389AF0(state, compilation_unit)   // allocate working structures
    sub_1390B30(state)                     // main fence insertion pass
    sub_138A6E0(state)                     // cleanup

Main Pass -- `sub_1390B30`

The main pass (8,956 bytes) is the largest function in this phase group. It:

Iterates over the basic block list via the instruction chain (context+272)
Identifies memory operations that cross unrolled loop iteration boundaries
Computes fence requirements based on the memory model and target architecture
Calls sub_A0F020 (the scheduling entry point) to build dependency information and determine where fences are needed
Inserts fence.proxy or MEMBAR pseudo-instructions at identified locations
Updates the instruction list metadata via sub_781F80 (basic block refresh)

The function takes floating-point parameters (double a2, double a3, __m128d a4), suggesting it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay fences to overlap with independent computation.

Phase 26 -- OriRemoveRedundantBarriers


Phase name	`OriRemoveRedundantBarriers`
Category	Optimization
Execute wrapper	`sub_C60BD0` (334 bytes)
Implementation	`sub_790A40` (2,288 bytes, 33 callees)
Helper: post-RA sched	`sub_790020` (1,200 bytes)
Helper: pre-RA opt	`sub_7904D0` (1,381 bytes)
Helper: barrier opt	`sub_7923A0` (2,344 bytes, 30 callees)
Helper: barrier pass	`sub_792CD0` (1,360 bytes, 25 callees)
Gating	Multi-function dispatch: only runs when `sub_7DDB50(ctx) > 1` (i.e., the compilation unit contains more than one function)
Total code	~10 KB across `0x790020`--`0x793220`

Purpose

OriRemoveRedundantBarriers performs dataflow-driven elimination of provably redundant barrier instructions. When the compiler can prove that all threads in a warp (or CTA) must have already passed through a dominating synchronization point, subsequent barriers to the same scope are redundant and can be removed. This reduces the synchronization overhead without changing program semantics.

Execution Flow

The execute wrapper sub_C60BD0 is a multi-function dispatch pattern: when a compilation unit contains multiple functions, it creates two reference-counted list objects, stores the current phase chain pointer, and calls sub_790A40 for cross-function barrier analysis. For single-function units, it returns directly.

sub_C60BD0(phase, compilation_unit):
    func_count = sub_7DDB50(compilation_unit)
    if func_count <= 1: return

    // Create two ref-counted analysis lists
    list1 = pool_alloc(24)
    list1->refcount = 1
    list2 = pool_alloc(24)
    list2->refcount = 1

    // Store current phase chain
    saved_chain = compilation_unit->field_88

    // Run multi-function barrier analysis
    sub_790A40(&compilation_unit)

    // Release ref-counted lists
    release(list1)
    release(list2)

Main Analysis -- `sub_790A40`

The main analysis function (2,288 bytes) operates through several stages:

Mode selection: Queries knob 358 (sync mode) through the knob container at ctx+1664. Three modes exist:
- Mode 0: no barrier removal (return immediately via sub_756F10)
- Mode 1: conservative removal (calls sub_790020)
- Mode 2: aggressive removal (calls sub_790020 with flag)
- Mode >= 3: full multi-function analysis
Graph construction (sub_7E6090): Builds an instruction-level dependency graph with 32-bit flags. Called with (ctx, 0, 0, 0, 0).
Liveness refresh (sub_781F80): Refreshes the basic block liveness information with mode parameter 1 (compute barrier liveness).
Dependency tracking (sub_A10160): Sets up dependency tracking data structures.
Block iteration (sub_769300, sub_752AB0): Builds block-level analysis structures for the function.
Redundancy analysis: For each barrier instruction (opcode 130; HSET2 in the ROT13 name table, but used as the internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111), checks whether the barrier's destination register is live in any successor block. If the barrier result is dead (no thread could observe it before the next dominating barrier), the barrier is eliminated.
Block-level merging (sub_75EAE0, sub_75E2F0): Merges barriers at block boundaries where adjacent blocks have compatible barrier scopes.

The algorithm checks barriers by walking the instruction chain and testing opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/sync instructions -- not the actual HSET2 half-precision set instruction). For each barrier, it extracts the destination operand (field+84), resolves the register through the register table at context+88, and tests whether the register's use-count (reg+24) indicates the barrier result is consumed.

Phase 42 -- ExpandMbarrier


Phase name	`ExpandMbarrier`
Category	Lowering
Execute wrapper	`0xC5F110` (6 bytes)
Implementation	Architecture-dispatch via `((ctx+0x630))->vtable[0x168/8]`
isNoOp	Always false (`0xC5F130` returns 0)
No opt-level check	Runs at all optimization levels

Purpose

ExpandMbarrier expands MBARRIER pseudo-instructions into native barrier instruction sequences. This is critically important for sm90+ (Hopper and later) architectures that use asynchronous barriers for TMA operations, cp.async.bulk, and warpgroup-level synchronization.

Dispatch Mechanism

Unlike most phases that tail-call a fixed function after an optimization level check, ExpandMbarrier performs a direct vtable dispatch:

mov    rdi, [rsi+0x630]     ; rdi = ctx->arch_backend (offset 1584)
mov    rax, [rdi]            ; rax = arch_backend->vtable
jmp    [rax+0x168]           ; call vtable[45] -- ExpandMbarrier impl

The architecture backend at ctx+1584 provides the actual expansion logic. This design allows each SM generation to define its own mbarrier expansion rules:

Pre-sm90: MBARRIER pseudo-ops do not exist; the phase is effectively a no-op.
sm90 (Hopper): Expands MBARRIER pseudo-ops into hardware mbarrier instruction sequences using the mbarrier object in shared memory. Handles mbarrier.init, mbarrier.arrive, mbarrier.arrive.expect_tx, mbarrier.try_wait.parity, and mbarrier.inval.
sm100+ (Blackwell): Extended mbarrier semantics for tcgen05.fence, cluster-level barriers, and async pipeline operations.

MBARRIER Expansion Patterns

A typical async copy pattern in the Ori IR and its expansion:

Before expansion (pseudo-ops):
    MBARRIER_INIT  %mbar, count
    MBARRIER_ARRIVE_EXPECT_TX  %mbar, bytes
    CP.ASYNC.BULK.TENSOR  dst, src, %mbar
    MBARRIER_TRY_WAIT_PARITY  %mbar, parity, pred

After expansion (native):
    MBARRIER.INIT  [smem_addr], count
    MBARRIER.ARRIVE.EXPECT_TX  [smem_addr], bytes
    CP.ASYNC.BULK.TENSOR  [dst], [src], [smem_addr]
    MBARRIER.TRY_WAIT.PARITY  pred, [smem_addr], parity

The expansion resolves shared memory addresses for the mbarrier objects, handles the naming of __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier and __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity reserved shared memory regions, and inserts any required fence.proxy operations for proxy domain coherence.

Phase 71 -- OptimizeSyncInstructions


Phase name	`OptimizeSyncInstructions`
Category	Optimization
Execute wrapper	`sub_C60080` (34 bytes)
Implementation	`sub_90A340` (1,670 bytes, 21 callees)
Sync predicate	`sub_18F6930` (185 bytes) -- determines if sync optimization should run
Gating	Requires `opt_level > 2`; additionally checks knob 487, architecture flags at `context+1368`, and `sub_18F6930` predicate
Pipeline position	After `OriPropagateVaryingSecond` (70), before `LateExpandSyncInstructions` (72)

Purpose

OptimizeSyncInstructions performs redundancy elimination and simplification of synchronization instructions within the partial-SSA window. It identifies and removes sync instructions that are provably unnecessary based on the data flow and the GPU memory model, and simplifies complex sync patterns into cheaper equivalents.

Gating Logic

The pass has elaborate gating controlled by sub_18F6930, which evaluates:

sub_18F6930(ctx, mode):
    // Check architecture-specific sync flags
    flags = *(ctx+1398)
    if (flags & 0x18) != 0:
        return (flags & 0x18) == 8   // specific arch config

    // Check whether SM requires explicit sync
    if !(*(ctx+1412) bit 7) or *(ctx+1584)->field_372 <= 28673:
        return true

    // Functions with <= 4 registers always need sync
    if *(ctx+1704) <= 4:
        return true

    // Mode-specific knob checks at offsets 51120/51192
    ...

The value 28673 corresponds to sm70/sm72/sm73/sm75 architecture IDs. The predicate returns true (optimize) for architectures that have explicit synchronization requirements (Volta and later), and false for older architectures where synchronization is implicit.

Main Algorithm -- `sub_90A340`

sub_90A340(ctx):
    if opt_level <= 2: return
    if !CheckKnob(ctx+1664, 487, 1): return

    // Determine sync optimization mode
    has_uniform_regs = (ctx+1412 bit 7) && !(ctx+1368 bit 4)
    arch_data = *(*(ctx+1664)+72)
    sync_mode = *(arch_data + 15480)
    if sync_mode == 1: mode = *(arch_data + 15488)

    // Main path: combined sync + barrier optimization
    if (ctx+1368 flags 0x20000001 all set) && (ctx+1377 bit 6) && !mode:
        need_expand = sub_18F6930(ctx, 0)
        sub_781F80(ctx, 1)               // refresh liveness

        if !need_expand && !has_uniform_regs:
            sub_7E6090(ctx, 0, 0, 0, 32) // build dep graph, 32-bit mode
            goto optimize
    else:
        need_expand = sub_18F6930(ctx, 0)
        if !has_uniform_regs && !need_expand: return
        sub_781F80(ctx, 1)

    // Barrier liveness computation
    sub_775010(ctx)
    sub_7E6090(ctx, 0, 0, 0, 32)

    // Walk instruction list, find opcode 130 (HSET2 in ROT13; internal barrier/sync marker)
    for instr = ctx->first_instr; instr; instr = instr->next:
        if instr->opcode != 130: continue

        // Extract operand, check register type
        operand = instr->field_84
        if operand_type(operand) != 1: continue

        reg = register_table[operand & 0xFFFFFF]
        if !check_liveness(reg): continue

        // For uniform-register-aware path:
        if has_uniform_regs:
            if (instr->field_91 & 1): continue  // skip if flagged
            if reg->file != 6: continue          // must be barrier reg
            if reg->use_count <= 1: continue
            // Check all uses via use-def chain...
            try_merge_barriers(ctx, instr)

        // Standard redundancy elimination
        try_eliminate_redundant_sync(ctx, instr)

    cleanup_lists()

The pass iterates the flat instruction list (not per-block), checking every instruction with opcode 130 (HSET2 in the ROT13 name table; used as the internal Ori IR opcode for barrier/synchronization instructions). For each barrier, it examines the operand to determine:

Whether the barrier result register is consumed by any subsequent instruction
Whether the barrier can be merged with an adjacent barrier of the same scope
Whether the barrier guards a memory region that is provably thread-local

The sub_1245740 call performs the actual redundancy proof by checking dominance relationships between barrier pairs.

Phase 72 -- LateExpandSyncInstructions


Phase name	`LateExpandSyncInstructions`
Category	Lowering
Execute wrapper	`sub_C600B0` (34 bytes)
Implementation	`sub_1381DA0` (1,517 bytes, 3 callees)
Core driver	`sub_1381CD0` (206 bytes)
Gating	Requires `opt_level > 1`; checks `context+1376 bit 5`, `"Predication"` disable flag, and knob 487 with iteration counter
Error diagnostic	`"ExpandSyncInstLate option is not supported on this architecture."` (via `sub_7EF030`)
Pipeline position	After `OptimizeSyncInstructions` (71), before `ConvertAllMovPhiToMov` (73)
Gate pass	Phase 135 (`AdvancedPhaseLateExpandSyncInstructions`) provides an additional architecture hook

Purpose

LateExpandSyncInstructions performs the final expansion of synchronization pseudo-instructions into their target-specific SASS instruction sequences. This runs late in the pipeline (phase 72, within the partial-SSA window) so that earlier optimization passes can work with high-level sync pseudo-ops rather than architecture-specific instruction sequences.

Execution Flow

The entry function shares structural similarity with the Predication pass entry (sub_1381DA0) because both operate within the same address range (0x1381000--0x1382000) and share infrastructure for walking the instruction list within the partial-SSA window.

sub_1381DA0(ctx):
    if context+1376 bit 5: return      // disabled by phase flag

    // Read expansion mode from knob container
    knob_state = *(ctx+1664)
    mode = *(*(knob_state+72) + 16416)

    if mode == 0:
        limit = (ctx+1419 bit 4) != 0
    elif mode == 1:
        limit = *(*(knob_state+72) + 16424)

    IsPassDisabled(knob_state, "Predication", &disabled)
    if disabled or limit: return

    // Knob 487 iteration gating with counter
    if !CheckKnob487WithCounter(knob_state): return

    // Set up working state
    context+1385 |= 1    // mark expansion active

    // Call core driver
    sub_1381CD0(state)

    context+1385 &= ~1   // clear expansion flag
    cleanup_pools()

Expansion Rules

The pass transforms sync pseudo-instructions according to the target SM:

Pseudo-instruction	sm70+ expansion	sm90+ expansion
`SYNC.WARP mask`	`WARPSYNC mask`	`WARPSYNC mask`
`SYNC.BLOCK`	`BAR.SYNC 0`	`BAR.SYNC 0`
`SYNC.CONVERGE target`	`BSSY B, target` ... `BSYNC B`	`BSSY B, target` ... `BSYNC B`
`MBARRIER.WAIT pseudo`	(not expanded here)	`MBARRIER.TRY_WAIT.PARITY` loop
`ERRBAR`	`BAR.SYNC 15` (error barrier)	Conditional on `DisableErrbarAfterMembar`

The ERRBAR (error barrier) is a compiler-inserted synchronization point placed after membar.sys instructions to ensure memory ordering is observable before proceeding. The DisableErrbarAfterMembar knob (accessible via the CLI option string at 0x1D04BC0) controls whether these error barriers are emitted. When set to 1, the compiler omits the error barrier, trading safety for performance.

Phase 99 -- OriDoSyncronization


Phase name	`OriDoSyncronization`
Category	Scheduling
Execute wrapper	`sub_C5FAD0` (34 bytes)
Implementation	`sub_A0F020` (2,375 bytes, 32 callees) -- DAG scheduler entry
Dependency builder	`sub_A0D800` (dependency DAG construction)
Per-block processor	`sub_A06A60` (3,045 bytes, 53 callees)
Uninit reg check	`sub_A0B5E0`
Gating	Requires `opt_level > 1`
Pipeline position	After `BackPropagateVEC2D` (98), before `ApplyPostSyncronizationWars` (100)
Callers of `sub_A0F020`	11 sites: `sub_913A30`, `sub_9AEF60` (x2), `sub_C5FA40`/`sub_C5FA70`/`sub_C5FAA0`/`sub_C5FAD0` (4 arch wrappers), `sub_1390B30` (x2), `sub_1395850` (x2)

Purpose

OriDoSyncronization is the post-optimization synchronization insertion pass. It runs after all IR-level optimizations are complete and before register allocation, using the scheduling infrastructure to analyze data dependencies and insert the synchronization instructions (BAR, DEPBAR, MEMBAR) required by the GPU memory model for correctness.

Note the intentional misspelling "Syncronization" (missing 'h') -- this is present in the binary's string table and preserved here for fidelity.

Architecture

OriDoSyncronization reuses the DAG scheduler's infrastructure (sub_A0F020) rather than implementing its own analysis. The same function serves as the scheduling entry point in multiple contexts:

Phase 99 (OriDoSyncronization): inserts sync instructions based on dependency analysis
Phase 25 (StageAndFence): inserts fences via sub_1390B30
Multiple architecture-specific scheduling wrappers: sub_C5FA40, sub_C5FA70, sub_C5FAA0

Execution Flow

sub_A0F020(ctx):
    while true:
        if *(ctx+1648) == 0: break

        // Initialize dependency context
        dep_ctx = pool_alloc(16)
        dep_ctx->refcount = 2
        dep_ctx->parent = ctx->pool

        // Build dependency DAG
        sub_A0D800(ctx, dep_ctx)

        // Process blocks in reverse order
        for each basic_block in reverse(block_list):
            if block->opcode == 8: continue  // skip NOP/exit blocks
            sub_A06A60(ctx, callback, block, flags...)

        // Check for uninitialized register usage
        sub_A0B5E0(ctx, dep_ctx)

        // Diagnostic output if enabled
        sub_7F44D0(ctx)

        // Break or retry based on scheduling result
        ...

Per-Block Synchronization -- `sub_A06A60`

The per-block processor (3,045 bytes, 53 callees) is the core of sync insertion. For each basic block:

Allocates temporary liveness bitsets via sub_BDBA60 (bitvector alloc)
Copies block-entry live set from ctx+832 via sub_BDC300
Walks instructions forward, examining each opcode (masked by 0xCFFF):
- Opcode 93 (OUT_FINAL in ROT13; used here as a call-like control-flow marker -- actual CALL is opcode 71): copies callee-save register set, handles arguments
- Opcode 95 (STS in ROT13; used here as a barrier/terminator marker -- actual BAR is opcode 61): AND-merges successor block live sets
- Opcode 97 (STG in ROT13; used here as a branch/control marker -- actual BRA is opcode 67): tests if live set changed since block entry
Inserts sync instructions where data dependencies cross synchronization boundaries
Updates uniform register liveness at ctx+856 when ctx+1378 bit 3 is set

The function uses extensive bitvector operations (13 different bitvector functions from the sub_BDB*/sub_BDC* infrastructure) to track register liveness through synchronization points.

Phase 100 -- ApplyPostSyncronizationWars


Phase name	`ApplyPostSyncronizationWars`
Category	Scheduling
Execute wrapper	`sub_C607A0` (51 bytes)
Implementation	Architecture-dispatch via `((ctx+0x630))->vtable[0x110/8]`
Nullsub guard	Skips if vtable entry equals `nullsub_170` (`0x7D6C80`)
Gating	Requires `opt_level > 1`
Pipeline position	After `OriDoSyncronization` (99), before `AdvancedPhaseAllocReg` (101)

Purpose

ApplyPostSyncronizationWars fixes write-after-read (WAR) hazards that are introduced or exposed by the synchronization insertion in phase 99. When OriDoSyncronization inserts new barrier or memory fence instructions, these insertions can create new register hazards (the barrier instruction may read a register that a subsequent instruction writes). This pass scans for and resolves those hazards.

Dispatch Mechanism

; sub_C607A0
mov    rbx, rsi                ; save ctx
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return                  ; skip if opt_level <= 1

mov    rdi, [rbx+0x630]       ; rdi = ctx->arch_backend
mov    rax, [rdi]              ; rax = arch_backend->vtable
mov    rax, [rax+0x110]       ; vtable[34] = ApplyPostSyncWars impl
cmp    rax, 0x7D6C80          ; compare with nullsub_170
jne    call_impl               ; if not nullsub, call it
return:
    ret
call_impl:
    jmp    rax                 ; tail-call architecture implementation

The nullsub_170 check (at 0x7D6C80) is the no-op sentinel: if the architecture backend does not override this vtable entry, the phase is silently skipped. This allows architectures that do not have post-sync WAR hazards to avoid unnecessary work.

Phase 114 -- FixUpTexDepBarAndSync


Phase name	`FixUpTexDepBarAndSync`
Category	Scheduling
Execute wrapper	`sub_C60600` (51 bytes)
Implementation	Architecture-dispatch via `((*(ctx+0x630)+0x10))->vtable[0x70/8]`
Nullsub guard	Skips if vtable entry equals `nullsub_43` (`0x680170`)
Gating	Requires `opt_level > 1`
Pipeline position	After `PostFixForMercTargets` (113), before `AdvancedScoreboardsAndOpexes` (115)

Purpose

FixUpTexDepBarAndSync performs a post-scheduling fixup of texture dependency barriers and synchronization instructions. After the main scheduling passes (phases 97--110) have reordered instructions and the Mercury encoder (phases 117--122) has finalized SASS encoding, texture fetch instructions may have dependency barriers that are incorrect due to instruction movement. This phase corrects those barriers.

Dispatch Mechanism

The dispatch is doubly-indirect, going through two vtable levels:

; sub_C60600
mov    rbx, rsi
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return

mov    rax, [rbx+0x630]       ; arch_backend
mov    rdi, [rax+0x10]        ; secondary object at arch_backend+16
mov    rax, [rdi]              ; secondary vtable
mov    rax, [rax+0x70]        ; vtable[14] = FixUpTexDepBar impl
cmp    rax, 0x680170           ; compare with nullsub_43
jne    call_impl
return:
    ret
call_impl:
    jmp    rax                 ; tail-call implementation

The double indirection (arch_backend -> arch_backend+16 -> vtable+0x70) indicates that the texture dependency barrier fixup lives in a secondary object owned by the architecture backend -- likely the scheduling/scoreboard subsystem object.

Texture Dependency Barriers

Texture fetches are long-latency operations (hundreds of cycles). The hardware uses dependency barriers (scoreboards) to track their completion. When the scheduler moves a texture fetch away from its original position, the dependency barrier assignment from AdvancedScoreboardsAndOpexes (phase 115) may become suboptimal or incorrect. This fixup pass:

Scans for texture fetch instructions (opcode 0x17 / class 0x37/0x38 in the scheduling tables)
Checks that the assigned write-barrier index correctly covers the instruction's result register
Verifies that consumer instructions have the corresponding read-barrier bit set in their wait mask
Adjusts stall counts and yield flags if the texture result is consumed sooner than the original schedule assumed

Memory Order Intrinsic Lowering

Before the eight sync phases operate on the Ori IR, the OCG intrinsic lowering pipeline translates PTX memory-ordering intrinsics into Ori IR instruction sequences. Three sibling functions in the OCG body dispatcher (sub_6D8B20) handle the three families of memory-ordering intrinsics. All three share an identical subop-array parsing protocol and the same scope/memory-order/deprecation validation logic.

Dispatcher and Function Family

The OCG body dispatcher at sub_6D8B20 (432 lines) reads the intrinsic ID from *(state+10688) and dispatches to per-family lowering functions via a 28-case switch statement. The three memory-ordering handlers are:

Case	Function	Size	Family	PTX instructions
9	`sub_6C0D90`	19KB (812 lines)	Atomic/reduction	`atom.add`, `atom.cas`, `atom.exch`, `red.add`
0xA	`sub_6C1CF0`	16KB (633 lines)	Mbarrier	`mbarrier.arrive`, `mbarrier.test_wait`, `mbarrier.try_wait`, counted/bytemask variants
0x16	`sub_6C4DA0`	15KB (647 lines)	Fence / load-store	`fence.sc`, `ld.acquire`, `st.release` with scope/domain

Subop Array Protocol

Each intrinsic descriptor carries a subop array at state+10704 (an int[]) with the count at state+10712. The subop values encode orthogonal PTX qualifiers (scope, memory order, type, domain) into a flat integer sequence that the lowering functions parse in positional order.

Reconstructed subop value map (shared by all three functions):

Subop	Meaning	IR effect
0	Scope qualifier (`.sys`/`.gpu`/`.cta`)	Sets scope_level = 4
1	Counted mode (mbarrier arrival count)	Adds extra type-14 parameter
2	Shared domain (`_shared`)	scope = 5
3	Memory order acquire	Sets order = 5
4	Memory order release	Sets order = 6
5	MMIO flag (`.mmio`)	Sets flag bit 8
6	Vector width 2x	scope_width = 2
7	Vector width 4x	scope_width = 4
8	Type u32	IR type 12
9	Type s32	IR type 11
0xA	Type u64	IR type 10
0xB--0x12	Reduction ops (add/min/max/inc/dec/and/or/xor)	Op index 0--7

Scope and Memory Order Validation

All three functions enforce the PTX 8.0 scoped memory model rules through a three-way decision tree. The logic (taken from sub_6C0D90 and sub_6C4DA0 where the strings appear verbatim; sub_6C1CF0 enforces equivalent constraints via positional subop checks) is:

if scope_qualifier_present:
    if memory_order NOT present:
        ERROR 7308: "Required scope with memory order semantics"
elif memory_order_present:
    WARNING 7308 (via sub_7F7C10): "Deprecated scope without memory order semantics"
    // Deprecation warning — may be promoted to error in future PTX versions.
    // If location info available (ctx+104), emits follow-up via sub_8955D0.

if mmio_flag AND NOT global_domain:
    ERROR 7308: "Domain param \"_global\" required for mmio semantics"

The warning path uses sub_7F7C10 (the deprecation-warning emitter at context+1176), which returns a boolean indicating whether the warning was promoted to an error. This implements NVIDIA's staged deprecation of unscoped memory operations: PTX code using old-style membar.cta without explicit .acquire/.release qualifiers triggers the deprecation path, while new-style fence.sc.cta.acquire requires the full scope + order combination.

Mbarrier Intrinsic Lowering -- `sub_6C1CF0`

The mbarrier handler (16KB, case 0xA) lowers mbarrier.* PTX intrinsics into Ori IR instruction sequences. It handles:

Scope/domain parsing: First subop must be 2 (shared) or 3 (global). If the first subop is > 1, it is treated as the domain selector directly; otherwise the function enters the two-position scope path where the second subop supplies the domain.
Counted mode (subop 1): Enables arrival-count tracking. When active, the parameter list includes an extra type-14 (integer) parameter for the expected arrival count. Bytemask mode (subop 6) is incompatible with counted mode -- error 7300: "byte mask not allowed with counted".
Bytemask mode (subop 6): Requires global destination (subop[1] == 3) and shared source (subop[2] == 2). Sets flag bit 17 (0x20000). Error messages: "global dst should be specified with bytemask" and "shared src should be specified with bytemask".
Sequenced mode (subop 5): Explicitly unsupported. Error 7300: "sequenced : Not yet supported".
MMIO flag (subop 4 when value == 4 in the optional-subop loop): Sets bit 3 in the flag word. Only valid with global domain (scope 2); enforced by the same "_global required for mmio" rule.

Parameter Processing

Parameters are stored at state+10728 as 12-byte records {value[4], flags[4], type[4]}. The function iterates over v100 parameters (2 or 3 depending on counted mode):

Each parameter type must be 10 (predicate register) or 12 (scope domain). Other types trigger error 7302 using the type name table at off_229E8C0.
For scope-domain parameters, the top 3 bits of the value word ((value >> 28) & 7) select the resolution mode:
- Mode 5: Named barrier resolution via sub_91BF30, then sub_934630(opcode 130) to create a barrier pseudo-op in the Ori IR.
- Mode 1 (no bit 24): Direct register reference (fast path, no resolution needed).
- Other modes: Full register resolution via sub_91D150 + sub_7DEFA0.

Output Instruction Sequence

The function generates three Ori IR instructions:

Step	Builder	Opcode	Purpose
1	`sub_934630`	214	Mbarrier scope-domain setup; template mask `0x90FFFFFF`
2	`sub_934630`	273	Memory ordering constraint / fence
3	`sub_92C240`	299	Mbarrier operation with full flags (arrive/wait/test)

The flag word passed to opcode 299 encodes: flags | 0x60000000, where flags accumulates mmio (bit 3), bytemask (bit 17), and other qualifiers from the subop parsing.

Error Codes

Code	Message template	Severity
7300	`"Unexpected intrinsic name (%s)"`	Semantic restriction (hard error)
7301	`"Unexpected intrinsic param number (%d)"`	Parameter count mismatch
7302	`"Unexpected intrinsic type (%s) in param (%d)"`	Wrong parameter type
7303	`"Unexpected intrinsic type (%s) instead of (%s) in param (%d)"`	Type mismatch with expected
7306	`"Unexpected intrinsic subop in position (%d)"`	Positional subop error
7307	`"Unexpected intrinsic subop (%s) in position (%d)"`	Named subop error
7308	`"Instrinsic - \"%s\""`	Scope/order/domain validation

Two diagnostic functions handle these errors: sub_895530 emits directly when source location is available (ctx+48); sub_7EEFA0 builds a deferred diagnostic record.

Function Map

Address	Size	Identity	Phase	Confidence
`sub_C5FBC0`	34	StageAndFence execute wrapper	25	CERTAIN
`sub_1392E30`	166	StageAndFence entry	25	HIGH
`sub_1389AF0`	3,049	StageAndFence setup	25	HIGH
`sub_1390B30`	8,956	StageAndFence core (fence insertion)	25	HIGH
`sub_138A6E0`	3,408	StageAndFence teardown	25	HIGH
`sub_C60BD0`	334	OriRemoveRedundantBarriers execute wrapper	26	CERTAIN
`sub_790A40`	2,288	OriRemoveRedundantBarriers main	26	HIGH
`sub_790020`	1,200	Post-RA scheduling helper	26	MEDIUM
`sub_7904D0`	1,381	Pre-RA optimization helper	26	MEDIUM
`sub_7923A0`	2,344	Barrier placement optimization	26	MEDIUM
`sub_792CD0`	1,360	Top-level barrier pass	26	MEDIUM
`0xC5F110`	6	ExpandMbarrier execute (vtable dispatch)	42	CERTAIN
`sub_C60080`	34	OptimizeSyncInstructions execute wrapper	71	CERTAIN
`sub_90A340`	1,670	OptimizeSyncInstructions main	71	HIGH
`sub_18F6930`	185	Sync optimization predicate	71	HIGH
`sub_C600B0`	34	LateExpandSyncInstructions execute wrapper	72	CERTAIN
`sub_1381DA0`	1,517	LateExpandSyncInstructions entry	72	HIGH
`sub_1381CD0`	206	LateExpandSyncInstructions core driver	72	HIGH
`sub_C5FAD0`	34	OriDoSyncronization execute wrapper	99	CERTAIN
`sub_A0F020`	2,375	DAG scheduler entry (sync insertion)	99	HIGH
`sub_A0D800`	--	Dependency DAG builder	99	MEDIUM
`sub_A06A60`	3,045	Per-block sync processor	99	HIGH
`sub_A0B5E0`	--	Uninitialized register check	99	MEDIUM
`sub_C607A0`	51	ApplyPostSyncronizationWars execute wrapper	100	CERTAIN
`sub_C60600`	51	FixUpTexDepBarAndSync execute wrapper	114	CERTAIN
`sub_A9C550`	2,178	Barrier instruction lowering	--	HIGH
`sub_80F400`	1,779	Sync instruction SASS lowering	--	HIGH
`sub_AA3BB0`	2,726	MBARRIER encoding	--	HIGH
`sub_AA33C0`	--	MBARRIER mnemonic builder	--	MEDIUM
`sub_775010`	18	Barrier liveness computation entry	--	MEDIUM
`sub_6D8B20`	432 lines	OCG intrinsic body dispatcher (28-case switch)	--	HIGH
`sub_6C0D90`	812 lines	Atomic/reduction intrinsic lowering (scope+order)	--	HIGH
`sub_6C1CF0`	633 lines	Mbarrier intrinsic lowering (arrive/wait/test)	--	HIGH
`sub_6C4DA0`	647 lines	Fence/load-store intrinsic lowering (scope+domain)	--	HIGH

Pipeline Position and Data Flow

The eight sync phases are distributed across the pipeline to operate at the appropriate abstraction level:

Phase 25  StageAndFence               ─── Early: after loop unrolling (24)
Phase 26  OriRemoveRedundantBarriers   ─── Early: before GeneralOptimize (29)
    ... (mid-level optimization) ...
Phase 42  ExpandMbarrier               ─── Mid: after CTA expansion (40)
    ... (late optimization) ...
Phase 71  OptimizeSyncInstructions     ─── Late: after varying propagation (70)
Phase 72  LateExpandSyncInstructions   ─── Late: before SSA destruction (73)
    ... (legalization, scheduling setup) ...
Phase 99  OriDoSyncronization          ─── Post-opt: sync insertion pass
Phase 100 ApplyPostSyncronizationWars  ─── Post-opt: WAR fixup
    ... (register allocation, scheduling) ...
Phase 114 FixUpTexDepBarAndSync        ─── Post-sched: texture dep fixup

Data dependencies between phases:

Phase 25 -> 26: StageAndFence inserts fences; OriRemoveRedundantBarriers may then eliminate redundant ones.
Phase 42 -> 71: ExpandMbarrier materializes mbarrier ops; OptimizeSyncInstructions may simplify the resulting sequences.
Phase 71 -> 72: OptimizeSyncInstructions reduces sync count; LateExpandSyncInstructions expands remaining pseudo-ops to SASS.
Phase 99 -> 100: OriDoSyncronization inserts sync instructions; ApplyPostSyncronizationWars fixes hazards introduced by the insertion.
Phase 114 -> 115: FixUpTexDepBarAndSync prepares texture barriers for AdvancedScoreboardsAndOpexes.

Architecture-Specific Behavior

The sync passes have significant architecture-dependent behavior controlled through the architecture backend vtable at ctx+1584:

SM generation	Key behavior
sm70--sm75 (Volta/Turing)	Explicit `BSSY`/`BSYNC` convergence; `WARPSYNC` required; `--no-membermask-overlap` warning active; `EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS` emitted for `membar.sys` WAR
sm80--sm89 (Ampere/Ada)	`cp.async` commit/wait groups; `ERRBAR` after `membar.sys`; barrier number range checked [0..15]
sm90--sm90a (Hopper)	Full `MBARRIER` support; TMA async pipeline barriers; `EIATTR_NUM_MBARRIERS` and `EIATTR_MBARRIER_INSTR_OFFSETS` emitted; `wgmma.fence` / `tcgen05.fence` sync fences for tensor operations
sm100+ (Blackwell)	Extended cluster barriers (`barrier.cluster.arrive`/`wait`); `fence.proxy` with proxy domain annotations; `sync_restrict::shared::{cta,cluster}` scope qualifiers; async bulk multicast

The sub_18F6930 predicate (185 bytes) encodes the architecture-specific decision logic. The magic value 28673 at *(ctx+1584)+372 corresponds to an architecture version threshold that enables explicit synchronization optimization for Volta-class and later architectures.

Option	Effect
`--assume-extern-functions-do-not-sync`	Tells the compiler that external function calls do not execute synchronization instructions, enabling more aggressive barrier elimination
`--no-membermask-overlap`	Asserts that no sync instruction is executed with different but overlapping thread masks (sm70--sm75 only). Enables additional optimizations.
`--print-potentially-overlapping-membermasks`	Diagnostic: prints locations of sync instructions where the compiler must assume overlapping masks

Knob	Effect
`DisableErrbarAfterMembar`	When set to 1, suppresses error barrier (`BAR.SYNC 15`) insertion after `membar.sys` instructions
Knob 358	Sync optimization mode selector (0=disabled, 1=conservative, 2=aggressive, 3+=full analysis)
Knob 472	Barrier liveness tracking enable
Knob 487	Iteration gate (shared with multiple passes); controls maximum number of iterations

Cross-References

Pass Inventory -- complete 159-phase table with sync phases at positions 25, 26, 42, 71, 72, 99, 100, 114
Scheduler Architecture -- the scheduling infrastructure reused by OriDoSyncronization
Scoreboards & Dependency Barriers -- phases 114, 115, 116; scoreboard generation
Phase Manager -- vtable dispatch mechanism, factory switch
Predication -- shares entry infrastructure with LateExpandSyncInstructions
Intrinsics Index -- OCG body dispatcher (sub_6D8B20) and per-family lowering functions
OCG Intrinsic Lowering -- dispatch table for sub_6C0D90/sub_6C1CF0/sub_6C4DA0
GMMA/WGMMA Pipeline -- wgmma.fence and tcgen05.fence interactions
SM Architecture Map -- per-SM sync capabilities
Knobs System -- knob 358, 472, 487, DisableErrbarAfterMembar

Hot/Cold Partitioning

ptxas implements hot/cold partitioning across three dedicated phases that mark cold blocks, reorganize loop internals, and restructure whole-function control flow to improve instruction cache utilization and warp scheduling efficiency. The system operates at two distinct granularities: instruction-level classification (used by the scheduler's priority function) and block-level classification (used by code layout and predication). Both are static heuristics -- no hardware performance counters are read at runtime -- though profile-guided data from phase 20 (PerformPGO) can influence block weights when available.


Phases	41 (`MarkAdditionalColdBlocks`), 108 (`OptimizeHotColdInLoop`), 109 (`OptimizeHotColdFlow`)
Category	Analysis (41), Optimization (108, 109)
Pipeline positions	Phase 41: mid-optimization (after `DoVirtualCTAExpansion`); Phases 108--109: post-scheduling (after `OriRemoveNopCode`)
Vtable addresses	`off_22BDC30` (41), `off_22BE6A8` (108), `off_22BE6D0` (109)
Instruction classifiers	`sub_A9CDE0` (isHotMemoryOp, 380B), `sub_A9CF90` (isColdMemoryOp, 367B)
Block layout consumer	Phase 112: `PlaceBlocksInSourceOrder` (`sub_A92C50`)
Related knob	Knob 582 (block-level cold-region query, consumed by predication at phase 63)
PGO feeder	Phase 20: `PerformPGO` (block weights, branch probabilities)

GPU Motivation

Hot/cold partitioning on a GPU serves fundamentally different purposes than on a CPU.

On a CPU, the primary goal is to keep the hot path in L1 icache lines and push cold code to distant addresses that never evict hot cache lines. The branch predictor handles the control flow; the optimization is purely about cache geometry.

On a GPU, three factors make hot/cold partitioning more impactful:

Instruction cache pressure. GPU SMs have small instruction caches (typically 32--128 KB shared across all warps on the SM). With dozens of warps in flight, each executing the same kernel, icache misses stall the entire SM. Moving cold code (error paths, rare branches) away from hot loops reduces the working set that must remain cached.
Warp scheduling. The warp scheduler selects ready warps from a pool. If cold-path instructions are interleaved with hot-path instructions in the binary layout, warps executing the cold path occupy instruction fetch bandwidth that could serve warps on the hot path. Physical separation means the fetch unit can service hot warps without cache line conflicts from cold code.
Convergence overhead. On sm_70+ architectures, divergent branches require BSSY/BSYNC convergence barriers. Cold blocks that are reached by divergent branches incur barrier setup costs even when the cold path is rarely taken. The predication pass (phase 63) uses knob 582 to query whether a block is in a cold region, allowing it to avoid if-converting cold regions where the divergence penalty is acceptable.

Architecture Overview

The three phases form a pipeline with increasing scope:

Phase 41: MarkAdditionalColdBlocks     (mid-optimization, Ori IR)
    |
    |  Sets cold-block flags on basic blocks based on static heuristics
    |  and PGO data. These flags are read by subsequent optimization
    |  passes (predication, scheduling, code layout).
    |
    v
Phase 108: OptimizeHotColdInLoop       (post-scheduling, SASS-level)
    |
    |  Within each loop body, separates hot and cold paths. Moves cold
    |  blocks to the end of the loop region so that the hot path forms
    |  a contiguous instruction sequence.
    |
    v
Phase 109: OptimizeHotColdFlow         (post-scheduling, SASS-level)
    |
    |  At function scope, restructures control flow to place cold blocks
    |  after all hot blocks. Adjusts branch targets to maintain correctness.
    |
    v
Phase 112: PlaceBlocksInSourceOrder    (final block layout)
    |
    |  Determines the physical ordering of all basic blocks in the
    |  emitted binary, consuming the hot/cold annotations set above.

The key architectural decision is that phase 41 runs at the Ori IR level (before scheduling and register allocation), while phases 108--109 run post-scheduling on the nearly-final SASS representation. This two-stage design is necessary because:

Cold-block annotations must be available early for predication decisions (phase 63) and scheduling priority (the 8-bit priority encoder).
Block reordering can only happen after scheduling has assigned stall counts and dependency barriers, since moving blocks changes instruction fetch distances and potentially invalidates scoreboard computations.

Phase 41: MarkAdditionalColdBlocks

Phase 41 is an analysis pass that annotates basic blocks with cold flags. The name "Additional" implies that some initial cold marking occurs earlier (likely during AnalyzeControlFlow at phase 3 or PerformPGO at phase 20), and this pass extends those annotations using additional heuristics available after mid-level optimization.

Pipeline Context

Phase 41 runs after DoVirtualCTAExpansion (40) and before ExpandMbarrier (42). At this point in the pipeline:

The CFG is fully built (phase 3) and loop structure is known (phase 18).
PGO data has been applied (phase 20) if available.
Branch optimization (phase 15) has simplified the control flow.
The IR is still in Ori form -- no register allocation or scheduling has occurred.

Cold-Block Heuristics

The cold-block classification uses both static and profile-guided signals. Based on analysis of consumers of the cold-block flag:

Static heuristics (always available):

Signal	Classification	Rationale
Error handling / `trap` terminator	Cold	Error paths are rarely executed in correct programs
`EXIT` with non-zero error code	Cold	Abnormal termination paths
Deeply nested conditional with uniform condition	Cold	Threads rarely diverge on uniform values
Block dominated by a back-edge but not in the loop body	Cold	Loop exit paths taken only once
Very low instruction count + unconditional branch to return	Cold	Cleanup epilogues

Profile-guided signals (when PGO data is available via phase 20):

Signal	Classification	Rationale
Execution count below threshold (relative to function entry)	Cold	Directly measured low frequency
Branch probability < 5% on the edge leading to the block	Cold	Rarely-taken branch target

Cold Flag Storage

The cold annotation is stored in the BasicBlock flags field at offset +28 of the 136-byte BasicBlock object. The predication pass queries this via knob 582 (block-level cold-region query), and the scheduling priority function reads it when computing the 8-bit packed priority at bit position 5.

Consumers of Cold Annotations

Consumer	Phase	Usage
`OriDoPredication`	63	Knob 582: skips if-conversion of cold regions (divergence penalty acceptable in cold code)
Scheduling priority	97--101	Bit 5 of 8-bit priority: hot instructions get higher scheduling priority (1 = hot, 0 = cold)
`OptimizeHotColdInLoop`	108	Reads cold flags to identify which loop blocks to move
`OptimizeHotColdFlow`	109	Reads cold flags for whole-function layout
`PlaceBlocksInSourceOrder`	112	Final block ordering uses cold annotations

Instruction-Level Hot/Cold Classification

Independent of the block-level cold marking, ptxas classifies individual memory instructions as "hot" or "cold" for scheduling purposes. This classification is performed by two small, dual functions.

`sub_A9CDE0` -- isHotMemoryOp (380 bytes)

Classifies an instruction as a hot memory operation. Hot instructions access memory spaces with high latency where early scheduling is beneficial.

isHotMemoryOp(scheduler, context, instruction):
    opcode = instruction->opcode & 0xFFFFCFFF    // mask modifier bits
    if opcode == 183 or opcode == 288:            // LD.E / ST.E (global load/store)
        operand = resolve_last_source(instruction)
        memspace = getMemorySpace(operand)
        if memspace == 6:                         // global memory
            return true
        if memspace == 4:                         // shared memory
            return (operand->modifier >> 19) & 7 == 1  // specific variant
        return false
    if opcode in {91, 92}:                        // ATOM / RED
        modifier = instruction->operand[last]
        return ((modifier ^ 6) & 6) == 0 and (modifier & 1)  // specific addressing mode
    return false

`sub_A9CF90` -- isColdMemoryOp (367 bytes)

The exact dual of isHotMemoryOp. Classifies an instruction as a cold memory operation.

isColdMemoryOp(scheduler, context, instruction):
    opcode = instruction->opcode & 0xFFFFCFFF
    if opcode == 183 or opcode == 288:            // LD.E / ST.E
        operand = resolve_last_source(instruction)
        memspace = getMemorySpace(operand)
        if memspace == 5:                         // constant memory (vs 6 for hot)
            return true
        if memspace == 4:                         // shared memory
            return (operand->modifier >> 19) & 7 == 2  // complement variant (vs 1 for hot)
        return false
    if opcode in {91, 92}:                        // ATOM / RED
        modifier = instruction->operand[last]
        return ((modifier ^ 6) & 6) == 0 and (modifier & 1) == 0  // complement of hot check
    return false

Memory Space Classification

The memory space type is resolved by sub_91C840 from register file metadata at context+152:

Space Code	Memory Type	Hot/Cold	Scheduling Implication
4	Shared memory	Depends on variant	Low latency (~20 cycles), variant-dependent
5	Constant memory	Cold	Cached, low latency (~4 cycles via constant cache)
6	Global memory	Hot	High latency (~200--800 cycles), benefits from early issue

The shared memory case splits on a 3-bit subfield at operand bits 19--21: variant 1 is hot (bank-conflicted or special access pattern), variant 2 is cold (standard access).

For atomic operations (opcodes 91/92 = ATOM/RED), the hot/cold split is on the addressing mode: specific atomics targeting global memory in reduction mode are hot; others are cold.

Scheduling Priority Integration

The instruction-level hot/cold classification feeds directly into the scheduler's 8-bit priority encoding (documented in Scheduling Algorithm):

Bit 7: yield-related
Bit 6: yield
Bit 5: hot/cold (1 = hot = higher priority, 0 = cold = lower priority)
Bit 4: register pressure overflow
Bit 3: same-BB preference
Bit 2: stall-free
Bit 1: critical path
Bit 0: tiebreaker

Hot memory instructions (global loads, global atomics) get higher scheduling priority because their long latencies benefit from being issued early -- the scheduler can then fill the latency window with independent instructions. Cold memory instructions (constant loads) have short latencies and do not benefit from early issue, so they receive lower priority.

Phase 108: OptimizeHotColdInLoop

Phase 108 operates at the post-scheduling level, after register allocation and NOP removal have completed. It optimizes the layout of basic blocks within loop bodies.

Pipeline Context

Phase 107: OriRemoveNopCode      (NOP removal)
Phase 108: OptimizeHotColdInLoop  (loop-internal reordering)
Phase 109: OptimizeHotColdFlow    (function-wide reordering)
Phase 110: PostSchedule           (post-scheduling fixup)
Phase 112: PlaceBlocksInSourceOrder (final layout)

At this point, instructions have been scheduled and stall counts assigned. The optimization must preserve scheduling correctness while improving spatial locality.

Algorithm

The pass iterates over each loop in the function (loop structure computed at phase 18, maintained through the pipeline):

Identify loop blocks. Using the loop header RPO number and exit RPO number, enumerate all blocks in the loop body.
Classify blocks. Each block in the loop is classified as hot or cold based on the cold-block flags set by phase 41 (and potentially refined by phases between 41 and 108).
Partition. Hot blocks remain at the top of the loop body; cold blocks are moved to the bottom (higher addresses within the loop region).
Adjust branches. Branch targets are updated to reflect the new block positions. Cold blocks that were fall-through targets of hot blocks receive explicit branch instructions (since they are no longer adjacent).

The effect is that the hot loop body forms a contiguous instruction sequence that fits in fewer icache lines:

Before:                          After:
  loop_header (hot)                loop_header (hot)
  hot_block_1                      hot_block_1
  cold_error_check                 hot_block_2
  hot_block_2                      BRA loop_header
  BRA loop_header                  cold_error_check  (moved)

Constraints

The loop header block cannot be moved (it must be the entry point).
Blocks with back-edges to the loop header must maintain their branch reachability.
The transformation must not change the set of scoreboard/dependency barrier states visible at each instruction (since scheduling has already completed).

Phase 109: OptimizeHotColdFlow

Phase 109 extends the hot/cold separation to the entire function, operating on blocks that are not inside loops (or that span multiple loops).

Algorithm

The whole-function pass:

Scan all blocks in RPO order. Classify each non-loop block as hot or cold.
Partition the function into a hot region (placed first in the binary) and a cold region (placed last). Loop bodies are treated as atomic units -- the internal ordering was already optimized by phase 108.
Insert or adjust branches at hot-to-cold and cold-to-hot transitions. Cold-to-hot transitions require a branch back to the hot region.
Update block ordering metadata consumed by phase 112 (PlaceBlocksInSourceOrder).

The combined effect of phases 108 and 109 is a two-level layout:

Function layout after phases 108+109:
  [hot loop bodies, internally sorted by phase 108]
  [hot non-loop blocks]
  [cold blocks from all regions]

Tepid Scheduling

Between the extremes of hot and cold, ptxas recognizes a "tepid" scheduling mode that balances math and memory instruction interleaving. The tepid infrastructure lives at 0x7A4350--0x7A5000 and computes ratios:

Metric	Formula	Purpose
`MathToDmaWaitRatio`	`field[756] / a5`	Ratio of math cycles to memory wait cycles
`MathToDmaTepidRatio`	`field[752] / a6`	Ratio of math cycles to memory tepid cycles
`MathToEpilogueWaitRatio`	`field[756] / (a5 / epilogue_count)`	Per-epilogue math-to-wait ratio
`MathToEpilogueTepidRatio`	`a6 / epilogue_count`	Per-epilogue tepid ratio

These ratios are computed by sub_7A4350 (TepidSchedulingCompute) and reported by sub_7A46E0 (TepidSchedulingReport) when verbosity > 0. Epilogue blocks are identified by sub_754510 (IsEpilogueBlock), with the epilogue instruction count controlled by knob 294.

The tepid mode affects how aggressively the scheduler interleaves memory and math instructions -- hot regions use aggressive overlap, cold regions use conservative scheduling, and tepid regions use an intermediate policy.

Interaction with Other Passes

Predication (Phase 63)

The predication pass queries knob 582 to determine whether a branch region lies in a cold block. If the region is cold, predication may be skipped because:

The cold path is rarely executed, so the branch divergence penalty is amortized.
Predication would execute both paths unconditionally, wasting functional units on cold-path instructions.
Keeping the branch allows the cold path to be physically separated by phases 108--109.

PlaceBlocksInSourceOrder (Phase 112)

Phase 112 is the final block layout pass. It consumes the hot/cold annotations and the reordering decisions made by phases 108--109 to determine the physical position of every basic block in the emitted binary. The function sub_A92C50 implements this with a complex block-sorting algorithm that uses FNV-1a hash maps and an explicit work stack.

Key fields consumed from the Code Object:

Offset	Field	Usage
+232	Current block pointer	Block being placed
+264	Block type/mode	Controls placement strategy
+296	BB array	Block pointers for lookup
+648	Successor edge map	Determines fall-through targets
+720	RPO array	Provides initial ordering

PerformPGO (Phase 20)

When profile data is available (from prior compilation runs with --generate-line-info and feedback), phase 20 applies execution counts and branch probabilities to the IR. These weights directly influence cold-block identification at phase 41 -- blocks with execution counts below a threshold relative to the function entry are marked cold regardless of static heuristics.

Key Functions

Address	Size	Identity	Confidence
`sub_A9CDE0`	380B	`isHotMemoryOp` -- classifies instruction as hot memory access	HIGH (0.90)
`sub_A9CF90`	367B	`isColdMemoryOp` -- classifies instruction as cold memory access	HIGH (0.90)
`sub_91C840`	~200B	`getMemorySpace` -- resolves memory space type from operand metadata	MEDIUM
`sub_A92C50`	~5KB	`PlaceBlocksInSourceOrder` -- final block layout algorithm	HIGH
`sub_7A46E0`	~1.1KB	`TepidSchedulingReport` -- reports tepid scheduling ratios	HIGH
`sub_7A4350`	~500B	`TepidSchedulingCompute` -- computes tepid scheduling metrics	MEDIUM
`sub_754510`	~200B	`IsEpilogueBlock` -- identifies epilogue blocks	MEDIUM

Vtable Layout

Phase	Index	Vtable Address	Name String Address
MarkAdditionalColdBlocks	41	`off_22BDC30`	`0x22BC763`
OptimizeHotColdInLoop	108	`off_22BE6A8`	`0x22BCD1D`
OptimizeHotColdFlow	109	`off_22BE6D0`	`0x22BCD33`

All three vtables follow the standard 5-entry layout:

Vtable Offset	Entry
+0	`execute(phase, compilation_context)`
+8	`isNoOp(phase*) -> bool`
+16	`getName(phase*) -> int`
+24	`alloc(pool*, size)`
+32	`free(pool*, ptr)`

Confidence Assessment

Claim	Confidence	Evidence
Phase names and indices (41, 108, 109)	VERY HIGH	Static name table at `off_22BD0C0`, factory switch at `sub_C60D30`
Vtable addresses	VERY HIGH	Computed from base `off_22BD5C8` + index * 40
`isHotMemoryOp` / `isColdMemoryOp` identity	HIGH	Dual function structure, memory space checks, opcode patterns
Memory space codes (4=shared, 5=constant, 6=global)	HIGH	Confirmed across multiple consumers
Scheduling priority bit 5 = hot/cold	HIGH	Decompiled priority function at `sub_8C9320`
Phase 41 runs before scheduling	VERY HIGH	Factory index and pipeline ordering table
Phases 108--109 run post-scheduling	VERY HIGH	Pipeline ordering table, position after `OriRemoveNopCode`
Knob 582 cold-region query in predication	HIGH	Decompiled predication pass at `sub_1381010`
Block layout consumer at phase 112	HIGH	`sub_A92C50` identified via string xref to `PlaceBlocksInSourceOrder`
Cold-block flag in BB+28	MEDIUM	Inferred from consumer patterns; exact bit position unconfirmed
Tepid scheduling ratios	HIGH	String evidence from decompiled `sub_7A46E0`
PGO influence on cold marking	MEDIUM	Inferred from pipeline ordering (PGO at 20, cold marking at 41)

Cross-References

Pass Inventory -- phases 41, 108, 109, 112 in the complete 159-phase table
Basic Blocks & CFG -- BasicBlock object layout, RPO computation, edge hash maps
Scheduling Algorithm -- 8-bit priority encoding, hot/cold bit 5
Scheduler Overview -- hot/cold classification in scheduling context
Predication -- knob 582 cold-region gate
Instruction Format -- instruction +72 opcode, +80 operand count, +84 operand array
Optimization Pipeline -- dispatch loop and phase execution order

GMMA/WGMMA Pipeline

The GMMA pipeline handles warpgroup matrix multiply-accumulate (WGMMA) instructions introduced with SM 90 (Hopper). Two dedicated compiler phases -- OriPropagateGmma (phase 85) and FixupGmmaSequence (phase 87) -- transform the IR to satisfy the hardware's strict pipelining requirements for asynchronous tensor-core operations. These are the only passes in ptxas whose sole purpose is WGMMA instruction handling.

WGMMA operates at warpgroup granularity (4 warps executing in lockstep). The hardware requires a specific sequencing protocol: wgmma.fence to open a pipeline stage, a sequence of wgmma.mma_async operations that share accumulator registers, wgmma.commit_group to close the stage, and wgmma.wait_group to synchronize on completion. Between the fence and wait, strict constraints govern which registers can be touched by non-WGMMA instructions. Violating these constraints forces the compiler to serialize the WGMMA pipeline, destroying throughput.


Pipeline phases	85 (`OriPropagateGmma`), 87 (`FixupGmmaSequence`)
Target architectures	SM 90+ (Hopper, Blackwell)
Phase 85 entry	`sub_AE5030` (2,967 bytes) -- outer driver, SM gate check
Phase 85 core	`sub_ADAD60` (2,170 bytes) -- accumulator propagation per instruction
Phase 87 entry	`sub_AE4F70` (182 bytes) -- sequencing orchestrator
Phase 87 core	`sub_ADEB40` (7,077 bytes) -- sequence fixup, warpgroup inject
Serialization warnings	`sub_ACE480` (1,908 bytes) -- 10 distinct warning codes
Pipeline validation	`sub_AE3D40` (2,511 bytes) -- sequence structural check
Accumulator collect	`sub_ADA740` (146 bytes) -- gathers accumulator register set
Live range propagation	`sub_ADBD30` (3,364 bytes) -- per-basic-block propagation
Phase name strings	`0x22BCB13` (`OriPropagateGmma`), `0x22BCB40` (`FixupGmmaSequence`)

Hardware Background

Warpgroup Execution Model

A warpgroup consists of 4 consecutive warps (128 threads). WGMMA instructions execute cooperatively across all 4 warps, with each warp contributing a slice of the matrix operation. The hardware tensor core pipeline is decoupled from the main pipeline: wgmma.mma_async dispatches work to the tensor core and returns immediately, while the accumulator registers remain in-flight until a wgmma.wait_group completes.

The PTX-level instructions that constitute a WGMMA pipeline stage:

PTX Instruction	Ori Opcode	Role
`wgmma.fence`	(via handler `sub_4DA380`)	Opens a pipeline stage; prevents reordering across the fence
`wgmma.mma_async`	309	Dispatches an asynchronous matrix multiply-accumulate
`wgmma.commit_group`	(via handler `sub_4DA4B0`)	Closes the current pipeline stage
`wgmma.wait_group`	(via handler `sub_4DA5E0`)	Waits for N committed groups to complete
`_warpgroup.arrive`	323	Compiler-inserted warpgroup synchronization (arrive)
`_warpgroup.wait`	271 (masked `& 0xFFFFCFFF`)	Compiler-inserted warpgroup synchronization (wait)
`_warpgroup.commit_batch`		Compiler-inserted commit batch

The _warpgroup.* instructions (prefixed with underscore) are compiler-internal pseudo-operations inserted by ptxas, not directly written by the programmer. They map to SASS WARPGROUP.ARRIVE, WARPGROUP.WAIT, and WARPGROUP.DEPBAR instructions.

Accumulator Register Constraints

WGMMA accumulator registers are the output (D) operands of wgmma.mma_async. While a pipeline stage is open (between fence and wait), strict rules apply:

No non-WGMMA definitions of accumulator registers. Another instruction cannot write to a register that a WGMMA in the current stage uses as an accumulator.
No non-WGMMA reads of accumulator registers. Another instruction cannot read from an accumulator register between the producing WGMMA and the completing wait.
No non-WGMMA definitions of WGMMA input registers. The A and B matrix input registers (including descriptor registers) must not be redefined by non-WGMMA instructions within the stage.

Violation of any constraint forces serialization -- the compiler collapses the pipeline to issue one WGMMA at a time with individual fence/commit/wait per operation.

Sparse GMMA

The binary contains support for sparse GMMA variants (structured sparsity). The string "Sparse GMMA with " at 0x1D0B430 appears in sub_494210 (2,276 bytes), which handles sparse matrix metadata validation. Sparse WGMMA uses an additional metadata operand encoding the 2:4 or other sparsity pattern.

Phase 85: OriPropagateGmma

Purpose

Phase 85 propagates WGMMA accumulator register liveness information through the IR. For each wgmma.mma_async instruction (Ori opcode 309), it identifies the accumulator register set and builds a compact encoding that downstream passes use to track which registers are "in-flight" at each program point. This information is consumed by phase 87 to determine where warpgroup.arrive and warpgroup.wait instructions must be injected.

SM Gate

The outer driver sub_AE5030 checks the target architecture before proceeding. At offset +1381 of the compilation context, a flag indicates whether the target supports WGMMA. The check at the function entry:

if (*(char*)(context + 1381) >= 0)  // bit 7 clear = no WGMMA support
    return;

An additional mode check reads from the target descriptor at offset 26208 (within a 72-byte sub-structure at the descriptor's offset 72):

Value 0: no WGMMA support -- skip entirely
Value 1 with sub-field at 26216 nonzero: use the simple single-function path (sub_ADCA60)
Otherwise: use the full pipeline analysis path

Accumulator Register Encoding

The core function sub_ADAD60 processes each wgmma.mma_async instruction and encodes its accumulator register set into a packed 32-bit word. The encoding uses the FNV-1a hash (prime 16777619, offset basis 0x811C9DC5) for register-set lookup in a hash table:

hash = 16777619 * (HIBYTE(reg_id) ^
       (16777619 * (BYTE2(reg_id) ^
       (16777619 * (BYTE1(reg_id) ^
       (16777619 * ((uint8_t)reg_id ^ 0x811C9DC5)))))));

Accumulator entries are stored with a type tag in the high nibble:

0x90000000 | (encoded_accum & 0xFFFFFF) -- source accumulator register set
0x10000000 | (encoded_accum & 0xFFFFFF) -- destination accumulator register set

Live Range Limit Check

After accumulator propagation, the pass checks whether the number of active GMMA live ranges exceeds the hardware limit. The limit is stored at offset 56 of the pass object (field *(DWORD*)(a1 + 56) = maxActiveGmmaLiveRanges). If exceeded, a diagnostic is emitted:

"GMMA sequence has too many active live ranges (%d), reduce it to bring it under (%d)"

This diagnostic uses warning code 0x1CEF (7407). The limit is architecture-dependent and reflects the number of accumulator register banks available to the tensor core pipeline.

Call Chain

sub_AE5030  (2,967B -- SM gate, iteration over basic blocks)
  └─ sub_ADCA60  (3,643B -- per-function pipeline analysis)
       └─ sub_ADBD30  (3,364B -- per-block accumulator propagation)
            └─ sub_ADAD60  (2,170B -- per-instruction accumulator encoding)
                 ├─ sub_AD4500  -- hash table lookup for register set
                 ├─ sub_AD4940  -- hash table insert/update
                 ├─ sub_AD6280  -- register set cache insert
                 ├─ sub_AD8E50  -- instruction iterator setup
                 ├─ sub_AD0C50  -- begin accumulator iteration
                 ├─ sub_AD3EA0  -- advance accumulator iterator
                 ├─ sub_AD1FA0  -- advance to next accumulator slot
                 ├─ sub_75A670  -- grow dynamic array (accumulator list)
                 └─ sub_895530  -- emit diagnostic warning

Accumulator Collection Helper

sub_ADA740 (146 bytes) collects the set of registers that are accumulators for a given instruction. It iterates over an instruction's operands, checking:

Operand type tag (operand >> 28) & 7 == 1 (register operand)
Not an immediate-flagged operand ((byte_flag & 1) == 0)
reg_type == 6 at vreg+64 (tensor/accumulator register class)

Matching registers are added to a bitvector-like set via sub_768AB0.

Phase 87: FixupGmmaSequence

Purpose

Phase 87 is the critical legalization pass. It analyzes WGMMA instruction sequences, verifies that the hardware pipeline constraints are satisfied, and inserts warpgroup.arrive / warpgroup.wait instructions where registers used by non-WGMMA instructions conflict with in-flight WGMMA accumulators. If the pipeline cannot be formed correctly, it triggers serialization and emits performance warnings.

Orchestrator: sub_AE4F70

The 182-byte wrapper orchestrates the complete fixup sequence:

sub_AE4F70 (FixupGmmaSequence orchestrator)
  │
  ├─ [1] sub_ADEB40  -- primary sequence fixup (inject arrive/wait)
  ├─ [2] sub_ADA7E0  -- verify pipeline consistency
  ├─ [3] sub_AE3D40  -- structural validation of sequences
  ├─ [4] sub_AD8F90  -- secondary validation pass
  ├─ [5] sub_AE4710  -- finalize sequence metadata
  ├─ [6] sub_AE17C0  -- late pipeline consistency check
  │
  └─ On failure at any step:
       ├─ Set serialization flag: *(BYTE*)(context + 1920) = 1
       ├─ sub_ACE480  -- emit serialization warning
       └─ sub_AE47B0  -- serialize the WGMMA pipeline (fallback)

The return value encodes the failure reason in the low 32 bits and a function identifier in the high 32 bits, which sub_ACE480 uses to select the appropriate warning message.

Primary Fixup: sub_ADEB40

This 7,077-byte function is the heart of the GMMA pipeline. Its logic:

1. Initialization. Allocates two dynamic arrays (v224/v225 for warpgroup.wait insertion points, i/v228 for warpgroup.arrive insertion points) and initializes them with sentinel values (0xFFFFFFFF).

2. First pass -- identify WGMMA sequences. Iterates over all instructions in the function's code list. For each instruction with opcode 309 (wgmma.mma_async):

Collects the instruction's accumulator register set via sub_ACC0A0 / sub_AD50B0 iterator pattern
Checks whether each of the instruction's operands (positions 1--4) has already been marked with arrival/wait flags
For unmarked operands, calls sub_ADA740 to collect accumulator registers and add them to the tracking set

The pass checks operand flag bits at instruction + 84 + 8*operand_index + 4:

Bit 0 (& 1): operand has been processed for arrive
Bit 1 (& 2): operand has been processed for wait
Bit 2 (& 4): operand requires a warpgroup.arrive/wait boundary

3. Second pass -- walk pipeline stages. For each WGMMA sequence identified in the compilation context's sequence table (context->field_99), the pass walks forward through basic blocks:

Tracks the current pipeline stage state (v206: 0=initial, 1=arrived, 2=committed)
When encountering a wgmma.mma_async (opcode 309), records it as part of the current stage
When encountering a _warpgroup.commit_batch (opcode 323), marks the stage boundary and sets bit 2 on the last accumulator operand
When encountering an arrive (opcode 271 masked) or wait (opcode 32 masked), updates the pipeline state
When encountering a function call (opcode 236), forces a pipeline break

For non-WGMMA instructions within a stage, checks whether their register operands conflict with the active accumulator set by querying the bitvector (the balanced binary tree at v238). If a conflict is found, the instruction needs a warpgroup.arrive or warpgroup.wait to be injected before it.

4. Injection. Creates new instructions:

sub_ACBE60 creates warpgroup.arrive pseudo-instructions
sub_ACBF80 creates warpgroup.wait pseudo-instructions

These are added to the arrival/wait lists and later inserted into the code.

5. Commit pass. After analysis, iterates over the collected injection points:

For each warpgroup.arrive insertion, checks whether the injection needs a diagnostic via sub_ACBCA0 (knob-gated)
Emits advisory warning 0x1D5F (7519): "warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"
For each warpgroup.wait insertion, emits advisory warning 0x1D5D (7517): "warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"

6. Finalization. Calls sub_ADD8A0 (1,349 bytes) to rebuild the WGMMA sequence metadata after injection.

Pipeline Stage State Machine

The fixup pass maintains a state machine as it walks through instructions within a WGMMA sequence:

          ┌──────────────┐
          │  state = 0   │  (initial / outside pipeline)
          │  no active   │
          │  stage       │
          └──────┬───────┘
                 │  encounter wgmma.mma_async
                 ▼
          ┌──────────────┐
          │  state = 1   │  (in pipeline stage, arrived)
          │  tracking    │
          │  accumulators│
          └──────┬───────┘
                 │  encounter commit_batch
                 ▼
          ┌──────────────┐
          │  state = 2   │  (committed, waiting)
          │  accumulators│
          │  in-flight   │
          └──────┬───────┘
                 │  encounter wait or stage end
                 ▼
          ┌──────────────┐
          │  state = 0   │  (back to initial)
          └──────────────┘

  At any state, encountering a function call (opcode 236)
  or a conflicting register use forces:
    → inject warpgroup.arrive/wait
    → potentially serialize the pipeline

Register Conflict Detection

Register type 6 (vreg+64 == 6) is the tensor/accumulator register class. The conflict check compares operand register IDs against the active accumulator bitvector using a balanced binary search tree (v238 / v148 in the decompilation). The tree is keyed by register_id >> 8 (register bank) with a 64-bit bitmap per node tracking individual registers within the bank:

bit_index = register_id & 0x3F;
bank_offset = (register_id >> 6) & 3;  // 0..3 for 4 64-bit words per node
is_conflict = (node->bitmap[bank_offset + 4] >> bit_index) & 1;

Serialization Warnings

When the pipeline cannot be formed correctly, sub_ACE480 (1,908 bytes) emits one of 10 distinct performance warnings. The function receives a packed 64-bit error code: the low 4 bits select the warning case (1--10) and the high 32 bits identify the function that triggered the failure. The function name is resolved via a vtable callback: context->field_0->vtable[18]->method_1(context->field_0->vtable[18], function_id).

Warning Emission Mechanism

Each warning is gated by a per-function flag at context->field_208 + 72 + 26280:

Byte == 1 with DWORD at +26288 nonzero: Emit via sub_895530 (direct diagnostic with source location). Falls back to sub_7EEFA0 (format-to-buffer, no location) if the source location callback at context->vtable + 48 is null.
Byte != 1 (default): Emit via sub_7FA2C0 (warning-once gate, keyed on hex code at context + 154). If the gate passes (first occurrence for this function), emits via sub_895670 (diagnostic through context->vtable + 128 callback). This prevents the same warning from being emitted multiple times for the same function.

All warnings use the prefix "Potential Performance Loss: wgmma.mma_async instructions are serialized due to ...".

Serialization Warning Table

Case	Hex	Decimal	Message suffix	Source function
1	`0x1D55`	7509	`...the presence of Extern calls in the function '%s'`	`sub_ADEB40`
2	`0x1D56`	7510	`...wgmma pipeline crossing function boundary at a function call in the function '%s'`	`sub_ADEB40`
3	`0x1D57`	7511	`...insufficient register resources for the wgmma pipeline in the function '%s'`	`sub_ADA7E0`, orchestrator fallback
4	`0x1D58`	7512	`...insufficient register resources for the function '%s'`	orchestrator resource check
5	`0x1D59`	7513	`...non wgmma instructions defining input registers of a wgmma between start and end of the pipeline stage in the function '%s'`	`sub_ADEB40`, `sub_AE17C0`
6	`0x1D5A`	7514	`...non wgmma instructions reading accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s'`	`sub_AE17C0`
7	`0x1D5B`	7515	`...non wgmma instructions defining accumulator registers of a wgmma between start and end of the pipeline stage in the function '%s'`	`sub_ADEB40`, `sub_AE17C0`
8	`0x1D5C`	7516	`...ill formed pipeline stage in the function '%s'`	`sub_AE3D40` structural check
9	`0x1D5E`	7518	`...program dependence on compiler-inserted WG.DP in divergent path in the function '%s'`	`sub_ADEB40` finalization
10	`0x1D60`	7520	`...program dependence on compiler-inserted WG.AR in divergent path in the function '%s'`	`sub_ADEB40` finalization

Note: The hex codes are not contiguous. Codes 0x1D5D (7517) and 0x1D5F (7519) are advisory injection warnings, not serialization warnings (see below).

Advisory Injection Warnings

During successful (non-serialized) pipeline fixup, sub_ADEB40 emits advisory warnings when it injects warpgroup synchronization instructions. These are gated by knob check at sub_ACBCA0 and the per-instruction flag at bb_info + 282 bit 3:

Hex	Decimal	Message
`0x1D5D`	7517	`"warpgroup.wait is injected in around line %d by compiler to allow use of registers defined by GMMA in function '%s'"`
`0x1D5F`	7519	`"warpgroup.arrive is injected in around line %d by compiler to allow use of registers in GMMA in function '%s'"`

These are informational: they indicate the compiler successfully handled a register conflict by inserting synchronization, without falling back to serialization.

Detailed Trigger Conditions

Case 1 (`0x1D55`): Extern calls prevent pipelining

Trigger. During the instruction walk in sub_ADEB40, a call instruction (Ori opcode 236) is encountered within a WGMMA pipeline stage, or an operand references a basic block with no instructions (opaque/extern function target). The compiler cannot verify that the callee preserves the accumulator register state.

Detection code. In sub_ADEB40: when opcode == 236 (function call), or when a callee basic block's instruction pointer is null (*(_QWORD*)v114 == 0), v206 is set to 1.