PTXAS v13.0 — Reverse Engineering Reference

Purpose: reimplementation-grade documentation of NVIDIA's PTX-to-SASS assembler, recovered entirely from static analysis of the stripped x86-64 binary.

PTX (Parallel Thread Execution) is NVIDIA's virtual ISA for GPU compute. SASS (Shader Assembly) is the native machine code executed by GPU hardware. PTXAS is the binary that transforms PTX into SASS. At 37.7 MB stripped, it is a fully proprietary compiler with no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, every data structure, and every encoding table was built in-house by NVIDIA. This wiki documents its internal architecture using IDA Pro 8.x and Hex-Rays decompilation.

Version note: All addresses and binary offsets in this wiki apply to ptxas v13.0.88 (CUDA Toolkit 13.0). Other versions will have different addresses.


Binary	ptxas v13.0.88, 37,741,528 bytes, x86-64, stripped
Build	`cuda_13.0.r13.0/compiler.36424714_0` (Aug 20 2025)
Decompilation	40,185 functions, IDA Pro 8.x + Hex-Rays
Strings	30,632 extracted
Call graph	548,693 edges
Version string	`Cuda compilation tools, release 13.0, V13.0.88` (`sub_612DE0`)
LLVM code	None — fully proprietary compiler
Default target	`sm_75` (Turing)
Supported SMs	sm_75 through sm_121f (Turing through DGX Spark)
Internal codename	OCG (Optimizing Code Generator), Mercury (SASS encoder)

Glossary

Term	Meaning
Ori IR	PTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym.
Mercury	The SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words. Named in NVIDIA source paths and error strings.
OCG	Optimizing Code Generator — NVIDIA's internal name for the ptxas optimization+codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings.
Fatpoint	The register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers. The allocator works by computing these sets and mapping them to physical registers.
Opex	Operand expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields. Converts virtual register references, immediates, and address modes into the bit patterns Mercury expects.
Capmerc	Capsule Mercury — an ELF section (`.nv.capmerc`) that embeds a secondary Mercury-encoded representation of the kernel alongside the primary `.text` section. Used for debug metadata and binary patching support.
ELFW	PTXAS's custom ELF writer (`sub_1C9F280`, 97 KB). Not a standard library — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions.
EIATTR	Extended Info Attributes — per-kernel metadata encoded in `.nv.info` sections. Each attribute is a tag-length-value record carrying register counts, barrier usage, shared memory sizes, CRS stack depth, and other kernel properties consumed by the CUDA runtime and driver.

Three Subsystems

PTXAS is not a monolithic assembler. It decomposes into three largely independent subsystems with distinct coding conventions, data structures, and lineages:

1. PTX Frontend (~3 MB, 0x400000--0x5AA000) — A Flex-generated DFA scanner (sub_720F00, 64 KB, ~552 rules) feeds tokens into a Bison-generated LALR(1) parser (sub_4CE6B0, 48 KB). The parser is driven from sub_446240 (the real main, 11 KB), which orchestrates the full pipeline: parse, DAGgen, OCG, ELF, DebugInfo. The frontend also contains 1,141 instruction descriptors registered via sub_46E000 (93 KB) that define accepted type combinations for every PTX opcode, 608 CUDA runtime intrinsics registered in sub_5D1660 (46 KB), and a suite of per-instruction semantic validators (0x460000--0x4D5000) that check architecture requirements, type compatibility, and operand constraints before lowering. See PTX Parser and Entry Point & CLI.

2. Ori Optimizer (~8 MB, 0x5AA000--0xC52000) — A proprietary 159-phase optimization pipeline managed by the PhaseManager (sub_C62720). The phase factory at sub_C60D30 is a 159-case switch that allocates polymorphic phase objects from a vtable table at off_22BD5C8. Each phase has virtual methods for execute(), isNoOp(), and getName(). Major subsystems include: a fatpoint-based register allocator (sub_957160 core, sub_95DC10 driver, sub_926A30 interference graph builder), a 3-phase instruction scheduler (sub_688DD0 with ReduceReg/DynBatch modes and 9 register pressure counters), copy propagation, strength reduction, predication (if-conversion), rematerialization, and GMMA/WGMMA pipelining. The pipeline reads its default phase ordering from a 159-entry table at 0x22BEEA0. See Optimization Pipeline and Phase Manager.

3. SASS Backend (~14 MB, 0xC52000--0x1CE3000) — The Mercury encoder generates native SASS binary code. Instruction encoding is handled by ~4,000 per-variant handler functions (683 + 678 = 1,361 in the SM100 Blackwell encoding tables alone at 0xED1000--0x107B000, with additional tables for other SM generations). Each handler follows a rigid template: set opcode ID, load a 128-bit encoding format descriptor via SIMD, initialize a 10-slot register class map, register operand descriptors via sub_7BD3C0/sub_7BD650/sub_7BE090, finalize with sub_7BD260, then extract bitfields from the packed instruction word. The backend also contains 3 peephole optimizers (the PeepholeOptimizer class at 0x7A5D10 with Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, and SchedulingAwarePatterns methods), a capsule Mercury ELF embedder for debug metadata (sub_1CB53A0, section .nv.capmerc), and a custom ELF emitter (sub_1C9F280, 97 KB) that builds the final CUBIN output. See SASS Code Generation, Mercury Encoder, and Peephole Optimization.

Additionally, the binary embeds a custom pool allocator (sub_424070, 3,809 callers), MurmurHash3-based hash maps (sub_426150 insert / sub_426D60 lookup), a thread pool with pthread-based parallel compilation support, and a GNU Make jobserver client for integration with build systems.

Compilation Pipeline

Both standalone and library-mode invocations converge on the same pipeline, visible in the timing strings emitted by sub_446240:

PTX text (.ptx file or string)
  |
  +-- Flex Scanner (sub_720F00, 64KB)
  |     552-rule DFA, off_203C020 transition table
  |     Tokens: 340+ terminal symbols for Bison grammar
  |
  +-- Bison LALR(1) Parser (sub_4CE6B0, 48KB)
  |     Semantic validators: 0x460000-0x4D5000
  |     1,141 instruction descriptors via sub_46E000
  |
  +-- Ori IR Construction (DAGgen phase)
  |     Internal representation: basic blocks + instruction DAG
  |     608 CUDA runtime intrinsics (sub_5D1660)
  |
  +-- 159-Phase Optimization Pipeline (PhaseManager, sub_C62720)
  |     Phase factory: sub_C60D30 (159-case switch)
  |     Fatpoint register allocator (sub_957160)
  |     3-phase instruction scheduler (sub_688DD0)
  |     Copy propagation, CSE, strength reduction, predication,
  |     rematerialization, GMMA pipelining, late legalization
  |
  +-- Mercury SASS Encoder
  |     Instruction encoding: ~4000 per-variant handlers
  |     3 peephole optimizers (PeepholeOptimizer at 0x7A5D10)
  |     WAR hazard resolution (sub_6FC240)
  |     Operand expansion (Opex pipeline)
  |
  +-- ELF/CUBIN Output (sub_1C9F280, 97KB)
        Sections: .text, .nv.constant0, .nv.info, .symtab
        Capsule Mercury: .nv.capmerc (debug metadata)
        DWARF: .debug_line, .debug_info, .debug_frame

The driver at sub_446240 reports per-stage timing: Parse-time, CompileUnitSetup-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time, plus PeakMemoryUsage in KB. For multi-entry PTX files, each compile unit is processed independently with the header "\nCompile-unit with entry %s".

Dual Compilation Modes

PTXAS operates in two modes selected at invocation:

	Standalone CLI	Library Mode
Invocation	`ptxas [options] file.ptx`	Called from nvcc/nvlink as a subprocess
Entry	`main` at `0x409460`	`sub_9F63D0` (library/ftrace entry)
Real driver	`sub_446240` (11 KB)	Same pipeline, alternate setup
Input	PTX file on disk	PTX string via `--input-as-string`
Output	`.cubin` / `.o` file	Binary blob returned to caller
Usage string	`"Usage : %s [options] <ptx file>,...\n"`	N/A

The main function (0x409460, 84 bytes) is a thin wrapper: it stores argv[0], sets stdout/stderr to unbuffered via setvbuf, and delegates to sub_446240. The --input-as-string flag enables accepting PTX source directly as a CLI argument rather than reading from a file.

Configuration

PTXAS exposes three layers of configuration:

CLI Options (~100 flags) — Registered in sub_432A00 and parsed by sub_434320. Key options include --gpu-name (target SM), --maxrregcount (register limit), --opt-level (0--4), --verbose, --warn-on-spills, --warn-on-local-memory-usage, --fast-compile, --fdevice-time-trace (Chrome trace JSON output), --compile-as-tools-patch (sanitizer mode), and --extensible-whole-program. Help is printed by sub_403588 which calls sub_1C97640 to enumerate all registered options.

Internal Knobs (1,294 ROT13-encoded entries) — A separate configuration system implemented in generic_knobs_impl.h (source path recovered: /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h). The knob table is populated by two massive static constructors: ctor_005 at 0x40D860 (80 KB, ~2,000 general OCG knobs) and ctor_007 at 0x421290 (8 KB, 98 Mercury scheduler knobs). All knob names are ROT13-obfuscated in the binary. Examples after decoding: MercuryUseActiveThreadCollectiveInsts, MercuryTrackMultiReadsWarLatency, MercuryPresumeXblockWaitBeneficial, ScavInlineExpansion, ScavDisableSpilling. Knobs are read from environment variables and knob files via ReadKnobsFile (sub_79D070) which parses [knobs]-header INI files. Lookup is performed by GetKnobIndex (sub_79B240) with inline ROT13 decoding and case-insensitive comparison. See Knobs System.

SM Profile Tables — Per-architecture capability maps initialized by sub_607DB0 (14 KB) which creates 7 hash maps indexing sm_XX / compute_XX strings to handler functions. Profile objects are constructed by sub_6765E0 (54 KB) with architecture-to-family mappings (sm_75 -> Turing, sm_80/86/87/88 -> Ampere, sm_89 -> Ada Lovelace, sm_90/90a -> Hopper, sm_100/100a/100f -> Blackwell, sm_103/103a/103f -> Blackwell Ultra, sm_110/110a/110f -> Jetson Thor, sm_120/120a/120f -> RTX 50xx, sm_121/121a/121f -> DGX Spark). See SM Architecture Map.

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with GPU compiler experience.

Section Index

Overview

Function Map — Address-to-identity lookup for key functions with confidence levels.
Binary Layout — Subsystem address map at pass granularity.
Methodology — How this analysis was performed.
Version Tracking — Cross-version address deltas.

Compilation Pipeline

Pipeline Overview — End-to-end PTX-to-SASS flow diagram with links to every stage.
Entry Point & CLI — CLI parsing, main at 0x409460, the real driver at sub_446240.
PTX Parser (Flex + Bison) — 552-rule Flex DFA scanner, Bison LALR(1) parser, instruction descriptor table.
PTX Directive Handling — .version, .target, .entry, .func, .reg, .shared, .const processing.
PTX-to-Ori Lowering — How parsed PTX is lowered into the Ori internal representation.
Optimization Pipeline (159 Phases) — PhaseManager, phase factory, default phase ordering, per-phase timing.
SASS Code Generation — Mercury encoder, instruction selection, operand expansion.
ELF/Cubin Output — Custom ELF emitter, section layout, capsule Mercury, DWARF generation.

Ori IR — Internal Representation

IR Overview & Design — Instruction DAG, basic blocks, typed virtual registers.
Instructions & Opcodes — Ori opcode set and instruction encoding.
Basic Blocks & CFG — Control flow graph construction and manipulation.
Register Model (R/UR/P/UP) — Four register classes and their constraints.
Data Structure Layouts — Memory layout of key IR objects.

Optimization Passes

Pass Inventory & Ordering — All 159 phases with names, addresses, and pipeline positions.
Phase Manager Infrastructure — Phase factory, vtable dispatch, execute/isNoOp/getName.
GeneralOptimize Bundles — Mega-pass bundles that group related sub-passes.
Loop Passes — Unrolling, LICM, induction variable optimization, strength reduction.
Copy Propagation & CSE — Value forwarding and common subexpression elimination.
Predication — If-conversion for GPU divergence control.
Rematerialization — Recomputing values to reduce register pressure.
Synchronization & Barriers — Barrier insertion and dead barrier elimination.
Late Expansion & Legalization — Final lowering before codegen.

Register Allocation

Allocator Architecture — Fatpoint algorithm, interference graph, spilling, ABI constraints.
Fatpoint Algorithm — Core allocation loop and heuristics.
Spilling — Spill cost model and spill code generation.
GPU ABI & Calling Convention — Register assignment rules and caller/callee contracts.

Instruction Scheduling

Scheduler Architecture — 3-phase scheduler, ReduceReg/DynBatch modes.
Scheduling Algorithm — Priority list scheduling with register pressure tracking.
Latency Model & HW Profiles — Per-SM instruction latency tables.
Scoreboards & Dependency Barriers — WAR hazard resolution, barrier allocation.

SASS Code Generation

Code Generation Overview — Instruction selection, encoding, peephole, Mercury.
Instruction Selection — Pattern-based DAG-to-SASS lowering.
SASS Instruction Encoding — 128-bit instruction word format and bitfield packing.
Peephole Optimization — Three peephole dispatchers with SM-variant patterns.
Mercury Encoder — Per-variant handler architecture, encoding tables.
Capsule Mercury & Finalization — .nv.capmerc section, debug metadata embedding.
SASS Text Generation — Disassembly-format printing for --verbose output.

GPU Architecture Targets

SM Architecture Map — SM feature gates from sm_75 through sm_121f.
Turing & Ampere (SM 75--88) — Feature delta between generations.
Ada & Hopper (SM 89--90a) — Async copy, TMA, distributed shared memory.
Blackwell (SM 100--121) — TCGen05, fifth-gen tensor cores, new SM variants.
TCGen05 — 5th Gen Tensor Cores — Blackwell tensor core instruction set.

CUDA Intrinsics

Intrinsic Table (608 Entries) — Math, tensor, sync, warp intrinsics.
Math Intrinsics — Fast-math, Newton-Raphson, special functions.
Tensor Core Intrinsics — WMMA, GMMA, WGMMA instruction families.
Sync & Warp Intrinsics — Barrier, vote, shuffle, match.

ELF/Cubin Output

Custom ELF Emitter — ELFW internals, section construction, symbol table.
Section Catalog & EIATTR — .nv.info attribute encoding, per-kernel metadata.
Debug Information — DWARF generation for GPU debugging.
Relocations & Symbols — CUBIN relocation types and symbol conventions.

Configuration

CLI Options — ~100 flags registered in sub_432A00.
Knobs System (1,294 Knobs) — ROT13 knob table, environment variables, INI files.
Optimization Levels — O-level to phase mapping, --fast-compile tiers.
DUMPIR & NamedPhases — Dumping IR at specific pipeline points.

Infrastructure

Memory Pool Allocator — sub_424070, 3,809 callers, arena-style allocation.
Hash Tables & Bitvectors — MurmurHash3-based maps, bitvector liveness sets.
Thread Pool & Concurrency — pthread pool, GNU Make jobserver client.

Reference

SASS Opcode Catalog — Complete SASS opcode enumeration.
PTX Instruction Table — All PTX instructions with type signatures.
EIATTR Attribute Catalog — Tag-length-value format for .nv.info attributes.

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how PTX text becomes SASS binary, what each stage does, and how control flows between subsystems.

Pipeline Overview — The complete flow diagram. Establishes all stages and their address ranges.
Entry Point & CLI — How ptxas is invoked, the ~100 CLI flags, and the sub_446240 driver function.
PTX Parser — The Flex scanner and Bison parser. How PTX text becomes an internal parse tree.
PTX-to-Ori Lowering — How the parse tree is lowered to Ori IR (basic blocks + instruction DAG).
Optimization Pipeline — The 159-phase PhaseManager. Phase factory, ordering, timing infrastructure.
SASS Code Generation — Mercury encoder, instruction selection, operand expansion, peephole.
ELF/Cubin Output — Custom ELF emitter, section layout, DWARF debug info, capsule Mercury.

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one optimization phase deeply enough to write a compatible replacement.

Pass Inventory & Ordering — Locate the phase in the 159-entry table. Note its index, vtable address, and pipeline position.
The phase's dedicated page (e.g., Copy Propagation & CSE, Predication). Every dedicated page contains the function address, decompiled algorithm, data flow, and controlling knobs.
Knobs System — Find which ROT13 knobs control the phase's behavior (enable/disable toggles, thresholds).
Ori IR Overview — Understand the IR data structures the phase operates on.
Register Model — The R/UR/P/UP register classes and their constraints.
Function Map — Cross-reference internal function addresses with the master function map.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, crash, or incorrect SASS output by tracing the problem to a specific phase.

DUMPIR & NamedPhases — How to dump IR at specific pipeline points. Use DUMPIR to observe the IR before and after each phase.
Optimization Levels — Compare phase pipelines at different O-levels. If a bug appears at -O2 but not -O1, the diff identifies suspect phases.
Pipeline Overview — The pipeline is linear: Parse -> DAGgen -> OCG (159 phases) -> Mercury -> ELF. The stage where output first goes wrong narrows the search.
Knobs System — Check whether the suspect phase has enable/disable knobs. Toggle them to confirm or rule out the phase.
Instruction Scheduling and Scoreboards & Dependency Barriers — If the generated SASS hangs or produces wrong results under specific warp configurations, the scheduler or barrier insertion may be at fault.

Reading Path 4: Tuning Performance

Goal: understand what ptxas does at each optimization level and what knobs control aggressiveness.

Optimization Levels — The O-level to phase mapping, including --fast-compile tiers.
Knobs System — The 1,294 ROT13-encoded internal tuning parameters. The primary mechanism for fine-grained control.
Register Allocation — The fatpoint allocator directly determines register count, which determines maximum occupancy.
Instruction Scheduling — The scheduler's ReduceReg and DynBatch modes, WAR hazard resolution, and interaction with register pressure.
Peephole Optimization — The 3 peephole dispatchers that perform late SASS-level rewrites.
SM Architecture Map — Per-SM feature gates that influence code generation decisions.

Keyboard shortcuts

PTXAS Reverse Engineering Reference