Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PTXAS v13.0 — Reverse Engineering Reference

Purpose: reimplementation-grade documentation of NVIDIA's PTX-to-SASS assembler, recovered entirely from static analysis of the stripped x86-64 binary.

PTX (Parallel Thread Execution) is NVIDIA's virtual ISA for GPU compute. SASS (Shader Assembly) is the native machine code executed by GPU hardware. PTXAS is the binary that transforms PTX into SASS. At 37.7 MB stripped, it is a fully proprietary compiler with no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, every data structure, and every encoding table was built in-house by NVIDIA. This wiki documents its internal architecture using IDA Pro 8.x and Hex-Rays decompilation.

Version note: All addresses and binary offsets in this wiki apply to ptxas v13.0.88 (CUDA Toolkit 13.0). Other versions will have different addresses.

Binaryptxas v13.0.88, 37,741,528 bytes, x86-64, stripped
Buildcuda_13.0.r13.0/compiler.36424714_0 (Aug 20 2025)
Decompilation40,185 functions, IDA Pro 8.x + Hex-Rays
Strings30,632 extracted
Call graph548,693 edges
Version stringCuda compilation tools, release 13.0, V13.0.88 (sub_612DE0)
LLVM codeNone — fully proprietary compiler
Default targetsm_75 (Turing)
Supported SMssm_75 through sm_121f (Turing through DGX Spark)
Internal codenameOCG (Optimizing Code Generator), Mercury (SASS encoder)

Glossary

TermMeaning
Ori IRPTXAS's internal intermediate representation — basic blocks containing an instruction DAG with typed virtual registers. Named after recovered debug strings; not an acronym.
MercuryThe SASS binary encoder subsystem. Converts abstract instruction objects into 128-bit packed machine words. Named in NVIDIA source paths and error strings.
OCGOptimizing Code Generator — NVIDIA's internal name for the ptxas optimization+codegen pipeline (the 159-phase core). Appears in knob prefixes and timing strings.
FatpointThe register allocation algorithm used by ptxas. A fatpoint is a program point annotated with the set of simultaneously live virtual registers. The allocator works by computing these sets and mapping them to physical registers.
OpexOperand expansion — a late pipeline stage that expands abstract operands into concrete SASS encoding fields. Converts virtual register references, immediates, and address modes into the bit patterns Mercury expects.
CapmercCapsule Mercury — an ELF section (.nv.capmerc) that embeds a secondary Mercury-encoded representation of the kernel alongside the primary .text section. Used for debug metadata and binary patching support.
ELFWPTXAS's custom ELF writer (sub_1C9F280, 97 KB). Not a standard library — a bespoke emitter that builds CUBIN files with NVIDIA-specific sections, relocations, and symbol conventions.
EIATTRExtended Info Attributes — per-kernel metadata encoded in .nv.info sections. Each attribute is a tag-length-value record carrying register counts, barrier usage, shared memory sizes, CRS stack depth, and other kernel properties consumed by the CUDA runtime and driver.

Three Subsystems

PTXAS is not a monolithic assembler. It decomposes into three largely independent subsystems with distinct coding conventions, data structures, and lineages:

1. PTX Frontend (~3 MB, 0x400000--0x5AA000) — A Flex-generated DFA scanner (sub_720F00, 64 KB, ~552 rules) feeds tokens into a Bison-generated LALR(1) parser (sub_4CE6B0, 48 KB). The parser is driven from sub_446240 (the real main, 11 KB), which orchestrates the full pipeline: parse, DAGgen, OCG, ELF, DebugInfo. The frontend also contains 1,141 instruction descriptors registered via sub_46E000 (93 KB) that define accepted type combinations for every PTX opcode, 608 CUDA runtime intrinsics registered in sub_5D1660 (46 KB), and a suite of per-instruction semantic validators (0x460000--0x4D5000) that check architecture requirements, type compatibility, and operand constraints before lowering. See PTX Parser and Entry Point & CLI.

2. Ori Optimizer (~8 MB, 0x5AA000--0xC52000) — A proprietary 159-phase optimization pipeline managed by the PhaseManager (sub_C62720). The phase factory at sub_C60D30 is a 159-case switch that allocates polymorphic phase objects from a vtable table at off_22BD5C8. Each phase has virtual methods for execute(), isNoOp(), and getName(). Major subsystems include: a fatpoint-based register allocator (sub_957160 core, sub_95DC10 driver, sub_926A30 interference graph builder), a 3-phase instruction scheduler (sub_688DD0 with ReduceReg/DynBatch modes and 9 register pressure counters), copy propagation, strength reduction, predication (if-conversion), rematerialization, and GMMA/WGMMA pipelining. The pipeline reads its default phase ordering from a 159-entry table at 0x22BEEA0. See Optimization Pipeline and Phase Manager.

3. SASS Backend (~14 MB, 0xC52000--0x1CE3000) — The Mercury encoder generates native SASS binary code. Instruction encoding is handled by ~4,000 per-variant handler functions (683 + 678 = 1,361 in the SM100 Blackwell encoding tables alone at 0xED1000--0x107B000, with additional tables for other SM generations). Each handler follows a rigid template: set opcode ID, load a 128-bit encoding format descriptor via SIMD, initialize a 10-slot register class map, register operand descriptors via sub_7BD3C0/sub_7BD650/sub_7BE090, finalize with sub_7BD260, then extract bitfields from the packed instruction word. The backend also contains 3 peephole optimizers (the PeepholeOptimizer class at 0x7A5D10 with Init, RunOnFunction, RunOnBB, RunPatterns, SpecialPatterns, ComplexPatterns, and SchedulingAwarePatterns methods), a capsule Mercury ELF embedder for debug metadata (sub_1CB53A0, section .nv.capmerc), and a custom ELF emitter (sub_1C9F280, 97 KB) that builds the final CUBIN output. See SASS Code Generation, Mercury Encoder, and Peephole Optimization.

Additionally, the binary embeds a custom pool allocator (sub_424070, 3,809 callers), MurmurHash3-based hash maps (sub_426150 insert / sub_426D60 lookup), a thread pool with pthread-based parallel compilation support, and a GNU Make jobserver client for integration with build systems.

Compilation Pipeline

Both standalone and library-mode invocations converge on the same pipeline, visible in the timing strings emitted by sub_446240:

PTX text (.ptx file or string)
  |
  +-- Flex Scanner (sub_720F00, 64KB)
  |     552-rule DFA, off_203C020 transition table
  |     Tokens: 340+ terminal symbols for Bison grammar
  |
  +-- Bison LALR(1) Parser (sub_4CE6B0, 48KB)
  |     Semantic validators: 0x460000-0x4D5000
  |     1,141 instruction descriptors via sub_46E000
  |
  +-- Ori IR Construction (DAGgen phase)
  |     Internal representation: basic blocks + instruction DAG
  |     608 CUDA runtime intrinsics (sub_5D1660)
  |
  +-- 159-Phase Optimization Pipeline (PhaseManager, sub_C62720)
  |     Phase factory: sub_C60D30 (159-case switch)
  |     Fatpoint register allocator (sub_957160)
  |     3-phase instruction scheduler (sub_688DD0)
  |     Copy propagation, CSE, strength reduction, predication,
  |     rematerialization, GMMA pipelining, late legalization
  |
  +-- Mercury SASS Encoder
  |     Instruction encoding: ~4000 per-variant handlers
  |     3 peephole optimizers (PeepholeOptimizer at 0x7A5D10)
  |     WAR hazard resolution (sub_6FC240)
  |     Operand expansion (Opex pipeline)
  |
  +-- ELF/CUBIN Output (sub_1C9F280, 97KB)
        Sections: .text, .nv.constant0, .nv.info, .symtab
        Capsule Mercury: .nv.capmerc (debug metadata)
        DWARF: .debug_line, .debug_info, .debug_frame

The driver at sub_446240 reports per-stage timing: Parse-time, CompileUnitSetup-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-time, plus PeakMemoryUsage in KB. For multi-entry PTX files, each compile unit is processed independently with the header "\nCompile-unit with entry %s".

Dual Compilation Modes

PTXAS operates in two modes selected at invocation:

Standalone CLILibrary Mode
Invocationptxas [options] file.ptxCalled from nvcc/nvlink as a subprocess
Entrymain at 0x409460sub_9F63D0 (library/ftrace entry)
Real driversub_446240 (11 KB)Same pipeline, alternate setup
InputPTX file on diskPTX string via --input-as-string
Output.cubin / .o fileBinary blob returned to caller
Usage string"Usage : %s [options] <ptx file>,...\n"N/A

The main function (0x409460, 84 bytes) is a thin wrapper: it stores argv[0], sets stdout/stderr to unbuffered via setvbuf, and delegates to sub_446240. The --input-as-string flag enables accepting PTX source directly as a CLI argument rather than reading from a file.

Configuration

PTXAS exposes three layers of configuration:

CLI Options (~100 flags) — Registered in sub_432A00 and parsed by sub_434320. Key options include --gpu-name (target SM), --maxrregcount (register limit), --opt-level (0--4), --verbose, --warn-on-spills, --warn-on-local-memory-usage, --fast-compile, --fdevice-time-trace (Chrome trace JSON output), --compile-as-tools-patch (sanitizer mode), and --extensible-whole-program. Help is printed by sub_403588 which calls sub_1C97640 to enumerate all registered options.

Internal Knobs (1,294 ROT13-encoded entries) — A separate configuration system implemented in generic_knobs_impl.h (source path recovered: /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h). The knob table is populated by two massive static constructors: ctor_005 at 0x40D860 (80 KB, ~2,000 general OCG knobs) and ctor_007 at 0x421290 (8 KB, 98 Mercury scheduler knobs). All knob names are ROT13-obfuscated in the binary. Examples after decoding: MercuryUseActiveThreadCollectiveInsts, MercuryTrackMultiReadsWarLatency, MercuryPresumeXblockWaitBeneficial, ScavInlineExpansion, ScavDisableSpilling. Knobs are read from environment variables and knob files via ReadKnobsFile (sub_79D070) which parses [knobs]-header INI files. Lookup is performed by GetKnobIndex (sub_79B240) with inline ROT13 decoding and case-insensitive comparison. See Knobs System.

SM Profile Tables — Per-architecture capability maps initialized by sub_607DB0 (14 KB) which creates 7 hash maps indexing sm_XX / compute_XX strings to handler functions. Profile objects are constructed by sub_6765E0 (54 KB) with architecture-to-family mappings (sm_75 -> Turing, sm_80/86/87/88 -> Ampere, sm_89 -> Ada Lovelace, sm_90/90a -> Hopper, sm_100/100a/100f -> Blackwell, sm_103/103a/103f -> Blackwell Ultra, sm_110/110a/110f -> Jetson Thor, sm_120/120a/120f -> RTX 50xx, sm_121/121a/121f -> DGX Spark). See SM Architecture Map.

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with GPU compiler experience.

Section Index

Overview

Compilation Pipeline

Ori IR — Internal Representation

Optimization Passes

Register Allocation

Instruction Scheduling

SASS Code Generation

GPU Architecture Targets

CUDA Intrinsics

ELF/Cubin Output

Configuration

Infrastructure

Reference

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how PTX text becomes SASS binary, what each stage does, and how control flows between subsystems.

  1. Pipeline Overview — The complete flow diagram. Establishes all stages and their address ranges.
  2. Entry Point & CLI — How ptxas is invoked, the ~100 CLI flags, and the sub_446240 driver function.
  3. PTX Parser — The Flex scanner and Bison parser. How PTX text becomes an internal parse tree.
  4. PTX-to-Ori Lowering — How the parse tree is lowered to Ori IR (basic blocks + instruction DAG).
  5. Optimization Pipeline — The 159-phase PhaseManager. Phase factory, ordering, timing infrastructure.
  6. SASS Code Generation — Mercury encoder, instruction selection, operand expansion, peephole.
  7. ELF/Cubin Output — Custom ELF emitter, section layout, DWARF debug info, capsule Mercury.

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one optimization phase deeply enough to write a compatible replacement.

  1. Pass Inventory & Ordering — Locate the phase in the 159-entry table. Note its index, vtable address, and pipeline position.
  2. The phase's dedicated page (e.g., Copy Propagation & CSE, Predication). Every dedicated page contains the function address, decompiled algorithm, data flow, and controlling knobs.
  3. Knobs System — Find which ROT13 knobs control the phase's behavior (enable/disable toggles, thresholds).
  4. Ori IR Overview — Understand the IR data structures the phase operates on.
  5. Register Model — The R/UR/P/UP register classes and their constraints.
  6. Function Map — Cross-reference internal function addresses with the master function map.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, crash, or incorrect SASS output by tracing the problem to a specific phase.

  1. DUMPIR & NamedPhases — How to dump IR at specific pipeline points. Use DUMPIR to observe the IR before and after each phase.
  2. Optimization Levels — Compare phase pipelines at different O-levels. If a bug appears at -O2 but not -O1, the diff identifies suspect phases.
  3. Pipeline Overview — The pipeline is linear: Parse -> DAGgen -> OCG (159 phases) -> Mercury -> ELF. The stage where output first goes wrong narrows the search.
  4. Knobs System — Check whether the suspect phase has enable/disable knobs. Toggle them to confirm or rule out the phase.
  5. Instruction Scheduling and Scoreboards & Dependency Barriers — If the generated SASS hangs or produces wrong results under specific warp configurations, the scheduler or barrier insertion may be at fault.

Reading Path 4: Tuning Performance

Goal: understand what ptxas does at each optimization level and what knobs control aggressiveness.

  1. Optimization Levels — The O-level to phase mapping, including --fast-compile tiers.
  2. Knobs System — The 1,294 ROT13-encoded internal tuning parameters. The primary mechanism for fine-grained control.
  3. Register Allocation — The fatpoint allocator directly determines register count, which determines maximum occupancy.
  4. Instruction Scheduling — The scheduler's ReduceReg and DynBatch modes, WAR hazard resolution, and interaction with register pressure.
  5. Peephole Optimization — The 3 peephole dispatchers that perform late SASS-level rewrites.
  6. SM Architecture Map — Per-SM feature gates that influence code generation decisions.