Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Methodology

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents how the reverse engineering of ptxas v13.0.88 was performed. It serves as a transparency record so readers can assess the confidence of any claim in this wiki, and as a practical guide for anyone who wants to reproduce or extend the analysis.

Scope and Scale

PTXAS is a 37.7 MB stripped x86-64 ELF binary with no debug symbols, no DWARF information, and no export table beyond 146 libc/libpthread PLT stubs. Unlike NVIDIA's cicc (which is an LLVM fork), ptxas contains no LLVM code, no EDG frontend, and no third-party optimizer components. Every pass, data structure, and encoding table is proprietary NVIDIA code. This makes the analysis harder than LLVM-derived binaries -- there is no upstream source to compare against.

MetricValue
Binary size37,741,528 bytes
Build stringcuda_13.0.r13.0/compiler.36424714_0
Total functions detected40,185
Functions decompiled39,881 (99.2%)
Strings extracted30,632
Call graph edges548,693
Cross-references7,427,044
IDA comments recovered66,598
IDA auto-names recovered16,019
Control flow graphs exported80,078
PLT imports146 (libc, libpthread, libm, libgcc)
Functions with 0 static callers15,907 (39.6%) -- vtable-dispatched
Functions < 100 bytes11,532 (28.7%)
Functions > 10 KB86 (0.2%)
Named functions (not sub_*)319 (0.8%)
Internal codenamesOCG (Optimizing Code Generator), Mercury (SASS encoder), Ori (IR)

The 304 functions that Hex-Rays could not decompile are predominantly PLT stubs, computed-jump trampolines in the Flex DFA scanner, and the four mega-dispatch functions exceeding 200 KB (too large for Hex-Rays to handle within default limits). None are in critical analysis paths -- the dispatch functions are understood from their callee lists and the PLT stubs from their import names.

Why PTXAS Is Harder Than LLVM-Based Binaries

Reverse engineering cicc (NVIDIA's LLVM-based CUDA compiler) benefits from extensive prior art: LLVM's open-source codebase provides structural templates, pass names are registered in predictable patterns, and cl::opt strings directly name their global variables. PTXAS offers none of these advantages:

  • No upstream source. Every identified function is identified from first principles -- string evidence, callgraph position, structural fingerprinting, or decompiled algorithm analysis. There is no reference implementation to compare against.
  • ROT13 obfuscation. Internal names for tuning knobs and PTX opcode mnemonics are ROT13-encoded in the binary, requiring decoding before they become useful anchors.
  • Vtable-heavy architecture. 39.6% of functions have zero static callers because they are dispatched through vtable pointers or function pointer tables. The call graph alone cannot reach them.
  • Template-generated code. The SASS backend contains approximately 4,000 encoding handler functions generated from templates, each structurally near-identical. These dominate the function count but carry almost no unique identifying features.
  • No pass registration infrastructure. LLVM passes register themselves via PassInfo objects with name strings. PTXAS phases are allocated by a factory switch (sub_C60D30) and their names are only visible through the NamedPhases registry and AdvancedPhase* timing strings -- far fewer anchors than LLVM's registration system.

Toolchain

All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. The entire effort is static analysis of the binary at rest -- no dynamic analysis (debugging, tracing, instrumentation) was used for function identification. Runtime tools (ptxas --stat, DUMPIR knob, --keep) were used only for validation and cross-referencing.

ToolPurpose
IDA Pro 8.xDisassembly, auto-analysis, cross-referencing, vtable reconstruction
Hex-Rays decompilerPseudocode generation for 39,881 recovered functions
IDA Python scriptingComplete database extraction: all 8 JSON artifact exports
Custom Python scriptanalyze_ptxas.py: batch string, function, graph, xref, and decompilation export
ptxas CLI--stat, --verbose, --compiler-stats, --fdevice-time-trace for runtime validation
ptxas DUMPIR knob-knob DUMPIR=<phase> to dump IR at specific pipeline points
ROT13 decoderStandard codecs.decode(s, "rot_13") for 2,000+ obfuscated knob/opcode names

IDA Pro Setup and Initial Analysis

Loading the Binary

PTXAS is a dynamically-linked ELF with 146 PLT imports but no symbol table beyond those imports. IDA auto-analysis settings:

  1. Processor: Meta PC (x86-64)
  2. Analysis options: default. IDA correctly identifies the Flex DFA scanner tables, Bison parser tables, and the .ctors/.dtors sections.
  3. Auto-analysis time: approximately 8-10 minutes on a modern machine for the 37.7 MB binary.
  4. Compiler detection: IDA identifies GCC as the compiler. The binary uses the Itanium C++ ABI (confirmed by the embedded C++ name demangler at sub_1CDC780, 93 KB).

Post-Auto-Analysis Steps

After auto-analysis completes:

  1. Run string extraction. IDA's auto-analysis finds 30,632 strings. All are exported via the analyze_ptxas.py IDA Python script.
  2. Force function creation. Some address ranges, particularly the template-generated encoding handlers, are not automatically recognized as functions. IDA's "Create function" (P key) was applied selectively in the 0xD27000--0x1579000 range where encoding handler stubs are tightly packed.
  3. Batch decompile. The IDA Python script iterates all 40,185 detected functions and calls ida_hexrays.decompile() on each, saving per-function .c files. 39,881 succeeded; 304 failed (PLT stubs, computed-jump trampolines, and 4 mega-functions exceeding decompiler limits).
  4. Export control flow graphs. For each function, the script extracts the FlowChart (basic blocks, edges, per-instruction disassembly) as JSON. 80,078 graph files were produced.

Type Recovery

PTXAS uses no C++ RTTI (no typeid, no dynamic_cast -- the binary has no .data.rel.ro RTTI structures). Type recovery relies on:

  • Vtable layout analysis. Each vtable is a contiguous array of function pointers in .data.rel.ro (4,256 bytes total). The vtable at off_22BD5C8 contains 159 entries, one per optimization phase. Each entry points to the phase's constructor function.
  • Structure offset patterns. The pool allocator struct has free-list bins at offset +2128 and a mutex at +7128. The thread-local context is a 280-byte struct accessed via pthread_getspecific. These offsets were recovered from the decompiled code of sub_424070 (pool alloc, 3,809 callers) and sub_4280C0 (TLS accessor, 3,928 callers).
  • Parameter/return type propagation. Once a function's signature is established (e.g., pool_alloc(pool*, size_t) -> void*), Hex-Rays propagates types to all 3,809 call sites, improving decompilation quality throughout the binary.

String-Driven Analysis

Strings are the single most productive source of function identification in ptxas. Of the 30,632 strings extracted, several categories are particularly valuable.

ROT13-Encoded Knob Names (2,000+ entries)

PTXAS uses ROT13 encoding as a light obfuscation layer on internal configuration names. Two massive static constructors populate these tables at startup:

  • ctor_005 at 0x40D860 (80 KB) registers approximately 2,000 general OCG tuning knobs
  • ctor_007 at 0x421290 (8 KB) registers 98 Mercury scheduler knobs

Each entry pairs a ROT13-encoded name with a hex-encoded default value. Decoding examples:

ROT13 in binaryDecoded name
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgfMercuryUseActiveThreadCollectiveInsts
ZrephelGenpxZhygvErnqfJneYngraplMercuryTrackMultiReadsWarLatency
ZrephelCerfhzrKoybpxJnvgOrarsvpvnyMercuryPresumeXblockWaitBeneficial
ZrephelZretrCebybthrOybpxfMercuryMergePrologueBlocks
ZrephelTraFnffHPbqrMercuryGenSassUCode
FpniVayvarRkcnafvbaScavInlineExpansion
FpniQvfnoyrFcvyyvatScavDisableSpilling

The knob names directly reveal subsystem organization. Names prefixed with Mercury* belong to the SASS encoder. Names prefixed with Scav* belong to the register allocator's scavenger. Names like XBlockWait* and WarDeploy* belong to the instruction scheduler. The knob lookup function GetKnobIndex at sub_79B240 performs inline ROT13 decoding and case-insensitive comparison, which was itself identified by tracing the xrefs from the ROT13-encoded strings.

ROT13-Encoded PTX Opcode Names (~900 entries)

A third static constructor, ctor_003 at 0x4095D0 (17 KB), populates a table of ~900 ROT13-encoded PTX opcode mnemonics. Decoding examples:

ROT13Decoded
NPDOHYXACQBULK
OFLAPBSYNC
SZNFMA
FRGCSETP
ERGHEARETURN
RKVGEXIT

These strings are used by the PTX parser to match instruction mnemonics. Each xref from one of these strings leads to a parser action or instruction validator function.

Timing and Phase Name Strings

The compilation driver at sub_446240 emits per-stage timing via format strings:

Parse-time            : %.3f ms (%.2f%%)
CompileUnitSetup-time : %.3f ms (%.2f%%)
DAGgen-time           : %.3f ms (%.2f%%)
OCG-time              : %.3f ms (%.2f%%)
ELF-time              : %.3f ms (%.2f%%)
DebugInfo-time        : %.3f ms (%.2f%%)
PeakMemoryUsage = %.3lf KB

Tracing the xrefs from these format strings identifies the code that brackets each pipeline stage, revealing the stage boundaries within sub_446240.

The NamedPhases registry (string at 0x21B64C8, xrefs to sub_9F4040) and the AdvancedPhase* timing strings provide phase-level anchors within the 159-phase optimization pipeline:

  • AdvancedPhaseBeforeConvUnSup, AdvancedPhaseAfterConvUnSup
  • AdvancedPhaseEarlyEnforceArgs, AdvancedPhaseLateConvUnSup
  • AdvancedPhasePreSched, AdvancedPhaseAllocReg, AdvancedPhasePostSched
  • AdvancedPhaseOriPhaseEncoding, AdvancedPhasePostFixUp
  • GeneralOptimizeEarly, GeneralOptimize, GeneralOptimizeMid, GeneralOptimizeMid2
  • GeneralOptimizeLate, GeneralOptimizeLate2
  • OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth

Each AdvancedPhase* string xrefs to exactly one call site, which is a boundary marker in the phase pipeline. These 15 markers divide the 159-phase pipeline into named segments whose boundaries were used to identify the phases between each pair of markers.

Error and Diagnostic Strings

The central diagnostic emitter sub_42FBA0 (2,350 callers) prints error messages whose text reveals the calling function's purpose. Examples:

  • "Please use -knob DUMPIR=AllocateRegisters for debugging" -- identifies the register allocator failure path at sub_9714E0
  • "SM does not support LDCU" -- identifies SM capability checking in the instruction legalizer
  • "Invalid knob identifier", "Invalid knob specified (%s)" -- identifies the knob parsing infrastructure around sub_79D070
  • "fseek() error knobsfile %s", "[knobs]" -- identifies ReadKnobsFile at sub_79D070

Source File Path

One recovered source path provides a structural anchor:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h

This string (at 0x202D4D8, 66 xrefs) is referenced from assertion checks throughout the knobs infrastructure, confirming that the knob system is a shared utility component (generic_knobs_impl.h) used across NVIDIA's compiler drivers.

Build and Version Strings

Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

The version string at sub_612DE0 identifies both the exact build and the version reporting function. The Usage : string at 0x1CE3666 identifies the usage printer. The "\nCompile-unit with entry %s" string identifies the per-kernel compilation loop within the driver.

Vtable-Driven Discovery

The Phase Vtable Table

The most productive vtable discovery was the phase vtable table at off_22BD5C8 in .rodata. This is an array of 159 pointers, each pointing to a vtable for one optimization phase class. The phase factory function at sub_C60D30 is a 159-case switch statement that allocates a 16-byte phase object and assigns the corresponding vtable from this table:

// Simplified from decompiled sub_C60D30
switch (phase_index) {
    case 0:  obj->vtable = off_22BD5C8[0];  break;
    case 1:  obj->vtable = off_22BD5C8[1];  break;
    ...
    case 158: obj->vtable = off_22BD5C8[158]; break;
}
return obj;

Each vtable contains pointers to the phase's virtual methods. The virtual method at slot 0 is execute() (the phase body). The virtual method at slot 1 is isNoOp() (returns whether the phase should be skipped). The virtual method at slot 2 is getName() (returns the phase name string).

By following each of the 159 vtable entries to their execute() slot, every optimization phase's main function was identified. The getName() slot provided the phase name for phases that implement it. For phases that return a constant empty string, the name was inferred from the NamedPhases registry or from the AdvancedPhase* timing strings that bracket the phase in the pipeline.

Encoding Handler Vtables

The SASS backend uses vtable dispatch for instruction encoding. Each SASS opcode variant has its own encoding handler function, registered in dispatch tables rather than called directly. This explains why 15,907 functions (39.6%) have zero static callers -- they are reached exclusively through indirect calls via function pointer tables.

The encoding handler vtables were identified by their structural uniformity: every handler in the 0xD27000--0x1579000 range follows an identical template:

  1. Set opcode ID via bitfield insert into the instruction word at a1+544
  2. Load a 128-bit format descriptor from .rodata via SSE (movaps xmm0, xmmword_XXXXXX)
  3. Initialize a 10-slot register class map
  4. Register operand descriptors via sub_7BD3C0 / sub_7BD650 / sub_7BE090
  5. Finalize encoding via sub_7BD260
  6. Extract bitfields from the packed instruction word

The uniformity of this template allowed batch identification: once the template was recognized in a few handlers, the remaining ~4,000 were identified by structural matching alone.

Peephole Optimizer Vtable

The PeepholeOptimizer class at 0x7A5D10 has a reconstructed vtable with 7 virtual methods:

SlotMethodPurpose
0InitInitialize peephole state for a compilation unit
1RunOnFunctionEntry point for per-function peephole optimization
2RunOnBBPer-basic-block dispatch
3RunPatternsStandard pattern matching pass
4SpecialPatternsArchitecture-specific pattern pass
5ComplexPatternsMulti-instruction pattern pass
6SchedulingAwarePatternsSchedule-preserving pattern pass

The three peephole dispatch mega-functions (sub_143C440 at 233 KB, sub_18A2CA0 at 231 KB, sub_198BCD0 at 239 KB) each serve a different SM generation family and call 1,100--1,336 pattern matcher functions. These dispatchers were identified by their enormous callee counts and their position in the pipeline after instruction encoding.

Callgraph Analysis

The 548,693-edge call graph, exported from IDA, reveals the binary's module structure and function relationships. Several callgraph properties were systematically exploited.

Hub Function Identification

Functions with extreme callee or caller counts serve as structural anchors:

Top callees (hub functions -- "fan-out" nodes):

AddressNameSizeCalleesRole
sub_169B190ISel master dispatch280 KB15,870The single largest function in the binary. Dispatches to all ISel pattern matchers.
sub_143C440SM120 peephole dispatch233 KB13,425SM120 (RTX 50-series) peephole optimization
sub_198BCD0Peephole dispatch (variant 2)239 KB13,391Peephole optimization for another SM family
sub_18A2CA0Peephole dispatch (variant 1)231 KB12,974Peephole optimization for another SM family
sub_BA9D00Bitvector/CFG analysis204 KB11,335Dataflow framework core

Top callers (utility functions -- "fan-in" nodes):

AddressNameSizeCallersRole
sub_B28F30(unknown leaf)12 B31,399Tiny utility, likely a type tag or opcode check
sub_10AE5C0(unknown leaf)60 B30,768Small encoding helper
.sprintflibc sprintf6 B20,398String formatting (PLT stub)
sub_7B9B80Bitfield insert216 B18,347Inserts bits into the 1280-bit instruction word
sub_424070Pool allocator2,098 B3,809Custom memory allocator
sub_4280C0TLS context accessor597 B3,928Thread-local storage via pthread_getspecific
sub_42FBA0Diagnostic emitter2,388 B2,350Central error/warning reporter

The fan-out nodes identify the mega-dispatch functions: ISel, peephole, and dataflow. The fan-in nodes identify the shared infrastructure layer: memory allocation, encoding primitives, string formatting, and error reporting.

Module Boundary Detection

The call graph reveals clear module boundaries. Functions in the 0x400000--0x67F000 range (PTX frontend) rarely call functions in 0xC52000--0x1CE3000 (SASS backend) directly, and vice versa. The optimizer region (0x67F000--0xC52000) bridges the two, calling into both the frontend (for IR construction) and the backend (for encoding).

The call graph was used to validate the three-subsystem decomposition:

Call directionEdge countInterpretation
Frontend -> Frontend~8,000Internal frontend cohesion
Frontend -> Optimizer~1,200IR construction handoff
Optimizer -> Optimizer~15,000Phase-to-phase internal calls
Optimizer -> Backend~3,500Scheduling, encoding setup
Backend -> Backend~18,000Encoding handler internal calls
Backend -> Frontend~500Shared infrastructure (allocator, hash)

Propagation from Known Functions

Once a high-confidence function is identified, its callees and callers gain contextual identity. The most productive propagation chains:

  1. sub_446240 (real main, CERTAIN) -> calls stage entry points for Parse, DAGgen, OCG, ELF, DebugInfo. Each stage's entry point was identified by following the timing format string pattern.

  2. sub_C62720 (PhaseManager constructor) -> allocates 159 phase objects via sub_C60D30 (factory). The factory's 159 case targets are the phase constructors. Each constructor installs a vtable whose slot 0 points to the phase's execute() method.

  3. sub_79B240 (GetKnobIndex) -> called from every function that reads a tuning knob. The first argument to GetKnobIndex is the ROT13-encoded knob name, so every call site reveals which knob a function checks.

  4. sub_42FBA0 (diagnostic emitter) -> the format string argument at each of the 2,350 call sites reveals the error context. A call with "Cannot take address of texture/surface variable (%s)" identifies a PTX semantic checker.

Pattern Recognition

16-Byte Phase Objects

All 159 optimization phases share a uniform object layout:

Offset 0: vtable pointer (8 bytes) -- points to phase-specific vtable
Offset 8: phase data pointer or inline data (8 bytes)

The phase factory (sub_C60D30) allocates each phase as a 16-byte object from the pool allocator, sets the vtable pointer from the vtable table at off_22BD5C8, and returns the object. The PhaseManager stores these 159 objects in its internal array and iterates them to execute the pipeline.

Pool Allocator Usage Pattern

The custom pool allocator (sub_424070, 3,809 callers) is the dominant allocation mechanism. Its usage pattern is recognizable throughout the binary:

ptr = sub_424070(pool, size);   // Allocate
if (!ptr) sub_42BDB0();         // Fatal OOM -- never returns
// ... use ptr ...
sub_4248B0(ptr);                // Free (1,215 callers)

The OOM handler sub_42BDB0 (14 bytes, 3,825 callers) is a tiny wrapper that calls sub_42F590 (fatal internal error). Because every allocation site checks for failure and calls the same handler, the allocator usage pattern is a reliable structural marker. Finding sub_42BDB0 in a function's callee list confirms that function performs heap allocation.

SASS Encoding Handler Template

Every encoding handler in the backend follows a rigid 6-step template (described in the vtable section above). The key identification markers:

  • Calls to sub_7B9B80 (bitfield insert, 18,347 callers)
  • SSE movaps loading a 128-bit constant from .rodata
  • Calls to sub_7BD3C0, sub_7BD650, or sub_7BE090 (operand registrars)
  • Final call to sub_7BD260 (encoding finalize)

Any function matching this pattern is a SASS encoding handler. This template recognition identified approximately 4,000 handlers spanning 6 SM architecture generations.

Hash Map Infrastructure Pattern

The MurmurHash3-based hash map infrastructure (sub_426150 insert, sub_426D60 lookup, sub_427630 MurmurHash3) appears throughout the binary with a consistent usage pattern:

map = sub_425CA0(hash_fn, cmp_fn, initial_capacity);  // Create
sub_426150(map, key, value);                           // Insert (2,800 callers)
result = sub_426D60(map, key);                         // Lookup (422 callers)
sub_425D20(map);                                       // Destroy

The MurmurHash3 constants (0xcc9e2d51, 0x1b873593) in sub_427630 confirmed the hash algorithm. The hash map supports three modes (custom function pointers, pointer hash, integer hash) selected by flags at struct offset 84.

Data Artifacts

The complete IDA database was exported via analyze_ptxas.py into 8 JSON artifacts. These artifacts are the foundation for all subsequent analysis.

ArtifactFileSizeEntriesSchema
Functionsptxas_functions.json92 MB40,185{addr, end, name, size, insn_count, is_library, is_thunk, callers[], callees[]}
Stringsptxas_strings.json4.8 MB30,632{addr, value, type, xrefs[{from, func, type}]}
Call graphptxas_callgraph.json64 MB548,693{from, from_addr, to, to_addr} -- one edge per call site
Cross-referencesptxas_xrefs.json978 MB7,427,044Complete xref database (code, data, string references)
Commentsptxas_comments.json5.9 MB66,598{addr, type, text} -- IDA auto-comments and analyst annotations
Namesptxas_names.json972 KB16,019{addr, name} -- IDA auto-generated and analyst-assigned names
Importsptxas_imports.json17 KB146{module, name, addr, ordinal} -- PLT import stubs
Segmentsptxas_segments.json3 KB24{name, start, end, size, type, perm} -- ELF segment map

Total artifact storage: 1.14 GB (dominated by the 978 MB xref database).

What Each Artifact Reveals

Functions (ptxas_functions.json): The master index. Every function's address, size, instruction count, caller list, and callee list. The caller/callee lists are the basis for callgraph analysis. The is_thunk flag identifies PLT stubs (exclude from analysis). The is_library flag identifies functions IDA tagged as library code (CRT startup, jemalloc-like allocator internals).

Strings (ptxas_strings.json): The primary identification tool. Each string's xref list shows which functions reference it. Searching for "AdvancedPhase" returns 15 strings, each xref pointing to a pipeline boundary in the PhaseManager. Searching for strings starting with "Z" (ROT13 "M" for "Mercury") returns the Mercury subsystem's knob names. The 2,035 hex-encoded default value strings ("0k..." / "0x...") are paired 1:1 with knob name strings in the constructors.

Call graph (ptxas_callgraph.json): The structural backbone. Each edge records a direct call from one function to another. Indirect calls (vtable dispatch, function pointer callbacks) are not captured, which is the primary limitation -- the 15,907 zero-caller functions are almost all vtable-dispatched. The call graph is used for module boundary detection, propagation from known functions, and entry/exit point analysis.

Cross-references (ptxas_xrefs.json): The most comprehensive artifact. Contains all code-to-code, code-to-data, and data-to-data references detected by IDA. At 7.4 million entries, it is too large to load into memory on machines with less than 16 GB RAM. Used for deep analysis of specific functions: finding all references to a particular .rodata constant, tracing data flow through global variables, and identifying vtable consumers.

Comments (ptxas_comments.json): IDA's auto-generated comments (e.g., "File format: \\x7FELF") plus analyst-added annotations. The auto-comments on function prologues identify calling conventions and stack frame layouts. Analyst comments record identification rationale for reviewed functions.

Names (ptxas_names.json): IDA's auto-generated names for data and code addresses. Of 16,019 entries, approximately 9,670 are auto-generated string reference names (aLib64LdLinuxX8, aGnu, etc.) and ~6,349 are analyst-assigned or IDA-recovered names (PLT stubs, constructors, etc.). These names appear in the callgraph edges as from/to identifiers.

Imports (ptxas_imports.json): The 146 PLT imports. Key imports include pthread_* (13 functions), malloc/free/realloc, _setjmp/longjmp (used by the error recovery system), select/fcntl (used by the GNU Make jobserver client), and clock (used by the timing infrastructure).

Segments (ptxas_segments.json): The 24 ELF segments/sections. Used to establish the address space layout and map code/data boundaries. The .ctors section (104 bytes, 12 entries) is particularly important -- it lists the static constructors that initialize the ROT13 tables and the knob registry.

The 30-Region Sweep Approach

The primary analysis was conducted as a systematic address-range sweep of the entire .text section, divided into 30 contiguous regions. Each region was analyzed independently in a single session, producing a raw sweep report. The 40 report files (including sub-region splits) total 34,880 lines of working notes.

Region Partitioning

The .text section (0x403520--0x1CE2DE2, 26.2 MB) was divided into approximately 870 KB regions. The partitioning was not arbitrary -- region boundaries were chosen to align with subsystem boundaries where possible, so that each sweep report covers a coherent functional area.

ReportAddress RangeSizeFunctionsSubsystem
p1.010x400000--0x4D5000853 KB1,383Runtime infra + CLI + PTX validators
p1.020x4D5000--0x5AA000853 KB581PTX text generation (580 formatters)
p1.030x5AA000--0x67F000853 KB628Intrinsics + SM profiles
p1.040x67F000--0x754000469 KB~500Mercury core + scheduling engine
p1.050x754000--0x829000853 KB1,545Knobs + peephole optimizer class
p1.060x829000--0x8FE000853 KB1,069Debug tables + scheduler + HW profiles
p1.070x8FE000--0x9D3000853 KB1,090Register allocator (fatpoint)
p1.080x9D3000--0xAA8000853 KB1,218Post-RA pipeline + NamedPhases
p1.090xAA8000--0xB7D000853 KB4,493GMMA/WGMMA + ISel + emission
p1.100xB7D000--0xC52000853 KB1,086CFG analysis + bitvectors
p1.110xC52000--0xD27000853 KB1,053PhaseManager + phase factory
p1.120xD27000--0xDFC000853 KB592SM100 SASS encoders (set 1)
p1.130xDFC000--0xED1000853 KB591SM100 SASS encoders (set 2) + decoders
p1.140xED1000--0xFA6000853 KB683SM100 SASS encoders (set 3)
p1.150xFA6000--0x107B000853 KB678SM100 SASS encoders (set 4)
p1.160x107B000--0x1150000853 KB3,396SM100 codec + 2,095 bitfield accessors
p1.170x1150000--0x1225000853 KB733SM89/90 codec (decoders + encoders)
p1.180x1225000--0x12FA000853 KB1,552Reg-pressure scheduling + ISel + encoders
p1.190x12FA000--0x13CF000853 KB1,282Operand legalization + peephole
p1.200x13CF000--0x14A4000853 KB1,219SM120 peephole pipeline
p1.210x14A4000--0x1579000853 KB606Blackwell ISA encode/decode
p1.220x1579000--0x164E000853 KB1,324Encoding + peephole matchers
p1.230x164E000--0x1723000853 KB899ISel pattern matching core
p1.240x1723000--0x17F8000853 KB631ISA description database
p1.250x17F8000--0x18CD000853 KB1,460SASS printer + peephole dispatch
p1.260x18CD000--0x19A2000853 KB1,598Scheduling + peephole dispatchers
p1.270x19A2000--0x1A77000853 KB1,393GPU ABI + SM89/90 encoders
p1.280x1A77000--0x1B4C000853 KB1,518SASS emission backend
p1.290x1B4C000--0x1C21000853 KB1,974SASS emission + format descriptors
p1.300x1C21000--0x1CE3000780 KB1,628ELF emitter + infra library layer

Several regions were further split into sub-reports (p1.04a/b, p1.05a/b, p1.06a/b, p1.07a/b, p1.08a/b) when the initial analysis revealed that a region contained multiple distinct subsystems requiring separate treatment.

Sweep Report Structure

Each sweep report follows a consistent format:

================================================================================
P1.XX SWEEP: Functions in address range 0xAAAA000 - 0xBBBB000
================================================================================
Range: 0xAAAA000 - 0xBBBB000
Files found: NNN decompiled .c files (of which ~MMM are > 1KB)
Total decompiled size: X,XXX,XXX bytes
Functions in range (from DB): NNN
Named functions: NNN (or 0 if all are sub_XXXXXX)
Functions with identified callers: NNN

CONTEXT: [1-paragraph summary of the region's purpose]

================================================================================
SECTION 1: [Subsystem name]
================================================================================

### 0xAAAAAA -- sub_AAAAAA (NNNN bytes / NNN lines)
**Identity**: [Function identification]
**Confidence**: [CERTAIN / HIGH / MEDIUM]
**Evidence**:
  - [String evidence]
  - [Structural evidence]
  - [Callgraph evidence]
**Key code**:
  [Relevant decompiled excerpts]
**Note**: [Additional observations]

Each function entry records the address, size, decompiled line count, proposed identity, confidence level, evidence citations, and key code excerpts. The reports are raw working notes -- they contain false starts, corrections, and evolving hypotheses that were resolved as more context became available.

Analysis Ordering

The sweep was not performed in address order. The analysis followed an information-maximizing sequence:

  1. p1.01 (infrastructure + CLI) first -- establishes the allocator, hash map, TLS, and diagnostic patterns that appear throughout the binary.
  2. p1.11 (PhaseManager) second -- identifies all 159 phases and their vtable entries, providing the skeleton of the optimization pipeline.
  3. p1.07 (register allocator) and p1.06 (scheduler) third -- these are the highest-complexity subsystems with the richest string evidence.
  4. p1.12--p1.15 (SASS encoders) in batch -- once the encoding template was recognized, all encoder regions were swept rapidly with template matching.
  5. p1.30 (library layer) late -- identifies shared infrastructure (ELF emitter, demangler, thread pool) referenced by earlier regions.
  6. Remaining regions filled in by decreasing information density.

Cross-Referencing with PTXAS CLI

Several ptxas command-line features and internal mechanisms provide runtime validation of static analysis findings.

--stat and --verbose

Running ptxas --stat input.ptx prints per-kernel resource usage (register count, shared memory, stack frame size). This output is generated by sub_A3A7E0 (the IR statistics printer), which was identified from the format strings:

ptxas info    : Used %d registers, %d bytes smem, %d bytes cmem[0]

Comparing the --stat output against the decompiled statistics printer confirms the register counting and resource tracking logic.

--compiler-stats

Enables the timing output (Parse-time, DAGgen-time, OCG-time, etc.) from sub_446240. This confirms the pipeline stage ordering and the stage boundary functions identified by string xrefs.

--fdevice-time-trace

Generates Chrome trace JSON output showing per-phase timing. The trace parser at sub_439880 and the ftracePhaseAfter string at 0x1CE383F confirm the per-phase instrumentation infrastructure. The trace output lists phase names that can be cross-referenced against the 159-entry phase table.

DUMPIR Knob

The internal DUMPIR knob (accessed via -knob DUMPIR=<phase_name>) dumps the Ori IR at specified pipeline points. The string "Please use -knob DUMPIR=AllocateRegisters for debugging" at 0x21EFBD0 confirms this mechanism. The NamedPhases registry at sub_9F4040 maps phase names to pipeline positions. Available DUMPIR points include:

  • OriPerformLiveDead, OriPerformLiveDeadFirst through OriPerformLiveDeadFourth
  • AllocateRegisters (the register allocation phase)
  • swap1 through swap6 (swap elimination phases)
  • shuffle (instruction scheduling)

The DUMPIR output format reveals the IR structure: basic block headers, instruction opcodes, register names (R0--R255, UR0--UR63, P0--P7, UP0--UP7), and operand encodings. This runtime output was used to validate the IR format reconstructed from static analysis.

--keep Flag

The --keep flag preserves intermediate files. While ptxas does not emit intermediate text files in the same way as nvcc, the --keep behavior in the overall CUDA compilation pipeline (nvcc -> cicc -> ptxas) allows inspecting the PTX input that reaches ptxas, confirming the PTX grammar and instruction format expectations.

Confidence Levels

Every function identification in this wiki carries one of three confidence levels:

LevelMeaningBasis
CERTAINIdentity is certainDirect string evidence naming the function, or the function is a PLT import with a known name
HIGHStrong identification (>90%)Multiple corroborating indicators: string xrefs, callgraph position, structural fingerprint, decompiled algorithm match
MEDIUMProbable identification (70--90%)Single indicator (vtable position, size fingerprint, callgraph context) or inferred from surrounding identified functions

The distribution across the ~200 key identified functions in the Function Map:

  • CERTAIN: ~30 functions (PLT imports, main, functions with unique identifying strings)
  • HIGH: ~130 functions (string evidence + structural confirmation)
  • MEDIUM: ~40 functions (inferred from callgraph context or structural similarity)

The remaining ~39,985 functions are either unidentified (template-generated encoding handlers, small utility stubs) or identified at subsystem level only (e.g., "this is an SM100 SASS encoding handler" without knowing which specific opcode it encodes).

Reproducing the Analysis

To reproduce this analysis from scratch:

  1. Obtain the binary. Install CUDA Toolkit 13.0. The binary is at <cuda>/bin/ptxas. Verify: ptxas --version should report V13.0.88 and the binary should be 37,741,528 bytes. Build string: cuda_13.0.r13.0/compiler.36424714_0.

  2. Run IDA auto-analysis. Open ptxas in IDA Pro 8.x with default x86-64 settings. Allow auto-analysis to complete (8-10 minutes). Accept GCC as the detected compiler.

  3. Run the extraction script. Load analyze_ptxas.py in IDA's Python console. The script exports all 8 JSON artifacts plus per-function decompiled C files, disassembly files, and control flow graph JSON files. Expected runtime: 4-8 hours for the full export (the xref export dominates).

  4. Decode ROT13 strings. Apply codecs.decode(s, "rot_13") to all strings in the knob constructors (ctor_003, ctor_005, ctor_007). This decodes ~3,000 obfuscated names into readable English identifiers.

  5. Identify anchor functions. Start with the highest-confidence identifications:

    • main at 0x409460 (named in symbol table)
    • sub_446240 (real main -- called from main, contains timing format strings)
    • sub_C60D30 (phase factory -- 159-case switch)
    • sub_C62720 (PhaseManager constructor -- references phase vtable table)
    • sub_79B240 (GetKnobIndex -- inline ROT13 decoding)
    • sub_42FBA0 (diagnostic emitter -- 2,350 callers, severity dispatch)
  6. Sweep the address space. Work through the .text section in regions of ~870 KB. For each region:

    • Count functions and decompiled file sizes
    • Identify string anchors (search for region-specific strings)
    • Classify functions by structural template (encoding handler, phase body, utility, etc.)
    • Propagate identities from known callers/callees
    • Record findings in the sweep report format
  7. Cross-reference with runtime. Compile a simple CUDA kernel and run ptxas --stat --verbose --compiler-stats to observe runtime behavior. Use -knob DUMPIR=<phase> to dump IR at specific pipeline points. Compare the dumped IR format against the IR structure reconstructed from decompiled code.

Dependencies

The extraction script (analyze_ptxas.py) requires IDA Pro 8.x with Hex-Rays decompiler and Python 3.x. No external Python packages are needed -- only the IDA Python API (idautils, idc, idaapi, ida_bytes, ida_funcs, ida_segment, ida_nalt, ida_gdl, ida_hexrays).

Post-export analysis requires only the Python 3.8+ standard library (json, codecs, collections).

Debug Infrastructure: bugspec.txt

ptxas contains an internal fault injection framework that deliberately corrupts the Mercury IR to test compiler verification passes. The mechanism is entirely file-driven: if a file named ./bugspec.txt exists in the current working directory when ptxas runs, the function sub_A83AC0 reads it and injects controlled mutations into the post-register-allocation instruction stream. No CLI flag activates this -- file presence alone is sufficient. If the file is absent, a diagnostic is printed to stdout (Cannot open file with bug specification) and compilation proceeds normally.

File Format

The file contains a single line of six integers:

COUNT0,COUNT1,COUNT2,COUNT3 COUNT4 COUNT5

The first four are comma-separated; then a space; then two space-separated values. Each integer specifies the number of faults to inject for that bug category. Zero or negative disables the category.

FieldVariableCategoryTarget
COUNT0v78Register bugsGeneral (R) and uniform (UR) register operands
COUNT1v79Predicate bugsPredicated instruction operands
COUNT2v80Offset/spill bugsMemory offsets in spill/refill instructions
COUNT3v81Remat bugsRematerialized value operands
COUNT4v82R2P/P2R bugsRegister-to-predicate conversion instructions
COUNT5v83Bit-spill bugsBit-level spill storage operands

Example: 3,2,1,0 0 1 injects 3 register bugs, 2 predicate bugs, 1 offset bug, and 1 bit-spill bug.

Bug Kind String Table

Each injected fault record carries a kind code (1--10) mapped to a string table at 0x21F0500:

KindStringMeaning
1r-ur registerGeneral or uniform register replaced with wrong register
2p-up registerPredicate or uniform predicate register corrupted
3any regAny register class operand corrupted
4offsetMemory offset shifted by +16 bytes
5regular bugGeneric operand value replacement
6predicated bugPredicate source operand corrupted
7remat bugRematerialization value corrupted
8spill-regill bugSpill or refill path value corrupted
9r2p-p2r bugRegister-predicate conversion operand corrupted
10bit-spill bugBit-level spill storage operand corrupted

Injection Algorithm

The injection proceeds in four phases:

1. Candidate collection. The function walks the Mercury IR instruction linked list (from context[0]+272). For each instruction, it checks which bug categories are active and whether the instruction qualifies:

  • Register bugs (field0): Scans operands for type-tag 1 (register) with register class 6 (general) or 3 (predicate), excluding opcodes 41--44. Eligible instructions are collected into a candidate list.
  • Predicate bugs (field1): Checks flag byte at instruction+73 for bit 0x10 (predicated). Eligible instructions are collected separately.
  • Offset/spill bugs (field2): Calls sub_A56DE0 / sub_A56CE0 against the register allocator state (context[133]) to identify spill/refill instructions.
  • Remat bugs (field3): Queries the rematerialization hash table (context+21 via sub_A54200) for instructions with remat entries.
  • R2P/P2R bugs (field4): Checks instruction opcode (offset +72) for values 268, 155, 267, 173 (the R2P and P2R conversion opcodes, with bit-masked variants).
  • Bit-spill bugs (field5): Checks operand count > 2, flag bit 0x10 at offset +28, and calls sub_A53DB0 / sub_A53C40 / sub_A56880 for bit-spill eligibility.

2. Random selection. Seeds the RNG with time(0) via srand(). For each active category, sub_A83490 randomly selects N instruction indices from the candidate list, where N is the count from bugspec.txt. The selector uses FNV-1a hashing on instruction addresses for collision avoidance, re-rolling duplicates.

3. Mutation application. For register and predicate categories, sub_A5EC40 iterates over selected instructions and calls sub_A5E9E0, which finds the last register operand, allocates a new register of the same class via sub_91BF30, and replaces the operand value. For offset bugs, the mutation adds +16 to the signed 24-bit offset field directly: *operand = (sign_extend_24(*operand) + 16) & 0xFFFFFF | (*operand & 0xFF000000).

4. Reporting. Prints to stdout:

Num forced bugs N
Created a bug at index I : kind K inst # ID [OFF] in operand OP correct val V replaced with W

Fault Record Structure (40 bytes)

OffsetSizeField
+04Kind (1--10)
+88Pointer to Mercury instruction node
+164Operand index within instruction
+204Original operand value
+244Replacement operand value
+284Selection index (position in candidate list)
+324Instruction ID (from instruction+16)

Records are stored in a dynamic array at context[135].

Function Map

AddressFunctionRoleConfidence
0xA83AC0sub_A83AC0bugspec.txt reader and injection coordinatorCERTAIN (string: ./bugspec.txt)
0xA83490sub_A83490Random index selector with FNV-1a dedupHIGH
0xA5E9E0sub_A5E9E0Register operand mutation (allocates new register)HIGH
0xA5EC40sub_A5EC40Batch mutation applicator (iterates selected instructions)HIGH
0xA832D0sub_A832D0Hash table resize for dedup trackingMEDIUM

Significance

This is NVIDIA's internal compiler testing infrastructure for stochastic fault injection. It targets specific vulnerability surfaces in the register allocator and post-allocation pipeline: wrong-register assignments, address calculation errors, predicate propagation failures, rematerialization correctness, spill code integrity, and register-predicate conversion accuracy. The time(0)-seeded RNG produces different fault patterns on each run for the same bugspec.txt, enabling randomized stress testing of verification passes.

Embedded C++ Name Demangler

PTXAS statically embeds an Itanium ABI C++ name demangler rather than linking libc++abi or libstdc++. The demangler is a self-contained 41-function cluster spanning 0x1CD8B00--0x1CE1E60 in .text, with a single external entry point. The core recursive-descent parser at sub_1CDC780 (93 KB decompiled, 3,442 lines) handles the full Itanium mangling grammar: nested names, template arguments, substitutions, function types, and special names.

API and Integration

The public-facing function is sub_1CE23F0, whose signature matches __cxa_demangle exactly: it takes a mangled name string, an optional output buffer with length pointer, and a status pointer; it returns a malloc-allocated demangled string or NULL with a status code (-1 = memory failure, -3 = invalid arguments). The only caller of this function is the embedded terminate handler at sub_1CD7850, which prints the standard "terminate called after throwing an instance of '...'" diagnostic to stderr, demangling the exception type name before display.

Why Embedded

PTXAS imports only libc, libpthread, libm, and libgcc_s (146 PLT stubs total). It has no dependency on any C++ runtime library. The only C++ ABI symbol in the PLT is __cxa_atexit (at 0x401989), used to register the terminate handler. By embedding the demangler and terminate handler directly, NVIDIA avoids a runtime dependency on libstdc++ or libc++abi, which would otherwise be required solely for exception type name display in fatal error messages. This is consistent with the binary's overall strategy of minimizing external dependencies.

Function Map

AddressFunctionSizeRoleConfidence
sub_1CDC780Demangler core (recursive-descent parser)93 KBParses Itanium-mangled names via large switch dispatchHIGH (size, structure, callgraph isolation)
sub_1CE0600Recursive dispatch wrapper580 BRe-enters the parser for nested name components (76 call sites from core)HIGH (mutual recursion with sub_1CDC780)
sub_1CE23F0__cxa_demangle-compatible API340 BPublic entry: mangled string in, demangled string out, malloc-allocatedCERTAIN (API shape, status codes, free/memcpy/strlen callees)
sub_1CE1E60Parse entry point~200 BInitializes parse state and invokes the coreHIGH (bridge between API and parser)
sub_1CD7850Terminate handler (__cxa_terminate)280 BPrints "terminate called after throwing..." to stderrCERTAIN (string: "terminate called after throwing an instance of '")

Version Update Procedure

All addresses, function counts, and structural offsets in this wiki are specific to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0, 37,741,528 bytes). When a new CUDA toolkit ships a different ptxas binary, the wiki must be updated. This section documents the procedure.

Version-Stable vs Version-Fragile Findings

Not everything changes between versions. Understanding what is stable dramatically reduces update effort.

Version-stable (survives across minor and most major releases unchanged):

CategoryExamplesWhy stable
Algorithm logicCopy propagation worklist walk, fatpoint pressure computation, MurmurHash3 constantsAlgorithms are rarely rewritten between releases
Data structure layoutsPool allocator bins at +2128, Mercury instruction node at 112 bytes, 16-byte phase objectsStruct layouts change only when fields are added or reordered
Knob namesMercuryUseActiveThreadCollectiveInsts, ScavInlineExpansion, all 2,000+ ROT13 namesKnob names are API-like -- changing them breaks internal test harnesses
ROT13 encodingThe ROT13 obfuscation layer itself, decoded by codecs.decode(s, "rot_13")Obfuscation scheme has been consistent across observed versions
Phase count and ordering159 phases in the OCG pipeline, ordered by the PhaseManager vtable tablePhase count may grow but existing phases retain their relative order
Pipeline stage namesParse-time, DAGgen-time, OCG-time, ELF-time, DebugInfo-timeStage names are embedded in format strings unlikely to change
Subsystem namesOCG, Mercury, Ori, ScavInternal codenames are stable across releases
Encoding handler template6-step pattern: opcode ID, movaps format descriptor, register class map, operand registration, finalize, bitfield extractTemplate structure is generated from a stable code generator
Error message text"SM does not support LDCU", "Invalid knob identifier"Diagnostic strings are rarely reworded

Version-fragile (changes with every recompilation):

CategoryExamplesWhy fragile
Function addressesEvery sub_XXXXXX reference, vtable addresses like off_22BD5C8ASLR-style shifts from any code or data size change
Address rangesSweep boundaries 0x400000--0x4D5000, subsystem regionsFunctions move when preceding code grows or shrinks
Function sizessub_446240 at 12,345 bytesInlining decisions change, optimizer improvements add/remove code
Caller/callee countssub_424070 at 3,809 callersNew call sites added, old ones removed
Struct offsetscontext[133], context+1584New fields inserted into context structs
.rodata addressesString locations like 0x202D4D8, encoding table addressesData layout shifts with code changes
Call graph edge counts548,693 edgesNew functions and call sites
Total function count40,185New SM targets add encoding handlers

Identifying Function Address Changes

When loading a new ptxas version into IDA:

  1. Extract the same 8 JSON artifacts using analyze_ptxas.py (or equivalent). The critical artifacts for diffing are ptxas_functions.json (address, size, callee list) and ptxas_strings.json (string content, xref locations).

  2. Match functions by invariant properties. Functions cannot be matched by address alone. Use these matching criteria in priority order:

    • String anchors. Functions containing unique string references (e.g., the function referencing "Please use -knob DUMPIR=AllocateRegisters") can be matched across versions by searching for the same string in the new binary. This is the highest-confidence matching method.
    • Size + callee signature. For functions without string anchors, match by (approximate size, sorted callee list). A function of ~2,100 bytes calling the pool allocator, OOM handler, and hash map insert is almost certainly the same function even if its address shifted by megabytes.
    • Callgraph position. Functions identified by their caller/callee topology: the phase factory is the function called from the PhaseManager constructor with 159+ case targets. The diagnostic emitter is the function with 2,000+ callers that calls vfprintf.
    • Vtable slot position. Phase execute() methods are at vtable slot 0. If the vtable table address changes but still contains 159 entries, the slot positions identify each phase.
    • Template fingerprinting. Encoding handlers matching the 6-step template (bitfield insert via the highest-caller utility, movaps from .rodata, operand registrars, finalize call) are encoding handlers in any version.
  3. Diff the function lists. Produce a mapping {old_addr -> new_addr} for all matched functions. Functions present in the new binary but absent in the old are new (likely new SM target support). Functions absent in the new binary are removed (dropped legacy SM support) or merged.

Updating Sweep Reports

The 30-region sweep reports in ptxas/raw/ are version-locked historical records -- they document the analysis of v13.0.88 and should not be overwritten. For a new version:

  1. Re-run the sweep with new address ranges derived from the new binary's function list. The region partitioning should follow the same subsystem-aligned strategy: infrastructure first, then PhaseManager, then high-complexity subsystems, then batch encoding handlers.

  2. Name new reports with a version suffix: p2.01-sweep-v13.1-0xNNN-0xMMM.txt (or whatever scheme distinguishes the version).

  3. Cross-reference against old reports. For each region, note which functions moved, which are new, and which disappeared. The old sweep reports provide the expected function identities; the new sweep validates whether those identities still hold at the new addresses.

Pages Most Sensitive to Version Changes

These wiki pages require immediate updates when the binary changes:

PageSensitivityWhat changes
function-map.mdCriticalEvery address in every table row. The entire page is address-indexed.
binary-layout.mdCriticalSection addresses, subsystem boundaries, address-range diagram.
VERSIONS.mdCriticalBinary size, build string, function count, version number.
pipeline/overview.mdHighPhase factory address, PhaseManager constructor address, vtable table address.
scheduling/algorithm.mdHighScheduler function addresses, priority function addresses.
regalloc/algorithm.mdHighAllocator function addresses, fatpoint computation address.
codegen/encoding.mdHighEncoding handler address ranges, format descriptor addresses.
config/knobs.mdMediumKnob constructor addresses (content of knob names is stable).
ir/instructions.mdMediumOpcode numbers may shift if new instructions are added.
targets/index.mdMediumNew SM targets may appear, changing validation table sizes.
methodology.mdLowThe methodology itself is version-stable; only the "Scope and Scale" table needs updating.

The update follows a five-step sequence. Steps 1-2 are mechanical; steps 3-5 require analyst judgment.

Step 1: Extract new IDA artifacts.

Load the new ptxas binary into IDA Pro 8.x. Run analyze_ptxas.py to produce the 8 JSON artifacts and per-function decompiled .c files. Store them in a version-specific directory (e.g., ptxas/ida-v13.1/ or alongside the existing artifacts with clear version labeling).

Step 2: Diff against the old artifacts.

Write or use a diff script that:

  • Compares ptxas_functions.json (old vs new) by matching on string anchors, size+callee signature, and callgraph position.
  • Produces a {old_addr -> new_addr} mapping for matched functions.
  • Lists unmatched functions in both directions (new functions, removed functions).
  • Compares ptxas_strings.json to detect new strings, removed strings, and strings whose xref functions changed.
  • Reports total function count delta, binary size delta, and new section addresses.

Step 3: Update address-sensitive pages.

Using the address mapping from Step 2:

  • Update every sub_XXXXXX reference in function-map.md, binary-layout.md, and all pages listed in the sensitivity table above.
  • Update the "Scope and Scale" table in methodology.md with new function counts, string counts, binary size, and build string.
  • Update VERSIONS.md with the new binary metadata.
  • For pages with address ranges (sweep boundaries, subsystem regions), recompute the ranges from the new function list.

Step 4: Verify key struct layouts.

Struct offset changes are the most dangerous kind of version drift because they silently invalidate decompiled code analysis. For each documented struct:

  • Re-decompile the struct's primary accessor function (e.g., sub_424070 for the pool allocator, sub_4280C0 for the TLS context).
  • Compare field offsets against the documented layout.
  • If offsets shifted, update the struct documentation and propagate the change to all pages that reference those offsets.

Priority structs to verify: pool allocator (free-list bins at +2128, mutex at +7128), TLS context (280 bytes), Mercury instruction node (112 bytes), scheduler context (~1000 bytes), allocator state (1590+ bytes), phase objects (16 bytes).

Step 5: Validate phase pipeline.

  • Re-extract the phase vtable table (find the new address of the 159-entry pointer array in .data.rel.ro).
  • Verify all 159 phases are present and in the expected order.
  • Check for new phases (count > 159) or removed phases (count < 159).
  • Re-run ptxas --fdevice-time-trace on a test kernel and cross-reference the phase names in the trace output against the wiki's phase list.

Raw Data Locations

All raw analysis artifacts for the current version (v13.0.88) live in the repository under ptxas/:

DirectoryContents
ptxas/raw/40 sweep reports (p1.01--p1.30 plus sub-region splits), per-task investigation reports (P0_*, P1_*, P2_*, etc.)
ptxas/decompiled/Per-function Hex-Rays decompiled C files (sub_XXXXXX.c, named functions like ctor_003_0x4095d0.c)
ptxas/disasm/Per-function disassembly files
ptxas/graphs/Per-function control flow graph JSON files (80,078 files)
ptxas/ (root)The 8 JSON artifacts (ptxas_functions.json, ptxas_strings.json, ptxas_callgraph.json, ptxas_xrefs.json, ptxas_comments.json, ptxas_names.json, ptxas_imports.json, ptxas_segments.json), the IDA database (ptxas.i64), the extraction script (analyze_ptxas.py), and the binary itself (ptxas)
ptxas/wiki/src/The wiki source pages (this document and all others)

When updating to a new version, preserve the existing artifacts for v13.0.88 (rename or move to a versioned subdirectory) and store new artifacts alongside them. The sweep reports in ptxas/raw/ are historical records and should never be overwritten.

Limitations and Known Gaps

  • No dynamic validation of optimization correctness. All findings are from static analysis. The identified phase algorithms have not been tested against runtime inputs to verify they produce correct output for all corner cases.

  • 39.6% of functions are vtable-dispatched. Functions with zero static callers can only be reached by finding the vtable or function pointer table that references them. Some vtables in deep .rodata may have been missed, leaving some functions orphaned.

  • No upstream reference for any code. Unlike cicc (LLVM fork) or nvcc (EDG frontend), ptxas has no open-source analog. Every identification is from first principles. This limits confidence for functions where string evidence is absent and structural analysis is the only basis.

  • Template-generated code is indistinguishable. The ~4,000 SASS encoding handlers are generated from internal templates. Without the template source, mapping individual handlers to specific opcodes requires tracing the dispatch table entries, which has only been done for select handlers.

  • Mega-functions are partially opaque. The four functions exceeding 200 KB (sub_169B190 at 280 KB, sub_143C440 at 233 KB, sub_198BCD0 at 239 KB, sub_18A2CA0 at 231 KB) could not be decompiled by Hex-Rays. Their behavior is understood from their callee lists (13,000--15,870 callees each) and their position in the pipeline, but the internal dispatch logic is known only at the disassembly level.

  • ROT13 decoding is necessary but not sufficient. Decoding the 2,000+ knob names reveals the existence of tuning parameters but not their semantics. A knob named MercuryPresumeXblockWaitBeneficial can be decoded from ROT13, but understanding what "xblock wait beneficial" means requires analyzing the code paths that read the knob.

  • Version-specific addresses. All addresses in this wiki apply to ptxas v13.0.88 (build cuda_13.0.r13.0/compiler.36424714_0). Other CUDA toolkit versions will have different addresses, different function counts, and potentially different phase orderings. However, the analysis methodology (string-driven, vtable-driven, callgraph propagation) applies to any version.

  • Indirect calls are undercounted. The 548,693-edge call graph captures only direct call instructions resolved by IDA. Virtual calls through vtable pointers, function pointer callbacks, and computed jumps are not fully captured. The true call graph is significantly denser than what is recorded.

Corrections Log

This section documents every factual error discovered and corrected during the wiki improvement pass. Each entry records the error, the correction, affected pages, and the agent task that performed the fix. The full detail for each correction is in ptxas/raw/P5_11_corrections_log_report.txt.

Summary

MetricCount
Distinct factual errors corrected22
Wiki pages with at least one fix30+
Agent tasks that discovered errors15
Agent tasks that propagated fixes5

Corrections by Severity

Systematic errors (affected 5+ pages each)

#ErrorCorrectionPagesAgent
01Opcode numbering: wiki assumed two numbering systems; "Selected Opcode Values" table had wrong SASS mnemonic labels (e.g., 93=CALL, 95=EXIT, 97=MOV, 130=BAR)One numbering system: ROT13 name table index IS the instruction opcode. Correct labels: 93=OUT_FINAL, 95=STS, 97=STG, 130=HSET215 pages (ir/instructions, ir/cfg, passes/predication, passes/sync-barriers, passes/liveness, passes/general-optimize, passes/rematerialization, passes/copy-prop-cse, passes/strength-reduction, regalloc/abi, regalloc/spilling, intrinsics/sync-warp, codegen/isel, scheduling/latency-model, scheduling/algorithm)P0-01, P4-02, P5-01
02Register class 6 = UB (Uniform Barrier); classes 2-6 all wrongClass 6 = Tensor/Accumulator (MMA/WGMMA). Correct table: 2=R(alt), 3=UR, 4=UR(ext), 5=P/UP, 6=Tensor/Acc. Barrier regs use reg_type 9, outside the 7-class system7 pages (ir/registers, regalloc/overview, regalloc/algorithm, regalloc/spilling, passes/gmma-pipeline, intrinsics/tensor, ir/overview)P0-02
03context+1584 had 5 conflicting names: code_object, sched_ctx, arch_backend, optimizer_state, function managerSingle object: SM-specific architecture backend ("sm_backend"), constructed per-compilation-unit in sub_662920 via SM version switch3 pages corrected (ir/data-structures, ir/overview, passes/copy-prop-cse); 14 pages acceptable as-isP0-03

Identity misattributions

#ErrorCorrectionPagesAgent
06sub_83EF00 (29KB) listed as "Top-level unrolling driver"sub_83EF00 is MainPeepholeOptimizer (opcode switch on 2, 134, 133, 214, 213, 210). Actual unrolling driver: sub_1390B30 via Phase 22 entry sub_1392E30passes/loop-passes.mdP1-04, P5-03
07sub_926A30 (22KB) listed as "Main pipelining engine (modulo scheduling)"sub_926A30 is the operand-level latency annotator and interference weight builder, called by sub_92C0D0 per-instructionpasses/loop-passes.mdP1-06
08sub_7E7380 described as "full structural equivalence" (opcode, type, all operands, register class comparison)sub_7E7380 is 30 lines / 150 bytes: narrow predicate-operand compatibility check (predicate bit parity + last operand 24-bit ID + penultimate 8-byte encoding). Full structural comparison done by the 21 callerspasses/copy-prop-cse.md, passes/general-optimize.mdP1-07, P5-06

Inverted semantics

#ErrorCorrectionPagesAgent
05isNoOp()=1 "means it executes unconditionally"isNoOp()=1 means the dispatch loop SKIPS execute(). Code: if (!phase->isNoOp()) { phase->execute(ctx); }passes/rematerialization.mdP0-05
09Hot-cold priority: "1 = cold, 0 = hot"1 = hot = higher priority, 0 = cold = lower priority. sub_A9CDE0 (hot detector) returns true -> bit 5 set -> higher prioritypasses/hot-cold.mdP1-09, P5-06
10"Fatpoint" implied to be maximum-pressure pointFatpoint scans for MINIMUM-cost slot. The name refers to the exhaustive (fat) scan evaluating all slots, not to picking the maximum(verified correct across all pages -- 0 fixes needed)P1-10, P5-06

Wrong numeric values

#ErrorCorrectionPagesAgent
04context+1552 = "Legalization stage counter" with 3 values (3, 7, 12)Pipeline progress counter with 22 values (0-21) spanning all pipeline categories4 pages (ir/data-structures, passes/late-legalization, passes/rematerialization, passes/copy-prop-cse)P0-04
125 SASS opcode mnemonic typos: PSMTEST, LGDEPBAR, LGSTS, UBLKPC, UTMAREDGCSMTEST, LDGDEPBAR, LDGSTS, UBLKCP, UTMREDGreference/sass-opcodes.mdP2-11
14WGMMA case 9 = 0x1D5D (7517), case 10 = 0x1D5E (7518)Case 9 = 0x1D5E (7518), case 10 = 0x1D60 (7520). Codes 0x1D5D/0x1D5F are advisory (non-serialization) warningspasses/gmma-pipeline.mdP3-25
15ABI minimum: gen 5 (sm_60-sm_89) = 16 regs, gen 9+ = 24 regsgen 3-4 (sm_35-sm_53) = 16, gen 5-9 (sm_60-sm_100) = 24. Binary: (generation - 5) < 5 ? 24 : 16regalloc/abi.mdP3-26
17Unrolling rejection table at 0x21D1980 with 36-byte structuresRejection string pointer array at 0x21D1EA0 with simple integer indices 7-24. The 0x21D1980 table is for peephole operand range lookupspasses/loop-passes.mdP1-04

Phantom data and scope errors

#ErrorCorrectionPagesAgent
11"Approximately 80 additional entries bulk-copied from unk_21C0E00" at SASS opcode indices 322-401, "totaling roughly 402 named opcodes"Table has exactly 322 entries. The 1288-byte block at unk_21C0E00 is a 322-element identity map {0,1,...,321} copied to a different data structure (encoding category map at obj+0x2478)reference/sass-opcodes.mdP2-11
13"139 explicitly named phases and 20 architecture-specific unnamed phases"All 159 phases have names in the static table at off_22BD0C0. The original 139-phase inventory missed 20 phases (e.g., OriCopyProp, Vectorization, MercConverter, AllocateRegisters)pipeline/overview.md, passes/index.mdP2-14, P4-03
16Warning 7018 (0x1B6A) attributed to SUSPEND/preserved scratch diagnosticCode 0x1B6A does not exist in the binary. The actual code is 7011 (0x1B63)regalloc/abi.mdP3-26
18Unrolling rejection codes listed as 0x80000001-0x80000018Those hex values appear in diagnostic message STRINGS, not as internal codes. Internal codes are simple integers 7-24passes/loop-passes.mdP1-04

Minor corrections

#ErrorCorrectionPagesAgent
19sub_80B700/sub_80BC80 listed as unrolling functionsBoth are peephole optimizer functions (called through sub_83EF00), not unrollingpasses/loop-passes.mdP1-04
22general-optimize.md called sub_7E7380 "instruction_equivalent" / "structural instruction equivalence" in 6 locationsRenamed to "predicate_operand_compatible" / "predicate-operand compatibility check"passes/general-optimize.mdP5-06

Error Categories

CategoryCountExamples
Identity misattribution5Wrong function-to-role mappings, wrong names for context fields
Wrong numeric values5Wrong opcode labels, wrong hex codes, wrong thresholds, wrong addresses
Inverted semantics3isNoOp skip-vs-execute, hot-cold bit polarity, fatpoint min-vs-max
Conflicting definitions3Register class contradictions across pages
Phantom data2Nonexistent SASS entries 322-401, nonexistent warning 7018
Scope mischaracterization2context+1552 scope too narrow, phase naming scope too narrow
Encoding confusion2Hex-in-message-string vs internal code, wrong address for lookup table

Lessons Learned

  1. Behavioral inference is unreliable for opcode identity. Observing that an opcode appears in branch contexts does not make it BRA. Always check the authoritative ROT13 name table.

  2. Cross-page consistency checks catch conflicting speculations. Five pages independently naming the same field (context+1584) is a strong signal that at least four are wrong.

  3. Counts from partial analysis are systematically low. The "3 values" for context+1552 and "139 named phases" both resulted from stopping the search too early. Exhaustive binary sweeps consistently reveal more entries.

  4. Function size is not a reliable identity signal. sub_83EF00 (29KB) was large enough to seem like a major driver, but size alone does not distinguish a peephole optimizer from a loop unroller.

  5. ROT13 decoding + binary cross-validation is the gold standard. Every correction that replaced speculative labels with ROT13-decoded names has held up under subsequent audits.