Embedded ptxas: Architecture Overview
The single most important structural fact about nvlink v13.0.88 is that approximately 95% of its 25.2 MB .text section is not linker code -- it is a complete, statically embedded copy of the ptxas assembler/compiler backend. The actual device linker (ELF merge, symbol resolution, relocation, layout, output) occupies roughly 1.2 MB in the address range 0x400000--0x530000. Everything from 0x530000 through the end of .text at 0x1D32172 (~24 MB, ~38,000 functions) is the ptxas compiler backend: IR primitives, instruction selection, register allocation, instruction scheduling, SASS binary encoding, PTX parsing, and ELF/cubin output generation.
This page documents the evidence for this claim, the complete address map of the embedded ptxas subsystems, the five mega-hub instruction selector dispatch functions, and the ROT13 obfuscation applied to SASS mnemonics.
Evidence for Embedded ptxas
The embedded compiler is not a stripped-down stub -- it is a full-featured PTX-to-SASS compilation pipeline identical in capability to the standalone ptxas binary shipped in the CUDA toolkit. Key evidence:
-
Named memory pools. The linker creates
"nvlink option parser"and"nvlink memory space"arenas at startup. The embedded compiler creates its own arenas with ptxas-specific names. Memory pool diagnostics at0x1AEE070report pool usage statistics (total, freeable, leaked) for the compiler's internal allocations. -
Full option parser.
sub_1103030(29,803 bytes) registers the complete ptxas command-line option set viasub_42F130:--maxrregcount,--opt-level,--gpu-name,--device-debug,--fast-compile,--register-usage-level,--compile-only,--minnctapersm,--warn-spills,--lineinfo,--sp-bounds-check,--device-stack-protector,--sanitize,--position-independent-code, and approximately 50 more. These are forwarded from nvlink's LTO pipeline into the embedded compiler. -
Full compilation pipeline.
sub_1112F30(65,018 bytes) at0x1112F30is the top-level per-module compilation driver. It writes PTX headers (.version,.target,.entry __cuda_dummy_entry__ { ret; }), selects codegen callbacks based on mode flags (--compile-as-tools-patch,--extensible-whole-program,--compile-only), validates SM version compatibility, and dispatches to per-function codegen initialization. -
Multi-architecture support.
sub_15C0CE0(14,517 bytes) initializes 7 dispatch hash maps covering sm_75, sm_80, sm_86, sm_87, sm_88, sm_89, sm_90/90a, sm_100/100a/100f, sm_103/103a/103f, sm_110/110a/110f, sm_120/120a/120f, and sm_121/121a/121f. Each architecture gets 7 registered callbacks (nv.info emitter, resource usage table, instruction encoding table, compute capability array, perf-stats handler, cpf_optx handler, codegen options). -
Register allocation and instruction scheduling. The range
0x1850000--0x1A00000contains the full backend compiler core:ScheduleInstructions(85 KB),ScheduleInstructionsReduceReg,DynBatch,HoistInvariants,ConvertMemoryToRegister, spilling regalloc, SMEM spilling, multi-class register allocation (R-regs, UR-regs, predicates),setmaxnregCTA-reconfig for Blackwell+, and codegen verification passes. -
ISel mega-hubs. Five functions exceed 160 KB each. These are the top-level instruction selector dispatch functions, too large for Hex-Rays to decompile. Each calls hundreds of pattern matchers, selects the highest-priority match, and dispatches to the corresponding emitter.
Relationship to Standalone ptxas
The standalone ptxas binary in the CUDA toolkit and the compiler backend embedded in nvlink share the same codebase. They differ in how they are invoked:
- Standalone ptxas: Invoked as a separate process by
nvcc. Reads.ptxfiles from disk, writes.cubinfiles. - Embedded ptxas in nvlink: Invoked in-process during LTO (
-lto) and PTX JIT compilation. The entry point issub_4BD760(called frommain()when a PTX input file is detected) orsub_4BC6F0(called for LTO IR compilation after libnvvm produces PTX output). Options are forwarded programmatically rather than via argc/argv.
The embedded copy supports thread-pool parallelism for split compilation (sub_43FDB0 creates the pool, sub_4264B0 dispatches per-function work items). This is the same --split-compile-extended feature available in standalone ptxas.
Embedded ptxas Address Map
The following table maps the full address range of the embedded ptxas backend. All addresses are within the .text section of nvlink v13.0.88.
IR Primitives (0x530000 -- 0x620000, ~960 KB)
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0x530E80--0x530FD0 | <1 KB | IR node accessors | 22 | sub_530FB0 has 31,399 callers -- universal getOperand(idx) |
0x530FE0--0x5B1AB0 | 523 KB | ISel pattern matchers (SM50-7x) | 1,293 | 152 target opcodes, 36 priority levels |
0x5B1D80--0x5E4470 | 204 KB | MercExpand mega-hub | 1 | MercExpand dispatch + CFG analysis (too large for Hex-Rays) |
0x5E4470--0x600260 | 114 KB | MercExpand engine | ~50 | Bitvector ops, FNV-1a hash maps, register constraint propagation |
0x603F60--0x61FA60 | 112 KB | SM50 instruction encoders | 79 | Per-instruction binary encoding functions |
The IR node structure is accessed through 22 leaf functions that constitute the most-called code in the entire binary. sub_530FB0 (get operand by index) at 31,399 callers and sub_A49150 (get instruction attribute) at 30,768 callers form the universal accessor layer. The IR node layout:
Offset Size Field
0 1B operand type tag (1=immediate, 2=register, 6=memref, ...)
4 4B register class / encoding field (1023 = wildcard "any")
14 1B flag A
15 1B flag B
20 4B data type / secondary encoding
28 2B IR opcode
32 8B pointer to operand array (each operand = 32 bytes)
40 4B total operand count
92 4B first source operand index
Number of source operands = *(off+40) + 1 - *(off+92). Number of destination operands = *(off+92).
ISA Encoding Tables (0x620000 -- 0xA70000, ~4.3 MB)
This is the largest contiguous subsystem -- 4.3 MB of template-instantiated functions defining the complete NVIDIA GPU instruction set encoding and metadata.
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0x620000--0x84DD70 | 2.2 MB | SM100+ SASS encoders | 1,537 | 128-bit instruction encoders for Blackwell ISA |
0x84DD70--0xA48290 | 1.7 MB | InstrDesc init table | 1,613 | Instruction descriptor initializers (operand types, latencies) |
0xA49010--0xA4AB10 | 4 KB | NVInst accessors | ~30 | IR instruction class hierarchy |
0xA4AB10 | 11 KB | NVInst constructor | 1 | Allocates and initializes instruction IR node |
0xA4B5E0--0xA4C7C0 | 5 KB | FNV-1a hash tables | 4 | Instruction lookup by hash |
0xA5B6B0 | 180 KB | setOperandField dispatch | 1 | Giant switch: sets operand fields by opcode class |
0xA62220 | 65 KB | setOperandImm dispatch | 1 | Giant switch: sets immediate operand values |
0xA65900 | 67 KB | getOperandField dispatch | 1 | Giant switch: reads operand fields |
0xA67910 | 141 KB | getDefaultOperandValue | 1 | Giant switch: returns default operand values per opcode |
The 1,537 SM100+ encoders each translate one instruction variant into a 128-bit SASS instruction word via the core primitive sub_4C28B0(buf, bit_offset, width, value). Opcode breakdown: major=1 (ALU/Scalar) 37.2%, major=2 (Vector/Memory/Control) 62.7%, major=3 (Special) 0.1%, across 118 instruction families.
The 1,613 InstrDesc initializers populate per-instruction metadata: operand count, operand types/constraints, scheduling hints, latency estimates, and execution unit assignments. Combined, the encoder + descriptor tables define the complete NVIDIA GPU ISA from SM50 through SM121.
Instruction Codecs (0xA70000 -- 0xCA0000, ~2.2 MB)
Multi-architecture instruction encoding and decoding, organized per-SM.
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0xA709F0 | 54 KB | Field offset query | 1 | 6,491-line switch: (opcode_class, field_id) -> bit_offset |
0xA7DE70 | 50 KB | Field presence query | 1 | Mirror: returns hasField boolean |
0xA87CE0--0xB25D50 | 630 KB | SM90/100 encoders | ~164 | Per-opcode binary instruction encoders |
0xACECF0--0xB77B60 | 700 KB | SM90/100 decoders | ~139 | Binary-to-IR instruction decoders |
0xB9FDE0--0xBC2CC0 | 142 KB | SM7x (Volta/Turing) codecs | ~60 | Encoders + decoders for SM70/SM75 |
0xBC3FC0--0xBFEC10 | 236 KB | SM75 extended codecs | ~80 | Turing-specific instruction variants |
0xC00070--0xC2FB60 | 193 KB | SM80 (Ampere) codecs | ~70 | Ampere instruction encoders |
0xC3D540--0xC50970 | 83 KB | SM80 decoders | ~15 | HMMA tensor core, SHF, memory decoders |
0xC7EC90--0xC9EE60 | 131 KB | SM86/89 (Ada) codecs | ~40 | GA10x / AD10x encoders + decoders |
Each encoder packs IR operands into a 128-bit SASS instruction word at *(a1+40). Each decoder unpacks a 128-bit word back into IR form. The sentinel value 1023 (register field) maps to RZ (zero register), and 31 (predicate field) maps to PT (true predicate). Architecture-specific encoder variants are differentiated by the helper functions they call: sub_A5A000 (SM70 Volta), sub_A5AB30 (SM75 Turing), sub_A59D80 (SM80 Ampere), etc.
Per-Arch ISel Backends
Instruction selection is implemented as a linear-scan architecture: for each IR instruction, every pattern matcher is called in sequence, and the match with the highest priority wins. Each backend has its own set of pattern matchers, a mega-hub dispatch function (too large for Hex-Rays), and instruction emitters.
SM80 (Ampere) ISel Backend (0xCA0000 -- 0xDA0000, ~1 MB)
| Range | Size | Subsystem | Functions |
|---|---|---|---|
0xCA0000--0xCDC000 | 240 KB | Operand emission + packing | 137 |
0xCDD5F0--0xCDD690 | <1 KB | Operand predicates | 15 |
0xCE2000--0xD5FD70 | 510 KB | ISel pattern matchers | 259 |
0xD5FD70 | 239 KB | SM80 ISel mega-hub | 1 |
0xD9A400--0xDA0000 | 23 KB | Binary encoding | 17 |
Three-phase pipeline: (1) pattern match on IR attributes/operand types, (2) operand emission into instruction descriptor, (3) binary encoding into 128-bit SASS word.
SM100+ (Blackwell) SASS Codec -- Second Table (0xDA0000 -- 0xF16000, ~1.5 MB)
| Range | Size | Subsystem | Functions |
|---|---|---|---|
0xDA0310--0xE436D0 | 669 KB | Blackwell encoders | 438 |
0xE43C20 | 1 KB | Encoder dispatch | 1 |
0xE43DC0--0xF15A50 | 847 KB | Blackwell decoders | 648 |
0xEFE6C0 | 1 KB | Decoder dispatch | 1 |
Format 1 instructions: 147. Format 2 (extended with modifiers): 290. Format 3 (special wide): 1. Every encoder has a mirror decoder; the decoder count exceeds encoders because decoders also handle architecture-variant forms.
SM75 (Turing) ISel Backend (0xF16000 -- 0x100C000, ~984 KB)
| Range | Size | Subsystem | Functions |
|---|---|---|---|
0xF16030--0xF160F0 | <1 KB | Operand predicates | 15 |
0xF10080--0xF15A50 | 22 KB | Instruction emitters | 18 |
0xF16150--0xFBB780 | 678 KB | ISel pattern matchers | 276 |
0xFBB810 | 280 KB | SM75 ISel mega-hub | 1 |
0xFFFDF0--0x100BBF0 | 48 KB | Post-ISel emit+encode | 38 |
This is the largest single-architecture ISel backend. sub_FBB810 at 280 KB is the largest function in the binary.
SM89/90 (Ada/Hopper) Backend (0x100C000 -- 0x11EA000, ~1.9 MB)
| Range | Size | Subsystem | Functions |
|---|---|---|---|
0x100C000--0x10FFFFF | 1.0 MB | Shared instruction encoders | ~750 |
0x1100000--0x1120000 | 128 KB | Backend driver | ~30 |
0x1104950 | 38 KB | ptxas option parser | 1 |
0x1112F30 | 65 KB | Compilation driver main | 1 |
0x1116890 | 60 KB | ELF output + metadata gen | 1 |
0x1120000--0x119BF40 | 496 KB | ISel pattern matchers | ~160 |
0x119BF40 | 231 KB | SM89/90 ISel mega-hub | 1 |
0x11D4680--0x11EA000 | 90 KB | Scheduler + emission | ~16 |
PTX Frontend (0x11EA000 -- 0x15C0000, ~3.5 MB)
The PTX frontend parses PTX assembly text, validates instructions against SM version constraints, and lowers them to the internal IR consumed by the per-architecture ISel backends.
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0x11EA000--0x126C000 | 520 KB | ISel pattern-match predicates | ~160 | Shared across all SM targets |
0x126CA30 | 239 KB | PTX ISel mega-hub | 1 | Shared PTX-level instruction selector |
0x12A7000--0x12B0000 | 36 KB | PTX type system + operand builders | ~20 | Type constructors, operand IR building |
0x12B0000--0x12BA000 | 40 KB | Special register name table | ~20 | %ntid, %laneid, %smid, %clock64, %ctaid, ... |
0x12BA000--0x12D0000 | 88 KB | ISel lowering passes | ~30 | LTO-path instruction lowering |
0x12D0000--0x12D5000 | 20 KB | DWARF debug line info gen | ~5 | Line table emission for LTO-compiled code |
0x12D5000--0x1400000 | 1.2 MB | ISel pattern clones | ~500 | Parametric clones per SM (sm_5x through sm_10x) |
0x1400000--0x1430000 | 192 KB | LTO pipeline + ELF emit | ~20 | Top-level LTO pipeline, MMA lowering |
0x1430000--0x1442000 | 72 KB | PTX version/SM gates | ~30 | Version-gated instruction validators |
0x1442000--0x146BEC0 | 156 KB | Instruction emission handlers | ~80 | Per-instruction PTX code generators |
0x146BEC0 | 206 KB | ptx_load_store_validator | 1 | Memory operation validator with SM checks |
0x147EF50 | 288 KB | ptx_instruction_semantic_analyzer | 1 | Master validator: all SM version requirements |
0x1487650 | 240 KB | ptx_statement_processor | 1 | Top-level PTX statement handler |
0x14932E0--0x15B86A0 | 700 KB | Instruction handlers + builtins | ~250 | Code-template generators for CUDA builtins |
0x15B86A0 | 345 KB | cuda_builtin_prototype_generator | 1 | 608-case switch covering sm20 through sm10x builtins |
The cuda_builtin_prototype_generator is the second-largest function in the binary at 345 KB. It maps builtin index numbers to PTX prototype strings of the form .weak .func (...) __cuda_smXX_foo (...). Function families include div, rem, rcp, sqrt, dsqrt, barrier, wmma, shfl, vote, matchsync, warpsync, reduxsync, sanitizer_memcheck, tcgen05, bulk_copy, and cp_async_bulk_tensor.
Compilation Pipeline (0x15C0000 -- 0x1A00000, ~4.2 MB)
This region contains the per-function compilation pipeline from SM dispatch through code generation to backend verification.
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0x15C0CE0 | 15 KB | SM dispatch tables | 1 | 7 callback maps for sm_75 through sm_121 |
0x15C44D0--0x15CA450 | 348 KB | nv.info attribute emitters | ~10 | Per-SM EIATTR record generation (largest: 78 KB) |
0x1610000--0x163FFFF | 192 KB | PTX compilation frontend | ~40 | Operand handling, control flow, symbol management |
0x1640000--0x165FFFF | 128 KB | Codegen operand lowering | ~30 | Atom formatting, offset calculation |
0x1660000--0x169FFFF | 256 KB | ISel/scheduling + DWARF | ~40 | Instruction scheduling, peephole, debug emission |
0x16A0000--0x16DFFFF | 256 KB | OCG intrinsic lowering | ~80 | builtin_ocg_* handlers, tcmma/tensor operations |
0x16E0000--0x16E3AB0 | 12 KB | tcgen05 intrinsic codegen | ~10 | SM100 tensor memory address setup, guardrails |
0x16E4D60--0x16F6000 | 70 KB | PTX instruction builder | ~20 | Instruction construction, operand insert |
0x16F6000--0x1740000 | 296 KB | Tepid instruction scheduler | ~50 | Full instruction scheduling pipeline |
0x175D000--0x1768000 | 44 KB | Knobs/config infrastructure | ~15 | Runtime tuning parameters |
0x1769000--0x1850000 | 924 KB | SASS opcode tables | ~150 | SM70-SM120 opcode encoding/emission with ROT13 mnemonics |
0x1850000--0x186F000 | 124 KB | Instruction scheduling | ~15 | ScheduleInstructions (85 KB), ReduceReg, DynBatch, Cutlass-aware |
0x1878000--0x189C000 | 144 KB | ConvertMemoryToRegister | ~20 | Shared-memory to register promotion |
0x189C000--0x18FC000 | 384 KB | Register allocation | ~40 | Spilling, SMEM spilling, multi-class regalloc |
0x18FC000--0x1920000 | 144 KB | setmaxnreg / CTA-reconfig | ~20 | Blackwell+ register budget negotiation |
0x1916000--0x1960000 | 296 KB | mbarrier + ORI passes | ~30 | Copy propagation, dead-code elimination |
0x1960000--0x19E0000 | 512 KB | Codegen verification | ~40 | Uninitialized register detection, remat verify |
0x19A0000--0x1A00000 | 384 KB | Metrics + scheduling guidance | ~35 | Occupancy estimation, loop analysis, regalloc guidance |
SASS Emission (0x1A00000 -- 0x1D32172, ~3.2 MB)
The final segment of .text handles SASS instruction lowering, ABI enforcement, ELF/cubin output, name demangling, and DWARF debug info.
| Range | Size | Subsystem | Functions | Key Finding |
|---|---|---|---|---|
0x1A009C0--0x1A0B180 | 6 KB | Bug injection framework | ~5 | Testing hooks for intentional bug injection |
0x1A0B180--0x1A20000 | 84 KB | Instruction operand analysis | ~30 | Operand lowering, constant buffer encoding |
0x1A1A000--0x1A2A000 | 64 KB | Warp sync / mbarrier | ~15 | %%mbarrier_%s_%s instruction generation |
0x1A4B000--0x1A61090 | 88 KB | WGMMA pipeline analysis | ~20 | Warpgroup MMA live ranges, sync injection |
0x1A61090--0x1A6A480 | 38 KB | Scoreboard management | ~10 | Instruction scheduling scoreboard |
0x1A6A480--0x1AA2090 | 352 KB | ISel/lowering + encoding | ~80 | Instruction selection, SASS emission |
0x1AA2090--0x1ABF000 | 124 KB | Regalloc + ABI | ~30 | Register allocation, ABI handling |
0x1AEAA90--0x1AEE070 | 14 KB | Instruction vtable factory | ~10 | SASS instruction vtable construction |
0x1AEE070--0x1B00000 | 70 KB | Memory pool diagnostics | ~10 | Pool tracking, encoding passes |
0x1B00000--0x1B20000 | 128 KB | Register liveness | ~30 | Interference graph construction |
0x1B19750--0x1B40000 | 160 KB | Machine scheduling + CFG | ~40 | Basic block management |
0x1B40000--0x1B60000 | 128 KB | Dependency tracking | ~30 | Scoreboard / dependency graph |
0x1B60000--0x1B9FFFF | 256 KB | ISel + lowering (tail) | ~200 | PTX-to-SASS ISel, tail-call optimization |
0x1BA0000--0x1BFFFFF | 384 KB | ABI / calling convention | ~150 | Return address mgmt, convergent boundary, coroutine regs |
0x1C00000--0x1CDFFFF | 896 KB | ELF section builder | ~120 | .nv.constant, .nv.shared, cubin/fatbin container |
0x1CE0000--0x1CEDFFF | 56 KB | C++ name demangler | ~40 | Itanium ABI + MSVC demangler |
0x1CF0000--0x1D32172 | 265 KB | DWARF + LEB128 + KNOBS | ~140 | Debug info generation, SSE-accelerated LEB128, config system |
The Five Mega-Hub Functions
Five functions exceed 160 KB each. They are the top-level instruction selector dispatch functions for different SM architecture generations. Each contains a massive jump table that calls hundreds of ISel pattern matchers in sequence, selects the highest-priority match, then dispatches to the corresponding emitter. All five are too large for Hex-Rays to decompile.
| Address | Size | Target | Description |
|---|---|---|---|
sub_FBB810 | 280 KB | SM75 (Turing) | Calls 276+ pattern matchers. Largest function in the binary |
sub_126CA30 | 239 KB | SM50-7x (shared) | Covers Maxwell/Pascal/Volta backends |
sub_D5FD70 | 239 KB | SM80 (Ampere) | Calls 259 pattern matchers for Ampere-class GPUs |
sub_119BF40 | 231 KB | SM89/90 (Ada/Hopper) | Calls ~160 pattern matchers |
sub_5B1D80 | 204 KB | SM50-7x (MercExpand) | MercExpand instruction expansion dispatch |
The ISel protocol is uniform across all backends:
for each pattern_matcher in pattern_table:
matched = pattern_matcher(ctx, ir_node, &pattern_id, &priority)
if matched && priority > best_priority:
best_priority = priority
best_id = pattern_id
emitter_table[best_id](ctx, ir_node) // emit selected instruction
Each pattern matcher queries IR node attributes via sub_A49150, checks operand counts via sub_530FD0/sub_530FC0, retrieves operands via sub_530FB0, validates operand types and register classes, and writes (pattern_id, priority) if all constraints are satisfied.
ROT13 Obfuscation of SASS Mnemonics
NVIDIA applies ROT13 encoding to SASS instruction mnemonic strings stored in the binary. The decoder function sub_1A40AC0 uses SSE/SIMD vectorization for bulk ROT13 processing (loading 16 bytes at a time via _mm_load_si128). The SASS opcode table initializer at 0x1A85E40 stores all mnemonics in ROT13-encoded form; they are decoded at runtime before use.
Known decoded mnemonics:
| ROT13 | Decoded | Instruction |
|---|---|---|
VZNQ | IMAD | Integer multiply-add |
SZHY | FMUL | Float multiply |
SNQQ | FADD | Float add |
SRAPR | FENCE | Memory fence |
ZREPHEL | MERCURY | Blackwell codename prefix |
CCGY | CCTL | Cache control |
OFLAP | BSYNC | Barrier synchronization |
ERZBAR | REMOVE | Instruction removal tag |
The "MERCURY" prefix (ZREPHEL in ROT13) corresponds to sm_100+ (Blackwell) and appears throughout the compilation pipeline as a codename. ROT13 is also applied to some internal ELF section names: .sync_restrict::shared::read::mma::a is stored as its ROT13 equivalent, .acc::f16 as .npp::s16, and .sp::2to4 as .fc::2gb4.
Size Summary
| Subsystem | Address Range | Size | Functions | % of .text |
|---|---|---|---|---|
| Linker core (not ptxas) | 0x400000--0x530000 | 1.2 MB | ~600 | 5% |
| IR primitives + SM50-7x ISel | 0x530000--0x620000 | 960 KB | ~1,450 | 4% |
| ISA encoding tables | 0x620000--0xA70000 | 4.3 MB | ~3,150 encoders + ~1,613 descriptors | 17% |
| Instruction codecs (multi-arch) | 0xA70000--0xCA0000 | 2.2 MB | ~700 | 9% |
| SM80 ISel backend | 0xCA0000--0xDA0000 | 1.0 MB | ~430 | 4% |
| SM100+ codec (second table) | 0xDA0000--0xF16000 | 1.5 MB | ~1,090 | 6% |
| SM75 ISel backend | 0xF16000--0x100C000 | 984 KB | ~350 | 4% |
| SM89/90 backend | 0x100C000--0x11EA000 | 1.9 MB | ~980 | 8% |
| PTX frontend | 0x11EA000--0x15C0000 | 3.5 MB | ~1,100 | 14% |
| Compilation pipeline | 0x15C0000--0x1A00000 | 4.2 MB | ~700 | 17% |
| SASS emission + ABI + ELF | 0x1A00000--0x1D32172 | 3.2 MB | ~1,300 | 13% |
| Total embedded ptxas | 0x530000--0x1D32172 | ~24 MB | ~38,000 | ~95% |
Cross-Reference: Key Functions
| Function | Size | Identity | Role |
|---|---|---|---|
sub_530FB0 | <1 KB | IRNode_GetOperand | Universal operand accessor (31,399 callers) |
sub_A49150 | <1 KB | IRInstr_GetAttribute | Universal attribute accessor (30,768 callers) |
sub_4C28B0 | <1 KB | setBitfield | Core encoding primitive for all SASS encoders |
sub_1112F30 | 65 KB | ptxas_main_compilation_driver | Top-level per-module compilation entry |
sub_1103030 | 30 KB | ptxas_option_definition_table_builder | Full option parser (~60 options) |
sub_1104950 | 38 KB | ptxas_command_option_parser | Option processing and validation |
sub_15C0CE0 | 15 KB | init_sm_dispatch_tables | SM architecture callback registration |
sub_1A40AC0 | <1 KB | rot13_string_decoder | SIMD-vectorized ROT13 decoder |
sub_4BD760 | varies | ptxas_jit_compile | Entry point for PTX JIT compilation |
sub_4BC6F0 | varies | compile_linked_lto_ir | Entry point for LTO compilation |
sub_15B86A0 | 345 KB | cuda_builtin_prototype_generator | 608-case builtin switch (second-largest function) |
sub_147EF50 | 288 KB | ptx_instruction_semantic_analyzer | Master instruction validator |
Compilation Pipeline: sub_1112F30 Reconstructed
sub_1112F30 (65,018 bytes at 0x1112F30) is the top-level per-module compilation driver. It receives a module context a1 and a PTX module descriptor a2, then orchestrates the full PTX-to-SASS compilation in a sequence of clearly delineated phases. The following ASCII pipeline diagram and reconstructed pseudocode are derived from the decompiled output of this function.
Pipeline Diagram
nvlink main / LTO pipeline
|
v
+--------------------------+
| sub_1112F30 | per-module compilation driver (65 KB)
| "ptxas_compile_module" |
+--------------------------+
|
Phase 1: Option Query & Cache Config
| query "def-load-cache", "force-load-cache",
| "def-store-cache", "force-store-cache"
| from option map at a1+904
v
Phase 2: Cancellation Check
| if a1+288 (cancel flag): call cancel_callback(a1+296)
| if returns 1 -> longjmp to error handler
v
Phase 3: Timing Gate
| check profiling flags a1+104..107, a1+402
| start wall-clock timer (sub_45CCD0)
| start high-res timer (sub_44EF30) if enabled
v
Phase 4: Callback Registration
| register per-instruction callback: sub_1108860 -> a1+408
| register per-function callback: sub_1101EB0 -> a1+416
| initialize PTX version tables: sub_12B30E0, sub_12B31D0
v
Phase 5: SM Version Validation
| sscanf .target string -> extract SM version number
| sscanf a1+576 (max supported SM) -> compare
| if module SM > max supported -> fatal error
| validate PTX version compatibility (sub_12A8360)
v
Phase 6: Mode Flag Dispatch (codegen callback selection)
| select (init_callback, begin_callback) pair:
|
| if --compile-only OR --assyscall OR --compile-as-tools-patch:
| init = sub_110CD20 ("compile_only_init")
| begin = sub_11089E0 ("compile_only_begin")
|
| elif --extensible-whole-program AND NOT --device-debug:
| init = sub_110D110 ("ewp_init")
| begin = sub_1107F10 ("ewp_begin")
|
| elif --extensible-whole-program AND --device-debug:
| init = sub_110CD20
| begin = sub_11089E0
|
| else (normal LTO / standard compilation):
| if --extensible-whole-program flag:
| init = sub_110CBA0 ("standard_init_ewp")
| else:
| init = sub_110D0B0 ("standard_init")
| begin = sub_1109180 ("standard_begin")
v
Phase 7: PTX Header Emission (dummy entry generation)
| if no explicit entry functions AND not tools-patch mode:
| if output-to-memory (sub_464740 returns true):
| sub_12AF550("__cuda_dummy_entry__", ptx_header, ...)
| else (output to file):
| fopen(output_path, "w")
| fprintf: .version <ptx_version>
| fprintf: .target <sm_name>
| fprintf: .entry __cuda_dummy_entry__ { ret; }
| fclose
| sub_12AF200(output_path, ...)
v
Phase 8: Tools-Patch Resource Allocation Warnings
| if --compile-as-tools-patch:
| warn if allocating: textures, surfaces, samplers, constants
| if --assyscall:
| warn if allocating: textures, surfaces, samplers
v
Phase 9: Compilation Flags Setup
| disable --fast-compile for ABI-less calls
| disable --extensible-whole-program for ABI-less compilation
| process --position-independent-code
| check texmode_independent vs texmode_unified
| check --preserve-relocs compatibility
| check --legacy-bar-warp-wide-behavior (SM70 only)
| check --g-tensor-memory-access-check (SM100+ only)
v
Phase 10: Hash Map + Codegen Context Allocation
| allocate 8 hash maps via sub_4489C0/sub_465020:
| [0] instruction map (cap 0x100)
| [1] function list (cap 0x400)
| [2] basic-block map (cap 0x100), conflict map (cap 0x40)
| [3] symbol map (cap 0x100), label map (cap 0x100)
| [4] directive map (cap 0x40), auxiliary map (cap 0x40)
| [5] register table (cap 0x20), operand table (arch-size)
| allocate per-function resource array:
| sub_12AE300(a2) entries x 48 bytes each -> a1+336
| allocate per-function compilation result array:
| count entries x 112 bytes each -> a1+256
v
Phase 11: Register Callbacks on Module IR
| sub_1102AC0 -> per-function callback on module functions list
| sub_1101E90 -> per-symbol callback on symbol table
| sub_1111DB0 -> per-function dispatch on func IR list
| sub_1101DE0 -> per-global callback (unless --compile-only)
| sub_110F5E0 -> per-section callback on section list
| sub_1101F60 -> per-symbol post-process callback
v
Phase 12: Address Width + Register Budget
| determine address width: 32-bit or 64-bit
| SM <= 13 -> 32-bit (maxnreg=32)
| SM > 13 -> read from module metadata
| PIC validation: if PIC address > threshold -> disable PIC
| maxrregcount validation: warn on mismatch
| 32-bit register check: SM > 90 -> fatal
v
Phase 13: Entry Point Collection
| if explicit -e entries:
| resolve each name through module's symbol table
| build ordered entry list -> v322
| elif explicit -E entries:
| resolve through module, build list
| else:
| use module's own entry list (a2+88)
v
Phase 14: Transfer Compilation State to Context
| copy hash maps, flags, and tables into a1+1072..1296
| create alias tracking map (sub_4489C0, cap 0x100)
| create callee usage map (sub_4489C0, cap 0x418)
v
Phase 15: init_callback(a1, entry_list) -- Codegen Initialization
| calls the selected init callback (from Phase 6)
| sub_110CD20: builds per-function codegen descriptors
| for each function: sub_110BC90 -> allocate codegen record
| stores in a1+1192 (register usage map)
| returns ordered list of functions needing compilation
v
Phase 16: Load/Store Cache Mode Assignment
| for each function in compilation list:
| if force-load-cache -> set cache_mode = 2
| elif def-load-cache -> set cache_mode = 1
| else -> set cache_mode based on call graph analysis
v
Phase 17: Indirect Call + MMA Validation
| for each function:
| check for indirect calls with .f64 MMA -> warn
| check for mutual recursion markers -> flag
v
Phase 18: Scheduling Class Assignment
| for each function: assign scheduling class (0, 1, or 2)
| class 0 = no scheduling needed
| class 1 = standard scheduling
| class 2 = aggressive scheduling (callee analysis)
| propagate class upward through call graph if needed
v
Phase 19: Debug Info Setup
| if --device-debug: init DWARF context (sub_1672520)
| check .debug_abbrev / .debug_info availability
v
Phase 20: Reserved Register Configuration
| if --first-reserved-rreg: validate (min=4)
| compute total reserved = first_reserved + count
v
Phase 21: Build Per-Function Codegen Configuration
| pack ~50 compilation flags into struct at v334..v358:
| device_debug, lineinfo, fast_compile, maxrregcount,
| opt_level, compile_only, tools_patch, ewp, preserve_relocs,
| sm_version, address_width, default caches, PIC, ...
| call sub_16257C0 to create codegen pipeline config object
v
Phase 22: Output File Setup
| if --output specified: create/truncate output file
v
+------+------+
| |
| thread_count == 0 thread_count > 0
| (sequential) (parallel via thread pool)
| |
v v
Phase 23a: Sequential Phase 23b: Parallel
Per-Function Loop Per-Function Loop
| |
for each func: sub_43FDB0(thread_count)
| = create thread pool
| |
+---> sub_110AA30 for each func:
| "codegen_init" build work item (48 bytes):
| - create OCG context [0..15] = timing state
| - set "NVIDIA" [24] = per-func codegen ctx
| - set "ptxocg.0.0" [32] = compilation driver ref
| - configure 30+ fields [40] = optional sync state
| - set opt level |
| - set SM-specific flags sub_43FF50(pool, sub_1107420, item)
| - invoke vtable->init = enqueue work item
| to map symbol names |
| sub_43FFE0(pool) = barrier wait
+---> sub_1655A60 sub_43FE70(pool) = destroy pool
| "codegen_per_func" |
| - initialize 48 pass sub_1107420 (thread worker):
| enable/disable flags sub_1102B30 -> setjmp + compile
| (passes 0..47) record timing per function
| - register lowering record peak memory
| callbacks on IR
| - set up UDT/UFT
| relocations
| - process entry
| function list
| - ISel dispatch
| - register allocation
|
+---> sub_1102B30
| "codegen_compile"
| - setjmp for error
| recovery
| - invoke compilation
| via vtable callback
| at a1+96
| - on error: longjmp,
| record failure
|
+---> timing measurement
| record compile time
| per-function stats
|
+---> sub_110D2A0
"codegen_finalize"
- emit ELF metadata
- write nv.info records
- output SASS binary
- write register usage
- cleanup per-func state
|
v
Phase 24: Post-Compilation Cleanup
| clean up hash maps, free temp allocations
| validate register usage across functions
| if --compile-only: cross-check register budgets
v
Phase 25: Pipeline Config Teardown (sub_1626480)
|
v
Phase 26: Final Cleanup
| destroy hash maps (sub_4650A0)
| free per-function resource array
| free codegen config
| return
Reconstructed Pseudocode for sub_1112F30
// sub_1112F30 -- per-module compilation driver
// Address: 0x1112F30, Size: 65,018 bytes
// Parameters:
// a1 = compilation driver context (opaque struct, ~1300 bytes)
// a2 = PTX module descriptor (parsed PTX representation)
// Returns: 0 on success
int ptxas_compile_module(CompilerCtx *ctx, PtxModule *mod) {
// ---- Phase 1: Query cache configuration options ----
bool def_load_cache = option_get_bool(ctx->option_map, "def-load-cache");
bool force_load_cache = option_get_bool(ctx->option_map, "force-load-cache");
bool def_store_cache = option_get_bool(ctx->option_map, "def-store-cache");
bool force_store_cache= option_get_bool(ctx->option_map, "force-store-cache");
// ---- Phase 2: Cancellation check ----
if (ctx->cancel_flag) {
if (ctx->cancel_callback(ctx->cancel_handle, ctx->cancel_arg) == 1)
longjmp_to_error_handler();
}
// ---- Phase 3: Timing infrastructure ----
timing_gate_start(ctx); // sub_45CCD0 on a1+128/a1+144
if (ctx->high_res_timing)
ctx->wall_start = get_hires_time(); // sub_44EF30
// ---- Phase 4: Register per-instruction and per-function callbacks ----
list_foreach(ctx->instr_callbacks, per_instr_callback, ctx); // sub_1108860
list_foreach(ctx->func_callbacks, per_func_callback, ctx); // sub_1101EB0
ptx_version_table_init(mod); // sub_12B30E0
ptx_version_table_validate(mod); // sub_12B31D0
// ---- Phase 5: SM version validation ----
unsigned sm_module, sm_max;
sscanf(mod->target_string, "%*[^0-9]%d", &sm_module);
sscanf(ctx->max_sm_string, "%*[^0-9]%d", &sm_max);
if (sm_max < sm_module)
fatal_error(ERR_SM_MISMATCH);
if (mod->ptx_version_flag) {
if (!ptx_version_compatible(sm_max, sm_module)) // sub_12A8360
fatal_error(ERR_PTX_VERSION);
if (!ctx->allow_unsupported_sm)
fatal_error(ERR_UNSUPPORTED_SM, mod->target_string, ctx->max_sm_string);
}
// ---- Phase 6: Select codegen callback pair based on mode flags ----
CodegenInitFn init_fn;
CodegenBeginFn begin_fn;
if (ctx->compile_only || ctx->assyscall || ctx->tools_patch) {
init_fn = compile_only_init; // sub_110CD20
begin_fn = compile_only_begin; // sub_11089E0
} else if (ctx->ewp_mode) {
if (ctx->device_debug) {
init_fn = compile_only_init; // sub_110CD20
begin_fn = compile_only_begin; // sub_11089E0
} else {
init_fn = ewp_init; // sub_110D110
begin_fn = ewp_begin; // sub_1107F10
}
} else {
begin_fn = standard_begin; // sub_1109180
if (ctx->ewp_flag)
init_fn = standard_init_ewp; // sub_110CBA0
else
init_fn = standard_init; // sub_110D0B0
}
// ---- Phase 7: Emit PTX dummy entry if no explicit entries ----
if (!mod->entry_list && !ctx->assyscall && !ctx->tools_patch) {
saved_flag = mod->flag_236;
if (output_is_memory(ctx->func_callbacks)) {
ptx_emit_entry_inline("__cuda_dummy_entry__", ptx_header_text, mod);
} else {
char *path = get_output_path(ctx->func_callbacks);
FILE *fp = fopen(path, "w");
if (mod->ptx_version_str)
fprintf(fp, "\t.version %s\n", mod->ptx_version_str);
if (mod->target_str)
fprintf(fp, "\t.target %s\n", mod->target_str);
fprintf(fp, "\t.entry __cuda_dummy_entry__ { ret; }\n");
fclose(fp);
ptx_parse_file(path, mod);
}
mod->flag_236 = saved_flag; // restore flag modified by parse
}
// ---- Phase 8-9: Compilation flags, warnings, compatibility ----
if (ctx->tools_patch) {
if (mod->alloc_textures) warn("Allocating additional textures");
if (mod->alloc_surfaces) warn("Allocating additional surfaces");
if (mod->alloc_samplers) warn("Allocating additional samplers");
if (mod->alloc_constants) warn("Allocating additional constants");
ctx->fast_compile = 0;
}
ctx->abi_mode = 0;
if (ctx->ewp_mode) ctx->ewp_mode = 0; // one-shot
if (mod->has_entry_funcs || mod->has_extern_funcs) {
if (ctx->fast_compile)
warn("'--fast-compile' incompatible with calls without ABI");
ctx->fast_compile = 0;
if (ctx->ewp_flag)
warn("'--extensible-whole-program' incompatible with compilation without ABI");
ctx->ewp_flag = 0;
}
// PIC handling
bool pic = option_get_bool(ctx->option_map, "position-independent-code");
if (!ctx->compile_only && !ctx->device_debug && !ctx->tools_patch
&& !ctx->ewp_mode && !ctx->ewp_flag) {
if (!mod->has_relo && !pic)
ctx->enable_pic = 1; // auto-enable PIC for normal compilation
}
if (ctx->ewp_flag && pic) {
ctx->enable_pic = 0;
warn("'--position-independent-code' incompatible with '--extensible-whole-program'");
}
if (ctx->early_exit)
return 0; // --dry-run equivalent
// ---- Phase 10: Allocate codegen data structures ----
HashMaps maps;
maps.instr_map = hashmap_create(cmp_fn, free_fn, 0x100);
maps.func_list = hashmap_create(cmp_fn, free_fn, 0x400);
maps.bb_map = hashmap_create(cmp_fn, free_fn, 0x100);
maps.conflict = hashmap_create(cmp_fn, free_fn, 0x40);
maps.sym_map = hashmap_create(cmp_fn, free_fn, 0x100);
maps.label_map = hashmap_create(cmp_fn, free_fn, 0x100);
maps.dir_map = hashmap_create(cmp_fn, free_fn, 0x40);
maps.aux_map = hashmap_create(cmp_fn, free_fn, 0x40);
maps.reg_table = sorted_map_create(int_cmp, free_fn, 0x20);
maps.operand_tbl = sorted_map_create(int_cmp, free_fn, arch_operand_count(mod));
unsigned func_count = get_function_count(mod); // sub_12AE300
ctx->per_func_resources = arena_alloc(func_count * 48);
memset(ctx->per_func_resources, 0, func_count * 48);
// ---- Phase 11: Traverse module IR, register callbacks ----
list_foreach(mod->functions.list, register_function_cb, ctx); // sub_1102AC0
foreach_symbol(mod->symbol_table, register_symbol_cb, ctx); // sub_1101E90
list_foreach(mod->functions.list, dispatch_function_ir, &maps); // sub_1111DB0
if (!ctx->compile_only)
list_foreach(mod->functions.globals, register_global_cb, &maps);
list_foreach(mod->functions.sections, register_section_cb, &maps);
foreach_symbol(mod->symbol_table, postprocess_symbol_cb, &maps);
// ---- Phase 12: Address width + register budget ----
int addr_width = determine_address_width(ctx, mod);
ctx->addr_width = addr_width; // 32 or 64
validate_maxrregcount(ctx, mod);
// ---- Phase 13: Collect entry points ----
FuncList *entries;
if (ctx->explicit_entries) {
entries = resolve_entries(ctx->explicit_entries, ctx->module_reader);
} else if (ctx->explicit_globals) {
entries = resolve_entries(ctx->explicit_globals, ctx->module_reader);
} else {
entries = mod->entry_list; // default: module's own entry list
}
// ---- Phase 14: Transfer state into codegen context ----
ctx->maps = maps;
ctx->func_count = list_count(entries);
ctx->alias_map = sorted_map_create(cmp, free, 0x100);
ctx->callee_map = sorted_map_create(cmp, free, 0x418);
// ... copy ~30 more fields ...
// ---- Phase 15: Call init_callback to prepare codegen descriptors ----
FuncList *compile_list = begin_fn(ctx, entries, ctx->per_func_resources);
// ---- Phase 16: Load/store cache mode per function ----
for (FuncNode *f = compile_list; f; f = f->next) {
FuncDesc *desc = f->data;
int func_idx = desc->func_ir->header->index;
bool needs_caching = per_func_resources[func_idx].has_cached_callees;
if (sm_dispatch->supports_caching(sm_class)) {
if (force_load_cache)
desc->cache_mode = 2; // force all loads cached
else if (def_load_cache)
desc->cache_mode = 1; // default cached
else if (needs_caching || force_store_cache)
desc->cache_mode = 2;
else
desc->cache_mode = (def_store_cache != 0);
} else {
desc->cache_mode = 2 * (force_load_cache || def_load_cache);
}
}
// ---- Phase 17: Validate indirect calls + MMA .f64 ----
for (FuncNode *f = compile_list; f; f = f->next) {
FuncDesc *desc = f->data;
if (desc->mma_info && desc->mma_info->has_f64) {
warn_once(ERR_MMA_F64, desc->name, "mma with .f64 type");
}
if (desc->has_mutual_recursion)
fatal_error(ERR_MUTUAL_RECURSION, desc->entry_name);
}
// ---- Phase 18: Assign scheduling class ----
for (FuncNode *f = compile_list; f; f = f->next) {
FuncDesc *desc = f->data;
if (!desc->needs_scheduling) {
desc->sched_class = 0;
} else {
// analyze call graph for callee scheduling requirements
int callee_sched_count = count_callees_with_sched_class_2(desc);
if (ctx->force_aggressive_sched && callee_sched_count > 0) {
desc->sched_class = 2;
} else {
desc->sched_class = sm_dispatch->determine_sched_class(desc);
}
}
}
// ---- Phase 19: Debug info + Phase 20: Reserved regs ----
if (ctx->device_debug)
dwarf_init(ctx->dwarf_ctx, mod);
int reserved_rreg = -1;
if (ctx->has_reserved_rreg) {
if (!ctx->first_reserved_rreg)
ctx->first_reserved_rreg = 4; // minimum
reserved_rreg = ctx->first_reserved_rreg + ctx->reserved_count;
}
// ---- Phase 21: Build codegen config struct ----
CodegenConfig cfg;
cfg.module = mod;
cfg.module_reader = ctx->module_reader;
cfg.device_debug = ctx->device_debug;
cfg.lineinfo = ctx->lineinfo;
cfg.opt_level = ctx->opt_level;
// ... pack ~50 flags from ctx into cfg ...
CodegenPipeline *pipeline = create_codegen_pipeline(cfg); // sub_16257C0
// ---- Phase 22: Create output file if needed ----
if (ctx->output_path && (ctx->dump_sass || ctx->dump_ptx))
fopen_and_truncate(ctx->output_path, "wt");
// ---- Phase 23: Per-function compilation ----
unsigned total_funcs = list_count(compile_list);
ctx->result_array = arena_alloc(total_funcs * 112);
memset(ctx->result_array, 0, total_funcs * 112);
if (ctx->thread_count == 0) {
// ---- Phase 23a: SEQUENTIAL per-function compilation ----
for (FuncNode *f = compile_list; f; f = f->next) {
FuncDesc *desc = f->data;
unsigned idx = desc->func_index;
char *name = desc->func_ir->name;
ctx->result_array[idx].name = name;
ctx->result_array[idx].start_time = timer_read(ctx->timer);
// Allocate 360-byte per-function codegen state
PerFuncState *pfs = arena_alloc_zeroed(360);
// Phase 23a-i: Initialize codegen context for this function
codegen_init(ctx, mod, desc, pipeline, pfs); // sub_110AA30:
// - create OCG (Optimizing Code Generator) context
// - set producer = "NVIDIA", tool = "ptxocg.0.0"
// - configure SM-specific codegen flags
// - set optimization level, maxrregcount, address width
// - initialize instruction vtable from SM dispatch table
// - set up DWARF state if debug enabled
// - call vtable->init to resolve symbol names
// Phase 23a-ii: Invoke the compilation pipeline
// sub_1655A60 (called from within codegen_init flow):
// - initialize 48 pass-enable flags (passes 0..47)
// - register IR lowering callbacks
// - set up UDT/UFT relocations for Blackwell+
// - for each pass in sequence:
// pass 1: IR canonicalization
// pass 2: instruction count estimation
// pass 3-22: SM-gated optimization passes
// (each enabled/disabled by SM dispatch vtable)
// pass 21: address width query
// pass 22: register class initialization
// pass 23-38: ISel, regalloc, scheduling (SM-gated)
// pass 39: ABI frame setup
// pass 40-42: final lowering
// pass 43: peephole cleanup
// pass 46: binary encoding query
// pass 47: final verification
// - emit SASS instructions via ISel mega-hub
// - perform register allocation (graph coloring)
// - schedule instructions (ScheduleInstructions, 85 KB)
// - run peephole optimizations
// - encode final SASS binary (128-bit words)
// Phase 23a-iii: Error-wrapped compile invocation
codegen_compile(ctx, desc, pfs); // sub_1102B30:
// - setjmp for error recovery
// - call vtable->compile(ctx, func_ir, ctx->codegen_config)
// - on error: longjmp, free resources, report failure
// Phase 23a-iv: Record timing
timing_record(ctx, desc, pfs);
// Phase 23a-v: Finalize and emit
codegen_finalize(ctx, mod, desc, pfs); // sub_110D2A0:
// - emit ELF section content (.text, .nv.info, .nv.constant)
// - write register usage records (EIATTR)
// - write SASS binary to output
// - handle PTX re-emission if --output-ptx
// - cleanup per-function OCG state
arena_free(pfs);
// cancellation check between functions
if (ctx->cancel_flag && cancel_callback() == 1)
longjmp_to_error_handler();
}
} else {
// ---- Phase 23b: PARALLEL per-function compilation ----
ThreadPool *pool = create_thread_pool(ctx->thread_count); // sub_43FDB0
IndexArray *sync = index_array_create(total_funcs);
for (FuncNode *f = compile_list; f; f = f->next) {
FuncDesc *desc = f->data;
// Allocate extended per-function state (360 bytes + 3 maps)
PerFuncState *pfs = arena_alloc_zeroed(360);
pfs->local_map_a = sorted_map_create(cmp, free, 8);
pfs->local_map_b = sorted_map_create(cmp, free, 8);
pfs->local_map_c = sorted_map_create(cmp, free, 8);
pfs->pipeline = pipeline;
pfs->driver_ctx = ctx;
pfs->module = mod;
pfs->func_desc = desc;
// Initialize codegen (same as sequential)
codegen_init(ctx, mod, desc, pipeline, pfs);
// Copy 15 x 16-byte blocks of driver state into pfs
// (compiler flags, maps, timing state)
memcpy(&pfs->snapshot, &ctx->compilation_state, 15 * 16);
// Allocate per-function DWARF state (216 bytes)
pfs->dwarf_state = arena_alloc_zeroed(216);
dwarf_register(ctx->dwarf_ctx, pfs->dwarf_state);
// Snapshot pipeline state
pipeline_snapshot(pipeline, &pfs->pipeline_state);
// Copy function-local maps from shared -> per-thread
copy_maps(pfs->local_map_a, ctx->shared_maps);
// Enqueue work item
index_array_push(pfs, sync);
}
// Wait for all per-function compilations to complete
// Each thread runs sub_1107420:
// 1. sub_1102B30 -- setjmp + vtable->compile()
// 2. record timing (wall-clock + per-function)
// 3. record peak memory usage
// 4. free per-function OCG state
for_each(compile_list, sync, dispatch_work_item, pool);
thread_pool_barrier(pool); // sub_43FFE0
thread_pool_destroy(pool); // sub_43FE70
// Single-threaded finalization pass
report_function_index(ctx, -1); // sub_1107720
if (ctx->cancel_flag && cancel_callback() == 1)
cleanup_and_longjmp();
// ---- Sequential post-compilation merge ----
bool first = true;
for (FuncNode *f = compile_list; f; f = f->next) {
PerFuncState *pfs = sync_get(sync, index);
// Restore driver state from per-function snapshot
memcpy(&ctx->compilation_state, &pfs->snapshot, 15 * 16);
// Replay pipeline state
pipeline_restore(pipeline, pfs->pipeline_state);
// Merge per-thread maps back into shared maps
merge_maps(ctx->shared_maps, pfs->local_map_a);
merge_maps(ctx->shared_maps, pfs->local_map_b);
merge_maps(ctx->shared_maps, pfs->local_map_c);
// Restore DWARF state
dwarf_merge(ctx->dwarf_ctx, pfs->dwarf_state);
// Finalize this function
codegen_finalize(ctx, mod, desc, pfs);
// Track first-function state for register budget
if (first) {
shared_register_budget = ctx->register_budget;
first = false;
}
ctx->register_budget = shared_register_budget;
// Cleanup
cleanup_per_func_maps(pfs);
arena_free(pfs);
}
index_array_destroy(sync);
}
// ---- Phase 24: Post-compilation validation ----
pipeline_finalize(pipeline); // sub_1626480
if (ctx->compile_only && ctx->register_budget_map) {
// Cross-validate register usage: each callee must not exceed
// the register budget of its caller
for (int i = 0; i < func_count; i++) {
ResourceEntry *re = &ctx->per_func_resources[i];
if (re->entry_name && re->callee_list) {
unsigned caller_budget = map_lookup(ctx->register_budget_map,
re->entry_name);
for (FuncNode *c = re->callee_list; c; c = c->next) {
unsigned callee_regs = map_lookup(ctx->alias_map, c->name);
if (caller_budget > callee_regs)
warn("register budget exceeded: %s uses %d, caller %s allows %d",
c->name, callee_regs, re->entry_name, caller_budget);
}
}
}
map_destroy(ctx->register_budget_map);
}
// ---- Phase 25-26: Cleanup ----
hashmap_destroy(maps.bb_map);
hashmap_destroy(maps.sym_map);
arena_free(ctx->per_func_resources);
if (ctx->tensor_check_map)
set_destroy(ctx->tensor_check_map);
return 0;
}
Key Subroutine Reference
| Address | Name (reconstructed) | Role in Pipeline |
|---|---|---|
sub_1112F30 | ptxas_compile_module | Top-level per-module driver (this function) |
sub_110AA30 | codegen_init | Per-function OCG context creation + field setup |
sub_1655A60 | codegen_per_func | 48-pass codegen pipeline (ISel, regalloc, sched, encode) |
sub_1102B30 | codegen_compile | Error-wrapped compilation (setjmp + vtable dispatch) |
sub_110D2A0 | codegen_finalize | ELF emission, nv.info, SASS output, cleanup |
sub_1107420 | thread_worker | Thread pool work item: compile + timing + cleanup |
sub_110CD20 | compile_only_init | Init for --compile-only / --compile-as-tools-patch |
sub_110D0B0 | standard_init | Init for normal LTO compilation |
sub_110D110 | ewp_init | Init for --extensible-whole-program mode |
sub_110CBA0 | standard_init_ewp | Init for standard mode with EWP flag |
sub_11089E0 | compile_only_begin | Begin callback for compile-only modes |
sub_1107F10 | ewp_begin | Begin callback for EWP mode |
sub_1109180 | standard_begin | Begin callback for standard compilation |
sub_110BC90 | alloc_codegen_record | Allocate per-function codegen descriptor |
sub_16257C0 | create_codegen_pipeline | Build the codegen pipeline configuration object |
sub_1626480 | pipeline_finalize | Tear down the codegen pipeline |
sub_43FDB0 | create_thread_pool | Create split-compilation thread pool |
sub_43FF50 | enqueue_work_item | Submit per-function work to thread pool |
sub_43FFE0 | thread_pool_barrier | Wait for all enqueued work to complete |
sub_43FE70 | thread_pool_destroy | Destroy thread pool |
sub_12AE300 | get_function_count | Return number of functions in module |
sub_12AF550 | ptx_emit_entry_inline | Emit PTX entry point to in-memory buffer |
sub_12AF200 | ptx_parse_file | Parse a PTX file into module representation |
sub_12B30E0 | ptx_version_table_init | Initialize PTX version compatibility tables |
sub_12A8360 | ptx_version_compatible | Check PTX/SM version compatibility |
sub_15C3DD0 | sm_name_to_ordinal | Convert SM target string to ordinal index |
sub_1672520 | dwarf_init | Initialize DWARF debug info context |
Compilation Mode Matrix
The mode flag dispatch at Phase 6 selects one of four codegen pathways. The choice is determined by command-line flags forwarded into the embedded compiler:
| Mode | Condition | init_fn | begin_fn | Behavior |
|---|---|---|---|---|
| Compile-only | --compile-only or --compile-as-tools-patch or --assyscall | sub_110CD20 | sub_11089E0 | Compile all functions independently. No cross-function optimization. Used for tools patches (Nsight Compute, Nsight Systems) |
| EWP (no debug) | --extensible-whole-program and NOT --device-debug | sub_110D110 | sub_1107F10 | Whole-program optimization. Functions compiled with global visibility into callee register usage. Enables aggressive inlining decisions |
| EWP + debug | --extensible-whole-program AND --device-debug | sub_110CD20 | sub_11089E0 | Falls back to compile-only pathway because whole-program optimization conflicts with debug info fidelity |
| Standard | Normal LTO compilation (default) | sub_110D0B0 | sub_1109180 | Standard per-function compilation with cross-function register budget propagation. Used for typical nvlink LTO |
| Standard + EWP flag | Standard with --extensible-whole-program hint | sub_110CBA0 | sub_1109180 | Same as standard but with EWP-aware init (reserves additional register space for potential future extensibility) |
The 48-Pass Codegen Pipeline (sub_1655A60)
The per-function codegen entry point sub_1655A60 runs a 48-pass pipeline (passes numbered 0--47). Each pass is enabled or disabled by querying the SM dispatch vtable at a1[3757] (the architecture-specific callback table registered by sub_15C0CE0). The pipeline initializes 48 boolean flags in a1[160..207] and then iterates:
Pass 0: Zero (placeholder)
Pass 1: Initial IR canonicalization
Pass 2: Instruction count estimation (query vtable+120)
Pass 3-20: SM-gated optimization passes
Each pass queries vtable+72 for SM capability.
If SM supports the pass, the pass enable flag is set to 1.
If not supported, the flag remains 0.
Pass 21: address-width-dependent setup
Pass 22: register class initialization
Pass 23-38: Core backend passes (ISel, register allocation, scheduling)
These are universally enabled for all SM >= sm_50.
Passes 23-38 include:
- Instruction selection (ISel mega-hub dispatch)
- Register allocation (graph coloring + spilling)
- Instruction scheduling (ScheduleInstructions)
- Peephole optimization
- SASS encoding
Pass 39: Initial ABI frame setup
Pass 40-42: Final lowering passes
Pass 43: Peephole cleanup
Pass 44-45: Reserved
Pass 46: Binary encoding query (vtable+488)
Pass 47: Final verification + pass-count teardown
After the pass loop, sub_1655A60 registers additional IR lowering callbacks (sub_161F1C0, sub_161F800, sub_1620460) on the function's basic block list, sets up UDT/UFT relocations for Blackwell+ (SM > 26 in ordinal = sm_100+), and processes the function's call graph for register pressure analysis.
Sequential vs. Parallel Compilation
The compilation driver supports two execution models, selected by ctx->thread_count (offset a1+668):
Sequential mode (thread_count == 0): Each function is compiled in the main thread with a simple loop: codegen_init -> codegen_compile (error-wrapped) -> codegen_finalize. Timing is recorded between stages. The cancel callback is checked between functions.
Parallel mode (thread_count > 0): A thread pool is created via sub_43FDB0. For each function, a work item containing the full per-function state snapshot (360 bytes + 3 local hash maps + pipeline snapshot) is allocated and enqueued. Each thread executes sub_1107420, which calls sub_1102B30 (the error-wrapped compile) and records timing. After the barrier wait (sub_43FFE0), the main thread performs a sequential merge pass: it restores each function's snapshot, merges local maps back into shared maps, and calls codegen_finalize. This is the same --split-compile-extended mechanism available in standalone ptxas.
The thread pool implementation uses a producer-consumer work queue. sub_43FF50 enqueues items, and worker threads dequeue and execute them. The barrier at sub_43FFE0 blocks until all items complete. The pool is destroyed by sub_43FE70. If an optional synchronization state (qword_2A64430) is non-null, each worker checks for compilation errors after completing its work item via sub_1D1E060 / sub_1D1E300.
Cross-References
nvlink Internal
- IR Nodes -- IR node structure and universal accessor functions
- ISel Hubs -- the five mega-hub instruction selector dispatch functions
- Peephole -- peephole optimization passes (ORI, scheduling-phase, linker-level)
- PTX Parsing -- the embedded PTX assembler frontend
- Register Allocation -- graph-coloring register allocator with spilling
- Scheduling -- pre-RA and tepid (post-RA) instruction schedulers
- Architecture Dispatch -- per-SM vtable dispatch system
- Mercury Overview -- Mercury ISA encoding pipeline
- FNLZR -- post-link binary rewriter for Mercury targets
- LTO Overview -- how the LTO pipeline invokes the embedded compiler
Sibling Wikis
- ptxas: Pipeline Overview -- standalone ptxas 159-phase compilation pipeline
- ptxas: Entry Point -- standalone ptxas main() and option parsing
- ptxas: Optimizer -- standalone ptxas optimization passes
- ptxas: Codegen Overview -- standalone ptxas code generation