Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Embedded ptxas: Architecture Overview

The single most important structural fact about nvlink v13.0.88 is that approximately 95% of its 25.2 MB .text section is not linker code -- it is a complete, statically embedded copy of the ptxas assembler/compiler backend. The actual device linker (ELF merge, symbol resolution, relocation, layout, output) occupies roughly 1.2 MB in the address range 0x400000--0x530000. Everything from 0x530000 through the end of .text at 0x1D32172 (~24 MB, ~38,000 functions) is the ptxas compiler backend: IR primitives, instruction selection, register allocation, instruction scheduling, SASS binary encoding, PTX parsing, and ELF/cubin output generation.

This page documents the evidence for this claim, the complete address map of the embedded ptxas subsystems, the five mega-hub instruction selector dispatch functions, and the ROT13 obfuscation applied to SASS mnemonics.

Evidence for Embedded ptxas

The embedded compiler is not a stripped-down stub -- it is a full-featured PTX-to-SASS compilation pipeline identical in capability to the standalone ptxas binary shipped in the CUDA toolkit. Key evidence:

  1. Named memory pools. The linker creates "nvlink option parser" and "nvlink memory space" arenas at startup. The embedded compiler creates its own arenas with ptxas-specific names. Memory pool diagnostics at 0x1AEE070 report pool usage statistics (total, freeable, leaked) for the compiler's internal allocations.

  2. Full option parser. sub_1103030 (29,803 bytes) registers the complete ptxas command-line option set via sub_42F130: --maxrregcount, --opt-level, --gpu-name, --device-debug, --fast-compile, --register-usage-level, --compile-only, --minnctapersm, --warn-spills, --lineinfo, --sp-bounds-check, --device-stack-protector, --sanitize, --position-independent-code, and approximately 50 more. These are forwarded from nvlink's LTO pipeline into the embedded compiler.

  3. Full compilation pipeline. sub_1112F30 (65,018 bytes) at 0x1112F30 is the top-level per-module compilation driver. It writes PTX headers (.version, .target, .entry __cuda_dummy_entry__ { ret; }), selects codegen callbacks based on mode flags (--compile-as-tools-patch, --extensible-whole-program, --compile-only), validates SM version compatibility, and dispatches to per-function codegen initialization.

  4. Multi-architecture support. sub_15C0CE0 (14,517 bytes) initializes 7 dispatch hash maps covering sm_75, sm_80, sm_86, sm_87, sm_88, sm_89, sm_90/90a, sm_100/100a/100f, sm_103/103a/103f, sm_110/110a/110f, sm_120/120a/120f, and sm_121/121a/121f. Each architecture gets 7 registered callbacks (nv.info emitter, resource usage table, instruction encoding table, compute capability array, perf-stats handler, cpf_optx handler, codegen options).

  5. Register allocation and instruction scheduling. The range 0x1850000--0x1A00000 contains the full backend compiler core: ScheduleInstructions (85 KB), ScheduleInstructionsReduceReg, DynBatch, HoistInvariants, ConvertMemoryToRegister, spilling regalloc, SMEM spilling, multi-class register allocation (R-regs, UR-regs, predicates), setmaxnreg CTA-reconfig for Blackwell+, and codegen verification passes.

  6. ISel mega-hubs. Five functions exceed 160 KB each. These are the top-level instruction selector dispatch functions, too large for Hex-Rays to decompile. Each calls hundreds of pattern matchers, selects the highest-priority match, and dispatches to the corresponding emitter.

Relationship to Standalone ptxas

The standalone ptxas binary in the CUDA toolkit and the compiler backend embedded in nvlink share the same codebase. They differ in how they are invoked:

  • Standalone ptxas: Invoked as a separate process by nvcc. Reads .ptx files from disk, writes .cubin files.
  • Embedded ptxas in nvlink: Invoked in-process during LTO (-lto) and PTX JIT compilation. The entry point is sub_4BD760 (called from main() when a PTX input file is detected) or sub_4BC6F0 (called for LTO IR compilation after libnvvm produces PTX output). Options are forwarded programmatically rather than via argc/argv.

The embedded copy supports thread-pool parallelism for split compilation (sub_43FDB0 creates the pool, sub_4264B0 dispatches per-function work items). This is the same --split-compile-extended feature available in standalone ptxas.

Embedded ptxas Address Map

The following table maps the full address range of the embedded ptxas backend. All addresses are within the .text section of nvlink v13.0.88.

IR Primitives (0x530000 -- 0x620000, ~960 KB)

RangeSizeSubsystemFunctionsKey Finding
0x530E80--0x530FD0<1 KBIR node accessors22sub_530FB0 has 31,399 callers -- universal getOperand(idx)
0x530FE0--0x5B1AB0523 KBISel pattern matchers (SM50-7x)1,293152 target opcodes, 36 priority levels
0x5B1D80--0x5E4470204 KBMercExpand mega-hub1MercExpand dispatch + CFG analysis (too large for Hex-Rays)
0x5E4470--0x600260114 KBMercExpand engine~50Bitvector ops, FNV-1a hash maps, register constraint propagation
0x603F60--0x61FA60112 KBSM50 instruction encoders79Per-instruction binary encoding functions

The IR node structure is accessed through 22 leaf functions that constitute the most-called code in the entire binary. sub_530FB0 (get operand by index) at 31,399 callers and sub_A49150 (get instruction attribute) at 30,768 callers form the universal accessor layer. The IR node layout:

Offset  Size   Field
  0     1B     operand type tag (1=immediate, 2=register, 6=memref, ...)
  4     4B     register class / encoding field (1023 = wildcard "any")
 14     1B     flag A
 15     1B     flag B
 20     4B     data type / secondary encoding
 28     2B     IR opcode
 32     8B     pointer to operand array (each operand = 32 bytes)
 40     4B     total operand count
 92     4B     first source operand index

Number of source operands = *(off+40) + 1 - *(off+92). Number of destination operands = *(off+92).

ISA Encoding Tables (0x620000 -- 0xA70000, ~4.3 MB)

This is the largest contiguous subsystem -- 4.3 MB of template-instantiated functions defining the complete NVIDIA GPU instruction set encoding and metadata.

RangeSizeSubsystemFunctionsKey Finding
0x620000--0x84DD702.2 MBSM100+ SASS encoders1,537128-bit instruction encoders for Blackwell ISA
0x84DD70--0xA482901.7 MBInstrDesc init table1,613Instruction descriptor initializers (operand types, latencies)
0xA49010--0xA4AB104 KBNVInst accessors~30IR instruction class hierarchy
0xA4AB1011 KBNVInst constructor1Allocates and initializes instruction IR node
0xA4B5E0--0xA4C7C05 KBFNV-1a hash tables4Instruction lookup by hash
0xA5B6B0180 KBsetOperandField dispatch1Giant switch: sets operand fields by opcode class
0xA6222065 KBsetOperandImm dispatch1Giant switch: sets immediate operand values
0xA6590067 KBgetOperandField dispatch1Giant switch: reads operand fields
0xA67910141 KBgetDefaultOperandValue1Giant switch: returns default operand values per opcode

The 1,537 SM100+ encoders each translate one instruction variant into a 128-bit SASS instruction word via the core primitive sub_4C28B0(buf, bit_offset, width, value). Opcode breakdown: major=1 (ALU/Scalar) 37.2%, major=2 (Vector/Memory/Control) 62.7%, major=3 (Special) 0.1%, across 118 instruction families.

The 1,613 InstrDesc initializers populate per-instruction metadata: operand count, operand types/constraints, scheduling hints, latency estimates, and execution unit assignments. Combined, the encoder + descriptor tables define the complete NVIDIA GPU ISA from SM50 through SM121.

Instruction Codecs (0xA70000 -- 0xCA0000, ~2.2 MB)

Multi-architecture instruction encoding and decoding, organized per-SM.

RangeSizeSubsystemFunctionsKey Finding
0xA709F054 KBField offset query16,491-line switch: (opcode_class, field_id) -> bit_offset
0xA7DE7050 KBField presence query1Mirror: returns hasField boolean
0xA87CE0--0xB25D50630 KBSM90/100 encoders~164Per-opcode binary instruction encoders
0xACECF0--0xB77B60700 KBSM90/100 decoders~139Binary-to-IR instruction decoders
0xB9FDE0--0xBC2CC0142 KBSM7x (Volta/Turing) codecs~60Encoders + decoders for SM70/SM75
0xBC3FC0--0xBFEC10236 KBSM75 extended codecs~80Turing-specific instruction variants
0xC00070--0xC2FB60193 KBSM80 (Ampere) codecs~70Ampere instruction encoders
0xC3D540--0xC5097083 KBSM80 decoders~15HMMA tensor core, SHF, memory decoders
0xC7EC90--0xC9EE60131 KBSM86/89 (Ada) codecs~40GA10x / AD10x encoders + decoders

Each encoder packs IR operands into a 128-bit SASS instruction word at *(a1+40). Each decoder unpacks a 128-bit word back into IR form. The sentinel value 1023 (register field) maps to RZ (zero register), and 31 (predicate field) maps to PT (true predicate). Architecture-specific encoder variants are differentiated by the helper functions they call: sub_A5A000 (SM70 Volta), sub_A5AB30 (SM75 Turing), sub_A59D80 (SM80 Ampere), etc.

Per-Arch ISel Backends

Instruction selection is implemented as a linear-scan architecture: for each IR instruction, every pattern matcher is called in sequence, and the match with the highest priority wins. Each backend has its own set of pattern matchers, a mega-hub dispatch function (too large for Hex-Rays), and instruction emitters.

SM80 (Ampere) ISel Backend (0xCA0000 -- 0xDA0000, ~1 MB)

RangeSizeSubsystemFunctions
0xCA0000--0xCDC000240 KBOperand emission + packing137
0xCDD5F0--0xCDD690<1 KBOperand predicates15
0xCE2000--0xD5FD70510 KBISel pattern matchers259
0xD5FD70239 KBSM80 ISel mega-hub1
0xD9A400--0xDA000023 KBBinary encoding17

Three-phase pipeline: (1) pattern match on IR attributes/operand types, (2) operand emission into instruction descriptor, (3) binary encoding into 128-bit SASS word.

SM100+ (Blackwell) SASS Codec -- Second Table (0xDA0000 -- 0xF16000, ~1.5 MB)

RangeSizeSubsystemFunctions
0xDA0310--0xE436D0669 KBBlackwell encoders438
0xE43C201 KBEncoder dispatch1
0xE43DC0--0xF15A50847 KBBlackwell decoders648
0xEFE6C01 KBDecoder dispatch1

Format 1 instructions: 147. Format 2 (extended with modifiers): 290. Format 3 (special wide): 1. Every encoder has a mirror decoder; the decoder count exceeds encoders because decoders also handle architecture-variant forms.

SM75 (Turing) ISel Backend (0xF16000 -- 0x100C000, ~984 KB)

RangeSizeSubsystemFunctions
0xF16030--0xF160F0<1 KBOperand predicates15
0xF10080--0xF15A5022 KBInstruction emitters18
0xF16150--0xFBB780678 KBISel pattern matchers276
0xFBB810280 KBSM75 ISel mega-hub1
0xFFFDF0--0x100BBF048 KBPost-ISel emit+encode38

This is the largest single-architecture ISel backend. sub_FBB810 at 280 KB is the largest function in the binary.

SM89/90 (Ada/Hopper) Backend (0x100C000 -- 0x11EA000, ~1.9 MB)

RangeSizeSubsystemFunctions
0x100C000--0x10FFFFF1.0 MBShared instruction encoders~750
0x1100000--0x1120000128 KBBackend driver~30
0x110495038 KBptxas option parser1
0x1112F3065 KBCompilation driver main1
0x111689060 KBELF output + metadata gen1
0x1120000--0x119BF40496 KBISel pattern matchers~160
0x119BF40231 KBSM89/90 ISel mega-hub1
0x11D4680--0x11EA00090 KBScheduler + emission~16

PTX Frontend (0x11EA000 -- 0x15C0000, ~3.5 MB)

The PTX frontend parses PTX assembly text, validates instructions against SM version constraints, and lowers them to the internal IR consumed by the per-architecture ISel backends.

RangeSizeSubsystemFunctionsKey Finding
0x11EA000--0x126C000520 KBISel pattern-match predicates~160Shared across all SM targets
0x126CA30239 KBPTX ISel mega-hub1Shared PTX-level instruction selector
0x12A7000--0x12B000036 KBPTX type system + operand builders~20Type constructors, operand IR building
0x12B0000--0x12BA00040 KBSpecial register name table~20%ntid, %laneid, %smid, %clock64, %ctaid, ...
0x12BA000--0x12D000088 KBISel lowering passes~30LTO-path instruction lowering
0x12D0000--0x12D500020 KBDWARF debug line info gen~5Line table emission for LTO-compiled code
0x12D5000--0x14000001.2 MBISel pattern clones~500Parametric clones per SM (sm_5x through sm_10x)
0x1400000--0x1430000192 KBLTO pipeline + ELF emit~20Top-level LTO pipeline, MMA lowering
0x1430000--0x144200072 KBPTX version/SM gates~30Version-gated instruction validators
0x1442000--0x146BEC0156 KBInstruction emission handlers~80Per-instruction PTX code generators
0x146BEC0206 KBptx_load_store_validator1Memory operation validator with SM checks
0x147EF50288 KBptx_instruction_semantic_analyzer1Master validator: all SM version requirements
0x1487650240 KBptx_statement_processor1Top-level PTX statement handler
0x14932E0--0x15B86A0700 KBInstruction handlers + builtins~250Code-template generators for CUDA builtins
0x15B86A0345 KBcuda_builtin_prototype_generator1608-case switch covering sm20 through sm10x builtins

The cuda_builtin_prototype_generator is the second-largest function in the binary at 345 KB. It maps builtin index numbers to PTX prototype strings of the form .weak .func (...) __cuda_smXX_foo (...). Function families include div, rem, rcp, sqrt, dsqrt, barrier, wmma, shfl, vote, matchsync, warpsync, reduxsync, sanitizer_memcheck, tcgen05, bulk_copy, and cp_async_bulk_tensor.

Compilation Pipeline (0x15C0000 -- 0x1A00000, ~4.2 MB)

This region contains the per-function compilation pipeline from SM dispatch through code generation to backend verification.

RangeSizeSubsystemFunctionsKey Finding
0x15C0CE015 KBSM dispatch tables17 callback maps for sm_75 through sm_121
0x15C44D0--0x15CA450348 KBnv.info attribute emitters~10Per-SM EIATTR record generation (largest: 78 KB)
0x1610000--0x163FFFF192 KBPTX compilation frontend~40Operand handling, control flow, symbol management
0x1640000--0x165FFFF128 KBCodegen operand lowering~30Atom formatting, offset calculation
0x1660000--0x169FFFF256 KBISel/scheduling + DWARF~40Instruction scheduling, peephole, debug emission
0x16A0000--0x16DFFFF256 KBOCG intrinsic lowering~80builtin_ocg_* handlers, tcmma/tensor operations
0x16E0000--0x16E3AB012 KBtcgen05 intrinsic codegen~10SM100 tensor memory address setup, guardrails
0x16E4D60--0x16F600070 KBPTX instruction builder~20Instruction construction, operand insert
0x16F6000--0x1740000296 KBTepid instruction scheduler~50Full instruction scheduling pipeline
0x175D000--0x176800044 KBKnobs/config infrastructure~15Runtime tuning parameters
0x1769000--0x1850000924 KBSASS opcode tables~150SM70-SM120 opcode encoding/emission with ROT13 mnemonics
0x1850000--0x186F000124 KBInstruction scheduling~15ScheduleInstructions (85 KB), ReduceReg, DynBatch, Cutlass-aware
0x1878000--0x189C000144 KBConvertMemoryToRegister~20Shared-memory to register promotion
0x189C000--0x18FC000384 KBRegister allocation~40Spilling, SMEM spilling, multi-class regalloc
0x18FC000--0x1920000144 KBsetmaxnreg / CTA-reconfig~20Blackwell+ register budget negotiation
0x1916000--0x1960000296 KBmbarrier + ORI passes~30Copy propagation, dead-code elimination
0x1960000--0x19E0000512 KBCodegen verification~40Uninitialized register detection, remat verify
0x19A0000--0x1A00000384 KBMetrics + scheduling guidance~35Occupancy estimation, loop analysis, regalloc guidance

SASS Emission (0x1A00000 -- 0x1D32172, ~3.2 MB)

The final segment of .text handles SASS instruction lowering, ABI enforcement, ELF/cubin output, name demangling, and DWARF debug info.

RangeSizeSubsystemFunctionsKey Finding
0x1A009C0--0x1A0B1806 KBBug injection framework~5Testing hooks for intentional bug injection
0x1A0B180--0x1A2000084 KBInstruction operand analysis~30Operand lowering, constant buffer encoding
0x1A1A000--0x1A2A00064 KBWarp sync / mbarrier~15%%mbarrier_%s_%s instruction generation
0x1A4B000--0x1A6109088 KBWGMMA pipeline analysis~20Warpgroup MMA live ranges, sync injection
0x1A61090--0x1A6A48038 KBScoreboard management~10Instruction scheduling scoreboard
0x1A6A480--0x1AA2090352 KBISel/lowering + encoding~80Instruction selection, SASS emission
0x1AA2090--0x1ABF000124 KBRegalloc + ABI~30Register allocation, ABI handling
0x1AEAA90--0x1AEE07014 KBInstruction vtable factory~10SASS instruction vtable construction
0x1AEE070--0x1B0000070 KBMemory pool diagnostics~10Pool tracking, encoding passes
0x1B00000--0x1B20000128 KBRegister liveness~30Interference graph construction
0x1B19750--0x1B40000160 KBMachine scheduling + CFG~40Basic block management
0x1B40000--0x1B60000128 KBDependency tracking~30Scoreboard / dependency graph
0x1B60000--0x1B9FFFF256 KBISel + lowering (tail)~200PTX-to-SASS ISel, tail-call optimization
0x1BA0000--0x1BFFFFF384 KBABI / calling convention~150Return address mgmt, convergent boundary, coroutine regs
0x1C00000--0x1CDFFFF896 KBELF section builder~120.nv.constant, .nv.shared, cubin/fatbin container
0x1CE0000--0x1CEDFFF56 KBC++ name demangler~40Itanium ABI + MSVC demangler
0x1CF0000--0x1D32172265 KBDWARF + LEB128 + KNOBS~140Debug info generation, SSE-accelerated LEB128, config system

The Five Mega-Hub Functions

Five functions exceed 160 KB each. They are the top-level instruction selector dispatch functions for different SM architecture generations. Each contains a massive jump table that calls hundreds of ISel pattern matchers in sequence, selects the highest-priority match, then dispatches to the corresponding emitter. All five are too large for Hex-Rays to decompile.

AddressSizeTargetDescription
sub_FBB810280 KBSM75 (Turing)Calls 276+ pattern matchers. Largest function in the binary
sub_126CA30239 KBSM50-7x (shared)Covers Maxwell/Pascal/Volta backends
sub_D5FD70239 KBSM80 (Ampere)Calls 259 pattern matchers for Ampere-class GPUs
sub_119BF40231 KBSM89/90 (Ada/Hopper)Calls ~160 pattern matchers
sub_5B1D80204 KBSM50-7x (MercExpand)MercExpand instruction expansion dispatch

The ISel protocol is uniform across all backends:

for each pattern_matcher in pattern_table:
    matched = pattern_matcher(ctx, ir_node, &pattern_id, &priority)
    if matched && priority > best_priority:
        best_priority = priority
        best_id = pattern_id
emitter_table[best_id](ctx, ir_node)  // emit selected instruction

Each pattern matcher queries IR node attributes via sub_A49150, checks operand counts via sub_530FD0/sub_530FC0, retrieves operands via sub_530FB0, validates operand types and register classes, and writes (pattern_id, priority) if all constraints are satisfied.

ROT13 Obfuscation of SASS Mnemonics

NVIDIA applies ROT13 encoding to SASS instruction mnemonic strings stored in the binary. The decoder function sub_1A40AC0 uses SSE/SIMD vectorization for bulk ROT13 processing (loading 16 bytes at a time via _mm_load_si128). The SASS opcode table initializer at 0x1A85E40 stores all mnemonics in ROT13-encoded form; they are decoded at runtime before use.

Known decoded mnemonics:

ROT13DecodedInstruction
VZNQIMADInteger multiply-add
SZHYFMULFloat multiply
SNQQFADDFloat add
SRAPRFENCEMemory fence
ZREPHELMERCURYBlackwell codename prefix
CCGYCCTLCache control
OFLAPBSYNCBarrier synchronization
ERZBARREMOVEInstruction removal tag

The "MERCURY" prefix (ZREPHEL in ROT13) corresponds to sm_100+ (Blackwell) and appears throughout the compilation pipeline as a codename. ROT13 is also applied to some internal ELF section names: .sync_restrict::shared::read::mma::a is stored as its ROT13 equivalent, .acc::f16 as .npp::s16, and .sp::2to4 as .fc::2gb4.

Size Summary

SubsystemAddress RangeSizeFunctions% of .text
Linker core (not ptxas)0x400000--0x5300001.2 MB~6005%
IR primitives + SM50-7x ISel0x530000--0x620000960 KB~1,4504%
ISA encoding tables0x620000--0xA700004.3 MB~3,150 encoders + ~1,613 descriptors17%
Instruction codecs (multi-arch)0xA70000--0xCA00002.2 MB~7009%
SM80 ISel backend0xCA0000--0xDA00001.0 MB~4304%
SM100+ codec (second table)0xDA0000--0xF160001.5 MB~1,0906%
SM75 ISel backend0xF16000--0x100C000984 KB~3504%
SM89/90 backend0x100C000--0x11EA0001.9 MB~9808%
PTX frontend0x11EA000--0x15C00003.5 MB~1,10014%
Compilation pipeline0x15C0000--0x1A000004.2 MB~70017%
SASS emission + ABI + ELF0x1A00000--0x1D321723.2 MB~1,30013%
Total embedded ptxas0x530000--0x1D32172~24 MB~38,000~95%

Cross-Reference: Key Functions

FunctionSizeIdentityRole
sub_530FB0<1 KBIRNode_GetOperandUniversal operand accessor (31,399 callers)
sub_A49150<1 KBIRInstr_GetAttributeUniversal attribute accessor (30,768 callers)
sub_4C28B0<1 KBsetBitfieldCore encoding primitive for all SASS encoders
sub_1112F3065 KBptxas_main_compilation_driverTop-level per-module compilation entry
sub_110303030 KBptxas_option_definition_table_builderFull option parser (~60 options)
sub_110495038 KBptxas_command_option_parserOption processing and validation
sub_15C0CE015 KBinit_sm_dispatch_tablesSM architecture callback registration
sub_1A40AC0<1 KBrot13_string_decoderSIMD-vectorized ROT13 decoder
sub_4BD760variesptxas_jit_compileEntry point for PTX JIT compilation
sub_4BC6F0variescompile_linked_lto_irEntry point for LTO compilation
sub_15B86A0345 KBcuda_builtin_prototype_generator608-case builtin switch (second-largest function)
sub_147EF50288 KBptx_instruction_semantic_analyzerMaster instruction validator

Compilation Pipeline: sub_1112F30 Reconstructed

sub_1112F30 (65,018 bytes at 0x1112F30) is the top-level per-module compilation driver. It receives a module context a1 and a PTX module descriptor a2, then orchestrates the full PTX-to-SASS compilation in a sequence of clearly delineated phases. The following ASCII pipeline diagram and reconstructed pseudocode are derived from the decompiled output of this function.

Pipeline Diagram

 nvlink main / LTO pipeline
         |
         v
 +--------------------------+
 | sub_1112F30              |    per-module compilation driver (65 KB)
 | "ptxas_compile_module"   |
 +--------------------------+
         |
    Phase 1: Option Query & Cache Config
         |  query "def-load-cache", "force-load-cache",
         |  "def-store-cache", "force-store-cache"
         |  from option map at a1+904
         v
    Phase 2: Cancellation Check
         |  if a1+288 (cancel flag): call cancel_callback(a1+296)
         |  if returns 1 -> longjmp to error handler
         v
    Phase 3: Timing Gate
         |  check profiling flags a1+104..107, a1+402
         |  start wall-clock timer (sub_45CCD0)
         |  start high-res timer (sub_44EF30) if enabled
         v
    Phase 4: Callback Registration
         |  register per-instruction callback: sub_1108860 -> a1+408
         |  register per-function callback:    sub_1101EB0 -> a1+416
         |  initialize PTX version tables:     sub_12B30E0, sub_12B31D0
         v
    Phase 5: SM Version Validation
         |  sscanf .target string -> extract SM version number
         |  sscanf a1+576 (max supported SM) -> compare
         |  if module SM > max supported -> fatal error
         |  validate PTX version compatibility (sub_12A8360)
         v
    Phase 6: Mode Flag Dispatch (codegen callback selection)
         |  select (init_callback, begin_callback) pair:
         |
         |  if --compile-only OR --assyscall OR --compile-as-tools-patch:
         |      init = sub_110CD20   ("compile_only_init")
         |      begin = sub_11089E0  ("compile_only_begin")
         |
         |  elif --extensible-whole-program AND NOT --device-debug:
         |      init = sub_110D110   ("ewp_init")
         |      begin = sub_1107F10  ("ewp_begin")
         |
         |  elif --extensible-whole-program AND --device-debug:
         |      init = sub_110CD20
         |      begin = sub_11089E0
         |
         |  else (normal LTO / standard compilation):
         |      if --extensible-whole-program flag:
         |          init = sub_110CBA0  ("standard_init_ewp")
         |      else:
         |          init = sub_110D0B0  ("standard_init")
         |      begin = sub_1109180    ("standard_begin")
         v
    Phase 7: PTX Header Emission (dummy entry generation)
         |  if no explicit entry functions AND not tools-patch mode:
         |    if output-to-memory (sub_464740 returns true):
         |      sub_12AF550("__cuda_dummy_entry__", ptx_header, ...)
         |    else (output to file):
         |      fopen(output_path, "w")
         |      fprintf:  .version <ptx_version>
         |      fprintf:  .target  <sm_name>
         |      fprintf:  .entry __cuda_dummy_entry__ { ret; }
         |      fclose
         |      sub_12AF200(output_path, ...)
         v
    Phase 8: Tools-Patch Resource Allocation Warnings
         |  if --compile-as-tools-patch:
         |    warn if allocating: textures, surfaces, samplers, constants
         |  if --assyscall:
         |    warn if allocating: textures, surfaces, samplers
         v
    Phase 9: Compilation Flags Setup
         |  disable --fast-compile for ABI-less calls
         |  disable --extensible-whole-program for ABI-less compilation
         |  process --position-independent-code
         |  check texmode_independent vs texmode_unified
         |  check --preserve-relocs compatibility
         |  check --legacy-bar-warp-wide-behavior (SM70 only)
         |  check --g-tensor-memory-access-check (SM100+ only)
         v
    Phase 10: Hash Map + Codegen Context Allocation
         |  allocate 8 hash maps via sub_4489C0/sub_465020:
         |    [0] instruction map (cap 0x100)
         |    [1] function list (cap 0x400)
         |    [2] basic-block map (cap 0x100), conflict map (cap 0x40)
         |    [3] symbol map (cap 0x100), label map (cap 0x100)
         |    [4] directive map (cap 0x40), auxiliary map (cap 0x40)
         |    [5] register table (cap 0x20), operand table (arch-size)
         |  allocate per-function resource array:
         |    sub_12AE300(a2) entries x 48 bytes each -> a1+336
         |  allocate per-function compilation result array:
         |    count entries x 112 bytes each -> a1+256
         v
    Phase 11: Register Callbacks on Module IR
         |  sub_1102AC0 -> per-function callback on module functions list
         |  sub_1101E90 -> per-symbol callback on symbol table
         |  sub_1111DB0 -> per-function dispatch on func IR list
         |  sub_1101DE0 -> per-global callback (unless --compile-only)
         |  sub_110F5E0 -> per-section callback on section list
         |  sub_1101F60 -> per-symbol post-process callback
         v
    Phase 12: Address Width + Register Budget
         |  determine address width: 32-bit or 64-bit
         |    SM <= 13 -> 32-bit (maxnreg=32)
         |    SM > 13  -> read from module metadata
         |  PIC validation: if PIC address > threshold -> disable PIC
         |  maxrregcount validation: warn on mismatch
         |  32-bit register check: SM > 90 -> fatal
         v
    Phase 13: Entry Point Collection
         |  if explicit -e entries:
         |    resolve each name through module's symbol table
         |    build ordered entry list -> v322
         |  elif explicit -E entries:
         |    resolve through module, build list
         |  else:
         |    use module's own entry list (a2+88)
         v
    Phase 14: Transfer Compilation State to Context
         |  copy hash maps, flags, and tables into a1+1072..1296
         |  create alias tracking map (sub_4489C0, cap 0x100)
         |  create callee usage map (sub_4489C0, cap 0x418)
         v
    Phase 15: init_callback(a1, entry_list) -- Codegen Initialization
         |  calls the selected init callback (from Phase 6)
         |  sub_110CD20: builds per-function codegen descriptors
         |    for each function: sub_110BC90 -> allocate codegen record
         |    stores in a1+1192 (register usage map)
         |    returns ordered list of functions needing compilation
         v
    Phase 16: Load/Store Cache Mode Assignment
         |  for each function in compilation list:
         |    if force-load-cache -> set cache_mode = 2
         |    elif def-load-cache -> set cache_mode = 1
         |    else -> set cache_mode based on call graph analysis
         v
    Phase 17: Indirect Call + MMA Validation
         |  for each function:
         |    check for indirect calls with .f64 MMA -> warn
         |    check for mutual recursion markers -> flag
         v
    Phase 18: Scheduling Class Assignment
         |  for each function: assign scheduling class (0, 1, or 2)
         |    class 0 = no scheduling needed
         |    class 1 = standard scheduling
         |    class 2 = aggressive scheduling (callee analysis)
         |  propagate class upward through call graph if needed
         v
    Phase 19: Debug Info Setup
         |  if --device-debug: init DWARF context (sub_1672520)
         |  check .debug_abbrev / .debug_info availability
         v
    Phase 20: Reserved Register Configuration
         |  if --first-reserved-rreg: validate (min=4)
         |  compute total reserved = first_reserved + count
         v
    Phase 21: Build Per-Function Codegen Configuration
         |  pack ~50 compilation flags into struct at v334..v358:
         |    device_debug, lineinfo, fast_compile, maxrregcount,
         |    opt_level, compile_only, tools_patch, ewp, preserve_relocs,
         |    sm_version, address_width, default caches, PIC, ...
         |  call sub_16257C0 to create codegen pipeline config object
         v
    Phase 22: Output File Setup
         |  if --output specified: create/truncate output file
         v
  +------+------+
  |             |
  | thread_count == 0            thread_count > 0
  | (sequential)                 (parallel via thread pool)
  |             |
  v             v
    Phase 23a: Sequential          Phase 23b: Parallel
    Per-Function Loop              Per-Function Loop
         |                              |
    for each func:                 sub_43FDB0(thread_count)
      |                              = create thread pool
      |                              |
      +---> sub_110AA30            for each func:
      |     "codegen_init"           build work item (48 bytes):
      |     - create OCG context       [0..15] = timing state
      |     - set "NVIDIA"             [24] = per-func codegen ctx
      |     - set "ptxocg.0.0"         [32] = compilation driver ref
      |     - configure 30+ fields     [40] = optional sync state
      |     - set opt level            |
      |     - set SM-specific flags    sub_43FF50(pool, sub_1107420, item)
      |     - invoke vtable->init      = enqueue work item
      |       to map symbol names      |
      |                              sub_43FFE0(pool) = barrier wait
      +---> sub_1655A60             sub_43FE70(pool) = destroy pool
      |     "codegen_per_func"       |
      |     - initialize 48 pass     sub_1107420 (thread worker):
      |       enable/disable flags     sub_1102B30 -> setjmp + compile
      |       (passes 0..47)           record timing per function
      |     - register lowering        record peak memory
      |       callbacks on IR
      |     - set up UDT/UFT
      |       relocations
      |     - process entry
      |       function list
      |     - ISel dispatch
      |     - register allocation
      |
      +---> sub_1102B30
      |     "codegen_compile"
      |     - setjmp for error
      |       recovery
      |     - invoke compilation
      |       via vtable callback
      |       at a1+96
      |     - on error: longjmp,
      |       record failure
      |
      +---> timing measurement
      |     record compile time
      |     per-function stats
      |
      +---> sub_110D2A0
            "codegen_finalize"
            - emit ELF metadata
            - write nv.info records
            - output SASS binary
            - write register usage
            - cleanup per-func state
         |
         v
    Phase 24: Post-Compilation Cleanup
         |  clean up hash maps, free temp allocations
         |  validate register usage across functions
         |  if --compile-only: cross-check register budgets
         v
    Phase 25: Pipeline Config Teardown (sub_1626480)
         |
         v
    Phase 26: Final Cleanup
         |  destroy hash maps (sub_4650A0)
         |  free per-function resource array
         |  free codegen config
         |  return

Reconstructed Pseudocode for sub_1112F30

// sub_1112F30 -- per-module compilation driver
// Address: 0x1112F30, Size: 65,018 bytes
// Parameters:
//   a1 = compilation driver context (opaque struct, ~1300 bytes)
//   a2 = PTX module descriptor (parsed PTX representation)
// Returns: 0 on success

int ptxas_compile_module(CompilerCtx *ctx, PtxModule *mod) {

    // ---- Phase 1: Query cache configuration options ----
    bool def_load_cache   = option_get_bool(ctx->option_map, "def-load-cache");
    bool force_load_cache = option_get_bool(ctx->option_map, "force-load-cache");
    bool def_store_cache  = option_get_bool(ctx->option_map, "def-store-cache");
    bool force_store_cache= option_get_bool(ctx->option_map, "force-store-cache");

    // ---- Phase 2: Cancellation check ----
    if (ctx->cancel_flag) {
        if (ctx->cancel_callback(ctx->cancel_handle, ctx->cancel_arg) == 1)
            longjmp_to_error_handler();
    }

    // ---- Phase 3: Timing infrastructure ----
    timing_gate_start(ctx);   // sub_45CCD0 on a1+128/a1+144
    if (ctx->high_res_timing)
        ctx->wall_start = get_hires_time();  // sub_44EF30

    // ---- Phase 4: Register per-instruction and per-function callbacks ----
    list_foreach(ctx->instr_callbacks, per_instr_callback, ctx);   // sub_1108860
    list_foreach(ctx->func_callbacks, per_func_callback, ctx);     // sub_1101EB0
    ptx_version_table_init(mod);       // sub_12B30E0
    ptx_version_table_validate(mod);   // sub_12B31D0

    // ---- Phase 5: SM version validation ----
    unsigned sm_module, sm_max;
    sscanf(mod->target_string, "%*[^0-9]%d", &sm_module);
    sscanf(ctx->max_sm_string, "%*[^0-9]%d", &sm_max);
    if (sm_max < sm_module)
        fatal_error(ERR_SM_MISMATCH);

    if (mod->ptx_version_flag) {
        if (!ptx_version_compatible(sm_max, sm_module))  // sub_12A8360
            fatal_error(ERR_PTX_VERSION);
        if (!ctx->allow_unsupported_sm)
            fatal_error(ERR_UNSUPPORTED_SM, mod->target_string, ctx->max_sm_string);
    }

    // ---- Phase 6: Select codegen callback pair based on mode flags ----
    CodegenInitFn   init_fn;
    CodegenBeginFn  begin_fn;

    if (ctx->compile_only || ctx->assyscall || ctx->tools_patch) {
        init_fn  = compile_only_init;     // sub_110CD20
        begin_fn = compile_only_begin;    // sub_11089E0
    } else if (ctx->ewp_mode) {
        if (ctx->device_debug) {
            init_fn  = compile_only_init;     // sub_110CD20
            begin_fn = compile_only_begin;    // sub_11089E0
        } else {
            init_fn  = ewp_init;              // sub_110D110
            begin_fn = ewp_begin;             // sub_1107F10
        }
    } else {
        begin_fn = standard_begin;            // sub_1109180
        if (ctx->ewp_flag)
            init_fn = standard_init_ewp;      // sub_110CBA0
        else
            init_fn = standard_init;          // sub_110D0B0
    }

    // ---- Phase 7: Emit PTX dummy entry if no explicit entries ----
    if (!mod->entry_list && !ctx->assyscall && !ctx->tools_patch) {
        saved_flag = mod->flag_236;
        if (output_is_memory(ctx->func_callbacks)) {
            ptx_emit_entry_inline("__cuda_dummy_entry__", ptx_header_text, mod);
        } else {
            char *path = get_output_path(ctx->func_callbacks);
            FILE *fp = fopen(path, "w");
            if (mod->ptx_version_str)
                fprintf(fp, "\t.version %s\n", mod->ptx_version_str);
            if (mod->target_str)
                fprintf(fp, "\t.target  %s\n", mod->target_str);
            fprintf(fp, "\t.entry __cuda_dummy_entry__ { ret; }\n");
            fclose(fp);
            ptx_parse_file(path, mod);
        }
        mod->flag_236 = saved_flag;  // restore flag modified by parse
    }

    // ---- Phase 8-9: Compilation flags, warnings, compatibility ----
    if (ctx->tools_patch) {
        if (mod->alloc_textures)  warn("Allocating additional textures");
        if (mod->alloc_surfaces)  warn("Allocating additional surfaces");
        if (mod->alloc_samplers)  warn("Allocating additional samplers");
        if (mod->alloc_constants) warn("Allocating additional constants");
        ctx->fast_compile = 0;
    }

    ctx->abi_mode = 0;
    if (ctx->ewp_mode)  ctx->ewp_mode = 0;   // one-shot
    if (mod->has_entry_funcs || mod->has_extern_funcs) {
        if (ctx->fast_compile)
            warn("'--fast-compile' incompatible with calls without ABI");
        ctx->fast_compile = 0;
        if (ctx->ewp_flag)
            warn("'--extensible-whole-program' incompatible with compilation without ABI");
        ctx->ewp_flag = 0;
    }

    // PIC handling
    bool pic = option_get_bool(ctx->option_map, "position-independent-code");
    if (!ctx->compile_only && !ctx->device_debug && !ctx->tools_patch
        && !ctx->ewp_mode && !ctx->ewp_flag) {
        if (!mod->has_relo && !pic)
            ctx->enable_pic = 1;   // auto-enable PIC for normal compilation
    }
    if (ctx->ewp_flag && pic) {
        ctx->enable_pic = 0;
        warn("'--position-independent-code' incompatible with '--extensible-whole-program'");
    }

    if (ctx->early_exit)
        return 0;  // --dry-run equivalent

    // ---- Phase 10: Allocate codegen data structures ----
    HashMaps maps;
    maps.instr_map   = hashmap_create(cmp_fn, free_fn, 0x100);
    maps.func_list   = hashmap_create(cmp_fn, free_fn, 0x400);
    maps.bb_map      = hashmap_create(cmp_fn, free_fn, 0x100);
    maps.conflict    = hashmap_create(cmp_fn, free_fn, 0x40);
    maps.sym_map     = hashmap_create(cmp_fn, free_fn, 0x100);
    maps.label_map   = hashmap_create(cmp_fn, free_fn, 0x100);
    maps.dir_map     = hashmap_create(cmp_fn, free_fn, 0x40);
    maps.aux_map     = hashmap_create(cmp_fn, free_fn, 0x40);
    maps.reg_table   = sorted_map_create(int_cmp, free_fn, 0x20);
    maps.operand_tbl = sorted_map_create(int_cmp, free_fn, arch_operand_count(mod));

    unsigned func_count = get_function_count(mod);   // sub_12AE300
    ctx->per_func_resources = arena_alloc(func_count * 48);
    memset(ctx->per_func_resources, 0, func_count * 48);

    // ---- Phase 11: Traverse module IR, register callbacks ----
    list_foreach(mod->functions.list, register_function_cb, ctx);    // sub_1102AC0
    foreach_symbol(mod->symbol_table, register_symbol_cb, ctx);      // sub_1101E90
    list_foreach(mod->functions.list, dispatch_function_ir, &maps);  // sub_1111DB0
    if (!ctx->compile_only)
        list_foreach(mod->functions.globals, register_global_cb, &maps);
    list_foreach(mod->functions.sections, register_section_cb, &maps);
    foreach_symbol(mod->symbol_table, postprocess_symbol_cb, &maps);

    // ---- Phase 12: Address width + register budget ----
    int addr_width = determine_address_width(ctx, mod);
    ctx->addr_width = addr_width;  // 32 or 64
    validate_maxrregcount(ctx, mod);

    // ---- Phase 13: Collect entry points ----
    FuncList *entries;
    if (ctx->explicit_entries) {
        entries = resolve_entries(ctx->explicit_entries, ctx->module_reader);
    } else if (ctx->explicit_globals) {
        entries = resolve_entries(ctx->explicit_globals, ctx->module_reader);
    } else {
        entries = mod->entry_list;  // default: module's own entry list
    }

    // ---- Phase 14: Transfer state into codegen context ----
    ctx->maps         = maps;
    ctx->func_count   = list_count(entries);
    ctx->alias_map    = sorted_map_create(cmp, free, 0x100);
    ctx->callee_map   = sorted_map_create(cmp, free, 0x418);
    // ... copy ~30 more fields ...

    // ---- Phase 15: Call init_callback to prepare codegen descriptors ----
    FuncList *compile_list = begin_fn(ctx, entries, ctx->per_func_resources);

    // ---- Phase 16: Load/store cache mode per function ----
    for (FuncNode *f = compile_list; f; f = f->next) {
        FuncDesc *desc = f->data;
        int func_idx = desc->func_ir->header->index;
        bool needs_caching = per_func_resources[func_idx].has_cached_callees;

        if (sm_dispatch->supports_caching(sm_class)) {
            if (force_load_cache)
                desc->cache_mode = 2;  // force all loads cached
            else if (def_load_cache)
                desc->cache_mode = 1;  // default cached
            else if (needs_caching || force_store_cache)
                desc->cache_mode = 2;
            else
                desc->cache_mode = (def_store_cache != 0);
        } else {
            desc->cache_mode = 2 * (force_load_cache || def_load_cache);
        }
    }

    // ---- Phase 17: Validate indirect calls + MMA .f64 ----
    for (FuncNode *f = compile_list; f; f = f->next) {
        FuncDesc *desc = f->data;
        if (desc->mma_info && desc->mma_info->has_f64) {
            warn_once(ERR_MMA_F64, desc->name, "mma with .f64 type");
        }
        if (desc->has_mutual_recursion)
            fatal_error(ERR_MUTUAL_RECURSION, desc->entry_name);
    }

    // ---- Phase 18: Assign scheduling class ----
    for (FuncNode *f = compile_list; f; f = f->next) {
        FuncDesc *desc = f->data;
        if (!desc->needs_scheduling) {
            desc->sched_class = 0;
        } else {
            // analyze call graph for callee scheduling requirements
            int callee_sched_count = count_callees_with_sched_class_2(desc);
            if (ctx->force_aggressive_sched && callee_sched_count > 0) {
                desc->sched_class = 2;
            } else {
                desc->sched_class = sm_dispatch->determine_sched_class(desc);
            }
        }
    }

    // ---- Phase 19: Debug info + Phase 20: Reserved regs ----
    if (ctx->device_debug)
        dwarf_init(ctx->dwarf_ctx, mod);
    int reserved_rreg = -1;
    if (ctx->has_reserved_rreg) {
        if (!ctx->first_reserved_rreg)
            ctx->first_reserved_rreg = 4;  // minimum
        reserved_rreg = ctx->first_reserved_rreg + ctx->reserved_count;
    }

    // ---- Phase 21: Build codegen config struct ----
    CodegenConfig cfg;
    cfg.module         = mod;
    cfg.module_reader  = ctx->module_reader;
    cfg.device_debug   = ctx->device_debug;
    cfg.lineinfo       = ctx->lineinfo;
    cfg.opt_level      = ctx->opt_level;
    // ... pack ~50 flags from ctx into cfg ...
    CodegenPipeline *pipeline = create_codegen_pipeline(cfg);  // sub_16257C0

    // ---- Phase 22: Create output file if needed ----
    if (ctx->output_path && (ctx->dump_sass || ctx->dump_ptx))
        fopen_and_truncate(ctx->output_path, "wt");

    // ---- Phase 23: Per-function compilation ----
    unsigned total_funcs = list_count(compile_list);
    ctx->result_array = arena_alloc(total_funcs * 112);
    memset(ctx->result_array, 0, total_funcs * 112);

    if (ctx->thread_count == 0) {
        // ---- Phase 23a: SEQUENTIAL per-function compilation ----
        for (FuncNode *f = compile_list; f; f = f->next) {
            FuncDesc *desc = f->data;
            unsigned idx = desc->func_index;
            char *name = desc->func_ir->name;

            ctx->result_array[idx].name = name;
            ctx->result_array[idx].start_time = timer_read(ctx->timer);

            // Allocate 360-byte per-function codegen state
            PerFuncState *pfs = arena_alloc_zeroed(360);

            // Phase 23a-i: Initialize codegen context for this function
            codegen_init(ctx, mod, desc, pipeline, pfs);  // sub_110AA30:
            //   - create OCG (Optimizing Code Generator) context
            //   - set producer = "NVIDIA", tool = "ptxocg.0.0"
            //   - configure SM-specific codegen flags
            //   - set optimization level, maxrregcount, address width
            //   - initialize instruction vtable from SM dispatch table
            //   - set up DWARF state if debug enabled
            //   - call vtable->init to resolve symbol names

            // Phase 23a-ii: Invoke the compilation pipeline
            // sub_1655A60 (called from within codegen_init flow):
            //   - initialize 48 pass-enable flags (passes 0..47)
            //   - register IR lowering callbacks
            //   - set up UDT/UFT relocations for Blackwell+
            //   - for each pass in sequence:
            //       pass 1:  IR canonicalization
            //       pass 2:  instruction count estimation
            //       pass 3-22: SM-gated optimization passes
            //           (each enabled/disabled by SM dispatch vtable)
            //       pass 21: address width query
            //       pass 22: register class initialization
            //       pass 23-38: ISel, regalloc, scheduling (SM-gated)
            //       pass 39: ABI frame setup
            //       pass 40-42: final lowering
            //       pass 43: peephole cleanup
            //       pass 46: binary encoding query
            //       pass 47: final verification
            //   - emit SASS instructions via ISel mega-hub
            //   - perform register allocation (graph coloring)
            //   - schedule instructions (ScheduleInstructions, 85 KB)
            //   - run peephole optimizations
            //   - encode final SASS binary (128-bit words)

            // Phase 23a-iii: Error-wrapped compile invocation
            codegen_compile(ctx, desc, pfs);  // sub_1102B30:
            //   - setjmp for error recovery
            //   - call vtable->compile(ctx, func_ir, ctx->codegen_config)
            //   - on error: longjmp, free resources, report failure

            // Phase 23a-iv: Record timing
            timing_record(ctx, desc, pfs);

            // Phase 23a-v: Finalize and emit
            codegen_finalize(ctx, mod, desc, pfs);  // sub_110D2A0:
            //   - emit ELF section content (.text, .nv.info, .nv.constant)
            //   - write register usage records (EIATTR)
            //   - write SASS binary to output
            //   - handle PTX re-emission if --output-ptx
            //   - cleanup per-function OCG state

            arena_free(pfs);

            // cancellation check between functions
            if (ctx->cancel_flag && cancel_callback() == 1)
                longjmp_to_error_handler();
        }
    } else {
        // ---- Phase 23b: PARALLEL per-function compilation ----
        ThreadPool *pool = create_thread_pool(ctx->thread_count);  // sub_43FDB0
        IndexArray *sync = index_array_create(total_funcs);

        for (FuncNode *f = compile_list; f; f = f->next) {
            FuncDesc *desc = f->data;

            // Allocate extended per-function state (360 bytes + 3 maps)
            PerFuncState *pfs = arena_alloc_zeroed(360);
            pfs->local_map_a = sorted_map_create(cmp, free, 8);
            pfs->local_map_b = sorted_map_create(cmp, free, 8);
            pfs->local_map_c = sorted_map_create(cmp, free, 8);
            pfs->pipeline = pipeline;
            pfs->driver_ctx = ctx;
            pfs->module = mod;
            pfs->func_desc = desc;

            // Initialize codegen (same as sequential)
            codegen_init(ctx, mod, desc, pipeline, pfs);

            // Copy 15 x 16-byte blocks of driver state into pfs
            // (compiler flags, maps, timing state)
            memcpy(&pfs->snapshot, &ctx->compilation_state, 15 * 16);

            // Allocate per-function DWARF state (216 bytes)
            pfs->dwarf_state = arena_alloc_zeroed(216);
            dwarf_register(ctx->dwarf_ctx, pfs->dwarf_state);

            // Snapshot pipeline state
            pipeline_snapshot(pipeline, &pfs->pipeline_state);

            // Copy function-local maps from shared -> per-thread
            copy_maps(pfs->local_map_a, ctx->shared_maps);

            // Enqueue work item
            index_array_push(pfs, sync);
        }

        // Wait for all per-function compilations to complete
        // Each thread runs sub_1107420:
        //   1. sub_1102B30 -- setjmp + vtable->compile()
        //   2. record timing (wall-clock + per-function)
        //   3. record peak memory usage
        //   4. free per-function OCG state
        for_each(compile_list, sync, dispatch_work_item, pool);
        thread_pool_barrier(pool);   // sub_43FFE0
        thread_pool_destroy(pool);   // sub_43FE70

        // Single-threaded finalization pass
        report_function_index(ctx, -1);  // sub_1107720
        if (ctx->cancel_flag && cancel_callback() == 1)
            cleanup_and_longjmp();

        // ---- Sequential post-compilation merge ----
        bool first = true;
        for (FuncNode *f = compile_list; f; f = f->next) {
            PerFuncState *pfs = sync_get(sync, index);

            // Restore driver state from per-function snapshot
            memcpy(&ctx->compilation_state, &pfs->snapshot, 15 * 16);

            // Replay pipeline state
            pipeline_restore(pipeline, pfs->pipeline_state);

            // Merge per-thread maps back into shared maps
            merge_maps(ctx->shared_maps, pfs->local_map_a);
            merge_maps(ctx->shared_maps, pfs->local_map_b);
            merge_maps(ctx->shared_maps, pfs->local_map_c);

            // Restore DWARF state
            dwarf_merge(ctx->dwarf_ctx, pfs->dwarf_state);

            // Finalize this function
            codegen_finalize(ctx, mod, desc, pfs);

            // Track first-function state for register budget
            if (first) {
                shared_register_budget = ctx->register_budget;
                first = false;
            }
            ctx->register_budget = shared_register_budget;

            // Cleanup
            cleanup_per_func_maps(pfs);
            arena_free(pfs);
        }
        index_array_destroy(sync);
    }

    // ---- Phase 24: Post-compilation validation ----
    pipeline_finalize(pipeline);  // sub_1626480
    if (ctx->compile_only && ctx->register_budget_map) {
        // Cross-validate register usage: each callee must not exceed
        // the register budget of its caller
        for (int i = 0; i < func_count; i++) {
            ResourceEntry *re = &ctx->per_func_resources[i];
            if (re->entry_name && re->callee_list) {
                unsigned caller_budget = map_lookup(ctx->register_budget_map,
                                                    re->entry_name);
                for (FuncNode *c = re->callee_list; c; c = c->next) {
                    unsigned callee_regs = map_lookup(ctx->alias_map, c->name);
                    if (caller_budget > callee_regs)
                        warn("register budget exceeded: %s uses %d, caller %s allows %d",
                             c->name, callee_regs, re->entry_name, caller_budget);
                }
            }
        }
        map_destroy(ctx->register_budget_map);
    }

    // ---- Phase 25-26: Cleanup ----
    hashmap_destroy(maps.bb_map);
    hashmap_destroy(maps.sym_map);
    arena_free(ctx->per_func_resources);
    if (ctx->tensor_check_map)
        set_destroy(ctx->tensor_check_map);

    return 0;
}

Key Subroutine Reference

AddressName (reconstructed)Role in Pipeline
sub_1112F30ptxas_compile_moduleTop-level per-module driver (this function)
sub_110AA30codegen_initPer-function OCG context creation + field setup
sub_1655A60codegen_per_func48-pass codegen pipeline (ISel, regalloc, sched, encode)
sub_1102B30codegen_compileError-wrapped compilation (setjmp + vtable dispatch)
sub_110D2A0codegen_finalizeELF emission, nv.info, SASS output, cleanup
sub_1107420thread_workerThread pool work item: compile + timing + cleanup
sub_110CD20compile_only_initInit for --compile-only / --compile-as-tools-patch
sub_110D0B0standard_initInit for normal LTO compilation
sub_110D110ewp_initInit for --extensible-whole-program mode
sub_110CBA0standard_init_ewpInit for standard mode with EWP flag
sub_11089E0compile_only_beginBegin callback for compile-only modes
sub_1107F10ewp_beginBegin callback for EWP mode
sub_1109180standard_beginBegin callback for standard compilation
sub_110BC90alloc_codegen_recordAllocate per-function codegen descriptor
sub_16257C0create_codegen_pipelineBuild the codegen pipeline configuration object
sub_1626480pipeline_finalizeTear down the codegen pipeline
sub_43FDB0create_thread_poolCreate split-compilation thread pool
sub_43FF50enqueue_work_itemSubmit per-function work to thread pool
sub_43FFE0thread_pool_barrierWait for all enqueued work to complete
sub_43FE70thread_pool_destroyDestroy thread pool
sub_12AE300get_function_countReturn number of functions in module
sub_12AF550ptx_emit_entry_inlineEmit PTX entry point to in-memory buffer
sub_12AF200ptx_parse_fileParse a PTX file into module representation
sub_12B30E0ptx_version_table_initInitialize PTX version compatibility tables
sub_12A8360ptx_version_compatibleCheck PTX/SM version compatibility
sub_15C3DD0sm_name_to_ordinalConvert SM target string to ordinal index
sub_1672520dwarf_initInitialize DWARF debug info context

Compilation Mode Matrix

The mode flag dispatch at Phase 6 selects one of four codegen pathways. The choice is determined by command-line flags forwarded into the embedded compiler:

ModeConditioninit_fnbegin_fnBehavior
Compile-only--compile-only or --compile-as-tools-patch or --assyscallsub_110CD20sub_11089E0Compile all functions independently. No cross-function optimization. Used for tools patches (Nsight Compute, Nsight Systems)
EWP (no debug)--extensible-whole-program and NOT --device-debugsub_110D110sub_1107F10Whole-program optimization. Functions compiled with global visibility into callee register usage. Enables aggressive inlining decisions
EWP + debug--extensible-whole-program AND --device-debugsub_110CD20sub_11089E0Falls back to compile-only pathway because whole-program optimization conflicts with debug info fidelity
StandardNormal LTO compilation (default)sub_110D0B0sub_1109180Standard per-function compilation with cross-function register budget propagation. Used for typical nvlink LTO
Standard + EWP flagStandard with --extensible-whole-program hintsub_110CBA0sub_1109180Same as standard but with EWP-aware init (reserves additional register space for potential future extensibility)

The 48-Pass Codegen Pipeline (sub_1655A60)

The per-function codegen entry point sub_1655A60 runs a 48-pass pipeline (passes numbered 0--47). Each pass is enabled or disabled by querying the SM dispatch vtable at a1[3757] (the architecture-specific callback table registered by sub_15C0CE0). The pipeline initializes 48 boolean flags in a1[160..207] and then iterates:

Pass  0:   Zero (placeholder)
Pass  1:   Initial IR canonicalization
Pass  2:   Instruction count estimation (query vtable+120)
Pass  3-20: SM-gated optimization passes
             Each pass queries vtable+72 for SM capability.
             If SM supports the pass, the pass enable flag is set to 1.
             If not supported, the flag remains 0.
             Pass 21: address-width-dependent setup
             Pass 22: register class initialization
Pass 23-38: Core backend passes (ISel, register allocation, scheduling)
             These are universally enabled for all SM >= sm_50.
             Passes 23-38 include:
               - Instruction selection (ISel mega-hub dispatch)
               - Register allocation (graph coloring + spilling)
               - Instruction scheduling (ScheduleInstructions)
               - Peephole optimization
               - SASS encoding
Pass 39:   Initial ABI frame setup
Pass 40-42: Final lowering passes
Pass 43:   Peephole cleanup
Pass 44-45: Reserved
Pass 46:   Binary encoding query (vtable+488)
Pass 47:   Final verification + pass-count teardown

After the pass loop, sub_1655A60 registers additional IR lowering callbacks (sub_161F1C0, sub_161F800, sub_1620460) on the function's basic block list, sets up UDT/UFT relocations for Blackwell+ (SM > 26 in ordinal = sm_100+), and processes the function's call graph for register pressure analysis.

Sequential vs. Parallel Compilation

The compilation driver supports two execution models, selected by ctx->thread_count (offset a1+668):

Sequential mode (thread_count == 0): Each function is compiled in the main thread with a simple loop: codegen_init -> codegen_compile (error-wrapped) -> codegen_finalize. Timing is recorded between stages. The cancel callback is checked between functions.

Parallel mode (thread_count > 0): A thread pool is created via sub_43FDB0. For each function, a work item containing the full per-function state snapshot (360 bytes + 3 local hash maps + pipeline snapshot) is allocated and enqueued. Each thread executes sub_1107420, which calls sub_1102B30 (the error-wrapped compile) and records timing. After the barrier wait (sub_43FFE0), the main thread performs a sequential merge pass: it restores each function's snapshot, merges local maps back into shared maps, and calls codegen_finalize. This is the same --split-compile-extended mechanism available in standalone ptxas.

The thread pool implementation uses a producer-consumer work queue. sub_43FF50 enqueues items, and worker threads dequeue and execute them. The barrier at sub_43FFE0 blocks until all items complete. The pool is destroyed by sub_43FE70. If an optional synchronization state (qword_2A64430) is non-null, each worker checks for compilation errors after completing its work item via sub_1D1E060 / sub_1D1E300.

Cross-References

  • IR Nodes -- IR node structure and universal accessor functions
  • ISel Hubs -- the five mega-hub instruction selector dispatch functions
  • Peephole -- peephole optimization passes (ORI, scheduling-phase, linker-level)
  • PTX Parsing -- the embedded PTX assembler frontend
  • Register Allocation -- graph-coloring register allocator with spilling
  • Scheduling -- pre-RA and tepid (post-RA) instruction schedulers
  • Architecture Dispatch -- per-SM vtable dispatch system
  • Mercury Overview -- Mercury ISA encoding pipeline
  • FNLZR -- post-link binary rewriter for Mercury targets
  • LTO Overview -- how the LTO pipeline invokes the embedded compiler

Sibling Wikis