Code Generation Overview
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The SASS code generation subsystem converts optimized Ori IR into executable GPU machine code. It is the largest subsystem in ptxas by every metric: approximately 12,000 functions, 9 MB of binary code, and nine functions so large that Hex-Rays cannot decompile them. The pipeline spans phases 112--158 of the 159-phase PhaseManager and comprises seven interlinked subsystems -- instruction selection, SASS binary encoding, peephole optimization, the Mercury encoding pipeline, Newton-Raphson math templates, SASS text generation, and ELF output packaging. Every subsystem dispatches through per-SM-family tables, so the same high-level flow produces correct output for targets from Kepler (sm_30) through Blackwell Ultra (sm_121).
| Pipeline phases | 112--158 (code generation spans the final third of the pipeline) |
| Total functions | ~12,000 (ISel, encoding, peephole, Mercury, formatters, ELF) |
| Total binary size | ~9 MB of machine code |
| Non-decompilable functions | 9 (3 peephole + 6 encoding megadispatchers) |
| Core primitive | sub_7B9B80 -- bitfield insert (216 bytes, 18,347 callers) |
| Architecture selector | *(int*)(config+372) >> 12 -- SM generation ID |
| Largest function | sub_169B190 -- generic peephole dispatcher (280 KB) |
| Output modes | mercury (SM 75--99), capmerc (SM 100+), sass (explicit) |
| CLI option | --binary-kind mercury,capmerc,sass |
Pipeline
Optimized Ori IR (register-allocated, scheduled)
|
v
┌─────────────────────────────────────────────────────────────┐
│ SASS CODE GENERATION │
│ │
│ 1. Instruction Selection (ISel) ───────> [isel.md] │
│ │ DAG pattern matching: ~750 matchers │
│ │ Mega-selector: sub_C0EB10 (185 KB) │
│ │ 4 arch-variant dispatch tables │
│ v │
│ 2. SASS Binary Encoding ───────────────> [encoding.md] │
│ │ ~4,000 template-generated handlers │
│ │ 6 megadispatchers (750 KB total) │
│ │ sub_7B9B80 bitfield packer (18,347 callers) │
│ v │
│ 3. Peephole Optimization ──────────────> [peephole.md] │
│ │ 3 mega-dispatchers: 280+233+233 KB = 746 KB │
│ │ ~3,185 pattern matchers │
│ v │
│ 4. Mercury Pipeline (phases 117-122) ──> [mercury.md] │
│ │ Encode/Decode → Expand → WAR → Opex → WAR → SASS │
│ │ sub_6D9690 master encoder (94 KB) │
│ v │
│ 5. Newton-Raphson Templates ───────────> [templates.md] │
│ │ DDIV/DRCP/DSQRT/DRSQRT software sequences │
│ │ 36 functions, up to 298 virtual registers each │
│ v │
│ 6. SASS Text Generation (phase 129) ──> [sass-printing.md] │
│ │ 580 formatter functions + 12.9 KB dispatcher │
│ v │
│ 7. ELF/Cubin Output ──────────────────> [../output/…] │
│ sub_612DE0 finalizer → sub_1C9F280 ELF emitter │
└─────────────────────────────────────────────────────────────┘
|
v
.cubin / .o (NVIDIA custom ELF)
Phase-to-Subsystem Map
The code generation pipeline occupies phases 112--158. This table maps each phase to its subsystem and documents the six-stage Mercury core that is the dominant path for SM 75+ targets.
| Phase | Name | Subsystem | Detail page |
|---|---|---|---|
| 112 | PlaceBlocksInSourceOrder | Block layout | cfg.md |
| 113 | PostFixForMercTargets | Mercury pre-fixup | mercury.md |
| 114 | FixUpTexDepBarAndSync | Scoreboard / sync | scoreboards.md |
| 115 | AdvancedScoreboardsAndOpexes | Scoreboard hook | scoreboards.md |
| 116 | ProcessO0WaitsAndSBs | Scoreboard (-O0) | scoreboards.md |
| 117 | MercEncodeAndDecode | Mercury core | mercury.md |
| 118 | MercExpandInstructions | Mercury core | mercury.md |
| 119 | MercGenerateWARs1 | Mercury core | mercury.md |
| 120 | MercGenerateOpex | Mercury core | mercury.md |
| 121 | MercGenerateWARs2 | Mercury core | mercury.md |
| 122 | MercGenerateSassUCode | Mercury core | mercury.md |
| 123 | ComputeVCallRegUse | Post-Mercury bookkeeping | -- |
| 124 | CalcRegisterMap | Post-Mercury bookkeeping | -- |
| 125 | UpdateAfterPostRegAlloc | Post-Mercury bookkeeping | -- |
| 126 | ReportFinalMemoryUsage | Reporting | dumpir.md |
| 127 | AdvancedPhaseOriPhaseEncoding | Encoding hook | -- |
| 128 | UpdateAfterFormatCodeList | Post-Mercury bookkeeping | -- |
| 129 | DumpNVuCodeText | SASS text output | sass-printing.md |
| 130 | DumpNVuCodeHex | SASS hex output | dumpir.md |
| 131 | DebuggerBreak | Debug | -- |
| 132 | UpdateAfterConvertUnsupportedOps | Late cleanup | -- |
| 133 | MergeEquivalentConditionalFlow | Late cleanup | -- |
| 134 | AdvancedPhaseAfterMidExpansion | Late cleanup hook | -- |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Late cleanup hook | -- |
| 136 | LateMergeEquivalentConditionalFlow | Late cleanup | -- |
| 137 | LateExpansionUnsupportedOpsMid | Late lowering | -- |
| 138 | OriSplitHighPressureLiveRanges | Late regalloc fixup | -- |
| 139--158 | (architecture-specific) | Arch backends | phase-manager.md |
Subsystem grouping summary:
| Subsystem | Phases | Key property |
|---|---|---|
| Block layout | 112 | Restores source-order block placement |
| Scoreboard / sync | 113--116 | Pre-Mercury texture and dependency bar fixups |
| Mercury core | 117--122 | Six-stage encode-expand-WAR-opex-WAR-emit pipeline |
| Post-Mercury bookkeeping | 123--128 | Register maps, data structure refresh |
| SASS output + debug | 129--131 | Text/hex dumps and debugger hook |
| Late cleanup | 132--138 | Conditional merging, late lowering, live-range splits |
| Arch-specific | 139--158 | 20 backend-overridable phases (no-op by default) |
Scale
| Subsystem | Functions | Binary size | Key entry point |
|---|---|---|---|
| ISel pattern matchers | ~750 | ~1.3 MB | sub_B285D0 (ISel driver, 9 KB) |
| ISel mega-selector | 1 | 185 KB | sub_C0EB10 |
| SASS encoding handlers | ~4,000 | ~2.5 MB | sub_7B9B80 (bitfield packer) |
| Encoding megadispatchers | 6 | ~750 KB | sub_10C0B20 (setField, 180 KB) |
| Peephole mega-dispatchers | 3 | ~746 KB | sub_169B190 (generic, 280 KB) |
| Peephole pattern matchers | ~3,185 | ~1.5 MB | (individual matchers) |
| Mercury pipeline | ~50 | ~400 KB | sub_6F52F0 (orchestrator, 23 KB) |
| Mercury encode tables | 530 | ~500 KB | format initializers at 0xC66000 |
| Encoding vtable methods | ~2,735 | ~450 KB | tiny dispatchers at 0xAF0000 |
| Newton-Raphson templates | 36 | ~180 KB | sub_170E260 (DDIV coordinator) |
| SASS text formatters | 580 | ~850 KB | sub_5D4190 (dispatcher, 12.9 KB) |
| ELF emitter | ~60 | ~300 KB | sub_1C9F280 (master, 97 KB) |
| Total | ~12,000 | ~9 MB |
Nine functions exceed the decompilation threshold: the three peephole mega-dispatchers (280 + 233 + 233 KB) and the six encoding megadispatchers (180 + 197 + 187 + 142 + 68 + 65 KB). All analysis of these functions derives from disassembly, call graphs, and the smaller functions they invoke.
Instruction Selection
ISel converts abstract Ori IR operations into concrete SASS instruction forms using SelectionDAG-style pattern matching. Unlike upstream LLVM's TableGen-driven ISel, ptxas uses handwritten C++ matchers compiled into ~750 functions invoked from the ISel driver via per-opcode dispatch tables. The ISel driver (sub_B285D0, 9 KB, 66 callees) selects architecture-variant builders based on the SM version. The mega-selector (sub_C0EB10, 185 KB) handles the full IR-to-SASS mapping through a giant switch over instruction opcodes. Four nearly identical dispatch functions (15,049 bytes each) at sub_B128E0--sub_B12920 provide architecture-variant opcode routing, all jumping to shared handler code at 0x1C39xxx.
See Instruction Selection for the full DAG matcher protocol, helper function table, architecture dispatch tables, and operand variant selectors.
SASS Binary Encoding
The encoding subsystem translates ISel output into packed binary SASS machine code. Each instruction is encoded into a 1280-bit (160-byte, 20-QWORD) buffer via the universal bitfield packer sub_7B9B80. The full architecture is documented in SASS Instruction Encoding; the key facts for the overview:
- ~4,000 encoding handler functions -- each follows an identical 10-phase template, differing only in constants and modifier helpers
- 6 megadispatchers (750 KB total) route field-level queries by instruction category:
setField(180 KB),getFieldOffset(197 KB),hasField(187 KB),setFieldDefault(142 KB),getOperandFieldOffset(68 KB),setOperandField(65 KB) - 2,095 bitfield accessor functions at
0x10B0000--0x10BF2C0(1,661 under 200 bytes) - 530 encoding table initializers at
0xC66000--0xD27000, each populating one instruction format row - 3-level opcode hierarchy: major (9 bits), minor (8 bits), sub-opcode (7 bits)
- Instruction widths: 64-bit (format code 1), 128-bit (format code 2), 256-bit (format code 8)
Peephole Optimization
Three monolithic dispatch functions implement brute-force pattern-match-and-rewrite. The full architecture is documented in Peephole Optimization. The key positioning facts:
| Dispatcher | Size | Matchers | Entry trampoline | Runs when |
|---|---|---|---|---|
sub_169B190 | 280 KB | 762 | sub_B12930 | Pre-scheduling (all SM) |
sub_143C440 | 233 KB | 1,087 | sub_B12940 | Pre-scheduling (SM 120 only) |
sub_198BCD0 | 233 KB | 1,336 | sub_B12960 | Post-scheduling (all SM) |
All three use identical architecture: a 373-case primary switch on the 16-bit opcode at instruction+0x0C, per-case pattern matcher invocations with priority tracking, and a secondary switch for rewrite actions. The SM 120 dispatcher (sub_143C440) is architecture-gated and runs only when compiling for consumer RTX 50-series or enterprise Pro GPUs.
Mercury Pipeline
Mercury is NVIDIA's intermediate encoding layer between the optimizer's Ori IR and native SASS machine code. It occupies phases 113--122 and forms a six-stage sub-pipeline. Three output modes are controlled by --binary-kind: mercury (SM 75--99), capmerc (SM 100+, with embedded PTX source and relocation metadata), and sass (explicit direct SASS output). The master encoder sub_6D9690 (94 KB) is the largest backend function, with the orchestrator sub_6F52F0 (23 KB, 18 parameters) driving the full stage sequence.
See Mercury Encoder Pipeline for the six-stage architecture, key function table, and output mode details. See Capsule Mercury & Finalization for the SM 100+ variant.
Newton-Raphson Templates
Double-precision operations lacking dedicated hardware (DDIV, DRCP, DSQRT, DRSQRT) are lowered into multi-instruction SASS sequences implementing Newton-Raphson iterative refinement. The template system at 0x1700000--0x1722D60 comprises 36 functions organized in a two-level hierarchy: a top-level handler per operation delegates to a coordinator that allocates up to 298 virtual registers and chains 5--7 sub-expander functions. The register-count dispatcher sub_1704070 selects between full inline, partial inline, and template-based expansion paths based on register file pressure (thresholds: 20,479 / 16,383).
See Newton-Raphson Templates for the complete template hierarchy, register-count dispatch logic, and sub-expander details.
SASS Text Generation
Phase 129 (DumpNVuCodeText) converts the internal instruction stream into human-readable SASS assembly text for --verbose output and --out-sass dumps. The dispatcher sub_5D4190 (12.9 KB) routes 81 named opcodes via direct string comparison and 473 via hash-based switch to 580 template-generated formatter functions at 0x4DA340--0x5A8E40 (~850 KB). All formatters use a monolithic 1.8 MB format string table -- an unusual design that trades memory for formatting speed.
See SASS Text Generation for the full formatter architecture and opcode routing details.
ELF/Cubin Output
The final stage packages the encoded SASS binary into NVIDIA's custom ELF format (.cubin/.o). The kernel finalizer sub_612DE0 (47 KB) feeds the master ELF emitter sub_1C9F280 (97 KB), which delegates to symbol table emission (sub_713710), relocation generation (sub_7163C0), string table construction (sub_7122C0), and section layout finalization (sub_716DC0).
See ELF/Cubin Output for section catalog, relocation format, and EIATTR attribute encoding.
Intrinsic Lowering
The OCG (On-Chip Global) intrinsic system at 0x6C0000--0x6D0000 handles PTX builtin operations for SM 100+ targets. The master intrinsic table at sub_6C9EB0 (13 KB) initializes a 10,664-byte dispatch table with prefix "__nv_ptx_builtin_ocg_", covering operations from basic add/load/store through SM 100 tensor core (tcgen05) and bulk async copy:
| Handler | Size | Operations |
|---|---|---|
sub_6C0D90 | 19 KB | Atomic reduce (atom.add/min/max/cas -- 54 validation strings) |
sub_6C3470 | 20 KB | cp.async.bulk (bulk async copy) |
sub_6C1CF0 | 16 KB | mbarrier (arrive, wait, test, counted variants) |
sub_6C4DA0 | 15 KB | Load/store with scope, memory order, domain validation |
sub_6D4350 | 30 KB | MMA intrinsics (HMMA, IMMA, DMMA variants) |
sub_6D7AF0 | 19 KB | TCGen05 MMA (SM 100, 5th generation tensor core) |
Intrinsic parameter validators at sub_6BDB60--sub_6BF910 enforce type, sub-operation, and memory domain constraints. NVIDIA consistently misspells "intrinsic" as "instrinsic" in all validation error strings.
Post-Scheduling Statistics
Eight SM-variant statistics printers at sub_ABBA50--sub_ABEB50 (7,603 bytes each, spaced 0x700 apart) generate "# [...] " comments with comprehensive post-codegen metrics: instruction counts, register usage, spill/refill bytes, estimated latency and occupancy, per-functional-unit instruction estimates, MMA counts, and throughput figures. The per-unit instruction counter sub_ABF590 (17 KB) uses SSE2 operations for batch updates.
Operand Legalization
Post-register-allocation operand legalization rewrites instructions that cannot be directly encoded in SASS:
| Address | Size | Purpose |
|---|---|---|
sub_AB3C30 | 32 KB | Post-RA instruction legalization (opcodes 288, 167, 185, 241, 299, 300, 317) |
sub_AB2D50 | 18 KB | Per-class operand legalization (opcode 307 = ternary/FMA-like) |
sub_ACF4D0 | 14 KB | Constraint solver -- splits instructions when direct encoding fails |
sub_AB8940 | 19 KB | Register move coalescing / copy elimination |
sub_AC2750 | 36 KB | Operand-to-encoding converter (36-byte operand records) |
When legalization requires instruction splitting, sub_ACF4D0 creates new instructions via sub_934630 (instruction constructor). The constraint solver tries alternative encodings before resorting to splits.
WGMMA Pipeline (SM 90+)
The WGMMA (Warp Group Matrix Multiply-Accumulate) pipeline optimizer at 0xACE000--0xAE6000 manages asynchronous tensor core execution for Hopper and later. It automatically inserts warpgroup.arrive and warpgroup.wait fences to ensure correct register handoff. The warning emitter (sub_ACE480) issues "Potential Performance Loss" advisories (codes 7509--7511) when pipelining fails due to extern calls, insufficient registers, or ill-formed pipeline stages.
See WGMMA Pipeline Optimizer for the full call tree, register pressure estimator, and serialization warning details.
Per-SM Architecture Dispatch
Every code generation subsystem dispatches through architecture-specific tables. The SM generation is determined by *(int*)(config+372) >> 12:
config+372 >> 12 | Generation | SM versions |
|---|---|---|
| 3 | Kepler | sm_30--sm_37 |
| 5 | Maxwell | sm_50--sm_53 |
| 6 | Pascal | sm_60--sm_62 |
| 7 | Volta / Turing | sm_70--sm_75 |
| 8 | Ampere | sm_80--sm_89 |
| 9 | Hopper | sm_90--sm_90a |
| 10+ | Blackwell | sm_100--sm_121 |
Architecture-specific dispatch points across the codegen pipeline:
| Subsystem | Dispatch mechanism | Evidence |
|---|---|---|
| ISel | 4 arch-variant dispatch tables at sub_B128E0--sub_B12910 | All JUMPOUT to shared code at 0x1C39xxx |
| Encoding | vtable at *(context+416) with ~200 virtual methods | Per-opcode encoding, latency, hazard rules |
| Peephole | 3 mega-dispatchers with per-SM case logic | SM 120 dispatcher (sub_143C440) is arch-gated |
| Mercury | sub_6E8EB0 sets arch-specific flags in opcode descriptor table | SM 80: bits 1, 8; SM 84: bits 16, 64 |
| Statistics | 8 SM-variant printer clones at sub_ABBA50--sub_ABEB50 | 7,603 bytes each, 0x700 spacing |
| NR templates | Register-count-based dispatch at sub_1704070 | Thresholds: 20479 / 16383 |
Function Map (Top 10)
| Address | Size | Identity |
|---|---|---|
sub_169B190 | 280 KB | Generic peephole dispatcher (all SM, 762 matchers) |
sub_10D5E60 | 197 KB | Encoding getFieldOffset megadispatcher (961 callers) |
sub_10E32E0 | 187 KB | Encoding hasField megadispatcher (72 callers) |
sub_C0EB10 | 185 KB | Main instruction selector (500+ locals, giant switch) |
sub_10C0B20 | 180 KB | Encoding setField megadispatcher (3,109 callers) |
sub_10CCD80 | 142 KB | Encoding setFieldDefault megadispatcher (4 callers) |
sub_1C9F280 | 97 KB | Master ELF emitter |
sub_6D9690 | 94 KB | Mercury master encoder (instruction type switch) |
sub_6FFDC0 | 66 KB | Mercury opex body (scoreboard generation) |
sub_6E8EB0 | 64 KB | BasicBlock::Initialize (encoder state, opcode descriptors) |
See function-map.md for the complete table (~30 entries with all codegen functions).
Cross-References
- Instruction Selection -- DAG pattern matching, builder variants, operand validation
- SASS Instruction Encoding -- bit-level encoding format, 10-phase template, opcode hierarchy
- Peephole Optimization -- 3 mega-dispatchers, 3,185 matchers, priority-based rewrite
- Mercury Encoder Pipeline -- 6-stage sub-pipeline, WAR resolution, opex
- Capsule Mercury & Finalization -- SM 100+ variant with embedded PTX + relocations
- Newton-Raphson Templates -- DDIV/DRCP/DSQRT/DRSQRT software sequences
- SASS Text Generation -- 580 formatters, format string table
- Pipeline Overview -- full PTX-to-SASS compilation flow
- Phase Manager -- 159-phase pipeline infrastructure
- Scheduling Architecture -- 3-phase scheduler (pre-codegen)
- Register Allocation -- Fatpoint algorithm (pre-codegen)
- ELF/Cubin Output -- custom ELF emitter, section catalog
- Knobs System -- knobs controlling codegen behavior