Pass Inventory & Ordering
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.
All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.
| Total phases | 159 (indices 0--158) |
| Named (static table) | 159 (all have entries in off_22BD0C0) |
| Late-pipeline phases | 20 (indices 139--158, added after the original 0--138 design) |
| Gate passes (AdvancedPhase) | 17 conditional hooks |
| Update passes | 9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned) |
| Report passes | 10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned) |
| GeneralOptimize instances | 6 compound optimization bundles |
| Liveness/DCE instances | 5 (including EarlyOriSimpleLiveDead) |
| LICM instances | 4 |
| Pipeline infrastructure | Phase Manager, Optimization Pipeline |
Phase Categories
Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.
| Tag | Meaning | Count |
|---|---|---|
| Validation | Checks IR structural correctness, catches illegal patterns | 3 |
| Lowering | Converts unsupported ops, expands macros, legalizes IR | 14 |
| Optimization | Transforms IR to improve performance (DCE, CSE, LICM, etc.) | 68 |
| Analysis | Computes information consumed by later passes (liveness, CFG) | 6 |
| Reporting | Dumps IR, statistics, or memory usage for debugging | 9 |
| Scheduling | Instruction scheduling, sync insertion, WAR fixup | 8 |
| RegAlloc | Register allocation and related fixups | 6 |
| Encoding | Mercury SASS encoding, expansion, microcode generation | 9 |
| Cleanup | Post-transformation updates, NOP removal, block layout | 13 |
| Gate | Conditional hooks (AdvancedPhase*) -- no-op by default | 17 |
Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.
Numbering Discrepancy
Warning: The phase numbers 0--138 on this page use a compressed numbering scheme established before the full 159-entry name table was discovered (P2-14). The true static name table at
off_22BD0C0contains 159 entries indexed 0--158, and 16 of the 20 newly-discovered names occupy indices within the 0--138 range. In the true table, these 16 entries sit at their listed indices, and all subsequent phases shift up. The wiki's compressed numbering diverges from the true binary indices starting around phase 8.Phases 139--158 are correctly numbered (they match the true static table indices). A full renumbering of phases 0--138 to match the true binary indices is deferred as a separate task because it would affect cross-references across 40+ wiki pages.
The 16 omitted name table entries (with their true static table indices) are:
| True Index | Name | Category | Relationship to Wiki |
|---|---|---|---|
| 22 | OriCopyProp | Optimization | Sub-pass within all 6 GeneralOptimize bundles; also injected into Mercury pipeline |
| 32 | OptimizeNaNOrZero | Optimization | Standalone NaN/zero folding pass; not documented under current wiki numbering |
| 37 | ConvertMemoryToRegisterOrUniform | Optimization | Sub-pass of GeneralOptimizeMid; gated by knob 487; sub_910840 |
| 41 | Vectorization | Optimization | Load/store vectorization; gated by DisableReadVectorization/DisableWriteVectorization knobs |
| 57 | OriCommoning | Optimization | Commoning sub-pass; related to LateOriCommoning (wiki phase 64) |
| 69 | OriSimpleLiveDead | Optimization | Liveness/DCE sub-pass; related to EarlyOriSimpleLiveDead (wiki phase 10) |
| 73 | LateVectorization | Optimization | Late vectorization (2nd instance, after optimization exposes new opportunities) |
| 77 | SinkCodeIntoBlock | Optimization | Code sinking; sub_78DB70; DisablePhases=SinkCodeIntoBlock gate |
| 103 | LateEnforceArgumentRestrictions | Lowering | Late counterpart to EnforceArgumentRestrictions (wiki phase 48) |
| 114 | ScheduleInstructions | Scheduling | Worker for AdvancedPhasePreSched; sub_8D0640 (22 KB) |
| 115 | UpdateAfterScheduleInstructions | Cleanup | IR metadata refresh after scheduling completes |
| 118 | UpdateAfterOriDoSyncronization | Cleanup | IR metadata refresh after sync insertion (wiki phase 99) |
| 120 | ReportBeforeRegisterAllocation | Reporting | DUMPIR target; diagnostic dump before register allocation |
| 122 | AllocateRegisters | RegAlloc | Worker for AdvancedPhaseAllocReg; canonical allocator entry |
| 124 | UpdateAfterOriAllocateRegisters | Cleanup | IR metadata refresh after register allocation |
| 127 | PostExpansion | Lowering | Worker for AdvancedPhasePostExpansion; post-RA expansion |
All 16 are valid DUMPIR targets (resolvable through sub_C641D0 binary search over the phase name table). Several are also valid DisablePhases targets.
Gate Passes (AdvancedPhase)
Seventeen phases are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.
Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.
The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.
Gate Pass Worker Correspondence
Several gate passes dispatch to named worker functions when activated by a backend. The worker names appear in the static name table and are valid DUMPIR/NamedPhases targets:
| Gate Pass (Wiki #) | Worker Function (True Table Index) | Evidence |
|---|---|---|
AdvancedPhasePreSched (97) | ScheduleInstructions [114] | sub_8D0640, string "ScheduleInstructions" |
AdvancedPhaseAllocReg (101) | AllocateRegisters [122] | String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0 |
AdvancedPhasePostExpansion (104) | PostExpansion [127] | Post-RA expansion dispatch |
AdvancedPhasePostFixUp (111) | PostFixUp [140] | Target vtable+0x148 dispatch |
See Optimization Levels for per-gate activation rules.
Update Passes
Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):
| Phase | Name | Refreshes |
|---|---|---|
| 76 | UpdateAfterOptimize | Rebuilds IR metadata after the late optimization group |
| 125 | UpdateAfterPostRegAlloc | Rebuilds IR metadata after register allocation and post-RA fixups |
| 128 | UpdateAfterFormatCodeList | Rebuilds the code list after Mercury encoding reformats instructions |
| 132 | UpdateAfterConvertUnsupportedOps | Rebuilds IR metadata after late unsupported-op expansion |
| 150 | UpdateAfterPostRegAlloc | Late-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default) |
| 154 | UpdateAfterFormatCodeList | Late-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default) |
| (true 115) | UpdateAfterScheduleInstructions | Refreshes IR after scheduling completes (omitted from compressed numbering) |
| (true 118) | UpdateAfterOriDoSyncronization | Refreshes IR after sync insertion (omitted from compressed numbering) |
| (true 124) | UpdateAfterOriAllocateRegisters | Refreshes IR after register allocation (omitted from compressed numbering) |
These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.
Report Passes
Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):
| Phase | Name | Output |
|---|---|---|
| 9 | ReportInitialRepresentation | Dumps the Ori IR immediately after initial lowering |
| 96 | ReportBeforeScheduling | Dumps the IR as it enters the scheduling/RA stage |
| 102 | ReportAfterRegisterAllocation | Dumps the IR after register allocation completes |
| (true 120) | ReportBeforeRegisterAllocation | Dumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068) |
| 126 | ReportFinalMemoryUsage | Prints memory pool consumption summary |
| 129 | DumpNVuCodeText | SASS text disassembly (cuobjdump-style) |
| 130 | DumpNVuCodeHex | Raw SASS hex dump |
| 151 | ReportFinalMemoryUsage | Late-pipeline duplicate: memory pool summary (no-op by default, isNoOp=1) |
| 155 | DumpNVuCodeText | Late-pipeline duplicate: SASS text disassembly; guarded by ctx+0x598 and ctx+0x740 |
| 156 | DumpNVuCodeHex | Late-pipeline duplicate: raw SASS hex dump; same guard as phase 155 |
Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).
GeneralOptimize Bundles
The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:
| Phase | Name | Position |
|---|---|---|
| 13 | GeneralOptimizeEarly | After initial setup, before loop passes |
| 29 | GeneralOptimize | After early loop/branch optimizations |
| 37 | GeneralOptimizeMid | After mid-level transformations |
| 46 | GeneralOptimizeMid2 | After VTA/CTA/mbarrier expansion |
| 58 | GeneralOptimizeLate | After late expansion |
| 65 | GeneralOptimizeLate2 | After predication and late commoning |
See GeneralOptimize Bundles for the sub-pass decomposition.
O-Level Gating
Twenty-two phases have confirmed optimization-level gates. The O-Level column in the table below annotates every phase where the activation threshold has been verified from decompiled isNoOp() methods or execute-function guards. Phases without an O-Level annotation run at all optimization levels (O0--O5). Threshold notation: > N means the phase requires opt_level > N; == 0 means the phase is active only at O0.
See Optimization Levels for the complete per-phase activation table, the O-level accessor (sub_7DDB50), and the NvOpt recipe system.
Complete 159-Phase Table
Stage 1 -- Initial Setup (Phases 0--13)
Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 0 | OriCheckInitialProgram | Validation | Validates structural correctness of the initial Ori IR after PTX lowering | ||
| 1 | ApplyNvOptRecipes | Optimization | Applies NvOptRecipe transformations (option 391, 440-byte sub-manager) | ||
| 2 | PromoteFP16 | Lowering | Promotes FP16 operations to FP32 where hardware lacks native support | ||
| 3 | AnalyzeControlFlow | Analysis | Builds the CFG: identifies loops, dominators, back edges | ||
| 4 | AdvancedPhaseBeforeConvUnSup | Gate | Hook before unsupported-op conversion; no-op by default | ||
| 5 | ConvertUnsupportedOps | Lowering | Replaces operations not natively supported on the target SM with equivalent sequences | Late Legalization | |
| 6 | SetControlFlowOpLastInBB | Cleanup | Ensures control flow instructions are the final instruction in each basic block | ||
| 7 | AdvancedPhaseAfterConvUnSup | Gate | Hook after unsupported-op conversion; no-op by default | ||
| 8 | OriCreateMacroInsts | Lowering | Expands PTX-level macro instructions into Ori instruction sequences | ||
| 9 | ReportInitialRepresentation | Reporting | Dumps the Ori IR for debugging (no-op unless DUMPIR enabled) | ||
| 10 | EarlyOriSimpleLiveDead | Optimization | Quick early dead code elimination pass | Liveness | |
| 11 | ReplaceUniformsWithImm | Optimization | Replaces uniform register reads with immediate constants where value is known | Uniform Regs | |
| 12 | OriSanitize | Validation | Validates IR consistency after initial setup transformations | ||
| 13 | GeneralOptimizeEarly | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (early) | GeneralOptimize |
Stage 2 -- Early Optimization (Phases 14--32)
Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 14 | DoSwitchOptFirst | Optimization | > 0 | Optimizes switch statements: jump table generation, case clustering (1st pass) | Branch & Switch |
| 15 | OriBranchOpt | Optimization | > 0 | Branch folding, unreachable block elimination, conditional branch simplification | Branch & Switch |
| 16 | OriPerformLiveDeadFirst | Analysis | Full liveness analysis + dead code elimination (1st of 4 major instances) | Liveness | |
| 17 | OptimizeBindlessHeaderLoads | Optimization | Hoists and deduplicates bindless texture header loads | ||
| 18 | OriLoopSimplification | Optimization | 4--5 | Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+ | Loop Passes |
| 19 | OriSplitLiveRanges | Optimization | Splits live ranges at loop boundaries to reduce register pressure | Liveness | |
| 20 | PerformPGO | Optimization | Applies profile-guided optimization data (block weights, branch probabilities) | ||
| 21 | OriStrengthReduce | Optimization | Replaces expensive operations (multiply, divide) with cheaper equivalents (shift, add) | Strength Reduction | |
| 22 | OriLoopUnrolling | Optimization | > 1 | Unrolls loops based on trip count and register pressure heuristics | Loop Passes |
| 23 | GenerateMovPhi | Lowering | Inserts SSA phi nodes as MOV.PHI pseudo-instructions | ||
| 24 | OriPipelining | Optimization | > 1 | Software pipelining: overlaps loop iterations to hide latency | Loop Passes |
| 25 | StageAndFence | Lowering | Inserts memory fence and staging instructions for coherence | Sync & Barriers | |
| 26 | OriRemoveRedundantBarriers | Optimization | > 1 | Eliminates barrier instructions proven redundant by data-flow analysis | Sync & Barriers |
| 27 | AnalyzeUniformsForSpeculation | Analysis | Identifies uniform values safe for speculative execution | Uniform Regs | |
| 28 | SinkRemat | Optimization | > 1 / > 4 | Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlass | Rematerialization |
| 29 | GeneralOptimize | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid-early) | GeneralOptimize | |
| 30 | DoSwitchOptSecond | Optimization | > 0 | Second switch optimization pass after loop/branch transformations | Branch & Switch |
| 31 | OriLinearReplacement | Optimization | Replaces branch-heavy patterns with linear (branchless) sequences | ||
| 32 | CompactLocalMemory | Optimization | Compacts local memory allocations by eliminating dead slots and reordering |
Stage 3 -- Mid-Level Optimization (Phases 33--52)
GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 33 | OriPerformLiveDeadSecond | Analysis | Full liveness analysis + DCE (2nd instance, post-early-optimization cleanup) | Liveness | |
| 34 | ExtractShaderConstsFirst | Optimization | Identifies uniform values loadable from constant memory instead of per-thread computation (1st pass) | ||
| 35 | OriHoistInvariantsEarly | Optimization | Loop-invariant code motion: hoists invariant computations out of loops (early) | Loop Passes | |
| 36 | EmitPSI | Lowering | Emits PSI (Pixel Shader Input) interpolation setup for graphics shaders | ||
| 37 | GeneralOptimizeMid | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid) | GeneralOptimize | |
| 38 | OptimizeNestedCondBranches | Optimization | > 0 | Simplifies nested conditional branches into flatter control flow | Branch & Switch |
| 39 | ConvertVTGReadWrite | Lowering | Converts vertex/tessellation/geometry shader read/write operations | ||
| 40 | DoVirtualCTAExpansion | Lowering | Expands virtual CTA operations into physical CTA primitives | ||
| 41 | MarkAdditionalColdBlocks | Analysis | Marks basic blocks as cold based on heuristics and profile data | Hot/Cold | |
| 42 | ExpandMbarrier | Lowering | Expands MBARRIER pseudo-instructions into native barrier sequences | Sync & Barriers | |
| 43 | ForwardProgress | Lowering | Inserts instructions guaranteeing forward progress (prevents infinite stalls) | ||
| 44 | OptimizeUniformAtomic | Optimization | Converts thread-uniform atomic operations into warp-level reductions | ||
| 45 | MidExpansion | Lowering | Target-dependent mid-level expansion of operations before register allocation | Late Legalization | |
| 46 | GeneralOptimizeMid2 | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd) | GeneralOptimize | |
| 47 | AdvancedPhaseEarlyEnforceArgs | Gate | Hook before argument enforcement; no-op by default | ||
| 48 | EnforceArgumentRestrictions | Lowering | Enforces ABI restrictions on function arguments (register classes, alignment) | ||
| 49 | GvnCse | Optimization | > 1 | Global value numbering combined with common subexpression elimination | Copy Prop & CSE |
| 50 | OriReassociateAndCommon | Optimization | Reassociates expressions for better commoning opportunities, then eliminates commons | Copy Prop & CSE | |
| 51 | ExtractShaderConstsFinal | Optimization | Final shader constant extraction pass (after GVN may expose new constants) | ||
| 52 | OriReplaceEquivMultiDefMov | Optimization | Eliminates redundant multi-definition move instructions with equivalent sources |
Stage 4 -- Late Optimization (Phases 53--77)
Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 53 | OriPropagateVaryingFirst | Optimization | Propagates varying (non-uniform) annotations to identify divergent values (1st pass) | ||
| 54 | OriDoRematEarly | Optimization | > 1 | Early rematerialization: recomputes cheap values near uses to reduce register pressure | Rematerialization |
| 55 | LateExpansion | Lowering | Expands operations that must be lowered after high-level optimizations | Late Legalization | |
| 56 | SpeculativeHoistComInsts | Optimization | Speculatively hoists common instructions above branches | ||
| 57 | RemoveASTToDefaultValues | Cleanup | Removes AST (address space type) annotations that have been lowered to defaults | ||
| 58 | GeneralOptimizeLate | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (late) | GeneralOptimize | |
| 59 | OriLoopFusion | Optimization | Fuses adjacent loops with compatible bounds and no inter-loop dependencies | Loop Passes | |
| 60 | DoVTGMultiViewExpansion | Lowering | Expands multi-view operations for vertex/tessellation/geometry shaders | ||
| 61 | OriPerformLiveDeadThird | Analysis | Full liveness analysis + DCE (3rd instance, post-late-optimization) | Liveness | |
| 62 | OriRemoveRedundantMultiDefMov | Optimization | Removes dead multi-definition move instructions | ||
| 63 | OriDoPredication | Optimization | > 1 | If-conversion: converts short conditional branches into predicated instructions | Predication |
| 64 | LateOriCommoning | Optimization | Late commoning pass: eliminates common subexpressions exposed by predication | Copy Prop & CSE | |
| 65 | GeneralOptimizeLate2 | Optimization | Compound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd) | GeneralOptimize | |
| 66 | OriHoistInvariantsLate | Optimization | LICM: hoists loop-invariant code (late, after predication may expose new invariants) | Loop Passes | |
| 67 | DoKillMovement | Optimization | Moves kill annotations closer to last use to improve register pressure | ||
| 68 | DoTexMovement | Optimization | Moves texture fetch instructions to minimize latency exposure | ||
| 69 | OriDoRemat | Optimization | > 1 | Late rematerialization: recomputes values exposed by predication and fusion | Rematerialization |
| 70 | OriPropagateVaryingSecond | Optimization | Propagates varying annotations (2nd pass, after predication changes control flow) | ||
| 71 | OptimizeSyncInstructions | Optimization | > 1 | Eliminates and simplifies synchronization instructions | Sync & Barriers |
| 72 | LateExpandSyncInstructions | Lowering | > 2 | Expands sync pseudo-instructions into final hardware sequences | Sync & Barriers |
| 73 | ConvertAllMovPhiToMov | Lowering | Destroys SSA form: converts MOV.PHI instructions into plain MOV | ||
| 74 | ConvertToUniformReg | Optimization | Converts qualifying values from general registers (R) to uniform registers (UR) | Uniform Regs | |
| 75 | LateArchOptimizeFirst | Optimization | Architecture-specific late optimizations (1st pass) | ||
| 76 | UpdateAfterOptimize | Cleanup | Rebuilds IR metadata invalidated by the late optimization group | ||
| 77 | AdvancedPhaseLateConvUnSup | Gate | Hook at the late unsupported-op boundary; no-op by default |
Stage 5 -- Legalization (Phases 78--96)
Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 78 | LateExpansionUnsupportedOps | Lowering | Expands remaining unsupported operations after all optimizations | Late Legalization | |
| 79 | OriHoistInvariantsLate2 | Optimization | LICM (late 2nd pass) after unsupported-op expansion | Loop Passes | |
| 80 | ExpandJmxComputation | Lowering | Expands JMX (jump with index computation) pseudo-instructions | ||
| 81 | LateArchOptimizeSecond | Optimization | Architecture-specific late optimizations (2nd pass) | ||
| 82 | AdvancedPhaseBackPropVReg | Gate | Hook before backward copy propagation; no-op by default | ||
| 83 | OriBackCopyPropagate | Optimization | Backward copy propagation: propagates values backward through move chains | Copy Prop & CSE | |
| 84 | OriPerformLiveDeadFourth | Analysis | Full liveness analysis + DCE (4th instance, pre-legalization cleanup) | Liveness | |
| 85 | OriPropagateGmma | Optimization | Propagates WGMMA accumulator values through the IR | GMMA Pipeline | |
| 86 | InsertPseudoUseDefForConvUR | Lowering | Inserts pseudo use/def instructions for uniform register conversion bookkeeping | Uniform Regs | |
| 87 | FixupGmmaSequence | Lowering | Fixes WGMMA instruction sequences for hardware ordering constraints | GMMA Pipeline | |
| 88 | OriHoistInvariantsLate3 | Optimization | LICM (late 3rd pass) after GMMA fixup | Loop Passes | |
| 89 | AdvancedPhaseSetRegAttr | Gate | Hook before register attribute setting; no-op by default | ||
| 90 | OriSetRegisterAttr | Analysis | Annotates registers with scheduling attributes (latency class, bank assignment) | Scheduling | |
| 91 | OriCalcDependantTex | Analysis | Computes texture instruction dependencies for scheduling | ||
| 92 | AdvancedPhaseAfterSetRegAttr | Gate | Hook after register attribute setting; no-op by default | ||
| 93 | LateExpansionUnsupportedOps2 | Lowering | Second late unsupported-op expansion (catches ops exposed by GMMA/attr passes) | Late Legalization | |
| 94 | FinalInspectionPass | Validation | Final IR validation gate: catches illegal patterns before irreversible scheduling/RA | ||
| 95 | SetAfterLegalization | Cleanup | > 1 | Sets post-legalization flag on the compilation context | |
| 96 | ReportBeforeScheduling | Reporting | Dumps IR before scheduling (no-op unless diagnostic options enabled) |
Stage 6 -- Scheduling & Register Allocation (Phases 97--103)
Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 97 | AdvancedPhasePreSched | Gate | Hook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114) | Scheduling | |
| 98 | BackPropagateVEC2D | Optimization | Backward-propagates 2D vector register assignments | ||
| 99 | OriDoSyncronization | Scheduling | > 1 | Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory model | Sync & Barriers |
| 100 | ApplyPostSyncronizationWars | Scheduling | > 1 | Fixes write-after-read hazards exposed by sync insertion | Sync & Barriers |
| 101 | AdvancedPhaseAllocReg | Gate | Register allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets this | RegAlloc Architecture | |
| 102 | ReportAfterRegisterAllocation | Reporting | Dumps IR after register allocation (no-op unless diagnostic options enabled) | ||
| 103 | Get64bRegComponents | RegAlloc | Splits 64-bit register pairs into 32-bit components for architectures that require it | RegAlloc Architecture |
Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)
Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 104 | AdvancedPhasePostExpansion | Gate | Hook after post-RA expansion; when active, dispatches to PostExpansion (true table index 127) | ||
| 105 | ApplyPostRegAllocWars | RegAlloc | Fixes write-after-read hazards exposed by register allocation | ||
| 106 | AdvancedPhasePostSched | Gate | Hook after post-scheduling; no-op by default | ||
| 107 | OriRemoveNopCode | Cleanup | Removes NOP instructions and dead code inserted as placeholders | ||
| 108 | OptimizeHotColdInLoop | Optimization | Separates hot and cold paths within loops for cache locality | Hot/Cold | |
| 109 | OptimizeHotColdFlow | Optimization | Separates hot and cold paths at the function level | Hot/Cold | |
| 110 | PostSchedule | Scheduling | > 0 | Post-scheduling pass: finalizes instruction ordering | Scheduling |
| 111 | AdvancedPhasePostFixUp | Gate | Hook after post-fixup; when active, dispatches to PostFixUp (phase 140, target vtable+0x148) | ||
| 112 | PlaceBlocksInSourceOrder | Cleanup | Determines final basic block layout in the emitted binary | ||
| 113 | PostFixForMercTargets | Encoding | Fixes up instructions for Mercury encoding requirements | Mercury | |
| 114 | FixUpTexDepBarAndSync | Scheduling | Fixes texture dependency barriers and sync instructions post-scheduling | Scoreboards | |
| 115 | AdvancedScoreboardsAndOpexes | Gate | > 0 | Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0 | Scoreboards |
| 116 | ProcessO0WaitsAndSBs | Scheduling | == 0 | Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazard | Scoreboards |
Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.
Stage 8 -- Mercury Backend (Phases 117--122)
SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 117 | MercEncodeAndDecode | Encoding | Converts Ori instructions to Mercury encoding, then round-trip decodes for verification | Mercury | |
| 118 | MercExpandInstructions | Encoding | Expands pseudo-instructions into final SASS instruction sequences | Mercury | |
| 119 | MercGenerateWARs1 | Encoding | Generates write-after-read hazard annotations (1st pass, pre-expansion) | Mercury | |
| 120 | MercGenerateOpex | Encoding | Generates "opex" (operation extension) annotations for each instruction | Mercury | |
| 121 | MercGenerateWARs2 | Encoding | Generates WAR annotations (2nd pass, covers hazards introduced by expansion) | Mercury | |
| 122 | MercGenerateSassUCode | Encoding | Produces the final SASS microcode bytes (the actual binary encoding) | Mercury |
"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).
Stage 9 -- Post-Mercury (Phases 123--131)
Register map computation, diagnostics, debug output.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 123 | ComputeVCallRegUse | RegAlloc | Computes register usage for virtual call sites | ||
| 124 | CalcRegisterMap | RegAlloc | Computes the final physical-to-logical register mapping emitted as EIATTR metadata | RegAlloc Architecture | |
| 125 | UpdateAfterPostRegAlloc | Cleanup | Rebuilds IR metadata after post-RA processing | ||
| 126 | ReportFinalMemoryUsage | Reporting | Prints memory pool consumption summary to stderr | ||
| 127 | AdvancedPhaseOriPhaseEncoding | Gate | Phase encoding hook; no-op by default | ||
| 128 | UpdateAfterFormatCodeList | Cleanup | Rebuilds the code list after Mercury encoding reformats instructions | ||
| 129 | DumpNVuCodeText | Reporting | Dumps human-readable SASS text disassembly | ||
| 130 | DumpNVuCodeHex | Reporting | Dumps raw SASS binary as hex | ||
| 131 | DebuggerBreak | Cleanup | Development hook: triggers a debugger breakpoint at this pipeline position |
Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)
Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.
| # | Phase Name | Category | O-Level | Description | Detail Page |
|---|---|---|---|---|---|
| 132 | UpdateAfterConvertUnsupportedOps | Cleanup | Rebuilds IR metadata after late unsupported-op conversion | ||
| 133 | MergeEquivalentConditionalFlow | Optimization | Merges basic blocks with equivalent conditional flow (tail merging) | ||
| 134 | AdvancedPhaseAfterMidExpansion | Gate | Hook after mid-level expansion; no-op by default | ||
| 135 | AdvancedPhaseLateExpandSyncInstructions | Gate | Hook for late sync instruction expansion; no-op by default | ||
| 136 | LateMergeEquivalentConditionalFlow | Optimization | Second conditional flow merge pass (catches cases exposed by late transforms) | ||
| 137 | LateExpansionUnsupportedOpsMid | Lowering | Mid-late unsupported-op expansion (between the two merge passes) | Late Legalization | |
| 138 | OriSplitHighPressureLiveRanges | RegAlloc | Last-resort live range splitter when register pressure exceeds hardware limits | RegAlloc Architecture | |
| 139 | ProcessO0WaitsAndSBs | Scheduling | == 0 | Conservative scoreboard insertion for -O0; inserts maximum wait counts at every hazard | Scoreboards |
| 140 | PostFixUp | Cleanup | Target-specific post-fixup dispatch (calls target vtable+0x148) | ||
| 141 | MercConverter | Encoding | Initial Mercury conversion: translates Ori instructions to Mercury format (sub_9F3760) | Mercury | |
| 142 | MercEncodeAndDecode | Encoding | Encode/decode round-trip verification of SASS binary encoding (sub_18F21F0) | Mercury | |
| 143 | MercExpandInstructions | Encoding | Expands Mercury pseudo-instructions into final SASS sequences; gated by ctx+0x570 bit 5 | Mercury | |
| 144 | MercGenerateWARs1 | Encoding | WAR hazard annotation (1st pass, pre-expansion); gated by ctx+0x570 sign bit | Mercury | |
| 145 | MercGenerateOpex | Encoding | Generates operation extension annotations per instruction; gated by ctx+0x570 bit 6 | Mercury | |
| 146 | MercGenerateWARs2 | Encoding | WAR hazard annotation (2nd pass, covers hazards from expansion in phase 143) | Mercury | |
| 147 | MercGenerateSassUCode | Encoding | Final SASS microcode emission: produces the binary bytes for the ELF; gated by ctx+0x571 bit 0 | Mercury | |
| 148 | ComputeVCallRegUse | RegAlloc | Computes register usage for virtual call sites (EIATTR metadata for indirect calls) | ||
| 149 | CalcRegisterMap | RegAlloc | Computes the final physical-to-logical register mapping; gated by ctx+0x590 bit 1 | RegAlloc Architecture | |
| 150 | UpdateAfterPostRegAlloc | Cleanup | Rebuilds IR metadata after post-RA processing (no-op by default, isNoOp=1) | ||
| 151 | ReportFinalMemoryUsage | Reporting | Prints memory pool consumption summary (no-op by default, isNoOp=1) | ||
| 152 | AdvancedPhaseOriPhaseEncoding | Gate | Phase encoding gate; when active, sets ctx+0x610 (pipeline_progress) = 0x15 (21) to mark encoding boundary | ||
| 153 | FormatCodeList | Encoding | Formats the instruction list for ELF output; dispatches through ctx+0x648 vtable+0x10 | Mercury | |
| 154 | UpdateAfterFormatCodeList | Cleanup | Rebuilds IR data structures after FormatCodeList reformats instructions (no-op by default, isNoOp=1) | ||
| 155 | DumpNVuCodeText | Reporting | Dumps human-readable SASS text disassembly; guarded by ctx+0x598 > 0 and ctx+0x740 non-null | ||
| 156 | DumpNVuCodeHex | Reporting | Dumps raw SASS binary as hex; same guard as phase 155 | ||
| 157 | DebuggerBreak | Cleanup | Development hook: convenient breakpoint location for pipeline debugging (empty body in release) | ||
| 158 | NOP | Cleanup | Terminal no-op sentinel; final phase in the 159-phase pipeline |
Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride). All 20 have names in the static table at off_22BD0C0 (159 entries, not 139). The vtable slot at +16 is isNoOp() (returns 0 for active phases, 1 for phases skipped by default); name resolution goes through the static table indexed by getIndex() at +8.
The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing backends to selectively enable/disable encoding passes. WAR generation runs in two passes (144, 146) bracketing instruction expansion (143) because expansion can introduce new write-after-read hazards.
Pipeline Ordering Notes
Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.
Identity ordering. The default ordering table at 0x22BEEA0 is an identity mapping: exec[N] = factory[N] for all 159 phases. The factory index IS the execution order. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below). In the true 159-phase table, phases execute in strict index order 0--158.
Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:
| Pass Family | Instances | Phases |
|---|---|---|
GeneralOptimize* | 6 | 13, 29, 37, 46, 58, 65 |
OriPerformLiveDead* | 4 | 16, 33, 61, 84 |
OriHoistInvariants* | 4 | 35, 66, 79, 88 |
LateExpansionUnsupportedOps* | 3 | 78, 93, 137 |
ExtractShaderConsts* | 2 | 34, 51 |
OriPropagateVarying* | 2 | 53, 70 |
OriDoRemat* | 2 | 54, 69 |
DoSwitchOpt* | 2 | 14, 30 |
LateArchOptimize* | 2 | 75, 81 |
MergeEquivalentConditionalFlow | 2 | 133, 136 |
MercGenerateWARs* | 2 | 144, 146 |
UpdateAfterPostRegAlloc | 2 | 125, 150 |
UpdateAfterFormatCodeList | 2 | 128, 154 |
ReportFinalMemoryUsage | 2 | 126, 151 |
DumpNVuCodeText | 2 | 129, 155 |
DumpNVuCodeHex | 2 | 130, 156 |
ComputeVCallRegUse | 2 | 123, 148 |
CalcRegisterMap | 2 | 124, 149 |
DebuggerBreak | 2 | 131, 157 |
Vectorization/LateVectorization | 2 | (true 41, 73) -- omitted from compressed numbering |
EnforceArgumentRestrictions/Late... | 2 | 48 (wiki), (true 103) -- late variant omitted |
Cross-References
- Optimization Pipeline -- pipeline infrastructure, PhaseManager data structures, dispatch loop
- Phase Manager Infrastructure -- PhaseManager object layout, constructor, destructor, factory switch
- GeneralOptimize Bundles -- sub-pass decomposition of compound optimization passes
- Branch & Switch Optimization -- phases 14, 15, 30, 38
- Loop Passes -- phases 18, 22, 24, 35, 59, 66, 79, 88
- Strength Reduction -- phase 21
- Copy Propagation & CSE -- phases 49, 50, 64, 83
- Predication -- phase 63
- Rematerialization -- phases 28, 54, 69
- Liveness Analysis -- phases 10, 16, 19, 33, 61, 84
- Synchronization & Barriers -- phases 25, 26, 42, 71, 72, 99, 100, 114
- Hot/Cold Partitioning -- phases 41, 108, 109
- GMMA/WGMMA Pipeline -- phases 85, 87
- Uniform Register Optimization -- phases 11, 27, 74, 86
- Late Expansion & Legalization -- phases 5, 45, 55, 78, 93, 137
- Register Allocator Architecture -- phases 101, 103, 105, 123, 124, 138, 148, 149
- Scheduler Architecture -- phases 90, 97--100, 110
- Scoreboards & Dependency Barriers -- phases 114, 115, 116
- Mercury Encoder -- phases 113, 117--122, 141--147, 153
- Optimization Levels -- O-level gating of gate passes
- DUMPIR & NamedPhases -- user-specified phase targeting and reordering
Key Functions
| Address | Size | Role | Confidence |
|---|---|---|---|
sub_C60D30 | -- | Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL) | 0.92 |
sub_7DDB50 | 232B | Opt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold | 0.95 |
sub_A36360 | 52KB | Master scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes) | 0.90 |
sub_A23CF0 | 54KB | DAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation | 0.90 |
sub_9F1A90 | 35KB | MercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern | 0.92 |
sub_9ED2D0 | 25KB | Opcode switch inside MercConverter; dispatches per-opcode legalization/conversion | 0.90 |
sub_9F3760 | -- | Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions | 0.85 |
sub_18F21F0 | -- | Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification | 0.85 |