Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pass Inventory & Ordering

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.

All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.

Total phases159 (indices 0--158)
Named (static table)159 (all have entries in off_22BD0C0)
Late-pipeline phases20 (indices 139--158, added after the original 0--138 design)
Gate passes (AdvancedPhase)17 conditional hooks
Update passes9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned)
Report passes10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned)
GeneralOptimize instances6 compound optimization bundles
Liveness/DCE instances5 (including EarlyOriSimpleLiveDead)
LICM instances4
Pipeline infrastructurePhase Manager, Optimization Pipeline

Phase Categories

Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.

TagMeaningCount
ValidationChecks IR structural correctness, catches illegal patterns3
LoweringConverts unsupported ops, expands macros, legalizes IR14
OptimizationTransforms IR to improve performance (DCE, CSE, LICM, etc.)68
AnalysisComputes information consumed by later passes (liveness, CFG)6
ReportingDumps IR, statistics, or memory usage for debugging9
SchedulingInstruction scheduling, sync insertion, WAR fixup8
RegAllocRegister allocation and related fixups6
EncodingMercury SASS encoding, expansion, microcode generation9
CleanupPost-transformation updates, NOP removal, block layout13
GateConditional hooks (AdvancedPhase*) -- no-op by default17

Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.

Numbering Discrepancy

Warning: The phase numbers 0--138 on this page use a compressed numbering scheme established before the full 159-entry name table was discovered (P2-14). The true static name table at off_22BD0C0 contains 159 entries indexed 0--158, and 16 of the 20 newly-discovered names occupy indices within the 0--138 range. In the true table, these 16 entries sit at their listed indices, and all subsequent phases shift up. The wiki's compressed numbering diverges from the true binary indices starting around phase 8.

Phases 139--158 are correctly numbered (they match the true static table indices). A full renumbering of phases 0--138 to match the true binary indices is deferred as a separate task because it would affect cross-references across 40+ wiki pages.

The 16 omitted name table entries (with their true static table indices) are:

True IndexNameCategoryRelationship to Wiki
22OriCopyPropOptimizationSub-pass within all 6 GeneralOptimize bundles; also injected into Mercury pipeline
32OptimizeNaNOrZeroOptimizationStandalone NaN/zero folding pass; not documented under current wiki numbering
37ConvertMemoryToRegisterOrUniformOptimizationSub-pass of GeneralOptimizeMid; gated by knob 487; sub_910840
41VectorizationOptimizationLoad/store vectorization; gated by DisableReadVectorization/DisableWriteVectorization knobs
57OriCommoningOptimizationCommoning sub-pass; related to LateOriCommoning (wiki phase 64)
69OriSimpleLiveDeadOptimizationLiveness/DCE sub-pass; related to EarlyOriSimpleLiveDead (wiki phase 10)
73LateVectorizationOptimizationLate vectorization (2nd instance, after optimization exposes new opportunities)
77SinkCodeIntoBlockOptimizationCode sinking; sub_78DB70; DisablePhases=SinkCodeIntoBlock gate
103LateEnforceArgumentRestrictionsLoweringLate counterpart to EnforceArgumentRestrictions (wiki phase 48)
114ScheduleInstructionsSchedulingWorker for AdvancedPhasePreSched; sub_8D0640 (22 KB)
115UpdateAfterScheduleInstructionsCleanupIR metadata refresh after scheduling completes
118UpdateAfterOriDoSyncronizationCleanupIR metadata refresh after sync insertion (wiki phase 99)
120ReportBeforeRegisterAllocationReportingDUMPIR target; diagnostic dump before register allocation
122AllocateRegistersRegAllocWorker for AdvancedPhaseAllocReg; canonical allocator entry
124UpdateAfterOriAllocateRegistersCleanupIR metadata refresh after register allocation
127PostExpansionLoweringWorker for AdvancedPhasePostExpansion; post-RA expansion

All 16 are valid DUMPIR targets (resolvable through sub_C641D0 binary search over the phase name table). Several are also valid DisablePhases targets.

Gate Passes (AdvancedPhase)

Seventeen phases are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.

Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.

The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.

Gate Pass Worker Correspondence

Several gate passes dispatch to named worker functions when activated by a backend. The worker names appear in the static name table and are valid DUMPIR/NamedPhases targets:

Gate Pass (Wiki #)Worker Function (True Table Index)Evidence
AdvancedPhasePreSched (97)ScheduleInstructions [114]sub_8D0640, string "ScheduleInstructions"
AdvancedPhaseAllocReg (101)AllocateRegisters [122]String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0
AdvancedPhasePostExpansion (104)PostExpansion [127]Post-RA expansion dispatch
AdvancedPhasePostFixUp (111)PostFixUp [140]Target vtable+0x148 dispatch

See Optimization Levels for per-gate activation rules.

Update Passes

Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):

PhaseNameRefreshes
76UpdateAfterOptimizeRebuilds IR metadata after the late optimization group
125UpdateAfterPostRegAllocRebuilds IR metadata after register allocation and post-RA fixups
128UpdateAfterFormatCodeListRebuilds the code list after Mercury encoding reformats instructions
132UpdateAfterConvertUnsupportedOpsRebuilds IR metadata after late unsupported-op expansion
150UpdateAfterPostRegAllocLate-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default)
154UpdateAfterFormatCodeListLate-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default)
(true 115)UpdateAfterScheduleInstructionsRefreshes IR after scheduling completes (omitted from compressed numbering)
(true 118)UpdateAfterOriDoSyncronizationRefreshes IR after sync insertion (omitted from compressed numbering)
(true 124)UpdateAfterOriAllocateRegistersRefreshes IR after register allocation (omitted from compressed numbering)

These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.

Report Passes

Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):

PhaseNameOutput
9ReportInitialRepresentationDumps the Ori IR immediately after initial lowering
96ReportBeforeSchedulingDumps the IR as it enters the scheduling/RA stage
102ReportAfterRegisterAllocationDumps the IR after register allocation completes
(true 120)ReportBeforeRegisterAllocationDumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068)
126ReportFinalMemoryUsagePrints memory pool consumption summary
129DumpNVuCodeTextSASS text disassembly (cuobjdump-style)
130DumpNVuCodeHexRaw SASS hex dump
151ReportFinalMemoryUsageLate-pipeline duplicate: memory pool summary (no-op by default, isNoOp=1)
155DumpNVuCodeTextLate-pipeline duplicate: SASS text disassembly; guarded by ctx+0x598 and ctx+0x740
156DumpNVuCodeHexLate-pipeline duplicate: raw SASS hex dump; same guard as phase 155

Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:

PhaseNamePosition
13GeneralOptimizeEarlyAfter initial setup, before loop passes
29GeneralOptimizeAfter early loop/branch optimizations
37GeneralOptimizeMidAfter mid-level transformations
46GeneralOptimizeMid2After VTA/CTA/mbarrier expansion
58GeneralOptimizeLateAfter late expansion
65GeneralOptimizeLate2After predication and late commoning

See GeneralOptimize Bundles for the sub-pass decomposition.


O-Level Gating

Twenty-two phases have confirmed optimization-level gates. The O-Level column in the table below annotates every phase where the activation threshold has been verified from decompiled isNoOp() methods or execute-function guards. Phases without an O-Level annotation run at all optimization levels (O0--O5). Threshold notation: > N means the phase requires opt_level > N; == 0 means the phase is active only at O0.

See Optimization Levels for the complete per-phase activation table, the O-level accessor (sub_7DDB50), and the NvOpt recipe system.


Complete 159-Phase Table

Stage 1 -- Initial Setup (Phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.

#Phase NameCategoryO-LevelDescriptionDetail Page
0OriCheckInitialProgramValidationValidates structural correctness of the initial Ori IR after PTX lowering
1ApplyNvOptRecipesOptimizationApplies NvOptRecipe transformations (option 391, 440-byte sub-manager)
2PromoteFP16LoweringPromotes FP16 operations to FP32 where hardware lacks native support
3AnalyzeControlFlowAnalysisBuilds the CFG: identifies loops, dominators, back edges
4AdvancedPhaseBeforeConvUnSupGateHook before unsupported-op conversion; no-op by default
5ConvertUnsupportedOpsLoweringReplaces operations not natively supported on the target SM with equivalent sequencesLate Legalization
6SetControlFlowOpLastInBBCleanupEnsures control flow instructions are the final instruction in each basic block
7AdvancedPhaseAfterConvUnSupGateHook after unsupported-op conversion; no-op by default
8OriCreateMacroInstsLoweringExpands PTX-level macro instructions into Ori instruction sequences
9ReportInitialRepresentationReportingDumps the Ori IR for debugging (no-op unless DUMPIR enabled)
10EarlyOriSimpleLiveDeadOptimizationQuick early dead code elimination passLiveness
11ReplaceUniformsWithImmOptimizationReplaces uniform register reads with immediate constants where value is knownUniform Regs
12OriSanitizeValidationValidates IR consistency after initial setup transformations
13GeneralOptimizeEarlyOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (early)GeneralOptimize

Stage 2 -- Early Optimization (Phases 14--32)

Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.

#Phase NameCategoryO-LevelDescriptionDetail Page
14DoSwitchOptFirstOptimization> 0Optimizes switch statements: jump table generation, case clustering (1st pass)Branch & Switch
15OriBranchOptOptimization> 0Branch folding, unreachable block elimination, conditional branch simplificationBranch & Switch
16OriPerformLiveDeadFirstAnalysisFull liveness analysis + dead code elimination (1st of 4 major instances)Liveness
17OptimizeBindlessHeaderLoadsOptimizationHoists and deduplicates bindless texture header loads
18OriLoopSimplificationOptimization4--5Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+Loop Passes
19OriSplitLiveRangesOptimizationSplits live ranges at loop boundaries to reduce register pressureLiveness
20PerformPGOOptimizationApplies profile-guided optimization data (block weights, branch probabilities)
21OriStrengthReduceOptimizationReplaces expensive operations (multiply, divide) with cheaper equivalents (shift, add)Strength Reduction
22OriLoopUnrollingOptimization> 1Unrolls loops based on trip count and register pressure heuristicsLoop Passes
23GenerateMovPhiLoweringInserts SSA phi nodes as MOV.PHI pseudo-instructions
24OriPipeliningOptimization> 1Software pipelining: overlaps loop iterations to hide latencyLoop Passes
25StageAndFenceLoweringInserts memory fence and staging instructions for coherenceSync & Barriers
26OriRemoveRedundantBarriersOptimization> 1Eliminates barrier instructions proven redundant by data-flow analysisSync & Barriers
27AnalyzeUniformsForSpeculationAnalysisIdentifies uniform values safe for speculative executionUniform Regs
28SinkRematOptimization> 1 / > 4Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlassRematerialization
29GeneralOptimizeOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid-early)GeneralOptimize
30DoSwitchOptSecondOptimization> 0Second switch optimization pass after loop/branch transformationsBranch & Switch
31OriLinearReplacementOptimizationReplaces branch-heavy patterns with linear (branchless) sequences
32CompactLocalMemoryOptimizationCompacts local memory allocations by eliminating dead slots and reordering

Stage 3 -- Mid-Level Optimization (Phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.

#Phase NameCategoryO-LevelDescriptionDetail Page
33OriPerformLiveDeadSecondAnalysisFull liveness analysis + DCE (2nd instance, post-early-optimization cleanup)Liveness
34ExtractShaderConstsFirstOptimizationIdentifies uniform values loadable from constant memory instead of per-thread computation (1st pass)
35OriHoistInvariantsEarlyOptimizationLoop-invariant code motion: hoists invariant computations out of loops (early)Loop Passes
36EmitPSILoweringEmits PSI (Pixel Shader Input) interpolation setup for graphics shaders
37GeneralOptimizeMidOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid)GeneralOptimize
38OptimizeNestedCondBranchesOptimization> 0Simplifies nested conditional branches into flatter control flowBranch & Switch
39ConvertVTGReadWriteLoweringConverts vertex/tessellation/geometry shader read/write operations
40DoVirtualCTAExpansionLoweringExpands virtual CTA operations into physical CTA primitives
41MarkAdditionalColdBlocksAnalysisMarks basic blocks as cold based on heuristics and profile dataHot/Cold
42ExpandMbarrierLoweringExpands MBARRIER pseudo-instructions into native barrier sequencesSync & Barriers
43ForwardProgressLoweringInserts instructions guaranteeing forward progress (prevents infinite stalls)
44OptimizeUniformAtomicOptimizationConverts thread-uniform atomic operations into warp-level reductions
45MidExpansionLoweringTarget-dependent mid-level expansion of operations before register allocationLate Legalization
46GeneralOptimizeMid2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd)GeneralOptimize
47AdvancedPhaseEarlyEnforceArgsGateHook before argument enforcement; no-op by default
48EnforceArgumentRestrictionsLoweringEnforces ABI restrictions on function arguments (register classes, alignment)
49GvnCseOptimization> 1Global value numbering combined with common subexpression eliminationCopy Prop & CSE
50OriReassociateAndCommonOptimizationReassociates expressions for better commoning opportunities, then eliminates commonsCopy Prop & CSE
51ExtractShaderConstsFinalOptimizationFinal shader constant extraction pass (after GVN may expose new constants)
52OriReplaceEquivMultiDefMovOptimizationEliminates redundant multi-definition move instructions with equivalent sources

Stage 4 -- Late Optimization (Phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#Phase NameCategoryO-LevelDescriptionDetail Page
53OriPropagateVaryingFirstOptimizationPropagates varying (non-uniform) annotations to identify divergent values (1st pass)
54OriDoRematEarlyOptimization> 1Early rematerialization: recomputes cheap values near uses to reduce register pressureRematerialization
55LateExpansionLoweringExpands operations that must be lowered after high-level optimizationsLate Legalization
56SpeculativeHoistComInstsOptimizationSpeculatively hoists common instructions above branches
57RemoveASTToDefaultValuesCleanupRemoves AST (address space type) annotations that have been lowered to defaults
58GeneralOptimizeLateOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late)GeneralOptimize
59OriLoopFusionOptimizationFuses adjacent loops with compatible bounds and no inter-loop dependenciesLoop Passes
60DoVTGMultiViewExpansionLoweringExpands multi-view operations for vertex/tessellation/geometry shaders
61OriPerformLiveDeadThirdAnalysisFull liveness analysis + DCE (3rd instance, post-late-optimization)Liveness
62OriRemoveRedundantMultiDefMovOptimizationRemoves dead multi-definition move instructions
63OriDoPredicationOptimization> 1If-conversion: converts short conditional branches into predicated instructionsPredication
64LateOriCommoningOptimizationLate commoning pass: eliminates common subexpressions exposed by predicationCopy Prop & CSE
65GeneralOptimizeLate2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd)GeneralOptimize
66OriHoistInvariantsLateOptimizationLICM: hoists loop-invariant code (late, after predication may expose new invariants)Loop Passes
67DoKillMovementOptimizationMoves kill annotations closer to last use to improve register pressure
68DoTexMovementOptimizationMoves texture fetch instructions to minimize latency exposure
69OriDoRematOptimization> 1Late rematerialization: recomputes values exposed by predication and fusionRematerialization
70OriPropagateVaryingSecondOptimizationPropagates varying annotations (2nd pass, after predication changes control flow)
71OptimizeSyncInstructionsOptimization> 1Eliminates and simplifies synchronization instructionsSync & Barriers
72LateExpandSyncInstructionsLowering> 2Expands sync pseudo-instructions into final hardware sequencesSync & Barriers
73ConvertAllMovPhiToMovLoweringDestroys SSA form: converts MOV.PHI instructions into plain MOV
74ConvertToUniformRegOptimizationConverts qualifying values from general registers (R) to uniform registers (UR)Uniform Regs
75LateArchOptimizeFirstOptimizationArchitecture-specific late optimizations (1st pass)
76UpdateAfterOptimizeCleanupRebuilds IR metadata invalidated by the late optimization group
77AdvancedPhaseLateConvUnSupGateHook at the late unsupported-op boundary; no-op by default

Stage 5 -- Legalization (Phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.

#Phase NameCategoryO-LevelDescriptionDetail Page
78LateExpansionUnsupportedOpsLoweringExpands remaining unsupported operations after all optimizationsLate Legalization
79OriHoistInvariantsLate2OptimizationLICM (late 2nd pass) after unsupported-op expansionLoop Passes
80ExpandJmxComputationLoweringExpands JMX (jump with index computation) pseudo-instructions
81LateArchOptimizeSecondOptimizationArchitecture-specific late optimizations (2nd pass)
82AdvancedPhaseBackPropVRegGateHook before backward copy propagation; no-op by default
83OriBackCopyPropagateOptimizationBackward copy propagation: propagates values backward through move chainsCopy Prop & CSE
84OriPerformLiveDeadFourthAnalysisFull liveness analysis + DCE (4th instance, pre-legalization cleanup)Liveness
85OriPropagateGmmaOptimizationPropagates WGMMA accumulator values through the IRGMMA Pipeline
86InsertPseudoUseDefForConvURLoweringInserts pseudo use/def instructions for uniform register conversion bookkeepingUniform Regs
87FixupGmmaSequenceLoweringFixes WGMMA instruction sequences for hardware ordering constraintsGMMA Pipeline
88OriHoistInvariantsLate3OptimizationLICM (late 3rd pass) after GMMA fixupLoop Passes
89AdvancedPhaseSetRegAttrGateHook before register attribute setting; no-op by default
90OriSetRegisterAttrAnalysisAnnotates registers with scheduling attributes (latency class, bank assignment)Scheduling
91OriCalcDependantTexAnalysisComputes texture instruction dependencies for scheduling
92AdvancedPhaseAfterSetRegAttrGateHook after register attribute setting; no-op by default
93LateExpansionUnsupportedOps2LoweringSecond late unsupported-op expansion (catches ops exposed by GMMA/attr passes)Late Legalization
94FinalInspectionPassValidationFinal IR validation gate: catches illegal patterns before irreversible scheduling/RA
95SetAfterLegalizationCleanup> 1Sets post-legalization flag on the compilation context
96ReportBeforeSchedulingReportingDumps IR before scheduling (no-op unless diagnostic options enabled)

Stage 6 -- Scheduling & Register Allocation (Phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#Phase NameCategoryO-LevelDescriptionDetail Page
97AdvancedPhasePreSchedGateHook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114)Scheduling
98BackPropagateVEC2DOptimizationBackward-propagates 2D vector register assignments
99OriDoSyncronizationScheduling> 1Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory modelSync & Barriers
100ApplyPostSyncronizationWarsScheduling> 1Fixes write-after-read hazards exposed by sync insertionSync & Barriers
101AdvancedPhaseAllocRegGateRegister allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets thisRegAlloc Architecture
102ReportAfterRegisterAllocationReportingDumps IR after register allocation (no-op unless diagnostic options enabled)
103Get64bRegComponentsRegAllocSplits 64-bit register pairs into 32-bit components for architectures that require itRegAlloc Architecture

Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.

#Phase NameCategoryO-LevelDescriptionDetail Page
104AdvancedPhasePostExpansionGateHook after post-RA expansion; when active, dispatches to PostExpansion (true table index 127)
105ApplyPostRegAllocWarsRegAllocFixes write-after-read hazards exposed by register allocation
106AdvancedPhasePostSchedGateHook after post-scheduling; no-op by default
107OriRemoveNopCodeCleanupRemoves NOP instructions and dead code inserted as placeholders
108OptimizeHotColdInLoopOptimizationSeparates hot and cold paths within loops for cache localityHot/Cold
109OptimizeHotColdFlowOptimizationSeparates hot and cold paths at the function levelHot/Cold
110PostScheduleScheduling> 0Post-scheduling pass: finalizes instruction orderingScheduling
111AdvancedPhasePostFixUpGateHook after post-fixup; when active, dispatches to PostFixUp (phase 140, target vtable+0x148)
112PlaceBlocksInSourceOrderCleanupDetermines final basic block layout in the emitted binary
113PostFixForMercTargetsEncodingFixes up instructions for Mercury encoding requirementsMercury
114FixUpTexDepBarAndSyncSchedulingFixes texture dependency barriers and sync instructions post-schedulingScoreboards
115AdvancedScoreboardsAndOpexesGate> 0Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0Scoreboards
116ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazardScoreboards

Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.

Stage 8 -- Mercury Backend (Phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#Phase NameCategoryO-LevelDescriptionDetail Page
117MercEncodeAndDecodeEncodingConverts Ori instructions to Mercury encoding, then round-trip decodes for verificationMercury
118MercExpandInstructionsEncodingExpands pseudo-instructions into final SASS instruction sequencesMercury
119MercGenerateWARs1EncodingGenerates write-after-read hazard annotations (1st pass, pre-expansion)Mercury
120MercGenerateOpexEncodingGenerates "opex" (operation extension) annotations for each instructionMercury
121MercGenerateWARs2EncodingGenerates WAR annotations (2nd pass, covers hazards introduced by expansion)Mercury
122MercGenerateSassUCodeEncodingProduces the final SASS microcode bytes (the actual binary encoding)Mercury

"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).

Stage 9 -- Post-Mercury (Phases 123--131)

Register map computation, diagnostics, debug output.

#Phase NameCategoryO-LevelDescriptionDetail Page
123ComputeVCallRegUseRegAllocComputes register usage for virtual call sites
124CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping emitted as EIATTR metadataRegAlloc Architecture
125UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing
126ReportFinalMemoryUsageReportingPrints memory pool consumption summary to stderr
127AdvancedPhaseOriPhaseEncodingGatePhase encoding hook; no-op by default
128UpdateAfterFormatCodeListCleanupRebuilds the code list after Mercury encoding reformats instructions
129DumpNVuCodeTextReportingDumps human-readable SASS text disassembly
130DumpNVuCodeHexReportingDumps raw SASS binary as hex
131DebuggerBreakCleanupDevelopment hook: triggers a debugger breakpoint at this pipeline position

Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.

#Phase NameCategoryO-LevelDescriptionDetail Page
132UpdateAfterConvertUnsupportedOpsCleanupRebuilds IR metadata after late unsupported-op conversion
133MergeEquivalentConditionalFlowOptimizationMerges basic blocks with equivalent conditional flow (tail merging)
134AdvancedPhaseAfterMidExpansionGateHook after mid-level expansion; no-op by default
135AdvancedPhaseLateExpandSyncInstructionsGateHook for late sync instruction expansion; no-op by default
136LateMergeEquivalentConditionalFlowOptimizationSecond conditional flow merge pass (catches cases exposed by late transforms)
137LateExpansionUnsupportedOpsMidLoweringMid-late unsupported-op expansion (between the two merge passes)Late Legalization
138OriSplitHighPressureLiveRangesRegAllocLast-resort live range splitter when register pressure exceeds hardware limitsRegAlloc Architecture
139ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0; inserts maximum wait counts at every hazardScoreboards
140PostFixUpCleanupTarget-specific post-fixup dispatch (calls target vtable+0x148)
141MercConverterEncodingInitial Mercury conversion: translates Ori instructions to Mercury format (sub_9F3760)Mercury
142MercEncodeAndDecodeEncodingEncode/decode round-trip verification of SASS binary encoding (sub_18F21F0)Mercury
143MercExpandInstructionsEncodingExpands Mercury pseudo-instructions into final SASS sequences; gated by ctx+0x570 bit 5Mercury
144MercGenerateWARs1EncodingWAR hazard annotation (1st pass, pre-expansion); gated by ctx+0x570 sign bitMercury
145MercGenerateOpexEncodingGenerates operation extension annotations per instruction; gated by ctx+0x570 bit 6Mercury
146MercGenerateWARs2EncodingWAR hazard annotation (2nd pass, covers hazards from expansion in phase 143)Mercury
147MercGenerateSassUCodeEncodingFinal SASS microcode emission: produces the binary bytes for the ELF; gated by ctx+0x571 bit 0Mercury
148ComputeVCallRegUseRegAllocComputes register usage for virtual call sites (EIATTR metadata for indirect calls)
149CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping; gated by ctx+0x590 bit 1RegAlloc Architecture
150UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing (no-op by default, isNoOp=1)
151ReportFinalMemoryUsageReportingPrints memory pool consumption summary (no-op by default, isNoOp=1)
152AdvancedPhaseOriPhaseEncodingGatePhase encoding gate; when active, sets ctx+0x610 (pipeline_progress) = 0x15 (21) to mark encoding boundary
153FormatCodeListEncodingFormats the instruction list for ELF output; dispatches through ctx+0x648 vtable+0x10Mercury
154UpdateAfterFormatCodeListCleanupRebuilds IR data structures after FormatCodeList reformats instructions (no-op by default, isNoOp=1)
155DumpNVuCodeTextReportingDumps human-readable SASS text disassembly; guarded by ctx+0x598 > 0 and ctx+0x740 non-null
156DumpNVuCodeHexReportingDumps raw SASS binary as hex; same guard as phase 155
157DebuggerBreakCleanupDevelopment hook: convenient breakpoint location for pipeline debugging (empty body in release)
158NOPCleanupTerminal no-op sentinel; final phase in the 159-phase pipeline

Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride). All 20 have names in the static table at off_22BD0C0 (159 entries, not 139). The vtable slot at +16 is isNoOp() (returns 0 for active phases, 1 for phases skipped by default); name resolution goes through the static table indexed by getIndex() at +8.

The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing backends to selectively enable/disable encoding passes. WAR generation runs in two passes (144, 146) bracketing instruction expansion (143) because expansion can introduce new write-after-read hazards.


Pipeline Ordering Notes

Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.

Identity ordering. The default ordering table at 0x22BEEA0 is an identity mapping: exec[N] = factory[N] for all 159 phases. The factory index IS the execution order. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below). In the true 159-phase table, phases execute in strict index order 0--158.

Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:

Pass FamilyInstancesPhases
GeneralOptimize*613, 29, 37, 46, 58, 65
OriPerformLiveDead*416, 33, 61, 84
OriHoistInvariants*435, 66, 79, 88
LateExpansionUnsupportedOps*378, 93, 137
ExtractShaderConsts*234, 51
OriPropagateVarying*253, 70
OriDoRemat*254, 69
DoSwitchOpt*214, 30
LateArchOptimize*275, 81
MergeEquivalentConditionalFlow2133, 136
MercGenerateWARs*2144, 146
UpdateAfterPostRegAlloc2125, 150
UpdateAfterFormatCodeList2128, 154
ReportFinalMemoryUsage2126, 151
DumpNVuCodeText2129, 155
DumpNVuCodeHex2130, 156
ComputeVCallRegUse2123, 148
CalcRegisterMap2124, 149
DebuggerBreak2131, 157
Vectorization/LateVectorization2(true 41, 73) -- omitted from compressed numbering
EnforceArgumentRestrictions/Late...248 (wiki), (true 103) -- late variant omitted

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_C60D30--Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL)0.92
sub_7DDB50232BOpt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold0.95
sub_A3636052KBMaster scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes)0.90
sub_A23CF054KBDAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation0.90
sub_9F1A9035KBMercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization/conversion0.90
sub_9F3760--Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions0.85
sub_18F21F0--Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification0.85