Phase Manager Infrastructure
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The PhaseManager is the central orchestration layer in ptxas. It owns the entire 159-phase optimization and code generation pipeline, constructs each phase as a polymorphic object via an abstract factory, and drives execution through a virtual dispatch loop. Every compilation unit passes through the same PhaseManager sequence: construct all 159 phase objects, iterate the phase index array calling execute() on each, optionally collect per-phase timing and memory statistics, then tear down. The PhaseManager also hosts an optional NvOptRecipe sub-manager (440 bytes) for architecture-specific "advanced phase" hooks that inject additional processing at 16 defined points in the pipeline.
The design is a textbook Strategy + Abstract Factory pattern: a 159-case switch statement maps phase indices to vtable pointers, each vtable provides execute(), isNoOp(), and getName() virtual methods, and the dispatch loop iterates a flat index array that defines execution order. This makes the pipeline fully data-driven -- reordering, disabling, or injecting phases requires only modifying the index array, not the dispatch logic.
| Core range | 0xC60000--0xC66000 (13 functions, ~17.5 KB) |
| Constructor | sub_C62720 (4,734 bytes) |
| Destructor | sub_C61B20 (1,753 bytes) |
| Phase factory | sub_C60D30 (3,554 bytes, 159-case switch) |
| Dispatch loop | sub_C64F70 (1,455 bytes) |
| Name lookup | sub_C641D0 (305 bytes, case-insensitive binary search) |
| Timing reporter | sub_C64310 (3,168 bytes) |
| Pool reporter | sub_C62200 (888 bytes) |
| Total phases | 159 (139 explicitly named + 20 arch-specific) |
| AdvancedPhase hooks | 16 no-op-by-default insertion points |
| Default phase table | Static array at 0x22BEEA0 (returned by sub_C60D20) |
| Phase name table | Static array at off_22BD0C0 (159 string pointers) |
| Vtable range | off_22BD5C8 (phase 0) through off_22BEE78 (phase 158) |
| Callers | sub_7FB6C0 (main compilation driver), sub_9F63D0 (library/ftrace entry) |
PhaseManager Object Layout
The PhaseManager is a plain C++ object (no vtable of its own) allocated by the compilation driver. Minimum size is 112 bytes, though the full extent depends on whether timing and NvOptRecipe are enabled.
PhaseManager (112+ bytes)
+0 int64 compilation_unit // back-pointer to owning compilation unit
+8 int64* allocator // pool allocator (from compilation_unit->field_16)
+16 void* sorted_name_table // sorted {name_ptr, index} pairs for binary search
+24 int sorted_name_count
+28 int sorted_name_capacity
+32 int64* allocator2 // copy of allocator (for phase list ops)
+40 void* phase_list // array of 16-byte {phase_ptr, pool_ptr} pairs
+48 int phase_list_count // always 159 after construction
+52 int phase_list_capacity
+56 int64 nvopt_recipe_ptr // NvOptRecipe sub-manager, or NULL
+64 int64 (reserved)
+72 bool timing_enabled // set from compilation_unit->config->options[17928]
+76 int (flags/padding)
+80 bool flag_byte // initialized to 1, reset after first timing report
+88 int64* timing_allocator
+96 void* phase_name_raw_table // 159 name string pointers, copied from off_22BD0C0
+104 int phase_name_raw_count
+108 int phase_name_raw_capacity
The two allocator fields (+8 and +32) both point to the same pool allocator extracted from the compilation unit, but are used in different contexts: +8 for name table operations, +32 for phase list operations.
Phase Object Model
Each phase is a 16-byte polymorphic object:
Phase (16 bytes)
+0 vtable* // points to one of 159 vtable instances
+8 void* // pool pointer (memory pool for phase-local allocations)
The vtable provides the interface contract:
| Vtable offset | Method | Signature |
|---|---|---|
+0 | execute | void execute(phase*, compilation_context*) |
+8 | isNoOp | bool isNoOp(phase*) -- returns true to skip execution |
+16 | getName | int getName(phase*) -- returns index into name table |
+24 | alloc | void* alloc(pool*, size_t) -- pool allocator |
+32 | free | void free(pool*, void*) -- pool deallocator |
The vtable addresses span off_22BD5C8 (phase 0) through off_22BEE78 (phase 158), with a stride of 0x28 (40 bytes) between consecutive entries. All vtables reside in .data.rel.ro.
Phase Factory -- sub_C60D30
The factory is a 159-case switch statement that serves as the sole point of phase instantiation. For each case:
- Extracts the pool allocator from
context->field_16 - Allocates 16 bytes via
pool_alloc(vtable offset+24) - Writes the case-specific vtable pointer at offset
+0 - Returns a
{phase_ptr, pool_ptr}pair
The default case returns {NULL, NULL}, which the caller treats as an invalid phase index.
// Pseudocode for sub_C60D30
pair<phase*, pool*> PhaseFactory(int phase_index, context* ctx) {
pool* p = ctx->allocator;
phase* obj = p->alloc(16);
switch (phase_index) {
case 0: obj->vtable = off_22BD5C8; break; // OriCheckInitialProgram
case 1: obj->vtable = off_22BD5F0; break; // ApplyNvOptRecipes
case 2: obj->vtable = off_22BD618; break; // PromoteFP16
// ... 156 more cases ...
case 158: obj->vtable = off_22BEE78; break; // sentinel/NOP
default: return {NULL, NULL};
}
return {obj, p};
}
Called exclusively by the constructor (sub_C62720).
Construction Sequence -- sub_C62720
The constructor performs 11 steps, building all internal data structures and instantiating every phase:
// Pseudocode for sub_C62720
bool PhaseManager::construct(compilation_unit* cu) {
this->cu = cu;
this->allocator = cu->field_16; // extract pool allocator
this->allocator2 = cu->field_16;
// 1. Check timing flag
this->timing_enabled = cu->config->options[17928];
// 2. Allocate and copy phase name table (1272 = 159 * 8 bytes)
this->phase_name_raw_table = alloc(1272);
memcpy(this->phase_name_raw_table, off_22BD0C0, 1272);
this->phase_name_raw_count = 159;
this->phase_name_raw_capacity = 159;
// 3. Initialize timing records
resize_timing(/*capacity=*/159); // sub_C62580
cu->timing_count++; // at cu+1576
append_timing({index=-1, name=0x2030007, time=0, flags=0}); // sentinel
// 4. Create all 159 phase objects
resize_phase_list(/*capacity=*/159); // sub_C62640
for (int i = 0; i < 159; i++) {
auto [phase, pool] = PhaseFactory(i, cu); // sub_C60D30
phase_list[i] = {phase, pool};
}
// 5. Optionally create NvOptRecipe sub-manager
if (cu->config->getOption(391)) {
auto* recipe = alloc(440);
// initialize hash table, ref-counted lists, timing arrays (8 entries)
// inherit phase chain from previous execution context
this->nvopt_recipe_ptr = recipe;
}
this->flag_byte = 1;
return true;
}
Key constants:
- 159 -- total phase count, used as loop bound and array capacities
- 1272 --
159 * 8, phase name pointer table size in bytes - 440 -- NvOptRecipe sub-manager object size
- 0x2030007 (33,739,079) -- timing sentinel magic value
- Option 17928 -- enables per-phase timing/memory reporting
- Option 391 -- enables NvOptRecipe sub-manager
Destruction Sequence -- sub_C61B20
Teardown mirrors construction in reverse order, with careful handling of the NvOptRecipe's reference-counted shared state:
// Pseudocode for sub_C61B20
void PhaseManager::destroy() {
// 1. Free raw name table
timing_allocator->free(phase_name_raw_table);
// 2. Tear down NvOptRecipe if present
if (nvopt_recipe_ptr) {
auto* r = nvopt_recipe_ptr;
// decrement shared_list ref-count at +432
if (--r->shared_list_refcount == 0)
free_list_nodes(r->shared_list);
free(r->hash_buckets); // +408
free(r->sorted_array); // +376
free(r->timing_records); // +344, stride=584 per entry
free(r->node_pool); // +16
free(r);
}
// 3. Destroy each phase via virtual destructor (vtable+32)
for (int i = 0; i < phase_list_count; i++) {
auto [phase, pool] = phase_list[i];
pool->free(phase); // invokes vtable+32
}
// 4. Free base arrays
allocator2->free(phase_list);
allocator->free(sorted_name_table);
}
The ref-count decrement-and-destroy pattern on shared_list at +432 follows C++ shared_ptr semantics: the NvOptRecipe may share state across multiple compilation units in library mode.
Phase Dispatch Loop -- sub_C64F70
The dispatch loop is the runtime engine. It takes a slice of the phase index array and executes each phase in order:
// Pseudocode for sub_C64F70
bool PhaseManager::dispatch(int* phase_indices, int count) {
memory_snapshot_t base_snap;
take_snapshot(&base_snap); // sub_8DADE0
for (int i = 0; i < count; i++) {
int idx = phase_indices[i];
phase* p = this->phase_list[idx].phase;
// Resolve phase name
int name_idx = p->getName(); // vtable+16
const char* name = this->phase_name_raw_table[name_idx];
// Record timing entry
append_timing({idx, name, opt_level, flags, metrics});
// Take pre-execution snapshot
memory_snapshot_t pre_snap;
take_snapshot(&pre_snap);
// Execute (unless no-op)
if (!p->isNoOp()) { // vtable+8
p->execute(this->cu); // vtable+0
// Construct diagnostic: "Before <name>" or "After <name>"
}
// Report per-phase stats
if (this->timing_enabled) {
report_phase_stats(name, &pre_snap, false); // sub_C64310
this->flag_byte = 0;
}
}
// Summary after all phases
if (this->timing_enabled) {
report_phase_stats("All Phases Summary", &base_snap, true);
report_pool_consumption(); // sub_C62200
}
return true;
}
The "Before" / "After" diagnostic strings use an interesting encoding trick: the string "Before " is stored as the 64-bit integer 0x2065726F666542 in little-endian, allowing the compiler to emit a single mov instruction instead of a memcpy.
Phase Name Lookup -- sub_C641D0
External callers (e.g., --ftrace-phase-after option processing in sub_9F4040) resolve phase names to indices through a case-insensitive binary search:
// Pseudocode for sub_C641D0
int PhaseManager::lookup_phase(const char* name) {
ensure_sorted(); // sub_C63FA0
int lo = 0, hi = sorted_name_count - 1;
while (lo <= hi) {
int mid = (lo + hi) / 2;
int cmp = strcasecmp(sorted_name_table[mid].name, name);
if (cmp == 0) return sorted_name_table[mid].index;
if (cmp < 0) lo = mid + 1;
else hi = mid - 1;
}
return 158; // sentinel: last phase (NOP)
}
The sorted name table is rebuilt on demand by sub_C63FA0 when the raw count differs from the sorted count. Sorting uses an iterative quicksort (sub_C639A0) with median-of-three pivot selection and three-way partitioning. The sort stack is pre-allocated to 33 entries, sufficient for ceil(log2(160)).
Per-Phase Timing and Memory Reporting
When timing is enabled (option 17928), the dispatch loop calls sub_C64310 after each phase to print memory statistics:
<indent><phase_name> :: [Total 1234 KB] [Freeable 567 KB] [Freeable Leaked 12 KB] (2%)
The reporter computes three memory deltas from snapshot pairs:
| Metric | Helper | Meaning |
|---|---|---|
| Total | sub_8DAE20 | Total memory allocated since snapshot |
| Freeable | sub_8DAE30 | Memory eligible for release |
| Freeable Leaked | sub_8DAE40 | Freeable memory not actually released |
Size formatting thresholds:
- 0--1023: raw bytes (suffix
B) - 1024--10,485,760: kilobytes with 3 decimal places (suffix
KB) - above 10 MB: megabytes with 3 decimal places (suffix
MB)
After all phases complete, the loop prints an "All Phases Summary" line using the same reporter, then calls sub_C62200 to print the pool consumption total:
[Pool Consumption = 45.678 MB]
Timing Record Format
Each timing entry is 32 bytes:
Timing Record (32 bytes)
+0 int phase_index // -1 for sentinel
+8 int64 phase_name // string pointer, or 0x2030007 for sentinel
+16 int64 timing_value // elapsed time
+24 int memory_flags // opt level / additional metrics
Records are stored in a growable array at compilation_unit+1560. Growth uses a 1.5x strategy: new_capacity = max(old + old/2 + 1, requested).
NvOptRecipe Sub-Manager (440 bytes)
When option 391 is enabled, the constructor creates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This object provides the runtime for "AdvancedPhase" hooks -- the 16 phases that are no-ops by default but can be activated for architecture-specific or optimization-level-specific processing. The NvOpt level (0--5) controls per-phase aggressiveness independently of the -O CLI level: -O gates which phases run at all, while the NvOpt level controls how aggressively active phases behave.
Object Layout
NvOptRecipe (440 bytes)
+0 int64 compilation_unit // back-pointer to owning CU
+8 int64 phase_manager_backref // back-pointer to PhaseManager
+16 void* node_pool // 24-byte ref-counted list node
+24 int64 secondary_bucket_count // secondary hash (migration buffer)
+32 void* secondary_bucket_array // secondary hash bucket array
+40 int64 secondary_total_entries // secondary hash entry count
+48 (264 B) [opaque internal region] // +48..+311 undecoded
+312 int64 recipe_data // from option 391 value (ext. pointer)
+320 int64 (reserved) // zeroed in constructor
+328 (8 B) [alignment gap]
+336 int64 allocator // cu->field_16->field_16
+344 void* timing_records // stride = 584 bytes per entry
+352 int32 timing_count // init -1 (empty sentinel)
+356 int32 timing_flags // init 0
+360 int32 timing_extra // init 0
+364 (4 B) (padding)
+368 int64* timing_allocator // cu->field_16->field_16 copy
+376 void* sorted_array // 4-byte entries, init capacity = 8
+384 int32 sorted_count // init 7 (pre-filled)
+388 int32 sorted_capacity // init 8
+392 void* ref_counted_list_2 // 24-byte ref-counted list node
+400 int32 hash_bucket_count // primary hash table bucket count
+404 (4 B) (padding)
+408 void* hash_buckets // primary hash, 24-byte stride/bucket
+416 int64 hash_size // total entries across all buckets
+424 (8 B) (padding)
+432 void* shared_list_ptr // ref-counted, shared across CUs
Sub-Structures
Ref-Counted List Node (24 bytes) -- used at +16, +392, +432:
RefCountedListNode (24 bytes)
+0 int64 refcount // manual shared_ptr: decrement-and-destroy
+8 void* next // singly-linked list chain
+16 void* allocator // for self-deallocation when refcount → 0
When the refcount reaches zero, the destructor walks the next chain freeing each node, then frees the head node itself through the allocator at +16.
Hash Bucket Entry (24 bytes) -- array at +408:
HashBucketEntry (24 bytes)
+0 void* chain_head // first element in bucket chain
+8 void* chain_sentinel // end-of-chain sentinel
+16 int32 chain_count // number of elements in this bucket
Timing Record (584 bytes) -- array at +344:
TimingRecord (584 bytes)
+0 (40 B) header
+40 void* sub_allocator // allocator for sub-data at +48
+48 void* sub_data // freed during cleanup
+56 int32 sub_count // set to -1 when cleaned
+60 int32 cleanup_flag // if >= 0: sub_data exists, free it
+64 (520 B) timing/metric data
Records are iterated backward during cleanup (base + 584 * (count + 1) - 584 down to base). The sentinel value -1 at offset +56 marks an entry as already cleaned up.
Construction Sequence
The constructor (sub_C62720, lines 356--850 in decompilation) performs these steps:
-
Check option 391 -- fast path:
*(config_obj[9] + 28152) != 0; slow path: virtual call with argument391. If disabled, skip entirely. -
Read option 391 value -- the value is the
recipe_datapointer. Fast path checks type tag5(int64) at config offset28152, reads the 64-bit value at offset28160. This is an externally-provided pointer, not computed locally. -
Allocate 440 bytes from the pool allocator at
compilation_unit->field_16. -
Initialize core fields -- back-pointers at
+0/+8,node_poolat+16(24-byte ref-counted node, refcount=1), zero+24/+32/+40, storerecipe_dataat+312. -
Initialize timing -- zero
+344, set+352to-1(empty sentinel), zero+360, copy allocator to+336and+368. -
Allocate sorted_array -- initial capacity 8 entries (32 bytes), pre-fill 7 entries, set
+384= 7,+388= 8. -
Allocate
ref_counted_list_2at+392(24-byte node, refcount=1), zero+400/+408/+416. -
Allocate
shared_listat+432(24-byte node, refcount=1). -
Inherit from previous recipe -- if
PhaseManager+56already holds an NvOptRecipe from a prior compilation unit:- Decrement old
shared_listrefcount; free if zero - Migrate hash bucket chains from old recipe to new
ref_counted_list_2 - Walk old timing records backward (stride 584), freeing sub-allocations
- Drain old secondary hash table, release old
node_pool - Free old NvOptRecipe object
- Decrement old
-
Install -- set
PhaseManager+56= new recipe,PhaseManager+64= allocator.
Destruction Sequence
The destructor (sub_C61B20) tears down the recipe in reverse:
- Decrement
shared_list_ptr(+432) refcount; free linked nodes if zero - Walk hash buckets (
+408, stride 24, count from+416): for each chain element, clean sub-entries (timing at offsets+56/+60/+64/+76), decrement per-entry refcounts at element[9], append toref_counted_list_2; zero bucket; reset+400to 0 - Clean up
ref_counted_list_2(+392); free if refcount zero - Free
sorted_array(+376) ifsorted_count(+388) >= 0 - Walk
timing_records(+344) backward, stride 584, freeing sub-allocations; reset+352to-1 - Drain secondary hash (
+24/+32/+40), move chains tonode_pool - Release
node_pool(+16); free if refcount zero - Free the 440-byte object via
PhaseManager+64allocator
NvOpt Level Validation
The recipe application function sub_C173E0 validates the NvOpt level at each recipe record:
// At sub_C173E0 + 0x2FD9 (line 1431)
int nvopt_level = *(int*)(recipe_record + 344);
if (nvopt_level > 5) {
emit_warning(cu + 1232, 8000, "Invalid nvopt level : %d.", nvopt_level);
// warning 8000 (0x1F40) -- non-fatal, compilation continues
}
Valid levels are 0--5. The level is consumed as a bitmask 1 << nvopt_level, passed to a vtable call that dispatches on a recipe configuration byte at target descriptor offset 35280 (8-case switch: cases 0--5, 7). This byte controls which recipe application mode is used for the target architecture.
Shared State in Library Mode
The shared_list at +432 enables recipe state persistence across compilation units in library mode (multiple .ptx files compiled by one ptxas invocation):
- Each new NvOptRecipe sets its
shared_listrefcount to 1 - During inheritance (step 9), hash bucket contents are migrated from the old recipe to the new one, accumulating per-kernel recipe decisions
- When a PhaseManager is destroyed, the recipe decrements the shared_list refcount; only the last reference frees the nodes
- This allows the NvOptRecipe to cache per-kernel optimization decisions across compilation passes
Key Constants
| Value | Meaning |
|---|---|
| 440 | NvOptRecipe object size (bytes) |
| 584 | Per-entry timing record stride (bytes) |
| 24 | Hash bucket entry size / ref-counted list node size |
| 8 | Initial sorted_array capacity |
| 7 | Initial sorted_count (pre-filled entries) |
| 391 | Option ID (enables NvOptRecipe; value = recipe data pointer) |
| 28152 | Option 391 type-tag offset in config storage |
| 28160 | Option 391 value offset (8 bytes after type tag) |
| 0x1F40 | Warning code 8000: "Invalid nvopt level" |
| 5 | Maximum valid NvOpt level |
| 35280 | Recipe config byte offset in target descriptor |
Multi-Function Dispatch -- sub_C60BD0
When a compilation unit contains more than one function, sub_C60BD0 redirects to a per-function dispatch path:
// Pseudocode for sub_C60BD0
void PhaseManager::invoke_multi(compilation_unit* cu) {
int func_count = get_function_count(cu); // sub_7DDB50
if (func_count > 1) {
auto list1 = create_refcounted_list();
auto list2 = create_refcounted_list();
this->phase_chain = current_chain; // +88
per_function_dispatch(cu, list1, list2); // sub_790A40
release(list1);
release(list2);
}
}
Complete Phase Table
Stage numbering. The 7 groups below are a coarse summary of the 159-phase OCG pipeline. The authoritative fine-grained grouping is the 10-stage scheme in the Pass Inventory (OCG-Stage 1--10). The 7-group table here collapses several of those stages for brevity; phase boundaries differ slightly. When citing a stage by number, prefer the Pass Inventory's 10-stage numbering.
Group 1: Initial Setup (phases 0--12)
| Index | Phase Name | Purpose |
|---|---|---|
| 0 | OriCheckInitialProgram | Validate initial Ori IR |
| 1 | ApplyNvOptRecipes | Apply NvOptRecipe transformations |
| 2 | PromoteFP16 | Promote FP16 operations where beneficial |
| 3 | AnalyzeControlFlow | Build/analyze control flow graph |
| 4 | AdvancedPhaseBeforeConvUnSup | Hook -- before unsupported op conversion |
| 5 | ConvertUnsupportedOps | Lower unsupported operations to supported sequences |
| 6 | SetControlFlowOpLastInBB | Mark control flow ops as last in basic block |
| 7 | AdvancedPhaseAfterConvUnSup | Hook -- after unsupported op conversion |
| 8 | OriCreateMacroInsts | Create macro instruction patterns |
| 9 | ReportInitialRepresentation | Diagnostic dump of initial IR |
| 10 | EarlyOriSimpleLiveDead | Early dead code elimination |
| 11 | ReplaceUniformsWithImm | Replace uniform register loads with immediates |
| 12 | OriSanitize | IR consistency checks |
Group 2: Early Optimization (phases 13--36)
| Index | Phase Name | Purpose |
|---|---|---|
| 13 | GeneralOptimizeEarly | First GeneralOptimize pass (peephole + simplify) |
| 14 | DoSwitchOptFirst | Switch statement optimization, first pass |
| 15 | OriBranchOpt | Branch simplification and folding |
| 16 | OriPerformLiveDeadFirst | Liveness analysis, first pass |
| 17 | OptimizeBindlessHeaderLoads | Optimize bindless texture header loads |
| 18 | OriLoopSimplification | Canonicalize loop structure |
| 19 | OriSplitLiveRanges | Split long live ranges to reduce pressure |
| 20 | PerformPGO | Apply profile-guided optimizations |
| 21 | OriStrengthReduce | Strength reduction on induction variables |
| 22 | OriLoopUnrolling | Loop unrolling |
| 23 | GenerateMovPhi | Convert phi nodes to MOV-phi representation |
| 24 | OriPipelining | Software pipelining of loops |
| 25 | StageAndFence | Memory staging and fence insertion |
| 26 | OriRemoveRedundantBarriers | Remove unnecessary barrier instructions |
| 27 | AnalyzeUniformsForSpeculation | Identify uniform values for speculative execution |
| 28 | SinkRemat | Sink rematerializable instructions |
| 29 | GeneralOptimize | Main GeneralOptimize pass |
| 30 | DoSwitchOptSecond | Switch optimization, second pass |
| 31 | OriLinearReplacement | Replace complex patterns with linear sequences |
| 32 | CompactLocalMemory | Compact local memory layout |
| 33 | OriPerformLiveDeadSecond | Liveness analysis, second pass |
| 34 | ExtractShaderConstsFirst | Extract shader constants, first pass |
| 35 | OriHoistInvariantsEarly | Early loop-invariant hoisting |
| 36 | EmitPSI | Emit program state information |
Group 3: Mid-Level Optimization (phases 37--58)
| Index | Phase Name | Purpose |
|---|---|---|
| 37 | GeneralOptimizeMid | Mid-pipeline GeneralOptimize |
| 38 | OptimizeNestedCondBranches | Simplify nested conditional branches |
| 39 | ConvertVTGReadWrite | Convert vertex/tessellation/geometry read/write ops |
| 40 | DoVirtualCTAExpansion | Expand virtual CTA operations |
| 41 | MarkAdditionalColdBlocks | Mark additional basic blocks as cold |
| 42 | ExpandMbarrier | Expand mbarrier intrinsics |
| 43 | ForwardProgress | Ensure forward progress guarantees |
| 44 | OptimizeUniformAtomic | Optimize uniform atomic operations |
| 45 | MidExpansion | Mid-pipeline lowering and expansion |
| 46 | GeneralOptimizeMid2 | Second mid-pipeline GeneralOptimize |
| 47 | AdvancedPhaseEarlyEnforceArgs | Hook -- before argument restrictions |
| 48 | EnforceArgumentRestrictions | Enforce ABI argument constraints |
| 49 | GvnCse | Global value numbering and common subexpression elimination |
| 50 | OriReassociateAndCommon | Reassociation and commoning |
| 51 | ExtractShaderConstsFinal | Extract shader constants, final pass |
| 52 | OriReplaceEquivMultiDefMov | Replace equivalent multi-def MOVs |
| 53 | OriPropagateVaryingFirst | Varying propagation, first pass |
| 54 | OriDoRematEarly | Early rematerialization |
| 55 | LateExpansion | Late lowering of complex operations |
| 56 | SpeculativeHoistComInsts | Speculatively hoist common instructions |
| 57 | RemoveASTToDefaultValues | Remove AST nodes set to default values |
| 58 | GeneralOptimizeLate | Late GeneralOptimize |
Group 4: Late Optimization (phases 59--95)
| Index | Phase Name | Purpose |
|---|---|---|
| 59 | OriLoopFusion | Fuse compatible loops |
| 60 | DoVTGMultiViewExpansion | Expand multi-view VTG operations |
| 61 | OriPerformLiveDeadThird | Liveness analysis, third pass |
| 62 | OriRemoveRedundantMultiDefMov | Remove redundant multi-def MOVs |
| 63 | OriDoPredication | If-conversion / predication |
| 64 | LateOriCommoning | Late value commoning |
| 65 | GeneralOptimizeLate2 | Second late GeneralOptimize |
| 66 | OriHoistInvariantsLate | Late invariant hoisting |
| 67 | DoKillMovement | Move kill instructions for better scheduling |
| 68 | DoTexMovement | Move texture instructions for latency hiding |
| 69 | OriDoRemat | Main rematerialization pass |
| 70 | OriPropagateVaryingSecond | Varying propagation, second pass |
| 71 | OptimizeSyncInstructions | Optimize synchronization instructions |
| 72 | LateExpandSyncInstructions | Expand sync instructions to HW sequences |
| 73 | ConvertAllMovPhiToMov | Convert all MOV-phi to plain MOV |
| 74 | ConvertToUniformReg | Promote values to uniform registers |
| 75 | LateArchOptimizeFirst | Architecture-specific late optimization, first pass |
| 76 | UpdateAfterOptimize | Post-optimization bookkeeping |
| 77 | AdvancedPhaseLateConvUnSup | Hook -- before late unsupported op expansion |
| 78 | LateExpansionUnsupportedOps | Late lowering of unsupported operations |
| 79 | OriHoistInvariantsLate2 | Second late invariant hoisting |
| 80 | ExpandJmxComputation | Expand JMX (join/merge) computations |
| 81 | LateArchOptimizeSecond | Architecture-specific late optimization, second pass |
| 82 | AdvancedPhaseBackPropVReg | Hook -- before back-copy propagation |
| 83 | OriBackCopyPropagate | Backward copy propagation |
| 84 | OriPerformLiveDeadFourth | Liveness analysis, fourth pass |
| 85 | OriPropagateGmma | GMMA/WGMMA propagation |
| 86 | InsertPseudoUseDefForConvUR | Insert pseudo use/def for uniform reg conversion |
| 87 | FixupGmmaSequence | Fix up GMMA instruction sequences |
| 88 | OriHoistInvariantsLate3 | Third late invariant hoisting |
| 89 | AdvancedPhaseSetRegAttr | Hook -- before register attribute setting |
| 90 | OriSetRegisterAttr | Set register attributes (types, constraints) |
| 91 | OriCalcDependantTex | Calculate dependent texture operations |
| 92 | AdvancedPhaseAfterSetRegAttr | Hook -- after register attribute setting |
| 93 | LateExpansionUnsupportedOps2 | Second late unsupported op expansion |
| 94 | FinalInspectionPass | Final IR validity checks |
| 95 | SetAfterLegalization | Mark legalization complete |
Group 5: Scheduling and Register Allocation (phases 96--105)
| Index | Phase Name | Purpose |
|---|---|---|
| 96 | ReportBeforeScheduling | Diagnostic dump before scheduling |
| 97 | AdvancedPhasePreSched | Hook -- before scheduling |
| 98 | BackPropagateVEC2D | Back-propagate 2D vector instructions |
| 99 | OriDoSyncronization | Insert synchronization instructions |
| 100 | ApplyPostSyncronizationWars | Apply post-synchronization write-after-read fixes |
| 101 | AdvancedPhaseAllocReg | Hook -- register allocation |
| 102 | ReportAfterRegisterAllocation | Diagnostic dump after regalloc |
| 103 | Get64bRegComponents | Extract 64-bit register components |
| 104 | AdvancedPhasePostExpansion | Hook -- after post-expansion |
| 105 | ApplyPostRegAllocWars | Apply post-regalloc write-after-read fixes |
Group 6: Post-Schedule and Code Generation (phases 106--131)
| Index | Phase Name | Purpose |
|---|---|---|
| 106 | AdvancedPhasePostSched | Hook -- after scheduling |
| 107 | OriRemoveNopCode | Remove NOP instructions |
| 108 | OptimizeHotColdInLoop | Hot/cold partitioning within loops |
| 109 | OptimizeHotColdFlow | Hot/cold partitioning across flow |
| 110 | PostSchedule | Post-scheduling fixups |
| 111 | AdvancedPhasePostFixUp | Hook -- after post-schedule fixup |
| 112 | PlaceBlocksInSourceOrder | Reorder blocks to match source order |
| 113 | PostFixForMercTargets | Mercury target-specific fixups |
| 114 | FixUpTexDepBarAndSync | Fix texture dependency barriers and sync |
| 115 | AdvancedScoreboardsAndOpexes | Hook -- before scoreboard generation |
| 116 | ProcessO0WaitsAndSBs | Process O0-level waits and scoreboards |
| 117 | MercEncodeAndDecode | Mercury encode to SASS and decode-verify |
| 118 | MercExpandInstructions | Expand macro instructions to SASS |
| 119 | MercGenerateWARs1 | Generate write-after-read hazard stalls, pass 1 |
| 120 | MercGenerateOpex | Generate operand exchange stalls |
| 121 | MercGenerateWARs2 | Generate write-after-read hazard stalls, pass 2 |
| 122 | MercGenerateSassUCode | Emit final SASS microcode |
| 123 | ComputeVCallRegUse | Compute virtual call register usage |
| 124 | CalcRegisterMap | Calculate final register map |
| 125 | UpdateAfterPostRegAlloc | Post-regalloc bookkeeping |
| 126 | ReportFinalMemoryUsage | Report final memory consumption |
| 127 | AdvancedPhaseOriPhaseEncoding | Hook -- before final encoding |
| 128 | UpdateAfterFormatCodeList | Update after code list formatting |
| 129 | DumpNVuCodeText | Dump NV microcode as text (debug) |
| 130 | DumpNVuCodeHex | Dump NV microcode as hex (debug) |
| 131 | DebuggerBreak | Debugger breakpoint (debug) |
Group 7: Late Cleanup (phases 132--158)
| Index | Phase Name | Purpose |
|---|---|---|
| 132 | UpdateAfterConvertUnsupportedOps | Bookkeeping after late conversion |
| 133 | MergeEquivalentConditionalFlow | Merge equivalent conditional branches |
| 134 | AdvancedPhaseAfterMidExpansion | Hook -- after mid-expansion |
| 135 | AdvancedPhaseLateExpandSyncInstructions | Hook -- after late sync expansion |
| 136 | LateMergeEquivalentConditionalFlow | Late merge of equivalent conditionals |
| 137 | LateExpansionUnsupportedOpsMid | Mid-point late unsupported op expansion |
| 138 | OriSplitHighPressureLiveRanges | Split live ranges under high register pressure |
| 139--158 | (architecture-specific) | 20 additional phases with names in vtable getString() methods |
Phases 139--158 are not in the static name table at off_22BD0C0. Their names are returned by each phase's getName() virtual method. These are conditionally-enabled phases for specific architecture targets (SM variants) or optimization levels.
AdvancedPhase Hook Points
The 16 AdvancedPhase entries are insertion points for architecture-specific or optimization-level-specific processing. All return isNoOp() == true by default. When activated (typically by NvOptRecipe configuration for a specific SM target), they execute additional transformations at precisely defined points in the pipeline:
| Index | Hook Name | Insertion Context |
|---|---|---|
| 4 | AdvancedPhaseBeforeConvUnSup | Before ConvertUnsupportedOps |
| 7 | AdvancedPhaseAfterConvUnSup | After ConvertUnsupportedOps |
| 47 | AdvancedPhaseEarlyEnforceArgs | Before EnforceArgumentRestrictions |
| 77 | AdvancedPhaseLateConvUnSup | Before LateExpansionUnsupportedOps |
| 82 | AdvancedPhaseBackPropVReg | Before OriBackCopyPropagate |
| 89 | AdvancedPhaseSetRegAttr | Before OriSetRegisterAttr |
| 92 | AdvancedPhaseAfterSetRegAttr | After OriSetRegisterAttr |
| 97 | AdvancedPhasePreSched | Before scheduling pipeline |
| 101 | AdvancedPhaseAllocReg | Register allocation entry point |
| 104 | AdvancedPhasePostExpansion | After post-regalloc expansion |
| 106 | AdvancedPhasePostSched | After scheduling |
| 111 | AdvancedPhasePostFixUp | After post-schedule fixup |
| 115 | AdvancedScoreboardsAndOpexes | Before scoreboard/opex generation |
| 127 | AdvancedPhaseOriPhaseEncoding | Before final instruction encoding |
| 134 | AdvancedPhaseAfterMidExpansion | After mid-level expansion |
| 135 | AdvancedPhaseLateExpandSyncInstructions | After late sync instruction expansion |
Mercury Encoding Sub-Pipeline
Phases 113--122 form a self-contained sub-pipeline that transforms the optimized, register-allocated Ori IR into final SASS machine code via the Mercury encoding format:
PostFixForMercTargets (113)
→ FixUpTexDepBarAndSync (114)
→ [AdvancedScoreboardsAndOpexes hook (115)]
→ ProcessO0WaitsAndSBs (116)
→ MercEncodeAndDecode (117) ← encode to SASS + decode for verification
→ MercExpandInstructions (118) ← expand remaining macros
→ MercGenerateWARs1 (119) ← first WAR hazard pass
→ MercGenerateOpex (120) ← operand exchange stalls
→ MercGenerateWARs2 (121)← second WAR hazard pass
→ MercGenerateSassUCode (122) ← final microcode emission
"Mercury" is NVIDIA's internal name for the SASS encoding format on recent GPU architectures (Blackwell-era SM 100/103/110/120).
Diagnostic Strings
| Address | String | Emitted By | Context |
|---|---|---|---|
0x22BC3B3 | "[Pool Consumption = " | sub_C62200 | After all phases summary |
0x22BC416 | "All Phases Summary" | sub_C64F70 | End of dispatch loop |
| (inline) | " :: " | sub_C64310 | Phase timing line separator |
| (inline) | "[Total " | sub_C64310 | Total memory delta |
| (inline) | "[Freeable " | sub_C64310 | Freeable memory delta |
| (inline) | "[Freeable Leaked " | sub_C64310 | Leaked memory delta |
| (inline) | "Before " / "After " | sub_C64F70 | Phase execution diagnostic |
Function Map
| Address | Size | Function | Confidence |
|---|---|---|---|
sub_C60D20 | 16 | Default phase table pointer | HIGH |
sub_C60D30 | 3,554 | Phase factory (159-case switch) | VERY HIGH |
sub_C60BD0 | 334 | Multi-function phase invoker | MEDIUM-HIGH |
sub_C61B20 | 1,753 | PhaseManager destructor | VERY HIGH |
sub_C62200 | 888 | Pool consumption reporter | VERY HIGH |
sub_C62580 | 253 | Timing record array resizer | HIGH |
sub_C62640 | 223 | Phase list resizer | HIGH |
sub_C62720 | 4,734 | PhaseManager constructor | VERY HIGH |
sub_C639A0 | 1,535 | Case-insensitive quicksort | HIGH |
sub_C63FA0 | 556 | Phase name table sort/rebuild | HIGH |
sub_C641D0 | 305 | Phase name-to-index lookup | VERY HIGH |
sub_C64310 | 3,168 | Per-phase timing reporter | VERY HIGH |
sub_C64F70 | 1,455 | Phase dispatch loop | VERY HIGH |
Cross-References
- Pass Inventory & Ordering -- full phase sequence and stage grouping
- GeneralOptimize Bundles -- phases 13, 29, 37, 46, 58, 65
- Synchronization & Barriers -- phases 26, 71, 72, 99, 100
- Liveness Analysis -- phases 10, 16, 33, 61, 84
- Mercury Encoder -- phases 113--122
- Memory Pool Allocator -- pool allocation infrastructure used by PhaseManager
- Optimization Levels -- how opt level controls phase behavior
- DUMPIR & NamedPhases -- phase name resolution for debug output