Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase Manager Infrastructure

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The PhaseManager is the central orchestration layer in ptxas. It owns the entire 159-phase optimization and code generation pipeline, constructs each phase as a polymorphic object via an abstract factory, and drives execution through a virtual dispatch loop. Every compilation unit passes through the same PhaseManager sequence: construct all 159 phase objects, iterate the phase index array calling execute() on each, optionally collect per-phase timing and memory statistics, then tear down. The PhaseManager also hosts an optional NvOptRecipe sub-manager (440 bytes) for architecture-specific "advanced phase" hooks that inject additional processing at 16 defined points in the pipeline.

The design is a textbook Strategy + Abstract Factory pattern: a 159-case switch statement maps phase indices to vtable pointers, each vtable provides execute(), isNoOp(), and getName() virtual methods, and the dispatch loop iterates a flat index array that defines execution order. This makes the pipeline fully data-driven -- reordering, disabling, or injecting phases requires only modifying the index array, not the dispatch logic.

Core range0xC60000--0xC66000 (13 functions, ~17.5 KB)
Constructorsub_C62720 (4,734 bytes)
Destructorsub_C61B20 (1,753 bytes)
Phase factorysub_C60D30 (3,554 bytes, 159-case switch)
Dispatch loopsub_C64F70 (1,455 bytes)
Name lookupsub_C641D0 (305 bytes, case-insensitive binary search)
Timing reportersub_C64310 (3,168 bytes)
Pool reportersub_C62200 (888 bytes)
Total phases159 (139 explicitly named + 20 arch-specific)
AdvancedPhase hooks16 no-op-by-default insertion points
Default phase tableStatic array at 0x22BEEA0 (returned by sub_C60D20)
Phase name tableStatic array at off_22BD0C0 (159 string pointers)
Vtable rangeoff_22BD5C8 (phase 0) through off_22BEE78 (phase 158)
Callerssub_7FB6C0 (main compilation driver), sub_9F63D0 (library/ftrace entry)

PhaseManager Object Layout

The PhaseManager is a plain C++ object (no vtable of its own) allocated by the compilation driver. Minimum size is 112 bytes, though the full extent depends on whether timing and NvOptRecipe are enabled.

PhaseManager (112+ bytes)
  +0    int64     compilation_unit      // back-pointer to owning compilation unit
  +8    int64*    allocator             // pool allocator (from compilation_unit->field_16)
  +16   void*     sorted_name_table     // sorted {name_ptr, index} pairs for binary search
  +24   int       sorted_name_count
  +28   int       sorted_name_capacity
  +32   int64*    allocator2            // copy of allocator (for phase list ops)
  +40   void*     phase_list            // array of 16-byte {phase_ptr, pool_ptr} pairs
  +48   int       phase_list_count      // always 159 after construction
  +52   int       phase_list_capacity
  +56   int64     nvopt_recipe_ptr      // NvOptRecipe sub-manager, or NULL
  +64   int64     (reserved)
  +72   bool      timing_enabled        // set from compilation_unit->config->options[17928]
  +76   int       (flags/padding)
  +80   bool      flag_byte             // initialized to 1, reset after first timing report
  +88   int64*    timing_allocator
  +96   void*     phase_name_raw_table  // 159 name string pointers, copied from off_22BD0C0
  +104  int       phase_name_raw_count
  +108  int       phase_name_raw_capacity

The two allocator fields (+8 and +32) both point to the same pool allocator extracted from the compilation unit, but are used in different contexts: +8 for name table operations, +32 for phase list operations.

Phase Object Model

Each phase is a 16-byte polymorphic object:

Phase (16 bytes)
  +0    vtable*   // points to one of 159 vtable instances
  +8    void*     // pool pointer (memory pool for phase-local allocations)

The vtable provides the interface contract:

Vtable offsetMethodSignature
+0executevoid execute(phase*, compilation_context*)
+8isNoOpbool isNoOp(phase*) -- returns true to skip execution
+16getNameint getName(phase*) -- returns index into name table
+24allocvoid* alloc(pool*, size_t) -- pool allocator
+32freevoid free(pool*, void*) -- pool deallocator

The vtable addresses span off_22BD5C8 (phase 0) through off_22BEE78 (phase 158), with a stride of 0x28 (40 bytes) between consecutive entries. All vtables reside in .data.rel.ro.

Phase Factory -- sub_C60D30

The factory is a 159-case switch statement that serves as the sole point of phase instantiation. For each case:

  1. Extracts the pool allocator from context->field_16
  2. Allocates 16 bytes via pool_alloc (vtable offset +24)
  3. Writes the case-specific vtable pointer at offset +0
  4. Returns a {phase_ptr, pool_ptr} pair

The default case returns {NULL, NULL}, which the caller treats as an invalid phase index.

// Pseudocode for sub_C60D30
pair<phase*, pool*> PhaseFactory(int phase_index, context* ctx) {
    pool* p = ctx->allocator;
    phase* obj = p->alloc(16);
    switch (phase_index) {
        case 0:   obj->vtable = off_22BD5C8; break;  // OriCheckInitialProgram
        case 1:   obj->vtable = off_22BD5F0; break;  // ApplyNvOptRecipes
        case 2:   obj->vtable = off_22BD618; break;  // PromoteFP16
        // ... 156 more cases ...
        case 158: obj->vtable = off_22BEE78; break;  // sentinel/NOP
        default:  return {NULL, NULL};
    }
    return {obj, p};
}

Called exclusively by the constructor (sub_C62720).

Construction Sequence -- sub_C62720

The constructor performs 11 steps, building all internal data structures and instantiating every phase:

// Pseudocode for sub_C62720
bool PhaseManager::construct(compilation_unit* cu) {
    this->cu          = cu;
    this->allocator   = cu->field_16;      // extract pool allocator
    this->allocator2  = cu->field_16;

    // 1. Check timing flag
    this->timing_enabled = cu->config->options[17928];

    // 2. Allocate and copy phase name table (1272 = 159 * 8 bytes)
    this->phase_name_raw_table = alloc(1272);
    memcpy(this->phase_name_raw_table, off_22BD0C0, 1272);
    this->phase_name_raw_count    = 159;
    this->phase_name_raw_capacity = 159;

    // 3. Initialize timing records
    resize_timing(/*capacity=*/159);                    // sub_C62580
    cu->timing_count++;                                 // at cu+1576
    append_timing({index=-1, name=0x2030007, time=0, flags=0});  // sentinel

    // 4. Create all 159 phase objects
    resize_phase_list(/*capacity=*/159);                // sub_C62640
    for (int i = 0; i < 159; i++) {
        auto [phase, pool] = PhaseFactory(i, cu);       // sub_C60D30
        phase_list[i] = {phase, pool};
    }

    // 5. Optionally create NvOptRecipe sub-manager
    if (cu->config->getOption(391)) {
        auto* recipe = alloc(440);
        // initialize hash table, ref-counted lists, timing arrays (8 entries)
        // inherit phase chain from previous execution context
        this->nvopt_recipe_ptr = recipe;
    }
    this->flag_byte = 1;
    return true;
}

Key constants:

  • 159 -- total phase count, used as loop bound and array capacities
  • 1272 -- 159 * 8, phase name pointer table size in bytes
  • 440 -- NvOptRecipe sub-manager object size
  • 0x2030007 (33,739,079) -- timing sentinel magic value
  • Option 17928 -- enables per-phase timing/memory reporting
  • Option 391 -- enables NvOptRecipe sub-manager

Destruction Sequence -- sub_C61B20

Teardown mirrors construction in reverse order, with careful handling of the NvOptRecipe's reference-counted shared state:

// Pseudocode for sub_C61B20
void PhaseManager::destroy() {
    // 1. Free raw name table
    timing_allocator->free(phase_name_raw_table);

    // 2. Tear down NvOptRecipe if present
    if (nvopt_recipe_ptr) {
        auto* r = nvopt_recipe_ptr;
        // decrement shared_list ref-count at +432
        if (--r->shared_list_refcount == 0)
            free_list_nodes(r->shared_list);
        free(r->hash_buckets);        // +408
        free(r->sorted_array);        // +376
        free(r->timing_records);      // +344, stride=584 per entry
        free(r->node_pool);           // +16
        free(r);
    }

    // 3. Destroy each phase via virtual destructor (vtable+32)
    for (int i = 0; i < phase_list_count; i++) {
        auto [phase, pool] = phase_list[i];
        pool->free(phase);            // invokes vtable+32
    }

    // 4. Free base arrays
    allocator2->free(phase_list);
    allocator->free(sorted_name_table);
}

The ref-count decrement-and-destroy pattern on shared_list at +432 follows C++ shared_ptr semantics: the NvOptRecipe may share state across multiple compilation units in library mode.

Phase Dispatch Loop -- sub_C64F70

The dispatch loop is the runtime engine. It takes a slice of the phase index array and executes each phase in order:

// Pseudocode for sub_C64F70
bool PhaseManager::dispatch(int* phase_indices, int count) {
    memory_snapshot_t base_snap;
    take_snapshot(&base_snap);                          // sub_8DADE0

    for (int i = 0; i < count; i++) {
        int idx = phase_indices[i];
        phase* p = this->phase_list[idx].phase;

        // Resolve phase name
        int name_idx = p->getName();                    // vtable+16
        const char* name = this->phase_name_raw_table[name_idx];

        // Record timing entry
        append_timing({idx, name, opt_level, flags, metrics});

        // Take pre-execution snapshot
        memory_snapshot_t pre_snap;
        take_snapshot(&pre_snap);

        // Execute (unless no-op)
        if (!p->isNoOp()) {                             // vtable+8
            p->execute(this->cu);                       // vtable+0
            // Construct diagnostic: "Before <name>" or "After <name>"
        }

        // Report per-phase stats
        if (this->timing_enabled) {
            report_phase_stats(name, &pre_snap, false); // sub_C64310
            this->flag_byte = 0;
        }
    }

    // Summary after all phases
    if (this->timing_enabled) {
        report_phase_stats("All Phases Summary", &base_snap, true);
        report_pool_consumption();                      // sub_C62200
    }
    return true;
}

The "Before" / "After" diagnostic strings use an interesting encoding trick: the string "Before " is stored as the 64-bit integer 0x2065726F666542 in little-endian, allowing the compiler to emit a single mov instruction instead of a memcpy.

Phase Name Lookup -- sub_C641D0

External callers (e.g., --ftrace-phase-after option processing in sub_9F4040) resolve phase names to indices through a case-insensitive binary search:

// Pseudocode for sub_C641D0
int PhaseManager::lookup_phase(const char* name) {
    ensure_sorted();                                    // sub_C63FA0

    int lo = 0, hi = sorted_name_count - 1;
    while (lo <= hi) {
        int mid = (lo + hi) / 2;
        int cmp = strcasecmp(sorted_name_table[mid].name, name);
        if (cmp == 0) return sorted_name_table[mid].index;
        if (cmp < 0)  lo = mid + 1;
        else           hi = mid - 1;
    }
    return 158;  // sentinel: last phase (NOP)
}

The sorted name table is rebuilt on demand by sub_C63FA0 when the raw count differs from the sorted count. Sorting uses an iterative quicksort (sub_C639A0) with median-of-three pivot selection and three-way partitioning. The sort stack is pre-allocated to 33 entries, sufficient for ceil(log2(160)).

Per-Phase Timing and Memory Reporting

When timing is enabled (option 17928), the dispatch loop calls sub_C64310 after each phase to print memory statistics:

<indent><phase_name>  ::  [Total 1234 KB]  [Freeable 567 KB]  [Freeable Leaked 12 KB] (2%)

The reporter computes three memory deltas from snapshot pairs:

MetricHelperMeaning
Totalsub_8DAE20Total memory allocated since snapshot
Freeablesub_8DAE30Memory eligible for release
Freeable Leakedsub_8DAE40Freeable memory not actually released

Size formatting thresholds:

  • 0--1023: raw bytes (suffix B)
  • 1024--10,485,760: kilobytes with 3 decimal places (suffix KB)
  • above 10 MB: megabytes with 3 decimal places (suffix MB)

After all phases complete, the loop prints an "All Phases Summary" line using the same reporter, then calls sub_C62200 to print the pool consumption total:

[Pool Consumption = 45.678 MB]

Timing Record Format

Each timing entry is 32 bytes:

Timing Record (32 bytes)
  +0    int       phase_index       // -1 for sentinel
  +8    int64     phase_name        // string pointer, or 0x2030007 for sentinel
  +16   int64     timing_value      // elapsed time
  +24   int       memory_flags      // opt level / additional metrics

Records are stored in a growable array at compilation_unit+1560. Growth uses a 1.5x strategy: new_capacity = max(old + old/2 + 1, requested).

NvOptRecipe Sub-Manager (440 bytes)

When option 391 is enabled, the constructor creates a 440-byte NvOptRecipe sub-manager at PhaseManager+56. This object provides the runtime for "AdvancedPhase" hooks -- the 16 phases that are no-ops by default but can be activated for architecture-specific or optimization-level-specific processing. The NvOpt level (0--5) controls per-phase aggressiveness independently of the -O CLI level: -O gates which phases run at all, while the NvOpt level controls how aggressively active phases behave.

Object Layout

NvOptRecipe (440 bytes)
  +0    int64     compilation_unit           // back-pointer to owning CU
  +8    int64     phase_manager_backref      // back-pointer to PhaseManager
  +16   void*     node_pool                  // 24-byte ref-counted list node
  +24   int64     secondary_bucket_count     // secondary hash (migration buffer)
  +32   void*     secondary_bucket_array     // secondary hash bucket array
  +40   int64     secondary_total_entries    // secondary hash entry count
  +48   (264 B)   [opaque internal region]   // +48..+311 undecoded
  +312  int64     recipe_data                // from option 391 value (ext. pointer)
  +320  int64     (reserved)                 // zeroed in constructor
  +328  (8 B)     [alignment gap]
  +336  int64     allocator                  // cu->field_16->field_16
  +344  void*     timing_records             // stride = 584 bytes per entry
  +352  int32     timing_count               // init -1 (empty sentinel)
  +356  int32     timing_flags               // init 0
  +360  int32     timing_extra               // init 0
  +364  (4 B)     (padding)
  +368  int64*    timing_allocator            // cu->field_16->field_16 copy
  +376  void*     sorted_array               // 4-byte entries, init capacity = 8
  +384  int32     sorted_count               // init 7 (pre-filled)
  +388  int32     sorted_capacity            // init 8
  +392  void*     ref_counted_list_2         // 24-byte ref-counted list node
  +400  int32     hash_bucket_count          // primary hash table bucket count
  +404  (4 B)     (padding)
  +408  void*     hash_buckets               // primary hash, 24-byte stride/bucket
  +416  int64     hash_size                  // total entries across all buckets
  +424  (8 B)     (padding)
  +432  void*     shared_list_ptr            // ref-counted, shared across CUs

Sub-Structures

Ref-Counted List Node (24 bytes) -- used at +16, +392, +432:

RefCountedListNode (24 bytes)
  +0    int64     refcount        // manual shared_ptr: decrement-and-destroy
  +8    void*     next            // singly-linked list chain
  +16   void*     allocator       // for self-deallocation when refcount → 0

When the refcount reaches zero, the destructor walks the next chain freeing each node, then frees the head node itself through the allocator at +16.

Hash Bucket Entry (24 bytes) -- array at +408:

HashBucketEntry (24 bytes)
  +0    void*     chain_head      // first element in bucket chain
  +8    void*     chain_sentinel  // end-of-chain sentinel
  +16   int32     chain_count     // number of elements in this bucket

Timing Record (584 bytes) -- array at +344:

TimingRecord (584 bytes)
  +0    (40 B)    header
  +40   void*     sub_allocator   // allocator for sub-data at +48
  +48   void*     sub_data        // freed during cleanup
  +56   int32     sub_count       // set to -1 when cleaned
  +60   int32     cleanup_flag    // if >= 0: sub_data exists, free it
  +64   (520 B)   timing/metric data

Records are iterated backward during cleanup (base + 584 * (count + 1) - 584 down to base). The sentinel value -1 at offset +56 marks an entry as already cleaned up.

Construction Sequence

The constructor (sub_C62720, lines 356--850 in decompilation) performs these steps:

  1. Check option 391 -- fast path: *(config_obj[9] + 28152) != 0; slow path: virtual call with argument 391. If disabled, skip entirely.

  2. Read option 391 value -- the value is the recipe_data pointer. Fast path checks type tag 5 (int64) at config offset 28152, reads the 64-bit value at offset 28160. This is an externally-provided pointer, not computed locally.

  3. Allocate 440 bytes from the pool allocator at compilation_unit->field_16.

  4. Initialize core fields -- back-pointers at +0/+8, node_pool at +16 (24-byte ref-counted node, refcount=1), zero +24/+32/+40, store recipe_data at +312.

  5. Initialize timing -- zero +344, set +352 to -1 (empty sentinel), zero +360, copy allocator to +336 and +368.

  6. Allocate sorted_array -- initial capacity 8 entries (32 bytes), pre-fill 7 entries, set +384 = 7, +388 = 8.

  7. Allocate ref_counted_list_2 at +392 (24-byte node, refcount=1), zero +400/+408/+416.

  8. Allocate shared_list at +432 (24-byte node, refcount=1).

  9. Inherit from previous recipe -- if PhaseManager+56 already holds an NvOptRecipe from a prior compilation unit:

    • Decrement old shared_list refcount; free if zero
    • Migrate hash bucket chains from old recipe to new ref_counted_list_2
    • Walk old timing records backward (stride 584), freeing sub-allocations
    • Drain old secondary hash table, release old node_pool
    • Free old NvOptRecipe object
  10. Install -- set PhaseManager+56 = new recipe, PhaseManager+64 = allocator.

Destruction Sequence

The destructor (sub_C61B20) tears down the recipe in reverse:

  1. Decrement shared_list_ptr (+432) refcount; free linked nodes if zero
  2. Walk hash buckets (+408, stride 24, count from +416): for each chain element, clean sub-entries (timing at offsets +56/+60/+64/+76), decrement per-entry refcounts at element [9], append to ref_counted_list_2; zero bucket; reset +400 to 0
  3. Clean up ref_counted_list_2 (+392); free if refcount zero
  4. Free sorted_array (+376) if sorted_count (+388) >= 0
  5. Walk timing_records (+344) backward, stride 584, freeing sub-allocations; reset +352 to -1
  6. Drain secondary hash (+24/+32/+40), move chains to node_pool
  7. Release node_pool (+16); free if refcount zero
  8. Free the 440-byte object via PhaseManager+64 allocator

NvOpt Level Validation

The recipe application function sub_C173E0 validates the NvOpt level at each recipe record:

// At sub_C173E0 + 0x2FD9 (line 1431)
int nvopt_level = *(int*)(recipe_record + 344);
if (nvopt_level > 5) {
    emit_warning(cu + 1232, 8000, "Invalid nvopt level : %d.", nvopt_level);
    // warning 8000 (0x1F40) -- non-fatal, compilation continues
}

Valid levels are 0--5. The level is consumed as a bitmask 1 << nvopt_level, passed to a vtable call that dispatches on a recipe configuration byte at target descriptor offset 35280 (8-case switch: cases 0--5, 7). This byte controls which recipe application mode is used for the target architecture.

Shared State in Library Mode

The shared_list at +432 enables recipe state persistence across compilation units in library mode (multiple .ptx files compiled by one ptxas invocation):

  • Each new NvOptRecipe sets its shared_list refcount to 1
  • During inheritance (step 9), hash bucket contents are migrated from the old recipe to the new one, accumulating per-kernel recipe decisions
  • When a PhaseManager is destroyed, the recipe decrements the shared_list refcount; only the last reference frees the nodes
  • This allows the NvOptRecipe to cache per-kernel optimization decisions across compilation passes

Key Constants

ValueMeaning
440NvOptRecipe object size (bytes)
584Per-entry timing record stride (bytes)
24Hash bucket entry size / ref-counted list node size
8Initial sorted_array capacity
7Initial sorted_count (pre-filled entries)
391Option ID (enables NvOptRecipe; value = recipe data pointer)
28152Option 391 type-tag offset in config storage
28160Option 391 value offset (8 bytes after type tag)
0x1F40Warning code 8000: "Invalid nvopt level"
5Maximum valid NvOpt level
35280Recipe config byte offset in target descriptor

Multi-Function Dispatch -- sub_C60BD0

When a compilation unit contains more than one function, sub_C60BD0 redirects to a per-function dispatch path:

// Pseudocode for sub_C60BD0
void PhaseManager::invoke_multi(compilation_unit* cu) {
    int func_count = get_function_count(cu);            // sub_7DDB50
    if (func_count > 1) {
        auto list1 = create_refcounted_list();
        auto list2 = create_refcounted_list();
        this->phase_chain = current_chain;              // +88
        per_function_dispatch(cu, list1, list2);        // sub_790A40
        release(list1);
        release(list2);
    }
}

Complete Phase Table

Stage numbering. The 7 groups below are a coarse summary of the 159-phase OCG pipeline. The authoritative fine-grained grouping is the 10-stage scheme in the Pass Inventory (OCG-Stage 1--10). The 7-group table here collapses several of those stages for brevity; phase boundaries differ slightly. When citing a stage by number, prefer the Pass Inventory's 10-stage numbering.

Group 1: Initial Setup (phases 0--12)

IndexPhase NamePurpose
0OriCheckInitialProgramValidate initial Ori IR
1ApplyNvOptRecipesApply NvOptRecipe transformations
2PromoteFP16Promote FP16 operations where beneficial
3AnalyzeControlFlowBuild/analyze control flow graph
4AdvancedPhaseBeforeConvUnSupHook -- before unsupported op conversion
5ConvertUnsupportedOpsLower unsupported operations to supported sequences
6SetControlFlowOpLastInBBMark control flow ops as last in basic block
7AdvancedPhaseAfterConvUnSupHook -- after unsupported op conversion
8OriCreateMacroInstsCreate macro instruction patterns
9ReportInitialRepresentationDiagnostic dump of initial IR
10EarlyOriSimpleLiveDeadEarly dead code elimination
11ReplaceUniformsWithImmReplace uniform register loads with immediates
12OriSanitizeIR consistency checks

Group 2: Early Optimization (phases 13--36)

IndexPhase NamePurpose
13GeneralOptimizeEarlyFirst GeneralOptimize pass (peephole + simplify)
14DoSwitchOptFirstSwitch statement optimization, first pass
15OriBranchOptBranch simplification and folding
16OriPerformLiveDeadFirstLiveness analysis, first pass
17OptimizeBindlessHeaderLoadsOptimize bindless texture header loads
18OriLoopSimplificationCanonicalize loop structure
19OriSplitLiveRangesSplit long live ranges to reduce pressure
20PerformPGOApply profile-guided optimizations
21OriStrengthReduceStrength reduction on induction variables
22OriLoopUnrollingLoop unrolling
23GenerateMovPhiConvert phi nodes to MOV-phi representation
24OriPipeliningSoftware pipelining of loops
25StageAndFenceMemory staging and fence insertion
26OriRemoveRedundantBarriersRemove unnecessary barrier instructions
27AnalyzeUniformsForSpeculationIdentify uniform values for speculative execution
28SinkRematSink rematerializable instructions
29GeneralOptimizeMain GeneralOptimize pass
30DoSwitchOptSecondSwitch optimization, second pass
31OriLinearReplacementReplace complex patterns with linear sequences
32CompactLocalMemoryCompact local memory layout
33OriPerformLiveDeadSecondLiveness analysis, second pass
34ExtractShaderConstsFirstExtract shader constants, first pass
35OriHoistInvariantsEarlyEarly loop-invariant hoisting
36EmitPSIEmit program state information

Group 3: Mid-Level Optimization (phases 37--58)

IndexPhase NamePurpose
37GeneralOptimizeMidMid-pipeline GeneralOptimize
38OptimizeNestedCondBranchesSimplify nested conditional branches
39ConvertVTGReadWriteConvert vertex/tessellation/geometry read/write ops
40DoVirtualCTAExpansionExpand virtual CTA operations
41MarkAdditionalColdBlocksMark additional basic blocks as cold
42ExpandMbarrierExpand mbarrier intrinsics
43ForwardProgressEnsure forward progress guarantees
44OptimizeUniformAtomicOptimize uniform atomic operations
45MidExpansionMid-pipeline lowering and expansion
46GeneralOptimizeMid2Second mid-pipeline GeneralOptimize
47AdvancedPhaseEarlyEnforceArgsHook -- before argument restrictions
48EnforceArgumentRestrictionsEnforce ABI argument constraints
49GvnCseGlobal value numbering and common subexpression elimination
50OriReassociateAndCommonReassociation and commoning
51ExtractShaderConstsFinalExtract shader constants, final pass
52OriReplaceEquivMultiDefMovReplace equivalent multi-def MOVs
53OriPropagateVaryingFirstVarying propagation, first pass
54OriDoRematEarlyEarly rematerialization
55LateExpansionLate lowering of complex operations
56SpeculativeHoistComInstsSpeculatively hoist common instructions
57RemoveASTToDefaultValuesRemove AST nodes set to default values
58GeneralOptimizeLateLate GeneralOptimize

Group 4: Late Optimization (phases 59--95)

IndexPhase NamePurpose
59OriLoopFusionFuse compatible loops
60DoVTGMultiViewExpansionExpand multi-view VTG operations
61OriPerformLiveDeadThirdLiveness analysis, third pass
62OriRemoveRedundantMultiDefMovRemove redundant multi-def MOVs
63OriDoPredicationIf-conversion / predication
64LateOriCommoningLate value commoning
65GeneralOptimizeLate2Second late GeneralOptimize
66OriHoistInvariantsLateLate invariant hoisting
67DoKillMovementMove kill instructions for better scheduling
68DoTexMovementMove texture instructions for latency hiding
69OriDoRematMain rematerialization pass
70OriPropagateVaryingSecondVarying propagation, second pass
71OptimizeSyncInstructionsOptimize synchronization instructions
72LateExpandSyncInstructionsExpand sync instructions to HW sequences
73ConvertAllMovPhiToMovConvert all MOV-phi to plain MOV
74ConvertToUniformRegPromote values to uniform registers
75LateArchOptimizeFirstArchitecture-specific late optimization, first pass
76UpdateAfterOptimizePost-optimization bookkeeping
77AdvancedPhaseLateConvUnSupHook -- before late unsupported op expansion
78LateExpansionUnsupportedOpsLate lowering of unsupported operations
79OriHoistInvariantsLate2Second late invariant hoisting
80ExpandJmxComputationExpand JMX (join/merge) computations
81LateArchOptimizeSecondArchitecture-specific late optimization, second pass
82AdvancedPhaseBackPropVRegHook -- before back-copy propagation
83OriBackCopyPropagateBackward copy propagation
84OriPerformLiveDeadFourthLiveness analysis, fourth pass
85OriPropagateGmmaGMMA/WGMMA propagation
86InsertPseudoUseDefForConvURInsert pseudo use/def for uniform reg conversion
87FixupGmmaSequenceFix up GMMA instruction sequences
88OriHoistInvariantsLate3Third late invariant hoisting
89AdvancedPhaseSetRegAttrHook -- before register attribute setting
90OriSetRegisterAttrSet register attributes (types, constraints)
91OriCalcDependantTexCalculate dependent texture operations
92AdvancedPhaseAfterSetRegAttrHook -- after register attribute setting
93LateExpansionUnsupportedOps2Second late unsupported op expansion
94FinalInspectionPassFinal IR validity checks
95SetAfterLegalizationMark legalization complete

Group 5: Scheduling and Register Allocation (phases 96--105)

IndexPhase NamePurpose
96ReportBeforeSchedulingDiagnostic dump before scheduling
97AdvancedPhasePreSchedHook -- before scheduling
98BackPropagateVEC2DBack-propagate 2D vector instructions
99OriDoSyncronizationInsert synchronization instructions
100ApplyPostSyncronizationWarsApply post-synchronization write-after-read fixes
101AdvancedPhaseAllocRegHook -- register allocation
102ReportAfterRegisterAllocationDiagnostic dump after regalloc
103Get64bRegComponentsExtract 64-bit register components
104AdvancedPhasePostExpansionHook -- after post-expansion
105ApplyPostRegAllocWarsApply post-regalloc write-after-read fixes

Group 6: Post-Schedule and Code Generation (phases 106--131)

IndexPhase NamePurpose
106AdvancedPhasePostSchedHook -- after scheduling
107OriRemoveNopCodeRemove NOP instructions
108OptimizeHotColdInLoopHot/cold partitioning within loops
109OptimizeHotColdFlowHot/cold partitioning across flow
110PostSchedulePost-scheduling fixups
111AdvancedPhasePostFixUpHook -- after post-schedule fixup
112PlaceBlocksInSourceOrderReorder blocks to match source order
113PostFixForMercTargetsMercury target-specific fixups
114FixUpTexDepBarAndSyncFix texture dependency barriers and sync
115AdvancedScoreboardsAndOpexesHook -- before scoreboard generation
116ProcessO0WaitsAndSBsProcess O0-level waits and scoreboards
117MercEncodeAndDecodeMercury encode to SASS and decode-verify
118MercExpandInstructionsExpand macro instructions to SASS
119MercGenerateWARs1Generate write-after-read hazard stalls, pass 1
120MercGenerateOpexGenerate operand exchange stalls
121MercGenerateWARs2Generate write-after-read hazard stalls, pass 2
122MercGenerateSassUCodeEmit final SASS microcode
123ComputeVCallRegUseCompute virtual call register usage
124CalcRegisterMapCalculate final register map
125UpdateAfterPostRegAllocPost-regalloc bookkeeping
126ReportFinalMemoryUsageReport final memory consumption
127AdvancedPhaseOriPhaseEncodingHook -- before final encoding
128UpdateAfterFormatCodeListUpdate after code list formatting
129DumpNVuCodeTextDump NV microcode as text (debug)
130DumpNVuCodeHexDump NV microcode as hex (debug)
131DebuggerBreakDebugger breakpoint (debug)

Group 7: Late Cleanup (phases 132--158)

IndexPhase NamePurpose
132UpdateAfterConvertUnsupportedOpsBookkeeping after late conversion
133MergeEquivalentConditionalFlowMerge equivalent conditional branches
134AdvancedPhaseAfterMidExpansionHook -- after mid-expansion
135AdvancedPhaseLateExpandSyncInstructionsHook -- after late sync expansion
136LateMergeEquivalentConditionalFlowLate merge of equivalent conditionals
137LateExpansionUnsupportedOpsMidMid-point late unsupported op expansion
138OriSplitHighPressureLiveRangesSplit live ranges under high register pressure
139--158(architecture-specific)20 additional phases with names in vtable getString() methods

Phases 139--158 are not in the static name table at off_22BD0C0. Their names are returned by each phase's getName() virtual method. These are conditionally-enabled phases for specific architecture targets (SM variants) or optimization levels.

AdvancedPhase Hook Points

The 16 AdvancedPhase entries are insertion points for architecture-specific or optimization-level-specific processing. All return isNoOp() == true by default. When activated (typically by NvOptRecipe configuration for a specific SM target), they execute additional transformations at precisely defined points in the pipeline:

IndexHook NameInsertion Context
4AdvancedPhaseBeforeConvUnSupBefore ConvertUnsupportedOps
7AdvancedPhaseAfterConvUnSupAfter ConvertUnsupportedOps
47AdvancedPhaseEarlyEnforceArgsBefore EnforceArgumentRestrictions
77AdvancedPhaseLateConvUnSupBefore LateExpansionUnsupportedOps
82AdvancedPhaseBackPropVRegBefore OriBackCopyPropagate
89AdvancedPhaseSetRegAttrBefore OriSetRegisterAttr
92AdvancedPhaseAfterSetRegAttrAfter OriSetRegisterAttr
97AdvancedPhasePreSchedBefore scheduling pipeline
101AdvancedPhaseAllocRegRegister allocation entry point
104AdvancedPhasePostExpansionAfter post-regalloc expansion
106AdvancedPhasePostSchedAfter scheduling
111AdvancedPhasePostFixUpAfter post-schedule fixup
115AdvancedScoreboardsAndOpexesBefore scoreboard/opex generation
127AdvancedPhaseOriPhaseEncodingBefore final instruction encoding
134AdvancedPhaseAfterMidExpansionAfter mid-level expansion
135AdvancedPhaseLateExpandSyncInstructionsAfter late sync instruction expansion

Mercury Encoding Sub-Pipeline

Phases 113--122 form a self-contained sub-pipeline that transforms the optimized, register-allocated Ori IR into final SASS machine code via the Mercury encoding format:

PostFixForMercTargets (113)
  → FixUpTexDepBarAndSync (114)
    → [AdvancedScoreboardsAndOpexes hook (115)]
      → ProcessO0WaitsAndSBs (116)
        → MercEncodeAndDecode (117)      ← encode to SASS + decode for verification
          → MercExpandInstructions (118) ← expand remaining macros
            → MercGenerateWARs1 (119)    ← first WAR hazard pass
              → MercGenerateOpex (120)   ← operand exchange stalls
                → MercGenerateWARs2 (121)← second WAR hazard pass
                  → MercGenerateSassUCode (122) ← final microcode emission

"Mercury" is NVIDIA's internal name for the SASS encoding format on recent GPU architectures (Blackwell-era SM 100/103/110/120).

Diagnostic Strings

AddressStringEmitted ByContext
0x22BC3B3"[Pool Consumption = "sub_C62200After all phases summary
0x22BC416"All Phases Summary"sub_C64F70End of dispatch loop
(inline)" :: "sub_C64310Phase timing line separator
(inline)"[Total "sub_C64310Total memory delta
(inline)"[Freeable "sub_C64310Freeable memory delta
(inline)"[Freeable Leaked "sub_C64310Leaked memory delta
(inline)"Before " / "After "sub_C64F70Phase execution diagnostic

Function Map

AddressSizeFunctionConfidence
sub_C60D2016Default phase table pointerHIGH
sub_C60D303,554Phase factory (159-case switch)VERY HIGH
sub_C60BD0334Multi-function phase invokerMEDIUM-HIGH
sub_C61B201,753PhaseManager destructorVERY HIGH
sub_C62200888Pool consumption reporterVERY HIGH
sub_C62580253Timing record array resizerHIGH
sub_C62640223Phase list resizerHIGH
sub_C627204,734PhaseManager constructorVERY HIGH
sub_C639A01,535Case-insensitive quicksortHIGH
sub_C63FA0556Phase name table sort/rebuildHIGH
sub_C641D0305Phase name-to-index lookupVERY HIGH
sub_C643103,168Per-phase timing reporterVERY HIGH
sub_C64F701,455Phase dispatch loopVERY HIGH

Cross-References