Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Loop Passes

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Eight phases in the ptxas pipeline transform loops in the Ori IR: one canonicalizer (phase 18), one unroller (phase 22), one software pipeliner (phase 24), four LICM instances (phases 35, 66, 79, 88), and one fusion pass (phase 59). Together they account for the largest category of repeated-pass instances in the pipeline -- the LICM family alone runs four times because intervening transformations (predication, legalization, GMMA fixup) continuously expose new invariants.

ptxas is not built on LLVM. Its loop infrastructure is a custom, non-SSA representation operating directly on the Ori IR's basic-block graph. Loop detection is performed by AnalyzeControlFlow (phase 3), which identifies back-edges, computes dominators, and annotates each basic block with a loop nesting depth stored at block offset +144. This nesting depth is the primary loop identity used by all eight passes.

OriLoopSimplificationPhase 18 -- vtable at off_22BD898
OriLoopUnrollingPhase 22 -- vtable at off_22BD938
OriPipeliningPhase 24 -- vtable at off_22BD988
OriHoistInvariantsEarlyPhase 35 -- vtable at off_22BDB40
OriLoopFusionPhase 59 -- vtable at off_22BDF00
OriHoistInvariantsLatePhase 66 -- vtable at off_22BE018
OriHoistInvariantsLate2Phase 79 -- vtable at off_22BE220
OriHoistInvariantsLate3Phase 88 -- vtable at off_22BE388
Phase factorysub_C60D30 cases 18, 22, 24, 35, 59, 66, 79, 88
Phase object size16 bytes (standard {vtable_ptr, allocator_ptr})
IR levelOri -- SASS opcodes with virtual registers, pre-RA
Loop detectionAnalyzeControlFlow (phase 3) -- back-edges, dominators, nesting depth
Related passes3 AnalyzeControlFlow, 19 OriSplitLiveRanges, 21 OriStrengthReduce, 108 OptimizeHotColdInLoop

Pipeline Placement

Phase   3  AnalyzeControlFlow              ── builds CFG, identifies loops, computes dominators
Phase  13  GeneralOptimizeEarly            ── const fold + copy prop (feeds loop analysis)
Phase  15  OriBranchOpt                    ── branch simplification (may change loop shape)
Phase  16  OriPerformLiveDeadFirst         ── DCE removes dead loop bodies
Phase  18  OriLoopSimplification           ── CANONICALIZATION: single entry, preheader insertion
Phase  19  OriSplitLiveRanges              ── splits live ranges at loop boundaries
Phase  21  OriStrengthReduce               ── induction variable strength reduction
Phase  22  OriLoopUnrolling                ── UNROLLING: full/partial based on trip count
Phase  23  GenerateMovPhi                  ── SSA phi insertion (after unrolling changes CFG)
Phase  24  OriPipelining                   ── SOFTWARE PIPELINING: overlaps iterations
    ...
Phase  35  OriHoistInvariantsEarly         ── LICM #1: after GVN, before mid-expansion
    ...
Phase  59  OriLoopFusion                   ── FUSION: merges adjacent compatible loops
    ...
Phase  66  OriHoistInvariantsLate          ── LICM #2: after predication
    ...
Phase  79  OriHoistInvariantsLate2         ── LICM #3: after late unsupported-op expansion
    ...
Phase  88  OriHoistInvariantsLate3         ── LICM #4: after GMMA fixup
    ...
Phase 108  OptimizeHotColdInLoop           ── separates hot/cold paths within loops (post-RA)

Ordering Rationale

The eight loop passes are deliberately spread across the pipeline rather than clustered together. Each occupies a specific position dictated by what has been lowered or optimized upstream:

  1. Phase 18 (simplification) must run before strength reduction (21) and unrolling (22) because both require canonical loop forms.
  2. Phase 22 (unrolling) runs after strength reduction so that induction variable simplifications are already applied, avoiding redundant computation in unrolled copies.
  3. Phase 24 (pipelining) runs after unrolling because pipelining targets loops that were not fully unrolled.
  4. Phase 35 (early LICM) runs after GeneralOptimize at phase 29, which performs partial CSE, giving it common subexpressions to hoist.
  5. Phase 59 (fusion) runs after late expansion (phase 55) because expansion can split a single operation into a loop pair that fusion can reunite.
  6. Phases 66, 79, 88 (late LICM instances) each follow a major transformation that can create new loop-invariant code: predication (63), unsupported-op expansion (78), and GMMA fixup (87), respectively.

Loop Representation in Ori IR

ptxas does not use a dedicated loop descriptor data structure (no LoopInfo object like LLVM's). Instead, loop membership is implicit in the CFG through annotations computed by AnalyzeControlFlow (phase 3):

BB FieldOffsetTypeMeaning
loop_depth+144intLoop nesting depth (0 = not in loop)
loop_depth_equal+152intCopy of loop_depth, used for sibling detection
predecessor_list+128linked_list*List of predecessor block indices
successor_list+136linked_list*List of successor block indices

A loop header is a block whose loop_depth equals its own back-edge source's depth. Back-edge information is stored in the Code Object's back-edge hash map at offset +680. Diagnostic output from sub_BDEA50 prints this information as bix%d -> backedge's successor BB: %d.

The block iteration order is controlled by a reverse-post-order (RPO) array stored at Code Object offset +512. All loop passes iterate over this array, ensuring they visit headers before inner blocks. The array length is at Code Object offset +520.


Phase 18 -- OriLoopSimplification

Purpose

Canonicalizes loop structure to simplify downstream analysis. Ensures each natural loop has a single entry edge, inserts dedicated preheader blocks where needed, and normalizes back-edge shapes. This is a prerequisite for strength reduction, unrolling, and pipelining, all of which assume canonical loop form.

Entry Point

sub_C5FB00 (34 bytes)          ── vtable execute(), calls sub_7DDB50
  └─ sub_78B430 (1,172 bytes)  ── LoopMakeSingleEntry core
       ├─ sub_7753F0            ── pre-pass: loop peeling setup
       ├─ sub_789BE0            ── canonicalize back-edges
       ├─ sub_781F80            ── rebuild instruction list
       └─ sub_9253C0            ── split edges / insert preheader

Algorithm

function LoopSimplification(code_object):
    if code_object.flags[1368] & 1 == 0:          // optimization disabled
        return

    // Phase 1: optional loop peeling for O4+ or flagged functions
    if opt_level not in {4,5} and flags[1382] & 4 set:
        peeled = PeelOuterEdges(code_object, 0)         // sub_7753F0
        canonicalized = CanonicalizeBackEdges(code_object, peeled)  // sub_789BE0
    else:
        canonicalized = CanonicalizeBackEdges(code_object, 0)

    if code_object.flags[1368] & 1 == 0:          // re-check after canon
        return

    // Phase 2: single-entry enforcement
    if not QueryKnob("LoopMakeSingleEntry", knob_487):  // OCG knob 487
        return

    RebuildInstructionList(code_object, 1)               // sub_781F80
    for each block in RPO order:
        if block.loop_depth > 0 and block is loop header:
            // find the deepest-nesting back-edge target
            // if multiple entries exist, split into single-entry form
            // insert preheader block between external predecessors and header
            InsertPreheaderIfNeeded(code_object, block)  // sub_9253C0

GPU-Specific Considerations

The simplification pass checks the optimization level at offset +896 of the code object. Levels 4 and 5 (-O4, -O5) enable aggressive loop peeling via sub_7753F0 before canonicalization. At the default -O2, peeling is suppressed to avoid code size growth that could cause instruction cache thrashing.

The LoopMakeSingleEntry knob (OCG knob 487) is the master enable. When disabled, only back-edge canonicalization runs -- preheader insertion is skipped. This knob is checked via the standard OCG knob query at offset +152 of the allocator vtable.

The pass also inspects the convergence flag at offset +1380 (bit 7). When set, it indicates a convergent execution context (e.g., warp-synchronous code), and certain edge-splitting transformations are suppressed to avoid disrupting convergence guarantees.

Knob NameDefaultDescription
LoopInversionenabledEnable loop inversion (do-while to while conversion)
LoopInversionBudgetunsetMaximum instruction count for loop inversion
LoopPeelInversiondisabledEnable loop peeling combined with inversion
EnableSingleThreadPeelingLoopsunsetEnable peeling for single-thread execution paths
GenPeelingLoopsForSyncsunsetGenerate peeling loops around sync instructions
AssertIfPeelingLoopForTexSurfunsetAssert (debug) if peeling a loop for texture/surface ops

Phase 22 -- OriLoopUnrolling

Purpose

Performs full unrolling of loops with known small trip counts and partial unrolling of larger loops to amortize loop overhead and expose instruction-level parallelism. This is one of the most impactful optimization passes for GPU code, where loops over texture coordinates, reduction accumulators, and matrix tiles dominate execution time.

Function Map

Correction (P1-04): The W023 report incorrectly listed sub_83EF00 as the unrolling driver. That function is the MainPeepholeOptimizer (confirmed by p1.06a sweep). The actual unrolling call chain starts at sub_1392E30.

FunctionSizeRoleConfidence
sub_1392E3025 linesPhase 22 execute entry: guards, calls initializer + driver + cleanupHIGH
sub_1389AF0593 linesUnrolling context initializer: reads all knobs from OCG profileHIGH
sub_1390B301,598 linesMain unrolling driver: per-loop decision, factor selection, dispatchHIGH
sub_138A6E0774 linesPost-unroll cleanup: frees working structuresHIGH
sub_7E512019 linesNounroll/skip check: pragma flag, convergence, knob 91HIGH
sub_7F5D2099 linesRejection recording: indexes string table at 0x21D1EA0HIGH
sub_138E3E0125 linesLoop body scanner: three-pass analysis (header, forward, backward)HIGH
sub_13858C042 linesLoop back-edge locatorHIGH
sub_1385E90~200 linesTrip count bound extractor (init, limit, stride from IV)MEDIUM
sub_13836201,157 linesFull unroll profitability evaluator (foldable constants, addresses)MEDIUM
sub_1387C30~400 linesPartial unroll body replicatorMEDIUM
sub_13880F0~200 linesPost-unroll CFG fixupMEDIUM
sub_1385950~300 linesInduction variable analysisMEDIUM
sub_138E9C0~400 linesIV stride/direction verificationMEDIUM
sub_1385CC0~200 linesIV constant detectionMEDIUM
sub_13829F0~200 linesProfitability: foldable constant load countingMEDIUM
sub_A3A7E01,236 linesPost-unroll statistics (DUMPIR output)HIGH

Unrolling Decision Algorithm

The unrolling decision is a multi-stage pipeline implemented in sub_1390B30. The function iterates over loops in reverse RPO order (innermost first, matching the RPO array at code_object+512) and applies a series of eligibility checks, trip count analysis, factor selection, and profitability evaluation before committing to the unroll.

Entry Guard (sub_1392E30)

function OriLoopUnrolling_Execute(code_object):
    if code_object.flags[1368] & 1 == 0:           // optimization disabled
        return
    if code_object.flags[1397] & 0xC0 == 0x40:     // global nounroll override
        return
    if DUMPIR_skip("LoopUnrolling"):                // sub_799250
        return
    if CountBlocks(code_object) <= 2:               // sub_7DDB50
        return
    if not QueryKnob(487, true):                    // master loop pass guard
        return

    ctx = InitializeContext(code_object)             // sub_1389AF0
    RunUnrolling(ctx)                                // sub_1390B30
    Cleanup(ctx)                                     // sub_138A6E0

Context Initialization and Knob Defaults (sub_1389AF0)

The initializer reads unrolling parameters from the OCG profile object. Each knob uses a three-valued flag: 0 = use hardcoded default, 1 = use integer override, 2 = use float override, 3 = use double override. The defaults recovered from binary:

Context FieldProfile OffsetDefaultKnob Name (inferred)
ctx+168 (int32)+31320140UnrollBudget
ctx+172 (float)+310320.25UnrollFlexableFullLimit
ctx+176 (int32)+309604UnrollUnknownCount
ctx+180 (int32)+308164UnrollSmallLoopLimit
ctx+184 (dbl)+646560.4LoopUnrollLargePartOfShaderPct
ctx+192 (float)+3139220.0UnrollInstLimit
ctx+196 (int32)+6487250UnrollPregThreshold
ctx+200 (int32)+312482UnrollExtraInstPerPercentSaving
ctx+204 (int32)+31176200UnrollFullInstLimit
ctx+208 (int32)+6429646LoopUnrollNumExtraInstBase

Boolean and integer knobs read via vtable dispatch:

Knob IDProfile OffsetDefaultKnob Name
437+31464trueLoopUnroll (master enable)
894+64368trueLoopUnrollNonInnermost
897+64584trueUnrollMultiBlockLoops
902+64944trueUnrollVariableBounds
896+645120LoopUnrollFactor (INT override; 0 = heuristic)
895+644400EpilogueLoopUnrollCount
900+648000LoopUnrollNumInstTex
903+65016falseDisablePartialUnrollOverflowCheck

String knob: knob 427 (profile+30744) returns the LoopUnrollFactor per-block override string, with the format "-N-" to skip block N, "+N+" to force-unroll block N, "-" to skip all, "+" to force all.

Nounroll Pragma Check (sub_7E5120)

Returns true (suppress unrolling) when any of these conditions hold:

  1. Convergence constraint: The back-edge analysis context at code_object+1784 is active, and the loop header's entry in the back-edge table (code_object+1776+16) is valid and within the convergence limit. This suppresses unrolling of warp-synchronous loops.
  2. PTX nounroll pragma: Byte 292 of the block descriptor at (code_object+368 + 8*block_idx) has bit 1 set. This bit is set during PTX-to-Ori lowering when the nounroll pragma string (at 0x1CFE126) is parsed.
  3. Instruction-level marker: Byte 283 of the loop header instruction has bit 0 set.
  4. Per-block knob: OCG knob 91 is set for this block (queried via sub_7A1A90).

Main Decision Flowchart (sub_1390B30)

function RunUnrolling(ctx):
    code_object = ctx.code_object

    // Phase 1: Read master enable and per-block override string
    master_enable = QueryKnob(437)                   // LoopUnroll
    override_string = QueryKnobString(427)           // "-N-" / "+N+" format
    RecomputeRegisterPressure(code_object)            // sub_7E6090
    RebuildInstructionList(code_object)               // sub_781F80

    // Phase 2: Pre-scan -- count inlinable calls and non-unrollable instructions
    for each instruction in code_object.instruction_list:
        if opcode == 97 (BRX):
            if callee.entry_block == callee.exit_block:
                inlinable_calls++
                if trip_count > 1:
                    multi_exit |= AnalyzeMultiExit(ctx, callee)

    // Phase 3: Iterate loops in reverse RPO (innermost first)
    rpo_count = code_object.rpo_count                // offset +520
    for idx = rpo_count-1 downto 0:
        block = code_object.blocks[code_object.rpo[idx]]

        // ── Step A: nounroll annotation propagation ──
        if block.nounroll_annotation:                // byte +246
            propagate nounroll to all blocks at >= same nesting depth

        // ── Step B: eligibility filter ──
        if block.loop_depth == 0:          continue  // not a loop
        if block.loop_depth != block.loop_depth_equal: continue
        if block.nounroll and not ctx.force_all:     continue

        // ── Step C: structure analysis ──
        latch = LocateBackEdge(ctx, block)           // sub_13858C0
        if not latch:                    continue
        exit_inst = latch.last_instruction
        if exit_inst.opcode != 95:                   // not conditional branch
            Reject(block, 13); continue              // indirect jump

        // ── Step D: nounroll / convergence check ──
        if CheckNounroll(block, code_object):        // sub_7E5120
            Reject(block, 11); continue

        // ── Step E: execution frequency analysis ──
        freq_header = code_object.freq_table[header_reg]
        freq_latch  = code_object.freq_table[latch_reg]
        is_hot = (freq_latch > 999) and (freq_header > 0)
                 and (freq_latch / freq_header > 3)

        // ── Step F: body analysis ──
        body_info = ScanLoopBody(ctx, block, latch)  // sub_138E3E0
        // body_info contains: tex_count, body_size, foldable_ldc_count,
        //                     has_cross_edges, mem_count
        if body_info.has_cross_edges:    continue

        // ── Step G: budget computation ──
        budget_scale = QueryKnobDouble(898, 0.5)     // default 0.5
        scaled_body = (int)(budget_scale * body_size)
        remaining = total_budget - body_size - scaled_body - ...

        // ── Step H: per-block override check ──
        if override_string:
            needle = "-{block_id}-"
            if override_string == "-" or strstr(override_string, needle):
                continue                             // skip this block
            needle = "+{block_id}+"
            if override_string == "+" or strstr(override_string, needle):
                force_unroll = true

        // ── Step I: pragma force-unroll ──
        if flags[1397] & 0xC0 == 0x80:              // PTX pragma force
            force_unroll = true

        // ── Step J: non-innermost filter ──
        if not ctx.allow_non_innermost and not force_unroll:
            if 10 * body_info.tex_count < remaining:
                Reject(block, 7); continue

        // ── Step K: factor selection ──
        if force_unroll:
            factor = 1 << ctx.force_factor           // power-of-2 override
        else if known_trip_count:
            factor = trip_count
            // Budget-constrain: while factor * body_cost > UnrollBudget:
            //     factor--
            if factor > 4 and trip_count == 1:
                factor &= ~3                         // round to mult-of-4
            if factor <= 1:
                Reject(block, 12); continue
        else:
            if body_size <= 49 and body_info.tex_count > 0:
                factor = 2                           // conservative default
            else:
                factor = max(1, UnrollBudget / body_cost)

        // ── Step L: knob override ──
        if QueryKnob(429):                           // LoopUnrollFactor INT
            factor = GetKnobInt(429)

        // ── Step M: IV analysis ──
        iv_info = AnalyzeIV(ctx, latch)              // sub_1385950
        if not iv_info:             Reject(block, 14); continue
        if not ValidateIV(ctx, iv_info):             // sub_1387870
                                    Reject(block, 14); continue
        bound = ExtractBound(ctx, iv_info)           // sub_1385E90
        if not bound or bound.opcode != 2:
                                    Reject(block, 16); continue
        if bound.def_block.predecessor_count != 1:
                                    Reject(block, 17); continue
        if bound.init_reg == bound.limit_reg:
                                    Reject(block, 18); continue
        stride_ok = VerifyStride(ctx, block, latch, iv_info, bound)
        if stride_ok & 2:          Reject(block, 17); continue
        if stride_ok & 1:          Reject(block, 18); continue

        // ── Step N: detect constant trip count ──
        const_iv = DetectConstantIV(ctx, iv_info)    // sub_1385CC0

        // ── Step O: profitability for full unroll ──
        if factor == trip_count and single_block_body:
            if CheckFoldableProfitability(ctx, block, iv_info, factor):
                ReplicateFullUnroll(ctx, block, factor) // sub_1383620
                stats.unrolled_count++
                continue

        // ── Step P: partial unroll execution ──
        if factor >= 2:
            remainder = trip_count % factor
            iterations_per_copy = (trip_count - remainder) / factor
            block.iterations_per_copy = iterations_per_copy
            if remainder > 0:
                for r = 0 to remainder-1:
                    DuplicateBody(ctx, block)         // sub_932E40
            ReplicatePartialUnroll(ctx, block, latch,
                factor, remainder)                    // sub_1387C30
            stats.unrolled_count++
        else:
            Reject(block, 24)                         // budget exceeded

    // Phase 4: Post-unroll fixup
    stats.non_unrolled = total_loops - stats.unrolled - stats.failed
    if any_unrolled:
        RebuildBackEdges(code_object)                 // sub_7846F0
        RerunLiveness(code_object)                    // sub_A0F020
        RerunControlFlow(code_object)                 // sub_752E40
        MarkModified(code_object)                     // sub_7B52B0

Unroll Rejection Table

When a loop cannot be unrolled, sub_7F5D20 records the reason by indexing a string pointer array at 0x21D1EA0. The diagnostic strings contain hex codes like "0x80000001 - Not unrolled: Irregular loop" -- these hex values are part of the printed message text, not the internal array index. The W023 report originally described a 36-byte structure table at 0x21D1980; that table belongs to the operand range lookup in the peephole optimizer (sub_7E39B0), not the unrolling pass. The actual internal rejection codes are simple integers indexing the string array:

CodeCategoryReason
7PerformanceBody too large relative to texture savings (10 * tex_count < remaining_budget)
11Pragma/knobPTX nounroll pragma, convergence constraint, or per-block knob 91
12BudgetPartial unroll factor reduced to 1 (no factor >= 2 fits within UnrollBudget)
13IneligibleLoop exit contains BRX (indirect jump, opcode 95 with special flags)
14Unsupported IVInduction variable analysis failed (sub_1385950 or sub_1387870)
15Unsupported IVIV register class is not integer (class 1) or pointer (class 2/3)
16Trip countTrip count bound extraction failed (sub_1385E90)
17IrregularIV definition block has multiple predecessors, or stride/direction verification failed
18Trip countIV initial value register equals IV limit register (degenerate zero-trip loop)
19Unsupported IVIV stride sign inconsistent between loop header and induction increment
24BudgetCatch-all: budget exceeded after all factor reduction attempts

The diagnostic output is gated by flags[1421] & 0x20 (DUMPIR verbose mode). When enabled, the rejection string is recorded in a hash map keyed by the loop header instruction node, using FNV-1a hashing of the node's block index.

Heuristic Thresholds (Knobs)

The unrolling decision is controlled by a rich set of OCG knobs. All knob names are stored ROT13-encoded in the binary:

Knob NameTypeDefaultDescription
LoopUnrollBOOLtrueMaster enable for loop unrolling
LoopUnrollFactorINT0Override unroll factor (0 = heuristic)
UnrollBudgetINT140Maximum total instruction count after unrolling
UnrollInstLimitFLOAT20.0Maximum instructions in a single unrolled loop body
UnrollFullInstLimitINT200Maximum body size for full unrolling
UnrollFlexableFullLimitFLOAT0.25Flexible full-unroll limit (adjusted by loop characteristics)
UnrollSmallLoopLimitINT4Body size threshold below which loops are always fully unrolled
UnrollPregThresholdINT50Maximum predicate register pressure for unrolling
UnrollMultiBlockLoopsBOOLtrueAllow unrolling of multi-basic-block loop bodies
UnrollVariableBoundsBOOLtrueAllow unrolling when trip count is not compile-time constant
UnrollUnknownCountINT4Default trip count assumption when count is unknown
UnrollUnknownInstLimitINT0Maximum body size for unrolling with unknown trip count
UnrollExtraInstPerPercentSavingINT2Instructions allowed per percent of cycle saving
UnrollTex3DPercentSavedThresholdINT0Minimum savings percent for 3D texture loops
UnrollProfiledColdInstsScaleINT0Scale factor for instruction count in profiled-cold blocks
LoopUnrollExtraFoldableLdcWeightINT0Extra weight for foldable constant loads in unroll benefit
LoopUnrollFoldableAddrWeightINT0Weight for foldable address computations
LoopUnrollLargePartOfShaderPctDOUBLE0.4Percentage threshold: loop is "large part of shader"
LoopUnrollNumExtraInstBaseINT46Base extra instruction allowance per unroll iteration
LoopUnrollNumInstSmallLoopINT0Instruction count defining "small loop"
LoopUnrollNumInstTexINT0Texture instruction count bonus for unrolling
LoopUnrollSingleLoopSavedPctFactorINT0Savings factor for single-loop shaders
LoopUnrollNonInnermostBOOLtrueAllow unrolling of non-innermost loops
LoopUnrollUnknownMultiBlockBOOLfalseAllow multi-block unroll with unknown bounds
EpilogueLoopUnrollCountINT0Unroll count for epilogue (remainder) loops
DisablePartialUnrollOverflowCheckBOOLfalseSkip overflow check on partial unroll count

GPU-Specific Unrolling Concerns

Register pressure. GPU threads share a fixed register file per SM. Unrolling increases live ranges, potentially reducing occupancy (the number of concurrent warps). The unroller queries register pressure estimates and compares against UnrollPregThreshold before committing.

Instruction cache. GPU instruction caches are small (typically 128KB L1i per SM). Aggressive unrolling of large loop bodies can cause i-cache thrashing. The UnrollBudget knob caps the total instruction growth.

Texture instruction scheduling. Texture fetches have high latency (hundreds of cycles). Unrolling loops containing texture operations is especially profitable because it exposes independent fetches that the scheduler can overlap. The LoopUnrollNumInstTex and UnrollTex3DPercentSavedThreshold knobs give extra weight to texture-heavy loops.

PTX nounroll pragma. The PTX string nounroll at 0x1CFE126 is parsed during PTX-to-Ori lowering and sets bit 1 of byte 292 in the block descriptor at (code_object+368 + 8*block_idx). The check is performed by sub_7E5120, which also tests three additional suppression conditions: the convergence constraint (back-edge table at code_object+1776), an instruction-level marker (byte 283 bit 0), and per-block knob 91. Any single condition is sufficient to suppress unrolling for that loop (rejection code 11).

Convergence constraint. When the back-edge analysis context at code_object+1784 is active (indicating warp-synchronous code), the unroller checks whether the loop header falls within the convergence region. If it does, unrolling is suppressed to avoid breaking warp-level synchronization guarantees. This is particularly important for cooperative groups and ballot-based algorithms.

DUMPIR Statistics

When diagnostics are enabled, the pass outputs:

# [partially unrolled loops=N] [non-unrolled loops=M]

This line appears in eight SM-variant statistics printers (sub_ABBA50 through sub_ABEB50), each a 1,771-byte clone specializing output format for a specific SM generation.


Phase 24 -- OriPipelining

Purpose

Performs modulo software pipelining on loops that were not fully unrolled. The pass overlaps successive loop iterations by interleaving instructions from different iterations within a single loop body, hiding functional unit and memory latency. This is the single most complex loop transformation in ptxas.

Two-Layer Pipelining Architecture

ptxas implements software pipelining in two cooperating layers:

  1. Phase 24 (OriPipelining, pre-RA): Annotates instruction operands with pipeline latency classes, computes the minimum initiation interval (MII), performs the modulo scheduling loop transformation (iteration overlap, prolog/epilog generation). Operates on the Ori IR before register allocation.

  2. Post-RA SoftwarePipeline (sub_8B9390, 23KB): A scheduling algorithm variant within the post-RA instruction scheduler (address range 0x893000--0x8FE000) that performs instruction-level scheduling of already-pipelined loop bodies using physical registers. One of approximately 12 scheduling variants alongside DualIssueScheduler, TensorScheduler, LoopScheduler, PrefetchScheduler, etc.

The two layers cooperate: Phase 24 transforms the loop structure (instruction replication, prolog/epilog construction) before register allocation. The post-RA SoftwarePipeline variant handles the cycle-accurate instruction placement of already-pipelined loops.

Function Map

FunctionSizeRoleConfidence
sub_926A3022,116 bytesPer-instruction operand latency annotator and encoding rewriterHIGH
sub_91A0F05,550 bytesOpcode-to-latency-class classifier (~350 opcodes, 13 distinct classes)HIGH
sub_9203A04,881 bytesPipeline stage cost calculator (ResMII computation, FP cost accumulation)MEDIUM
sub_9218201,592 bytesProlog/epilog code generatorMEDIUM
sub_9202D0207 bytesTwo-operand pipeline feasibility check (returns 60=reject, 130=accept)HIGH
sub_91E610399 bytesRegister-class-based latency lookup (class 4→26, class 5/2→20)HIGH
sub_91E900470 bytesPipe-assignment-based stall cycle calculator (32/64 cycle caps)HIGH
sub_92C0D0358 bytesPer-instruction annotation wrapper (calls sub_926A30, checks opcode changes)HIGH
sub_92C2408,033 bytesExtended GEMM-loop pipeliner (SM90+ TMA pipeline depth management)MEDIUM
sub_8B939022,841 bytesPost-RA software pipelining scheduling variant (in scheduler subsystem)MEDIUM

Correction (P1-06): The original function map listed sub_926A30 as the "main pipelining engine (modulo scheduling)." Decompilation reveals it is the per-instruction operand latency annotator -- it iterates over each operand of an instruction, calls sub_91A0F0 to classify the operand's latency class, and rewrites the operand encoding with the latency annotation. The modulo scheduling loop transformation is distributed across the remaining functions, with sub_9203A0 computing stage costs and sub_921820 generating prolog/epilog code.

Software Pipelining Algorithm

Phase 1: Operand Latency Annotation

For each instruction in the loop body, sub_92C0D0 calls sub_926A30 to annotate operands:

function AnnotateOperandLatencies(code_object, instruction):
    opcode = instruction.word & 0xFFFFCFFF      // strip modifier bits (bits 12-13)
    secondary_opcode = instruction.secondary_opcode
    operand_array = instruction.operands         // offset +84
    operand_count = instruction.operand_count    // offset +80

    for i in 0..operand_count-1:
        operand_type = (operand_array[i].word >> 28) & 7
        if operand_type in {2, 3}:               // register or register pair
            // Adjust count for predicated instructions (bit 12)
            adjusted_count = operand_count - 2 * ((opcode >> 11) & 2 != 0)
            if i < adjusted_count:
                latency_class = ClassifyLatency(opcode, secondary_opcode,
                                                operand_array, adjusted_count, i)
                if latency_class != default:
                    RewriteOperandEncoding(operand_array[i], code_object, latency_class)

        // For register operands: call full rewriter sub_922210
        // For non-register operands: call sub_9267C0

Phase 2: Pipeline Feasibility Filtering

Each instruction is checked by sub_9202D0:

function CheckPipelineFeasibility(code_object, instruction):
    // Reject instructions with special operand flags
    if (operand_array[1] & 0x603FFFF) != 0 or (operand_array[3] & 0xF8000000) != 0:
        if optimization_level > 1:
            return REJECT                        // return code 60

    // Reject if pipe assignment class <= 3 (control/barrier pipe)
    pipe_class = PipeAssignment(code_object, primary_opcode)   // vtable+904
    if pipe_class <= 3:
        return REJECT

    // Reject if operand 0 and operand 1 have different latency classes
    lat0 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 0)
    lat1 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 1)
    if lat0 != lat1:
        return REJECT                            // asymmetric latencies

    // Reject if extended operands have blocking flags
    if operand_count > 2 and (operand_array[4] & 0xF) or (operand_array[4] >> 4) & 1:
        return REJECT

    // Accept: trim to 2-operand form
    result_operands = &operand_array[2]
    result_count = 2
    return ACCEPT                                // return code 130

Phase 3: MII Computation

The minimum initiation interval is computed as:

MII = max(RecMII, ResMII)

RecMII (recurrence-constrained): The longest data dependence cycle in the DDG divided by the iteration distance it spans. For a cycle of total latency L spanning D iterations: RecMII = ceil(L / D).

ResMII (resource-constrained): Computed by sub_9203A0 using floating-point cost accumulation. The function classifies each instruction's pipe class using a 7-entry pipe class table at code_object+16 and accumulates per-pipe instruction counts:

function ComputeResMII(loop_body, pipe_table):
    pipe_counts[0..6] = {0}
    for each instruction in loop_body:
        lat0 = ClassifyLatency(instruction, operand=0)
        lat1 = ClassifyLatency(instruction, operand=1)
        pipe = MapLatencyToPipe(lat0, pipe_table)    // 7-entry lookup
        pipe_counts[pipe] += cost(instruction)       // FP cost weights

    ResMII = max(pipe_counts[i] / pipe_width[i] for i in 0..6)

The pipe class boundaries stored at code_object+16 define 7 functional unit classes. Each class has a capacity (number of execution slots per cycle). ResMII is the maximum ratio of instruction demand to capacity across all pipe classes.

Phase 4: Modulo Schedule Construction

function ModuloSchedule(loop_body, MII):
    II = MII
    while II <= MAX_II:
        MRT = new ModuloReservationTable(II)     // II rows x pipe_classes columns
        success = true

        for each instruction in priority order:
            earliest = max(data_dependency_constraints)
            latest = earliest + II - 1
            placed = false

            for slot in earliest..latest:
                row = slot mod II
                pipe = instruction.pipe_class
                if MRT[row][pipe] has capacity:
                    MRT[row][pipe] -= 1
                    instruction.scheduled_time = slot
                    instruction.stage = slot / II
                    placed = true
                    break

            if not placed:
                success = false
                break

        if success:
            return (II, schedule)
        II += 1

    return FAILURE                               // could not pipeline

Phase 5: Prolog/Epilog Generation

Once a valid schedule is found at initiation interval II with S pipeline stages, sub_921820 generates:

function GeneratePrologEpilog(loop, II, num_stages):
    // Prolog: S-1 partial iterations
    for stage in 0..num_stages-2:
        emit instructions assigned to stages 0..stage
        // Each prolog iteration adds one more stage

    // Kernel: steady-state loop body
    emit all instructions from all stages
    // Trip count adjusted: new_trip = original_trip - (num_stages - 1)

    // Epilog: S-1 drain iterations
    for stage in num_stages-2..0:
        emit instructions assigned to stages stage+1..num_stages-1
        // Each epilog iteration removes one stage

Instruction Latency Classifier (sub_91A0F0)

The classifier is a 5.5KB, 1372-line switch statement mapping approximately 350 Ori opcodes to 13 distinct latency class values. It takes five parameters: (opcode, secondary_opcode, operand_array, operand_count, operand_index) and returns a class ID -- not a cycle count. The scheduler maps class IDs to actual cycle counts via the hardware profile.

Latency Class Table

ClassTypical opcodesMeaning
1Past-end operands, invalid indicesSkip / not used
6Simple ALU, bitwise, short integerShort-pipe latency (~80 opcodes)
7Paired register operationsMedium-short (~5 opcodes)
8Special cases (via lookup table dword_21E1340)Medium
9Type conversions (via lookup table)Medium
10Integer multiply, shifts, IMADMedium-long (~40 opcodes)
11Address computations, LEA variantsMedium-long (~15 opcodes)
12Memory operations, FP32, barriersStandard long (~100 opcodes)
14Wide memory, atomics, FP64 storesExtended long (~20 opcodes)
16FP64 special variantsExtended long (~3 opcodes)
20Texture fetches, uniform loadsVery long (~30 opcodes)
26Global memory loads, uncached accessMaximum latency (~25 opcodes)
31Scoreboard/barrier-related operandsSpecial handling (~5 opcodes)

Opcode Family Handling

Opcode rangeCategoryLatency behavior
0x03--0x24Integer ALUMostly passthrough default; 0x23 always returns 10
0x3C, 0x3E, 0x4E, 0x4FMemory (load/store)Returns field from operand_array[4] bits for operands 0--1
0x46, 0xF3--0x106TextureReturns 6 normally; 10 for MIO-dependent with extended flag check
0x49, 0x4A, 0x51, 0x143, 0x15EAtomic/reduceAlways returns 12
0x55--0x6FFloating-pointComplex per-operand logic; 0x55 uses lookup table dword_21E1340
0x5B, 0x5C, 0x137Barriers/syncReturns 12 for operand 1, else default
0xB7, 0x120WGMMA setupPer-operand latency (10--20) based on accumulator flags
0x135HMMA/IMMACalls sub_7E39B0/sub_7E3A70/sub_7E3BA0/sub_7E3C30 for matrix latency
0x13D, 0x13EExtended FPAccumulator-flag-dependent returns (10 or 12)

Stall Cycle Calculator (sub_91E900)

sub_91E900 computes the stall penalty for an instruction by mapping latency classes through the pipe assignment function (vtable+904):

function ComputeStallCycles(code_object, instruction):
    lat0 = ClassifyLatency(instruction, operand=0)
    pipe0 = PipeAssignment(code_object, lat0)         // vtable+904

    if pipe0 == 8:                                     // long-latency pipe
        stall = StallTable[instruction.index]          // code_object+440
        return min(stall, 64)                          // cap at 64 cycles

    lat1 = ClassifyLatency(instruction, operand=1)
    pipe1 = PipeAssignment(code_object, lat1)

    if pipe1 == 8:
        stall = StallTable[instruction.index]
        return min(stall, 64)

    // Neither operand on long pipe
    stall = StallTable[instruction.index]
    return min(stall, 32)                              // cap at 32 cycles

The pipe assignment value 8 corresponds to the long-latency functional unit (memory/texture). Instructions on this pipe get a 64-cycle cap; all others are capped at 32 cycles.

GEMM Pipelining (sub_92C240)

The GemmPipeliner* family of knobs controls a specialized pipelining mode for GEMM (matrix multiply) loops:

Knob NameTypeDefaultDescription
GemmPipelinerEnabledBOOLfalseMaster enable for GEMM-specific pipelining
GemmPipelinerPipelineDepthEnforceDeltaFullINT0Pipeline depth adjustment for full enforcement
GemmPipelinerPipelineDepthEnforceDeltaPartialINT0Pipeline depth adjustment for partial enforcement
GemmPipelinerDependenciesPopblBOOLfalseDependency resolution policy between DMA and compute stages
GemmPipelinerScoreboardHashPopblBOOLfalseScoreboard hash policy for GEMM barrier tracking
GemmPipelinerUseRegisterCalculationINT0Use register-based calculation for pipeline depth vs. fixed

The extended pipelining in sub_92C240 (8KB) handles GEMM-like patterns where the loop body contains WGMMA/IMMA instructions. From decompilation:

  1. Activation: The GEMM pipeliner activates when code_object+48 (GEMM mode flag) is set and the pipeline context at code_object+56 has a valid stage range.
  2. Stage iteration: Iterates from context+84 (start stage) to context+88 (end stage), with 96-byte descriptors per stage at context+136.
  3. Pipeline depth management: Uses sub_8A4DA0 to validate stage depth and sub_6E6650 for dynamic array resizing when pipeline depth exceeds the current allocation. Writes stage bitmasks (1 << stage_index) into the stage descriptor arrays.
  4. Hardware model: On SM90+ (Hopper), TMA supports up to 8 outstanding asynchronous copy operations. The GEMM pipeliner matches this hardware depth, staging DMA (memory) and compute (math) operations to fill the pipeline.

The DUMPIR diagnostic output includes For Dma Loop and For Math Loop sections from sub_7A4500, confirming the pipeliner explicitly distinguishes between DMA and compute loop stages.

Other Pipelining Knobs

Knob NameTypeDefaultDescription
OkToPipelineNoUnrollINT0 (disabled)Allow pipelining even when unrolling was also suppressed
PipelineHoistCondLimitINTunsetMaximum condition complexity for hoisting in pipelined loops
PipelineHoistRRegPressureLimitINTunsetR-register pressure limit for hoisting inside pipelined body
PipelineHoistPRegPressureLimitINTunsetP-register pressure limit for hoisting inside pipelined body
PipelineMIOVQToInstRatioDBLunsetMIOVQ-to-instruction ratio threshold for pipeline profitability
PipelineMultiOutputTexINT0 (disabled)Enable pipelining of loops with multi-output texture instructions
PipelineSpecUsesInHeadOnlyINT0 (disabled)Restrict speculative uses to loop header only

GPU-Specific Pipeline Concerns

Warp divergence. Pipelined loops assume all threads in a warp execute the same number of iterations. If the trip count is warp-divergent, the prolog/epilog handling must account for early-exit threads. The pass checks the varying analysis (phases 53, 70) to determine divergence.

Barrier placement. Pipelined loops containing BAR.SYNC or MEMBAR instructions are checked by sub_9202D0 -- if the pipe assignment class for a barrier instruction is <= 3, the instruction is rejected from pipelining. The latency classifier (sub_91A0F0) assigns class 12 to barrier operands (opcodes 0x5B, 0x5C, 0x137), but the feasibility check rejects based on pipe class, not latency class.

Memory pipeline depth. The sub_92C240 extended pipeliner for GEMM-like loops manages the hardware memory pipeline on SM90+. It explicitly tracks DMA pipeline depth using 96-byte per-stage descriptors, resizing arrays dynamically when depth exceeds allocation. The stage descriptor at context+136 + 96*stage holds bitmask membership, latency counters, and dependency links.

Pipe class model. The 7-entry pipe class table at code_object+16 partitions the functional units into classes. The post-RA software pipelining variant (sub_8B9390) uses the same table to determine which functional unit class each instruction uses, ensuring resource conflict detection is consistent between the two pipelining layers.


Phases 35, 66, 79, 88 -- OriHoistInvariants (LICM)

Purpose

Hoists computations that produce the same result on every loop iteration out of the loop body and into the preheader. This reduces the dynamic instruction count proportionally to the trip count. The four instances are not redundant -- each targets invariants created by different intervening transformations.

Function Map

All four instances share the same core implementation:

FunctionSizeRoleConfidence
sub_C5FE0034 bytesPhase 35 execute wrapperCERTAIN
sub_C5FE3034 bytesPhase 66 execute wrapperCERTAIN
sub_C5FE6034 bytesPhase 79 execute wrapperCERTAIN
sub_C5FE9034 bytesPhase 88 execute wrapperCERTAIN
sub_7DDB50156 bytesOptimization guard: checks knob 499, block count > 2HIGH
sub_8FFDE0573 bytesHoistInvariants orchestrator: iterates blocks, queries knob 381, dispatches inner workerHIGH
sub_8FF7801,622 bytesLICM inner worker: identifies and moves invariant instructionsHIGH
sub_8FEAC02,053 bytesInvariance marking: forward/backward operand scan per blockHIGH
sub_8F76E090 bytesPer-instruction invariance test: checks output register def-blockHIGH
sub_8F7770810 bytesHoisting safety check: operand class + latency analysisHIGH
sub_8F8CB0658 bytesProfitability check: budget-weighted score vs latency penaltyHIGH
sub_8F7DD0374 bytesTransitive invariance propagation through def-use chainsHIGH
sub_8F7AE0558 bytesInstruction mover: unlinks from loop, inserts at preheaderHIGH
sub_8FF2D01,186 bytesBudget computation + invariant marking + hoist dispatchHIGH
sub_8F8BC0257 bytesInstruction counting: header/body weight via isNoOpHIGH
sub_74D720353 bytesLoop boundary analysis: barrier/jump/predecessor checksHIGH
sub_74F500--Preheader location finderMEDIUM
sub_7DF3A088 bytesOpcode flags table lookup (side-effect classification)HIGH
sub_7E0540156 bytesObservable side-effect checker (memory, call, barrier)HIGH

Execute Flow

sub_C5FExxx(phase_obj)                         // 34-byte vtable dispatch
  └─ sub_8FFDE0(code_object, pass_id)          // orchestrator
       ├─ sub_7DDB50(code_object)              // guard: returns block count, checks knob 499
       ├─ sub_799250(allocator, "HoistInvariants", &skip)  // DUMPIR check
       └─ sub_8FF780(context)                  // per-loop LICM core
            ├─ sub_781F80                       // rebuild instruction list
            ├─ sub_7E6090                       // recompute register pressure
            ├─ sub_773140                       // recompute loop depths
            ├─ sub_74D720                       // analyze loop boundaries
            ├─ sub_74F500                       // find preheader
            ├─ sub_7A1A90 / sub_7A1B80         // query knob 381 per block
            └─ sub_8F8BC0                       // move instruction to preheader

Why Four Instances?

PhasePass ID (a2)Pipeline PositionWhat Creates New Invariants
35 (Early)0After GeneralOptimize (29), ExtractShaderConsts (34)CSE eliminates redundant expressions, exposing loop-invariant results; shader constant extraction hoists uniform loads
66 (Late)1After predication (63), GeneralOptimizeLate2 (65)Predication converts conditional branches to predicated instructions; if the condition is loop-invariant, the entire predicated instruction becomes invariant
79 (Late2)2After LateExpansionUnsupportedOps (78)Late expansion splits compound operations into sequences; address computations and constant sub-expressions in expanded sequences are often invariant
88 (Late3)3After FixupGmmaSequence (87)GMMA fixup reorders/inserts instructions for wgmma hardware constraints; descriptor loads and accumulator setup become visible as invariants

Pass ID Controls Aggressiveness

The pass_id parameter (parameter a2 of sub_8FFDE0) affects which loops are processed and how aggressively hoisting is performed. From the decompiled logic at sub_8FFDE0:

// sub_8FFDE0 lines 58-89 (simplified)
v7 = sub_7A1B80(allocator, 381, block);   // query knob 381 for this block
if (v7 == 1) {                             // knob says "inner loops only"
    if (pass_id == 1) goto hoist_block;    // Late pass: proceed
    goto skip_block;                       // Early pass: skip
}
if (v7 == 3) {                             // knob says "never"
    if (pass_id <= 1) goto handle_conservative;
    goto skip_block;
}
if (v7 == 0) {                             // knob says "always"
    if (pass_id == 0) goto hoist_aggressively;
    goto skip_block;
}
  • pass_id = 0 (Early): Hoists aggressively and calls sub_A112C0(code_object, 1) to re-run sub-analyses afterward. This is the most aggressive pass.
  • pass_id = 1 (Late): Includes inner-loop-only blocks, but skips the re-analysis call.
  • pass_id >= 2 (Late2, Late3): Most conservative -- only hoists from blocks where knob 381 returns 0 (always-hoist).

Per-Block Knob 381 Policy

The LICM pass queries OCG knob 381 (sub_7A1A90 / sub_7A1B80) per basic block to determine the hoisting policy:

Knob 381 ResultMeaning
0Always hoist from this block
1Hoist from inner loops only
3Never hoist from this block

This per-block granularity allows the knob system to selectively disable hoisting in specific loop nests (e.g., those known to be register-pressure-critical).

Guard Function (sub_7DDB50)

Before the LICM core runs, sub_7DDB50 (156 bytes) gates execution on two conditions:

  1. Knob 499 enabled. Queries the allocator vtable at +72 for OCG knob 499 (the master LICM switch). If disabled, returns 1 which causes the orchestrator to bail (since 1 <= 2).
  2. Rate limiter. When knob 499 is enabled, the guard checks a pair of counters at allocator[9]+35936 (max invocations) and allocator[9]+35940 (current count). If the current count has reached the maximum, returns 1. Otherwise increments the counter and returns the actual basic block count from code_object+2104. This bounds the number of LICM invocations for compile-time control in functions with many loops.
  3. Block count > 2. The orchestrator (sub_8FFDE0) checks the return value: if <= 2, no hoisting is attempted. Single-block functions have no loops; two-block functions have at most a trivial loop not worth processing.

LICM Invariant Detection Algorithm

The invariance detection pipeline runs inside sub_8FF2D0 (1,186 bytes), which is called from sub_8FF780 once per loop nest level. It executes five stages in sequence: budget computation, forward invariance marking, backward non-invariance marking, transitive propagation, and profitability gating.

Stage 1: Budget Computation (sub_8FF2D0)

function ComputeHoistBudget(context, block, is_simple, num_preds, hoist_mode, is_inner):
    // Base budget from knob 483 (HoistBudget)
    if QueryKnob(483):
        budget = QueryKnobValue(483)                 // 0 = unlimited
    else:
        budget = 10                                  // default

    // CBO budget from knob 482
    if QueryKnob(482):
        cbo_budget = QueryKnobValue(482)
    else:
        cbo_budget = (pass_id == 0) ? 22 : 100

    // Adjust by loop type and depth
    if pass_id > 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 300
    else if pass_id == 0 and is_simple:
        budget = (hoist_mode < 2) ? cbo_budget : 200

    // Conservative multiplier for Late3
    if pass_id == 3:
        budget *= 100                                // generous once decided to hoist

    // Split budget among back-edge blocks
    if hoist_mode == 3:                              // processing back-edge block
        budget /= num_preds

    // Inner-loop divisor from knob 380
    if is_inner:
        if QueryKnob(380):
            budget /= QueryKnobValue(380)
        else:
            budget /= 10

Stage 2: Forward Invariance Marking (sub_8FEAC0, a3=1)

The forward pass iterates every instruction in the basic block and marks each register operand's invariance status based on where it was defined.

function MarkInvariants_Forward(context, block_index):
    block = blocks[block_index]
    header_depth = context.header_depth
    loop_depth_range = [context.header_depth, context.max_depth]

    // Two code paths based on knob 934 (UseNewLoopInvariantRoutineForHoisting)
    if QueryKnob(934):
        // Advanced path: set-based computation via sub_768BF0 + sub_8F7280
        return MarkInvariants_SetBased(context, block_index)

    // Default path: single-pass scan
    for each instruction in block (linked list: block+0 .. sentinel at block+8):
        has_side_effect = isNoOp(instruction)            // vtable+1824
        opcode = instruction+72 (masked: BYTE1 &= 0xCF)
        num_operands = instruction+80

        // Special case: opcode 195 + first dst is reg class 9 (predicate)
        is_predicate_def = (opcode == 195 and dst_reg.class == 9)

        is_invariant = true
        for each operand from LAST to FIRST:             // reverse scan
            operand = instruction + 84 + 8*i
            type = (operand >> 28) & 7

            if type != 1:                                // not a register
                continue                                 // immediates are always invariant

            if IsFixedRegister(operand, code_object):    // sub_7DEB90
                continue                                 // e.g., RZ, PT — always available

            if pass_id == 3:                             // Late3 extra check
                if IsSpecialRegClass(operand, code_object):  // sub_7DA2F0
                    // Exception: IADD3 (opcode 130, flag 0x1000) carry-out
                    if not (opcode == 130 and flag_0x1000 and is_penultimate_operand):
                        continue

            reg = RegisterDescriptor(code_object, operand & 0xFFFFFF)

            if reg.def_block (reg+76) == block_index:
                // Defined in THIS block — not invariant for this loop
                is_invariant = false
            else if context.is_multi_depth:
                def_instr = reg.def_instruction (reg+56)
                if def_instr is null or reg has pinned bit:
                    handle_predicate_invariance()
                else:
                    def_block = blocks[def_instr.block_index]
                    def_depth = def_block.loop_depth (offset +144)
                    if def_depth < header_depth or def_depth > max_depth:
                        reg.use_count (reg+80) = 0       // mark as loop-external
                    else:
                        is_invariant = false
                        reg.def_block (reg+76) = block_index
            else:
                reg.use_count (reg+80) = 0               // simple loop: mark external

        // Side-effect check for the entire instruction
        flags = LookupOpcodeFlags(instruction, code_object)  // sub_7DF3A0
        if (flags & 2) != 0:                             // has memory/control side effect
            is_invariant = false

        if MemoryOverlapsLoopLiveSet(instruction):       // sub_74F5E0
            is_invariant = false

        if is_multi_depth and HasObservableSideEffects(instruction):  // sub_7E0540
            is_invariant = false

        // Mark destination operands
        for each dst_operand (bit 31 set = definition):
            if type == 1 and not pinned:
                if is_invariant:
                    reg.def_block = block_index           // mark for hoisting
                else:
                    reg.use_count += 1                    // count loop-internal uses

The key insight is that invariance is determined by definition site: if every source register was defined outside the loop (or in a block already processed), the instruction is invariant. Immediates and constants are trivially invariant. The check is not purely structural -- it uses the reg+76 field which gets updated as hoisting proceeds, allowing transitive invariance discovery.

Stage 3: Backward Non-Invariance Marking (sub_8FEAC0, a3=0)

The backward pass uses the same function with a3=0. Instead of marking definitions as external, it marks operands whose definitions are inside the loop as non-invariant by setting reg.def_block = block_index. This clears any false positives from the forward pass where a register appeared invariant but its defining instruction depends on a loop-variant value.

For destination operands, the backward pass increments reg.use_count for all non-pinned register definitions, building the use-count information needed by the profitability check.

Stage 4: Transitive Invariance Propagation (sub_8F7DD0)

After the two marking passes, sub_8F7DD0 propagates invariance transitively through the instruction chain. This handles the case where instruction A is invariant and defines register R, and instruction B uses R and is otherwise invariant -- the forward pass may have marked B as non-invariant because R's definition was in the loop, but A (the definer) is itself invariant.

function PropagateInvariance(context, block_index):
    block = blocks[block_index]
    side_effect_mask = 0

    for each instruction in block:
        aliases_memory = CheckMemoryAlias(code_object, instruction)  // sub_74F5E0

        for each operand (type == 1, register):
            reg = RegisterDescriptor(operand)

            if operand is definition (bit 31 set):
                if isNoOp(instruction):
                    if IsInvariant(instruction, block_index):      // sub_8F76E0
                        side_effect_mask |= reg.flags & 0x3
                    else:
                        reg.flags |= aliases_memory ? 1 : 0
                else:
                    reg.flags |= (has_side_effect ? 1 : 0) | 2
            else:  // use
                if has_side_effect:
                    reg.def_block = block_index            // taint defining register
                else:
                    reg.use_count += 1

    return side_effect_mask

Stage 5: Profitability Check (sub_8F8CB0)

The final gate before hoisting. Computes a cost-benefit ratio and rejects hoisting if the ratio is unfavorable.

function IsProfitable(context, block_index, budget, is_hoist_safe):
    header_weight = context.header_insn_count            // from sub_8F8BC0
    body_weight = context.body_insn_count

    // Scoring weights depend on pass aggressiveness and safety
    if is_hoist_safe:
        noOp_weight = (pass_id == 0) ? 60 : 150
        real_weight = 5
    else:
        noOp_weight = (pass_id == 0) ? 12 : 30
        real_weight = 1

    score = 0
    latency_penalty = 0
    instruction_count = 0

    for each instruction in block:
        instruction_count += 1
        if IsInvariant(instruction, block_index):        // sub_8F76E0
            if isNoOp(instruction):
                score += noOp_weight
            else:
                score += 1
                for each dst_operand with scoreboard flag:
                    score += real_weight
                    latency = GetLatencyClass(instruction)  // sub_91E860
                    latency_penalty += (latency > 4) ? 2 : 1
        else:
            for each high-latency dst_operand:
                latency_penalty += (latency > 4) ? 2 : 1

    // Final decision: weighted score vs latency cost
    if pass_id == 0:                                     // aggressive
        denominator = real_weight * instruction_count
    else:
        denominator = body_weight / 3 + header_weight

    return denominator != 0 and (score * budget) / (real_weight * denominator) >= latency_penalty

The profitability check encodes a fundamental GPU tradeoff: hoisting reduces dynamic instruction count (proportional to trip count) but extends live ranges (increasing register pressure and reducing occupancy). The budget parameter, which varies by 100x between pass_id 0 and 3, controls how aggressively this tradeoff is resolved. Pass_id 0 (Early) uses the smallest denominator, making it easiest to exceed the threshold.

Per-Instruction Invariance Test (sub_8F76E0)

The leaf-level invariance test used by stages 4 and 5 is a simple definition-site check:

function IsInvariant(instruction, current_block_index):
    num_operands = instruction.operand_count             // inst+80
    if num_operands == 0:
        return false

    // Find the last "interesting" operand (skip immediates/constants)
    // Immediates have type bits in the 0x70000000 range
    last_operand = scan backwards from operand[num_operands-1]
                   while (operand XOR 0x70000000) & 0x70000000 == 0

    // Check: is this a register definition outside the current block?
    if last_operand is negative (bit 31 = definition)
       and type_bits == 1 (register)
       and not pinned (byte+7 bit 0 == 0):
        reg = RegisterDescriptor(last_operand & 0xFFFFFF)
        return reg.def_block (reg+76) != current_block_index

    return false

This is the most-called function in the LICM pipeline. It checks whether an instruction's primary output register was defined outside the current block -- if so, the instruction is considered invariant (already hoisted or defined in a dominating block).

Side-Effect Blocking Rules

An instruction is blocked from hoisting if any of the following conditions hold, regardless of operand invariance:

CheckFunctionCondition
Memory storesub_7DF3A0Flags byte bits 2-3 set and bit 5 clear
Memory barriersub_74D720Opcode 159 (BAR.SYNC), 32 (MEMBAR), or 271 (barrier variant)
Indirect jumpsub_74D720Opcode 236 (BRX)
Volatile/atomic accesssub_7DFA80Called from sub_7E0540; detects volatile or atomic memory
Function callvtable+1456isBarrier() returns true
Texture side effectsub_7DF3A0Flags byte bit 6 set with operand modifier flag
Address-space effectsub_7E0540Opcodes 85/109 (memory ops) with (flags+20 & 2) != 0

The boundary analysis (sub_74D720) also produces a 5-byte result array that gates the entire loop:

ByteMeaningEffect
0Has external predecessor (outside loop depth range)Skip loop (not a natural loop)
1Non-header block with different nestingMarks as complex multi-depth loop
2Contains barrier instructionSkip loop entirely
3Contains indirect jumpSkip loop entirely
4Multi-depth safety flagAND-ed with sub_7E5120 per inner block

Instruction Counting (sub_8F8BC0)

Before the profitability check, sub_8F8BC0 counts instructions in the loop header and body separately. It walks the instruction linked list for each block in the loop and classifies each instruction using isNoOp (vtable+1824):

  • No-op instruction (scheduling placeholder, predicate set, etc.): weight 1
  • Real instruction (ALU, memory, branch, etc.): weight 30

The header count is stored at context+64 and the body count at context+68. The profitability formula uses these to normalize the hoisting score: a loop with a heavy header relative to the body benefits less from hoisting.

Instruction Movement (sub_8F7AE0)

After all checks pass, sub_8F7AE0 physically moves each invariant instruction from the loop body to the preheader:

  1. Invariance re-check. Calls sub_8F76E0 one final time per instruction. Instructions whose invariance status changed during the marking passes are skipped.
  2. Knob 484 gate. Queries the allocator for knob 484; if disabled, no movement occurs. This provides a fine-grained override separate from the loop-level knob 381.
  3. Preheader creation. On the first hoisted instruction, creates or locates the preheader block:
    • If the loop has an existing preheader block (context+16 non-null): clones it via sub_931920, copies convergence flags from the original preheader's offset+282 bit 3, and links it into the CFG via sub_8F7610.
    • If no preheader exists: creates a new block via sub_92E1F0 and links it.
  4. Unlink and reinsert. For each invariant instruction:
    • sub_9253C0(code_object, instruction, 1): unlinks the instruction from the current block.
    • sub_91E290(code_object, instruction): inserts at the preheader insertion point.
    • Updates the Ori instruction's control word at instruction+32 (not the SchedNode): sets bit 1 at byte offset +13 to mark the instruction as hoisted (prevents the scheduler from reordering it back into the loop).
  5. Destination register tracking. For each output operand, if the defining instruction at reg+56 differs from the current instruction, sets context+44 (hoisted_cbo flag). For pass_id == 2, additionally sets reg+48 bit 26 if the register class is in {2, 3, 4} (GPR classes) and the preheader has the convergence flag.
  6. Special IADD3 handling. For pass_id == 3, instructions with opcode 130 (IADD3), flag 0x1000, and a negative byte at +90 (carry chain) receive special treatment via sub_9232B0 which adjusts the carry-out register linkage before movement.

Multi-Depth Loop Handling

For loops with nesting depth > 1 (inner loops within the hoisting target), sub_8FF780 performs multiple rounds of sub_8FF2D0 calls:

  1. Header block. First call processes the loop header with hoist_mode = 0.
  2. Intermediate blocks. For each depth level between header_depth+1 and max_depth, checks if the block's parent depth (block+148) matches the header depth. If the block is a back-edge predecessor of the loop header, uses hoist_mode = 3. Otherwise, checks a dominance bitmap at block[25] + 4*(depth >> 5): if bit (1 << depth) is set, uses hoist_mode = 1 (dominated); otherwise hoist_mode = 2 (non-dominated).
  3. Back-edge block. Final call with hoist_mode = 3 and the deepest back-edge block index, ensuring the budget is split among back-edge predecessors.

Multi-depth permission is gated by knob 220 (queried at allocator[9]+15840 for the fast path) and the DisableNestedHoist knob. When hoisting from an inner loop to the header of an outer loop, an additional constraint applies:

allow_nested = allow_nested_hoist AND is_simple_loop
               AND body_insn_count > 1
               AND num_predecessors == 1
               AND body_insn_count < header_insn_count * max_iterations

This prevents hoisting from inner loops where the cost (extended live range across multiple loop levels) exceeds the benefit (reduced inner-loop dynamic count).

LICM Outer Loop (sub_8FF780)

The complete outer driver that iterates over all loop nests:

function HoistInvariantsCore(context):
    code_object = context.code_object
    pass_id = context.pass_id

    // Read iteration limit from allocator+34632
    config_byte = allocator[34632]
    max_iterations = (config_byte == 0) ? 2
                   : (config_byte == 1) ? allocator[34640]
                   : 0                                   // unlimited

    allow_nested_hoist = (allocator[20016] != 0)

    // Prepare IR
    RebuildInstructionList(code_object, 1)               // sub_781F80
    RecomputeRegisterPressure(code_object, 1, 0, 0, 0)  // sub_7E6090
    RecomputeLoopDepths(code_object, 0)                  // sub_773140

    if code_object.flags[176] & 2 and pass_id > 1:
        RecomputeLoopNesting(code_object)                // sub_789280

    // Clear prior invariance markers
    for each block in instruction list:
        block.marker (offset +76) = 0xFFFFFFFF

    // Iterate from innermost loop outward (last RPO entry first)
    current = blocks[rpo[block_count]]

    while current is valid:
        if current has no predecessors or no first instruction:
            advance; continue

        // Count predecessors at >= current loop depth
        header_depth = current.loop_depth                // offset +144
        for each predecessor:
            if pred.loop_depth >= header_depth:
                num_at_depth++; track deepest back-edge index

        if num_at_depth == 0:                            // not a loop header
            advance; continue

        // Simple vs multi-depth
        if max_depth == header_depth:
            is_simple = true
        else:
            info = AnalyzeBoundaries(code_object, header_depth, max_depth)
            if has_external_pred or has_barrier or has_indirect_jump:
                advance; continue
            if !MultiDepthAllowed(knob_220):
                advance; continue
            context.is_multi_depth = true

        // Find preheader and query knob 381
        context.insert_pt = FindPreheader(code_object, current, ...)
        if !ShouldHoist(QueryKnob381(381, current), pass_id, opt_level):
            advance; continue

        // Count instruction weights
        CountInstructions(context)                       // sub_8F8BC0

        // === CORE HOISTING PIPELINE (per loop) ===
        sub_8FF2D0(context, header_block, ...)           // header block

        if context.is_multi_depth:
            for depth in (header_depth+1 .. max_depth-1):
                sub_8FF2D0(context, block_at_depth, ..., hoist_mode, ...)
            sub_8FF2D0(context, back_edge_block, ..., 3, ...)  // back-edge

        // Post-hoist cleanup
        if context.changed and current.num_back_edge_successors > 1:
            RebuildInstructionList(code_object, 0)

        advance to next loop

Hoisting Knobs

Knob NameTypeDefaultDescription
HoistBudgetFLOAT10Maximum number of instructions to hoist per loop (0 = unlimited)
HoistLoopInvBudgetFLOAT22 (early) / 100 (late)Budget specifically for loop-invariant hoisting; pass_id 0 uses 22, pass_id > 0 uses 100
HoistConservativeScaleINT10 (divisor)Inner-loop budget divisor; budget /= scale when hoisting from inner loops
HoistLateINTper-block policyPer-block hoisting policy (0=always, 1=inner only, 3=never)
HoistCBOModeINT0Constant-buffer-object hoisting mode
HoistCBOLoadINTunsetEnable hoisting of CBO load instructions
HoistCBOFromLoopWithColdNestINT1 (enabled)Hoist CBO loads even from loops with cold nesting
HoistCBOHighCostSBInstRatioThresholdINTunsetScoreboard cost threshold for CBO hoisting
HoistCBOLoadIDOMTravseLimitINT4IDOM traversal limit for CBO load hoisting
HoistCBORRegPressureLimitApplyRateINT80R-register pressure limit application rate (percentage)
HoistTexToInstRatioHighDBL0.045High texture-to-instruction ratio threshold for aggressive hoisting
HoistTexToInstRatioLowDBL0.03Low texture-to-instruction ratio threshold for conservative hoisting
DisableNestedHoistBOOLfalseDisable hoisting from nested loops (false = nested hoisting allowed)
NestedHoistInnerThresholdINT22 / 100Inner loop instruction threshold for nested hoisting (same value as HoistLoopInvBudget)
NestedHoistOuterThresholdINT10Outer loop instruction threshold for nested hoisting (same value as HoistBudget)
UseNewLoopInvariantRoutineForHoistingBOOLfalseUse updated set-based invariance check routine (legacy single-pass is default)
MaxMidHeaderSizeRateForAggressiveHoistINT2Maximum LICM iteration count (limits repeated hoisting passes)
EnableHoistLowLatencyInstMidBlockBOOLfalseHoist low-latency instructions from mid-block positions
MovWeightForSinkingHoistingDBL0.25Weight for MOV instructions in sink/hoist decisions

GPU-Specific LICM Concerns

Constant buffer loads. GPU shaders frequently load from constant buffers (LDC). These loads are loop-invariant by definition (the buffer is read-only during kernel execution). The HoistCBO* knobs control a specialized path that aggressively hoists these loads, trading register pressure for reduced memory traffic.

Register pressure vs. occupancy. Every hoisted instruction extends its live range from the preheader through the entire loop. On GPUs, this directly reduces occupancy. The four LICM passes use increasingly conservative heuristics (controlled by pass_id) to avoid excessive register growth in later pipeline stages where register allocation is imminent.

Texture instruction hoisting. Texture fetches (TEX, TLD, TLD4) are high-latency and loop-invariant when their coordinates are loop-invariant. The HoistTexToInstRatio* knobs provide thresholds for deciding when to hoist texture instructions -- a tradeoff between reducing loop body latency and increasing preheader register pressure.


Phase 59 -- OriLoopFusion

Purpose

Fuses adjacent loops with compatible bounds and no inter-loop data dependencies into a single loop. This reduces loop overhead (branch, induction variable update) and creates opportunities for the scheduler to overlap instructions from the formerly separate loop bodies.

Knobs

Knob NameTypeDefaultDescription
PerformLoopFusionINT0 (disabled)Master enable for loop fusion; must be explicitly set to a nonzero value
PerformLoopFusionBudgetFLOATunsetMaximum instruction count in fused body

Fusion Criteria

Two adjacent loops L1 followed by L2 are candidates for fusion when:

  1. Same trip count. Both loops iterate the same number of times (same induction variable bounds and stride, or equivalent after normalization).
  2. No violated inter-loop dependencies. No flow dependence (write in L1, read in L2) that crosses iteration boundaries differently after fusion. Since both loops are sequential pre-fusion, this reduces to: L2 must not read a value written by L1 at a different iteration index.
  3. Compatible loop structure. Both must be single-basic-block bodies (or the fused body must remain within the PerformLoopFusionBudget instruction limit).
  4. No intervening barriers. No BAR.SYNC, MEMBAR, or fence instructions between the two loop bodies.

Pipeline Position Rationale

Phase 59 runs after GeneralOptimizeLate (phase 58) and before predication (phase 63). This position is chosen because:

  • Late expansion (phase 55) may have split a single operation into a pair of loops (e.g., an atomic-reduce pattern becomes a compare loop followed by an exchange loop).
  • After fusion, the merged loop body gives predication (phase 63) a larger basic block to work with, improving if-conversion opportunities.
  • The subsequent LICM (phase 66) can hoist invariants from the fused loop that were not hoistable from either original loop individually (because they appeared in the "between-loops" region).

Loop Infrastructure Functions

Several utility functions are shared across the loop passes:

FunctionAddressSizePurpose
sub_781F800x781F80--Rebuild instruction linked list after CFG modification
sub_7892800x789280--Recompute loop nesting depths (called when flags[176] & 2 set)
sub_7731400x773140--Recompute register pressure estimates
sub_7E60900x7E60902,614Create complex multi-operand instruction (used in unroll body duplication)
sub_7753F00x7753F0--Loop peeling setup (splits first/last iterations)
sub_789BE00x789BE0--Back-edge canonicalization
sub_74D7200x74D720--Loop boundary analysis (determines header, latch, exit)
sub_74F5000x74F500--Find preheader block for a given loop
sub_9253C00x9253C0--Edge splitting / preheader block insertion
sub_7A1A900x7A1A90--OCG knob query (boolean)
sub_7A1B800x7A1B80--OCG knob query (multi-valued)
sub_7992500x799250--Named-phase DUMPIR check (string match against phase name)
sub_A112C00xA112C0--Trigger sub-analysis re-run (liveness, CFG refresh)
sub_BDEA500xBDEA50--Back-edge information printer (bix%d -> backedge's successor BB: %d)

PhaseNameRelationship
3AnalyzeControlFlowBuilds the CFG, identifies loops, computes dominators -- prerequisite for all loop passes
19OriSplitLiveRangesSplits live ranges at loop boundaries to reduce register pressure post-simplification
20PerformPGOApplies profile data that informs unrolling and pipelining heuristics
21OriStrengthReduceReduces induction variable strength before unrolling
23GenerateMovPhiInserts SSA phi nodes after unrolling changes the CFG
25StageAndFenceInserts memory fences needed by pipelined loops
56SpeculativeHoistComInstsSpeculatively hoists common instructions above branches (related to LICM)
108OptimizeHotColdInLoopPost-RA hot/cold partitioning within loop bodies
138OriSplitHighPressureLiveRangesLast-resort splitter when unrolling or LICM caused excessive register pressure

Cross-References