Loop Passes
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Eight phases in the ptxas pipeline transform loops in the Ori IR: one canonicalizer (phase 18), one unroller (phase 22), one software pipeliner (phase 24), four LICM instances (phases 35, 66, 79, 88), and one fusion pass (phase 59). Together they account for the largest category of repeated-pass instances in the pipeline -- the LICM family alone runs four times because intervening transformations (predication, legalization, GMMA fixup) continuously expose new invariants.
ptxas is not built on LLVM. Its loop infrastructure is a custom, non-SSA representation operating directly on the Ori IR's basic-block graph. Loop detection is performed by AnalyzeControlFlow (phase 3), which identifies back-edges, computes dominators, and annotates each basic block with a loop nesting depth stored at block offset +144. This nesting depth is the primary loop identity used by all eight passes.
| OriLoopSimplification | Phase 18 -- vtable at off_22BD898 |
| OriLoopUnrolling | Phase 22 -- vtable at off_22BD938 |
| OriPipelining | Phase 24 -- vtable at off_22BD988 |
| OriHoistInvariantsEarly | Phase 35 -- vtable at off_22BDB40 |
| OriLoopFusion | Phase 59 -- vtable at off_22BDF00 |
| OriHoistInvariantsLate | Phase 66 -- vtable at off_22BE018 |
| OriHoistInvariantsLate2 | Phase 79 -- vtable at off_22BE220 |
| OriHoistInvariantsLate3 | Phase 88 -- vtable at off_22BE388 |
| Phase factory | sub_C60D30 cases 18, 22, 24, 35, 59, 66, 79, 88 |
| Phase object size | 16 bytes (standard {vtable_ptr, allocator_ptr}) |
| IR level | Ori -- SASS opcodes with virtual registers, pre-RA |
| Loop detection | AnalyzeControlFlow (phase 3) -- back-edges, dominators, nesting depth |
| Related passes | 3 AnalyzeControlFlow, 19 OriSplitLiveRanges, 21 OriStrengthReduce, 108 OptimizeHotColdInLoop |
Pipeline Placement
Phase 3 AnalyzeControlFlow ── builds CFG, identifies loops, computes dominators
Phase 13 GeneralOptimizeEarly ── const fold + copy prop (feeds loop analysis)
Phase 15 OriBranchOpt ── branch simplification (may change loop shape)
Phase 16 OriPerformLiveDeadFirst ── DCE removes dead loop bodies
Phase 18 OriLoopSimplification ── CANONICALIZATION: single entry, preheader insertion
Phase 19 OriSplitLiveRanges ── splits live ranges at loop boundaries
Phase 21 OriStrengthReduce ── induction variable strength reduction
Phase 22 OriLoopUnrolling ── UNROLLING: full/partial based on trip count
Phase 23 GenerateMovPhi ── SSA phi insertion (after unrolling changes CFG)
Phase 24 OriPipelining ── SOFTWARE PIPELINING: overlaps iterations
...
Phase 35 OriHoistInvariantsEarly ── LICM #1: after GVN, before mid-expansion
...
Phase 59 OriLoopFusion ── FUSION: merges adjacent compatible loops
...
Phase 66 OriHoistInvariantsLate ── LICM #2: after predication
...
Phase 79 OriHoistInvariantsLate2 ── LICM #3: after late unsupported-op expansion
...
Phase 88 OriHoistInvariantsLate3 ── LICM #4: after GMMA fixup
...
Phase 108 OptimizeHotColdInLoop ── separates hot/cold paths within loops (post-RA)
Ordering Rationale
The eight loop passes are deliberately spread across the pipeline rather than clustered together. Each occupies a specific position dictated by what has been lowered or optimized upstream:
- Phase 18 (simplification) must run before strength reduction (21) and unrolling (22) because both require canonical loop forms.
- Phase 22 (unrolling) runs after strength reduction so that induction variable simplifications are already applied, avoiding redundant computation in unrolled copies.
- Phase 24 (pipelining) runs after unrolling because pipelining targets loops that were not fully unrolled.
- Phase 35 (early LICM) runs after
GeneralOptimizeat phase 29, which performs partial CSE, giving it common subexpressions to hoist. - Phase 59 (fusion) runs after late expansion (phase 55) because expansion can split a single operation into a loop pair that fusion can reunite.
- Phases 66, 79, 88 (late LICM instances) each follow a major transformation that can create new loop-invariant code: predication (63), unsupported-op expansion (78), and GMMA fixup (87), respectively.
Loop Representation in Ori IR
ptxas does not use a dedicated loop descriptor data structure (no LoopInfo object like LLVM's). Instead, loop membership is implicit in the CFG through annotations computed by AnalyzeControlFlow (phase 3):
| BB Field | Offset | Type | Meaning |
|---|---|---|---|
loop_depth | +144 | int | Loop nesting depth (0 = not in loop) |
loop_depth_equal | +152 | int | Copy of loop_depth, used for sibling detection |
predecessor_list | +128 | linked_list* | List of predecessor block indices |
successor_list | +136 | linked_list* | List of successor block indices |
A loop header is a block whose loop_depth equals its own back-edge source's depth. Back-edge information is stored in the Code Object's back-edge hash map at offset +680. Diagnostic output from sub_BDEA50 prints this information as bix%d -> backedge's successor BB: %d.
The block iteration order is controlled by a reverse-post-order (RPO) array stored at Code Object offset +512. All loop passes iterate over this array, ensuring they visit headers before inner blocks. The array length is at Code Object offset +520.
Phase 18 -- OriLoopSimplification
Purpose
Canonicalizes loop structure to simplify downstream analysis. Ensures each natural loop has a single entry edge, inserts dedicated preheader blocks where needed, and normalizes back-edge shapes. This is a prerequisite for strength reduction, unrolling, and pipelining, all of which assume canonical loop form.
Entry Point
sub_C5FB00 (34 bytes) ── vtable execute(), calls sub_7DDB50
└─ sub_78B430 (1,172 bytes) ── LoopMakeSingleEntry core
├─ sub_7753F0 ── pre-pass: loop peeling setup
├─ sub_789BE0 ── canonicalize back-edges
├─ sub_781F80 ── rebuild instruction list
└─ sub_9253C0 ── split edges / insert preheader
Algorithm
function LoopSimplification(code_object):
if code_object.flags[1368] & 1 == 0: // optimization disabled
return
// Phase 1: optional loop peeling for O4+ or flagged functions
if opt_level not in {4,5} and flags[1382] & 4 set:
peeled = PeelOuterEdges(code_object, 0) // sub_7753F0
canonicalized = CanonicalizeBackEdges(code_object, peeled) // sub_789BE0
else:
canonicalized = CanonicalizeBackEdges(code_object, 0)
if code_object.flags[1368] & 1 == 0: // re-check after canon
return
// Phase 2: single-entry enforcement
if not QueryKnob("LoopMakeSingleEntry", knob_487): // OCG knob 487
return
RebuildInstructionList(code_object, 1) // sub_781F80
for each block in RPO order:
if block.loop_depth > 0 and block is loop header:
// find the deepest-nesting back-edge target
// if multiple entries exist, split into single-entry form
// insert preheader block between external predecessors and header
InsertPreheaderIfNeeded(code_object, block) // sub_9253C0
GPU-Specific Considerations
The simplification pass checks the optimization level at offset +896 of the code object. Levels 4 and 5 (-O4, -O5) enable aggressive loop peeling via sub_7753F0 before canonicalization. At the default -O2, peeling is suppressed to avoid code size growth that could cause instruction cache thrashing.
The LoopMakeSingleEntry knob (OCG knob 487) is the master enable. When disabled, only back-edge canonicalization runs -- preheader insertion is skipped. This knob is checked via the standard OCG knob query at offset +152 of the allocator vtable.
The pass also inspects the convergence flag at offset +1380 (bit 7). When set, it indicates a convergent execution context (e.g., warp-synchronous code), and certain edge-splitting transformations are suppressed to avoid disrupting convergence guarantees.
Related Knobs
| Knob Name | Default | Description |
|---|---|---|
LoopInversion | enabled | Enable loop inversion (do-while to while conversion) |
LoopInversionBudget | unset | Maximum instruction count for loop inversion |
LoopPeelInversion | disabled | Enable loop peeling combined with inversion |
EnableSingleThreadPeelingLoops | unset | Enable peeling for single-thread execution paths |
GenPeelingLoopsForSyncs | unset | Generate peeling loops around sync instructions |
AssertIfPeelingLoopForTexSurf | unset | Assert (debug) if peeling a loop for texture/surface ops |
Phase 22 -- OriLoopUnrolling
Purpose
Performs full unrolling of loops with known small trip counts and partial unrolling of larger loops to amortize loop overhead and expose instruction-level parallelism. This is one of the most impactful optimization passes for GPU code, where loops over texture coordinates, reduction accumulators, and matrix tiles dominate execution time.
Function Map
Correction (P1-04): The W023 report incorrectly listed sub_83EF00 as the unrolling driver. That function is the MainPeepholeOptimizer (confirmed by p1.06a sweep). The actual unrolling call chain starts at sub_1392E30.
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_1392E30 | 25 lines | Phase 22 execute entry: guards, calls initializer + driver + cleanup | HIGH |
sub_1389AF0 | 593 lines | Unrolling context initializer: reads all knobs from OCG profile | HIGH |
sub_1390B30 | 1,598 lines | Main unrolling driver: per-loop decision, factor selection, dispatch | HIGH |
sub_138A6E0 | 774 lines | Post-unroll cleanup: frees working structures | HIGH |
sub_7E5120 | 19 lines | Nounroll/skip check: pragma flag, convergence, knob 91 | HIGH |
sub_7F5D20 | 99 lines | Rejection recording: indexes string table at 0x21D1EA0 | HIGH |
sub_138E3E0 | 125 lines | Loop body scanner: three-pass analysis (header, forward, backward) | HIGH |
sub_13858C0 | 42 lines | Loop back-edge locator | HIGH |
sub_1385E90 | ~200 lines | Trip count bound extractor (init, limit, stride from IV) | MEDIUM |
sub_1383620 | 1,157 lines | Full unroll profitability evaluator (foldable constants, addresses) | MEDIUM |
sub_1387C30 | ~400 lines | Partial unroll body replicator | MEDIUM |
sub_13880F0 | ~200 lines | Post-unroll CFG fixup | MEDIUM |
sub_1385950 | ~300 lines | Induction variable analysis | MEDIUM |
sub_138E9C0 | ~400 lines | IV stride/direction verification | MEDIUM |
sub_1385CC0 | ~200 lines | IV constant detection | MEDIUM |
sub_13829F0 | ~200 lines | Profitability: foldable constant load counting | MEDIUM |
sub_A3A7E0 | 1,236 lines | Post-unroll statistics (DUMPIR output) | HIGH |
Unrolling Decision Algorithm
The unrolling decision is a multi-stage pipeline implemented in sub_1390B30. The function iterates over loops in reverse RPO order (innermost first, matching the RPO array at code_object+512) and applies a series of eligibility checks, trip count analysis, factor selection, and profitability evaluation before committing to the unroll.
Entry Guard (sub_1392E30)
function OriLoopUnrolling_Execute(code_object):
if code_object.flags[1368] & 1 == 0: // optimization disabled
return
if code_object.flags[1397] & 0xC0 == 0x40: // global nounroll override
return
if DUMPIR_skip("LoopUnrolling"): // sub_799250
return
if CountBlocks(code_object) <= 2: // sub_7DDB50
return
if not QueryKnob(487, true): // master loop pass guard
return
ctx = InitializeContext(code_object) // sub_1389AF0
RunUnrolling(ctx) // sub_1390B30
Cleanup(ctx) // sub_138A6E0
Context Initialization and Knob Defaults (sub_1389AF0)
The initializer reads unrolling parameters from the OCG profile object. Each knob uses a three-valued flag: 0 = use hardcoded default, 1 = use integer override, 2 = use float override, 3 = use double override. The defaults recovered from binary:
| Context Field | Profile Offset | Default | Knob Name (inferred) |
|---|---|---|---|
ctx+168 (int32) | +31320 | 140 | UnrollBudget |
ctx+172 (float) | +31032 | 0.25 | UnrollFlexableFullLimit |
ctx+176 (int32) | +30960 | 4 | UnrollUnknownCount |
ctx+180 (int32) | +30816 | 4 | UnrollSmallLoopLimit |
ctx+184 (dbl) | +64656 | 0.4 | LoopUnrollLargePartOfShaderPct |
ctx+192 (float) | +31392 | 20.0 | UnrollInstLimit |
ctx+196 (int32) | +64872 | 50 | UnrollPregThreshold |
ctx+200 (int32) | +31248 | 2 | UnrollExtraInstPerPercentSaving |
ctx+204 (int32) | +31176 | 200 | UnrollFullInstLimit |
ctx+208 (int32) | +64296 | 46 | LoopUnrollNumExtraInstBase |
Boolean and integer knobs read via vtable dispatch:
| Knob ID | Profile Offset | Default | Knob Name |
|---|---|---|---|
| 437 | +31464 | true | LoopUnroll (master enable) |
| 894 | +64368 | true | LoopUnrollNonInnermost |
| 897 | +64584 | true | UnrollMultiBlockLoops |
| 902 | +64944 | true | UnrollVariableBounds |
| 896 | +64512 | 0 | LoopUnrollFactor (INT override; 0 = heuristic) |
| 895 | +64440 | 0 | EpilogueLoopUnrollCount |
| 900 | +64800 | 0 | LoopUnrollNumInstTex |
| 903 | +65016 | false | DisablePartialUnrollOverflowCheck |
String knob: knob 427 (profile+30744) returns the LoopUnrollFactor per-block override string, with the format "-N-" to skip block N, "+N+" to force-unroll block N, "-" to skip all, "+" to force all.
Nounroll Pragma Check (sub_7E5120)
Returns true (suppress unrolling) when any of these conditions hold:
- Convergence constraint: The back-edge analysis context at code_object+1784 is active, and the loop header's entry in the back-edge table (code_object+1776+16) is valid and within the convergence limit. This suppresses unrolling of warp-synchronous loops.
- PTX
nounrollpragma: Byte 292 of the block descriptor at(code_object+368 + 8*block_idx)has bit 1 set. This bit is set during PTX-to-Ori lowering when thenounrollpragma string (at0x1CFE126) is parsed. - Instruction-level marker: Byte 283 of the loop header instruction has bit 0 set.
- Per-block knob: OCG knob 91 is set for this block (queried via
sub_7A1A90).
Main Decision Flowchart (sub_1390B30)
function RunUnrolling(ctx):
code_object = ctx.code_object
// Phase 1: Read master enable and per-block override string
master_enable = QueryKnob(437) // LoopUnroll
override_string = QueryKnobString(427) // "-N-" / "+N+" format
RecomputeRegisterPressure(code_object) // sub_7E6090
RebuildInstructionList(code_object) // sub_781F80
// Phase 2: Pre-scan -- count inlinable calls and non-unrollable instructions
for each instruction in code_object.instruction_list:
if opcode == 97 (BRX):
if callee.entry_block == callee.exit_block:
inlinable_calls++
if trip_count > 1:
multi_exit |= AnalyzeMultiExit(ctx, callee)
// Phase 3: Iterate loops in reverse RPO (innermost first)
rpo_count = code_object.rpo_count // offset +520
for idx = rpo_count-1 downto 0:
block = code_object.blocks[code_object.rpo[idx]]
// ── Step A: nounroll annotation propagation ──
if block.nounroll_annotation: // byte +246
propagate nounroll to all blocks at >= same nesting depth
// ── Step B: eligibility filter ──
if block.loop_depth == 0: continue // not a loop
if block.loop_depth != block.loop_depth_equal: continue
if block.nounroll and not ctx.force_all: continue
// ── Step C: structure analysis ──
latch = LocateBackEdge(ctx, block) // sub_13858C0
if not latch: continue
exit_inst = latch.last_instruction
if exit_inst.opcode != 95: // not conditional branch
Reject(block, 13); continue // indirect jump
// ── Step D: nounroll / convergence check ──
if CheckNounroll(block, code_object): // sub_7E5120
Reject(block, 11); continue
// ── Step E: execution frequency analysis ──
freq_header = code_object.freq_table[header_reg]
freq_latch = code_object.freq_table[latch_reg]
is_hot = (freq_latch > 999) and (freq_header > 0)
and (freq_latch / freq_header > 3)
// ── Step F: body analysis ──
body_info = ScanLoopBody(ctx, block, latch) // sub_138E3E0
// body_info contains: tex_count, body_size, foldable_ldc_count,
// has_cross_edges, mem_count
if body_info.has_cross_edges: continue
// ── Step G: budget computation ──
budget_scale = QueryKnobDouble(898, 0.5) // default 0.5
scaled_body = (int)(budget_scale * body_size)
remaining = total_budget - body_size - scaled_body - ...
// ── Step H: per-block override check ──
if override_string:
needle = "-{block_id}-"
if override_string == "-" or strstr(override_string, needle):
continue // skip this block
needle = "+{block_id}+"
if override_string == "+" or strstr(override_string, needle):
force_unroll = true
// ── Step I: pragma force-unroll ──
if flags[1397] & 0xC0 == 0x80: // PTX pragma force
force_unroll = true
// ── Step J: non-innermost filter ──
if not ctx.allow_non_innermost and not force_unroll:
if 10 * body_info.tex_count < remaining:
Reject(block, 7); continue
// ── Step K: factor selection ──
if force_unroll:
factor = 1 << ctx.force_factor // power-of-2 override
else if known_trip_count:
factor = trip_count
// Budget-constrain: while factor * body_cost > UnrollBudget:
// factor--
if factor > 4 and trip_count == 1:
factor &= ~3 // round to mult-of-4
if factor <= 1:
Reject(block, 12); continue
else:
if body_size <= 49 and body_info.tex_count > 0:
factor = 2 // conservative default
else:
factor = max(1, UnrollBudget / body_cost)
// ── Step L: knob override ──
if QueryKnob(429): // LoopUnrollFactor INT
factor = GetKnobInt(429)
// ── Step M: IV analysis ──
iv_info = AnalyzeIV(ctx, latch) // sub_1385950
if not iv_info: Reject(block, 14); continue
if not ValidateIV(ctx, iv_info): // sub_1387870
Reject(block, 14); continue
bound = ExtractBound(ctx, iv_info) // sub_1385E90
if not bound or bound.opcode != 2:
Reject(block, 16); continue
if bound.def_block.predecessor_count != 1:
Reject(block, 17); continue
if bound.init_reg == bound.limit_reg:
Reject(block, 18); continue
stride_ok = VerifyStride(ctx, block, latch, iv_info, bound)
if stride_ok & 2: Reject(block, 17); continue
if stride_ok & 1: Reject(block, 18); continue
// ── Step N: detect constant trip count ──
const_iv = DetectConstantIV(ctx, iv_info) // sub_1385CC0
// ── Step O: profitability for full unroll ──
if factor == trip_count and single_block_body:
if CheckFoldableProfitability(ctx, block, iv_info, factor):
ReplicateFullUnroll(ctx, block, factor) // sub_1383620
stats.unrolled_count++
continue
// ── Step P: partial unroll execution ──
if factor >= 2:
remainder = trip_count % factor
iterations_per_copy = (trip_count - remainder) / factor
block.iterations_per_copy = iterations_per_copy
if remainder > 0:
for r = 0 to remainder-1:
DuplicateBody(ctx, block) // sub_932E40
ReplicatePartialUnroll(ctx, block, latch,
factor, remainder) // sub_1387C30
stats.unrolled_count++
else:
Reject(block, 24) // budget exceeded
// Phase 4: Post-unroll fixup
stats.non_unrolled = total_loops - stats.unrolled - stats.failed
if any_unrolled:
RebuildBackEdges(code_object) // sub_7846F0
RerunLiveness(code_object) // sub_A0F020
RerunControlFlow(code_object) // sub_752E40
MarkModified(code_object) // sub_7B52B0
Unroll Rejection Table
When a loop cannot be unrolled, sub_7F5D20 records the reason by indexing a string pointer array at 0x21D1EA0. The diagnostic strings contain hex codes like "0x80000001 - Not unrolled: Irregular loop" -- these hex values are part of the printed message text, not the internal array index. The W023 report originally described a 36-byte structure table at 0x21D1980; that table belongs to the operand range lookup in the peephole optimizer (sub_7E39B0), not the unrolling pass. The actual internal rejection codes are simple integers indexing the string array:
| Code | Category | Reason |
|---|---|---|
| 7 | Performance | Body too large relative to texture savings (10 * tex_count < remaining_budget) |
| 11 | Pragma/knob | PTX nounroll pragma, convergence constraint, or per-block knob 91 |
| 12 | Budget | Partial unroll factor reduced to 1 (no factor >= 2 fits within UnrollBudget) |
| 13 | Ineligible | Loop exit contains BRX (indirect jump, opcode 95 with special flags) |
| 14 | Unsupported IV | Induction variable analysis failed (sub_1385950 or sub_1387870) |
| 15 | Unsupported IV | IV register class is not integer (class 1) or pointer (class 2/3) |
| 16 | Trip count | Trip count bound extraction failed (sub_1385E90) |
| 17 | Irregular | IV definition block has multiple predecessors, or stride/direction verification failed |
| 18 | Trip count | IV initial value register equals IV limit register (degenerate zero-trip loop) |
| 19 | Unsupported IV | IV stride sign inconsistent between loop header and induction increment |
| 24 | Budget | Catch-all: budget exceeded after all factor reduction attempts |
The diagnostic output is gated by flags[1421] & 0x20 (DUMPIR verbose mode). When enabled, the rejection string is recorded in a hash map keyed by the loop header instruction node, using FNV-1a hashing of the node's block index.
Heuristic Thresholds (Knobs)
The unrolling decision is controlled by a rich set of OCG knobs. All knob names are stored ROT13-encoded in the binary:
| Knob Name | Type | Default | Description |
|---|---|---|---|
LoopUnroll | BOOL | true | Master enable for loop unrolling |
LoopUnrollFactor | INT | 0 | Override unroll factor (0 = heuristic) |
UnrollBudget | INT | 140 | Maximum total instruction count after unrolling |
UnrollInstLimit | FLOAT | 20.0 | Maximum instructions in a single unrolled loop body |
UnrollFullInstLimit | INT | 200 | Maximum body size for full unrolling |
UnrollFlexableFullLimit | FLOAT | 0.25 | Flexible full-unroll limit (adjusted by loop characteristics) |
UnrollSmallLoopLimit | INT | 4 | Body size threshold below which loops are always fully unrolled |
UnrollPregThreshold | INT | 50 | Maximum predicate register pressure for unrolling |
UnrollMultiBlockLoops | BOOL | true | Allow unrolling of multi-basic-block loop bodies |
UnrollVariableBounds | BOOL | true | Allow unrolling when trip count is not compile-time constant |
UnrollUnknownCount | INT | 4 | Default trip count assumption when count is unknown |
UnrollUnknownInstLimit | INT | 0 | Maximum body size for unrolling with unknown trip count |
UnrollExtraInstPerPercentSaving | INT | 2 | Instructions allowed per percent of cycle saving |
UnrollTex3DPercentSavedThreshold | INT | 0 | Minimum savings percent for 3D texture loops |
UnrollProfiledColdInstsScale | INT | 0 | Scale factor for instruction count in profiled-cold blocks |
LoopUnrollExtraFoldableLdcWeight | INT | 0 | Extra weight for foldable constant loads in unroll benefit |
LoopUnrollFoldableAddrWeight | INT | 0 | Weight for foldable address computations |
LoopUnrollLargePartOfShaderPct | DOUBLE | 0.4 | Percentage threshold: loop is "large part of shader" |
LoopUnrollNumExtraInstBase | INT | 46 | Base extra instruction allowance per unroll iteration |
LoopUnrollNumInstSmallLoop | INT | 0 | Instruction count defining "small loop" |
LoopUnrollNumInstTex | INT | 0 | Texture instruction count bonus for unrolling |
LoopUnrollSingleLoopSavedPctFactor | INT | 0 | Savings factor for single-loop shaders |
LoopUnrollNonInnermost | BOOL | true | Allow unrolling of non-innermost loops |
LoopUnrollUnknownMultiBlock | BOOL | false | Allow multi-block unroll with unknown bounds |
EpilogueLoopUnrollCount | INT | 0 | Unroll count for epilogue (remainder) loops |
DisablePartialUnrollOverflowCheck | BOOL | false | Skip overflow check on partial unroll count |
GPU-Specific Unrolling Concerns
Register pressure. GPU threads share a fixed register file per SM. Unrolling increases live ranges, potentially reducing occupancy (the number of concurrent warps). The unroller queries register pressure estimates and compares against UnrollPregThreshold before committing.
Instruction cache. GPU instruction caches are small (typically 128KB L1i per SM). Aggressive unrolling of large loop bodies can cause i-cache thrashing. The UnrollBudget knob caps the total instruction growth.
Texture instruction scheduling. Texture fetches have high latency (hundreds of cycles). Unrolling loops containing texture operations is especially profitable because it exposes independent fetches that the scheduler can overlap. The LoopUnrollNumInstTex and UnrollTex3DPercentSavedThreshold knobs give extra weight to texture-heavy loops.
PTX nounroll pragma. The PTX string nounroll at 0x1CFE126 is parsed during PTX-to-Ori lowering and sets bit 1 of byte 292 in the block descriptor at (code_object+368 + 8*block_idx). The check is performed by sub_7E5120, which also tests three additional suppression conditions: the convergence constraint (back-edge table at code_object+1776), an instruction-level marker (byte 283 bit 0), and per-block knob 91. Any single condition is sufficient to suppress unrolling for that loop (rejection code 11).
Convergence constraint. When the back-edge analysis context at code_object+1784 is active (indicating warp-synchronous code), the unroller checks whether the loop header falls within the convergence region. If it does, unrolling is suppressed to avoid breaking warp-level synchronization guarantees. This is particularly important for cooperative groups and ballot-based algorithms.
DUMPIR Statistics
When diagnostics are enabled, the pass outputs:
# [partially unrolled loops=N] [non-unrolled loops=M]
This line appears in eight SM-variant statistics printers (sub_ABBA50 through sub_ABEB50), each a 1,771-byte clone specializing output format for a specific SM generation.
Phase 24 -- OriPipelining
Purpose
Performs modulo software pipelining on loops that were not fully unrolled. The pass overlaps successive loop iterations by interleaving instructions from different iterations within a single loop body, hiding functional unit and memory latency. This is the single most complex loop transformation in ptxas.
Two-Layer Pipelining Architecture
ptxas implements software pipelining in two cooperating layers:
-
Phase 24 (OriPipelining, pre-RA): Annotates instruction operands with pipeline latency classes, computes the minimum initiation interval (MII), performs the modulo scheduling loop transformation (iteration overlap, prolog/epilog generation). Operates on the Ori IR before register allocation.
-
Post-RA SoftwarePipeline (
sub_8B9390, 23KB): A scheduling algorithm variant within the post-RA instruction scheduler (address range0x893000--0x8FE000) that performs instruction-level scheduling of already-pipelined loop bodies using physical registers. One of approximately 12 scheduling variants alongsideDualIssueScheduler,TensorScheduler,LoopScheduler,PrefetchScheduler, etc.
The two layers cooperate: Phase 24 transforms the loop structure (instruction replication, prolog/epilog construction) before register allocation. The post-RA SoftwarePipeline variant handles the cycle-accurate instruction placement of already-pipelined loops.
Function Map
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_926A30 | 22,116 bytes | Per-instruction operand latency annotator and encoding rewriter | HIGH |
sub_91A0F0 | 5,550 bytes | Opcode-to-latency-class classifier (~350 opcodes, 13 distinct classes) | HIGH |
sub_9203A0 | 4,881 bytes | Pipeline stage cost calculator (ResMII computation, FP cost accumulation) | MEDIUM |
sub_921820 | 1,592 bytes | Prolog/epilog code generator | MEDIUM |
sub_9202D0 | 207 bytes | Two-operand pipeline feasibility check (returns 60=reject, 130=accept) | HIGH |
sub_91E610 | 399 bytes | Register-class-based latency lookup (class 4→26, class 5/2→20) | HIGH |
sub_91E900 | 470 bytes | Pipe-assignment-based stall cycle calculator (32/64 cycle caps) | HIGH |
sub_92C0D0 | 358 bytes | Per-instruction annotation wrapper (calls sub_926A30, checks opcode changes) | HIGH |
sub_92C240 | 8,033 bytes | Extended GEMM-loop pipeliner (SM90+ TMA pipeline depth management) | MEDIUM |
sub_8B9390 | 22,841 bytes | Post-RA software pipelining scheduling variant (in scheduler subsystem) | MEDIUM |
Correction (P1-06): The original function map listed sub_926A30 as the "main pipelining engine (modulo scheduling)." Decompilation reveals it is the per-instruction operand latency annotator -- it iterates over each operand of an instruction, calls sub_91A0F0 to classify the operand's latency class, and rewrites the operand encoding with the latency annotation. The modulo scheduling loop transformation is distributed across the remaining functions, with sub_9203A0 computing stage costs and sub_921820 generating prolog/epilog code.
Software Pipelining Algorithm
Phase 1: Operand Latency Annotation
For each instruction in the loop body, sub_92C0D0 calls sub_926A30 to annotate operands:
function AnnotateOperandLatencies(code_object, instruction):
opcode = instruction.word & 0xFFFFCFFF // strip modifier bits (bits 12-13)
secondary_opcode = instruction.secondary_opcode
operand_array = instruction.operands // offset +84
operand_count = instruction.operand_count // offset +80
for i in 0..operand_count-1:
operand_type = (operand_array[i].word >> 28) & 7
if operand_type in {2, 3}: // register or register pair
// Adjust count for predicated instructions (bit 12)
adjusted_count = operand_count - 2 * ((opcode >> 11) & 2 != 0)
if i < adjusted_count:
latency_class = ClassifyLatency(opcode, secondary_opcode,
operand_array, adjusted_count, i)
if latency_class != default:
RewriteOperandEncoding(operand_array[i], code_object, latency_class)
// For register operands: call full rewriter sub_922210
// For non-register operands: call sub_9267C0
Phase 2: Pipeline Feasibility Filtering
Each instruction is checked by sub_9202D0:
function CheckPipelineFeasibility(code_object, instruction):
// Reject instructions with special operand flags
if (operand_array[1] & 0x603FFFF) != 0 or (operand_array[3] & 0xF8000000) != 0:
if optimization_level > 1:
return REJECT // return code 60
// Reject if pipe assignment class <= 3 (control/barrier pipe)
pipe_class = PipeAssignment(code_object, primary_opcode) // vtable+904
if pipe_class <= 3:
return REJECT
// Reject if operand 0 and operand 1 have different latency classes
lat0 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 0)
lat1 = ClassifyLatency(opcode, secondary_opcode, operand_array, count, 1)
if lat0 != lat1:
return REJECT // asymmetric latencies
// Reject if extended operands have blocking flags
if operand_count > 2 and (operand_array[4] & 0xF) or (operand_array[4] >> 4) & 1:
return REJECT
// Accept: trim to 2-operand form
result_operands = &operand_array[2]
result_count = 2
return ACCEPT // return code 130
Phase 3: MII Computation
The minimum initiation interval is computed as:
MII = max(RecMII, ResMII)
RecMII (recurrence-constrained): The longest data dependence cycle in the DDG divided by the iteration distance it spans. For a cycle of total latency L spanning D iterations: RecMII = ceil(L / D).
ResMII (resource-constrained): Computed by sub_9203A0 using floating-point cost accumulation. The function classifies each instruction's pipe class using a 7-entry pipe class table at code_object+16 and accumulates per-pipe instruction counts:
function ComputeResMII(loop_body, pipe_table):
pipe_counts[0..6] = {0}
for each instruction in loop_body:
lat0 = ClassifyLatency(instruction, operand=0)
lat1 = ClassifyLatency(instruction, operand=1)
pipe = MapLatencyToPipe(lat0, pipe_table) // 7-entry lookup
pipe_counts[pipe] += cost(instruction) // FP cost weights
ResMII = max(pipe_counts[i] / pipe_width[i] for i in 0..6)
The pipe class boundaries stored at code_object+16 define 7 functional unit classes. Each class has a capacity (number of execution slots per cycle). ResMII is the maximum ratio of instruction demand to capacity across all pipe classes.
Phase 4: Modulo Schedule Construction
function ModuloSchedule(loop_body, MII):
II = MII
while II <= MAX_II:
MRT = new ModuloReservationTable(II) // II rows x pipe_classes columns
success = true
for each instruction in priority order:
earliest = max(data_dependency_constraints)
latest = earliest + II - 1
placed = false
for slot in earliest..latest:
row = slot mod II
pipe = instruction.pipe_class
if MRT[row][pipe] has capacity:
MRT[row][pipe] -= 1
instruction.scheduled_time = slot
instruction.stage = slot / II
placed = true
break
if not placed:
success = false
break
if success:
return (II, schedule)
II += 1
return FAILURE // could not pipeline
Phase 5: Prolog/Epilog Generation
Once a valid schedule is found at initiation interval II with S pipeline stages, sub_921820 generates:
function GeneratePrologEpilog(loop, II, num_stages):
// Prolog: S-1 partial iterations
for stage in 0..num_stages-2:
emit instructions assigned to stages 0..stage
// Each prolog iteration adds one more stage
// Kernel: steady-state loop body
emit all instructions from all stages
// Trip count adjusted: new_trip = original_trip - (num_stages - 1)
// Epilog: S-1 drain iterations
for stage in num_stages-2..0:
emit instructions assigned to stages stage+1..num_stages-1
// Each epilog iteration removes one stage
Instruction Latency Classifier (sub_91A0F0)
The classifier is a 5.5KB, 1372-line switch statement mapping approximately 350 Ori opcodes to 13 distinct latency class values. It takes five parameters: (opcode, secondary_opcode, operand_array, operand_count, operand_index) and returns a class ID -- not a cycle count. The scheduler maps class IDs to actual cycle counts via the hardware profile.
Latency Class Table
| Class | Typical opcodes | Meaning |
|---|---|---|
| 1 | Past-end operands, invalid indices | Skip / not used |
| 6 | Simple ALU, bitwise, short integer | Short-pipe latency (~80 opcodes) |
| 7 | Paired register operations | Medium-short (~5 opcodes) |
| 8 | Special cases (via lookup table dword_21E1340) | Medium |
| 9 | Type conversions (via lookup table) | Medium |
| 10 | Integer multiply, shifts, IMAD | Medium-long (~40 opcodes) |
| 11 | Address computations, LEA variants | Medium-long (~15 opcodes) |
| 12 | Memory operations, FP32, barriers | Standard long (~100 opcodes) |
| 14 | Wide memory, atomics, FP64 stores | Extended long (~20 opcodes) |
| 16 | FP64 special variants | Extended long (~3 opcodes) |
| 20 | Texture fetches, uniform loads | Very long (~30 opcodes) |
| 26 | Global memory loads, uncached access | Maximum latency (~25 opcodes) |
| 31 | Scoreboard/barrier-related operands | Special handling (~5 opcodes) |
Opcode Family Handling
| Opcode range | Category | Latency behavior |
|---|---|---|
0x03--0x24 | Integer ALU | Mostly passthrough default; 0x23 always returns 10 |
0x3C, 0x3E, 0x4E, 0x4F | Memory (load/store) | Returns field from operand_array[4] bits for operands 0--1 |
0x46, 0xF3--0x106 | Texture | Returns 6 normally; 10 for MIO-dependent with extended flag check |
0x49, 0x4A, 0x51, 0x143, 0x15E | Atomic/reduce | Always returns 12 |
0x55--0x6F | Floating-point | Complex per-operand logic; 0x55 uses lookup table dword_21E1340 |
0x5B, 0x5C, 0x137 | Barriers/sync | Returns 12 for operand 1, else default |
0xB7, 0x120 | WGMMA setup | Per-operand latency (10--20) based on accumulator flags |
0x135 | HMMA/IMMA | Calls sub_7E39B0/sub_7E3A70/sub_7E3BA0/sub_7E3C30 for matrix latency |
0x13D, 0x13E | Extended FP | Accumulator-flag-dependent returns (10 or 12) |
Stall Cycle Calculator (sub_91E900)
sub_91E900 computes the stall penalty for an instruction by mapping latency classes through the pipe assignment function (vtable+904):
function ComputeStallCycles(code_object, instruction):
lat0 = ClassifyLatency(instruction, operand=0)
pipe0 = PipeAssignment(code_object, lat0) // vtable+904
if pipe0 == 8: // long-latency pipe
stall = StallTable[instruction.index] // code_object+440
return min(stall, 64) // cap at 64 cycles
lat1 = ClassifyLatency(instruction, operand=1)
pipe1 = PipeAssignment(code_object, lat1)
if pipe1 == 8:
stall = StallTable[instruction.index]
return min(stall, 64)
// Neither operand on long pipe
stall = StallTable[instruction.index]
return min(stall, 32) // cap at 32 cycles
The pipe assignment value 8 corresponds to the long-latency functional unit (memory/texture). Instructions on this pipe get a 64-cycle cap; all others are capped at 32 cycles.
GEMM Pipelining (sub_92C240)
The GemmPipeliner* family of knobs controls a specialized pipelining mode for GEMM (matrix multiply) loops:
| Knob Name | Type | Default | Description |
|---|---|---|---|
GemmPipelinerEnabled | BOOL | false | Master enable for GEMM-specific pipelining |
GemmPipelinerPipelineDepthEnforceDeltaFull | INT | 0 | Pipeline depth adjustment for full enforcement |
GemmPipelinerPipelineDepthEnforceDeltaPartial | INT | 0 | Pipeline depth adjustment for partial enforcement |
GemmPipelinerDependenciesPopbl | BOOL | false | Dependency resolution policy between DMA and compute stages |
GemmPipelinerScoreboardHashPopbl | BOOL | false | Scoreboard hash policy for GEMM barrier tracking |
GemmPipelinerUseRegisterCalculation | INT | 0 | Use register-based calculation for pipeline depth vs. fixed |
The extended pipelining in sub_92C240 (8KB) handles GEMM-like patterns where the loop body contains WGMMA/IMMA instructions. From decompilation:
- Activation: The GEMM pipeliner activates when
code_object+48(GEMM mode flag) is set and the pipeline context atcode_object+56has a valid stage range. - Stage iteration: Iterates from
context+84(start stage) tocontext+88(end stage), with 96-byte descriptors per stage atcontext+136. - Pipeline depth management: Uses
sub_8A4DA0to validate stage depth andsub_6E6650for dynamic array resizing when pipeline depth exceeds the current allocation. Writes stage bitmasks (1 << stage_index) into the stage descriptor arrays. - Hardware model: On SM90+ (Hopper), TMA supports up to 8 outstanding asynchronous copy operations. The GEMM pipeliner matches this hardware depth, staging DMA (memory) and compute (math) operations to fill the pipeline.
The DUMPIR diagnostic output includes For Dma Loop and For Math Loop sections from sub_7A4500, confirming the pipeliner explicitly distinguishes between DMA and compute loop stages.
Other Pipelining Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
OkToPipelineNoUnroll | INT | 0 (disabled) | Allow pipelining even when unrolling was also suppressed |
PipelineHoistCondLimit | INT | unset | Maximum condition complexity for hoisting in pipelined loops |
PipelineHoistRRegPressureLimit | INT | unset | R-register pressure limit for hoisting inside pipelined body |
PipelineHoistPRegPressureLimit | INT | unset | P-register pressure limit for hoisting inside pipelined body |
PipelineMIOVQToInstRatio | DBL | unset | MIOVQ-to-instruction ratio threshold for pipeline profitability |
PipelineMultiOutputTex | INT | 0 (disabled) | Enable pipelining of loops with multi-output texture instructions |
PipelineSpecUsesInHeadOnly | INT | 0 (disabled) | Restrict speculative uses to loop header only |
GPU-Specific Pipeline Concerns
Warp divergence. Pipelined loops assume all threads in a warp execute the same number of iterations. If the trip count is warp-divergent, the prolog/epilog handling must account for early-exit threads. The pass checks the varying analysis (phases 53, 70) to determine divergence.
Barrier placement. Pipelined loops containing BAR.SYNC or MEMBAR instructions are checked by sub_9202D0 -- if the pipe assignment class for a barrier instruction is <= 3, the instruction is rejected from pipelining. The latency classifier (sub_91A0F0) assigns class 12 to barrier operands (opcodes 0x5B, 0x5C, 0x137), but the feasibility check rejects based on pipe class, not latency class.
Memory pipeline depth. The sub_92C240 extended pipeliner for GEMM-like loops manages the hardware memory pipeline on SM90+. It explicitly tracks DMA pipeline depth using 96-byte per-stage descriptors, resizing arrays dynamically when depth exceeds allocation. The stage descriptor at context+136 + 96*stage holds bitmask membership, latency counters, and dependency links.
Pipe class model. The 7-entry pipe class table at code_object+16 partitions the functional units into classes. The post-RA software pipelining variant (sub_8B9390) uses the same table to determine which functional unit class each instruction uses, ensuring resource conflict detection is consistent between the two pipelining layers.
Phases 35, 66, 79, 88 -- OriHoistInvariants (LICM)
Purpose
Hoists computations that produce the same result on every loop iteration out of the loop body and into the preheader. This reduces the dynamic instruction count proportionally to the trip count. The four instances are not redundant -- each targets invariants created by different intervening transformations.
Function Map
All four instances share the same core implementation:
| Function | Size | Role | Confidence |
|---|---|---|---|
sub_C5FE00 | 34 bytes | Phase 35 execute wrapper | CERTAIN |
sub_C5FE30 | 34 bytes | Phase 66 execute wrapper | CERTAIN |
sub_C5FE60 | 34 bytes | Phase 79 execute wrapper | CERTAIN |
sub_C5FE90 | 34 bytes | Phase 88 execute wrapper | CERTAIN |
sub_7DDB50 | 156 bytes | Optimization guard: checks knob 499, block count > 2 | HIGH |
sub_8FFDE0 | 573 bytes | HoistInvariants orchestrator: iterates blocks, queries knob 381, dispatches inner worker | HIGH |
sub_8FF780 | 1,622 bytes | LICM inner worker: identifies and moves invariant instructions | HIGH |
sub_8FEAC0 | 2,053 bytes | Invariance marking: forward/backward operand scan per block | HIGH |
sub_8F76E0 | 90 bytes | Per-instruction invariance test: checks output register def-block | HIGH |
sub_8F7770 | 810 bytes | Hoisting safety check: operand class + latency analysis | HIGH |
sub_8F8CB0 | 658 bytes | Profitability check: budget-weighted score vs latency penalty | HIGH |
sub_8F7DD0 | 374 bytes | Transitive invariance propagation through def-use chains | HIGH |
sub_8F7AE0 | 558 bytes | Instruction mover: unlinks from loop, inserts at preheader | HIGH |
sub_8FF2D0 | 1,186 bytes | Budget computation + invariant marking + hoist dispatch | HIGH |
sub_8F8BC0 | 257 bytes | Instruction counting: header/body weight via isNoOp | HIGH |
sub_74D720 | 353 bytes | Loop boundary analysis: barrier/jump/predecessor checks | HIGH |
sub_74F500 | -- | Preheader location finder | MEDIUM |
sub_7DF3A0 | 88 bytes | Opcode flags table lookup (side-effect classification) | HIGH |
sub_7E0540 | 156 bytes | Observable side-effect checker (memory, call, barrier) | HIGH |
Execute Flow
sub_C5FExxx(phase_obj) // 34-byte vtable dispatch
└─ sub_8FFDE0(code_object, pass_id) // orchestrator
├─ sub_7DDB50(code_object) // guard: returns block count, checks knob 499
├─ sub_799250(allocator, "HoistInvariants", &skip) // DUMPIR check
└─ sub_8FF780(context) // per-loop LICM core
├─ sub_781F80 // rebuild instruction list
├─ sub_7E6090 // recompute register pressure
├─ sub_773140 // recompute loop depths
├─ sub_74D720 // analyze loop boundaries
├─ sub_74F500 // find preheader
├─ sub_7A1A90 / sub_7A1B80 // query knob 381 per block
└─ sub_8F8BC0 // move instruction to preheader
Why Four Instances?
| Phase | Pass ID (a2) | Pipeline Position | What Creates New Invariants |
|---|---|---|---|
35 (Early) | 0 | After GeneralOptimize (29), ExtractShaderConsts (34) | CSE eliminates redundant expressions, exposing loop-invariant results; shader constant extraction hoists uniform loads |
66 (Late) | 1 | After predication (63), GeneralOptimizeLate2 (65) | Predication converts conditional branches to predicated instructions; if the condition is loop-invariant, the entire predicated instruction becomes invariant |
79 (Late2) | 2 | After LateExpansionUnsupportedOps (78) | Late expansion splits compound operations into sequences; address computations and constant sub-expressions in expanded sequences are often invariant |
88 (Late3) | 3 | After FixupGmmaSequence (87) | GMMA fixup reorders/inserts instructions for wgmma hardware constraints; descriptor loads and accumulator setup become visible as invariants |
Pass ID Controls Aggressiveness
The pass_id parameter (parameter a2 of sub_8FFDE0) affects which loops are processed and how aggressively hoisting is performed. From the decompiled logic at sub_8FFDE0:
// sub_8FFDE0 lines 58-89 (simplified)
v7 = sub_7A1B80(allocator, 381, block); // query knob 381 for this block
if (v7 == 1) { // knob says "inner loops only"
if (pass_id == 1) goto hoist_block; // Late pass: proceed
goto skip_block; // Early pass: skip
}
if (v7 == 3) { // knob says "never"
if (pass_id <= 1) goto handle_conservative;
goto skip_block;
}
if (v7 == 0) { // knob says "always"
if (pass_id == 0) goto hoist_aggressively;
goto skip_block;
}
- pass_id = 0 (Early): Hoists aggressively and calls
sub_A112C0(code_object, 1)to re-run sub-analyses afterward. This is the most aggressive pass. - pass_id = 1 (Late): Includes inner-loop-only blocks, but skips the re-analysis call.
- pass_id >= 2 (Late2, Late3): Most conservative -- only hoists from blocks where knob 381 returns 0 (always-hoist).
Per-Block Knob 381 Policy
The LICM pass queries OCG knob 381 (sub_7A1A90 / sub_7A1B80) per basic block to determine the hoisting policy:
| Knob 381 Result | Meaning |
|---|---|
| 0 | Always hoist from this block |
| 1 | Hoist from inner loops only |
| 3 | Never hoist from this block |
This per-block granularity allows the knob system to selectively disable hoisting in specific loop nests (e.g., those known to be register-pressure-critical).
Guard Function (sub_7DDB50)
Before the LICM core runs, sub_7DDB50 (156 bytes) gates execution on two conditions:
- Knob 499 enabled. Queries the allocator vtable at +72 for OCG knob 499 (the master LICM switch). If disabled, returns 1 which causes the orchestrator to bail (since 1 <= 2).
- Rate limiter. When knob 499 is enabled, the guard checks a pair of counters at
allocator[9]+35936(max invocations) andallocator[9]+35940(current count). If the current count has reached the maximum, returns 1. Otherwise increments the counter and returns the actual basic block count fromcode_object+2104. This bounds the number of LICM invocations for compile-time control in functions with many loops. - Block count > 2. The orchestrator (
sub_8FFDE0) checks the return value: if <= 2, no hoisting is attempted. Single-block functions have no loops; two-block functions have at most a trivial loop not worth processing.
LICM Invariant Detection Algorithm
The invariance detection pipeline runs inside sub_8FF2D0 (1,186 bytes), which is called from sub_8FF780 once per loop nest level. It executes five stages in sequence: budget computation, forward invariance marking, backward non-invariance marking, transitive propagation, and profitability gating.
Stage 1: Budget Computation (sub_8FF2D0)
function ComputeHoistBudget(context, block, is_simple, num_preds, hoist_mode, is_inner):
// Base budget from knob 483 (HoistBudget)
if QueryKnob(483):
budget = QueryKnobValue(483) // 0 = unlimited
else:
budget = 10 // default
// CBO budget from knob 482
if QueryKnob(482):
cbo_budget = QueryKnobValue(482)
else:
cbo_budget = (pass_id == 0) ? 22 : 100
// Adjust by loop type and depth
if pass_id > 0 and is_simple:
budget = (hoist_mode < 2) ? cbo_budget : 300
else if pass_id == 0 and is_simple:
budget = (hoist_mode < 2) ? cbo_budget : 200
// Conservative multiplier for Late3
if pass_id == 3:
budget *= 100 // generous once decided to hoist
// Split budget among back-edge blocks
if hoist_mode == 3: // processing back-edge block
budget /= num_preds
// Inner-loop divisor from knob 380
if is_inner:
if QueryKnob(380):
budget /= QueryKnobValue(380)
else:
budget /= 10
Stage 2: Forward Invariance Marking (sub_8FEAC0, a3=1)
The forward pass iterates every instruction in the basic block and marks each register operand's invariance status based on where it was defined.
function MarkInvariants_Forward(context, block_index):
block = blocks[block_index]
header_depth = context.header_depth
loop_depth_range = [context.header_depth, context.max_depth]
// Two code paths based on knob 934 (UseNewLoopInvariantRoutineForHoisting)
if QueryKnob(934):
// Advanced path: set-based computation via sub_768BF0 + sub_8F7280
return MarkInvariants_SetBased(context, block_index)
// Default path: single-pass scan
for each instruction in block (linked list: block+0 .. sentinel at block+8):
has_side_effect = isNoOp(instruction) // vtable+1824
opcode = instruction+72 (masked: BYTE1 &= 0xCF)
num_operands = instruction+80
// Special case: opcode 195 + first dst is reg class 9 (predicate)
is_predicate_def = (opcode == 195 and dst_reg.class == 9)
is_invariant = true
for each operand from LAST to FIRST: // reverse scan
operand = instruction + 84 + 8*i
type = (operand >> 28) & 7
if type != 1: // not a register
continue // immediates are always invariant
if IsFixedRegister(operand, code_object): // sub_7DEB90
continue // e.g., RZ, PT — always available
if pass_id == 3: // Late3 extra check
if IsSpecialRegClass(operand, code_object): // sub_7DA2F0
// Exception: IADD3 (opcode 130, flag 0x1000) carry-out
if not (opcode == 130 and flag_0x1000 and is_penultimate_operand):
continue
reg = RegisterDescriptor(code_object, operand & 0xFFFFFF)
if reg.def_block (reg+76) == block_index:
// Defined in THIS block — not invariant for this loop
is_invariant = false
else if context.is_multi_depth:
def_instr = reg.def_instruction (reg+56)
if def_instr is null or reg has pinned bit:
handle_predicate_invariance()
else:
def_block = blocks[def_instr.block_index]
def_depth = def_block.loop_depth (offset +144)
if def_depth < header_depth or def_depth > max_depth:
reg.use_count (reg+80) = 0 // mark as loop-external
else:
is_invariant = false
reg.def_block (reg+76) = block_index
else:
reg.use_count (reg+80) = 0 // simple loop: mark external
// Side-effect check for the entire instruction
flags = LookupOpcodeFlags(instruction, code_object) // sub_7DF3A0
if (flags & 2) != 0: // has memory/control side effect
is_invariant = false
if MemoryOverlapsLoopLiveSet(instruction): // sub_74F5E0
is_invariant = false
if is_multi_depth and HasObservableSideEffects(instruction): // sub_7E0540
is_invariant = false
// Mark destination operands
for each dst_operand (bit 31 set = definition):
if type == 1 and not pinned:
if is_invariant:
reg.def_block = block_index // mark for hoisting
else:
reg.use_count += 1 // count loop-internal uses
The key insight is that invariance is determined by definition site: if every source register was defined outside the loop (or in a block already processed), the instruction is invariant. Immediates and constants are trivially invariant. The check is not purely structural -- it uses the reg+76 field which gets updated as hoisting proceeds, allowing transitive invariance discovery.
Stage 3: Backward Non-Invariance Marking (sub_8FEAC0, a3=0)
The backward pass uses the same function with a3=0. Instead of marking definitions as external, it marks operands whose definitions are inside the loop as non-invariant by setting reg.def_block = block_index. This clears any false positives from the forward pass where a register appeared invariant but its defining instruction depends on a loop-variant value.
For destination operands, the backward pass increments reg.use_count for all non-pinned register definitions, building the use-count information needed by the profitability check.
Stage 4: Transitive Invariance Propagation (sub_8F7DD0)
After the two marking passes, sub_8F7DD0 propagates invariance transitively through the instruction chain. This handles the case where instruction A is invariant and defines register R, and instruction B uses R and is otherwise invariant -- the forward pass may have marked B as non-invariant because R's definition was in the loop, but A (the definer) is itself invariant.
function PropagateInvariance(context, block_index):
block = blocks[block_index]
side_effect_mask = 0
for each instruction in block:
aliases_memory = CheckMemoryAlias(code_object, instruction) // sub_74F5E0
for each operand (type == 1, register):
reg = RegisterDescriptor(operand)
if operand is definition (bit 31 set):
if isNoOp(instruction):
if IsInvariant(instruction, block_index): // sub_8F76E0
side_effect_mask |= reg.flags & 0x3
else:
reg.flags |= aliases_memory ? 1 : 0
else:
reg.flags |= (has_side_effect ? 1 : 0) | 2
else: // use
if has_side_effect:
reg.def_block = block_index // taint defining register
else:
reg.use_count += 1
return side_effect_mask
Stage 5: Profitability Check (sub_8F8CB0)
The final gate before hoisting. Computes a cost-benefit ratio and rejects hoisting if the ratio is unfavorable.
function IsProfitable(context, block_index, budget, is_hoist_safe):
header_weight = context.header_insn_count // from sub_8F8BC0
body_weight = context.body_insn_count
// Scoring weights depend on pass aggressiveness and safety
if is_hoist_safe:
noOp_weight = (pass_id == 0) ? 60 : 150
real_weight = 5
else:
noOp_weight = (pass_id == 0) ? 12 : 30
real_weight = 1
score = 0
latency_penalty = 0
instruction_count = 0
for each instruction in block:
instruction_count += 1
if IsInvariant(instruction, block_index): // sub_8F76E0
if isNoOp(instruction):
score += noOp_weight
else:
score += 1
for each dst_operand with scoreboard flag:
score += real_weight
latency = GetLatencyClass(instruction) // sub_91E860
latency_penalty += (latency > 4) ? 2 : 1
else:
for each high-latency dst_operand:
latency_penalty += (latency > 4) ? 2 : 1
// Final decision: weighted score vs latency cost
if pass_id == 0: // aggressive
denominator = real_weight * instruction_count
else:
denominator = body_weight / 3 + header_weight
return denominator != 0 and (score * budget) / (real_weight * denominator) >= latency_penalty
The profitability check encodes a fundamental GPU tradeoff: hoisting reduces dynamic instruction count (proportional to trip count) but extends live ranges (increasing register pressure and reducing occupancy). The budget parameter, which varies by 100x between pass_id 0 and 3, controls how aggressively this tradeoff is resolved. Pass_id 0 (Early) uses the smallest denominator, making it easiest to exceed the threshold.
Per-Instruction Invariance Test (sub_8F76E0)
The leaf-level invariance test used by stages 4 and 5 is a simple definition-site check:
function IsInvariant(instruction, current_block_index):
num_operands = instruction.operand_count // inst+80
if num_operands == 0:
return false
// Find the last "interesting" operand (skip immediates/constants)
// Immediates have type bits in the 0x70000000 range
last_operand = scan backwards from operand[num_operands-1]
while (operand XOR 0x70000000) & 0x70000000 == 0
// Check: is this a register definition outside the current block?
if last_operand is negative (bit 31 = definition)
and type_bits == 1 (register)
and not pinned (byte+7 bit 0 == 0):
reg = RegisterDescriptor(last_operand & 0xFFFFFF)
return reg.def_block (reg+76) != current_block_index
return false
This is the most-called function in the LICM pipeline. It checks whether an instruction's primary output register was defined outside the current block -- if so, the instruction is considered invariant (already hoisted or defined in a dominating block).
Side-Effect Blocking Rules
An instruction is blocked from hoisting if any of the following conditions hold, regardless of operand invariance:
| Check | Function | Condition |
|---|---|---|
| Memory store | sub_7DF3A0 | Flags byte bits 2-3 set and bit 5 clear |
| Memory barrier | sub_74D720 | Opcode 159 (BAR.SYNC), 32 (MEMBAR), or 271 (barrier variant) |
| Indirect jump | sub_74D720 | Opcode 236 (BRX) |
| Volatile/atomic access | sub_7DFA80 | Called from sub_7E0540; detects volatile or atomic memory |
| Function call | vtable+1456 | isBarrier() returns true |
| Texture side effect | sub_7DF3A0 | Flags byte bit 6 set with operand modifier flag |
| Address-space effect | sub_7E0540 | Opcodes 85/109 (memory ops) with (flags+20 & 2) != 0 |
The boundary analysis (sub_74D720) also produces a 5-byte result array that gates the entire loop:
| Byte | Meaning | Effect |
|---|---|---|
| 0 | Has external predecessor (outside loop depth range) | Skip loop (not a natural loop) |
| 1 | Non-header block with different nesting | Marks as complex multi-depth loop |
| 2 | Contains barrier instruction | Skip loop entirely |
| 3 | Contains indirect jump | Skip loop entirely |
| 4 | Multi-depth safety flag | AND-ed with sub_7E5120 per inner block |
Instruction Counting (sub_8F8BC0)
Before the profitability check, sub_8F8BC0 counts instructions in the loop header and body separately. It walks the instruction linked list for each block in the loop and classifies each instruction using isNoOp (vtable+1824):
- No-op instruction (scheduling placeholder, predicate set, etc.): weight 1
- Real instruction (ALU, memory, branch, etc.): weight 30
The header count is stored at context+64 and the body count at context+68. The profitability formula uses these to normalize the hoisting score: a loop with a heavy header relative to the body benefits less from hoisting.
Instruction Movement (sub_8F7AE0)
After all checks pass, sub_8F7AE0 physically moves each invariant instruction from the loop body to the preheader:
- Invariance re-check. Calls
sub_8F76E0one final time per instruction. Instructions whose invariance status changed during the marking passes are skipped. - Knob 484 gate. Queries the allocator for knob 484; if disabled, no movement occurs. This provides a fine-grained override separate from the loop-level knob 381.
- Preheader creation. On the first hoisted instruction, creates or locates the preheader block:
- If the loop has an existing preheader block (
context+16non-null): clones it viasub_931920, copies convergence flags from the original preheader'soffset+282 bit 3, and links it into the CFG viasub_8F7610. - If no preheader exists: creates a new block via
sub_92E1F0and links it.
- If the loop has an existing preheader block (
- Unlink and reinsert. For each invariant instruction:
sub_9253C0(code_object, instruction, 1): unlinks the instruction from the current block.sub_91E290(code_object, instruction): inserts at the preheader insertion point.- Updates the Ori instruction's control word at
instruction+32(not the SchedNode): sets bit 1 at byte offset +13 to mark the instruction as hoisted (prevents the scheduler from reordering it back into the loop).
- Destination register tracking. For each output operand, if the defining instruction at
reg+56differs from the current instruction, setscontext+44(hoisted_cbo flag). For pass_id == 2, additionally setsreg+48 bit 26if the register class is in {2, 3, 4} (GPR classes) and the preheader has the convergence flag. - Special IADD3 handling. For pass_id == 3, instructions with opcode 130 (
IADD3), flag0x1000, and a negative byte at+90(carry chain) receive special treatment viasub_9232B0which adjusts the carry-out register linkage before movement.
Multi-Depth Loop Handling
For loops with nesting depth > 1 (inner loops within the hoisting target), sub_8FF780 performs multiple rounds of sub_8FF2D0 calls:
- Header block. First call processes the loop header with
hoist_mode = 0. - Intermediate blocks. For each depth level between
header_depth+1andmax_depth, checks if the block's parent depth (block+148) matches the header depth. If the block is a back-edge predecessor of the loop header, useshoist_mode = 3. Otherwise, checks a dominance bitmap atblock[25] + 4*(depth >> 5): if bit(1 << depth)is set, useshoist_mode = 1(dominated); otherwisehoist_mode = 2(non-dominated). - Back-edge block. Final call with
hoist_mode = 3and the deepest back-edge block index, ensuring the budget is split among back-edge predecessors.
Multi-depth permission is gated by knob 220 (queried at allocator[9]+15840 for the fast path) and the DisableNestedHoist knob. When hoisting from an inner loop to the header of an outer loop, an additional constraint applies:
allow_nested = allow_nested_hoist AND is_simple_loop
AND body_insn_count > 1
AND num_predecessors == 1
AND body_insn_count < header_insn_count * max_iterations
This prevents hoisting from inner loops where the cost (extended live range across multiple loop levels) exceeds the benefit (reduced inner-loop dynamic count).
LICM Outer Loop (sub_8FF780)
The complete outer driver that iterates over all loop nests:
function HoistInvariantsCore(context):
code_object = context.code_object
pass_id = context.pass_id
// Read iteration limit from allocator+34632
config_byte = allocator[34632]
max_iterations = (config_byte == 0) ? 2
: (config_byte == 1) ? allocator[34640]
: 0 // unlimited
allow_nested_hoist = (allocator[20016] != 0)
// Prepare IR
RebuildInstructionList(code_object, 1) // sub_781F80
RecomputeRegisterPressure(code_object, 1, 0, 0, 0) // sub_7E6090
RecomputeLoopDepths(code_object, 0) // sub_773140
if code_object.flags[176] & 2 and pass_id > 1:
RecomputeLoopNesting(code_object) // sub_789280
// Clear prior invariance markers
for each block in instruction list:
block.marker (offset +76) = 0xFFFFFFFF
// Iterate from innermost loop outward (last RPO entry first)
current = blocks[rpo[block_count]]
while current is valid:
if current has no predecessors or no first instruction:
advance; continue
// Count predecessors at >= current loop depth
header_depth = current.loop_depth // offset +144
for each predecessor:
if pred.loop_depth >= header_depth:
num_at_depth++; track deepest back-edge index
if num_at_depth == 0: // not a loop header
advance; continue
// Simple vs multi-depth
if max_depth == header_depth:
is_simple = true
else:
info = AnalyzeBoundaries(code_object, header_depth, max_depth)
if has_external_pred or has_barrier or has_indirect_jump:
advance; continue
if !MultiDepthAllowed(knob_220):
advance; continue
context.is_multi_depth = true
// Find preheader and query knob 381
context.insert_pt = FindPreheader(code_object, current, ...)
if !ShouldHoist(QueryKnob381(381, current), pass_id, opt_level):
advance; continue
// Count instruction weights
CountInstructions(context) // sub_8F8BC0
// === CORE HOISTING PIPELINE (per loop) ===
sub_8FF2D0(context, header_block, ...) // header block
if context.is_multi_depth:
for depth in (header_depth+1 .. max_depth-1):
sub_8FF2D0(context, block_at_depth, ..., hoist_mode, ...)
sub_8FF2D0(context, back_edge_block, ..., 3, ...) // back-edge
// Post-hoist cleanup
if context.changed and current.num_back_edge_successors > 1:
RebuildInstructionList(code_object, 0)
advance to next loop
Hoisting Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
HoistBudget | FLOAT | 10 | Maximum number of instructions to hoist per loop (0 = unlimited) |
HoistLoopInvBudget | FLOAT | 22 (early) / 100 (late) | Budget specifically for loop-invariant hoisting; pass_id 0 uses 22, pass_id > 0 uses 100 |
HoistConservativeScale | INT | 10 (divisor) | Inner-loop budget divisor; budget /= scale when hoisting from inner loops |
HoistLate | INT | per-block policy | Per-block hoisting policy (0=always, 1=inner only, 3=never) |
HoistCBOMode | INT | 0 | Constant-buffer-object hoisting mode |
HoistCBOLoad | INT | unset | Enable hoisting of CBO load instructions |
HoistCBOFromLoopWithColdNest | INT | 1 (enabled) | Hoist CBO loads even from loops with cold nesting |
HoistCBOHighCostSBInstRatioThreshold | INT | unset | Scoreboard cost threshold for CBO hoisting |
HoistCBOLoadIDOMTravseLimit | INT | 4 | IDOM traversal limit for CBO load hoisting |
HoistCBORRegPressureLimitApplyRate | INT | 80 | R-register pressure limit application rate (percentage) |
HoistTexToInstRatioHigh | DBL | 0.045 | High texture-to-instruction ratio threshold for aggressive hoisting |
HoistTexToInstRatioLow | DBL | 0.03 | Low texture-to-instruction ratio threshold for conservative hoisting |
DisableNestedHoist | BOOL | false | Disable hoisting from nested loops (false = nested hoisting allowed) |
NestedHoistInnerThreshold | INT | 22 / 100 | Inner loop instruction threshold for nested hoisting (same value as HoistLoopInvBudget) |
NestedHoistOuterThreshold | INT | 10 | Outer loop instruction threshold for nested hoisting (same value as HoistBudget) |
UseNewLoopInvariantRoutineForHoisting | BOOL | false | Use updated set-based invariance check routine (legacy single-pass is default) |
MaxMidHeaderSizeRateForAggressiveHoist | INT | 2 | Maximum LICM iteration count (limits repeated hoisting passes) |
EnableHoistLowLatencyInstMidBlock | BOOL | false | Hoist low-latency instructions from mid-block positions |
MovWeightForSinkingHoisting | DBL | 0.25 | Weight for MOV instructions in sink/hoist decisions |
GPU-Specific LICM Concerns
Constant buffer loads. GPU shaders frequently load from constant buffers (LDC). These loads are loop-invariant by definition (the buffer is read-only during kernel execution). The HoistCBO* knobs control a specialized path that aggressively hoists these loads, trading register pressure for reduced memory traffic.
Register pressure vs. occupancy. Every hoisted instruction extends its live range from the preheader through the entire loop. On GPUs, this directly reduces occupancy. The four LICM passes use increasingly conservative heuristics (controlled by pass_id) to avoid excessive register growth in later pipeline stages where register allocation is imminent.
Texture instruction hoisting. Texture fetches (TEX, TLD, TLD4) are high-latency and loop-invariant when their coordinates are loop-invariant. The HoistTexToInstRatio* knobs provide thresholds for deciding when to hoist texture instructions -- a tradeoff between reducing loop body latency and increasing preheader register pressure.
Phase 59 -- OriLoopFusion
Purpose
Fuses adjacent loops with compatible bounds and no inter-loop data dependencies into a single loop. This reduces loop overhead (branch, induction variable update) and creates opportunities for the scheduler to overlap instructions from the formerly separate loop bodies.
Knobs
| Knob Name | Type | Default | Description |
|---|---|---|---|
PerformLoopFusion | INT | 0 (disabled) | Master enable for loop fusion; must be explicitly set to a nonzero value |
PerformLoopFusionBudget | FLOAT | unset | Maximum instruction count in fused body |
Fusion Criteria
Two adjacent loops L1 followed by L2 are candidates for fusion when:
- Same trip count. Both loops iterate the same number of times (same induction variable bounds and stride, or equivalent after normalization).
- No violated inter-loop dependencies. No flow dependence (write in L1, read in L2) that crosses iteration boundaries differently after fusion. Since both loops are sequential pre-fusion, this reduces to: L2 must not read a value written by L1 at a different iteration index.
- Compatible loop structure. Both must be single-basic-block bodies (or the fused body must remain within the
PerformLoopFusionBudgetinstruction limit). - No intervening barriers. No
BAR.SYNC,MEMBAR, or fence instructions between the two loop bodies.
Pipeline Position Rationale
Phase 59 runs after GeneralOptimizeLate (phase 58) and before predication (phase 63). This position is chosen because:
- Late expansion (phase 55) may have split a single operation into a pair of loops (e.g., an atomic-reduce pattern becomes a compare loop followed by an exchange loop).
- After fusion, the merged loop body gives predication (phase 63) a larger basic block to work with, improving if-conversion opportunities.
- The subsequent LICM (phase 66) can hoist invariants from the fused loop that were not hoistable from either original loop individually (because they appeared in the "between-loops" region).
Loop Infrastructure Functions
Several utility functions are shared across the loop passes:
| Function | Address | Size | Purpose |
|---|---|---|---|
sub_781F80 | 0x781F80 | -- | Rebuild instruction linked list after CFG modification |
sub_789280 | 0x789280 | -- | Recompute loop nesting depths (called when flags[176] & 2 set) |
sub_773140 | 0x773140 | -- | Recompute register pressure estimates |
sub_7E6090 | 0x7E6090 | 2,614 | Create complex multi-operand instruction (used in unroll body duplication) |
sub_7753F0 | 0x7753F0 | -- | Loop peeling setup (splits first/last iterations) |
sub_789BE0 | 0x789BE0 | -- | Back-edge canonicalization |
sub_74D720 | 0x74D720 | -- | Loop boundary analysis (determines header, latch, exit) |
sub_74F500 | 0x74F500 | -- | Find preheader block for a given loop |
sub_9253C0 | 0x9253C0 | -- | Edge splitting / preheader block insertion |
sub_7A1A90 | 0x7A1A90 | -- | OCG knob query (boolean) |
sub_7A1B80 | 0x7A1B80 | -- | OCG knob query (multi-valued) |
sub_799250 | 0x799250 | -- | Named-phase DUMPIR check (string match against phase name) |
sub_A112C0 | 0xA112C0 | -- | Trigger sub-analysis re-run (liveness, CFG refresh) |
sub_BDEA50 | 0xBDEA50 | -- | Back-edge information printer (bix%d -> backedge's successor BB: %d) |
Related Passes
| Phase | Name | Relationship |
|---|---|---|
| 3 | AnalyzeControlFlow | Builds the CFG, identifies loops, computes dominators -- prerequisite for all loop passes |
| 19 | OriSplitLiveRanges | Splits live ranges at loop boundaries to reduce register pressure post-simplification |
| 20 | PerformPGO | Applies profile data that informs unrolling and pipelining heuristics |
| 21 | OriStrengthReduce | Reduces induction variable strength before unrolling |
| 23 | GenerateMovPhi | Inserts SSA phi nodes after unrolling changes the CFG |
| 25 | StageAndFence | Inserts memory fences needed by pipelined loops |
| 56 | SpeculativeHoistComInsts | Speculatively hoists common instructions above branches (related to LICM) |
| 108 | OptimizeHotColdInLoop | Post-RA hot/cold partitioning within loop bodies |
| 138 | OriSplitHighPressureLiveRanges | Last-resort splitter when unrolling or LICM caused excessive register pressure |
Cross-References
- Pass Inventory & Ordering -- complete 159-phase table
- Strength Reduction -- phase 21, IV simplification before unrolling
- Predication -- phase 63, creates new LICM opportunities for phase 66
- GMMA/WGMMA Pipeline -- phases 85, 87, creates LICM opportunities for phase 88
- Late Legalization -- phase 78, creates LICM opportunities for phase 79
- Hot/Cold Partitioning -- phase 108, loop-interior hot/cold splitting
- Liveness Analysis -- phases 16, 33, 61, 84 -- liveness drives unroll register pressure
- Knobs System -- knob infrastructure, ROT13 encoding
- Scheduling Architecture -- pipelined loops interact with the instruction scheduler