Instruction Scheduling
Prerequisites: Familiarity with Register Allocation, NVPTX register classes, and the codegen pipeline. Understanding of the GPU execution model (warp scheduling, latency hiding) is essential.
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/CodeGen/MachineScheduler.cpp(ScheduleDAGMILive),llvm/lib/CodeGen/MachinePipeliner.cpp(Swing Modulo Scheduler) (LLVM 20.0.0). The MRPA incremental pressure tracker and Texture Group Merge pass are NVIDIA-only with no upstream equivalent.
CICC v13.0 implements three distinct scheduling subsystems: MRPA (Machine Register Pressure Analysis) for incremental pressure tracking during MCSE, a Swing Modulo Scheduling pipeliner for loop bodies, and ScheduleDAGMILive for post-RA instruction ordering. All three maintain per-register-class pressure arrays but differ in granularity and update frequency. A texture group merge pass (sub_2DDE8C0) acts as a scheduling-adjacent optimization that groups texture load instructions for hardware coalescing.
| MRPA incremental tracker | sub_2E5A4E0 (primary), sub_1E00370 (backend variant) |
| MachinePipeliner (SMS) | sub_3563190 |
| ScheduleDAGMILive | sub_355F610 |
| Instruction selection heuristic | sub_3557A10 |
| Texture group merge | sub_2DDE8C0 |
| Scheduling mode switch | sub_21668D0 (post-RA), sub_2165850 (pre-RA) |
MRPA: Incremental Register Pressure Tracking
MRPA (Machine Register Pressure Analysis) provides incremental register pressure tracking for the Machine Common Subexpression Elimination (MCSE) pass. Rather than recomputing pressure from scratch after each instruction move or elimination, MRPA applies delta updates to maintain a running pressure state.
The primary implementation lives at sub_2E5A4E0 (48KB), with a backend variant at sub_1E00370 (78KB). Both use DenseMap hash tables for per-instruction pressure data with the hash function (ptr >> 9) ^ (ptr >> 4), empty sentinel -8, tombstone sentinel -16, minimum 64 buckets, and power-of-two sizing. The sub_1E00370 backend variant calls the pressure computation core at sub_1DF7390 (8 call sites) and sub_1DFB9D0 (6 call sites), plus pressure set queries via sub_1E1C690 / sub_1E15D60.
The MRPA pressure cluster spans the address range 0x1DF0000--0x1E0FFFF:
| Function | Role |
|---|---|
sub_1DF3D00 | Scheduler support (lowest address) |
sub_1DF4120 | Scheduler support |
sub_1DF4FB0 | Scheduler support |
sub_1DF5810 | Machine function pass (pressure-aware scheduling) |
sub_1DF7390 | Pressure computation core (called 8x from sub_1E00370) |
sub_1DF76E0 | Register liveness query |
sub_1DF7A80 | Code motion feasibility check |
sub_1DF81C0 | Pressure computation core |
sub_1DF9E90 | Schedule optimization pass |
sub_1DFB810 | DenseMap (64-bit value variant) |
sub_1DFB9D0 | DenseMap (32-bit value variant, called 6x) |
sub_1E00370 | MRPA entry -- backend variant |
Incremental Update Flow
The incremental update is the core algorithm. Rather than performing a full O(n) pressure recomputation after every MCSE transform, it maintains a running pressure state through delta operations. The pseudocode below is reconstructed from sub_2E5A4E0:
function mrpa_incremental_update(context, basicBlock):
// Phase 1: Build worklist via DFS
visited = DenseSet() // v292--v295
worklist = []
dfs_push(worklist, basicBlock, visited) // standard DFS seed
while worklist is not empty:
bb = worklist.pop()
// Phase 2: Create instruction tracking entries
tracking = context.densemap[+80..+104] // DenseMap at context offsets
for each instr in bb.instructions:
tracking.insert(instr, PressureEntry{})
// Phase 3: Filter schedulable instructions
if not sub_2E501D0(instr): // schedulability predicate
continue
// Phase 4: Scan operands (40-byte entries)
for i in range(instr.num_operands): // iterated at v69/v70
operand = instr.operand[i] // 40-byte stride
if not operand.isVirtualRegister():
continue
// Phase 5: Virtual register operand processing
old_reg = sub_2EBEF70(operand) // find existing rename mapping
reg_info = sub_2EBEE10(operand) // query register class, constraints
new_reg = sub_2EBE820(operand) // attempt rename if profitable
if new_reg != old_reg:
sub_2EBF120(old_reg) // free old register
// Phase 6: Register class constraint validation
sub_reg_list = sub_E922F0(reg_info) // sub-register list for class
for each sub_reg in sub_reg_list:
validate_class_constraint(sub_reg, context.class_limits)
// Phase 7: Pressure feasibility check
bb_pressure = context.per_bb_data[bb] // at v279[36]
if not sub_2E4F9C0(bb_pressure): // exceeds class limits?
// Rename was unprofitable -- roll back
context.rename_count_fail++ // *((_DWORD*)v254 + 17)
sub_2E88E20(instr) // erase unprofitable instruction
else:
context.rename_count_success++ // *((_DWORD*)v254 + 16)
The key insight is that steps 5--7 form a speculative rename-then-validate loop: MRPA tentatively renames a virtual register, checks whether the rename reduces pressure below the class limit, and rolls back if it does not. The rename counts at *((_DWORD*)v254 + 16) (success) and *((_DWORD*)v254 + 17) (failure) provide a diagnostic ratio of how often speculative renames succeed.
Register Liveness Queries
Register liveness (sub_1DF76E0) checks whether a register is live in an instruction range [a3, a4] using _bittest on register class bitmaps. A compressed alias table at context offset +240 stores sub-register overlap information in 24-byte entries containing alias counts and alias data offsets.
The alias table structure:
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | Sub-table pointer |
| +8 | 56 | Alias data block |
| +8..+10 (per entry) | 2 | Alias count (uint16) |
| +10.. | variable | Alias register IDs (2 bytes each) |
Sub-register overlap is resolved through an incremental alias walk: for each register in the query range, the alias table is consulted to expand the register into its physical sub-registers, and each sub-register is tested against the liveness bitmap.
Code Motion Feasibility
Code motion feasibility (sub_1DF7A80) validates whether an instruction can be moved between basic blocks:
- Check single-predecessor relationship between source and destination BBs.
- Validate against the allocation bitmask at allocator offset
+38. - Walk an instruction window bounded by offset
+296(configurable window size). - Count conflicting operands within the window.
- Track affected registers in an rb-tree set (offsets 56--88) with node structure
[left(16), right(24), value(32)].
An instruction is movable only if the conflicting operand count within the window is zero and the allocation bitmask permits the move.
MRPA Verification
A debug-only verification path checks incremental update correctness against full recomputation. The trigger path in sub_2E5A4E0 (decompiled lines 1702--1708):
if ( *(_BYTE *)(v7 + 40) // [1] context enable flag -- always ON during MCSE
&& (_BYTE)qword_501F8A8 // [2] verify-update-mcse -- user must enable
&& (_BYTE)qword_501F988 // [3] incremental-update-mcse -- default ON
&& !sub_2E59B70( // [4] full recomputation DISAGREES
*(_QWORD*)(v7+48),
qword_501F7C8, ...) )
{
sub_C64ED0("Incorrect RP info from incremental MRPA update\n", 1u);
}
All four conditions must hold simultaneously:
- Context enable flag (
v7 + 40) is set -- always true during MCSE. verify-update-mcseis ON -- user must explicitly enable this debug knob.incremental-update-mcseis ON -- default is ON.sub_2E59B70returns false -- full recomputation disagrees with the incremental state.
When all conditions hold, the error "Incorrect RP info from incremental MRPA update" fires via sub_C64ED0 (LLVM's report_fatal_error). The print-verify knob controls whether detailed per-register-class mismatch data is printed.
The backend variant (sub_1E00370, decompiled lines 2416--2420) uses byte_4FC6020 as its guard flag, calls sub_1DFF720 for verification, and falls back to byte_4FC62C0 (a cached result) if verification is disabled.
| Knob | Default | Description |
|---|---|---|
incremental-update-mcse | true | Incrementally update register pressure analysis |
verify-update-mcse | false | Verify incremental update by full RP analysis |
print-verify | false | Print problematic RP info if verification failed |
To trigger verification: cicc -Xcuda -verify-update-mcse input.cu. NVIDIA keeps this check off by default since the full rescan is O(n) and expensive.
MachinePipeliner: Swing Modulo Scheduling
Complexity. Let N = number of instructions in the loop body and E = number of dependency edges in the DDG. DDG construction is O(N + E). RecMII computation (computeRecMII) finds the maximum cycle ratio via enumeration of elementary circuits in the DDG -- worst-case exponential, but bounded in practice by small loop sizes (N < 100) and sparse dependency graphs. ResMII computation is O(N) (sum of resource vectors). ASAP/ALAP computation is O(N + E) each (topological traversals). The II search probes at most pipeliner-ii-search-range (default 10) candidate IIs. For each II, node placement is O(N * II) -- each of N nodes probes up to II cycle slots. The total scheduling cost is O((N + E) + R * N * II_max) where R = search range. The pipeliner-max-stages (default 3) and pipeliner-max-mii (default 27) provide additional constant-factor bounds. For MRPA, the incremental pressure update is O(1) per instruction move (delta update), compared to O(N) for a full recomputation -- this is the key efficiency gain over a naive approach.
The MachinePipeliner (sub_3563190, ~2030 decompiled lines, ~58KB) implements Swing Modulo Scheduling (SMS) for software pipelining of loop bodies. It overlaps iterations of a loop body to improve throughput on pipelined hardware by interleaving instructions from different iterations. The upstream LLVM equivalent is SwingSchedulerDAG::schedule().
Pass discovery: the pipeliner walks an analysis array at this+3456 (offset 3456) looking for vtable unk_4F86530 (the MachinePipeliner analysis pass), then extracts the SwingSchedulerDAG context at offset +176.
Phase 1: Initialization and DDG Construction
The setup chain builds the data dependence graph and computes MII lower bounds:
| Step | Function | Description |
|---|---|---|
| 1 | sub_2F97F60 | initializeDAG -- build data dependence graph (DDG) over the single-BB loop body |
| 2 | sub_3559990 | computeNodeLatencies -- fill latency fields per SUnit from the target scheduling model |
| 3 | sub_3542B20 | addDependencies -- add register/memory/order dependency edges to the DDG |
| 4 | sub_2F90200 | updateRegPressure -- compute initial register pressure state for the loop body |
| 5 | sub_354CBB0 | computeRecMII -- find the maximum cycle length of any recurrence in the DDG |
| 6 | sub_35449F0 | computeResMII -- compute ceil(total_resource_usage / functional_unit_count) |
The context object SwingSchedulerDAG occupies approximately 4100 bytes:
| Offset | Field |
|---|---|
| +32 | MachineFunction* |
| +48..56 | BB range (iterated at 256-byte stride) |
| +944 | DenseMap: pre-existing ordering constraints |
| +3456 | Analysis pass vector |
| +3472 | MII (int32) |
| +3480 | schedulingSucceeded (bool) |
| +3488 | DiagnosticsEngine / remark context |
| +3520 | TargetSubtargetInfo* |
| +3944..3952 | DDG node storage (vector) |
| +4016..4072 | Recurrence DenseMap (24-byte entries) |
Phase 2: MII Computation and II Search
MII computation combines two lower bounds:
- RecMII (Recurrence MII): the longest cycle in the DDG, computed by
sub_354CBB0. Each recurrence (loop-carried dependency cycle) constrains the minimum II because the cycle must fit within one iteration interval. Ifpipeliner-ignore-recmiiis set, RecMII is forced to zero so only resource constraints matter. - ResMII (Resource MII):
ceil(sum of resource usage across all instructions / number of available functional units), computed bysub_35449F0. This reflects the throughput bottleneck of the hardware.
function compute_MII():
recMII = sub_354CBB0() // max recurrence length
resMII = sub_35449F0() // resource throughput limit
if pipeliner-ignore-recmii: // qword_503E888
recMII = 0
MII = sub_3542AB0(resMII, recMII) // max(resMII, recMII)
sub_3542AE0() // store MII at this+3472
return MII
The II search algorithm starts at MII and probes upward:
function ii_search(MII):
max_ii = MII + pipeliner-ii-search-range // default: MII + 10
if pipeliner-force-ii != 0: // qword_503EB80
return try_schedule(pipeliner-force-ii) // skip search entirely
for II = MII to max_ii:
// 1. Compute ASAP/ALAP at this II
asap = compute_ASAP(DDG, II) // sub_354BFF0 -> v369
alap = compute_ALAP(DDG, II) // sub_354BFF0 -> v373
// 2. Place all nodes into II-wide modulo reservation table
success = place_nodes(asap, alap, II) // sub_354C3A0
if not success:
continue // try next II
// 3. Compute stage count
numStages = (lastCycle - firstCycle) / II // (v84 - v80) / v88
// 4. Validate stage count
if numStages > pipeliner-max-stages: // default 3
continue
// 5. Register pressure check (if enabled)
if pipeliner-register-pressure: // qword_503E2C0
if not verify_pressure(II, pipeliner-register-pressure-margin):
continue // sub_355C7C0
return (II, schedule)
return FAILURE // "Unable to find schedule"
The pipeliner-force-ii knob (default 0) bypasses the search entirely and forces a specific II value. This is useful for testing or when the compiler team knows the optimal II for a specific loop shape.
Phase 3: ASAP/ALAP Computation
ASAP (As Soon As Possible) and ALAP (As Late As Possible) define the scheduling window for each instruction at a given II:
ASAP computation (sub_354BFF0, first invocation producing v369): traverses the DDG in topological order. For each node, ASAP = max over all predecessors of (predecessor.ASAP + edge.latency). The root nodes (no predecessors) have ASAP = 0. This gives the earliest cycle each instruction can execute without violating data dependencies.
ALAP computation (sub_354BFF0, second invocation producing v373): traverses the DDG in reverse topological order. For each node, ALAP = min over all successors of (successor.ALAP - edge.latency). Leaf nodes (no successors) have ALAP = II - 1 (or the schedule length bound). This gives the latest cycle an instruction can execute.
The scheduling window for instruction i is [ASAP(i), ALAP(i)]. Instructions with narrow windows (ASAP close to ALAP) are more constrained and are typically scheduled first by the node ordering heuristic.
Phase 4: Node Placement
Node placement (sub_354C3A0) attempts to assign each instruction to a specific cycle in the modulo reservation table (MRT). The MRT has II columns (one per cycle in the initiation interval) and tracks resource usage per cycle.
The placement algorithm follows the Swing Modulo Scheduling strategy:
- Node ordering (
sub_35630A0): nodes are prioritized by a combination of critical-path depth, recurrence membership, and scheduling freedom (ALAP - ASAP). Nodes in tight recurrences and on the critical path are placed first. - Direction selection: for each node, the scheduler decides whether to place it "forward" (from ASAP toward ALAP) or "backward" (from ALAP toward ASAP) based on its dependency relationships. The "swing" refers to alternating direction between predecessor-constrained and successor-constrained nodes.
- Cycle probing: starting from the preferred direction, the scheduler tries each cycle in the node's
[ASAP, ALAP]window. At each candidate cycle, it checks resource availability in the MRT (the cycle modulo II must have sufficient functional unit capacity) and verifies that all dependency constraints remain satisfied. - Conflict resolution: if no cycle in the window is feasible, the placement fails for this II and the search continues with II+1.
Phase 5: Kernel Generation
After a valid schedule is found, the pipeliner builds the kernel, prolog, and epilog. The numStages value ((lastCycle - firstCycle) / II) determines how many iterations overlap.
function build_kernel(schedule, II, numStages):
// Build instruction-to-stage and instruction-to-cycle DenseMaps
instrToStage = DenseMap<SUnit*, int>() // v317/v318/v319
instrToCycle = DenseMap<SUnit*, int>() // v320/v321/v322
// DenseMap config: hash=(key>>9)^(key>>4), empty=-4096, tombstone=-8192
for stage in range(numStages):
for each SUnit in schedule.stage_bundle(stage):
instrToStage[SUnit] = stage
instrToCycle[SUnit] = SUnit.assigned_cycle
// Cross-reference recurrence edges with stage assignments
if this+4064 (recurrence count) != 0:
for each recurrence_edge in this+4056:
edge.stage = instrToStage[edge.instruction]
// Build per-recurrence analysis DenseMap (24-byte entries)
// Select codegen backend (priority order):
if pipeliner-annotate-for-testing: // testing mode: annotate only
sub_359AD80(schedule)
return
if pipeliner-experimental-cg: // peeling code generator
if numStages == 0:
sub_35A5710() // trivial kernel (no overlap)
else:
sub_35A93B0() // experimental peeling CG
sub_3598EB0() // finalize prolog/epilog
return
if pipeliner-mve-cg: // MVE code generator (DEFAULT)
if numStages == 0 and target_supports_mve():
sub_35A7730() // MVE compatibility check
sub_35A76E0() // MVE code generator
return
// else fall through to experimental CG
// Default fallthrough: experimental CG path
The codegen backend priority is: (1) pipeliner-annotate-for-testing for test infrastructure, (2) pipeliner-experimental-cg for peeling-based generation, (3) pipeliner-mve-cg (default enabled) for the MVE (Modulo Variable Expansion) code generator. The MVE path is gated on numStages == 0 and a target callback at **(this+3520)+72 returning non-default (i.e., not sub_2FDC510).
The SBO (Small Buffer Optimization) pattern is used for nodeInfo arrays: v416 = v418 (inline buffer of 704 bytes = 8 nodes x 88 bytes). When the loop body exceeds 8 instructions, sub_35498F0 sorts and possibly heap-allocates.
Error Conditions
| Condition | Diagnostic | Severity |
|---|---|---|
| MII == 0 | "Invalid Minimal Initiation Interval: 0" | 0x15 (missed) |
MII > pipeliner-max-mii | "Minimal Initiation Interval too large: MII > SwpMaxMii" | 0x15 (missed) |
| Scheduling failure | "Unable to find schedule" | 0x15 (missed) |
| numStages == 0 | "No need to pipeline - no overlapped iterations in schedule." | 0x15 (missed) |
numStages > pipeliner-max-stages | "Too many stages in schedule: numStages > SwpMaxStages" | 0x15 (missed) |
| Success | "Pipelined succesfully!" [sic] | 0x13 (passed) |
The typo "succesfully" (single 's') is preserved from upstream LLVM.
Pipeliner Knobs
| Knob | Global | Default | Description |
|---|---|---|---|
enable-pipeliner | unk_503EE20 | true | Master switch for SMS |
enable-pipeliner-opt-size | qword_503ED40 | false | Enable SWP at -Os |
pipeliner-max-mii | qword_503ECE8 | 27 | Maximum allowed MII |
pipeliner-force-ii | qword_503EB80 | 0 | Force specific II (0 = auto) |
pipeliner-max-stages | qword_503EB28 | 3 | Maximum pipeline stages |
pipeliner-prune-deps | qword_503E9C0 | true | Prune deps between unrelated Phi nodes |
pipeliner-prune-loop-carried | qword_503E8E0 | true | Prune loop-carried order deps |
pipeliner-ignore-recmii | qword_503E888 | false | Ignore RecMII (hidden knob) |
pipeliner-show-mask | qword_503E720 | false | Debug: show scheduling mask |
pipeliner-dbg-res | qword_503E640 | false | Debug: resource usage |
pipeliner-annotate-for-testing | qword_503E5E8 | false | Annotate instead of codegen |
pipeliner-experimental-cg | qword_503E508 | false | Use peeling code generator |
pipeliner-ii-search-range | qword_503E3A0 | 10 | Range to search for II |
pipeliner-register-pressure | qword_503E2C0 | false | Consider register pressure |
pipeliner-register-pressure-margin | qword_503E1E0 | 5 | Margin % for reg pressure limit |
pipeliner-mve-cg | unk_503E100 | true | Use MVE code generator |
pipeliner-enable-copytophi | qword_503E020 | true | Enable CopyToPhi DAG Mutation |
pipeliner-force-issue-width | qword_503DF40 | 0 | Force issue width (0 = auto) |
All registered in ctor_676_0_0x5a3430.c.
MachinePipeliner Function Map
| Function | Identity |
|---|---|
sub_3563190 | Top-level SMS orchestrator (SwingSchedulerDAG::schedule) |
sub_2F97F60 | initializeDAG -- build DDG |
sub_3559990 | computeNodeLatencies |
sub_3542B20 | addDependencies -- register/memory/order edges |
sub_2F90200 | updateRegPressure |
sub_354CBB0 | computeRecMII |
sub_35449F0 | computeResMII |
sub_3542AB0 | setMII = max(ResMII, RecMII) |
sub_3542AE0 | validateMII / store at +3472 |
sub_3556270 | collectNodeInfo -- gather 88-byte per-node records |
sub_35476E0 | initNodeOrder -- compute scheduling order |
sub_35523F0 | computeSchedule -- build SUnit ordering |
sub_35546F0 | orderDependences -- topological sort |
sub_3543340 | computeStart -- ASAP/ALAP times |
sub_35630A0 | normalizeSchedule -- adjust cycle numbering |
sub_35568E0 | scheduleNodes -- core SMS placement |
sub_35433F0 | adjustSchedule -- post-adjustment |
sub_3557A10 | computeFinalSchedule -- finalize stage/cycle |
sub_354A760 | buildStageMap -- iteration-to-stage mapping |
sub_355F610 | schedule() -- II search loop (2351 lines) |
sub_354BE50 | getScheduleForStage |
sub_35498F0 | sortNodeInfo (for >8 nodes) |
sub_359AD80 | annotateForTesting |
sub_35A5710 | generateTrivialKernel |
sub_35A93B0 | experimentalPeelingCG |
sub_3598EB0 | finalizeExperimentalKernel |
sub_35A76E0 | mveCG -- MVE code generator |
sub_35A7730 | mveCompatCheck |
ScheduleDAGMILive: Post-RA Instruction Ordering
ScheduleDAGMILive (sub_355F610, 64KB) is the post-RA machine instruction scheduler. It takes the pipeliner's output (or standalone scheduling regions) and determines the final instruction order while respecting register pressure limits.
Data structures:
- SUnit (Scheduling Unit): 88 bytes per instruction, consistent across both the pipeliner and
ScheduleDAGMILive. - Instruction-to-node hash map: 632-byte entries per instruction. The unusually large entry size suggests extensive caching of per-instruction metadata (RP deltas, latency info, dependency edges) to avoid recomputation.
- RP tracking structure: 112 bytes, with per-register-class pressure arrays at offsets 32--48 (current) and 56--72 (limits).
The scheduling flow:
- Initialize RP tracking via
sub_3551AB0(ifpipeliner-register-pressureis set). - Set per-class pressure defaults via
sub_2F60A40. - Walk BB instruction list, build instruction-to-node hash map.
- Compute ASAP (earliest cycle) via
sub_354BFF0->v369. - Compute ALAP (latest cycle) via
sub_354BFF0->v373. - Place instructions via
sub_354C3A0(returns success/failure). - Calculate stage count:
(lastCycle - firstCycle) / II=(v84 - v80) / v88. - Verify placement via
sub_355C7C0. - Build stage descriptors via
sub_355D7E0(80 bytes per stage, 10 QWORDs each).
Instruction Selection Heuristic
The instruction selection heuristic (sub_3557A10, 47KB) determines which instruction to schedule next from the ready set. It implements a multi-level priority scheme operating on 88-byte SUnit entries:
Level 1 -- Latency/Depth priority (SUnit offset +240): instructions deeper in the dependency graph are scheduled first. Depth is measured as the longest path from the instruction to a sink node in the DDG. This ensures that critical-path instructions are placed early, preventing them from becoming bottlenecks. Latency recomputation occurs via sub_2F8F5D0 during priority comparison to account for any scheduling decisions already made.
Level 2 -- Target priority table (context a1+3944): a table of 16-byte entries, each containing:
| Offset | Size | Field |
|---|---|---|
| +0 | 4 | start -- first cycle of priority window |
| +4 | 4 | end -- last cycle of priority window |
| +8 | 4 | priority -- target-assigned priority value |
| +12 | 4 | window_width -- scheduling window size |
The target (NVPTX backend) populates this table to express hardware-specific ordering preferences -- for example, prioritizing memory operations that can be overlapped with computation, or ensuring that warp-synchronous instructions are scheduled in specific relative positions. Instructions that fall within a priority window with a higher priority value are selected first.
Level 3 -- Schedule window width: when levels 1 and 2 are tied, the instruction with the narrower scheduling window (ALAP - ASAP) is preferred. Narrower windows mean fewer legal placement options, so these instructions should be placed before more flexible ones to avoid creating conflicts.
The ready queue is managed by sub_3553D90. Pattern matching on ready instructions proceeds through sub_35540D0 (applicability check) and sub_35543E0 (pattern application), with validation via sub_3546B80. A hash table at a1+3976 maps instructions to schedule nodes for O(1) lookup during priority comparison.
function select_next_instruction(ready_set):
best = null
for each candidate in ready_set:
if best is null:
best = candidate
continue
// Level 1: depth comparison
if candidate.depth > best.depth: // offset +240
best = candidate
continue
if candidate.depth < best.depth:
continue
// Level 2: target priority table
cand_prio = lookup_target_priority(candidate, priority_table)
best_prio = lookup_target_priority(best, priority_table)
if cand_prio > best_prio:
best = candidate
continue
if cand_prio < best_prio:
continue
// Level 3: prefer narrower window
cand_width = candidate.ALAP - candidate.ASAP
best_width = best.ALAP - best.ASAP
if cand_width < best_width:
best = candidate
return best
Texture Group Merge
The Texture Group Merge pass (sub_2DDE8C0, 74KB, 2382 decompiled lines) groups texture load instructions that access related memory locations, enabling the hardware texture unit to coalesce them into fewer requests. This is an NVIDIA-specific pass not present in upstream LLVM.
Fibonacci Hashing
The pass uses Fibonacci hashing for candidate bucketing:
hash = (ptr * 0xBF58476D1CE4E5B9) >> shift
The constant 0xBF58476D1CE4E5B9 is the 64-bit Fibonacci hash multiplier, derived from 2^64 / phi where phi = (1 + sqrt(5)) / 2 is the golden ratio. This multiplicative hash provides near-optimal distribution for pointer-based keys because the golden ratio's irrational nature ensures that consecutive multiples are maximally spread across the output range. The same constant appears in:
- Linux kernel's
hash_64()ininclude/linux/hash.h - LLVM's
FoldingSetandDenseMapinternals - CICC's SCEV expression uniquing (
sub_DC2B70) - CICC's OpenMP SPMD region hash table
The shift parameter controls how many high bits are retained, effectively determining the number of hash buckets as 2^(64 - shift). For a 1024-bucket table, shift would be 54.
Algorithm Detail
- Walk the BB instruction list.
- For each instruction, call
sub_2DDC600(candidate identification) to determine if it is a texture load eligible for merging. - Hash the candidate's key pointer using Fibonacci hashing to assign it to a bucket.
- Insert the candidate into the group table.
Group table entries are 56 bytes (7 QWORDs):
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | Key pointer (texture base address or descriptor) |
| +8 | 8 | Data pointer (to member array) |
| +16 | 4 | Member count |
| +20 | 4 | Member capacity |
| +24 | 32 | Reserved / padding |
Group members are 32 bytes each:
| Offset | Size | Content |
|---|---|---|
| +0 | 8 | MachineInstr* -- the texture load instruction |
| +8 | 8 | Symbol -- the texture symbol reference |
| +16 | 8 | Debug info -- source location |
| +24 | 8 | Scope info -- DWARF scope |
Generated group names carry a .Tgm (Texture Group Merge) suffix via sub_2241490. This suffix appears in debug output and internal symbol tables.
4-Callback Framework
The pass operates through a general instruction grouper framework (sub_3147BA0) that supports multiple types of instruction grouping through a common callback interface. Four callbacks are registered for texture group merge:
| # | Callback | Function | Purpose |
|---|---|---|---|
| 1 | Candidate identification | sub_2DDC600 | Examines each MachineInstr and returns true if it is a texture load eligible for grouping. Checks opcode, address space (texture memory), and operand constraints. |
| 2 | Group formation | sub_2DDBF40 | After candidates are identified and hashed into buckets, this callback decides which candidates within a bucket should form a group. It checks address proximity, common base registers, and compatible access patterns. |
| 3 | Merge execution | sub_2DDB3F0 | Applies the actual merge transformation. Replaces individual texture loads with a single grouped load instruction, rewrites operands, and updates dependency edges. |
| 4 | Cleanup | sub_2DDB400 | Frees temporary data structures (group tables, member arrays, hash buckets) after merging is complete. |
Additional helper functions in the texture group merge:
| Function | Role |
|---|---|
sub_2DDD850 | Node insertion into group table |
sub_2DDDD70 | Resize/grow scheduling data |
sub_2DDD530 | Scheduling iteration over groups |
sub_2DDDAB0 | Node analysis (profitability check) |
sub_2DDB710 | Data dependency edge creation |
sub_2DDE490 | Grouping operation (merge groups) |
sub_2DDBC50 | Constraint application |
sub_2DDBBA0 | Constraint application (secondary) |
sub_2DDBA80 | Finalize group (seal and emit) |
The grouper framework is designed to be reusable: by registering different callback tuples, the same framework can group surface loads, shared memory accesses, or other coalescing-friendly instruction patterns.
Scheduling Mode: The usedessa Knob
The usedessa knob (dword_4FD26A0, default 2) controls the scheduling pass pipeline configuration despite its name suggesting deSSA (de-Static Single Assignment) method selection. Pre-RA scheduling dispatches through sub_2165850; post-RA through sub_21668D0.
Mode 1 (simple): Pre-RA scheduling is skipped entirely. Post-RA runs only unk_4FCE24C (the post-RA scheduler). This minimal configuration is useful for debugging or when scheduling is harmful to performance.
Mode 2 (full, default): Pre-RA scheduling runs unk_4FC8A0C. Post-RA scheduling runs three passes sequentially:
unk_4FC8A0C-- pre-RA pass (disabled/noop in post-RA context).unk_4FCE24C-- post-RA scheduler.unk_4FC9D8C-- extra scheduling pass.
After scheduling completes, the framework prints "After Machine Scheduling", optionally runs sub_21F9D90, then runs unk_4FCAC8C and prints "After StackSlotColoring".
The "disabled" passes in mode 2 are registered but gated internally, allowing the framework to maintain a uniform pass list while selectively activating passes based on the current compilation phase.
Cross-Cutting Observations
Register pressure tracking appears in three distinct places within the scheduling infrastructure, each serving a different consumer:
| Tracker | Consumer | Update Frequency |
|---|---|---|
MRPA incremental (sub_2E5A4E0) | MCSE decisions | Per instruction move/elimination |
ScheduleDAGMILive (sub_355F610) | Scheduling decisions | Per scheduling region |
| MachinePipeliner stage tracking | II feasibility | Per pipeline stage |
All three maintain per-register-class pressure arrays but with different granularities. The MRPA tracker uses incremental delta updates for efficiency; the scheduler computes ASAP/ALAP bounds per region; the pipeliner tracks pressure per modulo stage.
The DenseMap hash function (ptr >> 9) ^ (ptr >> 4) is shared across both the 32-bit value variant (sub_1DFB9D0) and 64-bit value variant (sub_1DFB810), indicating a common template instantiation pattern consistent with LLVM's DenseMap<K, V> template.
Contrast with ptxas scheduling: ptxas has its own instruction scheduling subsystem with 195 knobs (including scoreboard-aware scheduling via the AdvancedSB* family, SchedDisableAll, SchedForceReverseOrder, and the GemmPipeliner* family of 8 knobs for matrix multiply detection and pipelining). CICC's scheduling operates at the MachineInstr level before PTX emission; ptxas re-schedules at the SASS level after PTX assembly. The two scheduling layers are independent but complementary.
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's instruction scheduling framework was designed for CPU cores with out-of-order execution, branch prediction, and deep reorder buffers. On a GPU SM, these hardware features do not exist:
- Upstream assumes out-of-order hardware will hide scheduling mistakes. Modern CPUs have 200+ entry reorder buffers that dynamically reorder instructions, making compiler scheduling a second-order optimization. GPU SMs execute instructions in-order within each warp -- every scheduling decision is final. A poorly ordered instruction stream on GPU means stalls that no hardware can recover from.
- Upstream optimizes for pipeline hazards and port pressure. CPU schedulers model execution port contention (e.g., port 0 vs. port 1 on Intel), dispatch group rules, and pipeline bubble avoidance. GPU scheduling targets register pressure minimization (
nvptx-sched4reg) because the SM's warp scheduler handles instruction-level parallelism through warp interleaving, not through instruction reordering within a single thread. - Upstream assumes a single scheduling pass produces the final order. On CPU, LLVM's
ScheduleDAGMILiveemits the final instruction sequence. On NVPTX, cicc's scheduling is the first of two layers -- ptxas re-schedules the entire program at the SASS level with its own 195-knob subsystem (including scoreboard-aware scheduling via theAdvancedSB*family). CICC's scheduler optimizes for ptxas consumption, not for direct hardware execution. - Upstream has no concept of texture instruction grouping. CPU scheduling never considers grouping memory operations for hardware coalescing units. NVIDIA adds a dedicated Texture Group Merge pass (
sub_2DDE8C0, 74KB) that groups texture load instructions by base address for the hardware texture unit -- an entirely GPU-specific optimization absent from upstream. - Upstream does not track register pressure incrementally during CSE. Upstream LLVM recomputes register pressure from scratch after each Machine CSE transform. NVIDIA's MRPA subsystem (
sub_2E5A4E0, 48KB) maintains running pressure state through delta updates, because on GPU the pressure-to-occupancy relationship makes every CSE decision a potential occupancy cliff crossing that must be evaluated cheaply.
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Scheduling subsystems | ScheduleDAGMILive + optional MachinePipeliner; no incremental pressure tracker | Three distinct subsystems: MRPA incremental tracker, Swing Modulo Scheduler, ScheduleDAGMILive; plus texture group merge pass |
| MRPA (incremental pressure) | Not present; pressure recomputed from scratch after each CSE transform | sub_2E5A4E0 (48 KB) + backend variant sub_1E00370 (78 KB) maintain running pressure state through delta operations during MCSE |
| Texture group merge | No concept of texture instruction grouping | Dedicated pass (sub_2DDE8C0) groups texture load instructions for hardware coalescing; scheduling-adjacent optimization absent from upstream |
| Scheduling target | Optimize for hardware pipeline hazards and port pressure | Optimize the MachineInstr stream for ptxas consumption; focus on register pressure reduction (nvptx-sched4reg) rather than hardware pipeline timing |
| Two-level scheduling | Single scheduling pass produces final instruction order | CICC scheduling is first layer; ptxas re-schedules at SASS level with its own 195-knob subsystem |
| Register pressure model | Per-register-class pressure sets from TRI | Same model but with GPU occupancy awareness; pressure arrays used to detect occupancy cliff crossings |
| Scheduling mode switch | Configured at pipeline construction time | Runtime mode switch between pre-RA (sub_2165850) and post-RA (sub_21668D0) with different heuristic weights |
ptxas Interaction
cicc's instruction scheduling operates at the MachineInstr level and produces a PTX instruction order that is not final. ptxas re-schedules the entire program at the SASS level using its own 195-knob scheduling subsystem, including scoreboard-aware scheduling (AdvancedSB* family), the GemmPipeliner* family for matrix multiply detection and software pipelining, and SchedForceReverseOrder for debugging. cicc's scheduler therefore optimizes for ptxas consumption rather than direct hardware execution: its primary goal is minimizing register pressure (nvptx-sched4reg) so that ptxas starts from a low-pressure baseline. The two scheduling layers are independent but complementary -- cicc controls the virtual register count visible to ptxas, and ptxas maps the resulting instruction stream onto the SM's hardware pipeline with full knowledge of scoreboard latencies and functional unit availability.