Priority List Scheduling Algorithm
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The scheduling engine implements a classical priority list scheduling algorithm extended with GPU-specific heuristics for register pressure management, functional unit contention avoidance, yield hint generation, and barrier-aware instruction ordering. A single unified engine (sub_688DD0, 20 KB) serves all three scheduling phases -- ReduceReg, ILP/Latency, and DynBatch -- differentiated only by a mode byte that selects different priority weight configurations. The algorithm iterates basic blocks, builds a ready list of zero-dependency instructions, selects the highest-priority candidate via an 8-bit packed heuristic, emits it into the final schedule, updates the dependency DAG, and repeats until all instructions in the block are placed.
| Unified engine | sub_688DD0 (20 KB) -- mode-parameterized core loop |
| Priority function | sub_8C9320 (47 KB, ~1300 lines) -- 8-bit packed heuristic |
| Ready list builder | sub_6820B0 (1.5 KB) -- zero-predecessor scan |
| Dependency pre-scan | sub_8CF880 (28 KB) -- reverse BB iteration |
| Edge builder | sub_8D9930 (19 KB) -- RAW/WAR/WAW/memory/barrier edges |
| Instruction mover | sub_925510 (341 bytes) -- doubly-linked list relink |
| Resource tracker | sub_A09530 (365 bytes) -- per-instruction stall update |
| Stall/barrier encoder | sub_8D7760 (41 KB) -- control word generation |
| Alternative loop | sub_68B9C0 (46 KB) -- combined DAG + scheduling |
| BB size limit | 4095 instructions (split via sub_931920) |
| Large function limit | 16383 instructions (chunk-based via sub_A9DDD0) |
Core Algorithm
The unified scheduling engine executes the following sequence for each basic block. All three phases (ReduceReg mode 0x39, ILP mode 0x49, DynBatch mode 0x41) follow this identical structure; only the priority weight selection differs.
function ScheduleEngine(sched, mode, arg3, rebuild):
if rebuild:
InitScheduleRegion(sched) // sub_6833F0
// Allocates 72-byte per-BB records
// Queries knobs 595, 743, 747
// Calls sub_7E5120 for instruction characterization
for each bb in sched.basic_blocks: // 72-byte stride
InitResourceTracking(sched, bb) // sub_A091C0
// Zeroes 40-byte resource records (one per register class + 1)
BuildReadyList(sched) // sub_6820B0
while ready_list is not empty:
best = SelectHighestPriority(sched) // via priority vtable
UnlinkFromReadyList(sched, best) // sub_682200
MoveInstruction(sched, best, ref) // sub_925510
UpdateStallCycles(sched, best) // sub_A09530
UpdateWARTracking(sched, best) // sub_A09D40
for each successor of best:
successor.dep_count -= 1
if successor.dep_count == 0:
ready_list.insert(successor)
// Sorted insertion using priority value
PostBBCleanup(sched, bb) // sub_BDC200 / sub_BDCDE0
The outer loop iterates basic blocks via an array of 72-byte records (v112 = 72 * bb_index). The inner loop is a standard worklist algorithm: remove the highest-priority ready instruction, schedule it, and propagate readiness to its successors in the dependency DAG by decrementing their predecessor counts.
Mode Selection
The mode byte stored at *(DWORD*)(scheduler+60) controls which priority weight set the engine uses:
| Mode | Value | Callback | Objective |
|---|---|---|---|
| ReduceReg | 1 | 0x39 | Minimize register pressure. Prioritizes instructions that release registers (last-use operands). |
| ILP/Latency | 0 | 0x49 | Maximize instruction-level parallelism. Prioritizes critical-path and long-latency instructions. |
| DynBatch | 2 | 0x41 | Batch-aware tensor scheduling. Groups GMMA/WGMMA operations for warpgroup cooperation. |
The engine uses vtable dispatch at *(a1+40) and *(a1+48) for polymorphic pre/post scheduling hooks. This allows each mode to inject custom behavior at scheduling boundaries without modifying the core loop.
Ready List Construction
sub_6820B0 (1.5 KB) scans the instruction linked list and collects every instruction with zero unsatisfied dependencies into a sorted ready list.
function BuildReadyList(sched):
for instr in sched.instruction_list: // linked list at sched[20]
if instr.opcode == 52: // NOP / BB boundary marker
continue // follow to real instruction
metadata = *(QWORD*)(instr + 40) // SchedNode pointer
if *(DWORD*)(metadata + 8) == 0: // depCount == 0
*(QWORD*)(metadata + 16) = sched[5] // link to current head
sched[5] = instr // new head
vtable_callback(sched, 104, instr) // insertion hook
*(DWORD*)(metadata + 28) = 0 // reset latency counter
The ready list is a singly-linked list threaded through SchedNode offset +16 (the nextReady field). Sort order is maintained at insertion time by the priority function -- each new instruction is inserted at its correct position so that the head of the list is always the highest-priority candidate. All metadata+N offsets throughout the scheduling pages refer to fields within the SchedNode block pointed to by instr+40 (sched_slot), not offsets from the instruction object itself. See the SchedNode layout for the complete field map.
Opcode 52 instructions are phantom BB boundary markers. The builder skips them but follows their linked-list successors to reach real instructions beyond the boundary.
The vtable+104 callback provides a polymorphic insertion hook. Different scheduling strategies can override this to implement custom ready-list policies (e.g., the dual-issue scheduler uses it to pair co-issuable instructions).
Priority Function
sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority as an integer with 8-bit packed fields. Each bit encodes a different heuristic criterion. Because priority comparison reduces to a single integer comparison, the ready list maintains sort order without multi-key sorting overhead.
8-Bit Priority Encoding
| Bit | Name | Meaning | Notes |
|---|---|---|---|
| 7 (MSB) | yield-related | Instruction is near a yield boundary | Higher priority ensures yield hints align with scheduling boundaries |
| 6 | yield flag | Instruction triggers or participates in a yield sequence | Controls warp scheduler round-robin interaction |
| 5 | hot-cold | Memory access temperature (1 = hot, 0 = cold) | Hot = global/texture/surface loads with long latencies; cold = constant/shared |
| 4 | hot-cold / pressure | Packed byte holds hot-cold flag; pressure overflow acts as comparison override | See Priority Function Internals for the dual mechanism |
| 3 | same-BB preference | Instruction belongs to the currently-scheduled BB | Discourages cross-BB instruction motion |
| 2 | stall-free | Scheduling this instruction introduces zero stall cycles | All producer latencies have completed |
| 1 | latency-bound | Instruction is on the DAG critical path | Prioritizes latency-sensitive dependency chains |
| 0 (LSB) | tiebreaker | Additional ordering factor | Encodes instruction position, operand count, or FU preference |
Higher numeric value = higher priority. Bit 7 is the most significant criterion: yield-boundary instructions always schedule before non-yield instructions when both are ready.
Hot-Cold Classification
The hot-cold flag (bit 5) classifies memory operations by expected latency:
- Hot (bit 5 = 1, higher priority): global memory loads (
LDG), texture fetches (TEX,TLD), surface operations. These have high latency (hundreds of cycles) and benefit most from early scheduling to overlap with computation. Detected bysub_A9CDE0. - Cold (bit 5 = 0, lower priority): constant memory loads (
LDC), shared memory operations (LDS,STS). These have low latency and do not need early scheduling. Detected bysub_A9CF90, which also suppresses the pressure-overflow and critical-path extension signals.
Classification uses sub_A9CDE0 (hot detection) and sub_A9CF90 (cold detection). Memory space type is determined by sub_693BC0, which returns space codes: 3 = shared, 16 = global, 2 = local, 11 = surface, 7 = constant, 1 = generic. Hot-cold tracking is gated by scheduler+523 and scheduler+532; when the hot-cold budget (scheduler+532) reaches zero, the feature deactivates for the remainder of the BB.
Pressure Overflow
Bit 4 in the packed byte holds the hot-cold flag (see above). The pressure overflow signal is a separate Boolean computed by checking all four register classes (GPR, predicate, address, UGP) against their respective limits. When any class exceeds its budget, the pressure overflow flag activates and acts as a comparison override: the candidate wins regardless of the packed priority byte, forcing the scheduler to select the instruction that relieves register pressure. This is the primary mechanism by which the ReduceReg phase achieves its objective: the mode sets a tight register budget via scheduler+178, causing pressure overflow to activate frequently and driving the scheduler toward pressure-reducing orderings. See the Priority Function Internals section for the exact per-class threshold checks.
Priority Evaluation Sequence
The priority function evaluates criteria in this order for each candidate instruction:
sub_8C7290: extract 4-class register deltas, same-BB flag, and per-BB resource vector (SSE-optimized)- Compute yield saturation: check write-barrier counters for predicate, GPR, and UGP register classes against their ceilings (7, 7, and
target_desc+624respectively) sub_8C67A0: compute per-instruction resource cost if BB slot not yet committedsub_8C7120: update barrier tracking state (ifmetadata+111bit 7 set)- Evaluate register pressure: compute per-class overflow against budget (
scheduler+432) and per-class limits; derive pressure-overflow Boolean - Evaluate stall-free: compare earliest cycle (
metadata+32) vs current cycle (scheduler+480) - Evaluate critical path: compare barrier-target count vs depth threshold (
scheduler+464) - Evaluate yield bits: opcode 39 (yield-related) and opcode 96 (yield flag from
scheduler+524) - Pack 8 bits into priority byte
- Evaluate hot/cold:
sub_A9CDE0/sub_A9CF90(only whenscheduler+523active) - Multi-stage comparison against running best: resource vectors, then XOR-based bit scan, then secondary tiebreakers
The function scans the full ready list in a single pass (not limited by knob 770 for the scan itself). Knob 770 (priority queue depth, default 4) controls the depth threshold mechanism for critical-path activation, not the number of candidates evaluated.
Key Internal Variables
| Variable | Source | Content |
|---|---|---|
budget_hw | sub_6818D0(scheduler, scheduler[432] - scheduler[412]) | Register budget in HW register units |
reduced_hw | sub_6818D0(scheduler, budget - budget/16) | Tighter budget for critical-path threshold (or knob 760 override) |
queue_depth | knob 770 | Depth threshold parameter (default 4); controls critical-path activation |
per_bb_flag | knob 769 | Per-BB scheduling flag; when set, resets yield state between BBs |
scheduler+420 | state | Spill-mode countdown; when > 0, forces aggressive scheduling with bit 1 = 1 |
scheduler+464 | state | Depth threshold -- number of barrier targets that must be ready before critical-path activates |
scheduler+480 | state | Current scheduling cycle; used for stall-free evaluation |
scheduler+523 | state | Hot-cold tracking enable flag; gated by knob |
scheduler+524 | state | Current yield state; propagated to CONTROL instructions via bit 6 |
scheduler+532 | state | Hot-cold budget counter; decremented per cold instruction, disables tracking at zero |
scheduler+672 | allocation | Per-BB resource cost table (84 bytes per slot) |
Support Subroutines
| Address | Size | Purpose |
|---|---|---|
sub_8C67A0 | 3.7 KB | Compute per-instruction resource cost. Calls sub_A08A00 (resource model) three times: mode 1 = instruction's own cost, mode 2 = operand release cost, mode 3 = combined BB-level impact. Uses SSE _mm_add_epi32 for vector accumulation. |
sub_8C7290 | 5.1 KB | Copy 10-element int32 resource vector from per-BB table at scheduler+672. SSE _mm_loadu_si128 bulk copy. Special case: opcode 97 (STG in ROT13; used as control/boundary marker) returns base scheduler state with zeroed deltas. |
sub_8C7720 | 20 KB | Red-black tree operations for instruction reordering within BB. Maintains a balanced BST of scheduling candidates for O(log N) insertion, removal, and priority update. |
sub_8C7120 | -- | Barrier tracking state update. |
sub_693BC0 | -- | Memory space classification and latency query. |
sub_6818D0 | -- | Register count to hardware-aligned unit conversion. |
Priority Function Internals
The full logic of sub_8C9320 divides into three phases: (1) pre-scan the ready list to collect aggregate BB statistics, (2) iterate the ready list a second time evaluating each candidate and maintaining a running best, and (3) update scheduler state and return the winner. The function signature is (scheduler, &second_best) -> best_instruction.
Phase 1: Pre-Scan Statistics
Before priority evaluation begins, the function iterates the entire ready list (linked via metadata+16) and accumulates per-BB statistics that feed into the per-instruction priority decisions:
| Variable | Init | Accumulation | Meaning |
|---|---|---|---|
shared_mem_count | 0 | ++ when opcode 183 and sub_693BC0 returns space 3 | Count of shared-memory operations in ready list |
neg_reg_deficit | 0 | += delta when register delta < 0 | Total register pressure reduction from ready instructions |
max_dep_cycle | -1 | max(current, metadata+92) | Highest dependency cycle among all ready instructions |
max_pred_cycle | 0 | max(current, metadata+88) | Highest predecessor cycle among all ready instructions |
barrier_count | 0 | ++ when metadata+108 & 1 | Count of barrier-target instructions in ready list |
dep_flag_count | 0 | ++ when metadata+108 & 2 | Count of instructions with dependency-set flag |
pos_pressure_sum | 0 | += delta when register delta > 0 | Total register pressure increase from ready instructions |
filtered_pressure | 0 | += delta when within depth threshold | Pressure increase from depth-eligible instructions |
max_barrier_slot | -1 | max(current, metadata+24) for barrier targets | Latest BB slot among barrier-target instructions |
min_barrier_latency | 99999 | min(current, metadata+28) for barrier targets | Shortest latency counter among barrier-target instructions |
max_nonbarrier_cycle | -1 | max(current, metadata+32) for non-barrier | Latest earliest-available-cycle for non-barrier instructions |
any_stall_free | 0 | |= (metadata+32 >= 0) | Whether any instruction can issue without stalling |
total_ready | 0 | ++ for every instruction | Total instructions in ready list |
preferred_instr | NULL | non-barrier instr with max metadata+24 | The program-order-latest non-barrier instruction |
The pre-scan also maintains a depth-threshold table: an array of up to 32 barrier-target instruction pointers sorted by their latency counter (metadata+28). This table is scanned to compute scheduler+464 (depth threshold) and scheduler+380 (latency cutoff), which control when the critical-path bit activates.
Phase 2: Register Budget Prologue
Before the main loop, the function computes two register budgets from scheduler+432 (target register count):
budget_base = scheduler[432] - scheduler[412] // target minus committed
if ReduceReg_mode (scheduler+178): // ReduceReg tightens budget
if scheduler[416] < 0:
budget_base -= (scheduler[432] / 8) + 3 // reduce by ~12.5% + 3
else:
budget_base -= scheduler[416] // explicit reduction
budget_hw = RegToHWUnits(scheduler, budget_base) // sub_6818D0
reduced_hw = RegToHWUnits(scheduler, budget_base - budget_base/16)
// ~6.25% tighter
if knob_760_active:
reduced_hw = RegToHWUnits(scheduler, budget_base - knob_760_value)
queue_depth = 4 // default
if knob_770_active:
queue_depth = knob_770_value // override
budget_hw sets the threshold for bit 4 (pressure overflow). reduced_hw provides a tighter threshold used in the critical-path assessment. queue_depth (knob 770) limits how many candidates receive full priority evaluation; the rest use cached values from initial insertion.
Phase 3: Per-Bit Computation
For each instruction in the ready list, sub_8C7290 extracts its per-register-class deltas (4 classes: GPR, predicate, address, UGP) and the same-BB flag. Then each priority bit is computed:
Bit 7 -- Yield-related. Determined by opcode. Only opcode 39 (YIELD instruction variant) can set this bit. The condition checks the last operand's low 2 bits:
if opcode_masked == 39:
operand_index = operand_count - 1 - ((opcode >> 11) & 2)
yield_related = (instr[84 + 8*operand_index] & 3) == 0
else:
yield_related = 0
When set, the instruction is a yield boundary marker and receives absolute highest priority regardless of all other heuristics.
Bit 6 -- Yield flag. Set only for opcode 96 (CONTROL instruction):
if opcode_masked == 96:
yield_flag = scheduler[524] // current yield state
else:
yield_flag = 0
// Post-adjustment: suppress when hot/pressure bits dominate
if (bit5_set || bit4_set):
yield_flag = 0
if metadata[32] < scheduler[480]: // behind schedule
yield_flag = scheduler[396] ? original_yield : 0
The yield flag propagates the scheduler's warp yield state only through CONTROL instructions, ensuring yield hints align with scheduling barriers.
Bit 5 -- Hot-cold classification. Requires hot-cold tracking to be active (scheduler+523 set, gated by scheduler+532 > 0):
if hot_cold_active:
is_hot = sub_A9CDE0(target_desc, context, instruction)
else:
is_hot = 0
// Cold detection suppresses priority
if sub_A9CF90(target_desc, context, instruction): // is_cold?
pressure_overflow = 0 // suppress bit 4
critical_extension = 0 // suppress lookahead
sub_A9CDE0 returns true for global memory loads (LDG), texture fetches (TEX, TLD), and surface operations -- instructions with latencies in the hundreds of cycles. sub_A9CF90 returns true for constant loads (LDC), shared memory operations (LDS/STS) -- low-latency operations. Hot instructions (bit 5 = 1) get higher priority to schedule early and overlap their long latencies with computation. Cold instructions (bit 5 = 0) are deprioritized.
Bit 4 -- Pressure overflow. This bit does NOT appear directly in the initial packing as a single variable. Instead, the pressure overflow signal (v81 in decompiled source) feeds into the candidate comparison logic as an override. The mechanism:
// For barrier-target instructions:
budget_in_units = RegToHWUnits(scheduler, scheduler[432])
headroom = RegToHWUnits(scheduler, 8)
if budget_in_units > headroom + scheduler[72]: // plenty of headroom
pressure_overflow = 0
elif latency_counter > min_barrier_latency + 9: // far from ready
pressure_overflow = 0
else:
// Check all 4 register classes against their limits:
overflow = false
overflow |= (scheduler[72] + gpr_delta > budget_hw)
overflow |= (scheduler[68] + pred_delta > 7)
overflow |= (scheduler[56] + addr_delta > 7)
overflow |= (scheduler[60] + ugp_delta >= target_desc[624])
pressure_overflow = overflow
When pressure_overflow = 1, the candidate wins the comparison regardless of other bits -- it is the scheduler's mechanism for emergency register pressure relief. In the packed byte's bit 4 position, the hot-cold flag occupies the slot. The pressure overflow signal operates at a higher level: it can force the candidate to win even when its packed priority byte is lower.
Bit 3 -- Same-BB preference. Output parameter from sub_8C7290:
same_bb = sub_8C7290.output_param_5 // boolean from resource copy
Set when the instruction belongs to the currently-scheduled basic block. Instructions imported from other BBs by global scheduling get same_bb = 0, reducing their priority relative to local instructions.
Bit 2 -- Stall-free. Computed from the earliest-available-cycle field:
if countdown_active (scheduler[420] != 0):
if metadata[32] < scheduler[480] AND instr != preferred_instr:
stall_free = 0
if pressure_plus_reg_sum > 0:
goto full_evaluation // positive pressure = needs analysis
else:
stall_free = 1
else:
// Normal mode: stall-free when producers have completed
if metadata[32] >= scheduler[480]:
stall_free = 1
elif instr == preferred_instr:
stall_free = 1
else:
stall_free = 0
metadata+32 is the instruction's earliest available cycle -- the latest completion time among all its producer instructions. scheduler+480 is the current scheduling cycle. When earliest >= current, all producers have retired and the instruction can issue with zero pipeline stalls.
Bit 1 -- Critical-path / latency-bound. Complex multi-path computation:
if countdown_active (scheduler[420] != 0):
// Spill mode: almost always critical
if !(barrier_bits_set_in_priority):
if slot_limit_exceeded:
critical = 1
else:
critical = !(pressure_sum <= 0 && max_reg_class == 0)
else:
critical = 0
else:
// Normal mode: depth threshold comparison
if barrier_count >= scheduler[464]:
critical = 1 // enough barriers ready -> critical path active
else:
critical = 0
In spill mode (active when scheduler+420 > 0), the critical-path bit is set for nearly all instructions to maximize scheduling throughput. In normal mode, it activates when the number of barrier-target instructions in the ready list meets or exceeds the depth threshold computed during the pre-scan, indicating that the scheduler is processing a latency-critical dependency chain.
Bit 0 -- Tiebreaker (barrier-target). Read directly from instruction metadata:
tiebreaker = metadata[108] & 1 // barrier-target flag
Barrier-target instructions (those waiting on a hardware barrier) get bit 0 = 1. Since this is the lowest-priority bit, it only affects ordering when all higher bits are identical. Scheduling barrier targets promptly allows the barrier resource to be retired sooner, freeing scoreboard entries for other instructions.
Packed Byte Assembly
The 8 bits are packed into a single byte using shift-and-mask arithmetic:
priority = (yield_related << 7) // bit 7
| (yield_flag << 6) & 0x7F // bit 6
| (hot_cold << 5) & 0x3F // bit 5 (initially yield copy)
| (hot_flag << 4) & 0x3F // bit 4
| (same_bb << 3) & 0x0F // bit 3
| (stall_free << 2) & 0x0F // bit 2
| (critical_path << 1) & 0x03 // bit 1
| (tiebreaker << 0) & 0x03 // bit 0
The & 0xNN masks ensure each bit occupies exactly one position. In the initial packing, bit 5 and bit 6 both derive from the yield variable; the hot-cold flag (sub_A9CDE0 result) overwrites bit 5 in subsequent repackings that occur during the spill-mode and comparison paths.
Candidate Comparison
The comparison between the current candidate and the running best is NOT a simple integer comparison of the packed bytes. The function performs a multi-stage refinement:
-
Resource vector comparison: If knob-gated architecture checks pass (SM index > 5 at
context+1704), a 4-tuple lexicographic comparison of per-register-class resource vectors occurs first. The four classes are compared in order: GPR delta, predicate delta, address delta, UGP delta. The first class that differs determines the winner. -
Priority byte XOR scan: When resource vectors are equal, the function XORs the current and best packed bytes and checks differing bits in this order:
- Bit 4 (0x10) -- pressure: winner has bit 4 set (higher pressure need)
- Bit 6 (0x40) -- yield: winner has bit 6 set (yield participation)
- Bit 1 (0x02) -- critical: winner has bit 1 set
- Bit 2 (0x04) -- stall-free: winner has bit 2 set
- Bit 5 (0x20) -- hot-cold: winner has bit 5 set (hot memory op)
-
Secondary tiebreakers (when all checked bits match):
- Barrier group index (
v213vsv253) - Latency counter comparison (
v223vsv248) - Bit 7 yield-related (only when shared-memory count > 0)
- Contention score (a derived value incorporating register overflow penalty:
contention + 2 * RegToHWUnits(pressure_delta) - pressure_sum_sign) - Slot manager cycles (scheduling cost estimate from
sub_682490) - Earliest available cycle (
metadata+32) - Dependency cycle (
metadata+92) - Latest deadline (
metadata+40) - Register delta magnitude
- Barrier group index (
-
Positional fallback: When all heuristic comparisons are tied, the instruction with the higher BB slot (
metadata+24) wins, preserving original program order.
The multi-stage comparison explains why the packed byte uses non-obvious bit ordering. Bits 4, 6, 1, 2, 5 are checked before bit 7 in the refinement path, even though bit 7 is the MSB. The packed byte enables fast ready-list insertion sort (integer comparison), while the full comparison function provides nuanced selection for the actual scheduling decision.
Scheduler State Updates
After selecting the best candidate, the function updates scheduler state:
// Spill mode countdown
if winner is barrier-target:
scheduler[420] = computed_countdown - 1
scheduler[396] -= 1 // spill sequence counter
if metadata[32] >= 0:
scheduler[400] -= 1 // stall-free counter
if stall_free_count==0 AND remaining>0 AND countdown>1:
scheduler[420] = 0 // force exit spill mode
scheduler[464] = -1 // reset depth threshold
else:
// Non-barrier winner in countdown mode
if !(barrier_bits in priority) AND slot_cost within budget:
// do nothing, continue countdown
else:
scheduler[420] = 0 // exit spill mode
scheduler[464] = -1 // reset depth threshold
// Slot manager update (when winner has positive scheduling cost)
if best_cost > 0 AND slotManager[76] > 0:
if slotManager[140]:
slotManager[28] += slotManager[44] // advance base
slotManager[76] = 0 // reset count
slotManager[80] = NULL // reset anchor
best.metadata[28] = sub_682490(...) // recompute latency
// Hot-cold counter update
if hot_cold_active AND winner is cold (sub_A9CF90 returns true):
scheduler[532] -= 1 // decrement hot-cold budget
elif hot_flag was set for winner:
scheduler[523] = 0 // disable hot-cold tracking
The function returns the best instruction pointer and writes the second-best to *a2 for lookahead scheduling.
Dependency DAG Construction
The dependency graph is built in two stages before the scheduling loop begins. The DAG is a directed acyclic graph where nodes are instructions and edges represent ordering constraints with associated latency values.
Stage 1: Pre-Scan (sub_8CF880, 28 KB)
Iterates basic blocks in reverse order (bb[N-1] to bb[0]) using the BB ordering array at func+512.
For each BB:
- Check knobs 314/313 for per-BB scheduling skip flags
- Walk the instruction linked list, identifying NOP/control instructions
- Set
bb->nextpointers and configure BB scheduling state - Delegate to Stage 2 (
sub_8D9930) for edge construction - Manage memory arenas with SSE-optimized copies for metadata arrays
Contains approximately 14 nested loops for edge construction. The reverse iteration order ensures that when the scheduler processes a BB, all of its successors have already been characterized.
Stage 2: Edge Construction (sub_8D9930, 19 KB)
For each pair of instructions within a BB, checks for five dependency types:
| Type | Abbreviation | Condition | Edge Latency |
|---|---|---|---|
| True | RAW | Read-after-write on same register | Producer's pipeline latency |
| Anti | WAR | Write-after-read on same register | 0 (ordering constraint only) |
| Output | WAW | Write-after-write on same register | 1 (minimum separation) |
| Memory | -- | Store before load to same memory space | Conservative; full ordering |
| Barrier | -- | Instruction depends on barrier/sync result | Barrier completion latency |
Operand analysis is performed by sub_894290 (27 KB), which processes 16-bit operand descriptors encoding:
| Bits | Field |
|---|---|
| 12--15 | Register class |
| 8--11 | Bank number |
| 0--7 | Dependency type |
Memory dependencies are conservative: all stores are ordered before subsequent loads to the same memory space. The scheduler does not perform alias analysis -- it relies on the memory space classification from sub_693BC0 to determine whether two operations might conflict.
Supplementary Dependency Builders
These functions handle specific aspects of dependency construction in the 0x6A0000--0x6B0000 range:
| Address | Size | Purpose |
|---|---|---|
sub_68A690 | 31 KB | BuildDependencies -- walks instruction lists and creates producer-consumer dependency edges from def-use chains |
sub_6A97B0 | 26 KB | AddDependencyEdges -- register-level data dependency edges |
sub_6A2D30 | 11 KB | ChainDependencies -- memory ordering constraints (ordering edges between memory operations even without explicit data deps) |
sub_6A78F0 | 23 KB | ProcessOperands -- iterates operand arrays at instruction +84, extracts register file pressure and dependency distance information |
Instruction Emission
sub_925510 (341 bytes, 57 lines) is the universal instruction relocation primitive. It moves an instruction to a new position in the doubly-linked instruction list.
function MoveInstruction(block, instr, insert_before):
// 1. Unlink from current position
instr.prev.next = instr.next
instr.next.prev = instr.prev
// 2. Insert before reference instruction
instr.next = insert_before
instr.prev = insert_before.prev
insert_before.prev.next = instr
insert_before.prev = instr
// 3. Update block boundaries if needed
if instr was block.head (block+272):
block.head = instr.next
if instr was block.tail (block+280):
block.tail = instr.prev
// 4. Notify subsystems
UpdateDependencyGraph(block, instr) // sub_7EEC10
UpdateBlockTimestamp(block) // sub_7DDCA0
This function has 13 callers across the codebase. It serves as the shared instruction movement primitive for the scheduler, register allocator, and peephole optimizer.
Resource Tracking
The scheduler maintains 10 functional unit resource counters per basic block, tracking pipeline utilization to avoid saturating any single execution unit.
Resource Vector Layout
Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:
| Offset | Size | Content |
|---|---|---|
| 0--36 | 10 x int32 | Current resource usage per functional unit |
| 40--76 | 10 x int32 | Resource pressure delta (change since last step) |
| 80 | int32 | BB-entered flag and auxiliary bits |
Functional Unit Pipes
| Index | Pipe | Typical Instructions |
|---|---|---|
| 0 | Integer ALU | IADD, IMAD, ISETP, LOP, SHF |
| 1 | FP32 | FADD, FFMA, FMUL, FSETP |
| 2 | FP64 | DADD, DFMA, DMUL |
| 3 | Tensor core | HMMA, IMMA, BMMA, BGMMA |
| 4 | Load/store | LD, ST, LDG, STG, LDS, STS |
| 5 | Texture | TEX, TLD, TXQ |
| 6 | Branch/control | BRA, JMP, EXIT, RET, BAR |
| 7 | Shared memory | ATOMS, REDS (overlaps with pipe 4 for LDS/STS) |
| 8 | Special function | MUFU (RCP, RSQ, SIN, COS, EX2, LG2) |
| 9 | Uniform/predicate | UPLOP, UISETP, uniform operations |
Resource Tracking Helpers
| Address | Size | Purpose |
|---|---|---|
sub_A091C0 | -- | Initialize per-BB resource arrays to zero |
sub_A09530 | 365 bytes | Update stall cycle counters after scheduling an instruction. Decrements pending latency counters for all tracked resources. |
sub_A09D40 | -- | Update WAR (anti-dependency) resource tracking for register operands |
sub_A08A00 | -- | Resource model query (called in 3 modes by sub_8C67A0) |
The resource model sub_A08A00 is called three times per instruction by sub_8C67A0:
- Mode 1: instruction's own execution cost (FU assignment + pipeline latency)
- Mode 2: operand release costs (freed resources when an operand reaches last-use)
- Mode 3: combined instruction + BB-level impact (aggregate pressure)
SSE intrinsics (_mm_add_epi32, _mm_loadu_si128) are used throughout for vectorized resource accumulation and copying.
Register Pressure Tracking
The scheduler tracks register liveness via a bitvector at scheduler+832. Each bit represents one register; the pressure is the popcount of the live set.
function UpdateRegisterPressure(sched, instr):
for each operand in instr.operands:
if operand.is_def:
set_bit(sched.live_bv, operand.reg) // DEF: mark live
if operand.is_last_use:
clear_bit(sched.live_bv, operand.reg) // LAST-USE: mark dead
sched.current_pressure = popcount(sched.live_bv)
The bitvector is sized to (numRegs + 1) words, or (2 * numRegs + 2) when knob 420 (dual-register tracking) is active. Dual-register tracking separately tracks register pairs for instructions that consume or produce 64-bit values.
Pressure state fields:
| Offset | Content |
|---|---|
scheduler+432 | Target register count (from budget computation) |
scheduler+324 | Committed register target |
scheduler+316 | Minimum register count |
scheduler+320 | Register pressure slack (headroom) |
When current_pressure > scheduler+432, the priority function sets bit 4 (pressure overflow) in the encoding, biasing the scheduler toward instructions that release registers.
Per-Instruction Scheduling Metadata (SchedNode)
Each instruction has a pointer at instr+40 to a heap-allocated SchedNode block. The offsets below are relative to the SchedNode base, not the 296-byte Ori instruction. See the SchedNode layout for the authoritative field map.
| Offset | Type | Content |
|---|---|---|
| +8 | int32 | dep_count -- unsatisfied predecessor count (0 = ready) |
| +16 | QWORD | next_ready -- linked-list pointer in ready list |
| +24 | int32 | bbSlot -- 1-based BB position (-1 = unscheduled) |
| +28 | int32 | latency_counter -- current stall counter |
| +32 | int32 | earliestCycle -- earliest available cycle |
| +40 | int32 | latestDeadline -- latest deadline cycle |
| +44 | int32 | Barrier group index |
| +88 | int32 | maxPredecessorCycle |
| +92 | int32 | maxDependencyCycle |
| +108 | byte | Flags: bit 0 = barrier target, bit 1 = has dependency, bit 2 = early schedulable, bit 3 = late schedulable, bit 4 = has register operand |
| +111 | byte | Flags: bit 7 = uses expensive register file |
The scheduling loop also reads Ori instruction fields directly (not via the SchedNode): instr+72 (opcode), instr+80 (operand count), instr+84 (operand descriptors).
Sentinel values: bbSlot -1 (unscheduled), latency 0x1869F (99999 = infinity).
The dep_count field at +8 is the key scheduling control: it counts unsatisfied predecessors in the dependency DAG. When a predecessor is scheduled, the engine decrements every successor's dep_count. When dep_count reaches zero, the instruction becomes ready and is inserted into the ready list.
Stall and Barrier Insertion
After the scheduling loop determines instruction order, sub_8D7760 (41 KB) converts the abstract schedule into SASS control words.
For each instruction in the scheduled order:
| Field | Computation | Range |
|---|---|---|
| Stall count | Distance in cycles to the nearest dependent consumer | 0--15 (capped by knob 805, max 16) |
| Yield hint | Warp scheduling hint -- should the HW scheduler switch to another warp? | 0 or 1 |
| Barrier assignment | Which of the 6 available barriers this instruction writes/waits on | 0--5, or none |
| Scoreboard deps | Read/write dependency tracking for the hardware scoreboard | Bitmask |
The function contains architecture-variant switches for different barrier models (sm_70 vs sm_80 vs sm_90+). It manages a 32-entry barrier table for tracking active barrier assignments.
See Scoreboards & Barriers for the control word encoding format.
Alternative Scheduling Loop
sub_68B9C0 (46 KB) is a monolithic function that combines dependency graph construction with the scheduling loop. It serves as an alternative entry point for scheduling passes that need to build the DAG inline rather than using the pre-built graph from Stage 1.
Internal structure:
- Initialize scheduling state (
sub_685700) - Initialize ready-list management (
sub_687080) - Check resource conflicts (
sub_687410) - Inner loop (
while(2)infinite loop with break conditions):- Check if ready list is empty -- break if so
- Check opcode 97 (
STGin ROT13; used as scheduling barrier/control marker) -- special handling - Select best instruction from ready list
- Schedule it: assign cycle, update resources, process edges
- For each successor: decrement
dep_count, add to ready list if zero - Check boundary condition (
v236) -- break if done
- Track first-pass initialization via
v215
This function accesses the Ori instruction's opcode at instr+72, plus SchedNode fields (via instr+40 pointer): +24 (bbSlot), +144 (scheduling slot), +164 (resource class), and +236 (latency).
Specialized Scheduling Strategies
The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies. These are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level. Each strategy implements a variation of the core list scheduling algorithm with different heuristics or search strategies.
| Address | Size | Strategy | Description |
|---|---|---|---|
sub_8B1190 | 16 KB | Backtracking | Undo and retry on scheduling conflicts. Rolls back the last N steps and tries alternative orderings. Bounded depth prevents exponential blowup. |
sub_8B2D90 | 18 KB | Global optimization | Cross-BB scheduling. Moves instructions across BB boundaries when safe (no side effects, dominance preserved). |
sub_8B4590 | 13 KB | Permutation search | Exhaustive permutation of instruction orderings for small BBs. Falls back to heuristic for larger blocks. |
sub_8B5400 | 14 KB | Latency-optimized | Maximizes memory latency hiding by aggressive interleaving of independent operations. |
sub_8B6D60 | 12 KB | Pressure-optimized | Minimizes live range overlap by scheduling defs as close to their uses as possible. |
sub_8B77C0 | 15 KB | Dual-issue | Pairs co-issuable instructions for dual-issue architectures (sm_50/Maxwell). Uses sub_A9CDE0 and sub_A9CF90 for compatibility checks. |
sub_8B8900 | 12 KB | Tensor scheduling | HMMA/BMMA/BGMMA grouping for warpgroup tensor operations. |
sub_8B9390 | 23 KB | Software pipelining | Loop body overlapping -- interleaves iterations to fill pipeline bubbles. |
sub_8BAAE0 | 15 KB | Loop-aware | Trip count + register awareness for loop bodies. |
sub_8BB9C0 | 8.2 KB | Prefetch scheduling | Inserts and schedules memory prefetch instructions. |
sub_8BC0B0 | 6.1 KB | Barrier coalescence | Merges adjacent barrier instructions to reduce overhead. |
sub_8BC990 | 7.6 KB | Scoreboard optimization | Minimizes scoreboard entries by reusing barrier registers. |
sub_8BCFA0 | 6.8 KB | Warp schedule optimization | Warp-level yield tuning for multi-warp scheduling. |
sub_8BDC40 | 7.9 KB | Dual-issue pairing | Instruction pair selection for dual-issue slots. |
sub_8BE320 | 25 KB | Complex combined pass | Multi-strategy combined pass for complex code patterns. |
sub_8A9D80 | 21 KB | Depth-first | DFS-based instruction ordering for deep dependency chains. |
sub_8AB750 | 9.8 KB | Critical path | DAG analysis for priority weight computation. |
Backtracking Scheduler
The backtracking strategy (sub_8B1190) is notable because it breaks from the greedy nature of standard list scheduling. When a scheduling decision leads to excessive stalls or resource conflicts, it can undo the last N steps (where N is bounded by a configurable depth), re-insert the affected instructions into the ready list, and try a different selection. This provides limited but effective lookahead without the full cost of optimal scheduling.
Dual-Issue Scheduling
For sm_50 (Maxwell), sub_8B77C0 pairs instructions that can execute simultaneously on different functional units. Eligibility is checked by sub_8CF5D0 (3.5 KB), which verifies architecture support and computes a dual-issue benefit score at scheduler+328. Pairing compatibility uses sub_A9CDE0 (is instruction dual-issuable?) and sub_A9CF90 (can this pair with the next instruction?).
Size Limits and Chunking
Two mechanisms prevent the scheduling algorithm from hitting quadratic complexity on large inputs:
BB Size Limit
sub_8CBAD0 scans all basic blocks during pre-scheduling setup. Any BB exceeding 4095 instructions is split by inserting scheduling barriers (sub_931920). This caps the per-BB scheduling problem size, ensuring the O(n^2) dependency graph construction remains tractable. The maximum BB size is tracked at scheduler+388.
Large Function Chunking
Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is partitioned into chunks that are scheduled independently, then the results are merged. This avoids the full O(n^2) DAG construction for very large kernels. The chunk boundary selection respects BB boundaries and dependency chains to minimize cross-chunk constraint violations.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_6820B0 | 1.5 KB | BuildReadyList -- zero-dep instruction scan | HIGH |
sub_682200 | -- | UnlinkFromReadyList -- remove and update deps | HIGH |
sub_682490 | 14 KB | RegisterPressureAnalyzer -- per-class deltas | HIGH |
sub_6833F0 | 10 KB | InitScheduleRegion -- per-BB setup, knob query | HIGH |
sub_685700 | -- | InitSchedulingState -- loop initialization | MEDIUM |
sub_685A10 | 11 KB | InstructionBarrierCheck -- opcode analysis | HIGH |
sub_687080 | -- | ReadyListManagementHelper | MEDIUM |
sub_687410 | -- | ResourceConflictCheck | MEDIUM |
sub_687FE0 | 12 KB | ScheduleBlock -- per-BB scheduling entry | HIGH |
sub_688DD0 | 20 KB | ScheduleEngine -- unified 3-mode core loop | HIGH |
sub_68A690 | 31 KB | BuildDependencies -- def-use chain DAG | HIGH |
sub_68B9C0 | 46 KB | MainSchedulingLoop -- combined DAG + scheduling | HIGH |
sub_692200 | 18 KB | SchedulingHeuristic -- priority with FP scoring | HIGH |
sub_695530 | 15 KB | ComputeLatencies -- instruction latency computation | HIGH |
sub_69B7D0 | 17 KB | TopologicalSort -- valid execution ordering | HIGH |
sub_69F170 | 12 KB | CriticalPathAnalysis -- DAG critical path | HIGH |
sub_893100 | 17 KB | ClassifyInstruction -- opcode/operand analysis | HIGH |
sub_894290 | 27 KB | BuildOperandDependencies -- operand-level edges | HIGH |
sub_89C550 | 14 KB | InnerScheduleLoop -- inner scheduling iteration | HIGH |
sub_89EFC0 | 16 KB | ReadyListManager -- BST management | HIGH |
sub_8A9D80 | 21 KB | DepthFirstSchedule | MEDIUM |
sub_8AB750 | 9.8 KB | CriticalPathCompute | MEDIUM |
sub_8B1190 | 16 KB | ScheduleWithBacktrack | MEDIUM |
sub_8B2D90 | 18 KB | GlobalScheduleOpt -- cross-BB scheduling | MEDIUM |
sub_8B4590 | 13 KB | PermuteSchedule -- permutation search | MEDIUM |
sub_8B5400 | 14 KB | ScheduleForLatency | MEDIUM |
sub_8B6D60 | 12 KB | ScheduleForPressure | MEDIUM |
sub_8B77C0 | 15 KB | DualIssueScheduler | MEDIUM |
sub_8B8900 | 12 KB | TensorScheduler | MEDIUM |
sub_8B9390 | 23 KB | SoftwarePipeline | MEDIUM |
sub_8BAAE0 | 15 KB | LoopScheduler | MEDIUM |
sub_8BB9C0 | 8.2 KB | PrefetchScheduler | MEDIUM |
sub_8BC0B0 | 6.1 KB | BarrierCoalescence | MEDIUM |
sub_8BC990 | 7.6 KB | ScoreboardOpt | MEDIUM |
sub_8BCFA0 | 6.8 KB | WarpScheduleOpt | MEDIUM |
sub_8BDC40 | 7.9 KB | DualIssuePairing | MEDIUM |
sub_8BE320 | 25 KB | ComplexSchedulePass | MEDIUM |
sub_8C67A0 | 3.7 KB | ComputeResourceCost -- per-instruction FU cost | HIGH |
sub_8C7120 | -- | BarrierTrackingUpdate | MEDIUM |
sub_8C7290 | 5.1 KB | GetResourceVector -- SSE-optimized copy | HIGH |
sub_8C7720 | 20 KB | ReorderInstructions -- red-black tree | HIGH |
sub_8C9320 | 47 KB | ComputePriority -- 8-bit packed heuristic | HIGH |
sub_8CBAD0 | 2.9 KB | PreScheduleSetup -- BB scan, 4095-instr limit | HIGH |
sub_8CCF80 | 2.3 KB | IsLongLatencyOp -- latency > 19 check | HIGH |
sub_8CD160 | 9.3 KB | ScheduleBasicBlock -- per-BB ordering loop | HIGH |
sub_8CF880 | 28 KB | BuildDependencyGraph -- pre-scan stage 1 | HIGH |
sub_8D0640 | 22 KB | ScheduleInstructions -- top-level orchestrator | HIGH |
sub_8D1730 | 19 KB | ExecuteSchedulePass | HIGH |
sub_8D2510 | 3.6 KB | UpdateDependencies -- post-schedule dep update | HIGH |
sub_8D3150 | 2.0 KB | CheckResourceConflict | MEDIUM |
sub_8D32D0 | 14 KB | ScheduleInstruction -- schedule single instruction | HIGH |
sub_8D3D60 | 1.4 KB | InsertStall | HIGH |
sub_8D3E20 | 2.1 KB | ComputeStallCycles | HIGH |
sub_8D4000 | 3.0 KB | InsertBarrier | HIGH |
sub_8D5E00 | 38 KB | MainSchedulingLoop -- workhorse | HIGH |
sub_8D7760 | 41 KB | StallAndBarrierInsertion -- control word generation | HIGH |
sub_8D9930 | 19 KB | BuildDependencyEdges -- RAW/WAR/WAW/memory/barrier | HIGH |
sub_925510 | 341 bytes | MoveInstruction -- doubly-linked list relink | HIGH |
sub_A08A00 | -- | ResourceModel -- FU cost query (3 modes) | HIGH |
sub_A091C0 | -- | InitResourceTracking | MEDIUM |
sub_A09530 | 365 bytes | UpdateStallCycles -- per-instruction latency update | HIGH |
sub_A09D40 | -- | UpdateWARTracking -- anti-dependency tracking | MEDIUM |
sub_A9DDD0 | 11.5 KB | HandleLargeFunction -- chunk-based scheduling | MEDIUM |
Per-SM Scheduling Backends
Everything documented above describes the main scheduler (Backend A), which covers approximately 436 KB at 0x680000--0x8FE000. ptxas contains two additional complete scheduling implementations activated for newer SM architectures. The three backends coexist in the binary; SM-version-gated dispatch selects which combination runs.
Architecture Dispatch
The function sub_7DDB50 reads an SM architecture index from context+2104 and returns it as an integer. Four dispatch stubs in the 0xC5FE00--0xC61000 range use this value to select the scheduling backend:
| Dispatch Stub | Condition | Backend Selected | Pipeline Stage |
|---|---|---|---|
sub_C5FEF0 | SmVersion > 1 | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C60910 | SmVersion > 1 && (context+1392 & 1) | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C5FFC0 | SmVersion > 1 | Backend C (RBT List), mode 1 | Pre-scheduling |
sub_C5FFF0 | SmVersion > 1 | Backend C (RBT List), mode 0 | Post-scheduling |
When SmVersion <= 1 (sm_50 through sm_75 -- Maxwell through Volta), control falls through to the main Backend A documented in the preceding sections. When SmVersion >= 2 (sm_80+ -- Ampere, Ada Lovelace, Hopper, Blackwell), Backends B and C replace Backend A entirely.
sub_C60910 has a secondary activation path: if *(options + 23544) == 1 && *(options + 23552) != 0, Backend B activates regardless of SM version, providing a knob override for testing the codec scheduler on older architectures.
Backends B and C are complementary, not competing. Backend C handles pre-scheduling and post-scheduling (the same pipeline stages as Backend A's 3-phase ReduceReg/ILP/DynBatch), while Backend B handles a separate codec/ISel scheduling step that has no equivalent in the legacy path.
Backend B -- SM89/90 Codec Scheduler (0x1225000)
Backend B is a forward-then-backward scheduling pass with continuous floating-point priority weighting. It replaces Backend A's discrete 8-bit packed heuristic with a configurable pressure/ILP tradeoff expressed as doubles.
| Entry | sub_1233D70 (1,527 B, 321 lines) -- pass phase 5 |
| Forward scheduler | sub_122AD60 (17.5 KB, 4,118 lines) -- largest function in range |
| Backward scheduler | sub_122F650 (18.2 KB, 3,917 lines) |
| Preparation | sub_123E0D0 -- instruction characterization |
| Post-fixup | sub_A112C0 -- scheduling result finalization |
| Priority structure | BST with FNV-1a hash tracking |
| Code region | 0x1225000--0x1240000 (132 functions, 111 KB) |
Float Weighting System
The entry point sub_1233D70 initializes two pairs of floating-point weights from the options object at *(context+1664) + 72:
Pair 1 -- Pressure/ILP tradeoff (options offsets 7200/7208):
| Weight | Default | Meaning |
|---|---|---|
pressure_weight | 1.8 | Contribution of register pressure to scheduling priority. Positive = favors orderings that reduce live register count. |
ilp_weight | -0.8 | Contribution of instruction-level parallelism. Negative = penalizes moves that reduce available parallelism. |
The two weights sum to 1.0 and form a weighted combination on a unit scale. The default 1.8/-0.8 split heavily favors register pressure reduction, accepting moderate ILP degradation -- appropriate for register-hungry Ada Lovelace and Hopper kernels.
Pair 2 -- Secondary scoring axis (options offsets 7560/7568):
| Weight | Default | Meaning |
|---|---|---|
forward_weight | 3.2 | Forward-looking scheduling contribution |
backward_penalty | -2.2 | Backward-looking penalty factor |
Both pairs are overridable. When the configuration byte at the respective offset equals 3, the weight is read from the adjacent double field and the complement is computed as 1.0 - weight:
if (*(BYTE*)(options + 7200) == 3):
pressure_weight = *(double*)(options + 7208)
ilp_weight = 1.0 - pressure_weight
After loading, both weight pairs are normalized by dividing by the register range (float)(max_regs - min_regs), producing per-register slopes:
range = (float)(max_regs) - (float)(min_regs)
pressure_slope = ilp_weight / range
secondary_slope = backward_penalty / range
This normalization ensures the scheduling heuristic scales consistently regardless of the target architecture's register file size.
Forward Pass (sub_122AD60)
The forward scheduler implements list scheduling with a BST priority queue, iterating basic blocks front-to-back. It uses FNV-1a hash tables (seed 0x811C9DC5, multiplier 16777619) for tracking scheduled instruction mappings. Instruction properties are queried via sub_7DF3A0. The function manages a ref-counted working set with proper cleanup at function exit. At 4,118 decompiled lines, it is the largest function in the 0x1225000 scheduling range.
Backward Pass (sub_122F650)
The backward scheduler receives the floating-point weights as direct parameters and processes basic blocks in reverse order. It calls into the barrier/scoreboard system (sub_BDC080, sub_BDBA60, sub_BDC0A0) and performs register liveness analysis via sub_A0EDE0. The function uses BST operations with left/right/parent pointer traversal and explicit rebalancing, then performs iterative tree cleanup at exit.
Backend C -- RBT List Scheduler (0x18CD000)
Backend C is a complete reimplementation of the list scheduling algorithm using a red-black tree priority queue, double-precision scoring, and an evaluate-then-commit model with hash-table solution caching. It replaces Backend A for all sm_80+ targets.
| Orchestrator | sub_1908D90 -- pre/post mode dispatch |
| Driver | sub_1906090 -- per-block scheduling loop |
| Core scheduler | sub_1902B70 (19 KB) -- RBT-based list scheduling |
| Solution evaluator | sub_1904B70 (26 KB) -- constraint check + commit |
| Constraint validator | sub_19043F0 (10 KB) -- feasibility testing |
| Pressure cost model | sub_18F3CB0 (16 KB) -- SIMD register pressure |
| Recursive cost propagation | sub_18FFD70 (23 KB) -- call-graph-aware scoring |
| Dependency update | sub_1902100 (15 KB) -- post-scheduling DAG update |
| RBT insert | sub_18FD370 -- balanced insertion with 3-key comparison |
| RBT extract | sub_18FCDA0 -- pop highest-priority node |
| RBT reset | sub_18F7EC0 -- tree cleanup |
| Score computation | sub_18FDAF0 -- double-precision weighted score |
| Hash table | sub_1906510 (14 KB) -- FNV-1a instruction ID lookup |
| Code region | 0x18CD000--0x190FFFF (392 functions, 275 KB) |
Red-Black Tree Priority Queue
The critical difference from Backend A is the priority queue data structure. Backend A uses a sorted singly-linked list (O(N) insertion per instruction). Backend C uses a red-black tree that maintains balance through rotation operations in sub_18FD170 (called at the end of every insertion).
Each RBT node is 40 bytes allocated from a pool, with node+24 pointing to the instruction's scheduling entry. The tree ordering uses a three-key comparison in sub_18FD370:
- Priority integer at
scheduling_entry + 384(descending -- higher priority nodes are left children) - Latency double at
scheduling_entry + 368(descending -- higher latency scheduled first among equal-priority instructions) - Instruction ID at
*(scheduling_entry + 16) + 12(ascending -- deterministic tiebreaker)
This three-key comparison provides O(log N) insertion and extraction, a significant improvement for basic blocks with hundreds of instructions where Backend A's O(N) sorted insertion becomes a bottleneck.
Core Scheduling Loop (sub_1902B70)
function RBTListSchedule(context, block, dep_info, bound, constraint):
InitRegisterPressure(context, block) // sub_18F8580
InitRBTree(tree) // sub_18F7EC0
for each instruction in block.instruction_list:
node = AllocPoolNode(40 bytes)
node.scheduling_entry = instruction
RBTreeInsert(tree, node) // sub_18FD370
ResizeScheduleArray(block, tree.count) // sub_18F9CC0
while tree is not empty:
best_node = RBTreeExtractMax(tree) // sub_18FCDA0
ReturnNodeToPool(best_node)
instruction = best_node.scheduling_entry
valid = vtable_check(context, block, instruction)
*(instruction + 365) = valid
UpdateDependencies(context, instruction, tree) // sub_1902100
if not valid:
InsertRejection(block + 112, instruction_id)
continue
// Record scheduling decision
position = ++(block + 360)
entry = *(block + 352) + 24 * position
entry[0] = instruction_id
entry[1] = instruction + 2 // scheduling state
entry[2] = instruction // back-pointer
// Compute and accumulate scores
latency = LookupLatency(context, instruction) // sub_18F5460
*(block + 96) += (priority - 2) * latency
*(block + 88) += latency * *(instruction + 376)
// Process successors, check conflicts (binary search on
// sorted 12-byte conflict array with 0xAAAAAAAAAAAAAAAB
// division-by-3 trick for index computation)
Evaluate-Then-Commit Model
Backend A uses a greedy approach: each scheduling decision is final. Backend C introduces a two-phase model where sub_1904B70 evaluates a proposed schedule against constraints before committing it:
- Build a candidate schedule (408-byte evaluation nodes with def/use/pred-deps/succ-deps lists)
- Validate via
sub_19043F0(checks scheduling mode at+64must be 5 or 6) - Run architecture-specific check via vtable at
*context + 16 - Verify register pressure via
sub_19016E0 - Compute score via
sub_18FDAF0(returns double) - If score exceeds threshold at
context + 360, insert into solution hash table atblock + 304
This allows Backend C to explore multiple scheduling alternatives and commit only the best-scoring solution, a capability Backend A's greedy model lacks.
Recursive Cost Propagation (sub_18FFD70)
Backend C uniquely supports cross-function scheduling awareness through recursive cost propagation. sub_18FFD70 walks the call graph:
- For a given schedule entry, iterate predecessor blocks (linked list at +12/+13)
- Look up each predecessor in the scheduling map via
sub_18F4E70 - If predecessor is live (byte at +365), recursively process it
- After recursion, scan instruction operand lists (offsets 80, 84), identifying register operands by the type tag
(operand >> 28) & 7 == 1 - Clear register usage bits at
reg_entry + 28 - Update double-precision scores at offsets 88 and 96
This propagation allows scheduling decisions in callee functions to influence caller scheduling priorities -- a form of interprocedural scheduling absent from both Backend A and Backend B.
Backend Comparison
| Feature | Backend A (Main) | Backend B (SM89/90 Codec) | Backend C (RBT List) |
|---|---|---|---|
| Address range | 0x680000--0x8FE000 | 0x1225000--0x1240000 | 0x18CD000--0x190FFFF |
| Code size | ~436 KB | ~111 KB | ~275 KB |
| SM gate | SmVersion <= 1 | SmVersion >= 2 | SmVersion >= 2 |
| Pipeline stage | Pre + post scheduling | Codec/ISel scheduling | Pre + post scheduling |
| Priority encoding | 8-bit packed integer | Float-weighted BST | RB-tree (int + double + ID) |
| Priority function size | 47 KB monolithic | Distributed across weights | 3-key comparison |
| Ready list structure | Sorted singly-linked list | Binary search tree | Red-black tree |
| Insertion complexity | O(N) per instruction | O(log N) | O(log N) |
| Scheduling passes | 3 (ReduceReg / ILP / DynBatch) | 2 (Forward / Backward) | 2 (Pre / Post) |
| Pressure tracking | Bitvector + popcount | Float slope per register | SIMD bitmap + cost model |
| Weight configuration | Knobs 769--805 (integer) | Options 7200/7560 (double) | Vtable dispatch |
| Score type | Integer (packed bits) | Double (weighted sum) | Double (accumulated) |
| Solution search | Greedy (single pass) | Forward + backward | Evaluate + commit |
| Cross-function awareness | None | None | Recursive cost propagation |
| Hash infrastructure | None | FNV-1a | FNV-1a |
| Backtracking | Optional (sub_8B1190) | None | Rejection set + retry |
Backend B + C Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_1233D70 | 1.5 KB | SM89/90 CodecScheduleEntry -- pass phase 5, float weight init | HIGH |
sub_122AD60 | 17.5 KB | ForwardCodecScheduler -- BST list scheduling, FNV-1a hash tracking | HIGH |
sub_122F650 | 18.2 KB | BackwardCodecScheduler -- reverse pass, barrier/scoreboard integration | HIGH |
sub_123ADD0 | 5.8 KB | CodecDependencyGraphBuilder -- dispatched via vtable | MEDIUM |
sub_12371D0 | 3.8 KB | CodecInstructionClassifier -- convergence-based property testing | MEDIUM |
sub_123E0D0 | -- | CodecSchedulePreparation -- instruction characterization | MEDIUM |
sub_A112C0 | -- | CodecSchedulePostFixup -- result finalization | MEDIUM |
sub_1908D90 | -- | RBTScheduleOrchestrator -- pre/post mode dispatch | HIGH |
sub_1906090 | -- | RBTScheduleDriver -- per-block loop, 368-byte block stride | HIGH |
sub_1902B70 | 19 KB | RBTCoreListScheduler -- RB-tree priority queue loop | HIGH |
sub_1904B70 | 26 KB | RBTSolutionEvaluator -- constraint check, score threshold, hash commit | HIGH |
sub_19043F0 | 10 KB | RBTConstraintValidator -- mode 5/6 feasibility | HIGH |
sub_19038E0 | 15 KB | RBTInitialEvaluation -- per-block constraint bootstrapping | MEDIUM |
sub_18F3CB0 | 16 KB | RBTPressureCostModel -- SIMD register pressure computation | HIGH |
sub_18FFD70 | 23 KB | RBTRecursiveCostPropagation -- call-graph-aware scoring | HIGH |
sub_1902100 | 15 KB | RBTDependencyUpdate -- post-scheduling DAG maintenance | HIGH |
sub_18FD370 | -- | RBTreeInsert -- 3-key balanced insertion + fix-up | HIGH |
sub_18FCDA0 | -- | RBTreeExtractMax -- pop highest-priority node | HIGH |
sub_18F7EC0 | -- | RBTreeReset -- tree cleanup | HIGH |
sub_18F8580 | -- | RBTRegisterPressureInit -- pressure state initialization | MEDIUM |
sub_18F5460 | -- | RBTLatencyLookup -- vtable-dispatched latency query | MEDIUM |
sub_18FDAF0 | -- | RBTScoreComputation -- double-precision weighted score | HIGH |
sub_1906510 | 14 KB | RBTHashLookup -- FNV-1a instruction ID hash table | HIGH |
sub_18FB850 | -- | RBTHashResize -- power-of-2 growth, 0.5 load factor | HIGH |
sub_1901200 | -- | RBTScorePropagationDriver -- calls sub_18FFD70 | MEDIUM |
sub_19081F0 | 17 KB | RBTBlockDependencyGraphBuild -- per-block DAG construction | HIGH |
sub_19072F0 | 14 KB | RBTInterBlockScheduling -- cross-BB register dependency | MEDIUM |
sub_18FEE60 | -- | RBTScheduleStateCreate -- 528-byte state construction | MEDIUM |
sub_18FE320 | -- | RBTScheduleDataPrepare -- pre-scheduling data setup | MEDIUM |
sub_18F94C0 | -- | RBTCleanup -- state teardown | MEDIUM |
sub_C5FFC0 | -- | DispatchPreSchedule -- SM gate -> Backend C (mode 1) | CERTAIN |
sub_C5FFF0 | -- | DispatchPostSchedule -- SM gate -> Backend C (mode 0) | CERTAIN |
sub_C5FEF0 | -- | DispatchCodecSchedule -- SM gate -> Backend B | CERTAIN |
sub_C60910 | -- | DispatchConditionalCodecSchedule -- SM gate + knob override | CERTAIN |
sub_7DDB50 | -- | GetSmVersionIndex -- reads context+2104 | HIGH |
Scheduling Guidance Output
After scheduling completes, ptxas can emit statistics comments into the SASS output and DUMPIR stream. Three emitter functions produce scheduling guidance in different contexts, all reading from a shared ~1400-byte statistics object. sub_A46CE0 controls the "SCHEDULING GUIDANCE:" header that wraps per-block scheduling output. sub_A3A7E0 emits per-function statistics as # [field=value] comment lines during DUMPIR. Eight post-regalloc clones at sub_ABBA50--sub_ABEB50 emit a variant with hardware pipe names.
Verbosity Controls
Two independent verbosity mechanisms gate the output:
Scheduling guidance level at *(DWORD*)(vtable + 992):
| Level | Behavior |
|---|---|
| 0 | No scheduling guidance output |
| 1+ | "SCHEDULING GUIDANCE:" header emitted; per-block scheduling dispatched |
| 2+ | Pre-formatting hook called via vtable+816 before header emission |
| 4+ | "LOOP STATIC METRICS : " sub-header appended |
DUMPIR detail bits at context+1416:
| Bit | Mask | Behavior |
|---|---|---|
| 3 | 0x08 | Enable detailed statistics (FP16 vectorization, functional unit breakdown, throughput estimates) |
| 4 | 0x10 | Show worst-case latency: # [worstcaseLat=%f] |
| 5 | 0x20 | Show average-case latency: # [avgcaseLat=%f] |
Bits 4 and 5 are mutually exclusive -- only one latency variant is emitted.
Emitter Functions
| Address | Size | Identity | Confidence | Context |
|---|---|---|---|---|
sub_A3A7E0 | 1,236 B | Statistics::emitFunctionStats | CERTAIN | Pre-regalloc DUMPIR statistics. 20+ format strings at 0x21EBF76--0x21EC3B0. Uses abstract FU names (fp, half, shared, controlFlow, loadStore). |
sub_A46CE0 | 1,793 B | SchedulingGuidance::buildAndEmit | HIGH | Scheduling guidance header + BB classification. Walks BB array at context+296, dispatches schedulable blocks to vtable+336. |
sub_A4B8F0 | 248 B | StatsEmitter::emitInstrRegStats | HIGH | Binary-embedded metadata. Writes record type 3 (string) into SASS code object at *(a1+1000) + *(a1+996). |
sub_ABBA50--sub_ABEB50 | 8 x 1,771 B | PostSchedStats::emit (SM-variant) | CERTAIN | Post-regalloc statistics. 8 clones at 0x700 spacing. Format strings at 0x21FA008--0x21FA400. Uses hardware pipe names (adu, alu, cbu, fma, lsu). |
Pre-Regalloc Output Format (sub_A3A7E0)
Emitted during DUMPIR. All lines prefixed with # . Lines marked [COND] are gated by the stated condition.
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [urregs=8] [COND: SM > 0x5FFF]
# [_lat2inst=0.0]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00] [COND: +1416 bit 3]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [SharedMem Alloc thru=0.125000] [COND: value != 0.0]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [Precise inst=0]
# [worstcaseLat=87.000000] [COND: +1416 bits 4-5 == 0x10]
# [avgcaseLat=52.500000] [COND: +1416 bits 4-5 == 0x20]
# [instHint=142] [instPairs=4] [COND: instPairs != 0]
# <custom annotation> [COND: linked list at stats[55] != NULL]
Key format details: pre-regalloc uses commas between some bracket groups ([SSpillB=%d], [SRefillB=%d],) and abstract functional unit names (fp, half, trancedental, shared, controlFlow, loadStore, texLoadStore). The typo "trancedental" (missing "s") is present in the binary.
Post-Regalloc Output Format (sub_ABBA50 clones)
Emitted after scheduling by SM-variant clones dispatched via vtable. Same # prefix. Differs from the pre-regalloc format in three ways:
- No commas between bracket groups
- SpillSize replaces
LowLmemSpillSize+FrameLmemSpillSize - Hardware pipe names replace abstract unit names; MMA variant breakdown added
The unique lines (lines shared with pre-regalloc use the same structure minus commas):
# [est latency = %d] [LSpillB=%d] [LRefillB=%d] [SSpillB=%d] [SRefillB=%d] [SpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d] [NonSpillSize=%d]
# [Occupancy = %f] [est numDivergentBranches=%d] [attributeMemUsage=%d] [programSize=%d]
# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] [est fma=%d] [est half=%d]
# [est trancedental=%d] [est ipa=%d] [est lsu=%d] [est redux=%d]
# [est schedDisp=%d] [est tex=%d] [est ttu=%d] [est udp=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d] [est hmma16816=%d] [est hmma16816f16=%d]
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [issue thru=%f] [adu thru=%f] [alu thru=%f] [cbu thru=%f] [fma2x thru=%f] [fma thru=%f]
# [trancedental thru=%f] [ipa thru=%f] [lsu thru=%f] [redux thru=%f]
# [schedDisp thru=%f] [tex thru=%f] [ttu thru=%f] [udp thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f] [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [reg thru=%f] [warp thru=%f]
Hardware Pipe Name Mapping
The post-regalloc format maps abstract functional unit names to hardware execution pipe identifiers:
| Post-Regalloc Pipe | Pre-Regalloc Equivalent | Description |
|---|---|---|
adu | -- | Address Divergence Unit (address computation) |
alu | fp | Arithmetic Logic Unit (integer + FP32 combined) |
cbu | controlFlow | Control/Branch Unit (branch, exit, barrier) |
fma2x | -- | Double-precision FMA (separate pipe on sm_80+) |
fma | fp | Fused Multiply-Add (FP32) |
half | half | FP16 operations |
lsu | loadStore + shared | Load/Store Unit (unified) |
redux | -- | Reduction Unit (warp-level reductions) |
schedDisp | -- | Scheduler Dispatch (internal overhead) |
tex | tex | Texture Unit |
ttu | -- | Tensor Texture Unit (Ada Lovelace+) |
udp | -- | Uniform Data Path operations |
Binary-Embedded Statistics Record (sub_A4B8F0)
Separate from the DUMPIR comment output, sub_A4B8F0 writes a compact binary record into the SASS code object during emission:
Format string: "instr/R-regs: %d instructions, %d R-regs"
instructions = stats[335] - stats[341] (total minus removed)
R-regs = stats[159] + stats[102] (extra + base allocation)
Record layout in output buffer:
+0 DWORD type = 3 (string record type)
+4 DWORD string_length
+8 char[] string_content (formatted text)
The companion function sub_A4B9F0 writes record type 2 for undefined register warnings: "Referencing undefined register: %s%d".
Scheduling Guidance Header (sub_A46CE0)
sub_A46CE0 emits the scheduling guidance wrapper, then walks the BB array to classify and dispatch blocks for scheduling. The header is emitted into the output stream via sub_7FE930 (string builder) at context + 1440.
BB classification algorithm:
For each BB in context+296 (index 0 through context+304):
-
Schedulable:
sub_7544D0(context, bb)returns true ANDsub_754510(context, bb)returns false. Dispatched immediately to scheduling viavtable+336. -
Type-8 (deferred):
*(bb+16) == 8. Added to a dynamically-grownsrcarray for second-pass processing. -
Loop back-edge: When
*(bb+148) != 0and*(bb+128) != NULL, the function walks the predecessor linked list atbb+128. For each predecessor, it checks whether the predecessor's iteration order (bb+144) exceeds the current block's, and whether the predecessor's terminal instruction is a branch (opcode0x5Dafter masking with0xFFFFCFFD) with a matching program counter at(instr+84) & 0xFFFFFF. If a back-edge is detected, scheduling dispatch includes the back-edge source instruction as a hint parameter.
After the first pass, deferred type-8 blocks are processed in a second loop with the same back-edge detection logic.
Statistics Object Field Map
Both emitter families read from the same ~1400-byte statistics object. The object is accessed as a float* array; integer fields use the same DWORD index but read as int32.
| Index | Type | Field | Description |
|---|---|---|---|
| 8 | int32 | est_latency | Estimated schedule length in cycles |
| 9 | float | FP16_vectorization_pct | Fraction of FP16 instructions vectorized |
| 10 | int32 | worstcase_latency | Worst-case latency (cast to float for output) |
| 11 | int32 | avgcase_latency | Average-case latency (cast to float for output) |
| 12 | int32 | LSpillB | Long spill byte count |
| 13 | int32 | LRefillB | Long refill byte count |
| 14 | int32 | SRefillB | Short refill byte count |
| 15 | int32 | SSpillB | Short spill byte count |
| 16 | int32 | LowLmemSpillSize | Local-memory low spill allocation |
| 17 | int32 | FrameLmemSpillSize | Frame local-memory spill allocation |
| 18 | int32 | LNonSpillB | Long non-spill byte count |
| 19 | int32 | LNonRefillB | Long non-refill byte count |
| 20 | int32 | NonSpillSize | Non-spill allocation size |
| 26 | float | Occupancy | Occupancy ratio (0.0--1.0) |
| 27 | int32 | numDivergentBranches | Estimated divergent branch count |
| 28 | int32 | attributeMemUsage | Attribute memory usage in bytes |
| 29 | int32 | programSize | Program binary size in bytes |
| 42 | int32 | preciseInst | Count of precise (non-approximate) instructions |
| 44 | int32 | UDPinst | Uniform data-path instruction count |
| 45 | int32 | vecToURConverts | Vector-to-uniform-register conversion count |
| 49 | int32 | maxLiveAtSuspend | Max live register values at suspend points |
| 50 | float | issue_thru | Overall issue throughput (fraction of peak) |
| 54 | float | fp_thru | FP32 pipe throughput |
| 57 | float | half_thru | FP16 pipe throughput |
| 58 | float | transcendental_thru | Transcendental function pipe throughput |
| 59 | float | ipa_thru | Interpolation pipe throughput |
| 61 | float | shared_thru | Shared memory pipe throughput |
| 62 | float | controlFlow_thru | Control flow pipe throughput |
| 65 | float | texLoadStore_thru | Texture and load/store pipe throughput |
| 84 | float | reg_thru | Register throughput |
| 85 | float | warp_thru | Warp throughput |
| 86 | float | sharedMemAlloc_thru | Shared memory allocation throughput |
| 87 | int32 | partiallyUnrolledLoops | Partially unrolled loop count |
| 88 | int32 | nonUnrolledLoops | Non-unrolled loop count |
| 89 | int32 | CBBoundTex | Constant-bank-bound texture count |
| 90 | int32 | PartiallyBoundTex | Partially bound texture count |
| 91 | int32 | BindlessTex | Bindless texture count |
| 92 | int32 | URBoundTex | Uniform-register-bound texture count |
| 93 | int32 | SM_architecture_enum | SM version discriminator (>0x5FFF enables UR stats) |
| 99 | int32 | uniform_reg_count | Uniform register count |
| 102 | int32 | R_reg_base | Base R-register allocation |
| 159 | int32 | R_reg_extra | Extra R-register allocation |
| 303 | int32 | est_fp | Estimated FP32 instruction count |
| 306 | int32 | est_half | Estimated FP16 instruction count |
| 307 | int32 | est_transcendental | Estimated transcendental instruction count |
| 308 | int32 | est_ipa | Estimated IPA instruction count |
| 310 | int32 | est_shared | Estimated shared memory operation count |
| 311 | int32 | est_controlFlow | Estimated control flow operation count |
| 315 | int32 | est_loadStore | Estimated load/store instruction count |
| 316 | int32 | est_tex | Estimated texture instruction count |
| 334 | int32 | est_pairs | Estimated co-issuable instruction pairs |
| 335 | int32 | total_inst | Total instruction count (before removal) |
| 336 | int32 | texInst | Texture instruction count |
| 337 | int32 | FP16inst | FP16 instruction count |
| 338 | int32 | FP16VectInst | FP16 vectorized instruction count |
| 339 | int32 | instHint | Instruction hint value |
| 340 | int32 | instPairs | Instruction pair count (also output gate) |
| 341 | int32 | removed_inst | Removed instruction count |
| 342 | int32 | tepid_inst | TEPID (texture-pending) instruction count |
Cross-References
- Scheduler Overview -- 3-phase architecture, register budget, scheduling knobs
- Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
- Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
- Register Allocation -- the allocator that the scheduler interacts with
- Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
- Knobs -- the 76 scheduling knobs and the knob query infrastructure
- GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch
- DUMPIR Configuration -- DUMPIR levels that trigger statistics output
- Spilling -- spill metrics (LSpillB, SSpillB) referenced in guidance output