Instruction Scheduling
Note: This page documents the embedded ptxas copy within nvlink v13.0.88. The standalone ptxas binary has its own comprehensive wiki -- see the ptxas Reverse Engineering Reference for the full compiler reference. For the standalone ptxas scheduling pipeline, see [ptxas Scheduling overview](../../ptxas/scheduling/overview.html, [algorithm](../../ptxas/scheduling/algorithm.html, latency model, and scoreboards.
The embedded ptxas backend in nvlink v13.0.88 contains two complete instruction scheduling subsystems: the pre-register-allocation scheduler (three named strategy modes operating on IR-level instructions) and the tepid scheduler (a post-register-allocation pipeline simulator that assigns stall counts, yield hints, and scoreboard barriers to the final SASS instruction stream). Both subsystems run per-function and per-basic-block. Together they span approximately 1.2 MB of code across three address ranges: 0x1680000--0x16E0000, 0x16F6000--0x1740000, and 0x1850000--0x186F000, plus scoreboard/dependency tracking at 0x1B40000--0x1B60000.
Overview of the Two Schedulers
The compilation pipeline invokes scheduling at two distinct points:
-
Pre-RA scheduling (
ScheduleInstructionsand variants). Runs before register allocation on the internal IR. Its goal is to maximize instruction-level parallelism (ILP) and hide memory latency while respecting a register pressure budget. The three strategy modes --ScheduleInstructions,ScheduleInstructionsReduceReg, andScheduleInstructionsDynBatch-- are selected per-function based on workload characteristics and knob configuration. -
Post-RA scheduling (the "tepid" scheduler). Runs after register allocation on the SASS instruction stream with physical register assignments finalized. It models the GPU hardware pipeline, computes concrete stall counts, assigns scoreboard barriers, determines dual-issue pairing, inserts yield hints, and sets the scheduling control words that appear every 4th instruction in the SASS binary.
The two schedulers communicate indirectly through the register allocation pass: the pre-RA scheduler's instruction ordering influences register pressure, which in turn determines spill/fill counts that the post-RA scheduler must accommodate.
Pre-RA Scheduling: ScheduleInstructions
Entry Point and Driver Hierarchy
The main entry point is sub_1851DC0 (85 KB, 2,938 lines) -- ScheduleInstructions_main_driver. This is one of the largest single functions in the scheduling subsystem. It takes a compilation context, an IR module, and a function pointer, and orchestrates the entire pre-RA scheduling pass for one function.
The driver hierarchy is:
ScheduleInstructions_main_driver (0x1851DC0, 85 KB)
-> ScheduleInstructions_per_function_driver (0x1860A40, 47 KB)
-> ScheduleInstructions_per_block_driver (0x185B870, 28 KB)
-> ScheduleInstructions_block_scheduler_core (0x1867D60, 22 KB)
-> schedule_list_scheduler_core (0x1864ED0, 17 KB)
The main driver also conditionally invokes two companion passes that are tightly integrated with scheduling:
HoistInvariants-- loop-invariant code motion, lifted out of loops before scheduling.OptimizeNaNOrZero-- a peephole pass that simplifies NaN-producing and zero-producing instruction sequences.ConvertMemoryToRegisterOrUniform-- promotes shared memory accesses to register or uniform register operations where safe.
Three Scheduling Modes
The strategy selection function at 0x1857990 (13 KB) chooses between three modes:
| Mode | Pass Name | Description | When Selected |
|---|---|---|---|
| Default | ScheduleInstructions | Maximize ILP and latency hiding | General-purpose kernels |
| ReduceReg | ScheduleInstructionsReduceReg | Minimize register pressure | When register pressure exceeds budget |
| DynBatch | ScheduleInstructionsDynBatch | Dynamic batching for throughput | CUTLASS workloads and GEMM patterns |
The mode name appears as a literal string in the binary for each pass invocation. The strategy selector at 0x1857990 checks multiple configuration knobs (10 vtable calls observed) and returns a boolean indicating whether ReduceReg mode should be used. DynBatch selection is driven by CUTLASS pattern detection.
ScheduleInstructions (default mode) uses a standard list-scheduling algorithm. The core loop at 0x1864ED0 (17 KB) maintains a ready queue, selects the highest-priority instruction from the ready set, issues it, and updates the dependency DAG. Priority is computed by the critical-path calculator and takes into account instruction latencies, functional unit pressure, and memory-hierarchy distance.
ScheduleInstructionsReduceReg reorders instructions to reduce the number of simultaneously live registers. The reordering pass at 0x185D760 (9 KB) copies instruction buffers via memcpy and rearranges them to shorten live ranges. This mode is engaged when the pre-RA scheduler detects that the default ordering would cause register allocation to spill.
ScheduleInstructionsDynBatch is specialized for dense linear-algebra kernels (CUTLASS GEMM patterns). The DynBatch heuristic at 0x185F980 (6 KB) groups instructions into dynamic batches that can be issued as a unit, maximizing throughput on the tensor core and memory pipelines. The CUTLASS workload detector at 0x1866CF0 (3.5 KB) and the CUTLASS pattern handler at 0x1868E50 (19 KB) identify and special-case these patterns.
List Scheduling Data Structures
The scheduler operates on a per-basic-block dependency DAG. Key data structures:
184-byte per-BB scheduling records. Stored in a growable array at context offset +832. Each record contains:
Offset Size Field
+0 8B basic block pointer
+4 4B scheduling latency / timing info
+8 128B constraint arrays for instruction positions
+136 4B instruction count
+140 4B scheduled instruction count
+144 4B max register pressure seen
+148 4B current cycle count
+152 32B scoreboard state snapshot
Capacity tracking at offset +840. Overflow entries use a hash-table/linked-list at offset +864. Scheduling contexts (192 bytes each) are arena-allocated at offset +848.
DAG construction. The DAG builder at 0x1858730 (12 KB) constructs dependency edges via schedule_build_dependency_edge at 0x1850760 (5 KB). Edges encode four hazard types: RAW (read-after-write), WAR (write-after-read), WAW (write-after-write), and memory ordering dependencies. The DAG is compacted by 0x1858FA0 (4.6 KB) after construction to remove redundant transitive edges.
Ready queue. The ready-list selector at 0x18592C0 (6 KB) picks the next instruction to schedule from among all instructions whose predecessors have been scheduled. Selection priority factors include: critical-path height, instruction latency, functional unit availability, and (in ReduceReg mode) the register-pressure delta.
Register Pressure Tracking
Register pressure is tracked incrementally during scheduling by the pressure tracker at 0x185C40 (12 KB). The delta function at 0x1859F10 (4.3 KB) computes the net register pressure change from issuing a given instruction (positive for definitions, negative for last uses). In ReduceReg mode, the scheduler penalizes instructions that increase pressure beyond a target threshold.
The live-range interference checker at 0x185D4B0 (4.3 KB) detects cases where scheduling two instructions back-to-back would create an interference that register allocation cannot resolve without spilling.
CUTLASS-Aware Scheduling
Six functions specifically handle CUTLASS workloads:
| Address | Size | Function |
|---|---|---|
0x1866CF0 | 3.5 KB | schedule_check_cutlass_workload -- detects CUTLASS pattern |
0x1866FA0 | 28 KB | schedule_optimize_nan_or_zero -- NaN/zero peephole with CUTLASS awareness |
0x1868E50 | 19 KB | schedule_handle_cutlass_pattern -- CUTLASS GEMM scheduling |
0x186A9F0 | 14 KB | schedule_reorder_memory_ops -- memory operation reordering for CUTLASS |
0x186BE40 | 14 KB | schedule_optimize_texture_ops -- texture op scheduling for CUTLASS |
0x185F980 | 6 KB | schedule_dynbatch_heuristic -- DynBatch mode heuristic |
The CUTLASS detection mechanism checks function names and instruction patterns for the characteristic GEMM structure: interleaved tensor-core MMA instructions with global memory loads and shared memory stores. When detected, the scheduler applies specialized reordering that overlaps MMA computation with memory transfers, a pattern critical for achieving peak throughput on tensor cores.
HoistInvariants Pass
The HoistInvariants pass is invoked from the scheduling driver before the main scheduling loop. It is a loop-invariant code motion (LICM) pass operating at the SASS IR level.
Driver Hierarchy
HoistInvariants_analysis_driver (0x186EE80, 41 KB)
-> HoistInvariants_per_function (0x186D520, 38 KB)
-> HoistInvariants_core (0x186C7A0, 24 KB)
-> hoist_analyze_loop_body (0x1871050, 19 KB)
-> hoist_perform_transformation (0x1873580, 12 KB)
The analysis driver at 0x186EE80 (41 KB, 1,603 lines) identifies hoistable instructions and performs the actual transformation. It takes a context, an instruction, a function, and an output count pointer.
Hoisting Pipeline
-
Candidate identification.
hoist_collect_candidates(0x1882F20) gathers instructions that may be loop-invariant. An instruction is a candidate if all its operands are defined outside the loop or are themselves loop-invariant. -
Side-effect check.
hoist_is_side_effect_free(0x1883590) checks whether the instruction has observable side effects (memory writes, barriers, etc.). Instructions with side effects are not hoisted. -
Alias analysis.
hoist_analyze_memory_aliasing(0x1886360, 13 KB, 13 vtable calls) queries the alias analysis infrastructure to determine whether hoisting a load past a store is safe.hoist_check_alias_safety(0x1886B70) provides the final safety verdict. -
Cost-benefit analysis.
hoist_compute_cost_benefit(0x1873030) weighs the benefit of removing an instruction from the loop body against the cost of increasing register pressure in the loop preheader. The cost model uses floating-point arithmetic. -
Transformation.
hoist_perform_transformation(0x1873580, 12 KB) moves the instruction from its original basic block to the loop preheader viahoist_insert_at_preheader(0x1884B80). Phi nodes are updated byhoist_update_phi_nodes(0x1882A70). -
Liveness update.
hoist_update_liveness(0x1876BC0) recomputes live-in/live-out sets for affected blocks.
The pass is CUTLASS-aware: multiple helper functions (hoist_handle_shared_memory at 0x1874B40, hoist_handle_texture_ops at 0x1877200, hoist_handle_special_instructions at 0x1877BF0) contain special handling for shared-memory, texture, and other CUTLASS-relevant instruction patterns.
OptimizeNaNOrZero Pass
The OptimizeNaNOrZero peephole runs immediately after invariant hoisting. Its entry point is 0x1866FA0 (28 KB) with a core implementation at 0x187C80 (18 KB). The pass identifies floating-point instruction sequences that produce NaN or zero results and simplifies them.
Key sub-passes:
nan_zero_check_operand_pattern(0x187AA90) -- matches known NaN-producing patterns (e.g.,0.0 * xwherexmay be infinity).nan_zero_propagate_through_phi(0x187DDD0, 13 KB) -- propagates NaN/zero knowledge through phi nodes across basic block boundaries.nan_zero_speculative_elimination(0x187EB20) -- speculatively eliminates NaN checks when the producer is known to be non-NaN.nan_zero_transform_branch(0x187C0C0, 9 KB) -- simplifies conditional branches that test for NaN or zero when the condition is statically determinable.
Tepid Scheduler (Post-RA)
The tepid scheduler is NVIDIA's post-register-allocation instruction scheduler. It operates on the final SASS instruction stream with physical register assignments, modeling the actual hardware pipeline to produce optimal scheduling control words. The name "tepid" appears in internal string references ("TepidMacUtil", "TepidTime") and refers to a "warm" scheduling approach -- not as aggressive as full software pipelining, but more than simple in-order emission.
Address Ranges
The tepid scheduler spans two regions:
-
0x16F6000--0x1740000(~296 KB, ~80 functions): the core tepid scheduling engine, including the main loop, pipeline model, resource tracking, scoreboard management, software pipelining, and statistics. -
0x1B40000--0x1B60000(~128 KB, ~10 functions): scoreboard dependency tracking and optimization, control-word building.
Main Loop and Pipeline Model
The tepid scheduler's main loop is sub_17027F0 (38 KB, 1,216 lines) -- tepid_scheduler_main_loop. It walks basic-block instruction lists from begin to end, simulating the GPU execution pipeline.
For each basic block:
- Count instructions using opcode classification (
sub_17662F0, opcode 443). - Initialize min/max timing values (min initialized to
0x7FFFFFFF). - Iterate through instructions, tracking:
v447/v448: instruction countsv439/v445: min/max execution timesv436/v444: LDS (shared memory load) timingv438/v440: LDG (global memory load) timingv437/v443: Xu64 (64-bit extended) timing
- Call
__popcountdi2for population count on register masks (register set analysis). - Use vtable callbacks at
a1+16for ISA-specific scheduling decisions.
The main loop references four distinct loop categories in its string context:
- "For Mac Loop" -- multiply-accumulate dominated loops (tensor/GEMM)
- "For Dma Loop" -- DMA/memory-transfer dominated loops
- "For Math Loop" -- general ALU-dominated loops
- "For Epilogue" -- loop epilogue regions
Pipeline Simulation Components
| Address | Size | Function | Role |
|---|---|---|---|
0x16F7370 | 5 KB | scheduler_latency_calculator | Compute per-instruction latencies |
0x16F7830 | 5 KB | scheduler_resource_tracker | Track functional unit availability |
0x16F7BB0 | 4 KB | scheduler_dependency_checker | Check data dependencies between instructions |
0x16F7F70 | 4 KB | scheduler_stall_detector | Detect pipeline stalls from resource conflicts |
0x16F8640 | 5 KB | scheduler_reuse_tracker | Track register reuse opportunities |
0x16F8B80 | 4 KB | scheduler_issue_slot_manager | Manage instruction issue slot assignment |
0x16FF350 | 8 KB | scheduler_cycle_counter | Count execution cycles |
0x16FF8F0 | 13 KB | scheduler_pipeline_modeler | Model GPU pipeline state (539 lines) |
The pipeline modeler at 0x16FF8F0 is the core simulation engine. It models the GPU's execution pipeline as a set of functional unit queues with known latencies. Each instruction is placed into the appropriate queue based on its functional unit class, and the pipeline state is advanced cycle-by-cycle.
Functional Unit Models
The tepid scheduler contains dedicated handlers for each major GPU functional unit:
| Address | Size | Handler |
|---|---|---|
0x1722910 | 6 KB | Math/ALU unit modeler |
0x1722C40 | 12 KB | Texture unit handler |
0x1724230 | 6 KB | Shared memory handler |
0x1724F10 | 7 KB | Global memory handler |
0x17271C0 | 7 KB | Special function unit (SFU) handler |
0x1727560 | 5 KB | Tensor core (MMA) handler |
0x1727BC0 | 11 KB | Tensor core latency model |
0x1728320 | 5 KB | DMA engine handler |
0x17287E0 | 11 KB | DMA latency model |
Each handler models the specific pipeline depth, throughput, and resource constraints of its target unit. The tensor core and DMA handlers have separate latency models (11 KB each), reflecting the complexity of modeling asynchronous, multi-cycle operations.
Knob 610: Scheduling Aggressiveness
The tepid scheduler's behavior is controlled by knob 610, queried via vtable dispatch. The block processor at 0x16F35A0 (36 KB) checks this knob to select scheduling aggressiveness:
- Knob 610 = 0: Scheduling disabled (early return).
- Knob 610 = 1: Standard tepid scheduling.
- Knob 610 = 2+: Progressively more aggressive scheduling with deeper pipeline lookahead.
The block processor also checks the architecture capability at knob-table offset +43920 to gate SM-version-specific scheduling features.
Math-to-DMA Ratio Balancing
A critical heuristic in the tepid scheduler is the math-to-DMA ratio balancer. Two named ratios appear in string references:
MathToDmaWaitRatio-- the ratio of math instruction cycles to DMA wait cycles. When this ratio is too low, the scheduler has insufficient math work to hide DMA latency.MathToDmaTepidRatio-- the tepid scheduler's target ratio. The scheduler attempts to interleave math and DMA instructions to approach this target.
Additional ratios for epilogue regions:
MathToEpilogueWaitRatioMathToEpilogueTepidRatio
The DMA-math balancer at 0x1717A00 (12 KB) and the epilogue optimizer at 0x1729F10 (19 KB) compute these ratios and adjust instruction ordering to maximize latency hiding.
Software Pipelining
The tepid scheduler includes software pipelining support for loop bodies:
| Address | Size | Function |
|---|---|---|
0x17130F0 | 11 KB | scheduler_software_pipeline -- main software pipelining |
0x1713930 | 8 KB | scheduler_software_pipeline_helper |
0x1714870 | 4 KB | scheduler_modulo_schedule_helper |
0x17151D0 | 6 KB | scheduler_iteration_interval -- compute initiation interval |
0x17157F0 | 5 KB | scheduler_stage_assignment -- assign pipeline stages |
0x1712B70 | 4 KB | scheduler_loop_rotation -- loop rotation for pipelining |
The iteration interval calculator at 0x17151D0 computes the minimum initiation interval (II) based on resource constraints and recurrence constraints. The stage assignment at 0x17157F0 assigns each instruction to a pipeline stage for modulo scheduling. Loop rotation at 0x1712B70 transforms the loop structure to enable overlapped execution of iterations.
Dual-Issue Optimization
The dual-issue checker at 0x16FB800 (6 KB) and dual-issue optimizer at 0x170EB20 (7 KB) identify instruction pairs that can be issued simultaneously on the GPU's dual-issue-capable pipelines. Constraints include:
- Instructions must use different functional units.
- No data dependencies between the pair.
- No register bank conflicts (checked by
0x1725C10, 8 KB). - Instruction formats must be compatible with the dual-issue encoding.
Latency Hiding Statistics
The latency hiding analyzer at 0x16F9980 (15 KB) computes and reports scheduling quality metrics:
LDS latency hiding: Num=..., Avg=..., Min=...
LDG latency hiding: Num=..., Avg=..., Min=...
Xu64 latency hiding: Num=..., Avg=..., Min=...
Antidep latency hiding: Num=..., Avg=..., Min=...
These statistics measure how effectively the scheduler has hidden the latency of shared memory loads (LDS), global memory loads (LDG), 64-bit extended operations (Xu64), and anti-dependency stalls. The anti-dependency resolver at 0x16F9080 (8 KB) specifically handles anti-dependency (WAR) latency hiding by inserting register renaming where possible.
Per-Block Statistics
The block statistics collector at 0x16FAD00 (10 KB) tracks the following per-basic-block scheduling metrics:
| Field | Description |
|---|---|
tSubBb | Number of scheduling sub-regions within the basic block |
HeaderBb | Whether this block is a loop header |
Nvopts | Number of nvopt-level optimizations applied |
LsuResBusy | LSU (load-store unit) resource busy cycles |
Time | Total estimated execution time |
TepidTime | Time spent in tepid scheduling |
MacInsts | Number of multiply-accumulate instructions |
MacReuses | Number of MAC register reuse opportunities exploited |
Scoreboard and Dependency Tracking
NVIDIA GPUs use a hardware scoreboard mechanism to track instruction dependencies. Each in-flight instruction is assigned a scoreboard barrier (an integer ID from a limited pool). Subsequent instructions that depend on a pending result must wait on the corresponding barrier. The scheduler must assign barriers efficiently to avoid both correctness violations and unnecessary stalls.
Hardware Scoreboard Model
The SM target configuration at 0x1A83FB0 sets the maximum scoreboard count:
- SM70--SM89: up to 63 scoreboard entries (offset
+616). - SM100+ (Blackwell): up to 255 scoreboard entries (offset
+616).
Each scoreboard entry tracks one in-flight instruction and its expected completion cycle. The DEPBAR instruction in SASS encodes which barriers to wait on, and the control word's stall count encodes how many additional cycles to stall before issuing the next instruction.
Scoreboard Management Functions
The scoreboard subsystem at 0x1B40000--0x1B60000 contains:
| Address | Size | Function |
|---|---|---|
0x1B40920 | 38 KB | scoreboard_dependency_tracker -- main dependency tracking (1,256 lines) |
0x1B41E10 | 23 KB | wait_barrier_optimizer -- reduce unnecessary waits |
0x1B42E30 | 22 KB | yield_optimization_pass -- optimize yield hints |
0x1B43E30 | 14 KB | stall_count_propagation -- propagate stall counts |
0x1B44940 | 13 KB | control_word_builder -- build SASS control words |
The dependency tracker at 0x1B40920 (38 KB, 1,256 lines) is the core of scoreboard management. It maintains read/write barrier state, tracks instruction completion status, and updates the scoreboard through simulated instruction execution.
Barrier Assignment
The barrier assignment pass at 0x1A63610 (14 KB) assigns scoreboard barrier IDs to instructions. The algorithm:
- Maintain a pool of free barrier IDs.
- When an instruction with a long-latency result is issued, allocate a barrier ID and record the expected completion cycle.
- When a dependent instruction is reached, insert a wait on the allocated barrier.
- When the barrier is no longer needed (all dependents have waited), return the ID to the free pool.
The barrier optimizer at 0x1A64080 (15 KB) post-processes assignments to reduce the total number of active barriers and eliminate waits that are already satisfied by earlier waits.
SASS Control Word Encoding
SASS instructions are grouped into bundles of three instructions plus one control word. The control word encodes scheduling information for each of the three instructions in the bundle:
Control word (64 bits per instruction slot):
Bits [3:0] = stall count (0-15 cycles)
Bit [4] = yield hint (1 = suggest warp switch)
Bits [9:5] = write barrier index (which scoreboard to signal on completion)
Bits [14:10] = read barrier mask (which scoreboards to wait on)
Bits [20:15] = barrier count (dual-issue marker, reserved bits)
The control-word builder at 0x1B44940 (13 KB) assembles these fields. The stall-count optimizer at 0x1B1CB00 (11 KB) reduces stall counts by analyzing actual dependency distances -- if the dependent instruction is far enough away in the instruction stream, the stall can be reduced or eliminated.
Scoreboard Pressure Analysis
The scoreboard pressure analyzer at 0x1A8A5B0 (11 KB) produces diagnostic output under the heading "SCOREBOARD PRESSURE GUIDANCE":
SCOREBOARD PRESSURE GUIDANCE (N SBs):
All Insts: ...
All Unordered-VQ Insts: ...
Unordered-VQ INST Stat: ...
Unordered VQ Stat: ...
The companion reporter at 0x1A8ABC0 (11 KB) produces ": SbOverload" and ", SbStallDiff" metrics. These diagnostics identify situations where scoreboard pressure is causing performance loss -- when the number of in-flight instructions exceeds the available scoreboard entries, the scheduler must serialize operations.
Scheduling Guidance Output
The scheduling guidance system at 0x19C1A70 (10 KB) produces the "SCHEDULING GUIDANCE:" output section in the compiler's diagnostic output. This includes:
- Estimated latency per function and per loop.
- Bottleneck identification -- which functional unit is the throughput bottleneck.
- Resource utilization estimates for all major functional units.
LOOP STATIC METRICS-- per-loop scheduling statistics.
The guidance computation at 0x19C2740 (5 KB) identifies the bottleneck unit, and the resource usage calculator at 0x19C2B50 (5 KB) estimates utilization of each functional unit category: ADU, ALU, CBU, FMA, FMA2x, HALF, transcendental, IPA, LSU, REDUX, TEX, TTU, UDP.
Interaction with Register Pressure
The scheduling and register allocation passes form a feedback loop. The key interaction points:
-
Pre-RA scheduler produces an instruction ordering. The
ScheduleInstructionspass reorders instructions to maximize ILP. This ordering determines which virtual registers are simultaneously live. -
Register allocator allocates physical registers. If pressure exceeds the available register budget, the allocator spills to local memory (or shared memory on SM50+). The spilling report at
0x18F8D80records the outcome. -
ReduceReg retry. If the default scheduling produces too many spills, the scheduler can be re-invoked with
ScheduleInstructionsReduceRegto reduce pressure at the cost of ILP. The strategy selector at0x1857990makes this decision. -
Tepid scheduler works with final register assignments. The post-RA tepid scheduler sees the actual physical register numbers and can optimize register reuse flags and bank conflict avoidance.
-
Scheduling guidance feeds back to the user. The
SCHEDULING GUIDANCEandREGALLOC GUIDANCEoutput sections at0x19C0000--0x1A00000report the combined effect of scheduling and register allocation decisions.
Reconstructed Pre-RA Scheduling Algorithm
The following pseudocode is reconstructed from the decompiled binary. Addresses are given for cross-reference with the decompiled sources.
Initialization: DAG Setup and Instruction Classification (sub_1864ED0, lines 128--338)
Before the scheduling loop runs, the list-scheduler core performs per-basic-block initialization. Each instruction in the block is assigned a scheduling record, classified, and linked into the dependency DAG.
function list_scheduler_init(sched_ctx):
// sched_ctx is the 840+ byte scheduling context
max_reg_pressure_threshold = knob_table[624]
sched_ctx.has_barrier = false
sched_ctx.barrier_count = 0
sched_ctx.pending_list = null
sched_ctx.completed_flag = false
sched_ctx.reduce_reg_mode = false
// Clear the ready-set bitset
clear_bitset(sched_ctx.ready_set) // offset +248
// Max instructions before ready-set overflow (knob gated):
// if arch_capability <= 28671: max_in_flight = 64
// else: max_in_flight = INT_MAX (0x7FFFFFFF)
// Further overridden by knob at offset +55944
if knob_table[55944] == 1:
max_in_flight = knob_table[55952]
// Phase 1: walk every instruction in the basic block
inst_index = 0
has_many_calls = false
has_many_textures = false
has_many_shared = false
call_count = 0
for each instruction inst in basic_block:
dag_node = inst.dag_info // 40-byte node at inst+40
dag_node.block_position = inst_index
inst.schedule_order = -1 // unscheduled sentinel
// Allocate 84-byte per-instruction scheduling record (offset +672)
sched_rec = sched_ctx.records[inst_index] // 48-byte record array at +280
clear(sched_rec) // zero all 20 DWORDs
// Build dependency edges: call sub_185D760 (register pressure tracker)
track_register_operands(sched_ctx, inst, &pressure_lo, &pressure_hi)
// Classify instruction
opcode = inst.opcode & 0xFFFFCFFF // mask out irrelevant bits
if opcode == 96 and sched_ctx.has_cutlass:
// CUTLASS-relevant instruction: add to pending list
node = arena_alloc(16)
node.inst = inst
node.next = sched_ctx.pending_list // offset +264
sched_ctx.pending_list = node
// Check if instruction is a call/barrier via vtable dispatch
is_barrier = vtable[45](sched_ctx.target, inst)
dag_node.flags.is_barrier = is_barrier
if is_barrier:
call_count++
sched_ctx.has_barrier = true
has_many_calls |= ((inst.flags >> 2) ^ 1) & 1
// Walk operands to classify destination registers
for each operand op in inst.operands:
tag = (op >> 28) & 7
if tag != 1: // not a register reference
continue
if op < 0: // definition (negative tag)
reg = lookup_register(op & 0xFFFFFF)
if reg.type == 4: // call-clobbered register class
sched_ctx.barrier_count++
dag_node.max_pred_height = max(dag_node.max_pred_height,
block_position)
if reg.type == 5: has_many_calls++
if reg.type == 2: has_many_textures++
if reg.type == 3: has_many_shared++
else: // use (positive tag)
reg = lookup_register(op & 0xFFFFFF)
if reg.live_range > max_in_flight and reg.type == 6:
dag_node.flags.has_long_dep = true
inst_index++
DAG Construction and Critical-Path Computation (sub_1864ED0, lines 340--683)
After initialization, the scheduler computes scheduling heights (critical path from each instruction to the block exit), builds predecessor/successor bitsets, and propagates dependency weights.
function compute_dag_heights(sched_ctx):
// Phase 2: compute scheduling height for each instruction.
// Height = (target_max_latency - instruction_latency) accounting for
// memory hierarchy distance.
//
// The base latency comes from the ISA model (vtable call),
// adjusted by register-class-specific penalties.
max_latency_in_block = 0
for each instruction inst in basic_block (forward):
dag_node = inst.dag_info
sched_rec = sched_ctx.records[dag_node.block_position]
// Union predecessor dependency bitsets into this node
for each predecessor pred of inst:
bitset_union(sched_rec.dep_set, pred_sched_rec.dep_set)
// Compute base instruction latency via ISA model
isa_info = inst.isa_descriptor // from inst+32
base_latency = isa_info[2] // latency field
if base_latency > max_latency_in_block:
max_latency_in_block = base_latency
// Add memory-hierarchy adjustment based on instruction class flags:
// has_many_textures -> add isa_info[4] (texture penalty)
// has_many_shared -> add isa_info[6] (shared mem penalty)
// has_many_calls -> add isa_info[5] (call penalty)
adjusted_latency = base_latency
if sched_ctx.has_many_textures:
adjusted_latency += isa_info[4]
if sched_ctx.has_many_shared:
adjusted_latency += isa_info[6]
if sched_ctx.has_many_calls:
adjusted_latency += isa_info[5]
adjusted_latency = scale_latency(sched_ctx, adjusted_latency)
// sub_167AA60
// Scheduling height = max_latency - adjusted_latency
// (earlier instructions get higher heights -> higher priority)
sched_rec.height = max_latency_in_block - adjusted_latency
// Check predecessors for "schedulable" status
num_preds = count_predecessors(inst)
if num_preds == 0:
// Instruction has no unscheduled predecessors -> initially ready
sched_rec.flags |= READY
sched_rec.successor_pressure = 0
else:
// Count how many pred operand types force serialization
single_pred = (count of non-register predecessors <= 1)
sched_rec.flags.single_pred = single_pred
// Accumulate successor pressure:
// For each predecessor that is itself ready and has
// a compatible dependency, add its height + successor weight
successor_pressure = 0
for each predecessor pred:
if pred is READY and dependency_is_compatible(inst, pred):
pred_rec = sched_ctx.records[pred.block_position]
successor_pressure += pred_rec.height + pred_rec.weight
pred_rec.flags |= HAS_READY_SUCCESSOR
sched_rec.successor_pressure = successor_pressure
// Net register delta: positive if this instruction is a net producer,
// negative if net consumer.
// height + successor_pressure < 0 => instruction is a net consumer
// -> mark as UNREADY (needs predecessors to complete first)
// -> add to blocked set
if sched_rec.height + successor_pressure < 0:
add_to_blocked_set(sched_ctx, dag_node)
// Also compute total_blocked_height for scheduling priority
sched_rec.total_blocked_height = sum of all successors' heights
else:
// Ready: add to ready bitset
clear_bit(sched_rec.flags, UNREADY)
set_bit(sched_ctx.ready_set, dag_node.block_position)
set_bit(sched_rec.dep_set, dag_node.block_position)
// Backward cross-block dependency edges:
// Walk operand definitions backwards; if a def's block_position
// is earlier, add a dependency edge from that def to this use.
for each operand op in inst.operands (backwards):
if op is register_ref and op.def_block_position < inst.block_position:
bitset_union(records[op.def_block_position].dep_set,
sched_rec.dep_set)
// Store final priority metrics
sched_rec.weight = successor_pressure (negated for consumers)
sched_rec.tiebreak = 0
sched_rec.cumulative_height = 0
// Phase 3: propagate heights through the DAG
// For instructions with positive height, propagate to successors.
// For instructions with negative height (consumers), propagate
// the negated height as a "pull" to predecessors.
for each instruction inst in basic_block:
sched_rec = sched_ctx.records[inst.block_position]
if sched_rec.height != 0:
for each successor succ in sched_rec.dep_set:
succ_rec = sched_ctx.records[succ]
if sched_rec.height >= 0:
succ_rec.tiebreak += sched_rec.height
else:
succ_rec.weight -= sched_rec.height
// Phase 4: handle CUTLASS pending instructions
if sched_ctx.pending_list is not null:
bitset_subtract(sched_ctx.cutlass_set, sched_ctx.ready_set)
for each inst in sched_ctx.pending_list:
bitset_union(sched_ctx.ready_set, sched_rec.dep_set)
// Phase 5: finalize the scheduling window
// Mark the last instruction as the scheduling boundary.
sched_ctx.schedule_limit = -1
for inst = sched_ctx.last_inst; inst != null; inst = inst.prev:
if not is_pseudo_instruction(inst):
break
retire_instruction(sched_ctx, inst) // sub_185DC90
Ready-Queue Selection and Batch Window (sub_18592C0)
The ready-queue selector picks the next instruction to issue from the ready set. It uses a batch-window heuristic to avoid excessive scheduling granularity.
function select_next_instruction(sched_ctx, schedule_state):
// Reset batch window state
sched_ctx.min_batch_height = INT_MAX // +488
sched_ctx.best_pressure = -1 // +516
sched_ctx.max_latency_seen = 0 // +496
sched_ctx.batch = empty // +536 array of pointers
// Copy the sorted ready-list snapshot into a traversal iterator
copy_sorted_readyset(sched_ctx.iterator, schedule_state.sorted_set)
// Compute batch window size:
// target_batch = min(max_batch, num_ready_instructions)
// If no anti-dep pressure and target < num_ready:
// batch_window = num_ready / ceil(num_ready / target) [balanced]
// or num_ready / 2 if slightly over
target_batch = sched_ctx.max_batch_size // +404
num_ready = sched_ctx.num_ready // +396
batch_window = min(target_batch, num_ready)
if target_batch < num_ready and not sched_ctx.has_anti_dep_pressure:
if num_ready >= 2 * target_batch:
batch_window = num_ready / ceil(num_ready / target_batch)
else:
batch_window = num_ready / 2
// Walk the ready set in priority order (sorted by height descending)
batch_count = 0
for each candidate in ready_set (priority order):
latency = scale_latency(sched_ctx, sched_ctx.current_max_pressure)
// Skip candidates that would violate scoreboard constraints
if batch_count > 0:
opcode = candidate.opcode & 0xFFFFCFFF
if opcode == 96 and sched_ctx.has_cutlass:
// CUTLASS barrier: stop batching, yield to next block
sched_ctx.yield_flag = true
break
// Check if candidate would cause a scoreboard stall
if candidate.dag_info.min_height <= sched_ctx.min_batch_height:
for each scoreboard entry sb in sched_ctx.scoreboard_window:
sb_slot = sb.dag_info.slot_id
if sb_slot < (candidate.dep_mask & 0x7FFFFFFF)
and bit_is_set(candidate.dep_bitset, sb_slot):
// Candidate would stall on this scoreboard entry
break to COMMIT_CANDIDATE
// Check if candidate has no outstanding dependencies
if candidate has no live predecessors:
goto SKIP_TO_NEXT
// Record highest latency seen
if candidate.isa_info.latency > sched_ctx.max_latency_seen:
sched_ctx.max_latency_seen = candidate.isa_info.latency
goto COMMIT_CANDIDATE
SKIP_TO_NEXT:
// Detect register-pressure excursion and truncate batch
cumulative_pressure = sched_ctx.current_pressure
+ sched_ctx.max_latency_seen
- sched_ctx.pressure_baseline
if cumulative_pressure + sched_ctx.best_pressure >= latency:
if batch_count > 0:
sched_ctx.pressure_exceeded = true
// Truncate batch if pressure would blow budget
break
sched_ctx.max_latency_seen = 0
sched_ctx.pressure_baseline = candidate.isa_info.latency
// Add candidate to batch
sched_ctx.min_batch_height = candidate.dag_info.min_height
sched_ctx.batch[batch_count] = candidate
batch_count++
sched_ctx.last_batch_position = candidate.block_position
if batch_count == batch_window:
break
// Track max latency among batch members for tiebreaking
if candidate.dag_info.max_height > sched_ctx.best_pressure:
sched_ctx.best_pressure = candidate.dag_info.max_height
COMMIT_CANDIDATE:
advance iterator to next candidate
// Post-selection: if batch is smaller than half the ready set and
// pressure is not exceeded, trim the batch further to balance
// ILP against register pressure.
if batch_count < num_ready:
if 2 * target_batch > num_ready and not pressure_exceeded:
// Adaptive trimming: walk batch backwards, remove
// entries that are in the same scheduling group
// (same min_height and negative max_height)
trim_target = (num_ready + 1) / 2
or (num_ready - sched_ctx.unpressured_count) / 2
while batch_count > trim_target:
last = sched_ctx.batch[batch_count - 1]
if last.dag_info.min_height > trim_target
or last.dag_info.max_height >= 0:
break
batch_count--
batch_window = min(batch_window, trim_target)
Instruction Issue and Dependency Retirement (sub_185DC90)
When an instruction is issued, its scheduling record is updated and all successor dependency counts are decremented.
function issue_instruction(sched_ctx, instruction):
// Look up the instruction's scheduling record
dag_node = instruction.dag_info
block_pos = dag_node.block_position
sched_rec = sched_ctx.records[block_pos] // 48-byte record
if sched_rec.height == 0:
goto REMOVE_FROM_READY
if sched_rec.height < 0:
// This instruction was a net register consumer.
// Propagate its negative height to all successors, reducing
// their "weight" (pending predecessor contribution).
for each successor succ_pos in sched_rec.dep_set:
succ_rec = sched_ctx.records[succ_pos]
succ_rec.weight += sched_rec.height // height is negative
else:
// Net producer: propagate positive height to successors,
// reducing their tiebreak/priority accumulator.
for each successor succ_pos in sched_rec.dep_set:
succ_rec = sched_ctx.records[succ_pos]
succ_rec.tiebreak -= sched_rec.height
REMOVE_FROM_READY:
// Remove instruction from the ready bitset
clear_bit(sched_ctx.ready_set, block_pos)
Instruction Latency Calculation (sub_1850760)
The latency calculator computes the expected completion time for an instruction, considering the target architecture's pipeline model.
function compute_latency(sched_ctx, instruction):
dag_node = instruction.dag_info
isa_encoding = dag_node.encoding_bits & 0x1FF // 9-bit opcode class
latency_table_ptr = dag_node.latency_ptr // ISA-model pointer
base_latency = dag_node.hardware_latency // field at +184
if base_latency is valid (not 0x80000000):
goto APPLY_MODIFIERS
// Opcode-specific latency lookup
switch isa_encoding:
case 215: // LDGSTS (async global-to-shared copy)
base_latency = sched_ctx.ldgsts_latency // offset +26824
if base_latency == -1: goto FALLBACK
case 221: // LDSM (load shared matrix)
base_latency = sched_ctx.ldsm_latency // offset +26828
if base_latency == -1:
// Compute from matrix dimensions:
operand_size = extract_matrix_size(instruction)
is_transposed = extract_transpose_flag(instruction)
return lookup_matrix_latency(sched_ctx.latency_model,
operand_size, is_transposed, isa_encoding) / 4
case 2: // IMAD with special predicate
if is_special_predicate(instruction):
base_latency = sched_ctx.imad_special_latency // +10644
if base_latency != -1: goto APPLY_MODIFIERS
case 94, 166: // TEX / SULD (texture / surface load)
if not is_texture_opcode_442(instruction):
base_latency = sched_ctx.tex_base_latency // +10640
if base_latency != -1: goto APPLY_MODIFIERS
base_latency = sched_ctx.default_latency // +228
goto APPLY_MODIFIERS
case 191: // TLD4 (texture gather)
if is_texture_opcode_442(instruction):
operand_shift = extract_operand_size(instruction)
base_latency = lookup_banked_latency(sched_ctx,
sched_ctx.default_latency, 4 << (operand_shift & 3))
goto APPLY_MODIFIERS
default: // Memory instructions with known opcode class
if is_texture_opcode_442(instruction):
// Compute strided latency for memory ops
element_size = lookup_element_size(instruction)
num_elements = extract_element_count(instruction) + 1
base_latency = lookup_banked_latency(sched_ctx,
sched_ctx.default_latency, element_size * num_elements)
if base_latency != -1: goto APPLY_MODIFIERS
FALLBACK:
if latency_table_ptr is null:
if isa_encoding == 152 or isa_encoding == 142:
base_latency = 1 // NOP-like instructions
else:
base_latency = lookup_from_isa_model(sched_ctx, instruction)
else:
base_latency = hash_table_lookup(sched_ctx.latency_hash,
latency_table_ptr)
APPLY_MODIFIERS:
// Apply warp-scheduling modifier if enabled
if sched_ctx.warp_sched_enabled: // vtable offset +232
if is_eligible_for_warp_discount(sched_ctx, instruction):
base_latency -= sched_ctx.warp_discount // offset +2400
return base_latency
Per-Block Scheduling Pass (sub_1681A70)
The per-block scheduling pass is the workhorse called for each basic block. It drives the list-scheduling loop that picks instructions from the ready set and emits them in scheduled order.
function schedule_basic_block(sched_ctx, strategy_callback, bb_offset, init_flag):
// Initialize timing infrastructure
reset_timing(sched_ctx.compilation_ctx)
if init_flag:
initialize_liveness(sched_ctx)
initialize_register_state(sched_ctx)
sched_ctx.max_latency = 0
sched_ctx.total_instructions = 0
sched_ctx.pressure_overflow = false
// Allocate per-function register tracking arrays
num_registers = compilation_ctx.register_count + 1
alloc_tracking_array(sched_ctx, num_registers)
// Query knobs for scheduling parameters
batch_size = query_knob(743) // max batch size
anti_dep_weight = query_knob(747) // anti-dependency weight
// Iterate over each sub-function (for split compilation)
for each sub_function in reverse order:
is_recursive = check_recursion(sub_function)
sched_ctx.current_function = sub_function
// Select strategy based on compilation mode
if sched_ctx.mode == 1: // ReduceReg
check_pass_gate("ScheduleInstructionsReduceReg", sub_function)
elif sched_ctx.mode == 2: // DynBatch
check_pass_gate("ScheduleInstructionsDynBatch", sub_function)
// Resolve scheduling parameters from ISA model
pressure_info = lookup_pressure(sched_ctx.isa_model, sub_function)
// Set up register-pressure snapshot
subtract_pressure_baseline(sched_ctx, pressure_info)
update_pressure_from_live_in(sched_ctx)
// Walk each basic block in the sub-function
cursor = sub_function.last_block
while cursor is not null:
// Get next block to schedule via strategy callback
// strategy_callback dispatches to one of:
// - default list scheduler
// - ReduceReg pressure-minimizing scheduler
// - DynBatch throughput scheduler
next_block = strategy_callback(sched_ctx, cursor, bb_offset)
if next_block is null:
break
// Track max instruction height for this block
if next_block.dag_info.height > sched_ctx.max_latency:
sched_ctx.max_latency = next_block.dag_info.height
// Insert scheduled block into output sequence
insert_scheduled_block(sched_ctx, next_block, cursor)
// Update register pressure tracking after scheduling
update_register_pressure(sched_ctx, next_block)
// Handle call instructions (opcode 97):
// On a call boundary, the scheduler crosses to a new
// sub-function. Register state is snapshotted and
// the pressure tracker walks forward into the callee.
if next_block.opcode == 97:
push_call_context(sched_ctx, next_block)
cursor = next_block
continue
// Record per-block scheduling statistics
block_max_latency = next_block.isa_info.latency
if block_max_latency > sched_ctx.total_instructions:
sched_ctx.total_instructions = block_max_latency
if not is_recursive:
if block_max_latency > sched_ctx.non_recursive_max:
sched_ctx.non_recursive_max = block_max_latency
cursor = next_block
// Record final pressure statistics
compilation_ctx.peak_pressure = sched_ctx.total_instructions
compilation_ctx.non_recursive_peak = sched_ctx.non_recursive_max
// Cleanup
free_tracking_arrays(sched_ctx)
Three-Mode Strategy Dispatch (sub_1867D60)
The block scheduler core selects between the three scheduling strategies and wraps the per-block pass.
function block_scheduler_core(sched_ctx):
compilation_ctx = sched_ctx.compilation_ctx
isa_model = compilation_ctx.isa_model_ptr // offset [198]
// Run pre-scheduling peepholes
optimize_nan_or_zero(sched_ctx, compilation_ctx) // sub_1866FA0
invoke_vtable_232(sched_ctx) // pre-scheduling hook
// Phase 1: Insert barrier-crossing dependency edges
// For each instruction in the function, if the ISA model says
// it crosses a barrier (vtable[183]), insert a serialization
// edge via sub_18B91C0 (add dependency with kind=8).
if compilation_ctx.flags & 0x10:
for each instruction inst in function:
if isa_model.crosses_barrier(inst):
add_serialization_edge(compilation_ctx, inst)
// Phase 2: Gate the pass on the "ScheduleInstructions" knob
if pass_is_disabled("ScheduleInstructions"):
return
// Phase 3: Analyze function structure
max_block_height = compute_max_block_height(compilation_ctx)
initialize_scheduling_infrastructure(sched_ctx, max_block_height)
// Phase 4: Configure for ReduceReg mode
if compilation_ctx.flags & 0x10: // extended register file
register_budget = 2 * (compilation_ctx.register_count + 1)
else:
register_budget = compilation_ctx.register_count + 1
allocate_pressure_tracking(sched_ctx, register_budget)
// Phase 5: Determine scheduling mode
// Query knobs 776 (ReduceReg ILP budget), 778 (ReduceReg latency
// budget), scale them through the latency model, then run
// "ScheduleInstructionsReduceReg" pass if not gated.
ilp_budget = query_knob(776)
latency_budget = query_knob(778)
ilp_budget = scale_latency(sched_ctx, ilp_budget)
latency_budget = scale_latency(sched_ctx, latency_budget)
if pass_not_gated("ScheduleInstructionsReduceReg"):
// Run ReduceReg mode: sched_ctx.mode = 1
set_mode(sched_ctx, REDUCE_REG)
sched_ctx.max_batch = 0x39 // 57 instructions
run_scheduling_pass(sched_ctx, mode=0x39) // sub_1681A70
// Reset per-instruction state for next pass
clear_instruction_state(compilation_ctx)
// Phase 6: Configure default scheduling mode
set_mode(sched_ctx, DEFAULT)
sched_ctx.max_batch = query_knob(805) // batch size
if sched_ctx.max_batch > 16: sched_ctx.max_batch = 16
sched_ctx.anti_dep_limit = query_knob(741) // default: 3
sched_ctx.pressure_limit = query_knob(761) // default: 3 or 6 or 12
// Phase 7: Check CUTLASS workload pattern
sched_ctx.is_cutlass = check_cutlass_workload(sched_ctx) // sub_1866CF0
// Phase 8: Run DynBatch mode if applicable
if pass_not_gated("ScheduleInstructionsDynBatch"):
// Check knob 742 or auto-detect based on instruction mix
if query_knob(742) or auto_detect_dynbatch(compilation_ctx):
set_mode(sched_ctx, DYNBATCH)
run_scheduling_pass(sched_ctx, mode=0x41) // 65 instructions
// Phase 9: Final default scheduling pass
set_mode(sched_ctx, DEFAULT)
run_scheduling_pass(sched_ctx, mode=0x49) // 73 instructions
// Cleanup
free_pressure_tracking(sched_ctx)
Register Pressure Delta Computation (sub_185D760)
The register pressure tracker scans each instruction's operands and records which physical register ranges are defined or consumed. This drives the pressure-aware scheduling decisions.
function track_register_operands(sched_ctx, instruction, out_lo, out_hi):
// Walk operands in reverse order (last operand to first)
num_ops = instruction.operand_count - 1
if num_ops < 0:
return
arena = sched_ctx.arena // offset +840
for op_index = num_ops downto 0:
operand = instruction.operands[op_index]
tag = (operand >> 28) & 7
if tag != 1: // not a register
continue
if (operand & 0xFFFFFF) - 41 <= 3: // pseudo-register
continue
if operand < 0: // definition operand
continue
// Look up the register descriptor
reg = register_table[operand & 0xFFFFFF]
// Allocate a 16-byte tracking node
node = arena_alloc(arena, 16)
node.instruction = instruction
// Link into the register's use chain
prev_head = reg.use_chain_head // offset +104
if prev_head is null:
// First use of this register in this block
add_to_active_list(sched_ctx, reg) // link at +680
reg.use_chain_next = null // +112
node.next = prev_head
reg.use_chain_head = node
// Check if register is "wide" (exceeds scheduling horizon)
if sched_ctx.pressure_limit < reg.live_range
and reg.type == 6
and reg is not pseudo:
// Wide register: track in the register-file bitmap
// The bitmap at offset +728 records which register
// groups are actively tracked for pressure.
phys_reg = reg.physical_number
stride = phys_reg * (has_extended_regfile ? 2 : 1)
bitmap_word = stride >> 6
bitmap_bit = 1 << stride
// Grow bitmap if necessary
ensure_bitmap_capacity(sched_ctx, bitmap_word + 1)
sched_ctx.register_bitmap[bitmap_word] |= bitmap_bit
// If extended register file (paired mode),
// also set the paired register bit
if has_extended_regfile:
paired_stride = stride + 1
sched_ctx.register_bitmap[paired_stride >> 6] |= (1 << paired_stride)
// Update output pressure counters
*out_lo += definitions_in_lo_class
*out_hi += definitions_in_hi_class
Register Pressure Heuristic in the Priority Function (sub_1859F10)
In ReduceReg mode, the priority function penalizes instructions that would increase register pressure beyond a configurable threshold.
function pressure_aware_priority(sched_ctx, instruction, reg_info):
// Query knob 780 for pressure tracking granularity
// (auto-incremented on each call)
granularity = query_knob_780_autoincrement()
opcode = instruction.opcode & 0xFFFFCFFF
if opcode != 288 and opcode != 183: // not a memory or barrier op
return 0 // no pressure adjustment
// Look up the register class and bank that this instruction writes to
reg_class = reg_info[3] // register class index
num_elements = reg_info[2] // number of registers written
// Check current pressure in this class
current_bank_pressure = sched_ctx.class_pressure[reg_class]
if current_bank_pressure != 0:
// Compute maximum pressure across all active banks
max_bank_pressure = 0
if num_elements == 1:
// Single-element: check both base and paired register
for each active bank in (current_bank_pressure bitmask):
p = max(sched_ctx.base_pressure[bank],
sched_ctx.paired_pressure[bank])
max_bank_pressure = max(max_bank_pressure, p)
else:
// Multi-element: check only base pressure
for each active bank in (current_bank_pressure bitmask):
max_bank_pressure = max(max_bank_pressure,
sched_ctx.base_pressure[bank])
// If max pressure exceeds threshold, penalize this instruction
if max_bank_pressure >= sched_ctx.pressure_threshold: // offset +1036
// For definitions that write the first bank, set class=0
// to force it to schedule first (relieve pressure)
if (current_bank_pressure & 1) != 0:
reg_info[3] = 0
return // signal: instruction is pressure-hot
else:
// No active pressure in this class
// Look up paired class pressure
paired_pressure = sched_ctx.class_pressure[reg_class + 33]
if num_elements == 1:
paired_pressure = max(paired_pressure,
sched_ctx.paired_pressure[reg_class])
if paired_pressure >= sched_ctx.pressure_threshold:
return // still pressure-hot
// Extract destination operand information for latency-based
// priority: wider destinations get higher priority to free
// registers sooner.
dest_operand = instruction.operands[dest_index]
dest_reg = register_table[dest_operand & 0xFFFFFF]
num_dest_regs = compute_dest_width(instruction)
latency_penalty = compute_latency(sched_ctx, instruction) * num_dest_regs
return latency_penalty // higher = schedule sooner
Key Address Summary
Pre-RA Scheduling (0x1850000--0x186F000)
| Address | Size | Identity |
|---|---|---|
0x1851DC0 | 85 KB | ScheduleInstructions_main_driver |
0x1857990 | 13 KB | Strategy selector (default vs. ReduceReg vs. DynBatch) |
0x185B870 | 28 KB | Per-block scheduling driver |
0x1860A40 | 47 KB | Per-function scheduling driver |
0x1864ED0 | 17 KB | List-scheduler core algorithm |
0x1867D60 | 22 KB | Block scheduler core with strategy dispatch |
0x1866FA0 | 28 KB | OptimizeNaNOrZero peephole |
0x186C7A0 | 24 KB | HoistInvariants core |
0x186D520 | 38 KB | HoistInvariants per-function |
0x186EE80 | 41 KB | HoistInvariants analysis driver |
0x1868E50 | 19 KB | CUTLASS pattern handler |
Tepid Scheduler (0x16F6000--0x1740000)
| Address | Size | Identity |
|---|---|---|
0x16F35A0 | 36 KB | Block processor (knob 610 gated) |
0x17027F0 | 38 KB | Main scheduling loop (1,216 lines) |
0x1704010 | 12 KB | Instruction selector |
0x1704CB0 | 12 KB | Priority calculator |
0x1710840 | 25 KB | Global optimizer (854 lines) |
0x172FED0 | 30 KB | Main driver (1,101 lines) |
0x17130F0 | 11 KB | Software pipelining |
0x16F9980 | 15 KB | Latency hiding statistics |
0x1717A00 | 12 KB | DMA-math ratio balancer |
0x1729F10 | 19 KB | Epilogue optimizer |
Scoreboard Tracking (0x1B40000--0x1B60000)
| Address | Size | Identity |
|---|---|---|
0x1B40920 | 38 KB | Scoreboard dependency tracker |
0x1B41E10 | 23 KB | Wait barrier optimizer |
0x1B42E30 | 22 KB | Yield optimization pass |
0x1B43E30 | 14 KB | Stall count propagation |
0x1B44940 | 13 KB | Control word builder |
0x1A8A5B0 | 11 KB | Scoreboard pressure analyzer |
0x1A63610 | 14 KB | Barrier assignment pass |
0x1A64080 | 15 KB | Barrier optimizer |
Cross-References
nvlink Internal
- Embedded ptxas Overview -- scheduling in the 48-pass pipeline (passes 23--38)
- Register Allocation -- runs before the tepid scheduler
- ISel Hubs -- runs before scheduling
- Peephole -- peephole passes at
0x1866FA0interleaved with scheduling - Mercury Compiler Passes -- Mercury-specific scheduling passes (MercWARs, MercOpex)
Sibling Wikis
- ptxas: Scheduling Overview -- standalone ptxas scheduling infrastructure
- ptxas: Algorithm -- scheduling algorithm details
- ptxas: Latency Model -- per-instruction latency tables
- ptxas: Scoreboards -- dependency barrier assignment