Rematerialization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Rematerialization is the compiler technique of recomputing a value near its use instead of keeping the original definition live across a long range. In ptxas, rematerialization is implemented through three cooperating pipeline phases and tightly integrated with the register allocator's spill-vs-remat decision logic. On GPUs, where register pressure directly determines occupancy and therefore throughput, aggressive rematerialization is one of the most performance-critical optimizations in the entire pipeline.
| Phase 28 | SinkRemat -- sinks instructions closer to uses, marks remat candidates |
| Phase 54 | OriDoRematEarly -- sets remat mode flag (ctx+1552 = 4) |
| Phase 69 | OriDoRemat -- late rematerialization after predication and fusion |
| Address range (phase 28) | Execute: sub_C5FC20, core: sub_913A30 -> sub_A0F020 |
| Address range (phase 69) | Execute: sub_C5F910, core: sub_A112C0 -> sub_A11060 -> sub_A107B0 |
| Minimum opt level | Phase 28: requires level > 4 (knob 487); Phase 69: requires level > 1 |
| Operand kind 7 | "Remat" marker in the Ori IR operand classification |
| Vreg flags (offset +80) | 0x80000001 = remat candidate; 0x80000007 = remat with predication; 0x80000008 = remat committed |
| Regalloc integration | sub_93AC90 (remat check), sub_99A9D0/sub_99AA50 (range remat cost) |
| DUMPIR name | SinkRemat, OriDoRematEarly, OriDoRemat |
Why Rematerialization Matters on GPUs
On NVIDIA GPUs, register count per thread inversely determines the number of concurrent warps (occupancy). Each additional register consumed by a kernel reduces the number of warps that can be resident on an SM. Since GPU performance depends on hiding memory latency through massive parallelism, even a single extra register can measurably degrade throughput.
Rematerialization trades instruction count for register pressure reduction. Instead of keeping a computed value alive in a register from its definition to its last use, the compiler recomputes it where needed. This is profitable when:
- The original instruction is cheap (single-cycle ALU: IADD, IMAD, MOV, SEL, LOP3, SHF)
- All source operands are still available at the use point (not overwritten)
- The live range of the result is long enough to actually cause register pressure
- The instruction has no side effects (no memory writes, no barrier interactions)
On GPUs, the cost-benefit tradeoff is skewed much further toward remat than on CPUs. A single spill/refill pair (STL + LDL) costs 20--100 cycles of local memory latency, while a rematerialized IADD costs 1 cycle. More importantly, the spill itself consumes a register for the address computation, potentially cascading into more spills.
Pipeline Position
Phase 23 GenerateMovPhi SSA phi nodes -> MOV instructions
Phase 24 OriPipelining Software pipelining
Phase 25 StageAndFence Memory fence insertion
Phase 26 OriRemoveRedundantBarriers
Phase 27 AnalyzeUniformsForSpeculation
Phase 28 SinkRemat *** Sink + remat candidate marking ***
Phase 29 GeneralOptimize Bundled mid-level optimizations
...
Phase 53 OriPropagateVaryingFirst
Phase 54 OriDoRematEarly *** Sets remat mode flag ***
Phase 55 LateExpansion
...
Phase 63 OriDoPredication If-conversion (creates new opportunities)
...
Phase 66 OriHoistInvariantsLate
Phase 67 DoKillMovement
Phase 68 DoTexMovement
Phase 69 OriDoRemat *** Late rematerialization ***
Phase 70 OriPropagateVaryingSecond
The three-phase design is deliberate:
- Phase 28 (early): Runs after SSA construction and pipelining but before the main optimization passes. Sinks instructions closer to their uses and identifies candidates. This is the most complex of the three phases.
- Phase 54 (mode setter): A trivial phase that writes
4toctx+1552(the pipeline progress counter), signaling to downstream passes that rematerialization mode is active. ItsisNoOp()returns 1 in the default vtable, meaning the dispatch loop skips itsexecute()by default. The phase is only active when an architecture backend overrides the vtable to return 0, at which point the single-store execute body runs. - Phase 69 (late): Runs after predication (phase 63) and loop fusion (phase 59), which restructure control flow and create new rematerialization opportunities that did not exist at phase 28 time. Also runs after
OriHoistInvariantsLate(phase 66), which may have extended live ranges by hoisting invariants.
Phase 28: SinkRemat
Entry and Guard Logic
The execute function (sub_C5FC20) applies two layers of gating:
function SinkRemat_execute(phase, ctx):
opt_level = getOptLevel(ctx) // sub_7DDB50
if opt_level <= 1:
return
return sub_913A30(ctx) // actual implementation
sub_913A30 (131 lines) performs additional checks before invoking the core:
- Optimization level >= 5: Required for the full sink+remat pass
- Knob 487: Must be enabled (queried via
vtable+152dispatch onctx+1664) - Cutlass detection (
sub_8F47E0): Checks if the function name contains "cutlass" viastrstr(). Cutlass kernels receive special treatment - Flag check (
ctx+1368bit 0): Must be set (compilation is in SSA window) - Feature flags (
ctx+1376): Must have bit 26 set (0x4000000) but NOT bit 53 (0x20000000000000) simultaneously
When the cutlass flag (ctx+1381 bit 6) is set, the pass enters an iterative mode:
function sub_913A30(ctx):
if opt_level <= 4:
return
if not knob_enabled(487):
return
is_cutlass = function_name_contains("cutlass")
if not (flag_byte(ctx+1368) & 1):
return
if not is_cutlass and not (flag_byte(ctx+1381) & 0x40):
return
// Feature flag gating
features = *(ctx+1376) & 0x20000004000000
if features != 0x4000000:
return
// Cutlass iterative mode
if flag_byte(ctx+1381) & 0x40:
max_iters = 5 // default
if hw_config->field_62064: // architecture-specific override
max_iters = getKnob(862) // configurable iteration limit
if max_iters <= 0: goto sinkRemat_core
for iter in 0..max_iters:
sub_8F5220(&state, ctx) // initialize iteration state
changed = sub_911030(&state, iter) // core sink+remat
if not changed or sub_8F59C0(&state): // convergence check
break
sub_8F5AD0(&state) // update state for next iter
sub_909A20(&state) // propagate changes
// clean up 4 bitvectors + 2 hash tables
return
// Non-cutlass path: single invocation
sinkRemat_core:
if is_cutlass:
// Instruction count limit check
if *(ctx+1584)->field_372 > 0x7FFF:
// Warn via vtable dispatch (diagnostic knob 356, severity 2)
sub_A0F020(ctx) // CORE: sink + remat driver
vtable_callback() // post-processing hook
sub_781F80(ctx, 1) // rebuild liveness
sub_8F4820(ctx, &worklist) // build remat worklist
// Process worklist in reverse order
for item in worklist (descending):
sub_8F4F90(ctx, &item) // apply remat decisions
Core Sink+Remat Driver: sub_A0F020
sub_A0F020 (494 lines) is the main workhorse of phase 28. It operates on the entire function body, processing basic blocks in reverse postorder through the dominator tree.
The algorithm has two main stages:
Stage 1: Per-block sinking analysis (via sub_A06A60 calling sub_A08250)
For each basic block in reverse postorder:
- Walk the instruction list backward
- For each instruction, check if it has a single use in a dominated block
- If so, sink the instruction to the use block (moves the instruction node in the linked list)
- Track whether any changes were made for convergence
Stage 2: Cross-block rematerialization (via sub_A06A60 calling sub_A07DA0)
For each basic block in reverse postorder:
- Walk the instruction list
- For each rematerialization-eligible instruction, check if the cost model approves duplication
- If profitable, clone the instruction at the use site and mark the original's result register with the remat flag
The pass alternates between sinking and rematerialization in a fixed-point loop, repeating until no more changes occur. The two worklist callbacks (sub_A08250 for sinking, sub_A07DA0 for remat) operate on a per-block basis through a generic block visitor (sub_A06A60).
The block visitor manages per-block liveness bitvectors:
block+16: live-in bitvectorblock+40: live-out bitvectorblock+64: kill setblock+112: live-through set (computed as intersection of live-in and live-out)
Phase 69: OriDoRemat (Late Rematerialization)
Entry Logic
The execute function (sub_C5F910) checks the optimization level and delegates:
function OriDoRemat_execute(phase, ctx):
opt_level = getOptLevel(ctx) // sub_7DDB50
if opt_level <= 1:
return
return sub_A112C0(ctx, -1) // -1 = process all blocks
Remat State Initialization: sub_A0C310
sub_A112C0 first initializes a rematerialization state object (88+ bytes) via sub_A0C310:
struct RematState {
+0 compilation_ctx* ctx
+8 RefCountedObj* block_worklist // refcounted allocator object
+16 int hash_capacity // hash table for seen blocks
+20 int hash_count
+24 HashNode** hash_buckets
+32 int hash_bucket_count
+72 int iteration_count // convergence counter
+80 RefCountedObj* instruction_pool // for cloned instructions
+88 int pass_id // -1 for late remat
+92 byte requires_tex_check // hw_config->field_17064 == 1
};
The pass ID -1 distinguishes OriDoRemat from OriDoRematEarly. When the hardware configuration at hw_config+17064 is 1 and hw_config+17072 != 0, the requires_tex_check flag is set, enabling additional texture-instruction awareness.
Iterative Remat Loop: sub_A112C0 + sub_A11060
The late remat pass runs in a convergence loop:
function sub_A112C0(ctx, pass_id):
init_remat_state(&state, ctx, pass_id)
// Iterative convergence loop
while sub_A11060(&state) and getOptLevel(ctx) != 1
and sub_785E20(ctx, 0): // instruction budget check
continue
// Cleanup: drain worklist, release refcounted objects
...
Per-Iteration Worker: sub_A11060
Each iteration of sub_A11060 (155 lines) processes the entire instruction list:
function sub_A11060(state):
ctx = state->ctx
sub_7E6090(ctx, 0, 1, 0, 0) // rebuild use-def chains
// Reset all basic block depth markers to 0x80000000 (unvisited)
for bb in ctx->block_list:
bb->field_76 = 0x80000000
// Drain hash table back into instruction pool
drain_hash_table(state)
first_pass = !state->requires_tex_check
changed = false
// Walk instructions in program order
instr = ctx->first_instruction // ctx+280
while instr:
if first_pass:
first_pass = false
while instr:
opcode = instr->opcode & 0xFFFFCFFF
if opcode == 97: // STG in ROT13; used as definition anchor/label marker
changed |= sub_A10DF0(state, instr)
next = instr->next
sub_A107B0(state, instr, &sink_flag, &changed_flag,
&remat_flag, true)
instr = next
else:
// Non-first-pass: skip MOV processing
while instr:
next = instr->next
sub_A107B0(state, instr, &sink_flag, &changed_flag,
&remat_flag, true)
instr = next
if not changed_flag:
goto check_second_pass
// Decrement iteration counter, check convergence
if --state->iteration_count == 0:
return sink_flag
check_second_pass:
if remat_flag and *(ctx+1552) > 4:
// Second pass: walk block list for cross-block opportunities
for bb in ctx->block_list:
if (bb->field_20 & 1) == 0 or bb->size <= 0
or (bb->field_20 & 6) == 6:
continue // skip empty/dead/cold blocks
instr = ctx->first_block_instruction
while instr:
instr = sub_A0C540(state, instr, &changed, ...)
if changed:
// Reset depth markers and loop
continue
--state->iteration_count
return sink_flag
Per-Instruction Remat Worker: sub_A107B0
sub_A107B0 (316 lines) is the core per-instruction decision function called from both phases 28 and 69. It determines whether a specific instruction should be sunk, rematerialized, or left alone.
function sub_A107B0(state, instr, sink_flag_out, changed_out, remat_flag_out,
allow_remat):
// Quick rejection: check if instruction is sinkable
result = sub_A105F0(state, instr, sink_flag_out, changed_out)
if result:
return result // already sunk, done
num_operands = instr->operand_count // at instr+80
if num_operands <= 0:
return 0
// Walk destination operands
for i in 0..num_operands:
operand = instr->operands[i] // at instr+84 + 8*i
operand_type = (operand >> 28) & 7
if operand_type == 7: // barrier register
// Track barrier liveness
continue
if operand_type != 1: // not a GPR destination
continue
// GPR destination operand
if operand < 0: // bit 31 set = definition
vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
vreg->flags |= 0x80000001 // mark as remat candidate
if has_predication_flag and last_operand_is_0x20:
vreg->flags |= 0x80000007 // enhanced remat with predication
if sub_A0C410(state, vreg, instr, allow_remat):
// Remat is profitable: clear depth flag, update block assignment
vreg->field_76 = ~instr->block_id
else:
// Not profitable: process as regular definition
// Check for multi-use definitions
...
else:
// Source operand: track liveness contribution
...
return result
Sinkability Check: sub_A105F0
sub_A105F0 (77 lines) determines if an instruction can be sunk to a single-use block. It enforces strict criteria:
- Opcode filter: Only opcode
0x5F(95;STSin the ROT13 name table, used here as a constant/immediate load variant marker) withstate->byte_92clear - Single-use check via
sub_A07940: The instruction must have exactly one use - Dominator check: The use must be in a block dominated by the definition block
- MOV chain check: If the instruction feeds opcode 93 (
OUT_FINALin ROT13; used here as a MOV-like chain link), verifies through an FNV-1a hash table that the definition matches the expected pattern - Cost check via
sub_A0C4A0: Verifies that sinking reduces pressure (returns the pressure delta)
When sinking succeeds, the instruction is physically moved in the linked list via sub_92E1B0 (insert at new position) and sub_9253C0 (remove from old position).
Rematerialization Eligibility Criteria
The eligibility check spans multiple functions. An instruction is rematerializable if it passes ALL of these filters:
Opcode Whitelist
From sub_911030 and sub_A11060, the eligible opcode set (after masking opcode & 0xFFFFCFFF) is:
| Opcode | Identity | Category |
|---|---|---|
| 22 | IADD/IADD3 | Integer add (1 cycle) |
| 50 | SHF | Funnel shift (1 cycle) |
| 77 | IMAD | Integer multiply-add (1 cycle on modern SM) |
| 83 | ISETP | Integer set-predicate (1 cycle) |
| 93 | OUT_FINAL in ROT13; used as MOV-like marker | Register move (0--1 cycles, often eliminated). Actual SASS MOV is opcode 19. |
| 95 | STS in ROT13; used as constant-load marker | Constant materialization |
| 297 | LOP3 | 3-input logic (1 cycle) |
| 352 | SEL | Conditional select (1 cycle) |
The eligibility bitmask is encoded as 0x2080000010000001 >> (opcode - 22) for opcodes in range [22, 83], with explicit checks for opcodes 297 and 352. This is a compile-time-constant bitmask covering single-cycle ALU instructions.
Operand Source Liveness
sub_90C010 (70 lines) checks that all source operands are still available (live) at the proposed remat point:
function check_sources_available(state, instr, operand_idx, cost_out):
operand = &instr->operands[operand_idx]
// Immediate operand: always available
if sub_7DEB90(operand, state->ctx):
return 1
// Must be a GPR (type 1) and not a definition (bit 31 clear)
type = (operand->value >> 28) & 7
if type != 1 or (operand->value_high & 1):
return 0
// Check if the source vreg has a single reaching definition
vreg = lookup_vreg(ctx, operand->value & 0xFFFFFF)
single_def = vreg->field_56
if single_def:
return sub_90B790(state, single_def, cost_out, false)
// Multiple definitions: walk the def-use chain
min_cost = UINT_MAX
for def in vreg->def_chain: // at instr->field_64 + 8*operand_idx
cost = sub_90B790(state, def->instruction, cost_out, false)
if cost == 0:
return 0 // any unavailable source -> reject
// For rematerializable defs, add depth cost
if def is rematerializable:
cost += (def->block_depth <= instr->block_depth) ? 1 : 0
min_cost = min(min_cost, cost)
return min_cost
Cost Model: sub_90B790
sub_90B790 (large function, ~350 lines) implements the core cost/benefit analysis. It returns a non-negative integer cost where:
0= not profitable, do not rematerialize1+= profitable, higher values indicate cheaper remat
The function considers:
- Opcode-specific register consumption: Different opcodes produce different register-type results.
sub_7E36C0,sub_7E40E0,sub_7E3790,sub_7E3800,sub_7E3640extract per-operand register class (R/P/UR/UP) and width - Live range length: Longer live ranges benefit more from remat
- Use count: Multiple uses may require multiple remat copies -- still profitable if the live range is long enough
- Block depth: Instructions in deeper loop nests get higher remat cost thresholds since the duplicated instruction executes more frequently
- Predication state: Predicated instructions have additional constraints on remat safety
- Pre-existing flags: If
vreg+80already has0x80000001set, the register is already a remat candidate
Cross-Block Rematerialization: sub_A0C540
sub_A0C540 (228 lines) handles rematerialization across basic block boundaries, invoked in the second pass of sub_A11060. It processes definitions that are used in different blocks:
function cross_block_remat(state, instr, changed_out):
// Walk operands in reverse order (destinations first)
for i in (instr->operand_count - 1) downto 0:
operand = instr->operands[i]
if (operand >> 28) & 7 != 1: // not a GPR
continue
if (operand_high & 1): // skip source operands
continue
vreg = lookup_vreg(ctx, operand & 0xFFFFFF)
if (vreg->flags_48 & 0x22) != 0: // skip special vregs
continue
if vreg->reg_index in [41..44]: // skip architectural predicates
continue
flags80 = vreg->field_80
if not (flags80 & 1): // not a remat candidate
continue
if vreg->use_count <= 0:
continue
if (flags80 & 2) and (flags80 & 4): // already fully processed
continue
// Compute instruction-level remat cost
cost = sub_91E860(ctx, instr, i)
if operand < 0: // definition
if cost <= 3:
vreg->field_80 |= 0x80000008 // commit remat
continue
// Remat profitable: insert remat copy
adjust_pressure(state, instr, -1) // sub_A0C4A0
duplicate_at_use(ctx, instr) // vtable dispatch +1280
adjust_pressure(state, instr, +1)
vreg->field_80 |= 0x80000008
vreg->flags_48 &= ~0x300000 // clear live-through bits
// Rebuild interference for affected ranges
adjust_pressure(state, instr, -1)
sub_92C0D0(ctx, instr, 0, ...) // clone instruction at use
adjust_pressure(state, instr, +1)
*changed_out = 1
Interaction with Register Allocator
The rematerialization flags set during phases 28 and 69 are consumed by the fat-point register allocator in several ways:
Remat Detection During Assignment: sub_93AC90
During per-instruction register assignment (sub_9680F0, 3722 lines), the allocator calls sub_93AC90 (29 lines) to check if a virtual register is a rematerialization candidate:
function check_remat_opportunity(alloc, vreg_index, reg_class):
if alloc->vreg_count == 0:
BUG()
entry = hash_lookup(alloc->remat_table, vreg_index)
cost = entry->field_144[reg_class]
if cost < entry->threshold:
return true
return (cost == entry->threshold) and (reg_class == entry->field_12)
Range Remat Cost: sub_99AA50
The live-range infrastructure at 0x994000--0x9A1000 includes remat-aware cost functions. sub_99AA50 (51 lines) inserts a rematerialization cost node into a range's cost linked list, enabling the allocator to compare spill cost against remat cost when choosing between spilling and rematerializing a value.
Spill-vs-Remat Decision
The allocator's main iteration driver (sub_9AEF60, 1415 lines) uses remat information to guide the spill-vs-remat tradeoff:
- During interference analysis, remat candidates get lower interference weights (they can be killed and recreated)
- When a spill is triggered, the allocator first checks if the value is rematerializable. If so, it inserts a remat copy instead of a spill/refill pair
- Remat linked lists are maintained at
alloc+161..+175in the per-class allocator state
Verification: sub_A55D80
The post-allocation verifier (sub_A55D80, referenced by "REMATERIALIZATION PROBLEM..." string) validates that rematerialization was applied correctly. Error case 7 in the verifier specifically checks that:
- The rematerialized instruction produces the same value as the original
- The reaching definitions before and after allocation match (modulo known-safe remat transformations)
- No rematerialized instruction references a register that was invalidated by the allocation
Operand Kind 7: Remat Markers
The Ori IR operand classification includes a dedicated "Remat" kind (value 7) that marks operands participating in rematerialization. This marker is orthogonal to the vreg+80 flags -- it exists in the instruction's operand descriptors and tells downstream passes that this particular use was created by rematerialization rather than by the original program.
The 10 operand kinds in the Ori IR:
| Kind | Name | Description |
|---|---|---|
| 0 | R register | General-purpose register |
| 1 | Offset | Memory offset |
| 2 | P/UP register | Predicate register |
| 3 | Any register | Wildcard |
| 4 | Regular | Immediate or constant |
| 5 | Predicated | Guard predicate |
| 6 | -- | (reserved) |
| 7 | Remat | Rematerialization marker |
| 8 | Spill-refill | Spill/refill pair |
| 9 | R2P/P2R | Register-to-predicate conversion |
Vreg Flags at Offset +80
The virtual register's field at offset +80 encodes rematerialization state through a bitmask:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | Remat candidate -- this value CAN be recomputed |
| 1 | 0x2 | Remat source processed -- cross-block analysis done |
| 2 | 0x4 | Remat committed -- the allocator should prefer remat over spill |
| 31 | 0x80000000 | Depth marker / unvisited sentinel |
Common flag combinations:
0x80000001: Candidate identified by sub_A107B0, pending cost analysis0x80000007: Candidate with predication awareness (stronger guarantee for predicated code paths)0x80000008: Remat committed by cross-block analysis (sub_A0C540), allocator should use remat
Knobs and Configuration
| Knob ID | Role | Default | Notes |
|---|---|---|---|
| 487 | Gate for SinkRemat pass | (enabled) | Must be true for phase 28 to execute |
| 862 | Cutlass iteration limit | 5 | Max iterations in cutlass-specific iterative mode |
| 356 | Instruction count diagnostic | -- | Severity-2 warning when instruction count exceeds 32767 |
The optimization level gating:
- Level <= 1 (
-O0/-O1): All three remat phases are disabled - Level <= 4: Phase 28 runs the non-cutlass path only
- Level >= 5 (
-O3+): Full sink+remat with cutlass iteration support
Function Map
Phase 28 (SinkRemat)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5FC20 | sub_C5FC20 | 12 | Phase execute dispatcher |
0xC5F2E0 | sub_C5F2E0 | 7 | getName() -> returns 28 |
0xC5F2F0 | sub_C5F2F0 | 7 | isNoOp() -> returns 0 (always runs) |
0x913A30 | sub_913A30 | 131 | SinkRemat entry with knob/feature gating |
0xA0F020 | sub_A0F020 | 494 | Core sink+remat driver (block visitor loop) |
0x911030 | sub_911030 | 2408 | Per-block promotion/sinking engine |
0x90C010 | sub_90C010 | 70 | Source operand liveness check for remat |
0x90B790 | sub_90B790 | ~350 | Cost model: remat profitability analysis |
0x8F47E0 | sub_8F47E0 | 12 | Cutlass detection (strstr("cutlass")) |
0x8F4820 | sub_8F4820 | -- | Build remat worklist |
0x8F4F90 | sub_8F4F90 | -- | Apply remat decisions from worklist |
Phase 54 (OriDoRematEarly)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5EF30 | sub_C5EF30 | 7 | Phase execute: writes ctx+1552 = 4 |
0xC5EF40 | sub_C5EF40 | 7 | getName() -> returns 54 |
0xC5EF50 | sub_C5EF50 | 7 | isNoOp() -> returns 1 |
Phase 54 is a degenerate phase. Its execute body is a single store: *(ctx + 1552) = 4. Its isNoOp() returns 1, so the dispatch loop skips execute() by default -- the phase does nothing unless an architecture backend overrides the vtable to activate it. When active, the value 4 written to ctx+1552 advances the pipeline progress counter, which sub_A11060 checks (if *(ctx+1552) > 4 triggers the cross-block second pass).
Phase 69 (OriDoRemat)
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0xC5F910 | sub_C5F910 | 24 | Phase execute dispatcher |
0xC5ED50 | sub_C5ED50 | 7 | getName() -> returns 69 |
0xC5ED60 | sub_C5ED60 | 7 | isNoOp() -> returns 0 (always runs) |
0xA112C0 | sub_A112C0 | 245 | Late remat entry + cleanup |
0xA0C310 | sub_A0C310 | 45 | RematState initialization |
0xA11060 | sub_A11060 | 155 | Per-iteration remat worker |
0xA107B0 | sub_A107B0 | 316 | Per-instruction remat decision |
0xA105F0 | sub_A105F0 | 77 | Sinkability check (opcode 0x5F) |
0xA10DF0 | sub_A10DF0 | 138 | MOV chain analysis (FNV-1a hash table) |
0xA0C540 | sub_A0C540 | 228 | Cross-block rematerialization |
0xA0C4A0 | sub_A0C4A0 | -- | Pressure adjustment (+1 or -1) |
0xA0C410 | sub_A0C410 | -- | Remat profitability check for a vreg |
Register Allocator Integration
| Address | Function | Size (lines) | Role |
|---|---|---|---|
0x93AC90 | sub_93AC90 | 29 | Remat opportunity check during assignment |
0x99A9D0 | sub_99A9D0 | 38 | Range rematerialization cost cleanup |
0x99AA50 | sub_99AA50 | 51 | Range rematerialization cost insertion |
0x9AEF60 | sub_9AEF60 | 1415 | Main allocation driver with remat support |
0xA55D80 | sub_A55D80 | ~800 | Post-allocation remat verification |
Sinking vs. Rematerialization
The SinkRemat pass (phase 28) and the late OriDoRemat pass (phase 69) both move instructions closer to their uses, but through fundamentally different mechanisms:
Sinking moves the original instruction. The definition is physically relocated from its original position to a dominated block closer to the use. This does not increase instruction count but may change schedule. Sinking is legal only when:
- The instruction has exactly one use
- The use is in a block dominated by the current definition block
- Moving the instruction does not cross any barrier or synchronization point
- All source operands remain available at the new position
Rematerialization duplicates the instruction. The original definition remains in place (or is deleted if dead), and a fresh copy is inserted near each use. This increases instruction count but can dramatically reduce register pressure. Remat is legal for any instruction in the opcode whitelist, subject to:
- All source operands available at the use point
- The cost model approves the duplication
- The instruction has no side effects
The sub_A105F0 sinkability check runs first in sub_A107B0. Only if sinking fails does the function proceed to the rematerialization path. This prioritizes the cheaper transformation (sinking = zero instruction overhead) before falling back to the more expensive one (remat = duplicated instructions).
Architectural Notes
The three-phase structure with an interleaved flag-setter (phase 54) suggests the rematerialization infrastructure evolved over multiple ptxas generations. Phase 54's isNoOp() = 1 default means its execute() is skipped unless an architecture backend activates it by overriding the vtable. This indicates the phase was likely once a full pass that was later simplified to a flag write, with its analysis logic migrated into phase 69.
The CUTLASS-specific iterative mode in phase 28 (sub_913A30) reveals that NVIDIA's matrix-multiply library is important enough to warrant dedicated compiler heuristics. The strstr("cutlass") check is a name-based pattern match on the function name, not a property of the IR itself. This coupling between compiler optimization and library naming conventions is a pragmatic choice for a production compiler targeting known workloads.
The FNV-1a hash (constant 0x811C9DC5, prime 16777619) appears in both the rematerialization infrastructure (sub_A10DF0 for MOV chain tracking) and the register allocator (sub_926A30 for interference). This shared hash implementation is one of ptxas's standard infrastructure components (see Hash Tables & Bitvectors).