Sinking2 (NVIDIA Code Sinking)
sinking2 is an NVIDIA-proprietary instruction sinking pass that moves instructions closer to their uses, with specific awareness of GPU texture and surface memory operations. It is entirely distinct from LLVM's stock sink pass: while both perform code sinking, Sinking2 is tailored for NVIDIA's memory hierarchy and iterates to a fixed point rather than making a single pass. The primary motivation is reducing register pressure by deferring computation of values until just before they are consumed, which is especially impactful on GPUs where register files are shared across hundreds of concurrent threads.
The pass is particularly focused on sinking instructions into texture load blocks. Texture operations on NVIDIA GPUs have high latency but are served by a dedicated cache; by sinking the address computation and other operands into the block that performs the texture fetch, the compiler reduces the live range of those values and frees registers for other warps. This directly improves occupancy -- the number of warps that can execute simultaneously on an SM.
Pipeline Position
| Field | Value |
|---|---|
| Pass name (pipeline) | sinking2 |
| Pass ID | sink2 |
| Display name | Code sinking |
| Pass type | FunctionPass (NVIDIA-custom) |
| Class | llvm::Sinking2Pass |
| Registration | New PM #390, line 2282 in sub_2342890 |
| Runtime positions | Tier 1/2/3 #81 (NVVMSinking2 via sub_1CC60B0, gated by opts[3328] && !opts[2440]); see Pipeline |
| Legacy PM entry | sub_1CCA270 |
| New PM entry | sub_2D1C160 (19KB) |
| Legacy PM registration | sub_1CC7010 |
| New PM registration | sub_2D1B410 |
| Knob constructor | ctor_275 at 0x4F7750 |
| Vtable (Legacy) | off_49F8BC0 |
| Vtable (New PM) | off_4A260F0 |
Relationship to All Sink Passes in cicc
CICC v13.0 contains five distinct sinking mechanisms. Understanding which is which is essential when reading the pipeline or debugging register pressure issues:
| Pass ID / Factory | Class | Origin | Key Difference |
|---|---|---|---|
sink / sub_1A634D0 | LLVM SinkingPass | Upstream LLVM | Stock single-pass sinking, uses MemorySSA for alias safety |
sink2 / sub_1CCA270 | llvm::Sinking2Pass | NVIDIA | Texture-aware, iterative fixpoint, custom AA layer |
sink<rp-aware> | Parameterized variant | LLVM + NVIDIA | Register-pressure-aware sinking (stock sink with rp-aware-sink=true) |
NVVMSinking2 / sub_1CC60B0 | NVIDIA late sinking | NVIDIA | Late-pipeline SM-specific sinking, gated by opts[3328] |
MachineSink | LLVM MachineSinking | LLVM | MIR-level sinking, opt-in for NVPTX via nvptx-enable-machine-sink |
The stock LLVM sink (sub_1869C50, called with params (1,0,1)) uses MemorySSA for alias queries and makes a single pass. Sinking2 uses its own alias analysis layer routed through sub_13575E0 and iterates to convergence. NVVMSinking2 (sub_1CC60B0) is a separate NVIDIA pass that runs late in the pipeline after barrier lowering and warp-level optimizations, gated by the SM-specific pass group flag opts[3328].
IR Before/After Example
The pass sinks address computation closer to texture/surface use sites, reducing register pressure by shortening live ranges.
Before (address computation in preheader, live across loop body):
preheader:
%base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
%addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
br label %loop
loop:
%i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
; ... many instructions using registers, %base and %addr are live ...
%tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
%val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
%i.next = add i64 %i, 1
%cmp = icmp slt i64 %i.next, %n
br i1 %cmp, label %loop, label %exit
After (address computation sunk into loop, next to texture use):
preheader:
br label %loop
loop:
%i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
; ... many instructions, but %base and %addr are no longer live here ...
%base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
%addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
%tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
%val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
%i.next = add i64 %i, 1
%cmp = icmp slt i64 %i.next, %n
br i1 %cmp, label %loop, label %exit
The GEP instructions now execute inside the loop (higher execution count) but free registers in the rest of the loop body. This is a deliberate tradeoff: extra ALU work for reduced register pressure, which typically improves occupancy and net throughput.
Algorithm
Entry Point
The legacy PM entry sub_1CCA270 performs these steps:
- Fetches
DominatorTreeanalysis (viaDominatorTreeWrapperPassatunk_4F9E06C) - Fetches
LoopInfoanalysis (viaLoopInfoWrapperPassatunk_4F96DB4) - Reads
sink-into-textureknob (qword_4FBF2C0[20]) -- must be non-zero (enabled) - Reads
sink-limitknob (qword_4FBF1E0[20]) -- must be greater than zero - Calls the main worklist driver
sub_1CC9110
The New PM entry sub_2D1C160 (19KB) performs the same logic using AnalysisManager to fetch analyses, then dispatches to sub_2D1CFB0 (13KB).
The pass does not require ScalarEvolution (SCEV), MemorySSA, or PostDominatorTree, keeping it simpler and cheaper than loop-oriented or MemorySSA-dependent passes.
Main Worklist Driver (sub_1CC9110, 22KB)
The core algorithm is a fixpoint iteration over the dominator tree:
function SinkingWorklist(F, DT, LI, textureLevel, sinkLimit):
changed = false
do:
roundChanged = false
sinkCount = 0
// Walk dominator tree in DFS preorder
for BB in DT.dfs_preorder():
// Skip loop headers to avoid creating loop-carried deps
if LI.isLoopHeader(BB):
continue
// Process instructions bottom-up within each block
for I in reverse(BB.instructions()):
if sinkCount >= sinkLimit:
break // complexity limiter
if I.mayHaveSideEffects() or I.isTerminator():
continue // unsinkable
if I.use_empty():
continue // dead, leave for DCE
// Level 3: consider instructions used only outside BB
if textureLevel < 3 and allUsesInSameBlock(I, BB):
continue
targetBB = findBestSinkTarget(I, DT, LI) // sub_1CC7510
if targetBB == BB:
continue // already in best position
// Profitability: prefer texture/surface blocks
if textureLevel >= 1:
if not blockContainsTextureOps(targetBB):
if not dominatesTextureBlock(targetBB, DT):
continue // not profitable
// Safety: alias analysis check
if not isSafeToSink(I, BB, targetBB): // sub_1CC8920
continue
// Safety: memory dependency check
if not checkMemDep(I, BB, targetBB): // sub_1CC8CA0
continue
I.moveBefore(targetBB.firstNonPHI())
roundChanged = true
sinkCount++
changed |= roundChanged
while roundChanged // iterate until no more changes
return changed
Key design points:
- DFS preorder ensures parent blocks are processed before children. Instructions sunk from a parent into a child on one iteration may expose further sinking opportunities for grandchild blocks on the next iteration -- hence the fixpoint loop.
- Bottom-up within each block processes the last instruction first. This is important because sinking an instruction may make an earlier instruction's operands dead, which DCE will clean up later.
- Loop headers are skipped to prevent creating loop-carried dependencies (a value defined in the header, consumed in the latch, sunk into the latch would create a cycle).
Instruction Processing (sub_1CC7510, 16KB)
For each candidate instruction, this function:
- Walks the use chain to find all consumers (via
sub_15F4D60, multi-use check) - For each user, determines the containing basic block
- Computes the lowest common dominator (LCD) of all user blocks using the dominator tree
- If LCD == current block, no benefit from sinking -- the instruction is already as close to its uses as possible while dominating all of them
- Builds a sink mapping: instruction to target block
- Checks memory safety via alias analysis (
sub_13575E0) - Validates that sinking does not violate memory ordering constraints
- Respects PHI nodes (LLVM opcode
PHI) as sink boundaries -- an instruction cannot be sunk past a PHI insertion point
The target block selection algorithm effectively finds the nearest common dominator of all uses that is strictly dominated by the current block. If the instruction has a single use, the target is trivially the use's block (or its immediate dominator if the use is a PHI operand).
Dominance Ordering (sub_1CC8170, 13KB)
Implements a hash-based ordering of basic blocks for comparing sink profitability. Uses DFS numbering from the dominator tree to determine which block comes "earlier" in the program. This ordering ensures:
- Instructions are only sunk toward uses, never away from them
- When multiple sink targets exist (multi-use instruction), the lowest common dominator is chosen
- The ordering is consistent across iterations of the fixpoint loop
Alias Checking (sub_1CC8920, 4KB)
Validates that moving instruction I from block From to block To does not reorder I past any conflicting memory access:
function isSafeToSink(I, From, To):
if not I.mayReadOrWriteMemory():
return true // pure computation, always safe
// Walk all instructions on the domtree path From -> To
for BB in pathBlocks(From, To):
for J in BB.instructions():
if J == I: continue
if AA.getModRefInfo(I, J) != NoModRef:
return false // conflict: I aliases with J
return true
This is not MemorySSA-based (unlike stock LLVM sink). The pass invokes the traditional AliasAnalysis query interface through sub_13575E0. This is less precise than MemorySSA but avoids the cost of building and maintaining the MemorySSA graph, which matters because Sinking2 iterates to fixpoint and would need to update MemorySSA on every move.
Memory Dependency Checking (sub_1CC8CA0, 6KB)
Additional memory safety layer beyond alias checking:
- Store-load forwarding: if
Iis a load and there is a store betweenFromandTothat may alias the loaded location, sinking would change the value loaded - Store ordering: if
Iis a store, moving it past another store to a potentially-aliasing location changes program semantics - Volatile/atomic barrier: volatile loads/stores and atomic operations are never sunk (treated as having side effects)
- Synchronization intrinsics: barrier calls (
__syncthreads,bar.sync) are treated as memory fences; no instruction may be sunk past them
Texture/Surface Awareness
The pass identifies "texture blocks" -- basic blocks containing calls to texture/surface intrinsics (the tex.*, suld.*, sust.* family). Address computations that feed these intrinsic calls are the primary sink candidates, because texture address computation chains (GEP + index arithmetic) produce intermediate values that are consumed only at the texture fetch site. Without sinking, these intermediates occupy registers across potentially many instructions.
The sink-into-texture knob controls aggressiveness:
| Level | Behavior |
|---|---|
| 0 | Disabled -- no texture-aware sinking |
| 1 | Cross-block only: move instructions across block boundaries into texture blocks |
| 2 | Cross-block + intra-block: also reorder instructions within a block to position them immediately before their texture use |
| 3 (default) | All of the above + outside-only: consider instructions whose only uses are in blocks other than where the instruction is defined |
Level 3 catches the important case where a GEP in a preheader feeds a texture load inside a loop -- the GEP has no uses in its own block, only "outside" uses.
Address space checks for NVPTX (see reference/address-spaces):
- AS 1 (global): may alias with texture reads in some configurations
- AS 3 (shared): texture operations never access shared memory, so shared-space stores are not barriers to texture sinking
- AS 4 (const): texture/surface descriptors typically live in constant memory
- AS 5 (local): thread-local, no cross-thread interference
Loop Considerations
Sinking2 is loop-aware but conservative:
- Never sinks OUT of a loop: moving an instruction from a loop body to an exit block would change its execution count. The pass skips this entirely.
- May sink INTO loop bodies: when an instruction in a loop preheader feeds only uses inside the loop (particularly texture fetches), sinking it into the loop is profitable despite increasing execution count -- the register pressure reduction from shorter live ranges outweighs the extra computation.
- Skips loop headers: prevents creating loop-carried dependencies.
- Runs after LoopSimplify: the early instance (
sub_18B1DE0) runs after LoopSimplify/LCSSA have canonicalized loop structure, so preheaders, latches, and exit blocks are well-formed.
This creates a deliberate tension with LICM:
- LICM hoists loop-invariant code into the preheader (reducing execution count)
- Sinking2 sinks non-invariant address computation out of the preheader and into the loop body (reducing register pressure)
The two passes run at different pipeline positions and balance each other. LICM runs first; Sinking2 runs after GVN and CGSCC inlining, when texture patterns are fully exposed.
Barrier Awareness
Sinking2 itself does not contain explicit __syncthreads / bar.sync detection logic. Instead, it relies on the LLVM side-effect model:
- Barrier intrinsics are marked as having side effects, so they are never sunk
- Barrier intrinsics are treated as memory fences by alias analysis, so no memory instruction may be sunk past them
The late NVVMSinking2 (sub_1CC60B0) runs after barrier lowering (sub_1CB73C0) and warp-level optimization passes. By that point, barriers have been lowered to their final form. The pipeline ordering is:
NVVMBranchDist -> NVVMWarpShuffle -> NVVMReduction -> NVVMSinking2
This sequence ensures NVVMSinking2 can sink past warp-level operations that are no longer opaque barriers, while still respecting the lowered barrier representation.
Multi-Run Pipeline Pattern
Sinking2 appears at three to four pipeline positions. Each run has different context and different opportunities:
| Position | Factory | Mode | Context |
|---|---|---|---|
| Early (pass ~39) | sub_18B1DE0() | Standard | After stock Sink, GVN, and CGSCC inlining. Texture patterns are exposed. |
| Post-peephole | sub_18B3080(1) | Fast (flag=1) | After NVVMPeephole. Peephole may create new sinking opportunities. Reduced iteration budget. |
| Late SM-specific | sub_1CC60B0() | SM-gated | After barrier lowering and warp shuffle. Gated by opts[3328] && !opts[2440]. |
For fast-compile mode (Ofcmax), only sub_18B3080(1) runs -- the single Sinking2 in fast mode with reduced iteration budget. No stock Sink, no NVVMSinking2.
The rationale for multiple runs:
- Run 1 (stock Sink) handles straightforward cases using MemorySSA's precise alias information
- Run 2 (Sinking2 early) performs texture-aware sinking now that GVN/CGSCC have simplified the IR
- Run 3 (Sinking2 fast) cleans up opportunities created by peephole optimization
- Run 4 (NVVMSinking2) performs SM-specific late sinking after barrier and warp-level transforms
NVVMPassOptions Gating
| Offset | Type | Effect |
|---|---|---|
opts[1040] | bool | Disable stock Sink/MemSSA |
opts[2440] | bool | Disable NVVMSinking2 (sub_1CC60B0) |
opts[3328] | bool | Enable SM-specific warp/reduction/sinking pass group (gates NVVMSinking2) |
Cost Model (New PM)
The New PM object (176 bytes) contains floating-point thresholds at offsets +88 and +144, both initialized to 1065353216 (IEEE 754 1.0f). These thresholds suggest the New PM implementation has a more sophisticated cost model than the Legacy PM version:
- Profitability threshold (
+88): minimum benefit score for a sink to be accepted. A value of 1.0 means the benefit must at least equal the cost. - Cost threshold (
+144): maximum acceptable cost for the sinking motion itself. A value of 1.0 means the movement cost must not exceed the baseline.
The Legacy PM version uses a simpler boolean profitability model (is the target a texture block? yes/no).
Configuration Knobs
Sinking2-Specific (ctor_275 at 0x4F7750)
| Knob | Type | Default | Storage | Description |
|---|---|---|---|---|
sink-into-texture | int | 3 | qword_4FBF2C0 | Texture sinking aggressiveness (0=off, 1=cross-block, 2=+intra, 3=+outside-only) |
sink-limit | int | 20 | qword_4FBF1E0 | Max instructions to sink per invocation (complexity limiter) |
dump-sink2 | bool | false | qword_4FBF100 | Dump debug information during sinking |
Related Sinking Knobs (other passes, NOT Sinking2)
| Knob | Type | Default | Owner | Description |
|---|---|---|---|---|
sink-check-sched | bool | true | stock Sink | Check scheduling effects of sinking |
sink-single-only | bool | true | stock Sink | Only sink single-use instructions |
rp-aware-sink | bool | false | stock Sink | Consider register pressure (controls sink<rp-aware> variant) |
max-uses-for-sinking | int | (default) | stock Sink | Don't sink insts with too many uses |
sink-ld-param | bool | (default) | NVPTX backend | Sink one-use ld.param to use point |
hoist-load-param | bool | (default) | NVPTX backend | Hoist all ld.param to entry block (counterpart to sink-ld-param) |
enable-andcmp-sinking | bool | (default) | CodeGenPrepare | Sink and/cmp into branches |
aggressive-no-sink | bool | (default) | (unknown) | Sink all generated instructions |
instcombine-code-sinking | bool | (default) | InstCombine | Enable code sinking within instcombine |
nvptx-enable-machine-sink | bool | (default) | NVPTX backend | Enable MIR-level MachineSink |
SinkRematEnable | bool | (default) | ptxas | Enable sink+rematerialization in ptxas |
Analysis Dependencies
| Legacy PM | New PM | Purpose |
|---|---|---|
DominatorTreeWrapperPass (unk_4F9E06C) | DominatorTreeAnalysis (sub_CF6DB0) | Dominator tree for sink legality and ordering |
LoopInfoWrapperPass (unk_4F96DB4) | LoopAnalysis (sub_B1A2E0) | Avoid sinking out of loops; skip loop headers |
Does not require: SCEV, MemorySSA, PostDominatorTree, BranchProbabilityInfo.
This is a key difference from stock LLVM SinkingPass, which requires MemorySSAAnalysis. Sinking2 uses its own alias analysis queries through helpers sub_1CC8920 and sub_1CC8CA0, routed through the traditional AA interface at sub_13575E0. This avoids the overhead of building/maintaining MemorySSA across fixpoint iterations.
Pass Object Layout
Legacy PM (160 bytes):
| Offset | Type | Content |
|---|---|---|
| +0 | ptr | Vtable pointer (off_49F8BC0) |
| +8 | ptr | Pass link (next pass in chain) |
| +16 | ptr | Pass ID pointer (&unk_4FBF0F4) |
| +24 | int32 | Mode (default=3, from sink-into-texture) |
| +28 | int32 | Sink limit (default=20, from sink-limit) |
| +32--48 | ptr[3] | Worklist data (head, tail, size) |
| +56 | ptr | DominatorTree* (set during runOnFunction) |
| +64 | ptr | List head 1 (self-referential sentinel) |
| +72--80 | ptr[2] | List next/prev 1 |
| +96 | int64 | Counter (sink count for current iteration) |
| +104 | ptr | LoopInfo* (set during runOnFunction) |
| +112 | ptr | List head 2 (self-referential sentinel) |
| +120--128 | ptr[2] | List next/prev 2 |
| +144 | int64 | Data field |
| +152 | byte | Changed flag (for fixpoint termination) |
New PM (176 bytes): two embedded worklists and float thresholds at offsets +88 and +144 (value 1065353216 = 1.0f IEEE 754).
Differences from Upstream LLVM
| Aspect | Upstream LLVM sink | NVIDIA sinking2 |
|---|---|---|
| Alias analysis backend | MemorySSA | Custom AA layer (sub_13575E0) |
| Iteration strategy | Single pass | Fixpoint iteration |
| Texture awareness | None | 3-level configurable |
| Address space awareness | Generic | NVPTX-specific (AS 1,3,4,5) |
| Complexity limiter | None | sink-limit knob (default=20) |
| Intra-block reordering | No | Level >= 2 |
| Outside-only pattern | No | Level == 3 |
| Debug dump | Standard LLVM debug | dump-sink2 knob |
| Cost model | Boolean (profitable or not) | Float thresholds in New PM |
| Pipeline occurrences | 1 | 3--4 (multi-run strategy) |
| Fast-compile variant | Same pass | Dedicated fast=1 mode |
Diagnostic Strings
| String | Context |
|---|---|
"llvm::Sinking2Pass]" | RTTI name at sub_2315E20 |
"sink2" | Pipeline parser ID |
"Code sinking" | Display name (shared with stock LLVM sink) |
"sinking2" | New PM pipeline string match |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | sub_1CC7010 | -- | Legacy PM pass registration |
| -- | sub_1CC7100 | -- | Legacy PM factory |
| -- | sub_1CC71E0 | -- | Legacy PM alternate factory |
| -- | sub_1CC7510 | 16KB | processInstruction: sink candidate evaluation, use-chain walk, LCD computation |
| -- | sub_1CC8170 | 13KB | Dominance ordering: DFS numbering for block comparison |
| -- | sub_1CC8920 | 4KB | Alias checking helper: validates no conflicting memory accesses on path |
| -- | sub_1CC8CA0 | 6KB | Memory dependency helper: store-load forwarding, store ordering, volatile |
| -- | sub_1CC9110 | 22KB | Main worklist driver: fixpoint iteration over dominator tree |
| -- | sub_1CCA270 | -- | Legacy PM runOnFunction entry |
| -- | sub_2D1B410 | -- | New PM pass registration |
| -- | sub_2D1BC50 | -- | New PM factory |
| -- | sub_2D1C160 | 19KB | New PM run() entry |
| -- | sub_2D1CFB0 | 13KB | New PM core logic |
| -- | sub_2D1D770 | 7KB | New PM helper |
| -- | sub_2D1DCF0 | 7KB | New PM helper |
| -- | sub_2315E20 | -- | RTTI name printer |
| -- | 0x4F7750 | -- | Knob constructor (ctor_275) |
Related pipeline factories:
| Address | Role |
|---|---|
sub_18B1DE0 | Sinking2 early-pipeline factory |
sub_18B3080 | Sinking2 fast-mode factory (accepts fast flag parameter) |
sub_1CC60B0 | NVVMSinking2 late-pipeline factory |
sub_1A634D0 | Stock LLVM Sink legacy PM registration |
sub_29776B0 | Stock LLVM Sink New PM registration |
sub_1B51110 | Stock Sink core (51KB, creates .sink.split / .sink blocks) |
sub_1869C50 | Stock Sink pipeline factory (called with params 1,0,1) |
Total code size: ~80KB (Legacy PM) + ~65KB (New PM) = ~145KB
GPU-Specific Motivation
Register pressure directly determines occupancy -- each additional live register per thread reduces the number of warps available for latency hiding, with discrete cliff boundaries where a single register can drop an entire warp group.
Sinking instructions closer to their uses shortens live ranges and reduces the peak number of simultaneously live registers. This is especially valuable for texture load sequences, which typically involve address computation (GEP chains, index arithmetic) that produces values consumed only at the texture fetch site. Without sinking, these intermediate values occupy registers across potentially many instructions, bloating register pressure unnecessarily.
The three-level sink-into-texture design reflects a graduated approach to this optimization: level 1 handles the common case (cross-block sinking), level 2 adds intra-block reordering for tighter packing, and level 3 (the default) handles the edge case where an instruction's only uses are in blocks other than where it is defined, enabling more aggressive motion.
The multi-run pattern (early Sinking2, post-peephole fast Sinking2, late NVVMSinking2) ensures that sinking opportunities created by other optimization passes are captured throughout the pipeline, rather than relying on a single sinking point that may miss opportunities not yet exposed.
Cross-References
- Dead Synchronization Elimination -- runs earlier, removes barriers that Sinking2 would otherwise treat as memory fences
- LICM -- counterpart: hoists loop-invariant code into preheaders; Sinking2 sinks address computation out of preheaders
- NVVMPeephole -- runs before late Sinking2, may create new sinking opportunities
- Rematerialization -- runs after all sinking; rematerialization + sinking together minimize register pressure (ptxas
SinkRematEnableknob) - MemorySpaceOpt -- changes address spaces which affects sinking profitability
- NVVMPassOptions --
opts[1040]disables stock Sink;opts[2440]disables NVVMSinking2 - Register Allocation -- ultimate consumer of the register pressure reduction that sinking provides
- Optimization Levels -- Ofcmax runs only fast-mode Sinking2; O2/O3 run full multi-run pattern