Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Sinking2 (NVIDIA Code Sinking)

sinking2 is an NVIDIA-proprietary instruction sinking pass that moves instructions closer to their uses, with specific awareness of GPU texture and surface memory operations. It is entirely distinct from LLVM's stock sink pass: while both perform code sinking, Sinking2 is tailored for NVIDIA's memory hierarchy and iterates to a fixed point rather than making a single pass. The primary motivation is reducing register pressure by deferring computation of values until just before they are consumed, which is especially impactful on GPUs where register files are shared across hundreds of concurrent threads.

The pass is particularly focused on sinking instructions into texture load blocks. Texture operations on NVIDIA GPUs have high latency but are served by a dedicated cache; by sinking the address computation and other operands into the block that performs the texture fetch, the compiler reduces the live range of those values and frees registers for other warps. This directly improves occupancy -- the number of warps that can execute simultaneously on an SM.

Pipeline Position

FieldValue
Pass name (pipeline)sinking2
Pass IDsink2
Display nameCode sinking
Pass typeFunctionPass (NVIDIA-custom)
Classllvm::Sinking2Pass
RegistrationNew PM #390, line 2282 in sub_2342890
Runtime positionsTier 1/2/3 #81 (NVVMSinking2 via sub_1CC60B0, gated by opts[3328] && !opts[2440]); see Pipeline
Legacy PM entrysub_1CCA270
New PM entrysub_2D1C160 (19KB)
Legacy PM registrationsub_1CC7010
New PM registrationsub_2D1B410
Knob constructorctor_275 at 0x4F7750
Vtable (Legacy)off_49F8BC0
Vtable (New PM)off_4A260F0

Relationship to All Sink Passes in cicc

CICC v13.0 contains five distinct sinking mechanisms. Understanding which is which is essential when reading the pipeline or debugging register pressure issues:

Pass ID / FactoryClassOriginKey Difference
sink / sub_1A634D0LLVM SinkingPassUpstream LLVMStock single-pass sinking, uses MemorySSA for alias safety
sink2 / sub_1CCA270llvm::Sinking2PassNVIDIATexture-aware, iterative fixpoint, custom AA layer
sink<rp-aware>Parameterized variantLLVM + NVIDIARegister-pressure-aware sinking (stock sink with rp-aware-sink=true)
NVVMSinking2 / sub_1CC60B0NVIDIA late sinkingNVIDIALate-pipeline SM-specific sinking, gated by opts[3328]
MachineSinkLLVM MachineSinkingLLVMMIR-level sinking, opt-in for NVPTX via nvptx-enable-machine-sink

The stock LLVM sink (sub_1869C50, called with params (1,0,1)) uses MemorySSA for alias queries and makes a single pass. Sinking2 uses its own alias analysis layer routed through sub_13575E0 and iterates to convergence. NVVMSinking2 (sub_1CC60B0) is a separate NVIDIA pass that runs late in the pipeline after barrier lowering and warp-level optimizations, gated by the SM-specific pass group flag opts[3328].

IR Before/After Example

The pass sinks address computation closer to texture/surface use sites, reducing register pressure by shortening live ranges.

Before (address computation in preheader, live across loop body):

preheader:
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions using registers, %base and %addr are live ...
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

After (address computation sunk into loop, next to texture use):

preheader:
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions, but %base and %addr are no longer live here ...
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

The GEP instructions now execute inside the loop (higher execution count) but free registers in the rest of the loop body. This is a deliberate tradeoff: extra ALU work for reduced register pressure, which typically improves occupancy and net throughput.

Algorithm

Entry Point

The legacy PM entry sub_1CCA270 performs these steps:

  1. Fetches DominatorTree analysis (via DominatorTreeWrapperPass at unk_4F9E06C)
  2. Fetches LoopInfo analysis (via LoopInfoWrapperPass at unk_4F96DB4)
  3. Reads sink-into-texture knob (qword_4FBF2C0[20]) -- must be non-zero (enabled)
  4. Reads sink-limit knob (qword_4FBF1E0[20]) -- must be greater than zero
  5. Calls the main worklist driver sub_1CC9110

The New PM entry sub_2D1C160 (19KB) performs the same logic using AnalysisManager to fetch analyses, then dispatches to sub_2D1CFB0 (13KB).

The pass does not require ScalarEvolution (SCEV), MemorySSA, or PostDominatorTree, keeping it simpler and cheaper than loop-oriented or MemorySSA-dependent passes.

Main Worklist Driver (sub_1CC9110, 22KB)

The core algorithm is a fixpoint iteration over the dominator tree:

function SinkingWorklist(F, DT, LI, textureLevel, sinkLimit):
    changed = false
    do:
        roundChanged = false
        sinkCount = 0

        // Walk dominator tree in DFS preorder
        for BB in DT.dfs_preorder():
            // Skip loop headers to avoid creating loop-carried deps
            if LI.isLoopHeader(BB):
                continue

            // Process instructions bottom-up within each block
            for I in reverse(BB.instructions()):
                if sinkCount >= sinkLimit:
                    break                  // complexity limiter

                if I.mayHaveSideEffects() or I.isTerminator():
                    continue               // unsinkable
                if I.use_empty():
                    continue               // dead, leave for DCE

                // Level 3: consider instructions used only outside BB
                if textureLevel < 3 and allUsesInSameBlock(I, BB):
                    continue

                targetBB = findBestSinkTarget(I, DT, LI)  // sub_1CC7510
                if targetBB == BB:
                    continue               // already in best position

                // Profitability: prefer texture/surface blocks
                if textureLevel >= 1:
                    if not blockContainsTextureOps(targetBB):
                        if not dominatesTextureBlock(targetBB, DT):
                            continue       // not profitable

                // Safety: alias analysis check
                if not isSafeToSink(I, BB, targetBB):     // sub_1CC8920
                    continue
                // Safety: memory dependency check
                if not checkMemDep(I, BB, targetBB):       // sub_1CC8CA0
                    continue

                I.moveBefore(targetBB.firstNonPHI())
                roundChanged = true
                sinkCount++

        changed |= roundChanged
    while roundChanged            // iterate until no more changes

    return changed

Key design points:

  • DFS preorder ensures parent blocks are processed before children. Instructions sunk from a parent into a child on one iteration may expose further sinking opportunities for grandchild blocks on the next iteration -- hence the fixpoint loop.
  • Bottom-up within each block processes the last instruction first. This is important because sinking an instruction may make an earlier instruction's operands dead, which DCE will clean up later.
  • Loop headers are skipped to prevent creating loop-carried dependencies (a value defined in the header, consumed in the latch, sunk into the latch would create a cycle).

Instruction Processing (sub_1CC7510, 16KB)

For each candidate instruction, this function:

  1. Walks the use chain to find all consumers (via sub_15F4D60, multi-use check)
  2. For each user, determines the containing basic block
  3. Computes the lowest common dominator (LCD) of all user blocks using the dominator tree
  4. If LCD == current block, no benefit from sinking -- the instruction is already as close to its uses as possible while dominating all of them
  5. Builds a sink mapping: instruction to target block
  6. Checks memory safety via alias analysis (sub_13575E0)
  7. Validates that sinking does not violate memory ordering constraints
  8. Respects PHI nodes (LLVM opcode PHI) as sink boundaries -- an instruction cannot be sunk past a PHI insertion point

The target block selection algorithm effectively finds the nearest common dominator of all uses that is strictly dominated by the current block. If the instruction has a single use, the target is trivially the use's block (or its immediate dominator if the use is a PHI operand).

Dominance Ordering (sub_1CC8170, 13KB)

Implements a hash-based ordering of basic blocks for comparing sink profitability. Uses DFS numbering from the dominator tree to determine which block comes "earlier" in the program. This ordering ensures:

  • Instructions are only sunk toward uses, never away from them
  • When multiple sink targets exist (multi-use instruction), the lowest common dominator is chosen
  • The ordering is consistent across iterations of the fixpoint loop

Alias Checking (sub_1CC8920, 4KB)

Validates that moving instruction I from block From to block To does not reorder I past any conflicting memory access:

function isSafeToSink(I, From, To):
    if not I.mayReadOrWriteMemory():
        return true                    // pure computation, always safe

    // Walk all instructions on the domtree path From -> To
    for BB in pathBlocks(From, To):
        for J in BB.instructions():
            if J == I: continue
            if AA.getModRefInfo(I, J) != NoModRef:
                return false           // conflict: I aliases with J

    return true

This is not MemorySSA-based (unlike stock LLVM sink). The pass invokes the traditional AliasAnalysis query interface through sub_13575E0. This is less precise than MemorySSA but avoids the cost of building and maintaining the MemorySSA graph, which matters because Sinking2 iterates to fixpoint and would need to update MemorySSA on every move.

Memory Dependency Checking (sub_1CC8CA0, 6KB)

Additional memory safety layer beyond alias checking:

  • Store-load forwarding: if I is a load and there is a store between From and To that may alias the loaded location, sinking would change the value loaded
  • Store ordering: if I is a store, moving it past another store to a potentially-aliasing location changes program semantics
  • Volatile/atomic barrier: volatile loads/stores and atomic operations are never sunk (treated as having side effects)
  • Synchronization intrinsics: barrier calls (__syncthreads, bar.sync) are treated as memory fences; no instruction may be sunk past them

Texture/Surface Awareness

The pass identifies "texture blocks" -- basic blocks containing calls to texture/surface intrinsics (the tex.*, suld.*, sust.* family). Address computations that feed these intrinsic calls are the primary sink candidates, because texture address computation chains (GEP + index arithmetic) produce intermediate values that are consumed only at the texture fetch site. Without sinking, these intermediates occupy registers across potentially many instructions.

The sink-into-texture knob controls aggressiveness:

LevelBehavior
0Disabled -- no texture-aware sinking
1Cross-block only: move instructions across block boundaries into texture blocks
2Cross-block + intra-block: also reorder instructions within a block to position them immediately before their texture use
3 (default)All of the above + outside-only: consider instructions whose only uses are in blocks other than where the instruction is defined

Level 3 catches the important case where a GEP in a preheader feeds a texture load inside a loop -- the GEP has no uses in its own block, only "outside" uses.

Address space checks for NVPTX (see reference/address-spaces):

  • AS 1 (global): may alias with texture reads in some configurations
  • AS 3 (shared): texture operations never access shared memory, so shared-space stores are not barriers to texture sinking
  • AS 4 (const): texture/surface descriptors typically live in constant memory
  • AS 5 (local): thread-local, no cross-thread interference

Loop Considerations

Sinking2 is loop-aware but conservative:

  1. Never sinks OUT of a loop: moving an instruction from a loop body to an exit block would change its execution count. The pass skips this entirely.
  2. May sink INTO loop bodies: when an instruction in a loop preheader feeds only uses inside the loop (particularly texture fetches), sinking it into the loop is profitable despite increasing execution count -- the register pressure reduction from shorter live ranges outweighs the extra computation.
  3. Skips loop headers: prevents creating loop-carried dependencies.
  4. Runs after LoopSimplify: the early instance (sub_18B1DE0) runs after LoopSimplify/LCSSA have canonicalized loop structure, so preheaders, latches, and exit blocks are well-formed.

This creates a deliberate tension with LICM:

  • LICM hoists loop-invariant code into the preheader (reducing execution count)
  • Sinking2 sinks non-invariant address computation out of the preheader and into the loop body (reducing register pressure)

The two passes run at different pipeline positions and balance each other. LICM runs first; Sinking2 runs after GVN and CGSCC inlining, when texture patterns are fully exposed.

Barrier Awareness

Sinking2 itself does not contain explicit __syncthreads / bar.sync detection logic. Instead, it relies on the LLVM side-effect model:

  • Barrier intrinsics are marked as having side effects, so they are never sunk
  • Barrier intrinsics are treated as memory fences by alias analysis, so no memory instruction may be sunk past them

The late NVVMSinking2 (sub_1CC60B0) runs after barrier lowering (sub_1CB73C0) and warp-level optimization passes. By that point, barriers have been lowered to their final form. The pipeline ordering is:

NVVMBranchDist -> NVVMWarpShuffle -> NVVMReduction -> NVVMSinking2

This sequence ensures NVVMSinking2 can sink past warp-level operations that are no longer opaque barriers, while still respecting the lowered barrier representation.

Multi-Run Pipeline Pattern

Sinking2 appears at three to four pipeline positions. Each run has different context and different opportunities:

PositionFactoryModeContext
Early (pass ~39)sub_18B1DE0()StandardAfter stock Sink, GVN, and CGSCC inlining. Texture patterns are exposed.
Post-peepholesub_18B3080(1)Fast (flag=1)After NVVMPeephole. Peephole may create new sinking opportunities. Reduced iteration budget.
Late SM-specificsub_1CC60B0()SM-gatedAfter barrier lowering and warp shuffle. Gated by opts[3328] && !opts[2440].

For fast-compile mode (Ofcmax), only sub_18B3080(1) runs -- the single Sinking2 in fast mode with reduced iteration budget. No stock Sink, no NVVMSinking2.

The rationale for multiple runs:

  • Run 1 (stock Sink) handles straightforward cases using MemorySSA's precise alias information
  • Run 2 (Sinking2 early) performs texture-aware sinking now that GVN/CGSCC have simplified the IR
  • Run 3 (Sinking2 fast) cleans up opportunities created by peephole optimization
  • Run 4 (NVVMSinking2) performs SM-specific late sinking after barrier and warp-level transforms

NVVMPassOptions Gating

OffsetTypeEffect
opts[1040]boolDisable stock Sink/MemSSA
opts[2440]boolDisable NVVMSinking2 (sub_1CC60B0)
opts[3328]boolEnable SM-specific warp/reduction/sinking pass group (gates NVVMSinking2)

Cost Model (New PM)

The New PM object (176 bytes) contains floating-point thresholds at offsets +88 and +144, both initialized to 1065353216 (IEEE 754 1.0f). These thresholds suggest the New PM implementation has a more sophisticated cost model than the Legacy PM version:

  • Profitability threshold (+88): minimum benefit score for a sink to be accepted. A value of 1.0 means the benefit must at least equal the cost.
  • Cost threshold (+144): maximum acceptable cost for the sinking motion itself. A value of 1.0 means the movement cost must not exceed the baseline.

The Legacy PM version uses a simpler boolean profitability model (is the target a texture block? yes/no).

Configuration Knobs

Sinking2-Specific (ctor_275 at 0x4F7750)

KnobTypeDefaultStorageDescription
sink-into-textureint3qword_4FBF2C0Texture sinking aggressiveness (0=off, 1=cross-block, 2=+intra, 3=+outside-only)
sink-limitint20qword_4FBF1E0Max instructions to sink per invocation (complexity limiter)
dump-sink2boolfalseqword_4FBF100Dump debug information during sinking
KnobTypeDefaultOwnerDescription
sink-check-schedbooltruestock SinkCheck scheduling effects of sinking
sink-single-onlybooltruestock SinkOnly sink single-use instructions
rp-aware-sinkboolfalsestock SinkConsider register pressure (controls sink<rp-aware> variant)
max-uses-for-sinkingint(default)stock SinkDon't sink insts with too many uses
sink-ld-parambool(default)NVPTX backendSink one-use ld.param to use point
hoist-load-parambool(default)NVPTX backendHoist all ld.param to entry block (counterpart to sink-ld-param)
enable-andcmp-sinkingbool(default)CodeGenPrepareSink and/cmp into branches
aggressive-no-sinkbool(default)(unknown)Sink all generated instructions
instcombine-code-sinkingbool(default)InstCombineEnable code sinking within instcombine
nvptx-enable-machine-sinkbool(default)NVPTX backendEnable MIR-level MachineSink
SinkRematEnablebool(default)ptxasEnable sink+rematerialization in ptxas

Analysis Dependencies

Legacy PMNew PMPurpose
DominatorTreeWrapperPass (unk_4F9E06C)DominatorTreeAnalysis (sub_CF6DB0)Dominator tree for sink legality and ordering
LoopInfoWrapperPass (unk_4F96DB4)LoopAnalysis (sub_B1A2E0)Avoid sinking out of loops; skip loop headers

Does not require: SCEV, MemorySSA, PostDominatorTree, BranchProbabilityInfo.

This is a key difference from stock LLVM SinkingPass, which requires MemorySSAAnalysis. Sinking2 uses its own alias analysis queries through helpers sub_1CC8920 and sub_1CC8CA0, routed through the traditional AA interface at sub_13575E0. This avoids the overhead of building/maintaining MemorySSA across fixpoint iterations.

Pass Object Layout

Legacy PM (160 bytes):

OffsetTypeContent
+0ptrVtable pointer (off_49F8BC0)
+8ptrPass link (next pass in chain)
+16ptrPass ID pointer (&unk_4FBF0F4)
+24int32Mode (default=3, from sink-into-texture)
+28int32Sink limit (default=20, from sink-limit)
+32--48ptr[3]Worklist data (head, tail, size)
+56ptrDominatorTree* (set during runOnFunction)
+64ptrList head 1 (self-referential sentinel)
+72--80ptr[2]List next/prev 1
+96int64Counter (sink count for current iteration)
+104ptrLoopInfo* (set during runOnFunction)
+112ptrList head 2 (self-referential sentinel)
+120--128ptr[2]List next/prev 2
+144int64Data field
+152byteChanged flag (for fixpoint termination)

New PM (176 bytes): two embedded worklists and float thresholds at offsets +88 and +144 (value 1065353216 = 1.0f IEEE 754).

Differences from Upstream LLVM

AspectUpstream LLVM sinkNVIDIA sinking2
Alias analysis backendMemorySSACustom AA layer (sub_13575E0)
Iteration strategySingle passFixpoint iteration
Texture awarenessNone3-level configurable
Address space awarenessGenericNVPTX-specific (AS 1,3,4,5)
Complexity limiterNonesink-limit knob (default=20)
Intra-block reorderingNoLevel >= 2
Outside-only patternNoLevel == 3
Debug dumpStandard LLVM debugdump-sink2 knob
Cost modelBoolean (profitable or not)Float thresholds in New PM
Pipeline occurrences13--4 (multi-run strategy)
Fast-compile variantSame passDedicated fast=1 mode

Diagnostic Strings

StringContext
"llvm::Sinking2Pass]"RTTI name at sub_2315E20
"sink2"Pipeline parser ID
"Code sinking"Display name (shared with stock LLVM sink)
"sinking2"New PM pipeline string match

Function Map

FunctionAddressSizeRole
--sub_1CC7010--Legacy PM pass registration
--sub_1CC7100--Legacy PM factory
--sub_1CC71E0--Legacy PM alternate factory
--sub_1CC751016KBprocessInstruction: sink candidate evaluation, use-chain walk, LCD computation
--sub_1CC817013KBDominance ordering: DFS numbering for block comparison
--sub_1CC89204KBAlias checking helper: validates no conflicting memory accesses on path
--sub_1CC8CA06KBMemory dependency helper: store-load forwarding, store ordering, volatile
--sub_1CC911022KBMain worklist driver: fixpoint iteration over dominator tree
--sub_1CCA270--Legacy PM runOnFunction entry
--sub_2D1B410--New PM pass registration
--sub_2D1BC50--New PM factory
--sub_2D1C16019KBNew PM run() entry
--sub_2D1CFB013KBNew PM core logic
--sub_2D1D7707KBNew PM helper
--sub_2D1DCF07KBNew PM helper
--sub_2315E20--RTTI name printer
--0x4F7750--Knob constructor (ctor_275)

Related pipeline factories:

AddressRole
sub_18B1DE0Sinking2 early-pipeline factory
sub_18B3080Sinking2 fast-mode factory (accepts fast flag parameter)
sub_1CC60B0NVVMSinking2 late-pipeline factory
sub_1A634D0Stock LLVM Sink legacy PM registration
sub_29776B0Stock LLVM Sink New PM registration
sub_1B51110Stock Sink core (51KB, creates .sink.split / .sink blocks)
sub_1869C50Stock Sink pipeline factory (called with params 1,0,1)

Total code size: ~80KB (Legacy PM) + ~65KB (New PM) = ~145KB

GPU-Specific Motivation

Register pressure directly determines occupancy -- each additional live register per thread reduces the number of warps available for latency hiding, with discrete cliff boundaries where a single register can drop an entire warp group.

Sinking instructions closer to their uses shortens live ranges and reduces the peak number of simultaneously live registers. This is especially valuable for texture load sequences, which typically involve address computation (GEP chains, index arithmetic) that produces values consumed only at the texture fetch site. Without sinking, these intermediate values occupy registers across potentially many instructions, bloating register pressure unnecessarily.

The three-level sink-into-texture design reflects a graduated approach to this optimization: level 1 handles the common case (cross-block sinking), level 2 adds intra-block reordering for tighter packing, and level 3 (the default) handles the edge case where an instruction's only uses are in blocks other than where it is defined, enabling more aggressive motion.

The multi-run pattern (early Sinking2, post-peephole fast Sinking2, late NVVMSinking2) ensures that sinking opportunities created by other optimization passes are captured throughout the pipeline, rather than relying on a single sinking point that may miss opportunities not yet exposed.

Cross-References

  • Dead Synchronization Elimination -- runs earlier, removes barriers that Sinking2 would otherwise treat as memory fences
  • LICM -- counterpart: hoists loop-invariant code into preheaders; Sinking2 sinks address computation out of preheaders
  • NVVMPeephole -- runs before late Sinking2, may create new sinking opportunities
  • Rematerialization -- runs after all sinking; rematerialization + sinking together minimize register pressure (ptxas SinkRematEnable knob)
  • MemorySpaceOpt -- changes address spaces which affects sinking profitability
  • NVVMPassOptions -- opts[1040] disables stock Sink; opts[2440] disables NVVMSinking2
  • Register Allocation -- ultimate consumer of the register pressure reduction that sinking provides
  • Optimization Levels -- Ofcmax runs only fast-mode Sinking2; O2/O3 run full multi-run pattern