Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CodeGenPrepare and SCEV-CGP

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Upstream CodeGenPrepare is stock LLVM 20.0.0 CodeGenPrepare.cpp with all 20+ cl::opt knobs unchanged. SCEV-CGP is a fully proprietary NVIDIA pass with no upstream equivalent; it is disabled by default (nv-disable-scev-cgp = true).

cicc v13.0 contains two distinct passes that prepare LLVM IR for the NVPTX backend's instruction selection. The first is upstream LLVM's CodeGenPreparePass, registered as "codegenprepare" in the New PM pipeline (line 216 of sub_2342890), which sinks address computations, creates PHI nodes for sunk values, and splits critical edges. The second is NVIDIA's proprietary SCEV-CGP (Scalar-Evolution-based Code Generation Preparation), a fully custom pass that uses SCEV analysis to rewrite address expressions with GPU thread ID as an induction variable.

Both passes operate at the LLVM IR level, immediately before SelectionDAG construction. They share the goal of making address expressions cheap for the backend to lower, but they work at different abstraction levels: CodeGenPrepare operates syntactically on individual memory instructions; SCEV-CGP operates semantically on entire address expression families using scalar evolution. NVIDIA disables SCEV-CGP by default (nv-disable-scev-cgp defaults to true), relying on upstream CodeGenPrepare plus the downstream Base Address Strength Reduction and Common Base Elimination passes to handle GPU address optimization.

Key Facts

PropertyValue
Pass name (upstream)codegenprepare (New PM)
Pass name (NVIDIA)SCEV-CGP (no formal New PM pass name found in binary)
Binary range (v12.x)0x1D60000--0x1D7FFFF (helpers + main transforms)
Binary range (v13.0)0x2D75700--0x2D88660 (Cluster 6 in 0x2D sweep)
Address sinkingsub_1D73760 / sub_2D75700 (65--72 KB), string "sunkaddr"
PHI sinkingsub_1D706F0 / sub_2D784F0 (64--68 KB), string "sunk_phi"
Block splittingsub_1D7AA30 / sub_2D88660 (54--74 KB), strings ".unlikely", ".cond.split"
Main transformsub_2D80050 (54 KB) -- orchestrates address mode lowering
SCEV-CGP knob ctorctor_263_0 at 0x4F36F0 (9.9 KB, 44 option strings)
CGP knob ctorctor_288_0 at 0x4FA950 (8.6 KB, 44 option strings)
Master disablenv-disable-scev-cgp (default: true -- SCEV-CGP is disabled)
Upstream sourcellvm/lib/CodeGen/CodeGenPrepare.cpp
Pipeline positionLate IR, immediately before SelectionDAG ISel

Upstream CodeGenPrepare

Purpose

CodeGenPrepare is the last IR-level pass before instruction selection. Its job is to transform the IR into a form that the SelectionDAG builder can lower efficiently: address computations should be adjacent to their memory uses (reducing live ranges), complex addressing modes should be materialized as GEP chains that ISel can pattern-match, and unlikely branches should be split into cold blocks so that block placement can isolate them.

On NVPTX this pass is less critical than on x86 because PTX has simpler addressing modes (base + offset, no scaled index), but it still performs three important transforms.

Transform 1: Address Sinking (sunkaddr)

The address sinking logic lives in sub_1D73760 (v12.x) / sub_2D75700 (v13.0). It identifies memory instructions whose address operand is computed in a dominating block, then sinks the computation to the block containing the memory instruction. The sunk address is named "sunkaddr" in the IR, appearing as a GEP, inttoptr, or bitcast chain:

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br label %loop

  loop:
    %val = load float, ptr %addr          ; addr live across loop

After:
  entry:
    br label %loop

  loop:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %val = load float, ptr %sunkaddr0     ; addr local to use

The naming convention "sunkaddr" with a numeric suffix (20+ occurrences in binary string references) is the standard LLVM naming. Each sunk address gets a unique suffix: sunkaddr0, sunkaddr1, etc.

The sinking decision is controlled by a cache called ValueToSunkAddr (a DenseMap at sub_2CE7CF0 in the v13.0 build). Before sinking a value, the pass checks whether the same address expression has already been sunk into the target block. If so, it reuses the existing sunk copy rather than creating a duplicate.

The core sinking algorithm:

for each basic block BB in function:
    for each instruction I in BB:
        if I is a memory instruction (load/store/atomic):
            addr = I.getPointerOperand()
            if addr.getParent() != BB:
                // addr defined in a dominating block
                addr_mode = matchAddressMode(addr)           // sub_2D67BB0
                if addr_mode.isFoldable():
                    sunk = materializeAddrMode(addr_mode, BB) // sub_2D68450
                    I.setPointerOperand(sunk)
                    mark changed

Key helpers in the v13.0 build:

FunctionAddressSizeRole
----sub_2D749D0--
----sub_2D67BB0--
----sub_2D6E640--
----sub_2D68450--
----sub_2CE7CF0--

Transform 2: PHI Sinking (sunk_phi)

When an address computation has multiple uses in successor blocks of a conditional branch, the pass creates a PHI node in the merge block rather than sinking independent copies into each successor. The resulting PHI is named "sunk_phi":

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br i1 %cond, label %then, label %else

  then:
    %v1 = load float, ptr %addr
    br label %merge

  else:
    %v2 = load float, ptr %addr
    br label %merge

After (conceptual):
  then:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %v1 = load float, ptr %sunkaddr0
    br label %merge

  else:
    %sunkaddr1 = getelementptr float, ptr %base, i64 %idx
    %v2 = load float, ptr %sunkaddr1
    br label %merge

When the two sunk copies would be identical and the value is needed in the merge block for other uses, the pass instead creates:

  merge:
    %sunk_phi = phi ptr [ %sunkaddr0, %then ], [ %sunkaddr1, %else ]

The PHI creation calls sub_B44260 (PHI node setup), with naming via sub_BD6B50. The addr-sink-new-phis cl::opt knob (registered at ctor_288_0) controls whether the pass is allowed to create new PHIs during address sinking. The addr-sink-new-select knob similarly controls creation of new select instructions.

Transform 3: Block Splitting

sub_1D7AA30 (v12.x) / sub_2D88660 (v13.0) splits basic blocks to isolate unlikely paths. The pass creates blocks with suffixes ".unlikely" and ".cond.split", allowing MachineBlockPlacement to push cold code away from the hot path. This is driven by branch probability metadata and profile-guided section prefix hints.

On NVPTX, block splitting interacts with StructurizeCFG: the split blocks must still form reducible control flow, otherwise StructurizeCFG will have to insert additional flow blocks to restore structure. The profile-guided-section-prefix knob controls whether section prefix metadata (.hot, .unlikely, .unknown) is attached to split blocks.

Upstream CodeGenPrepare Knobs

All registered at ctor_288_0 (0x4FA950, 8.6 KB, 44 strings). These are standard LLVM cl::opt knobs, unchanged from upstream:

KnobTypeEffect
disable-cgp-branch-optsboolDisable CodeGenPrepare branch optimizations
disable-cgp-gc-optsboolDisable CodeGenPrepare GC optimizations
disable-cgp-select2branchboolDisable select-to-branch conversion
addr-sink-using-gepboolUse GEP instructions for address sinking (vs. inttoptr)
enable-andcmp-sinkingboolSink and/cmp instruction pairs into branches
disable-cgp-store-extractboolDisable store-extractvalue optimization
stress-cgp-store-extractboolStress test store-extractvalue path
disable-cgp-ext-ld-promotionboolDisable extension-load promotion
disable-preheader-protboolDisable loop preheader protection
profile-guided-section-prefixboolAttach section prefix based on profile data
cgp-freq-ratio-to-skip-mergeintBlock frequency ratio threshold to skip block merging
force-split-storeboolForce store splitting
cgp-type-promotion-mergeboolMerge type promotions
disable-complex-addr-modesboolDisable complex addressing mode optimization
addr-sink-new-phisboolAllow creating new PHIs during address sinking
addr-sink-new-selectboolAllow creating new select during address sinking
addr-sink-combine-base-regboolCombine base register in address sink
addr-sink-combine-gvboolCombine global value in address sink
addr-sink-combine-offsboolCombine offset in address sink
addr-sink-combine-scaled-regboolCombine scaled register in address sink
cgp-split-large-offset-gepboolSplit GEPs with large offsets

GPU Relevance of Upstream Knobs

Most of these knobs are effectively no-ops on NVPTX because the target's addressing modes are simple (base + immediate offset, no scaled index register). However, a few matter:

  • addr-sink-using-gep: Controls whether sunk addresses use GEP or inttoptr chains. On NVPTX, GEP chains are preferred because they preserve address space information through lowering. The inttoptr path strips address space, forcing the backend to re-derive it.

  • cgp-split-large-offset-gep: Relevant for large array accesses where the constant offset exceeds the PTX immediate encoding width (±2^31 for 64-bit addressing). Splitting the GEP allows the backend to use a base register plus a small offset rather than a 64-bit constant.

  • addr-sink-new-phis: On GPU, creating new PHIs can increase divergent live ranges. If the condition driving the PHI is thread-divergent, the PHI result will be divergent, potentially requiring a wider (per-lane) register allocation.

NVIDIA SCEV-CGP

What Is It?

SCEV-CGP is a fully custom NVIDIA pass that uses LLVM's ScalarEvolution analysis to optimize address mode expressions at the function level, with specific awareness of GPU thread ID as an induction variable. Where upstream CodeGenPrepare operates syntactically (pattern-matching individual instructions), SCEV-CGP operates semantically: it analyzes address expressions as SCEV recurrences, factors out common base computations, and rewrites them to minimize register pressure.

The pass is registered in ctor_263_0 at 0x4F36F0 alongside Base Address Strength Reduction knobs. The 44 strings registered in this single constructor cover both SCEV-CGP and BASR, confirming they are part of the same address optimization subsystem.

Why NVIDIA Disables It By Default

The nv-disable-scev-cgp knob defaults to true (the description reads "Disable optimize addr mode with SCEV pass" and the raw data at ctor_609_0 marks it as def=on meaning disabled). This is a deliberate choice:

  1. Redundancy with BASR/CBE. NVIDIA has invested heavily in Base Address Strength Reduction (62 KB) and Common Base Elimination (39 KB), which handle the most profitable GPU address optimizations (sharing base computations across array accesses in loop bodies). These passes are simpler, more predictable, and better-tested than the general SCEV-CGP framework.

  2. Interaction with LSR. Both SCEV-CGP and Loop Strength Reduction operate on SCEV expressions. If both are active, they can fight over the same address expressions: LSR rewrites IVs for loop-carried efficiency, then SCEV-CGP undoes part of that work to optimize address modes. The result can be worse than either pass alone. By disabling SCEV-CGP, NVIDIA lets LSR (with its full GPU-aware formula solver) handle SCEV-based address optimization without interference.

  3. Compile-time cost. SCEV-CGP with aggressive mode (do-scev-cgp-aggresively [sic]) is expensive. The scev-cgp-inst-limit and scev-cgp-control knobs exist precisely because uncontrolled SCEV-CGP can balloon compile times on large kernels with many address expressions.

  4. Overflow hazards. The ignore-32-bit-overflow and ignore-signed-32-bit-overflow knobs in ctor_263_0 indicate that SCEV-CGP can produce address arithmetic that overflows 32-bit intermediates. On GPU where 32-bit addressing is common (shared memory, constant memory), this is a correctness risk that NVIDIA mitigates by keeping the pass off by default.

When SCEV-CGP Would Be Beneficial

Despite being disabled by default, the pass has 11 dedicated knobs -- NVIDIA clearly uses it selectively:

  • Kernels with complex strided access patterns where thread ID participates in multi-dimensional address calculations (e.g., base + tid.x * stride_x + tid.y * stride_y + tid.z * stride_z). BASR handles the case where multiple accesses share a base, but it does not factor thread ID expressions across dimensions.

  • Register-pressure-critical kernels at occupancy cliffs where SCEV-based address strength reduction can save enough registers to cross an occupancy boundary. The scev-cgp-tid-max-value knob lets the pass reason about the bounded range of thread IDs, enabling tighter value range analysis.

  • Function-level address optimization (enabled by do-function-scev-cgp) where cross-loop base sharing matters more than per-loop IV optimization.

Thread ID Max Value Knob

The scev-cgp-tid-max-value knob deserves special attention. It provides SCEV analysis with the maximum possible value of a GPU thread ID, which is architecture-dependent:

  • threadIdx.x: max 1024 (all architectures sm_70+)
  • threadIdx.y: max 1024
  • threadIdx.z: max 64
  • blockIdx.x: max 2^31 - 1

By telling SCEV that threadIdx.x is bounded by 1024, the analysis can prove that threadIdx.x * element_size fits in 32 bits for element sizes up to ~2 million bytes. This enables 32-bit address arithmetic where the expression would otherwise be widened to 64 bits. The knob links to the Known Bits analysis documented in Known Bits, where the nvvm-intr-range pass provides similar bounded-range information for special registers.

SCEV-CGP Knobs (Complete Reference)

All registered in ctor_263_0 at 0x4F36F0. These are NVVMPassOptions values, stored in the 222-slot pass option registry.

KnobTypeDefaultEffect
do-scev-cgpbooltrue [MEDIUM confidence]Master enable for SCEV-based CodeGenPrepare transforms. Default inferred from the fact that nv-disable-scev-cgp exists as an override, implying this defaults to enabled.
do-scev-cgp-aggresively [sic]boolfalse [MEDIUM confidence]Enable aggressive SCEV-CGP mode with expanded search. Default inferred from naming convention (aggressive modes typically off by default).
do-function-scev-cgpboolfalse [MEDIUM confidence]Enable function-level (cross-loop) SCEV-CGP. Default inferred from naming convention.
nv-disable-scev-cgpbooltrueMaster disable switch in NVPTX backend (overrides do-scev-cgp)
scev-cgp-controlintunknownLimit the total number of SCEV-CGP transformations per function
scev-cgp-cross-block-limitintunknownMax number of common base expressions from a single block
scev-cgp-idom-level-limitintunknownMax dominator tree depth for hoisting base computations
scev-cgp-inst-limitintunknownMax instructions analyzed per parameter expression
scev-cgp-old-baseboolunknownUse old (legacy) base computation method instead of new
scev-cgp-tid-max-valueintarch-dependentMaximum value of thread ID for address range analysis
scev-cgp-check-latencyintunknownLatency threshold for address computation profitability
scev-cgp-normintunknownNormalization control for SCEV expression canonicalization
print-after-scev-cgpboolfalseDump function IR after SCEV-CGP completes
dump-scev-cgpboolfalseDebug dump during SCEV-CGP execution

The same constructor also registers these knobs, documented in their respective pages:

KnobSee
do-base-address-strength-reduceBase Address Strength Reduction
do-base-address-strength-reduce-chainBase Address Strength Reduction
base-address-strength-reduce-iv-limitBase Address Strength Reduction
base-address-strength-reduce-max-ivBase Address Strength Reduction
topo-sort-beginTopological sort starting point for address expression graph
ignore-bad-baseBypass validity checks on base pointer classification
ignore-32-bit-overflowSkip 32-bit overflow checks in address arithmetic
ignore-signed-32-bit-overflowSkip signed 32-bit overflow checks

Interaction with LSR

CodeGenPrepare/SCEV-CGP and Loop Strength Reduction both optimize address expressions, but at different pipeline stages and granularities.

AspectLSRCodeGenPrepareSCEV-CGP
Pipeline positionLate IR optimization (loop passes)Pre-ISel (after all IR opts)Pre-ISel (NVIDIA custom position)
ScopePer-loop IV rewritingPer-instruction address sinkingPer-function address expression rewriting
SCEV usageFull: formula generation, stride factoring, chain constructionNone (syntactic pattern matching)Full: base decomposition, range analysis
Register pressureExplicit RP tracking with occupancy ceilingImplicit (sinking reduces live ranges)Implicit via scev-cgp-cross-block-limit
Address spaceFull awareness (shared memory protection, 64-bit IV gating)No special GPU handlingThread ID aware (scev-cgp-tid-max-value)
Default statusEnabled (with GPU-custom formula solver)Enabled (standard upstream)Disabled (nv-disable-scev-cgp = true)

The key insight is the pipeline ordering: LSR runs first during the optimization phase, rewriting IVs across the loop. CodeGenPrepare runs later, sinking the results into individual use sites. If SCEV-CGP were also enabled, it would run between these two, potentially undoing LSR's IV choices to create "better" address modes -- which may conflict with LSR's register-pressure-informed formula selection.

NVIDIA's solution is pragmatic: keep SCEV-CGP off, let LSR handle SCEV-level optimization, let BASR/CBE handle GPU-specific base sharing, and let upstream CodeGenPrepare handle the final address sinking.

Differences from Upstream LLVM

AreaUpstream LLVMcicc v13.0
CodeGenPrepare passStandard, used as-isRetained unchanged from LLVM 20.0.0
SCEV-CGPDoes not existNVIDIA proprietary, disabled by default
Address sinkingAlways uses TTI::getAddrModeTypeSame, but NVPTX TTI returns simple modes (base+offset only)
Block splittingHot/cold based on PGOSame, but must preserve reducibility for StructurizeCFG
BASR/CBEDo not existNVIDIA proprietary alternatives to SCEV-CGP for GPU
Knob count~20 cl::opt for CGP20 upstream CGP + 14 SCEV-CGP + 8 BASR = 42 total

Function Map

CodeGenPrepare (v12.x Addresses)

FunctionAddressSizeRole
--sub_1D7376065 KBoptimizeMemoryInst -- address sinking, creates "sunkaddr"
--sub_1D706F068 KBPHI optimization, creates "sunk_phi"
--sub_1D7AA3074 KBBlock splitting, creates ".unlikely", ".cond.split"
--sub_1D779D071 KBIR transform (DAG combine-level, possibly optimizeInst)
--sub_1D765D034 KBSelect lowering ("cond.false", "cond.end")
--sub_1D7F9D031 KBDeque-based worklist processor

CodeGenPrepare (v13.0 Addresses)

FunctionAddressSizeRole
--sub_2D7570072 KBAddress sinking with "sunk_phi", ValueToSunkAddr DenseMap
--sub_2D784F064 KBAddress mode lowering orchestrator, calls sub_2D75700
--sub_2D8005054 KBMain CodeGenPrepare transform, calls TTI and address mode logic
--sub_2D8285062 KBLate lowering/expansion (type widening, custom lowering)
--sub_2D8866070 KBBlock splitting with branch weights ("hot", "unlikely", "unknown")
--sub_2D749D0--Address mode cache lookup
--sub_2D67BB0--Address mode legality test
--sub_2D6E640--Address mode cache insert
--sub_2D68450--Address mode materialization
--sub_2D6DEE0--Address mode matching
--sub_2D69E90--Cleanup/init

Helper Range (0x1D60000--0x1D6FFFF)

This 64 KB sub-range contains CodeGenPrepare helper functions. The sweep identifies it as "CodeGenPrepare helpers" but no individual functions are called out with string evidence. These likely include address mode computation utilities, operand analysis, and GEP canonicalization.

SCEV-CGP Option Registration

FunctionAddressSizeRole
--ctor_263_0 (0x4F36F0)9.9 KBRegisters 44 cl::opt strings for SCEV-CGP + BASR
--ctor_288_0 (0x4FA950)8.6 KBRegisters 44 cl::opt strings for upstream CodeGenPrepare
--ctor_591 (0x57C1A0)9.3 KBAdditional CodeGenPrepare sink/split options
--ctor_544_0 (0x56C190)13.1 KBCodeGenPrepare options (v13.0 duplicate registration)
--ctor_609_0 (0x585D30)37.3 KBNVPTX backend mega-block, includes nv-disable-scev-cgp

Cross-References