Loop Optimization Passes

Loop optimization is the single most performance-sensitive area of the cicc pipeline. On an NVIDIA GPU, the constraints are fundamentally different from CPU: register pressure dominates (every additional register per thread reduces SM occupancy), memory coalescing replaces cache locality as the primary memory optimization target, and warp divergence caused by loop-carried control flow destroys SIMT efficiency. NVIDIA's cicc v13.0 addresses these constraints by shipping a mix of stock LLVM loop passes, LLVM passes with GPU-specific threshold overrides, and fully proprietary loop transformations -- all orchestrated through a carefully ordered pipeline where the position of each pass reflects hard-won engineering tradeoffs between register pressure, instruction count, and memory access patterns.

This page provides the big-picture view of loop optimization in cicc: what passes exist, how they are ordered, what analyses they share, and why the ordering matters for GPU targets. Each pass links to a dedicated sub-page with full algorithmic detail.

Why Loop Optimization Is Different on GPU

Four properties of the GPU execution model distinguish GPU loop optimization from the CPU case that upstream LLVM targets:

Register pressure is the primary constraint. Every loop transformation that increases live values (unrolling, vectorization, LICM hoisting) must be evaluated against the SM's register budget and its discrete occupancy cliffs -- adding one register can drop occupancy by a full warp group. CPU compilers never face this tradeoff.

Memory coalescing replaces cache line optimization. Loop transformations that improve stride-1 access patterns (interchange, vectorization) improve coalescing; transformations that increase the number of live pointers (unrolling, distribution) may degrade it by interleaving access streams.

No out-of-order execution. Warps execute instructions in program order; the only latency-hiding mechanism is warp-level multithreading. Unrolling creates ILP within a single warp by exposing independent instructions that the ptxas backend can interleave, but the benefit is bounded by the register pressure cost.

Address space semantics. GPU memory is partitioned into address spaces with different pointer widths, hardware addressing modes, and performance characteristics. Loop passes that rewrite address computations (LSR, IndVarSimplify) must respect these distinctions -- strength-reducing a 32-bit shared memory pointer into 64-bit generic form defeats the backend's ability to emit efficient .shared:: instructions.

Pipeline Ordering

The loop passes execute within the main optimization pipeline assembled by sub_12E54A0. The ordering below reflects the Tier 1/2/3 optimization path (the normal path for -O1 and above). Passes marked with (N) are NVIDIA-specific or have significant NVIDIA modifications; unmarked passes are stock LLVM with at most threshold overrides.

LoopSimplify + LCSSA                   (canonicalization)
    |
    v
LoopRotate                             (do-while canonical form)
    |
    v
LICM (hoist)                           (move invariants out)
    |
    v
LoopIndexSplit **(N)**                 (split index-dependent branches)
    |
    v
IndVarSimplify **(N)**                 (canonicalize IVs, LFTR)
    |
    v
LoopIdiomRecognize                     (memcpy/memset/mismatch idioms)
    |
    v
LoopDistribute                         (fission for vectorization)
    |
    v
LoopVectorize **(N)**                  (widen scalar loops to v2/v4)
    |
    v
LoopUnroll **(N)**                     (replicate body, GPU-tuned)
    |
    v
LoopInterchange                        (swap nest levels for coalescing)
    |
    v
IRCE                                   (range check elimination)
    |
    v
NVLoopStrengthReduce **(N)**           (NVIDIA custom LSR solver)
    |
    v
LoopDeletion                           (remove dead loops)
    |
    v
LoopSink / LICM (sink)                 (demote unprofitable hoists)

Several passes appear more than once. LICM runs in both hoist and sink mode. LoopUnroll has an early invocation in the main pipeline and a late invocation gated by opts[1360] (nv-disable-loop-unrolling). IndVarSimplify runs before vectorization to canonicalize induction variables, then again after unrolling to clean up newly exposed IVs. LoopSimplify and LCSSA are implicit -- they run as required analyses whenever any loop pass requests them, ensuring loops remain in canonical form throughout.

The ordering reflects a deliberate strategy: canonicalize first (LoopSimplify, LoopRotate, IndVarSimplify), transform for parallelism (LoopDistribute, LoopVectorize, LoopInterchange), replicate for ILP (LoopUnroll), and clean up addressing (LSR, LoopDeletion, LoopSink). Reordering these passes produces measurably different code: running LSR before LoopVectorize would pollute the cost model with strength-reduced IVs that confuse SCEV; running LoopUnroll before LoopVectorize would prevent vectorization of unrolled-but-still-vectorizable loops.

LoopPassManager Structure

cicc uses the LLVM New Pass Manager's LoopPassManager infrastructure. Loop passes are grouped inside a FunctionPassManager that contains a LoopToFunctionPassAdaptor wrapping the LoopPassManager. The adaptor iterates over all loops in the function in reverse post-order of the loop forest (innermost first), running the full sequence of loop passes on each loop before moving to the next.

The LoopStandardAnalysisResults struct is threaded through all loop passes, providing shared access to:

Analysis	Typical Accessor	Purpose
`ScalarEvolution`	`AR.SE`	Trip counts, strides, value ranges
`LoopInfo`	`AR.LI`	Loop structure, nesting depth
`DominatorTree`	`AR.DT`	Dominance queries for code motion
`AssumptionCache`	`AR.AC`	`__builtin_assume` facts
`TargetTransformInfo`	`AR.TTI`	Cost model, addressing modes
`MemorySSA`	`AR.MSSA`	Memory alias queries for LICM/DSE
`AAResults`	`AR.AA`	Alias analysis chain

Passes that structurally modify loops (LoopUnroll, LoopDistribute, IRCE) call LPMUpdater::markLoopAsDeleted() or LPMUpdater::addSiblingLoops() to inform the pass manager of changes. SCEV is invalidated per-loop via SE.forgetLoop() after any transformation that changes the loop's backedge-taken count.

Complete Pass Inventory

The table below lists every loop pass present in cicc v13.0 with its pipeline position, NVIDIA modification status, and primary function address.

Pass Name	Pipeline Position	NVIDIA Modified	Entry Address	Status
`loop-simplify`	Infrastructure (on demand)	No	stock LLVM	Canonicalizes loop form
`lcssa`	Infrastructure (on demand)	No	stock LLVM	Ensures loop-closed SSA
`loop-rotate`	Early, before LICM	No	stock LLVM	Converts to do-while form
`licm`	Early (hoist) + Late (sink)	Threshold only	stock LLVM	Invariant code motion
`loop-index-split`	After LICM, before IndVars	Yes (proprietary)	`sub_2CBEC60` (New PM)	Splits index-dependent branches
`indvars`	Before vectorize	Yes (3 knobs)	`sub_19489B0`	IV canonicalization + LFTR
`loop-idiom`	Before distribute	No	stock LLVM	Memcpy/memset/mismatch recognition
`loop-distribute`	Before vectorize	Threshold only	`sub_1A8CD80`	Loop fission for vectorization
`loop-vectorize`	Main loop slot	Yes (cost model)	`sub_2AF1970`	Vectorize inner loops to v2/v4
`loop-unroll`	After vectorize (x2)	Yes (decision engine)	`sub_19BE360`	Replicate loop body
`loop-interchange`	After unroll	Threshold only	`sub_1979A90`	Swap loop nest levels
`irce`	After interchange	No	`sub_194D450`	Range check elimination
`loop-reduce`	Late, after unroll	Yes (complete rewrite)	`sub_19CE990` (NV wrapper)	Strength reduction for GPU
`loop-deletion`	Late	No	stock LLVM	Remove dead/empty loops
`loop-sink`	Late	No	stock LLVM	Sink invariants back into loops
`loop-instsimplify`	Utility	No	stock LLVM	Simplify instructions in loops
`loop-flatten`	Utility	No	stock LLVM	Flatten nested counted loops
`loop-guard-widening`	Utility	No	stock LLVM	Widen loop guards
`loop-predication`	Utility	No	stock LLVM	Predicate unswitched loops
`loop-reroll`	Utility	No	stock LLVM	Reverse unrolling (rarely used)

Passes marked "Utility" are registered in the pipeline infrastructure but are not part of the default optimization sequence -- they are available for explicit pipeline specification via -mllvm -passes=....

Pass Descriptions and Sub-Page Links

Canonicalization Passes

LoopSimplify and LCSSA run on demand before any loop transformation pass executes. LoopSimplify ensures each loop has a single preheader, a single backedge (latch), and dedicated exit blocks. LCSSA (Loop-Closed SSA) ensures that values defined inside a loop and used outside it pass through PHI nodes at loop exit blocks. These are stock LLVM utilities with no NVIDIA modifications. Together they establish the invariants that all subsequent loop passes depend on.

LoopRotate converts a loop from while-form (while (cond) { body }) to do-while form (do { body } while (cond)). This creates a single-entry loop body and moves the exit test to the latch, which is the canonical form expected by SCEV, LoopVectorize, and LoopUnroll. Stock LLVM, no NVIDIA modifications.

NVIDIA-Custom Loop Passes

Loop Index Split is a revived and heavily reworked version of a pass removed from upstream LLVM 3.0. It splits loops when the loop body contains a condition that depends on the induction variable (e.g., if (i == K)), producing two or three loops where each has a uniform body. On GPU, this eliminates warp divergence caused by index-dependent branches. The pass implements three transformation modes: all-but-one peel (for i == K), only-one collapse (for nearly-empty special iterations), and full range split (for i < K vs i >= K). Proprietary, no upstream equivalent.

IndVarSimplify (NVIDIA) is upstream LLVM's induction variable canonicalization pass with three NVIDIA-specific extensions: Disable-unknown-trip-iv (bool, qword_4FAF520) -- bypasses the pass entirely when SCEV cannot compute the trip count, preventing aggressive IV transforms on warp-divergent loops; iv-loop-level (int, default 1, qword_4FAF440) -- restricts the pass to loops at a maximum nesting depth to control compile time on deeply nested stencil kernels; and disable-lftr (bool, byte_4FAF6A0) -- disables Linear Function Test Replace when the IV canonicalization would increase register pressure.

LoopVectorize (GPU-Adapted) is the largest single pass in the cicc loop pipeline (88 KB). On GPU, vectorization means generating ld.v2/ld.v4 wide loads rather than filling SIMD lanes. The pass builds VPlans, selects VF through a GPU-aware cost model that penalizes register pressure, and caps VF at 4 for most GPU targets. Scalable vectors are always disabled. The pass includes an outer-loop vectorization path (rarely triggered on GPU) and an inner-loop path (the main code path).

Loop Unrolling (GPU-Tuned) ships a substantially reworked computeUnrollCount decision engine with GPU heuristics: a local-array threshold multiplier that aggressively unrolls loops over __shared__ arrays, power-of-two factor enforcement, a pragma threshold 200x larger than stock LLVM, and a register-pressure-aware cost model. The transformation engine is lightly modified upstream UnrollLoop. The pass runs twice: once in the main pipeline, once as a late cleanup.

NVLoopStrengthReduce (NVIDIA Custom) is the most GPU-specific LLVM pass in cicc. NVIDIA ships a complete replacement formula solver (160 KB, 2688 lines) with 11 custom knobs controlling register pressure checking, address-space-aware formula selection, sign-extension optimization, and 64-bit IV handling. The stock LLVM LSR remains in the binary but the NVIDIA overlay replaces the formula generation and selection phases.

Standard Loop Passes (Threshold Overrides Only)

LICM (Loop-Invariant Code Motion) hoists loop-invariant computations above the loop and sinks them below it. On GPU, LICM's hoist mode must be conservative: hoisting increases register pressure in the loop preheader, which may push past occupancy cliffs. The sink mode (running later) undoes unprofitable hoists. Stock LLVM with NVIDIA-tuned thresholds.

LoopInterchange swaps the nesting order of a perfectly-nested loop pair when doing so improves memory access locality. In cicc, the threshold loop-interchange-threshold (dword_4FB07E0) defaults to 0, meaning interchange is only performed when the net locality benefit is non-negative AND parallelism improves. The pass has a 100-pair dependence limit (0x960 bytes) as a compile-time safety valve. There is no visible CUDA-specific memory space awareness -- the standard LLVM stride-1 locality model applies uniformly. See the standard loop passes page for details.

IRCE (Inductive Range Check Elimination) splits a loop into preloop/mainloop/postloop regions, eliminating range checks from the mainloop where the induction variable is provably within bounds. The implementation is stock LLVM with no visible NVIDIA modifications. Configuration globals include a block count threshold (dword_4FB0000), a debug flag (byte_4FAFE40), and a "constrained" relaxation mode (byte_4FAFBA0) that handles slightly non-canonical range checks common in GPU thread-coarsened loops.

LoopDistribute (loop fission) splits a single loop into multiple loops to separate unsafe memory dependences from safe ones, enabling LoopVectorize to vectorize the safe partition. Stock LLVM algorithm. The SCEV runtime check threshold (qword_4FB5480) is likely GPU-tuned. The pass runs before LoopVectorize in the pipeline.

LoopIdiomRecognize detects loops that implement common patterns (byte-by-byte copy, memset, mismatch search, string search) and replaces them with optimized multi-block IR or library calls. The expansion routines generate vectorized mismatch detection (sub_2AA00B0, 48 KB) and vectorized first-occurrence string search (sub_2AA3190, 40 KB), both with page-boundary-safe masked vector loads. Stock LLVM pass; the expansion quality benefits GPU targets where wide loads are profitable.

LoopDeletion removes loops proven dead (no observable side effects). Stock LLVM. LoopSink moves loop-invariant operations that were hoisted by LICM back into the loop body when doing so reduces register pressure -- particularly valuable on GPU where the register pressure tradeoff is acute.

Loop Analysis Infrastructure

All loop passes share three core analysis frameworks.

ScalarEvolution (SCEV)

SCEV models how values evolve across loop iterations. Every loop pass depends on it for trip count computation, stride analysis, and value range queries. cicc ships an LLVM 20.0.0-based SCEV with three NVIDIA extensions: a complexity control system (simple_mode) that prevents unbounded analysis time, GPU-specific SCEV sources that inject thread index bounds, and recognition of CUDA loop idioms (warp-stride, grid-stride). See ScalarEvolution Overview, Range Analysis & Trip Counts, and Invalidation & Delinearization.

LoopInfo

LoopInfo provides the loop forest structure: which basic blocks belong to which loops, nesting depth, header/latch/exit identification. It is the primary structural query interface for all loop passes. Stock LLVM, no NVIDIA modifications.

DependenceInfo

DependenceInfo computes memory dependence direction vectors between instruction pairs across loop iterations. LoopInterchange and LoopDistribute are its primary consumers. The analysis uses SCEV to classify dependences as forward (<), backward (>), equal (=), scalar (S), independent (I), or unknown (*). Direction vectors drive the legality checks for loop interchange (no reversed backward-carried dependences after swap) and loop distribution (which instructions must stay in the same partition).

The following table consolidates all loop-pass-specific configuration knobs discovered in cicc v13.0. These are controllable via -mllvm -<knob>=<value>.

Knob	Pass	Type	Default	Effect
`Disable-unknown-trip-iv`	IndVarSimplify	bool	false	Skip IV canonicalization for unknown-trip loops
`iv-loop-level`	IndVarSimplify	int	1	Max nesting depth for IV simplification
`disable-lftr`	IndVarSimplify	bool	false	Disable Linear Function Test Replace
`replexitval`	IndVarSimplify	enum	1 (cheap)	Exit value replacement strategy: 0=never, 1=cheap, 2=always
`indvars-widen-indvars`	IndVarSimplify	bool	true	Allow IV widening to eliminate sign/zero extension
`loop-interchange-threshold`	LoopInterchange	int	0	Minimum net locality improvement for interchange
`vectorize-loops`	LoopVectorize	bool	true	Master vectorization enable
`enable-early-exit-vectorization`	LoopVectorize	bool	false	Allow vectorization of early-exit loops
`force-vector-width-outer`	LoopVectorize	bool	false	Force VF=4 for outer loops
`nv-disable-loop-unrolling`	LoopUnroll	bool	false	Disable the late unroll invocation
`disable-unknown-trip-lsr`	NV LSR	bool	false	Skip LSR for unknown-trip loops
`lsr-check-rp`	NV LSR	bool	true	Enable register pressure checking in LSR
`lsr-rp-limit`	NV LSR	int	~32-64	Register pressure ceiling for LSR
`filter-bad-formula`	NV LSR	bool	true	NVIDIA custom formula filtering
`do-lsr-64-bit`	NV LSR	bool	arch-dep	Enable LSR for 64-bit IVs (false on sm_3x-5x)
`count-sxt-opt-for-reg-pressure`	NV LSR	bool	true	Credit sign-ext savings in cost model
`lsr-sxtopt`	NV LSR	bool	true	Fold sign-extensions into IV expressions
`lsr-loop-level`	NV LSR	int	0 (all)	Restrict LSR to specific loop nesting depth
`lsr-skip-outer-loop`	NV LSR	bool	false	Skip outer loop IVs in nested loops
`disable-lsr-for-sharedmem32-ptr`	NV LSR	bool	false	Disable LSR for `addrspace(3)` pointers
`disable-lsr-complexity-discount`	NV LSR	bool	false	Disable complexity discount in cost model
`irce-block-threshold`	IRCE	int	varies	Max basic blocks before IRCE bails
`enable-loop-distribute`	LoopDistribute	bool	false	Force-enable distribution
`loop-distribute-scev-check-threshold`	LoopDistribute	int	varies	Max SCEV runtime checks allowed

Cross-References

Pipeline context: LLVM Optimizer -- two-phase compilation, tier dispatch, NVVMPassOptions
Pipeline ordering: Pipeline & Pass Ordering -- complete pass registration table
Vectorization: LoopVectorize & VPlan -- GPU-adapted vectorizer with full cost model
Unrolling: Loop Unrolling -- decision cascade with GPU-specific heuristics
Strength reduction: Loop Strength Reduction (NVIDIA) -- the most GPU-specific pass in cicc
NVIDIA custom passes: Loop Index Split, NVVM Peephole
SCEV infrastructure: ScalarEvolution Overview, Range Analysis & Trip Counts, SCEV Invalidation
Standard loop passes: Standard Loop Passes -- IndVarSimplify, LoopInterchange, IRCE, LoopDistribute, LoopIdiomRecognize details

Keyboard shortcuts

CICC Reverse Engineering Reference