SCEV Invalidation & Delinearization
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
SCEV analysis results are expensive to compute and are cached aggressively. When the IR mutates -- a loop is unrolled, a value is replaced, a block is deleted -- cached SCEV expressions, range information, and backedge-taken counts can become stale. The invalidation subsystem (forgetLoop, forgetValue, forgetAllLoops) determines exactly which cache entries must be discarded after each transformation. Get it wrong in either direction and the compiler either produces incorrect code (stale data) or wastes time recomputing everything (over-invalidation).
Delinearization is the complementary recovery problem: given a flat pointer expression like base + i*N*M + j*M + k, recover the original multi-dimensional subscripts [i][j][k]. This is critical for GPU code because memory coalescing analysis needs to know whether adjacent threads in a warp are accessing adjacent addresses -- a question that can only be answered by examining per-dimension subscripts against the thread index structure.
In cicc v13.0, both subsystems carry NVIDIA-specific modifications. The invalidation engine has an extended exit-analysis depth threshold and an early-out for simple two-operand AddRec expressions common in GPU loops. The delinearization engine has a polymorphic predicate collector that supports GPU-aware strategies for shared memory bank conflict detection and coalescing analysis, plus at least 9 configuration knobs not present in upstream LLVM.
| Property | Value |
|---|---|
forgetLoop address | sub_DE2750 (0xDE2750) |
forgetLoop size | 10,051 bytes (~2,271 asm lines) |
forgetValue address | sub_D9EE30 (0xD9EE30) |
forgetValue size | ~9 KB |
forgetAllLoops address | sub_D9D700 (0xD9D700) |
forgetAllLoops size | ~8 KB |
delinearize address | sub_DE9D10 (0xDE9D10) |
delinearize size | 3,614 bytes (~849 asm lines) |
collectParametricTerms address | sub_DE8D20 (0xDE8D20) |
| Hash function | (key >> 9) ^ (key >> 4) & (capacity - 1) |
| Empty sentinel | 0xFFFFFFFFFFFFF000 |
| Tombstone sentinel | 0xFFFFFFFFFFFFE000 |
Cache Invalidation
The Seven Caches
SCEV maintains seven distinct cache tables that must be kept consistent. Each has its own eviction path inside forgetLoop:
| # | Cache | Entry size | Key | Value | Context offset |
|---|---|---|---|---|---|
| 1 | ValueExprMap (primary) | 16 bytes | Value* | SCEV* | main SE object |
| 2 | Unsigned range cache | 40 bytes | SCEV* | ConstantRange | +976 |
| 3 | Signed range cache | 40 bytes | SCEV* | ConstantRange | +1008 |
| 4 | BTC cache | 168 bytes (0xA8) | loop SCEV* | BackedgeTakenInfo | +0x290 |
| 5 | Per-exit BTC cache | 16 bytes | exit SCEV* | exit count | +0x490 |
| 6 | AddRec folding cache | per-expression | AddRec pair | folded form | per-expression |
| 7 | Predicated BTC cache | 16 bytes | loop SCEV* | predicated count | secondary table |
All hash tables use the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.
forgetLoop: The 8-Phase Algorithm
sub_DE2750 is the largest invalidation function -- 10 KB of machine code organized into eight sequential phases. It is called after every loop transformation that might invalidate SCEV data.
Signature:
void forgetLoop(
ScalarEvolution *SE, // rdi -- the SE context
Loop *L, // rsi -- loop being forgotten
BasicBlock *Header, // rdx -- loop header block
ExitInfo *Exits, // rcx -- exit block info (nullable)
int DepthFlag, // r8d -- 0=shallow, 1=deep, >1=bounded
int ExtraFlag, // r9d -- controls AddRec early-out
SmallDenseSet *Visited // stack -- prevents cycles in nested loops
);
Phase 1 -- Block value collection (0xDE27C9). Iterates the loop's basic blocks and collects all Values that have cached SCEV entries. The block array is at loop[+0x20] -> [+0x10] (pointer) / [+0x18] (count), stored as 32-byte entries. For each Value, a dominance check (sub_B19D00) confirms it belongs to the loop, then the SCEV index is extracted from a 27-bit field at value[+4] & 0x7FFFFFF. Collected pointers are stored in a SmallVector (inline capacity 6) with bit 2 set as a tag.
Phase 1B -- Scope chain collection (0xDE28A7). Walks a scope chain obtained via sub_B6AC80(SE[0][+0x28], 0x99), where 0x99 is the SCEV scope identifier. Filters to SCEVUnknown entries (type byte 0x55) with specific flag conditions (byte [+0x21] & 0x20), verifying loop membership and dominance. This captures values not directly in the loop's blocks but semantically part of its analysis scope.
Phase 2 -- Exit block processing (0xDE29D9). Enumerates exit blocks via sub_AE6EC0 and processes their AddRec chains. For each exit, reads the chain at [exit+0x30] & ~7 (stripping tag bits), checks expression kind byte (range 0x1E--0x28), and extracts operands. For the common case of simple {start, +, step} two-operand recurrences, an early-out stops after processing 2 operands when ExtraFlag != 0. If the loop has exactly 2 exits and ExtraFlag >= qword_4F88DC8 (a global threshold for maximum exit analysis depth), deep exit analysis is skipped entirely.
Phase 3 -- Expression dependency analysis (0xDE2BC5). The core invalidation loop. Iterates the collected values in reverse order and builds a transitive closure of all dependent SCEV expressions. Uses a stack-based worklist (SmallVector, inline capacity 8) and a SmallDenseSet for visited tracking. The dependency walk dispatches on expression type:
Type 0x52 ('R' = AddRec): Follow Start and Step operands via getSCEV,
compare ranges with getRangeRef
Type 0x56 ('V' = variant): Check function pointer equality at [-0x60]
and [+0x08], follow if simple recurrence
Type 0x39/0x3A (flagged): Check bit 6 of flags byte, follow base pointer
or compute canonical form from 27-bit index
General: Follow underlying object, check for pointer
types (0x11/0x12), verify integer type
Phase 4 -- Primary cache eviction (0xDE2DFF). For each expression identified by Phase 3, looks it up in the ValueExprMap, computes both unsigned and signed ranges via getRangeRef (sub_DBB9F0), compares old and new ranges via ConstantRange::contains (sub_AB1BB0), and clears validity bits in the range cache ([entry+0x20] for unsigned, [entry+0x21] for signed). Wide APInt buffers (>64 bits) are freed through __libc_free.
Phase 5 -- BTC eviction (0xDE3D2F). For each collected value, looks it up in the BTC hash table. On hit: writes TOMBSTONE, decrements entry count, increments tombstone count, then calls forgetMemoizedResults (sub_DE2690) to recursively invalidate any expressions that depended on this backedge-taken count. Also evicts the corresponding predicated BTC entry from the secondary table.
Phase 6 -- AddRec folding cache cleanup (0xDE3230). For AddRec expressions (type 0x52), invalidates pre-computed folding results. Extracts the 6-bit opcode from [expr+2] & 0x3F and dispatches:
- Opcode
0x20(shift/power-of-two multiply): checks viacountPopulationwhether the step is a power of two, then callstryFoldAddRecWithStep(sub_DCFD50) - Opcodes
0x22--0x29(binary operations): constructs the appropriate folded expression per operation type and marks it for invalidation - Opcode
0x24with pointer type (0x0E): skips pointer-integer cast invalidation
Phase 7 -- Predicate and assumption cleanup (0xDE3856). Processes the predicate hash table via the loop object's fields. Performs range intersection (sub_AB0910), union (sub_AB0A00), and emptiness/fullness checks (sub_AAFBB0, sub_AAF760). If the resulting range is neither empty nor full, stores the updated BTC in the loop's entry.
Phase 8 -- Final output (0xDE3CCD). Writes 0x0101 to loop->flags[+0x20], marking the loop as SCEV-forgotten (bit 0 = primary cache invalidated, bit 8 = secondary cache invalidated). Frees heap-allocated collection and output buffers.
forgetValue and forgetAllLoops
forgetValue (sub_D9EE30, ~9 KB) performs single-value eviction. It removes the value's entry from the ValueExprMap, then walks all expressions that transitively depend on it and evicts those as well. Used when a single instruction is replaced (RAUW) or deleted.
forgetAllLoops (sub_D9D700, ~8 KB) iterates every loop in the function's LoopInfo and calls forgetLoop for each one. Used when the entire function's loop structure changes (e.g., after inlining or full function cloning).
Which Passes Trigger Invalidation
forgetLoop is called after these loop transformations:
| Pass | Why invalidation is needed |
|---|---|
| LoopUnroll | Trip counts change; unrolled body has different IVs |
| LoopVectorize | Widened types; vector IVs replace scalar ones |
| LoopPeeling | Peeled iterations change the start value of recurrences |
| LoopUnswitching | Exit conditions change; control flow restructured |
| LICM | Hoisted values have new SCEV forms outside the loop |
| LoopSimplify | Preheader/exit block insertion changes loop structure |
| LoopRotate | Header/latch swap requires BTC recomputation |
| LoopDistribute | Original loop split into multiple loops |
| LoopIdiomRecognize | Pattern replacement changes loop body |
| LoopIndexSplit (NVIDIA) | IV range split into subranges |
| MemorySpaceOpt (NVIDIA) | Address space changes invalidate pointer SCEVs |
The DepthFlag parameter controls the aggressiveness of invalidation: 0 does shallow invalidation (only direct loop values), 1 follows all dependency chains, and values >1 impose a bounded depth useful for performance in deeply nested loops. The Visited parameter (a SmallDenseSet*) prevents infinite cycles when nested loops have mutual SCEV dependencies.
The forget-scev-loop-unroll knob (boolean) controls whether SCEV cache is invalidated after unrolling -- disabling it is unsound but can be used for compile-time experimentation.
Delinearization
The Problem
CUDA kernels routinely access multi-dimensional arrays:
float val = A[blockIdx.x * BLOCK_H + threadIdx.y][blockIdx.y * BLOCK_W + threadIdx.x];
By the time this reaches LLVM IR, the address computation has been flattened:
%addr = getelementptr float, ptr %A, i64 %flat_idx
; where %flat_idx = (blockIdx.x * BLOCK_H + threadIdx.y) * N + (blockIdx.y * BLOCK_W + threadIdx.x)
SCEV sees this as a single polynomial. Delinearization recovers the per-dimension subscripts, which are essential for:
- Coalescing analysis: determining whether adjacent threads (
threadIdx.x,threadIdx.x+1, ...) access adjacent memory addresses (coalesced) or strided addresses (uncoalesced). This requires isolating the dimension wherethreadIdx.xappears. - Shared memory bank conflict detection: 32 banks, 4-byte stride. Knowing whether the innermost subscript is
threadIdx.x(conflict-free) vs.threadIdx.x * stride(potential conflicts) requires dimensional decomposition. - Dependence analysis: per-dimension dependence tests (Banerjee, GCD, MIV) are more precise than whole-expression tests. Delinearized subscripts feed
DependenceInfofor vectorization legality.
Delinearization Context
The delinearizer (sub_DE9D10) operates on a context object:
| Offset | Type | Field | Purpose |
|---|---|---|---|
+0x00 | ScalarEvolution* | SE | Parent SCEV context |
+0x08 | SCEV* | ElementSize | Innermost element size |
+0x10 | uint8_t | Flags | Bit 0: inline cache mode |
+0x18 | 64 bytes | InlineCache | 4-slot direct-mapped table (inline mode) |
+0x20 | uint32_t | Capacity | Heap table capacity (heap mode) |
+0x58 | SCEV* | TargetArrayPtr | Array being delinearized |
+0x60 | void* | PredicateCollector | Nullable; collects validity predicates |
+0x68 | SCEV* | StepRecurrence | AddRec step for innermost dimension |
The inline cache (4 slots of 16 bytes at +0x18) is a small-buffer optimization sized for the overwhelmingly common GPU case of 1D or 2D array accesses. Cache entries use the same (key >> 9) ^ (key >> 4) hash as all other SCEV tables.
The Recursive Delinearization Algorithm
sub_DE9D10 is a recursive function dispatching on 17 SCEV expression kinds via a jump table:
| Kind | Expression type | Handling |
|---|---|---|
| 0, 1, 16 | Constant, TruncateExpr (ident), Unknown | Leaf -- return unchanged |
| 2 | TruncateExpr | Recurse into inner, rebuild with getTruncateExpr |
| 3 | SignExtendExpr | Recurse; dimension discovery on AddRec step match |
| 4 | ZeroExtendExpr | Recurse; dimension discovery on AddRec step match |
| 5 | AddExpr | N-ary: delinearize each operand, rebuild with getAddExpr |
| 6 | MulExpr | N-ary: delinearize each factor, rebuild with getMulExpr |
| 7 | UDivExpr | Delinearize both operands, rebuild with getUDivExpr |
| 8 | AddRecExpr | N-ary with wrap flag preservation; critical path |
| 9--13 | SMax/UMax/SMin/UMin/SeqUMin | N-ary: delinearize operands, rebuild |
| 14 | PtrToIntExpr | Recurse into pointer, rebuild |
| 15 | GEP | Primary dimension discovery entry point |
The N-ary pattern. Cases 5, 6, 8--13 share a common template:
SmallVector<const SCEV*, 2> NewOps; // inline capacity 2
bool Changed = false;
for (auto *Op : Expr->operands()) {
const SCEV *NewOp = delinearize(Ctx, Op); // recursive
NewOps.push_back(NewOp);
if (NewOp != Op) Changed = true;
}
if (!Changed) return Expr; // pointer identity optimization
return rebuildExpr(SE, Kind, NewOps);
The "changed" flag enables pointer identity short-circuiting: if no operand was modified during recursion, the original expression pointer is returned without allocation.
AddRecExpr (case 8) is the most critical case for GPU code. Multi-dimensional array accesses manifest as nested AddRec expressions: {A[0][0], +, dim1}<outer_loop> wrapping {init, +, 1}<inner_loop>. The delinearizer preserves wrap flags (NSW/NUW/NW from bits [+0x1C] & 7) and the step value ([+0x30]) when reconstructing via getAddRecExpr (sub_DBFF60).
ZeroExtend/SignExtend (cases 3, 4) are secondary dimension discovery points. When the inner operand is an AddRec whose step matches Ctx->StepRecurrence (+0x68) and the AddRec has exactly 2 operands (the common {start, +, step} form), the handler extracts dimension information: it calls getElementSize (sub_D33D80) and getConstant (sub_DA4270) to compute the element count, then pushes a new term into the term collector at Ctx[+0x58]. This identifies a dimension boundary -- the extend operation wrapping a matching-step AddRec indicates the point where one array dimension ends and another begins.
GEP (case 15) is the primary entry for actual dimension discovery. It first checks the predicate collector (Ctx[+0x60]). If present, it searches the collector's table for a matching GEP index entry (type field == 1, matching scev_expr, operation == 0x20). If no predicate collector or no match, it falls back to structural delinearization via sub_DE97B0, which analyzes the GEP's index computation structure, iterates discovered terms, and classifies them by dimension type. Terms matching Ctx->StepRecurrence go to the direct collector; others go through the predicate collector's virtual dispatch (vtable[+0x10]).
Fixed-Point Iteration
The function itself is a single recursive pass, but its callers implement a fixed-point loop:
- Initialize the context with an initial guess for dimension sizes
- Call
sub_DE9D10to delinearize using those dimensions - During recursion, the GEP and extend handlers collect new dimension information into
Ctx[+0x58](term collector) andCtx[+0x60](predicate collector) - If collected dimensions differ from the initial guess, update and repeat from step 2
- Terminate when dimensions stabilize or a maximum iteration count is exceeded
The memoization cache ensures unchanged sub-expressions are not recomputed across iterations.
Parametric vs Fixed-Size Arrays
Upstream LLVM has the delinearize-use-fixed-size-array-heuristic knob (default: false). When the standard parametric delinearization fails -- typically because dimension sizes are runtime values with no SCEV relationship -- the fixed-size heuristic uses compile-time-known array dimensions from type metadata to guide decomposition.
cicc extends this with an alternative delinearization entry point at sub_147EE30 (25 KB), which applies additional heuristics controlled by at least 3 of the delinearization config globals (dword_4F9AB60, dword_4F9AE00, dword_4F9B340). This second path is likely NVIDIA-enhanced for cases common in GPU code, such as dynamically-allocated shared memory with dimensions derived from kernel launch parameters.
The dependence analysis subsystem has its own entry points into delinearization (sub_146F1B0 at 40 KB for delinearizeAccess, sub_146B5E0 at 18 KB for tryDelinearize) that combine delinearization with per-dimension dependence testing in a single pass.
GPU-Specific Delinearization Patterns
Thread grid indexing. The canonical GPU pattern threadIdx.x + blockIdx.x * blockDim.x produces an AddRec with step = blockDim.x (grid stride). The delinearizer recognizes this by matching the step recurrence against Ctx[+0x68]. When the step corresponds to a grid dimension, the subscript identifies which dimension of a multi-dimensional array is parallelized across the thread grid.
Shared memory bank conflicts. For shared memory arrays, the delinearizer feeds into bank conflict analysis. Shared memory has 32 banks with 4-byte interleaving. If delinearization reveals A[threadIdx.y][threadIdx.x] with row stride 32 (or any multiple of 32), every thread in a warp hits the same bank -- a 32-way conflict. If the stride is relatively prime to 32, accesses are conflict-free. This analysis requires knowing per-dimension subscripts, which only delinearization can provide from the flat pointer arithmetic.
Predicate collector polymorphism. The PredicateCollector at Ctx[+0x60] uses virtual dispatch (vtable[+0x10]), allowing different delinearization strategies to be plugged in:
- Standard delinearization for host code
- GPU-aware delinearization that considers shared memory bank geometry
- Coalescing-aware delinearization that checks whether the innermost subscript varies with
threadIdx.x
High-dimensional tensors. The term collector at Ctx[+0x58] is a growable SmallVector, supporting arrays with arbitrary dimensionality. This matters for tensor operations in CUDA (e.g., CUTLASS library patterns, which cicc special-cases elsewhere -- see the cutlass substring check in the dependence analysis region).
SCEV Term Collection
Before delinearization runs, collectParametricTerms (sub_DE8D20) walks the SCEV expression tree to extract candidate terms:
SCEVAddRecExproperands yield stride candidates (the step of each AddRec)SCEVUnknownandSCEVMulExprnodes yield dimension-size candidatesSCEVSignExtendExprnodes are also collected (they often wrap dimension-related terms)
These candidates are passed to findArrayDimensions (sub_147B0D0) which uses product decomposition to determine which terms correspond to array dimensions. The resulting dimension list seeds the delinearization context before sub_DE9D10 is invoked.
Configuration
SCEV Invalidation Knobs
| Knob | Default | Effect |
|---|---|---|
forget-scev-loop-unroll | true | Enable SCEV invalidation after loop unrolling |
verify-scev | false | Verify SCEV consistency after transformations |
verify-scev-strict | false | Stricter verification (compare old/new trip counts) |
verify-scev-maps | false | Verify SCEV map consistency |
qword_4F88DC8 (max exit analysis depth) | unknown | Threshold beyond which deep exit analysis is skipped |
SCEV Analysis Depth Limits (shared with invalidation)
| Knob | Default | Effect |
|---|---|---|
scalar-evolution-max-iterations | 100 | Maximum loop iterations for constant evaluation |
scalar-evolution-max-scev-compare-depth | 32 | Maximum SCEV comparison recursion depth |
scalar-evolution-max-arith-depth | 32 | Maximum SCEV arithmetic simplification depth |
scalar-evolution-max-ext-depth | 8 | Maximum sign/zero-extend nesting depth |
scalar-evolution-max-cast-depth | 8 | Maximum cast chain depth |
scalar-evolution-max-constant-evolving-depth | 32 | Maximum constant evolution depth |
scalar-evolution-max-expr-size | 384 | Maximum expression node count |
scalar-evolution-max-expr-failures | 100 | Maximum SCEV creation failures before bailout |
scalar-evolution-max-scev-operations-implication-depth | 2 | Maximum depth for implications |
scalar-evolution-max-value-compare-depth | 2 | Maximum value comparison depth |
NVIDIA-Specific SCEV Knobs
| Knob | Effect |
|---|---|
aggressive-positive-stride-analysis | More aggressive positive-stride IV analysis (nvbug 3972412) |
do-sign-ext-simplify | Simplify SCEV sign-extend expressions |
do-sign-ext-expand | Expand sign-extends during SCEV construction |
track-trip-count-more | Track loop trip counts more aggressively |
scev-mulops-inline-threshold (32) | Max MulExpr operands before out-of-line |
scev-addops-inline-threshold (500) | Max AddExpr operands before out-of-line |
Delinearization Knobs
| Global | Likely identity | Notes |
|---|---|---|
byte_4F9A8C0 | Delinearization enable flag | Master enable for the delinearization subsystem |
dword_4F9A620 | Config 1 | Referenced by combined delinearize-and-test |
dword_4F9A700 | Config 2 | Referenced by delinearizeAccess core |
dword_4F9A7E0 | Config 3 | Referenced by delinearizeAccess core |
dword_4F9AB60 | Config 4 | Referenced by alternative delinearization v2 |
dword_4F9AC40 | Config 5 | Referenced by dependence distance with delinearization |
dword_4F9AE00 | Config 6 (shared) | Referenced by both combined-test and v2 paths |
dword_4F9B260 | Config 7 | Referenced by combined delinearize-and-test |
dword_4F9B340 | Config 8 | Referenced by alternative delinearization v2 |
da-delinearize | Try to delinearize array references | DependenceAnalysis pass knob (upstream LLVM) |
da-miv-max-level-threshold | MIV test depth limit | DependenceAnalysis pass knob (upstream LLVM) |
Function Map
Invalidation Functions
| Function | Address | Size | Role |
|---|---|---|---|
ScalarEvolution::forgetLoop | sub_DE2750 | 10,051 B | 8-phase loop invalidation |
ScalarEvolution::forgetValue | sub_D9EE30 | ~9 KB | Single-value eviction |
ScalarEvolution::forgetAllLoops | sub_D9D700 | ~8 KB | Invalidate all loops |
forgetMemoizedResults | sub_DE2690 | small | Recursive BTC invalidation helper |
ScalarEvolution::verify | sub_DE5FA0 | ~52 KB | Debug verification (old/new trip count comparison) |
| Loop invalidation helper | sub_DE5640 | ~178 lines | Helper for forgetLoop |
| SCEV expression invalidator | sub_DCE1C0 | small | Callback for AddRec folding cleanup |
Delinearization Functions
| Function | Address | Size | Role |
|---|---|---|---|
ScalarEvolution::delinearize | sub_DE9D10 | 3,614 B | Recursive delinearizer (17-case switch) |
collectParametricTerms | sub_DE8D20 | ~521 lines | Term extraction before delinearization |
| Structural GEP delinearization | sub_DE97B0 | small | Sub-analysis called from GEP case |
canonicalizeExpr | sub_D9ABD0 | small | SCEV normalization |
computeAccessFunctions | sub_D94080 | ~12 KB | Access function computation |
SCEV_delinearize (dependence region) | sub_CF5550 | 6,276 B | Alternate copy in dependence analysis |
Dependence Analysis Delinearization
| Function | Address | Size | Role |
|---|---|---|---|
delinearizeAccess | sub_146F1B0 | 40 KB | Core delinearization for dependence analysis |
tryDelinearize | sub_146B5E0 | 18 KB | Delinearization attempt with fallback |
| Delinearize subscript | sub_1472640 | 10 KB | Per-subscript extraction |
| Array dimension inference | sub_1473850 | 12 KB | Infers dimensions from access patterns |
collectSubscripts | sub_1476060 | 22 KB | Multi-dimensional GEP subscript collection |
| Dependence distance with delinearization | sub_14747F0 | 15 KB | Computes dependence vectors using delinearized subscripts |
findArrayDimensions | sub_147B0D0 | 11 KB | Dimension sizes from SCEV product decomposition |
| Combined delinearize-and-test | sub_147C070 | 34 KB | Delinearize + per-dimension dependence test |
| Alternative delinearization v2 | sub_147EE30 | 25 KB | NVIDIA-enhanced heuristics |
| Partial result combiner | sub_147DF40 | 11 KB | Combines partial delinearization results |
Key SCEV Callees (shared by both subsystems)
| Function | Address |
|---|---|
getRangeRef -- range computation | sub_DBB9F0 |
ConstantRange::contains | sub_AB1BB0 |
ConstantRange::intersectWith | sub_AB0910 |
ConstantRange::unionWith | sub_AB0A00 |
ConstantRange::isEmptySet | sub_AAFBB0 |
ConstantRange::isFullSet | sub_AAF760 |
getSCEV -- expression resolution | sub_DD8400 |
tryFoldAddRecWithStep | sub_DCFD50 |
getAddExpr (N-ary) | sub_DC7EB0 |
getMulExpr (N-ary) | sub_DC8BD0 |
getAddRecExpr | sub_DBFF60 |
getUDivExpr | sub_DCB270 |
getZeroExtendExpr | sub_DC5000 |
getSignExtendExpr | sub_DC2B70 |
getTruncateExpr | sub_DC5200 |
getPtrToIntExpr | sub_DD3A70 |
DominatorTree::dominates | sub_B19D00 |
SmallDenseSet::insert | sub_C8CC70 |
| Cache insert (delinearization result memoization) | sub_DB11F0 |
Differences from Upstream LLVM
| Aspect | Upstream LLVM | CICC v13.0 |
|---|---|---|
| Delinearization purpose | Optimize for cache locality; multi-dimensional subscript recovery for polyhedral analysis | Optimize for memory coalescing: recover subscripts to determine whether adjacent warp threads access adjacent addresses |
| Invalidation triggers | Standard loop transformations (unroll, vectorize, simplify) | Additional triggers from NVIDIA-specific passes: MemorySpaceOpt (address space transformations), IV Demotion (narrowing changes SCEV types), NVLoopStrengthReduce |
| Delinearization result caching | No explicit memoization in upstream | Memoization cache via sub_DB11F0 prevents redundant delinearization of the same GEP across multiple consumers |
| Thread index awareness | No concept of thread-index-based access patterns | Delinearized subscripts are analyzed against threadIdx dimensions to determine coalescing quality; feeds into vectorization and LSR decisions |
forget-scev-loop-unroll knob | Present in upstream LLVM | Same knob, but more critical on GPU because over-invalidation forces expensive SCEV recomputation on deeply nested kernel loops |
| Range source diversity | Profile data, programmer assertions (__builtin_assume) | Additional sources: !range metadata from nvvm-intr-range, __launch_bounds__, warpSize constant, special register bounded ranges |
Cross-References
- ScalarEvolution Overview & Construction -- SCEV expression creation, the ValueExprMap, and the expression DAG structure that invalidation walks
- SCEV Range Analysis & Trip Counts -- range caches and BTC caches that invalidation must clear; the
getRangeRefand BTC computation functions called during eviction - LoopVectorize & VPlan -- primary consumer of delinearization results for vectorization legality; calls
forgetLoopafter vectorizing - Loop Unrolling -- calls
forgetLoopafter unrolling; theforget-scev-loop-unrollknob controls this - Loop Strength Reduction (NVIDIA) -- uses SCEV for IV analysis; its transformations trigger
forgetValuecalls - MemorySpaceOpt -- NVIDIA-specific pass that triggers SCEV invalidation after address space transformations
- Alias Analysis & NVVM AA -- delinearization results feed into alias analysis for disambiguating multi-dimensional array accesses