EarlyCSE (Early Common Subexpression Elimination)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 EarlyCSE.cpp. Evidence: iterative (non-recursive) dominator-tree walk matches the LLVM 16+ refactoring; MemorySSA-backed variant with early-cse-memssa pipeline parameter matches LLVM 14+. NVIDIA adds four GPU extensions (barrier-aware versioning, AS 7 handling, NVVM call CSE, PHI limit) and a fourth scoped hash table not present in any upstream version.

EarlyCSE is a fast dominator-tree-walk pass that eliminates redundant computations, loads, and calls within a function. Cicc's version is not stock LLVM 20.0.0 -- the binary contains four CUDA-specific extensions that handle GPU memory model semantics: barrier-aware memory versioning with hardcoded NVVM intrinsic ID checks, shared memory address space 7 protection against unsafe store-to-load forwarding, a dedicated NVVM intrinsic call CSE handler with a fast-path for thread-invariant special register reads, and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream lacks.

Key Facts

Property	Value
Pass name	`"early-cse"` (standard), `"early-cse-memssa"` (MemorySSA variant)
Pipeline parser params	`memssa` (selects MemorySSA-backed variant)
Entry point (standard)	`sub_2778270`
Entry point (MemorySSA)	`sub_27783D0`
Core function	`sub_2780B00` (12,350 bytes)
NVVM call CSE handler	`sub_2780450` (1,142 bytes, ~263 decompiled lines)
Pipeline slot	245, 291 (tier 1); 525, 593 (tier 2+); ~370 (late)
Disable flag	`NVVMPassOptions` offset `+1440`
Pipeline assembler	`sub_18E4A00` (MemorySSA variant), `sub_196A2B0` (standard)
Upstream LLVM file	`llvm/lib/Transforms/Scalar/EarlyCSE.cpp`
NVIDIA modifications	Barrier generation tracking, AS 7 handling, NVVM call CSE, PHI limit, store-fwd table

Algorithm Overview

The pass performs a stack-driven iterative DFS over the dominator tree. At each basic block it scans instructions linearly, attempting three forms of elimination:

Expression CSE -- arithmetic, casts, comparisons, GEPs with identical operands are looked up in a scoped hash table. If a matching canonical instruction exists, the redundant one is replaced via RAUW and erased.
Load CSE and store-to-load forwarding -- loads from the same address and type as a prior load (or a prior store) are replaced with the already-available value. This is gated by a CurrentGeneration counter that invalidates stale entries whenever a memory-writing instruction or barrier intrinsic is encountered.
Call CSE -- readonly/readnone calls with identical targets and arguments are deduplicated. The NVVM-specific handler sub_2780450 provides a fast-path for thread-invariant NVVM intrinsics (llvm.nvvm.read.ptx.sreg.*).

The dominator tree walk is not recursive. It uses an explicit growable stack (initial capacity 8 entries, 64 bytes) with DomTreeScope nodes that record per-scope hash table insertions. On scope exit all insertions are tombstoned. This matters for deeply-nested GPU kernel CFGs where stack overflow from recursion is a real risk.

function EarlyCSE(ctx):
    root = ctx.Function.DomTree.root
    stack.push(DomTreeScope(root))

    while stack is not empty:
        scope = stack.top()
        ctx.CurrentGeneration = scope.generation_begin

        if not scope.visited:
            for inst in scope.bb.instructions:
                processNode(ctx, inst)       // CSE logic below
            scope.visited = true
            scope.generation_end = ctx.CurrentGeneration
        else:
            if scope has unvisited children:
                child = scope.children.pop_front()
                stack.push(DomTreeScope(child))
                continue
            else:
                unwindScope(ctx, scope)      // tombstone entries, free node
                stack.pop()

DomTreeScope Structure

Each scope node is 160 bytes (0xA0), allocated via sub_22077B0:

Offset	Type	Field
`+0x00`	`u32`	`generation_begin` -- snapshot of `CurrentGeneration` at scope entry
`+0x04`	`u32`	`generation_end` -- value at scope exit (after processing all instructions)
`+0x08`	`BasicBlock*`	The basic block for this domtree node
`+0x10`	`DomTreeNode**`	`children_begin`
`+0x18`	`DomTreeNode**`	`children_end`
`+0x20`	scope link	Expression ScopedHT chain -> `ctx+0x78`
`+0x38`	scope link	Load ScopedHT chain -> `ctx+0x108`
`+0x50`	scope link	Call ScopedHT chain -> `ctx+0x198`
`+0x68`	scope link	Call-values ScopedHT chain -> `ctx+0x228`
`+0x80`	scope link	Store-fwd ScopedHT chain -> `ctx+0x250`
`+0x98`	`u8`	`visited` flag (0 = not yet processed, 1 = instructions scanned)

Each chain entry is a triplet [link_fwd, link_back, insertion_list_head] occupying 24 bytes. On scope exit, the pass walks each insertion list and tombstones the corresponding hash table entries, then frees the scope node.

Four Scoped Hash Tables

Upstream LLVM EarlyCSE has three scoped hash tables (expression, load, call). Cicc adds a fourth dedicated to store-to-load forwarding.

Table	Context offset	Hash function	Equality	Key	Value
Expression	`+0xE8` / `+0xF8`	`sub_277F590`	`sub_277AC50`	Opcode + operand value-numbers	Canonical instruction pointer
Load	`+0x178` / `+0x188`	`sub_277CF80`	`sub_27792F0`	Load address + type	Previously loaded value
Call	`+0x230` / `+0x240`	`sub_277CF80`	`sub_27792F0`	Call target + arguments	Return value
Store-fwd	`+0x2C0` / `+0x2D0`	`sub_277C800`	`sub_27781D0`	Store address + type	Stored value

All four use open-addressing with linear probing. Sentinel values: 0xFFFFFFFFFFFFF000 = empty, 0xFFFFFFFFFFFFE000 = tombstone. Resize triggers at 75% load factor (4 * (count + 1) >= 3 * bucket_count) or when tombstones exceed 12.5% of capacity. Bucket counts are always a power of two.

The store-forwarding table is the NVIDIA addition. Upstream EarlyCSE performs store-to-load forwarding through the load table by inserting the stored value when a store is processed. Cicc separates this into a dedicated table, which enables more aggressive dead-store detection within the early pipeline -- two stores to the same address with no intervening load or barrier can be recognized without polluting the load table's namespace.

CUDA Extension 1: Barrier-Aware Memory Versioning

The context structure holds a CurrentGeneration counter at offset +0x2E0 (type u32). This counter acts as a memory version number. Every load and call CSE lookup checks whether the cached entry's generation matches the current generation -- a mismatch means an intervening memory-modifying operation invalidated the entry.

Generation is incremented when:

A trivially dead instruction is skipped (minor bump at 0x2781950)
sub_B46490 (hasMemoryWriteSideEffects) returns true for a call instruction
Any of four hardcoded NVVM barrier intrinsic IDs is encountered

The barrier intrinsic checks are explicit cmp dword ptr [rax+24h], IMM instructions at specific addresses in the binary:

Address	Encoding	Intrinsic ID	Decimal	Identity
`0x2781B30`	`cmp ..., 9Bh`	`0x9B`	155	`llvm.nvvm.barrier0` (`__syncthreads`)
`0x27812AF`	`cmp ..., CDh`	`0xCD`	205	`llvm.nvvm.membar.*` (device/system memory barrier)
`0x2781F4D`	`cmp ..., 123h`	`0x123`	291	`llvm.nvvm.bar.sync` (named barrier sync)
`0x2781F40`	`cmp ..., 144h`	`0x144`	324	NVVM cluster barrier (SM 90+ cluster-scope fence)

These checks are a safety net on top of the intrinsics' declared memory-effect attributes. Upstream LLVM relies solely on the memory-effect modeling to determine whether a call clobbers memory. Cicc adds the explicit ID checks because the barrier intrinsics' memory effects, as declared in the NVVM tablegen files, may not fully capture the GPU-specific semantics: a bar.sync does not just write memory from the perspective of one thread -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees at the IR level, so the explicit ID checks are the correctness backstop.

When any of these four intrinsics appears between two memory operations, EarlyCSE refuses to forward the earlier value. This prevents optimizations like:

;; INCORRECT optimization that barriers prevent:
%v1 = load i32, ptr addrspace(3) %p          ;; load from shared memory
call void @llvm.nvvm.barrier0()               ;; __syncthreads()
%v2 = load i32, ptr addrspace(3) %p          ;; CANNOT be replaced with %v1
;; Another thread may have written to %p between the barrier and this load

CUDA Extension 2: Shared Memory Address Space 7 Handling

Stores targeting NVPTX address space 7 (the internal representation for __shared__ memory) receive special treatment that prevents unsafe store-to-load forwarding.

At address 0x2781BB6, the pass checks byte [rdx+8] == 7 on the store's pointer operand type. When this matches, the store is routed through sub_B49E20 (isSharedMemoryStore), which calls sub_B43CB0 (getCalledFunction) and sub_B2D610 (hasIntrinsicID) to confirm the target is a shared memory variable (string ID 0x31 = "shared").

The motivation: shared memory is written by one thread and potentially read by a different thread after a barrier. Forwarding a stored value to a subsequent load in the same thread is only safe if no barrier intervenes -- but even then, a reimplementor must be careful because the CUDA memory model permits a thread to read its own store without a barrier, while other threads cannot. The shared-memory path in EarlyCSE conservatively disables forwarding for shared-memory stores to avoid the case where a load is CSE'd to the stored value, but the actual runtime value has been modified by another thread's post-barrier store to the same location.

processStore(ctx, store_inst):
    ptr_type = store_inst.pointer_operand.type
    if ptr_type.address_space == 7:                 // NVPTX shared memory
        if isSharedMemoryStore(store_inst):         // sub_B49E20
            ctx.CurrentGeneration++                 // invalidate load/call tables
            return                                  // do NOT insert into store-fwd table
    // Normal path: insert stored value into store-fwd table for later forwarding
    insertStoreForwarding(ctx, store_inst)

CUDA Extension 3: NVVM Intrinsic Call CSE (`sub_2780450`)

The dedicated function sub_2780450 (1,142 bytes, ~263 decompiled lines) handles CSE for calls to NVVM builtin intrinsics. It is entered when the main instruction loop detects a single-use-by-call pattern: the instruction's result has exactly one user, that user is a CallInst (opcode 0x1F), and the operand index is 3.

The function provides a fast-path for thread-invariant special register reads. Many NVVM intrinsics return values that are constant for the lifetime of a kernel invocation from a given thread's perspective:

llvm.nvvm.read.ptx.sreg.tid.x/y/z -- threadIdx.x/y/z
llvm.nvvm.read.ptx.sreg.ntid.x/y/z -- blockDim.x/y/z
llvm.nvvm.read.ptx.sreg.ctaid.x/y/z -- blockIdx.x/y/z
llvm.nvvm.read.ptx.sreg.nctaid.x/y/z -- gridDim.x/y/z
llvm.nvvm.read.ptx.sreg.warpsize
llvm.nvvm.read.ptx.sreg.laneid

Upstream LLVM would model these as readnone and CSE them through the generic call table. The NVVM-specific handler recognizes these intrinsic IDs directly via sub_987FE0 (getIntrinsicID), avoiding the overhead of the general readonly-call analysis. For a kernel that references threadIdx.x twenty times, the fast-path eliminates nineteen redundant intrinsic calls in a single pass.

The function also handles two additional NVVM intrinsic IDs:

ID	Decimal	Identity	CSE behavior
`0xE4`	228	NVVM load intrinsic	CSE-able if same address and no intervening clobber
`0xE6`	230	NVVM store intrinsic	Blocks CSE (generation bump)

The check at 0x2783890 tests for intrinsic ID 228 and at 0x27839BC for intrinsic ID 230. The store intrinsic (230) triggers a generation bump, while the load intrinsic (228) is treated as a CSE candidate.

CUDA Extension 4: PHI Operand Limit

At address 0x2781BED, the pass checks:

if PHINode.getNumIncomingValues() > 5:
    skip CSE analysis for this PHI

This is a compile-time heuristic absent from upstream LLVM. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence becomes quadratic in the operand count (each pair of values must be checked for dominance and equivalence), and the benefit for wide PHIs is marginal -- they rarely represent true common subexpressions.

The threshold of 5 is hardcoded with no cl::opt override.

Instruction Classification

The inner processing loop at 0x2780EB5--0x2781110 classifies each instruction by its opcode byte at [instr-0x18]:

Opcode	Hex	Instruction	EarlyCSE action
`0x55`	Store	`StoreInst`	Store-to-load forwarding path; shared memory check
`0x3D`	Call	`CallInst`	Call CSE or generation bump (if memory effects)
`0x3E`	Invoke	`InvokeInst`	Same as CallInst
`0x3F`	Select	`SelectInst`	Expression CSE with type-size check
`0x40`	PHI	`PHINode`	Expression CSE if operand count <= 5
`<= 0x1C`	--	Constants/args	Skip (not instructions)
`0x29`	Return	`ReturnInst`	Skip
`0x43`--`0x4F`	Casts	Cast instructions	Expression CSE

The classification dispatches to these helper predicates:

Helper	Address	Purpose
`sub_AA54C0`	`0x2780EC6`	`isTriviallyDead` -- if true, bump generation and skip
`sub_D222C0`	`0x2780F97`	`isSimpleExpression` -- arithmetic, casts, comparisons, GEPs
`sub_F50EE0`	`0x2780F7A`	`canCSE` / `doesNotAccessMemory`
`sub_1020E10`	`0x2781967`	`getCallCSEValue` -- readonly/readnone call check
`sub_B46420`	`0x2781B95`	`isLoadCSECandidate`
`sub_B46490`	`0x2781CC6`	`hasMemoryWriteSideEffects` -- triggers generation bump

Load-Store Forwarding Detailed Flow

The most complex code path (0x2781B48--0x2781F32) handles load CSE and store-to-load forwarding:

processLoad(ctx, load_inst):
    key = computeLoadCSEKey(load_inst, ctx.DataLayout)    // sub_2779A20
    if key.status != 0:
        // Cannot form clean key -- check if call/invoke returns equivalent value
        if load_inst is CallInst (0x3D) or InvokeInst (0x3E):
            tryCallValueForwarding(ctx, load_inst)
        return

    // Check for preceding store to same address
    store_entry = lookupStoreTable(ctx, key)
    if store_entry and store_entry.generation == ctx.CurrentGeneration:
        // Forward stored value to this load
        salvageDebugInfo(load_inst, store_entry.value)    // sub_BD84D0
        replaceAllUsesWith(load_inst, store_entry.value)  // sub_11C4E30
        eraseInstruction(load_inst)                       // sub_B43D60
        return CHANGED

    // Check for preceding load from same address
    load_entry = lookupLoadTable(ctx, key)
    if load_entry and load_entry.generation == ctx.CurrentGeneration:
        // Replace with previously loaded value
        replaceAllUsesWith(load_inst, load_entry.value)
        eraseInstruction(load_inst)
        return CHANGED

    // Not found -- insert into load table for future lookups
    insertLoadTable(ctx, key, load_inst, ctx.CurrentGeneration)

For stores, the pass also performs dead-store detection within the same scope: if two stores target the same address with no intervening load or barrier, the earlier store is dead. The barrier check uses the same four intrinsic ID comparisons described above.

Type Compatibility and Bitwidth Handling

At 0x27829C3--0x2782B87, for expression CSE of SelectInst and PHINode:

sub_AE43F0 computes type size in bits via the DataLayout
If size <= 64 bits: use a u64 bitmask as the CSE key
If size > 64 bits: allocate a BitVector via sub_C43690 and use bit-level comparison

At 0x2782F72--0x2782FD5, integer constant range analysis computes leading zeros/ones to determine effective bit-width. If the value fits in fewer bits, EarlyCSE allows CSE across different integer types (e.g., i32 zext i64 vs i64). This is an NVIDIA extension that upstream LLVM does not perform -- upstream requires exact type matches for expression CSE.

Context Structure Layout

The EarlyCSEContext structure passed to sub_2780B00 in rdi:

Offset	Field	Size
`+0x00`	Current instruction pointer	8
`+0x08`	`DataLayout` / `TargetData`	8
`+0x10`	`Function*` (-> `[+0x60]` = DomTree root)	8
`+0x18`	`TargetLibraryInfo*`	8
`+0x20`	`AssumptionCache*`	8
`+0x68`	MemDep result tracking	8
`+0x70`	MemDep analysis reference	8
`+0xE8`--`+0x110`	Expression hash table (buckets, count, ScopedHT, free list, allocator)	40
`+0x170`--`+0x198`	Load hash table + ScopedHT	40
`+0x200`--`+0x258`	Call hash table + ScopedHT	88
`+0x2B8`--`+0x2D8`	Store-fwd hash table + ScopedHT	32
`+0x2E0`	`CurrentGeneration` (`u32`)	4

Stack frame: 0x1D0 bytes (sub rsp, 0x1A8 + 5 callee-saved pushes).

Scope Page Management

The scoped hash tables use 512-byte (0x200) scope pages chained together. When a page fills:

At 0x2781328: fetch previous page via [stack.end - 8], advance by 0x200 to the next chained page.
At 0x2782260: when reclaiming, free the current page and pop from the page pointer array.

The initial worklist stack is 64 bytes (8 entries of 8 bytes each). The scope-page-pointer array is 8-byte aligned via lea rbx, [rdx*4 - 4]; and rbx, ~7; add rbx, rax.

`memssa` Pipeline Parameter

The pipeline parser registers "early-cse" at slot 394 with the parameter keyword memssa. When memssa is specified, the pass uses the MemorySSA-backed variant (sub_27783D0, pass name "Early CSE w/ MemorySSA") instead of the standard variant (sub_2778270, pass name "Early CSE"). Both variants call the same core function sub_2780B00; the difference is that the MemorySSA variant receives a pre-built MemorySSA graph in the context structure and uses it for more precise clobber queries, avoiding the O(n^2) scanning that the non-MSSA path falls back to for load CSE.

Knobs

Knob	Default	Description
`enable-earlycse-memoryssa`	`true`	Master switch for MemorySSA integration
`earlycse-debug-hash`	`false`	Debug: log hash function inputs/outputs
`earlycse-mssa-optimization-cap`	`500`	Max MemorySSA queries per block before falling back to conservative
`enable-earlycse-imprecision`	`false`	Allow approximate analysis in pathological cases (huge blocks, deep PHI nests)

No dedicated cl::opt flags exist for any of the four NVIDIA extensions. The PHI operand limit of 5, the four barrier intrinsic IDs, and the shared-memory address space 7 check are all hardcoded in the binary.

Pipeline Positions and Tier Gating

Tier	Position(s)	Notes
Tier 1 (O1)	Skipped	`sub_12DE8F0` explicitly gates EarlyCSE with `tier != 1`
Tier 2 (O2)	525, 593	Two invocations: early function simplification and post-loop-optimization
Tier 3 (O3)	245, 291, ~370	Three invocations; additional late-pipeline run
Ofcmid	After Sinking2	Single invocation in the moderate-optimization path

The pass is independently disableable via NVVMPassOptions at offset +1440. The same offset gates the standard and MemorySSA variants identically.

Key Constants

Value	Hex	Meaning
160	`0xA0`	`DomTreeScope` node size
512	`0x200`	Scope page size
64	`0x40`	Initial stack capacity (8 entries)
48	`0x30`	Hash table entry node size
40	`0x28`	Insertion record size
`0xFFFFFFFFFFFFF000`	--	Hash table EMPTY sentinel
`0xFFFFFFFFFFFFE000`	--	Hash table TOMBSTONE sentinel
155	`0x9B`	`llvm.nvvm.barrier0` intrinsic ID
205	`0xCD`	`llvm.nvvm.membar.*` intrinsic ID
291	`0x123`	NVVM `bar.sync` intrinsic ID
324	`0x144`	NVVM cluster barrier intrinsic ID
228	`0xE4`	NVVM load intrinsic ID
230	`0xE6`	NVVM store intrinsic ID
5	--	PHI operand limit for CSE

Differences from Upstream LLVM 20.0.0

Feature	Upstream	Cicc
Scoped hash tables	3 (expression, load, call)	4 (+ store-forwarding)
Barrier intrinsic checks	Relies on memory-effect attributes only	Explicit ID checks for IDs 155, 205, 291, 324
Shared memory handling	No address-space-specific logic	AS 7 stores skip store-fwd insertion, bump generation
NVVM intrinsic call CSE	Generic readonly-call path	Dedicated `sub_2780450` with fast-path for `sreg.*` reads
PHI operand limit	None	Skip CSE for PHI nodes with >5 incoming values
Cross-type expression CSE	Exact type match required	Allows CSE across integer widths when value range fits
Dominator tree walk	Recursive in many LLVM builds	Always iterative (explicit stack)

Function Map

Function	Address	Size	Role
`EarlyCSEPass::run` (standard variant entry)	`sub_2778270`	--	--
`EarlyCSEPass::run` (MemorySSA variant entry)	`sub_27783D0`	--	--
Core pass body (domtree walk + instruction processing)	`sub_2780B00`	12,350	--
`handleNVVMCallCSE` (NVVM intrinsic call CSE)	`sub_2780450`	1,142	--
Expression hash function	`sub_277F590`	--	--
Expression equality check	`sub_277AC50`	--	--
Load/call key hash	`sub_277CF80`	--	--
Load/call key equality	`sub_27792F0`	--	--
Store key hash	`sub_277C800`	--	--
Store key equality	`sub_27781D0`	--	--
`isSimpleExpression`	`sub_D222C0`	--	--
`canCSE` / `doesNotAccessMemory`	`sub_F50EE0`	--	--
`isSharedMemoryStore` (AS 7 check)	`sub_B49E20`	--	--
`isSharedMemoryAccess`	`sub_B49E00`	--	--
`getCallCSEValue` (readonly/readnone check)	`sub_1020E10`	--	--
`isLoadCSECandidate`	`sub_B46420`	--	--
`hasMemoryWriteSideEffects`	`sub_B46490`	--	--
`computeCSEHash` / `isVolatile`	`sub_B46500`	--	--
`getIntrinsicID` (NVVM intrinsic ID from call)	`sub_987FE0`	--	--
`isTriviallyDead`	`sub_AA54C0`	--	--
`replaceAllUsesWith` (RAUW)	`sub_11C4E30`	--	--
`salvageDebugInfo`	`sub_BD84D0`	--	--
`eraseInstruction`	`sub_B43D60`	--	--
`removeFromParent`	`sub_27793B0`	--	--
`computeLoadCSEKey`	`sub_2779A20`	--	--
`insertStoreForwarding`	`sub_27808D0`	--	--
`insertExprIntoScopedHT`	`sub_27801B0`	--	--
`lookupScope` (find value by generation)	`sub_277D510`	--	--
`lookupCallTable`	`sub_277D3C0`	--	--
`lookupInScopedHT`	`sub_2778110`	--	--
`shouldInsertIntoTable`	`sub_27785B0`	--	--
`growTable` (double hash table size)	`sub_277C980`	--	--
`insertIntoTable` (post-grow insert)	`sub_277C8A0`	--	--
`cleanupLoadTable` (compact after scope exit)	`sub_277FFC0`	--	--
`cleanupCallTable` (compact after scope exit)	`sub_277A110`	--	--
`compareLoadTypes` (type compatibility)	`sub_277A9A0`	--	--
`TargetData::getTypeSizeInBits`	`sub_AE43F0`	--	--
`getCalledFunction`	`sub_B43CB0`	--	--
`hasIntrinsicID`	`sub_B2D610`	--	--

Common Pitfalls

These are mistakes a reimplementor is likely to make when extending EarlyCSE for a GPU target with barrier semantics.

1. Relying solely on LLVM memory-effect attributes to model barrier semantics. Upstream LLVM models barrier intrinsics as memory-writing calls, which triggers a generation bump through the standard hasMemoryWriteSideEffects path. This is insufficient for GPU barriers: a bar.sync does not just write memory from one thread's perspective -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees. Cicc adds explicit hardcoded checks for four intrinsic IDs (155, 205, 291, 324) as a safety net. A reimplementation that trusts the declared memory effects alone will forward values across barriers, producing load CSE that reads stale pre-barrier data written by a different thread.

2. Forwarding stores to loads across barriers in shared memory (AS 7). When thread T0 stores to smem[0], a barrier fires, and thread T1 loads from smem[0], the load must see T1's own value (if it wrote) or the value written by whichever thread last stored before the barrier. Forwarding T0's stored value to T0's subsequent load is only safe if no barrier intervenes and no other thread could have written to the same location. Cicc's AS 7 handling conservatively disables store-to-load forwarding for all shared memory stores by bumping the generation counter. A reimplementation that allows shared memory store forwarding without barrier awareness will produce reads that return the local thread's stale value instead of the globally-visible post-barrier value.

3. Missing one or more of the four barrier intrinsic IDs. Cicc checks for IDs 155 (barrier0 / __syncthreads), 205 (membar.*), 291 (bar.sync), and 324 (cluster barrier for SM 90+). A reimplementation that only handles __syncthreads (ID 155) will fail to invalidate the load/call tables when a bar.sync or cluster barrier is encountered. The result: loads before and after a named barrier or cluster-scope fence are incorrectly CSE'd, producing silent data corruption in multi-CTA cooperative kernels.

4. Applying expression CSE to PHI nodes with more than 5 incoming values. Cicc hardcodes a PHI operand limit of 5 for CSE analysis. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence is quadratic in operand count, and the benefit for wide PHIs is negligible -- they rarely represent true common subexpressions. A reimplementation without this threshold will experience severe compile-time regressions on heavily unrolled GPU kernels.

5. Not adding a dedicated store-forwarding hash table. Upstream LLVM uses three scoped hash tables (expression, load, call). Cicc adds a fourth table dedicated to store-to-load forwarding. Without this separation, inserting stored values into the load table pollutes the load namespace, making dead-store detection within the same scope unreliable. Two stores to the same address with no intervening load or barrier should trigger dead-store elimination of the earlier store; mixing stores into the load table obscures this pattern.

Cross-References

Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
MemorySSA Builder for GPU -- the MemorySSA infrastructure consumed by the early-cse-memssa variant
Hash Infrastructure -- the universal DenseMap mechanics shared by all four hash tables
Barriers & Sync -- the barrier builtins whose intrinsic IDs trigger generation bumps
Dead Synchronization Elimination -- the 96KB pass that removes dead barriers; interacts with EarlyCSE's barrier-aware generation tracking
GVN -- the more expensive redundancy elimination pass that complements EarlyCSE later in the pipeline
DSE -- Dead Store Elimination, which complements EarlyCSE's within-scope store-to-load forwarding with cross-block analysis
Pipeline & Ordering -- tier-dependent scheduling and NVVMPassOptions gating
Alias Analysis & NVVM AA -- address-space-aware alias analysis that feeds into MemorySSA clobber queries

Keyboard shortcuts

CICC Reverse Engineering Reference