DSE (Dead Store Elimination)
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source:
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp(LLVM 20.0.0)
CICC v13.0 contains a heavily modified Dead Store Elimination pass totaling approximately 91 KB of decompiled code across three major functions: the core DSE::runOnFunction at sub_19DA750 (33 KB), the overwrite detection engine at sub_19DDCB0 (28 KB), and the partial overwrite tracking system at sub_19DF5F0 (30 KB). This substantially exceeds the size of upstream LLVM DSE, primarily due to NVIDIA's additions for partial store forwarding with type conversion, cross-store dependency tracking, store-chain decomposition for aggregates, and native CUDA vector type awareness.
IR Before/After Example
DSE removes stores that are overwritten before any load reads them. The NVIDIA extension handles partial overwrites common in CUDA vector code.
Before (dead store followed by overwrite):
define void @f(ptr addrspace(1) %p, float %x, float %y) {
store float %x, ptr addrspace(1) %p, align 4 ; dead: overwritten below before any load
%other = fadd float %x, %y
store float %other, ptr addrspace(1) %p, align 4 ; overwrites the first store completely
ret void
}
After:
define void @f(ptr addrspace(1) %p, float %x, float %y) {
; first store removed -- overwritten by second store, no intervening load
%other = fadd float %x, %y
store float %other, ptr addrspace(1) %p, align 4
ret void
}
NVIDIA's DSE also handles partial overwrite patterns with CUDA vector types. When a float4 store partially overwrites a previous float4 store, the pass decomposes via GEP to determine which elements are dead. This is a key GPU extension that upstream LLVM DSE does not handle.
Analysis Dependencies
DSE requires five analysis passes, resolved through the pass manager at registration time (sub_19DD1D0):
| Analysis | Global Address | Pass ID |
|---|---|---|
| MemorySSA | unk_4F9E06C | Memory SSA graph |
| DominatorTree | unk_4F9A488 | Dominator tree |
| MemoryDependence | unk_4F9B6E8 | Memory dependence queries |
| PostDominatorTree | unk_4F9D764 | Post-dominator tree |
| AliasAnalysis | unk_4F9D3C0 | NVVM-aware alias analysis |
Core Algorithm
The main entry point DSE::runOnFunction (sub_19DA750) processes a function by iterating over store instructions and checking whether each store is dead (fully or partially overwritten by a later store to the same location before any intervening load).
Early Exit and Setup
The pass begins with an early exit check via sub_1636880() to determine whether the function should be skipped entirely. It then retrieves MemoryDependence and AliasAnalysis from the pass manager and calls sub_14A4050 / sub_14A2F00 to verify the function contains stores worth analyzing. If no stores are present, the pass returns immediately.
Store Instruction Identification
Store instructions are identified by checking byte +16 of the instruction structure for value 77. The operand count is read from offset +20 (masked with 0xFFFFFFF), and the "has-operand-list-pointer" flag at byte +23, bit 0x40, indicates indirect operand storage for instructions with many operands.
Type Size Computation
DSE computes store sizes through a type-walker switch on byte +8 of the type structure. This logic is shared between the core pass and the overwrite detector:
| Type Code | Size | Notes |
|---|---|---|
| 1 | 16 bits | Half-precision float |
| 2 | 32 bits | Float / int32 |
| 3, 9 | 64 bits | Double / int64 |
| 4 | 80 bits | x86 long double / PTX f80 |
| 5, 6 | 128 bits | Quad precision / int128 |
| 7 | pointer-sized | Resolved via sub_15A9520 |
| 0xB | immediate | Size from upper bits of type word |
| 0xD | struct | Layout computed by sub_15A9930 |
| 0xE | vector | element_size * num_elements with alignment |
| 0xF | integer | Arbitrary-width integer |
| 0x10 | array | Recurses into element type, multiplies by count |
| 0, 8, A, C | array-like | Follows pointer chain |
The vector type formula (case 0xE) accounts for element alignment: 8 * num_elements * element_alignment * ceil(element_alignment + ceil(element_bits/8) - 1) / element_alignment). This handles CUDA native vector types (float2, float4, int4).
Overwrite Detection
The overwrite analysis engine at sub_19DDCB0 (28 KB) determines whether one store completely or partially covers another. It receives the instruction, an operand index, alias analysis results, and address-space information.
Alias Queries
The function calls sub_14C2730 to perform alias queries with full parameters: (target_ptr, data_layout, 0, instruction, store_address, alias_analysis). This returns whether two memory locations may alias. The alias analysis already incorporates CUDA address-space separation (shared=3, global=1, local=5, constant=4), so DSE itself does not need explicit address-space checks.
Partial Store Forwarding
When store sizes do not match, NVIDIA's DSE creates truncation or extension casts to extract the relevant portion. This is a critical GPU-specific extension:
- If the source is smaller than the destination: creates an extension (opcode 36 = zext).
- If the source is larger than the destination: creates a truncation (opcode 38 = trunc).
- Alignment requirements are verified through
sub_16431D0. - Complex types use
sub_15FDBD0for cast creation; simple types usesub_15A46C0.
Standard LLVM DSE bails on size mismatches. NVIDIA's version handles the common CUDA pattern of a float4 store followed by a scalar float load by extracting the relevant component via GEP + load.
Store Size Ratio Check
At labels LABEL_25 / LABEL_29 in the core function, DSE performs a ratio check:
- Computes
v159= aligned size of destination type. - Computes
v48= aligned size of source type. - Calculates
v148 = v48 / v159(how many destination elements fit in source). - If
v48 % v159 != 0, bails (partial overlap that cannot be forwarded). - If sizes differ, creates a GEP + load to extract the relevant portion.
Metadata Preservation
After creating a replacement instruction, the pass preserves metadata:
- Debug location via
sub_157E9D0. - Use-chain linkage by updating prev/next pointers at offsets +24/+32.
- Basic block insertion via
sub_164B780. - TBAA metadata propagation through
sub_1623A60/sub_1623210. nonnullattribute copying viasub_15FA300/sub_15FA2E0.- Use replacement via
sub_164B7C0.
Partial Overwrite Tracking
The function-level partial overwrite pass at sub_19DF5F0 (30 KB) maintains a hash table of all stores in a function and tracks which stores partially overwrite each other.
Hash Table Structure
Each hash table entry is 72 bytes:
| Offset | Content |
|---|---|
| +0 | Key (store instruction pointer; -8 = empty, -16 = tombstone) |
| +8 | Operand list pointer |
| +16 | Operand count |
| +24 | Inline storage (when count <= small threshold) |
| +48 | Additional metadata |
The hash function, probing strategy, and growth/compaction thresholds follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure. This instance uses NVVM-layer sentinels (-8 / -16) and a minimum table size of 64 entries.
Cross-Store Dependency Records
When a new store aliases an existing entry, DSE records both stores in a 6-element record: {store1, store2, operand1, operand2, ptr1, ptr2}. This enables tracking stores that partially overwrite each other even when the overwritten value has been modified between stores. Reference counting is managed through sub_1649AC0 / sub_1649B30, and per-entry operand lists grow via sub_170B450.
Store-Chain Decomposition
In the LABEL_47 region of the core function, DSE walks store chains through struct/array GEPs and decomposes aggregate stores into element-level dead store checks. sub_19D94E0 handles chain-level elimination, while sub_19D91E0 builds the comparison set for overlap detection.
Address-Space Handling
DSE does not contain explicit CUDA address-space comparisons. Address-space separation is handled entirely by the underlying NVVM alias analysis (unk_4F9D3C0), which knows that different address spaces cannot alias. The alias query function sub_14C2730 receives the full instruction context including address space, so query results already incorporate this constraint.
Store Forwarding to Loads
The function sub_19DBD20 (20 KB) attempts store-to-load forwarding. When sub_19DD7C0 finds a store feeding into a load, it constructs a replacement using sub_12815B0. Sign/zero extension matching uses type byte 15 (float types) and type byte 11 (integer types), with opcodes 45 (float-to-int truncation), 46 (int-to-float), and 47 (generic cast).
Related Passes
Two related passes are registered alongside DSE in the same code region:
- MergedLoadStoreMotion (
sub_19DCD20, pass namemldst-motion): Shares the same alias infrastructure and is registered with the same analysis dependencies. - NaryReassociate (
sub_19DD420/sub_19DD530): N-ary reassociation pass factory, registered atsub_19DD1D0with its own analysis set.
Key Function Map
| Function | Address | Size | Role |
|---|---|---|---|
DSE::runOnFunction | 0x19DA750 | 33 KB | Main dead store elimination |
DSE::analyzeOverwrite | 0x19DDCB0 | 28 KB | Complete/partial overwrite detection |
DSE::runPartialOverwritePass | 0x19DF5F0 | 30 KB | Function-level partial tracking |
DSE::tryForwardStoresToLoad | 0x19DBD20 | 20 KB | Store-to-load forwarding |
DSE::buildOverwriteRecord | 0x19D8AF0 | -- | Overlap record construction |
DSE::buildComparisonSet | 0x19D91E0 | -- | Set of stores to compare |
DSE::eliminateStoreChain | 0x19D94E0 | -- | Chain-level elimination |
DSE::scanLoopForDeadStores | 0x19DCB70 | -- | Loop-level DSE |
DSE::runOnBasicBlock | 0x19DCC90 | -- | Block-level entry point |
DSE::extractStoreOperands | 0x19DD690 | -- | Get base pointer and stored value |
DSE::lookupDeadStoreCandidate | 0x19DD7C0 | -- | Hash table lookup |
DSE::decomposeGEPStore | 0x19DD950 | -- | GEP-based store decomposition |
DSE::collectPartialOperands | 0x19DEFC0 | -- | Partial overwrite operand collection |
DSE::checkPartialOverwrite | 0x19DEE70 | -- | Individual partial overwrite check |
DSE::tryEliminateStore | 0x19DF200 | -- | Attempt store elimination |
DSE::rehashStoreTable | 0x19DF220 | -- | Hash table resize |
Differences from Upstream LLVM
- Partial store forwarding with type conversion. Standard LLVM DSE bails when store and load sizes differ. NVIDIA's version creates GEP + load sequences to extract relevant portions, handling
float4->floatpatterns. - 72-byte hash table entries with cross-store tracking. Upstream uses simpler data structures. NVIDIA tracks which stores partially overwrite each other through 6-element dependency records.
- Store-chain decomposition. Aggregate stores are decomposed through struct/array GEPs into element-level checks, enabling elimination of stores that are collectively dead.
- Vector type awareness. The type walker includes a dedicated case for CUDA vector types with proper alignment computation.
- Total code size. At ~91 KB across three functions, NVIDIA's DSE is roughly 3x the size of upstream LLVM's equivalent.