InstCombine

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/InstCombine/InstructionCombining.cpp, llvm/lib/Transforms/InstCombine/InstCombine*.cpp (LLVM 20.0.0). The upstream is split across ~15 files by instruction category; cicc inlines them into a single monolithic visitor.

NVIDIA's InstCombine in CICC v13.0 is approximately twice the size of upstream LLVM's, weighing in at roughly 405 KB for the main visitor alone. The monolithic visitor function at sub_10EE7A0 dispatches across 80 unique opcode cases through a three-level switch structure, handling standard LLVM instructions, NVIDIA-extended vector and FMA operations, and three high-opcode NVVM intrinsic dead-code elimination patterns. A separate 87 KB intrinsic folding function (sub_1169C30) handles NVVM-specific canonicalization, and a 127 KB computeKnownBits implementation (sub_11A7600) provides the dataflow backbone. This page covers the visitor architecture, the per-instruction-type visitors recovered from the binary, and the NVIDIA-specific extensions that distinguish this implementation from upstream.


Registration	New PM #398, parameterized: `no-aggressive-aggregate-splitting;...;max-iterations=N`
Runtime positions	Tier 0 #28 (via `sub_19401A0`); Tier 1/2/3 #42 (gated by `!opts[1000]`); see Pipeline
Main visitor	`sub_10EE7A0` (`0x10EE7A0`, ~405 KB, 9,258 lines)
Intrinsic folding	`sub_1169C30` (`0x1169C30`, ~87 KB, 2,268 lines)
computeKnownBits	`sub_11A7600` (`0x11A7600`, ~127 KB, 4,156 lines)
SimplifyDemandedBits	`sub_11AE870` / `sub_11AE3E0` (wrapper + hash table)
Opcode cases	80 unique case labels across 3 switch blocks
NVIDIA extra size	~200 KB beyond upstream (~87 KB intrinsic fold + ~113 KB expanded cases)

Visitor Architecture

The main visitor sub_10EE7A0 receives an NVVM IR node pointer (__m128i* a2) and attempts to simplify it. A persistent local v1612 aliases the instruction being visited. The function has four structural regions:

Preamble (lines ~1760--2000) performs pre-dispatch checks: validating call-site attributes (opcode 41 for bitwise-assert), handling ternary FMA instructions (opcodes 238--245), checking for constant-foldable select patterns, canonicalizing operand ordering (constant to RHS), and running SimplifyDemandedBits via sub_11A3F30 on the result type.

Opcode dispatch reads the NVVM opcode via sub_987FE0 (getOpcode) and uses a three-level switch:

Switch Level	Opcode Range	Description
Level 1	`0x99`--`0x2A5` (main)	Standard LLVM instructions (GEP, select, stores, casts, compares, calls, vectors)
Level 2	`0x01`--`0x42` (low)	Binary operations, casts, early comparisons
Level 3	`> 0x13CF` (high)	NVIDIA proprietary intrinsic IDs (9549, 9553, 9567)

Additional if-else chains handle intermediate ranges: opcodes 0xC7E (3198), 0x2E2 (738), 0x827 (2087), 0x2CC (716), 0xE07--0xE08, 0xE4F--0xE51, 0x13C6--0x13C7, and 0x13CD--0x13CE.

The fallback path at LABEL_95 calls sub_F0C430 for generic simplification. The no-change return path at LABEL_155 is referenced 101 times throughout the function.

Per-Instruction Visitors

Each major instruction type is handled by a dedicated visitor function called from the main dispatch. The following table summarizes the recovered visitors with their sizes and key characteristics.

visitBinaryOperator -- `sub_10D8BB0`


Address	`0x10D8BB0` (102 KB, 2,078 lines)
Dispatch case	`0x3A` in the master dispatcher
Sibling cases	`0x39` (NSW/NUW-focused), `0x3B` (associative/commutative)

This is the second-largest visitor. It implements approximately 25 cascading simplification phases for all binary arithmetic (Add, Sub, Mul, Div, Rem, Shl, LShr, AShr, And, Or, Xor, and their floating-point counterparts). The phases execute in a strict try-and-return order:

Phase 0 runs quick exits: pattern-matched constant fold (sub_101E960), SimplifyBinOp (sub_F29CA0), algebraic identities (sub_F0F270), NSW/NUW simplification (sub_F11DB0), and critically the NVIDIA-specific intrinsic handler sub_11AE870 which runs before any standard LLVM folds.

Phases 1--9 handle associative/commutative factoring, cross-operand Mul-of-Add matching, delegated simplification, overflow detection, and multiply-shift strength reduction. Phase 5 detects multiply-by-power-of-2 and converts to shift; sub_10BA120 builds the full strength reduction for patterns like x * (2^n + 1) into (x << n) + x.

Phases 10--25 cover Add-of-Mul factoring, shift chains, linear expression folding, subtraction of multiplied constants, demanded-bits masking, reciprocal elimination, overflow intrinsic decomposition, and division/remainder folding. The division constant folder uses sub_C46BD0 (APInt::udiv), sub_C499A0 (APInt::urem), sub_C45F70 (APInt::sdiv), and sub_C49AB0 (APInt::srem).

Four template-instantiated helpers at sub_10D2680--sub_10D2D70 (2,767 bytes each, identical structure) implement matchBinOpReduction parameterized by NVVM intrinsic ID (329, 330, 365, 366) and acceptable opcode range. These detect NVVM horizontal reduction intrinsics (e.g., horizontal add/mul across vector lanes) and simplify them to scalar binary operations.

visitICmpInst -- `sub_1136650` + `sub_113CA70`


Comprehensive folder	`sub_1136650` (`0x1136650`, 149 KB, 3,697 lines)
Per-opcode dispatch	`sub_113CA70` (`0x113CA70`) -- 12 case labels

The ICmp folder is the single largest function in InstCombine. It runs before the per-opcode dispatch table and handles 15 major fold categories: all-ones/sign-bit constant folds, Mul-with-constant strength reduction (NUW-gated), nested Mul decomposition, common sub-operand cancellation, NUW/NSW flag-gated predicate conversion, known-nonnegativity folds, ConstantRange intersection, shared sub-operand elimination, Sub sign-bit analysis, min/max pattern recognition, computeKnownBits sign-bit analysis, power-of-2 optimizations, remainder pattern matching, XOR/shift decomposition, and Or/And decomposition with type width folding.

NVVM uses a custom predicate encoding stored at ICmpInst+2 as a 6-bit field (*(_WORD*)(inst+2) & 0x3F):

Value	Predicate	Value	Predicate
32	EQ	33	NE
34	UGT	35	UGE
36	ULT	37	ULE
38	SGT	39	SGE
40	SLT	41	SLE

The per-opcode dispatch at sub_113CA70 routes based on the non-constant operand's opcode tag:

Tag	Instruction	Handler	Size
`*` (42)	Mul	`sub_1128290`	1,178 lines
`,` (44)	Add	`sub_1119FB0`	413 lines
`.` (46)	Trunc	`sub_1115510`	--
`0` (48)	SExt	`sub_11164F0`	--
`1` (49)	ZExt	`sub_1122A30`	--
`4` (52)	Select	`sub_1115C10`	428 lines
`6` (54)	And	`sub_1120680`	911 lines
`7` (55)	Or	`sub_1126B10`	786 lines
`8` (56)	Xor	`sub_1126B10`	shared with Or
`9` (57)	Shl	`sub_112C930`	664 lines
`:` (58)	LShr	`sub_1133500`	--
`;` (59)	Sub	`sub_111CED0` + `sub_113BFE0`	519 lines

visitCastInst -- `sub_110CA10`


Address	`0x110CA10` (93 KB, 2,411 lines)
Cast chain helper	`sub_110B960` (22 KB, 833 lines)

Handles all cast simplification: same-type identity elimination, bool-to-float chains, integer-to-integer narrowing/widening, FP-to-int special cases, FP narrowing, cast-through-select/PHI, and the major cast-of-cast chain folding. The helper sub_110B960 implements deep cast chain folding for aggregate types using a worklist with a DenseMap for O(1) deduplication, preventing exponential blowup on diamond-shaped use-def graphs. The function is conservative about side effects: sub_B46500 (isVolatile) is called before every fold.

visitSelectInst -- `sub_1012FB0`


Address	`0x1012FB0` (74 KB, 1,801 lines)
Local variables	190 total

Implements 18 prioritized select simplifications: constant fold, undef arm elimination, both-same identity, PHI-through-select, KnownBits sign analysis, ConstantRange analysis, full-range analysis, KnownBits cross-validation, ICmpInst arm synthesis, ExtractValue decomposition, implied condition, canonicalization (delegated to sub_1015760, 27 KB), min/max pattern detection (smin/smax/umin/umax/abs/nabs via four helpers), select-in-comparison chains, PHI-select worklist scan (DenseMap with hash (ptr >> 9) ^ (ptr >> 4)), ValueTracking classification, pointer-null folding, and load/trunc delegation.

visitPHINode -- `sub_1175E90`


Address	`0x1175E90` (~57 KB, ~2,130 lines)

Implements 16 PHI optimization strategies tried in sequence: SimplifyInstruction constant fold, foldPHIArgOpIntoPHI (binary/cast with one varying operand), foldPHIArgConstantOp, typed opcode dispatch (GEP via sub_1172510, InsertValue, ExtractValue, CmpInst, BinOp/Cast), GEP incoming deduplication with loop back-edge analysis, single-use PHI user check, GEP-of-PHI transform (sub_1174BB0, 1,033 lines), phi-cycle escape detection, trivial PHI elimination (all-same non-PHI value), recursive PHI cycle resolution (sub_116D410), operand reordering canonicalization, identical-PHI-in-block deduplication, pointer-type struct GEP optimization, all-undef incoming check, and dominator-tree GEP index hoisting using two DenseMaps.

visitCallInst -- `sub_1162F40`


Address	`0x1162F40` (50 KB, 1,647 lines)

Processes calls through a 15-step cascade: LibCall simplification (sub_100A740), standard intrinsic folding (sub_F0F270), return attribute analysis (sub_F11DB0), overflow/saturating arithmetic (sub_115C220), inline mul-by-constant folding, generic call combining (sub_115A080), FMA/fneg/fsub canonicalization (the largest block, requiring all of nnan+ninf+nsz+arcp+reassoc on both call and function), constant-argument intrinsic folding, unary intrinsic constant folding, exp/log pair detection (IDs 325 and 63), sqrt/rsqrt folding (IDs 284, 285), min/max folding (IDs 88, 90), nested intrinsic composition, division-to-reciprocal-multiply, and finally the NVIDIA-specific sub_115A4C0 which dispatches to the 87 KB intrinsic folding table.

visitLoadInst -- `sub_1152CF0`


Address	`0x1152CF0` (~68 KB, ~1,680 lines)
Stack frame	`0x4F0` bytes (1,264 bytes)

Four major paths: constant-address fold (loads from known constant pointers with types <= 64 bits are replaced via symbol table lookup using sub_BCD420), address-space-based elimination (loads from non-AS(32) pointers are replaced with constants, exploiting CUDA's read-only address spaces), the main store-to-load forwarding worklist (BFS over the def-use graph following GEPs, PHIs, and bitcasts, depth-limited by global qword_4F90528), and dominator-based forwarding for non-pointer loads. Alignment is propagated as the maximum of source and destination, with the volatile bit carefully preserved through the *(node+2) 16-bit field (bits [5:0] = log2(alignment), bit [6] = volatile flag).

NVIDIA-Specific Extensions

NVVM Intrinsic Folding -- `sub_1169C30`

This 87 KB function is the core of NVIDIA's additions to InstCombine. Called from the main visitor when the instruction is an NVIDIA intrinsic, it uses a two-layer dispatch:

Layer 1 (primary switch, entered when the uses-list is empty or the "fast" flag at a1+336 is set) dispatches on the node's byte-tag:

Tag	Char	Fold Type
42	`*`	FNeg/negation -- pushes negation through arithmetic via the "Negator" chain
55	`7`	Vector extract from intrinsic result (full-width extract becomes identity)
56	`8`	Vector insert into intrinsic result (full-width insert becomes And mask)
59	`;`	Multiply-like symmetric intrinsic (folds when one operand is known non-negative)
68	`D`	ZExt of i1 intrinsic result (bypasses intrinsic wrapper)
69	`E`	SExt of i1 intrinsic result (bypasses intrinsic wrapper)
85	`U`	Call-site fold for `llvm.nvvm.*` with specific IDs (313, 362)
86	`V`	Select-like intrinsic fold (dead select elimination)

Layer 2 (depth-gated by qword_4F908A8 = instcombine-negator-max-depth) adds aggressive cases:

Tag	Char	Fold Type
46	`.`	Dot product fold
54	`6`	Indexed access / extract with fold
58	`:`	Comparison intrinsic fold
67	`C`	Type conversion intrinsic fold
84	`T`	Tensor / multi-operand intrinsic fold
90	`Z`	Zero-extend intrinsic fold
91	`[`	Three-operand fold (e.g., fma)
92	`\`	Four-operand fold (e.g., dp4a)
96	`	Unary special intrinsic fold

The FNeg case (tag 42) is the most complex. It first attempts constant folding: if the operand is all-ones (-1), it creates sub(0, operand) via CSE lookup with opcode 30. When the simple fold fails, it falls through to the Negator chain at LABEL_163: sub_1168D40 collects all negatable sub-expressions, sub_1169800 attempts to fold negation into each operand, and the results are combined with sub_929C50 or sub_929DE0. This pushes negation through chains of arithmetic to find a cheaper representation, depth-gated to prevent exponential blowup. Created replacement instructions carry .neg modifier metadata for PTX emission.

Three High-Opcode NVIDIA Intrinsics

Opcodes 0x254D (9549), 0x2551 (9553), and 0x255F (9567) are NVIDIA-proprietary intrinsic IDs handled directly in the main visitor. All three share the same pattern: extract the commuted-operand index via v1612->m128i_i32[1] & 0x7FFFFFF, verify the other operand has byte-tag 12 or 13 (ConstantInt/ConstantFP), query metadata via sub_10E0080 with mask 0xFFFFFFFFFFFFFFFF, and test specific bit patterns:

Opcode	Test	Fold Condition
0x2551 (9553)	`((result >> 40) & 0x1E) == 0x10`	Fold when bit pattern mismatches
0x255F (9567)	`(result & 0x10) != 0`	Fold when bit 4 is clear
0x254D (9549)	`(result & 0x200) != 0`	Fold when bit 9 is clear

When the filter passes, the shared epilogue calls sub_F207A0(v6, v1612->m128i_i64) (eraseInstFromFunction), deleting the instruction entirely. These implement dead-code elimination for NVIDIA intrinsics with constant arguments matching known-safe-to-remove criteria.

Separate Storage Assume Bundles

At lines 6557--6567 of the main visitor, the code iterates over operand bundles on llvm.assume calls (opcode 0x0B). For each bundle with a tag of exactly 16 bytes matching "separate_storage" (verified by memcmp), it calls sub_10EA360 on both bundle operands. This implements NVIDIA's separate_storage alias analysis hint, allowing InstCombine to exploit non-aliasing assumptions for pairs of pointers declared to reside in separate memory spaces.

Expanded GEP Handling

The GEP case (opcode 0x99 = 153) is significantly expanded compared to upstream. The global dword_4F901A8 controls a depth-limited chain walk for nested GEP simplification:

v729 = getOperand(0) of GEP
if (dword_4F901A8) {
    v730 = 0;
    do {
        if (!isConstantGEP(v729)) break;
        ++v730;
        v729 = getOperand(0, v729);  // walk up
    } while (v730 < dword_4F901A8);
}
if (*(_BYTE*)v729 != 85)  // 85 = CallInst
    goto LABEL_155;        // bail

This walks backward through constant-index GEP chains up to dword_4F901A8 steps, looking for a CallInst base pointer. The knob controls how many GEP levels to look through when simplifying GEP(GEP(GEP(..., call_result))).

Ternary/FMA Support

The preamble handles 3-operand instructions (opcodes 238--245) representing fused multiply-add variants. This includes checking whether the third operand is a zero-constant, converting between FMA opcode variants (238 vs. 242), and handling address space mismatches on FMA operand types -- entirely NVIDIA-specific for CUDA's FMA intrinsics.

computeKnownBits -- `sub_11A7600`

The 127 KB computeKnownBits implementation dispatches on the first byte of the NVVM IR node (the type tag):

Tag	Char	Node Type
42	`*`	Truncation (extracts low bits)
44	`,`	GEP (computes known bits through pointer arithmetic)
46	`.`	Comparison (known result bits)
48	`0`	Select (intersection of known bits from both arms)
52	`4`	Branch-related
54	`6`	Vector shuffle
55	`7`	Vector extract
56	`8`	Vector insert
57	`9`	PHI node (intersection across incoming values)
58	`:`	Comparison variant
59	`;`	Invoke / call
67	`C`	Cast chain
68	`D`	Binary op path 1
69	`E`	Binary op path 2
85	`U`	CallInst (sub-dispatch: `0x0F`=abs, `0x42`=ctpop, `0x01`=bitreverse)
86	`V`	LoadInst

A debug assertion at lines 2204--2212 fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results, printing both APInt values and calling abort(). This invariant check (known_zero & known_one == 0, plus consistency with the demanded mask) is compiled in for debug/checked builds.

SimplifyDemandedBits -- `sub_11AE870`

The wrapper sub_11AE870 gets the bit-width via sub_BCB060 (or sub_AE43A0 for non-integer types), allocates two APInts sized to the width, delegates to sub_11AE3E0, and frees any heap-allocated storage. The core implementation at sub_11AE3E0 (235 lines) calls computeKnownBits, then if the instruction was simplified, walks the use-chain and inserts each user into a hash table (open-addressing with quadratic probing, hash = (ptr >> 9) ^ (ptr >> 4)) at offset +2064 from the InstCombiner context. This "seen instructions" set prevents infinite recursion during demanded-bits propagation.

Configuration Knobs

Global	CLI Flag	Default	Used In
`dword_4F901A8`	(GEP chain look-through depth)	unknown	GEP handler (case 0x99)
`qword_4F908A8`	`instcombine-negator-max-depth`	-1	`sub_1169C30` (depth gate)
`qword_4F90988`	`instcombine-negator-enabled`	1	ctor_090
`qword_4F8B4C0`	`instcombine-split-gep-chain`	--	ctor_068
`qword_4F8B340`	`instcombine-canonicalize-geps-i8`	--	ctor_068
`qword_4F909E0`	`instcombine-max-num-phis`	--	ctor_091
`qword_4F90120`	`instcombine-guard-widening-window`	3	ctor_087
`qword_4F90528`	(load forwarding search depth)	--	`sub_1152CF0`

Key Helper Functions

Address	Recovered Name	Purpose
`sub_987FE0`	`getOpcode()`	Reads NVVM opcode from IR node
`sub_B46B10`	`getOperand(idx)`	Operand access
`sub_B44E20`	`eraseFromParent()`	Unlink instruction
`sub_F207A0`	`eraseInstFromFunction()`	Delete instruction from worklist
`sub_F162A0`	`replaceInstUsesWith()`	RAUW and return replacement
`sub_F20660`	`setOperand(i, val)`	Replace operand in-place
`sub_B33BC0`	`CreateBinOp()`	IRBuilder binary op creation
`sub_B504D0`	`CreateBinOp(no-flags)`	Binary op without flags
`sub_B51D30`	`CreateCast()`	Cast instruction creation
`sub_AD8D80`	`ConstantInt::get(type, APInt)`	Constant integer factory
`sub_AD64C0`	`ConstantInt::get(type, val, signed)`	Constant integer factory (scalar)
`sub_BCB060`	`getScalarSizeInBits()`	Type bit-width query
`sub_10E0080`	`getKnownBitsProperty()`	Metadata property query
`sub_B43CB0`	`getFunction()`	Get parent function
`sub_B43CA0`	`getParent()`	Get parent basic block
`sub_10A0170`	`extractFlags()`	Read fast-math, exact, etc.
`sub_B44900`	`isCommutative()`	Check commutativity
`sub_C444A0`	`APInt::countLeadingZeros()`	Bit analysis
`sub_986760`	`APInt::isZero()`	Zero test
`sub_10EA360`	`recordSeparateStorageOperand()`	Separate storage alias hint

Diagnostic Strings

Diagnostic strings recovered from the InstCombine binary region. InstCombine uses assertion-style diagnostics rather than optimization remarks; the computeKnownBits consistency check is the primary runtime diagnostic.

String	Source	Category	Trigger
`"computeKnownBits(): "`	`sub_904010` in `sub_11A7600` line ~2204	Assertion	Debug build: `computeKnownBits` and `SimplifyDemandedBits` produce inconsistent results (prints both APInt values, then calls `abort()`)
`"SimplifyDemandedBits(): "`	`sub_904010` in `sub_11A7600` line ~2212	Assertion	Debug build: paired with `computeKnownBits()` inconsistency diagnostic above
`"separate_storage"`	Main visitor lines 6557--6567	Bundle tag	Matched via `memcmp` (16 bytes) on `llvm.assume` operand bundles; not a user-visible diagnostic
`"instcombine-negator-max-depth"`	`ctor_090` at `0x4F908A8`	Knob	Knob registration (default -1, unlimited)
`"instcombine-negator-enabled"`	`ctor_090` at `0x4F90988`	Knob	Knob registration (default 1, enabled)
`"instcombine-split-gep-chain"`	`ctor_068` at `0x4F8B4C0`	Knob	Knob registration
`"instcombine-canonicalize-geps-i8"`	`ctor_068` at `0x4F8B340`	Knob	Knob registration
`"instcombine-max-num-phis"`	`ctor_091` at `0x4F909E0`	Knob	Knob registration
`"instcombine-guard-widening-window"`	`ctor_087` at `0x4F90120`	Knob	Knob registration (default 3)

InstCombine does not emit OptimizationRemark diagnostics. The only runtime-visible diagnostic is the debug assertion that fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results (known_zero & known_one != 0, or results disagree with the demanded mask). This check is compiled into debug/checked builds only and calls abort() after printing both APInt values.

Size Contribution Estimate

Component	Size	Description
Upstream visitor baseline	~200 KB	Standard LLVM visiting ~50 instruction types
`sub_1169C30` intrinsic folding	~87 KB	NVVM-specific intrinsic canonicalization
NVVM GEP/FMA/vector cases	~40 KB	Expanded GEP chains, ternary FMA, vector width-changing
`separate_storage` + assume	~10 KB	Operand bundle handling for alias hints
High-opcode NVIDIA intrinsics	~15 KB	DCE for opcodes 0x254D/0x2551/0x255F
Expanded comparator/cast	~50 KB	Extended ICmp, cast chain, select handling
NVIDIA total addition	~200 KB	Roughly doubles upstream InstCombine

Optimization Level Behavior

Level	Scheduled	Instances	Notes
O0	Not run	0	No optimization passes
Ofcmax	Runs	1	Single instance in fast-compile pipeline
Ofcmid	Runs	2	Early + post-GVN cleanup
O1	Runs	3-4	Early, post-SROA, post-GVN, late cleanup
O2	Runs	4-5	Same as O1 + additional Tier 2 instance after loop passes
O3	Runs	5-6	Same as O2 + Tier 3 instance; benefits from more aggressive inlining/unrolling

InstCombine is the most frequently scheduled pass in the CICC pipeline. Each instance runs the full 405KB visitor but benefits from different preceding transformations: the post-SROA instance cleans up cast chains from aggregate decomposition, the post-GVN instance simplifies expressions exposed by redundancy elimination, and the late instance performs final canonicalization before codegen. The instcombine-negator-max-depth and instcombine-negator-enabled knobs apply uniformly across all instances. Even at Ofcmax, at least one InstCombine run is considered essential for basic IR canonicalization. See Optimization Levels for pipeline tier details.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Binary size	~200 KB main visitor	~405 KB main visitor + 87 KB intrinsic folding (~2x upstream)
NVVM intrinsic folding	No NVVM-specific intrinsic canonicalization	Dedicated 87 KB function (`sub_1169C30`) with two-layer dispatch for negation, vector extract/insert, FMA, tensor, dot product, and 15+ fold types
High-opcode DCE	Not present	Three NVIDIA proprietary intrinsic IDs (9549, 9553, 9567) with constant-argument dead-code elimination
`separate_storage` bundles	No `separate_storage` operand bundle handling	Iterates `llvm.assume` bundles, extracting `"separate_storage"` hints for alias-based optimization
Ternary FMA opcodes	Standard `llvm.fma` / `llvm.fmuladd` folding	Extended preamble handles opcodes 238--245 for CUDA FMA variants with address-space mismatch handling
GEP chain look-through	Single-level GEP simplification	Depth-limited chain walk (`dword_4F901A8` steps) backward through constant-index GEP chains to find CallInst base pointers
Horizontal reduction	Standard intrinsic-based reduction fold	Four template-instantiated `matchBinOpReduction` helpers for NVVM horizontal reduction intrinsics (IDs 329, 330, 365, 366)
KnownBits integration	Separate `computeKnownBits` in ValueTracking	Fused 127 KB `computeKnownBits` + `SimplifyDemandedBits` with GPU special-register range oracle

Keyboard shortcuts

CICC Reverse Engineering Reference