Base Address Strength Reduction
Address computation is a disproportionately expensive category of work on NVIDIA GPUs. The integer ALU units that compute memory addresses are a scarce resource relative to the FP/tensor throughput the hardware is designed to maximize. A typical unrolled loop body touching four arrays at A[tid + i], B[tid + i], C[tid + i], D[tid + i] -- where tid is a function of threadIdx.x, blockIdx.x, and blockDim.x -- may emit four independent 64-bit multiply-add chains per iteration, each recomputing the same base expression base_ptr + tid * element_size. Reducing those four chains to one base computation plus three cheap constant-offset additions can halve the integer instruction count in the loop body and free address registers that would otherwise stay live across the entire loop.
Base Address Strength Reduction (BASR) is an NVIDIA-proprietary IR-level pass that performs exactly this transformation. It scans loop bodies for memory operations that share a common base pointer expression, finds the one with the minimum constant offset (the "anchor"), hoists the anchor computation, and rewrites all remaining addresses as (anchor + relative_offset). The pass is confirmed by the string "BaseAddressStrengthReduce" at decompiled line 457 of sub_1C67780.
Key Facts
| Property | Value |
|---|---|
| Pass name | BaseAddressStrengthReduce |
| Entry point | sub_1C67780 (Legacy PM), sub_2CA4A10 (New PM) |
| Binary size | 58 KB (~1,400 decompiled lines) |
| Pass type | NVIDIA-proprietary, IR-level, loop body transform |
| Primary knobs | do-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions) |
| Chain variant | do-base-address-strength-reduce-chain (separate boolean toggle) |
| Negative offset control | dword_4FBCAE0 (aggressiveness for negative-offset patterns) |
| IV limit | base-address-strength-reduce-iv-limit (parametric) |
| Max IV | base-address-strength-reduce-max-iv (parametric) |
| Debug dump | dump-base-address-strength-reduce |
| Required analyses | LoopInfo (sub_1632FA0), DataLayout |
| Option registration | ctor_263_0 at 0x4F36F0 (shared with SCEV-CGP, 44 strings total) |
| Companion pass | Common Base Elimination (sub_1C5DFC0) |
| Helper | Bitcast helper at sub_1C637F0 (28 KB, strings "baseValue", "bitCastEnd") |
Algorithm
The pass operates in six phases, executing once per function. It processes all loop bodies simultaneously using worklists seeded from LoopInfo.
Phase 1 -- Initialization (lines 452-497)
The entry function retrieves LoopInfo via sub_1632FA0 and extracts the module's DataLayout from the function object (path: (a1+184)->field+24->field+40). It then allocates bookkeeping state:
- Eight hash maps at stack offsets
v374-v399, keyed byValue*(the base pointer). Each map entry holds a linked list of memory instructions that share that base. - Multiple worklists for basic blocks containing loads vs. stores.
- Threshold:
v429 = 2-- the minimum number of uses of the same base before the pass considers strength reduction worthwhile. - Pass counter:
v438 = 1-- the initial pass number (the pass may iterate).
Phase 2 -- Address Pattern Collection (lines 518-600)
For each instruction in the target basic blocks (drawn from the a4 worklist):
sub_1C57390classifies the address expression, extracting its structural form.sub_1CCB2B0computes alignment information from the DataLayout.sub_1456040extracts the base pointer from the address expression.
The base pointer is then categorized into one of two buckets:
| Category | Condition | Hash map | Worklist | Description |
|---|---|---|---|---|
| Non-pointer-type base | type_id != 15 | v382 | v363 | Integer/GEP-derived base addresses |
| Pointer-type base | type_id == 15 | v378 | v360 | Bases that are raw pointers to globals |
For pointer-type bases, sub_1CCDC20 further extracts the underlying global variable, allowing grouping of addresses to the same global even when accessed through different local pointer variables.
Hash map insertion uses sub_1C50900. If the base pointer is new (not yet in the map), the instruction list is initialized and the base is appended to the corresponding worklist. Otherwise, the instruction is appended to the existing list for that base.
for each instruction I in target BBs:
addr_info = classify_address(I) // sub_1C57390
alignment = compute_alignment(addr_info) // sub_1CCB2B0
base_ptr = extract_base(addr_info) // sub_1456040
if type_of(base_ptr) != POINTER_TYPE:
map_insert(hash_map_v382, base_ptr, I) // sub_1C50900
if is_new_entry:
worklist_v363.append(base_ptr)
else:
global = extract_global(base_ptr) // sub_1CCDC20
map_insert(hash_map_v378, global, I)
if is_new_entry:
worklist_v360.append(global)
Phase 3 -- Anchor Finding (lines 430-470)
For each base pointer that has accumulated at least v429 (2) uses, the pass determines the "anchor" -- the use with the minimum constant offset. This is the instruction whose address computation will be hoisted and shared.
For each candidate base:
sub_1C53170decomposes each address expression into a(base, constant_offset)pair.- The pass iterates over all uses and finds the one with the smallest constant offset:
- For offsets that fit in 64 bits: direct integer comparison via sign-extended values.
- For offsets wider than 64 bits: reads from extended-precision word arrays and compares word-by-word.
- The minimum-offset use becomes the anchor.
function find_anchor(base_ptr, use_list):
min_offset = +INF
anchor = null
for each use U in use_list:
(base, offset) = decompose_address(U) // sub_1C53170
if bit_width(offset) <= 64:
val = sign_extend_64(offset)
else:
val = read_extended_precision(offset)
if val < min_offset:
min_offset = val
anchor = U
return (anchor, min_offset)
Phase 4 -- Address Rewriting (lines 578-600)
Once the anchor is identified:
sub_13A5B00creates a new base address instruction from the anchor's address computation. This instruction is placed at the loop preheader or the dominating point of all uses.- For every other instruction sharing the same base, the pass computes the relative offset:
relative_offset = original_offset - anchor_offset. sub_14806B0creates a new address expression(new_base + relative_offset)and replaces the original address operand.
function rewrite_addresses(anchor, anchor_offset, use_list):
new_base = create_base_instruction(anchor) // sub_13A5B00
for each use U in use_list:
if U == anchor:
replace_address(U, new_base)
else:
(_, orig_offset) = decompose_address(U)
rel_offset = orig_offset - anchor_offset
new_addr = create_offset_add(new_base, rel_offset) // sub_14806B0
replace_address(U, new_addr)
After this transformation, a loop body that previously contained:
load (base + tid*stride + 0) // original: full GEP chain
load (base + tid*stride + 16) // original: full GEP chain
store (base + tid*stride + 32) // original: full GEP chain
store (base + tid*stride + 48) // original: full GEP chain
Becomes:
anchor = base + tid*stride + 0 // hoisted once
load anchor // offset 0: use anchor directly
load (anchor + 16) // cheap add
store (anchor + 32) // cheap add
store (anchor + 48) // cheap add
The three 64-bit multiply-add chains are replaced by three 64-bit immediate additions.
Phase 5 -- Negative Offset Handling (lines 512-520)
When dword_4FBCAE0 > 1 (the aggressiveness knob is set above default), the pass also considers address groups where the maximum offset has a negative sign bit. These represent patterns like:
load (base + tid*stride - 32)
load (base + tid*stride + 0)
load (base + tid*stride + 32)
Without this phase, the anchor would be the instruction at offset -32, producing negative relative offsets for the first use. Some hardware addressing modes handle negative offsets less efficiently, so this phase is gated separately.
For negative-offset candidates, the pass:
- Checks whether the base is loop-invariant via
sub_1C51340. - If loop-invariant, creates a separate common base via
sub_1C55CE0that absorbs the negative component.
Phase 6 -- Red-Black Tree Tracking
The pass uses a red-black tree infrastructure (sub_220F040 for insertion, sub_220EF80 for lookup) shared with other NVIDIA passes. This provides O(log n) sorted-set operations for maintaining collections of instruction pointers and efficiently checking membership during the rewriting phase.
Hash Map Implementation
The address pattern hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). The resize/rehash logic lives in sub_1C54050 -- the same function used by Common Base Elimination. Hash keys are Value* pointers with linear probing. See Hash Table and Collection Infrastructure for the hash function and probing strategy.
Relationship with Common Base Elimination
BASR and Common Base Elimination (sub_1C5DFC0) attack the same problem -- redundant address computation -- but at different scopes and with different strategies:
| Dimension | Base Address Strength Reduction | Common Base Elimination |
|---|---|---|
| Scope | Intra-loop: operates within a single loop body | Inter-block: operates across the CFG using dominance |
| Grouping | Groups addresses by shared induction-variable-based base | Groups addresses by shared base pointer to the same global |
| Placement | Anchor placed at loop preheader | Anchor placed at common dominator of all uses |
| Offset model | Constant offsets relative to IV-derived base | Constant offsets relative to global-derived base |
| Entry point | sub_1C67780 | sub_1C5DFC0 |
| Size | 58 KB | 38 KB |
The two-pass approach is deliberate. Common Base Elimination runs first at the IR level, hoisting shared base expressions across control flow boundaries. BASR then runs within loop bodies, strength-reducing the IV-dependent address chains that CBE cannot handle because the IV changes each iteration.
Both passes share the same address decomposition helper (sub_1C53170), the same hash map infrastructure (sub_1C50900, sub_1C54050), and the same instruction creation routines (sub_13A5B00, sub_14806B0).
Relationship with SCEV-CGP
The BASR knobs are registered together with SCEV-CGP (Scalar-Evolution-based CodeGenPrepare) in ctor_263_0 at 0x4F36F0. This constructor registers 44 option strings total, covering both SCEV-CGP and BASR. The do-base-address-strength-reduce and do-scev-cgp knobs are stored in the same ctor_526_0 option block.
SCEV-CGP is a broader pass that performs SCEV-based address optimization using thread ID as an induction variable (scev-cgp-tid-max-value controls the maximum thread ID value for analysis). BASR is a sub-transformation within this address optimization framework -- it handles the specific case of multiple memory operations sharing a base, while SCEV-CGP handles the broader case of rewriting address expressions using scalar evolution.
Related SCEV-CGP knobs that interact with BASR:
| Knob | Purpose |
|---|---|
scev-cgp-old-base | Controls whether SCEV-CGP creates new base expressions |
ignore-bad-base | Bypasses validity checks on base pointer classification |
ignore-32-bit-overflow | Skips 32-bit overflow checks in address arithmetic |
ignore-signed-32-bit-overflow | Skips signed 32-bit overflow checks |
topo-sort-begin | Controls topological sort start point for address chains |
special-reassociate-for-threadid | Prevents reassociation from moving threadId-dependent expressions |
Configuration
Boolean Knobs
| Knob | Default | Description |
|---|---|---|
do-base-address-strength-reduce | Enabled (level 2) | Master enable. Level 1 = unconditional; level 2 = with conditions (default). 0 = disabled. |
do-base-address-strength-reduce-chain | Enabled | Enables the chain variant, which strength-reduces chains of dependent address computations |
dump-base-address-strength-reduce | false | Prints diagnostic output when set |
Parametric Knobs
| Knob | Description |
|---|---|
base-address-strength-reduce-iv-limit | Maximum number of induction variables to consider per loop |
base-address-strength-reduce-max-iv | Maximum IV value for strength reduction eligibility |
Global Variables
| Global | Purpose |
|---|---|
dword_4FBCAE0 | Negative offset aggressiveness. When > 1, enables strength reduction of address groups with negative offsets. Also used as a special minimum-selection mode in MemorySpaceOpt. |
Diagnostic Strings
"BaseAddressStrengthReduce" -- Pass identification (line 457)
"baseValue" -- Bitcast helper: base value operand name (sub_1C637F0)
"bitCastEnd" -- Bitcast helper: end-of-chain marker (sub_1C637F0)
When dump-base-address-strength-reduce is enabled, the pass emits additional diagnostic output showing which base pointers were grouped, which anchor was selected, and which addresses were rewritten.
Key Functions
| Function | Address (Legacy) | Size | Role |
|---|---|---|---|
| Main entry | sub_1C67780 | 58 KB | Pass driver: initialization, collection, anchor finding, rewriting |
| Main entry (New PM) | sub_2CA4A10 | 62 KB | New Pass Manager variant |
| Address classifier | sub_1C57390 | -- | Classifies address expression structure |
| Address decomposer | sub_1C53170 | -- | Decomposes address into (base, constant_offset) pairs |
| Hash map insert | sub_1C50900 | -- | Inserts base pointer into pattern hash map |
| Hash map resize | sub_1C54050 | -- | Load-factor-based resize/rehash |
| Loop invariance check | sub_1C51340 | -- | Tests whether a value is loop-invariant |
| Negative offset handler | sub_1C55CE0 | -- | Creates common base for negative-offset patterns |
| Base instruction creation | sub_13A5B00 | -- | Creates the hoisted anchor address instruction |
| Offset rewriting | sub_14806B0 | -- | Creates (base + relative_offset) replacement |
| Base extraction | sub_1456040 | -- | Extracts base pointer from address expression |
| Global extraction | sub_1CCDC20 | -- | Extracts underlying global variable from pointer chains |
| Alignment computation | sub_1CCB2B0 | -- | Computes alignment from DataLayout |
| Bitcast helper | sub_1C637F0 | 28 KB | Handles bitcast chains in base address expressions |
| RB-tree insert | sub_220F040 | -- | Red-black tree insertion (shared infrastructure) |
| RB-tree lookup | sub_220EF80 | -- | Red-black tree membership check |
| LoopInfo retrieval | sub_1632FA0 | -- | Gets LoopInfo analysis for the function |
Cross-References
- Common Base Elimination -- the complementary inter-block pass
- Pass Overview & Inventory -- master pass listing
- Optimizer Pipeline -- pipeline position and option registration
- Rematerialization -- another pass trading computation for register pressure
- SCEV -- the scalar evolution analysis that SCEV-CGP (and indirectly BASR) depends on