Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Base Address Strength Reduction

Address computation is a disproportionately expensive category of work on NVIDIA GPUs. The integer ALU units that compute memory addresses are a scarce resource relative to the FP/tensor throughput the hardware is designed to maximize. A typical unrolled loop body touching four arrays at A[tid + i], B[tid + i], C[tid + i], D[tid + i] -- where tid is a function of threadIdx.x, blockIdx.x, and blockDim.x -- may emit four independent 64-bit multiply-add chains per iteration, each recomputing the same base expression base_ptr + tid * element_size. Reducing those four chains to one base computation plus three cheap constant-offset additions can halve the integer instruction count in the loop body and free address registers that would otherwise stay live across the entire loop.

Base Address Strength Reduction (BASR) is an NVIDIA-proprietary IR-level pass that performs exactly this transformation. It scans loop bodies for memory operations that share a common base pointer expression, finds the one with the minimum constant offset (the "anchor"), hoists the anchor computation, and rewrites all remaining addresses as (anchor + relative_offset). The pass is confirmed by the string "BaseAddressStrengthReduce" at decompiled line 457 of sub_1C67780.

Key Facts

PropertyValue
Pass nameBaseAddressStrengthReduce
Entry pointsub_1C67780 (Legacy PM), sub_2CA4A10 (New PM)
Binary size58 KB (~1,400 decompiled lines)
Pass typeNVIDIA-proprietary, IR-level, loop body transform
Primary knobsdo-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions)
Chain variantdo-base-address-strength-reduce-chain (separate boolean toggle)
Negative offset controldword_4FBCAE0 (aggressiveness for negative-offset patterns)
IV limitbase-address-strength-reduce-iv-limit (parametric)
Max IVbase-address-strength-reduce-max-iv (parametric)
Debug dumpdump-base-address-strength-reduce
Required analysesLoopInfo (sub_1632FA0), DataLayout
Option registrationctor_263_0 at 0x4F36F0 (shared with SCEV-CGP, 44 strings total)
Companion passCommon Base Elimination (sub_1C5DFC0)
HelperBitcast helper at sub_1C637F0 (28 KB, strings "baseValue", "bitCastEnd")

Algorithm

The pass operates in six phases, executing once per function. It processes all loop bodies simultaneously using worklists seeded from LoopInfo.

Phase 1 -- Initialization (lines 452-497)

The entry function retrieves LoopInfo via sub_1632FA0 and extracts the module's DataLayout from the function object (path: (a1+184)->field+24->field+40). It then allocates bookkeeping state:

  • Eight hash maps at stack offsets v374-v399, keyed by Value* (the base pointer). Each map entry holds a linked list of memory instructions that share that base.
  • Multiple worklists for basic blocks containing loads vs. stores.
  • Threshold: v429 = 2 -- the minimum number of uses of the same base before the pass considers strength reduction worthwhile.
  • Pass counter: v438 = 1 -- the initial pass number (the pass may iterate).

Phase 2 -- Address Pattern Collection (lines 518-600)

For each instruction in the target basic blocks (drawn from the a4 worklist):

  1. sub_1C57390 classifies the address expression, extracting its structural form.
  2. sub_1CCB2B0 computes alignment information from the DataLayout.
  3. sub_1456040 extracts the base pointer from the address expression.

The base pointer is then categorized into one of two buckets:

CategoryConditionHash mapWorklistDescription
Non-pointer-type basetype_id != 15v382v363Integer/GEP-derived base addresses
Pointer-type basetype_id == 15v378v360Bases that are raw pointers to globals

For pointer-type bases, sub_1CCDC20 further extracts the underlying global variable, allowing grouping of addresses to the same global even when accessed through different local pointer variables.

Hash map insertion uses sub_1C50900. If the base pointer is new (not yet in the map), the instruction list is initialized and the base is appended to the corresponding worklist. Otherwise, the instruction is appended to the existing list for that base.

for each instruction I in target BBs:
    addr_info = classify_address(I)          // sub_1C57390
    alignment = compute_alignment(addr_info)  // sub_1CCB2B0
    base_ptr  = extract_base(addr_info)       // sub_1456040

    if type_of(base_ptr) != POINTER_TYPE:
        map_insert(hash_map_v382, base_ptr, I)    // sub_1C50900
        if is_new_entry:
            worklist_v363.append(base_ptr)
    else:
        global = extract_global(base_ptr)         // sub_1CCDC20
        map_insert(hash_map_v378, global, I)
        if is_new_entry:
            worklist_v360.append(global)

Phase 3 -- Anchor Finding (lines 430-470)

For each base pointer that has accumulated at least v429 (2) uses, the pass determines the "anchor" -- the use with the minimum constant offset. This is the instruction whose address computation will be hoisted and shared.

For each candidate base:

  1. sub_1C53170 decomposes each address expression into a (base, constant_offset) pair.
  2. The pass iterates over all uses and finds the one with the smallest constant offset:
    • For offsets that fit in 64 bits: direct integer comparison via sign-extended values.
    • For offsets wider than 64 bits: reads from extended-precision word arrays and compares word-by-word.
  3. The minimum-offset use becomes the anchor.
function find_anchor(base_ptr, use_list):
    min_offset = +INF
    anchor = null

    for each use U in use_list:
        (base, offset) = decompose_address(U)  // sub_1C53170

        if bit_width(offset) <= 64:
            val = sign_extend_64(offset)
        else:
            val = read_extended_precision(offset)

        if val < min_offset:
            min_offset = val
            anchor = U

    return (anchor, min_offset)

Phase 4 -- Address Rewriting (lines 578-600)

Once the anchor is identified:

  1. sub_13A5B00 creates a new base address instruction from the anchor's address computation. This instruction is placed at the loop preheader or the dominating point of all uses.
  2. For every other instruction sharing the same base, the pass computes the relative offset: relative_offset = original_offset - anchor_offset.
  3. sub_14806B0 creates a new address expression (new_base + relative_offset) and replaces the original address operand.
function rewrite_addresses(anchor, anchor_offset, use_list):
    new_base = create_base_instruction(anchor)  // sub_13A5B00

    for each use U in use_list:
        if U == anchor:
            replace_address(U, new_base)
        else:
            (_, orig_offset) = decompose_address(U)
            rel_offset = orig_offset - anchor_offset
            new_addr = create_offset_add(new_base, rel_offset)  // sub_14806B0
            replace_address(U, new_addr)

After this transformation, a loop body that previously contained:

load (base + tid*stride + 0)    // original: full GEP chain
load (base + tid*stride + 16)   // original: full GEP chain
store (base + tid*stride + 32)  // original: full GEP chain
store (base + tid*stride + 48)  // original: full GEP chain

Becomes:

anchor = base + tid*stride + 0  // hoisted once
load anchor                     // offset 0: use anchor directly
load (anchor + 16)              // cheap add
store (anchor + 32)             // cheap add
store (anchor + 48)             // cheap add

The three 64-bit multiply-add chains are replaced by three 64-bit immediate additions.

Phase 5 -- Negative Offset Handling (lines 512-520)

When dword_4FBCAE0 > 1 (the aggressiveness knob is set above default), the pass also considers address groups where the maximum offset has a negative sign bit. These represent patterns like:

load (base + tid*stride - 32)
load (base + tid*stride + 0)
load (base + tid*stride + 32)

Without this phase, the anchor would be the instruction at offset -32, producing negative relative offsets for the first use. Some hardware addressing modes handle negative offsets less efficiently, so this phase is gated separately.

For negative-offset candidates, the pass:

  1. Checks whether the base is loop-invariant via sub_1C51340.
  2. If loop-invariant, creates a separate common base via sub_1C55CE0 that absorbs the negative component.

Phase 6 -- Red-Black Tree Tracking

The pass uses a red-black tree infrastructure (sub_220F040 for insertion, sub_220EF80 for lookup) shared with other NVIDIA passes. This provides O(log n) sorted-set operations for maintaining collections of instruction pointers and efficiently checking membership during the rewriting phase.

Hash Map Implementation

The address pattern hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). The resize/rehash logic lives in sub_1C54050 -- the same function used by Common Base Elimination. Hash keys are Value* pointers with linear probing. See Hash Table and Collection Infrastructure for the hash function and probing strategy.

Relationship with Common Base Elimination

BASR and Common Base Elimination (sub_1C5DFC0) attack the same problem -- redundant address computation -- but at different scopes and with different strategies:

DimensionBase Address Strength ReductionCommon Base Elimination
ScopeIntra-loop: operates within a single loop bodyInter-block: operates across the CFG using dominance
GroupingGroups addresses by shared induction-variable-based baseGroups addresses by shared base pointer to the same global
PlacementAnchor placed at loop preheaderAnchor placed at common dominator of all uses
Offset modelConstant offsets relative to IV-derived baseConstant offsets relative to global-derived base
Entry pointsub_1C67780sub_1C5DFC0
Size58 KB38 KB

The two-pass approach is deliberate. Common Base Elimination runs first at the IR level, hoisting shared base expressions across control flow boundaries. BASR then runs within loop bodies, strength-reducing the IV-dependent address chains that CBE cannot handle because the IV changes each iteration.

Both passes share the same address decomposition helper (sub_1C53170), the same hash map infrastructure (sub_1C50900, sub_1C54050), and the same instruction creation routines (sub_13A5B00, sub_14806B0).

Relationship with SCEV-CGP

The BASR knobs are registered together with SCEV-CGP (Scalar-Evolution-based CodeGenPrepare) in ctor_263_0 at 0x4F36F0. This constructor registers 44 option strings total, covering both SCEV-CGP and BASR. The do-base-address-strength-reduce and do-scev-cgp knobs are stored in the same ctor_526_0 option block.

SCEV-CGP is a broader pass that performs SCEV-based address optimization using thread ID as an induction variable (scev-cgp-tid-max-value controls the maximum thread ID value for analysis). BASR is a sub-transformation within this address optimization framework -- it handles the specific case of multiple memory operations sharing a base, while SCEV-CGP handles the broader case of rewriting address expressions using scalar evolution.

Related SCEV-CGP knobs that interact with BASR:

KnobPurpose
scev-cgp-old-baseControls whether SCEV-CGP creates new base expressions
ignore-bad-baseBypasses validity checks on base pointer classification
ignore-32-bit-overflowSkips 32-bit overflow checks in address arithmetic
ignore-signed-32-bit-overflowSkips signed 32-bit overflow checks
topo-sort-beginControls topological sort start point for address chains
special-reassociate-for-threadidPrevents reassociation from moving threadId-dependent expressions

Configuration

Boolean Knobs

KnobDefaultDescription
do-base-address-strength-reduceEnabled (level 2)Master enable. Level 1 = unconditional; level 2 = with conditions (default). 0 = disabled.
do-base-address-strength-reduce-chainEnabledEnables the chain variant, which strength-reduces chains of dependent address computations
dump-base-address-strength-reducefalsePrints diagnostic output when set

Parametric Knobs

KnobDescription
base-address-strength-reduce-iv-limitMaximum number of induction variables to consider per loop
base-address-strength-reduce-max-ivMaximum IV value for strength reduction eligibility

Global Variables

GlobalPurpose
dword_4FBCAE0Negative offset aggressiveness. When > 1, enables strength reduction of address groups with negative offsets. Also used as a special minimum-selection mode in MemorySpaceOpt.

Diagnostic Strings

"BaseAddressStrengthReduce"   -- Pass identification (line 457)
"baseValue"                   -- Bitcast helper: base value operand name (sub_1C637F0)
"bitCastEnd"                  -- Bitcast helper: end-of-chain marker (sub_1C637F0)

When dump-base-address-strength-reduce is enabled, the pass emits additional diagnostic output showing which base pointers were grouped, which anchor was selected, and which addresses were rewritten.

Key Functions

FunctionAddress (Legacy)SizeRole
Main entrysub_1C6778058 KBPass driver: initialization, collection, anchor finding, rewriting
Main entry (New PM)sub_2CA4A1062 KBNew Pass Manager variant
Address classifiersub_1C57390--Classifies address expression structure
Address decomposersub_1C53170--Decomposes address into (base, constant_offset) pairs
Hash map insertsub_1C50900--Inserts base pointer into pattern hash map
Hash map resizesub_1C54050--Load-factor-based resize/rehash
Loop invariance checksub_1C51340--Tests whether a value is loop-invariant
Negative offset handlersub_1C55CE0--Creates common base for negative-offset patterns
Base instruction creationsub_13A5B00--Creates the hoisted anchor address instruction
Offset rewritingsub_14806B0--Creates (base + relative_offset) replacement
Base extractionsub_1456040--Extracts base pointer from address expression
Global extractionsub_1CCDC20--Extracts underlying global variable from pointer chains
Alignment computationsub_1CCB2B0--Computes alignment from DataLayout
Bitcast helpersub_1C637F028 KBHandles bitcast chains in base address expressions
RB-tree insertsub_220F040--Red-black tree insertion (shared infrastructure)
RB-tree lookupsub_220EF80--Red-black tree membership check
LoopInfo retrievalsub_1632FA0--Gets LoopInfo analysis for the function

Cross-References