Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SROA (Scalar Replacement of Aggregates)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 SROA.cpp. Evidence: preserve-cfg / modify-cfg pipeline parser parameters match LLVM 16+ new PM integration; two-pass analysis mode (qword_50055E8) matches LLVM 17+ pre-analysis path. Core splitting algorithm is stock LLVM with no CUDA-specific modifications detected.

SROA is the single most important early-pipeline optimization for NVIDIA GPU compilation. Every alloca instruction that survives into code generation is lowered to .local memory (NVPTX address space 5) -- physically backed by device DRAM and accessed through the L1/L2 cache hierarchy. A .local access that misses L1 costs 200-400 cycles; a register read costs zero. A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA's job is to decompose aggregate allocas (structs, arrays, unions) into individual scalar SSA values that the register allocator can place in registers, eliminating the memory traffic entirely.

PropertyValue
Pass name"sroa"
Pipeline parser paramspreserve-cfg, modify-cfg
Entry functionsub_2935C30 (runOnAlloca)
Core functionsub_2930B90 (splitAlloca)
Binary footprint~138 KB primary (80 KB + 58 KB), ~200 KB secondary (legacy PM)
Binary address range0x2910000-0x293FFFF (178 functions)
Pipeline positionsPosition 4 (early, after NVVMReflect) and post-sinking (late)
Disable flagNVVMPassOptions offset +1400
Size threshold knobqword_50056C8 (max alloca size in bits)
Two-pass flagqword_50055E8 (enables pre-analysis for new PM)
NVIDIA modificationsNone to core algorithm
Upstream sourcellvm/lib/Transforms/Scalar/SROA.cpp

Why SROA Is Existential on GPU

On a CPU, an alloca that cannot be promoted to a register lives on the stack -- a cached, low-latency memory region with typical access times of 1-4 cycles. On an NVIDIA GPU there is no hardware stack cache: every surviving alloca becomes a .local allocation backed by DRAM with 200-800 cycle latency on cache miss versus zero for a register. See the GPU Execution Model memory hierarchy table for per-tier latencies.

Every alloca that survives SROA becomes a .local allocation. The NVPTX backend emits these as frame objects in the NVPTXFrameLowering::emitPrologue path, and ptxas maps them to per-thread local memory. Because occupancy is bounded by register count per SM, and .local spills effectively consume both registers (for the address) and memory bandwidth, the performance impact compounds.

The pipeline runs SROA twice: once early (position 4, immediately after NVVMReflect) to eliminate allocas before any other transform sees them, and once late (after NVVMCustomSinking2 and BreakCriticalEdges) to catch allocas created or exposed by loop unrolling, inlining, and other mid-pipeline transforms. The early invocation handles the common case (byval parameter copies, local struct variables); the late invocation cleans up whatever the loop optimizer and sinking passes left behind.

The isAllocaPromotable Fast Path

Before performing any splitting, runOnAlloca checks whether the alloca is trivially promotable via sub_B4CE70 (isAllocaPromotable). An alloca is promotable if every use is a simple load or store with no address-taken escape -- the same criterion as mem2reg. When this returns true, SROA marks the alloca for mem2reg and returns without performing any slice analysis or splitting. This fast path avoids the O(n) slice-building cost for the vast majority of CUDA local variables (scalar int, float, simple pointers), which are already simple enough for mem2reg to handle directly.

Algorithm: runOnAlloca (sub_2935C30)

The top-level per-alloca entry point. Validates the alloca as a candidate, builds the partition/slice table, and delegates to splitAlloca for the actual transformation.

Phase 1: Candidate Validation

runOnAlloca(state, alloca):
    if alloca has no users:
        eraseFromParent(alloca)
        return

    if isAllocaPromotable(alloca):
        defer to mem2reg
        return

    type = getAllocatedType(alloca)
    type_byte = getTypeID(type)

    // Accept: integers(3), half(4), bfloat(5), float(6),
    //         pointers(10), vectors(11), arrays(12), structs(15-18, 20)
    // Reject structs/composites unless isVectorType returns true
    if type_byte not in {3,4,5,6,10,11,12,15,16,17,18,20}:
        return
    if type_byte in {15,16,17,18,20} and not isVectorType(type):
        return  // function types, labels, etc.

    size = getTypeSizeInBits(type)   // sub_BDB740
    if size > qword_50056C8:         // SROA size threshold
        return  // alloca too large, leave for backend

The size threshold at qword_50056C8 is a global tuning knob, likely controlled by the sroa<preserve-cfg> / sroa<modify-cfg> pipeline parameter. Allocas larger than this threshold are left untouched; the backend will lower them to .local memory. The exact default is not exposed in the binary's constructor initializers, but upstream LLVM uses a default of 128 bytes (1024 bits) for the sroa-threshold flag.

Phase 2: Use Analysis and Slice Building

    metadata = buildMetadataTable(alloca)   // sub_D5F1F0

    if qword_50055E8:                       // two-pass mode
        buildSlices(state, alloca, 1)       // sub_2927160 — pre-analysis
        slices = buildPartitions(state)     // sub_2924690
    else:
        slices = buildPartitions(state)     // single-pass

buildSlices (sub_2927160) walks all users of the alloca, classifying each use as a "slice" -- a byte range [start, end) with associated flags. Each slice is a 24-byte entry:

OffsetSizeField
+08start (byte offset into alloca)
+88end (byte offset, exclusive)
+168flags -- bit 2 = splittable, bits [63:3] = user instruction metadata pointer

buildPartitions (sub_2924690) groups non-overlapping slices into partitions. Each partition represents a contiguous byte range that can be replaced by a single sub-alloca. Overlapping slices are merged; slices that cross partition boundaries are marked as "unsplittable."

The two-pass flag (qword_50055E8) enables a pre-analysis pass that runs buildSlices first with a "dry-run" mode to count slices and pre-allocate arrays, then runs the actual partition builder. This is the new PM (PassManager) style -- the legacy PM code path at 0x1A10000 does a single pass.

Phase 3: Contiguous Slice Merging

After building slices, runOnAlloca scans for contiguous ranges that share the same base type and can be merged:

    for each group of contiguous slices:
        if all loads/stores in group use the same type:
            if none are volatile (isVolatile check via sub_B46500):
                if all are in-bounds (byte +2, bit 0):
                    mergeSlices(group)   // sub_11D2BF0 + sub_11D3120 + sub_11D7E80

This optimizer/merger reduces redundant slices before the splitting phase. For example, if a 16-byte struct has four contiguous 4-byte i32 loads, the merger can combine them into a single slice covering the full struct, which may then map to a single <4 x i32> register rather than four separate scalar registers.

Phase 4: Dead Instruction Processing

    for each dead instruction found during analysis:
        for each operand:
            addToWorklist(operand)         // sub_29220F0
        replaceAllUsesWith(undef)          // sub_BD84D0 + sub_ACADE0
        eraseFromParent(instruction)       // sub_BD60C0

Dead instructions identified during slice building (stores to never-loaded ranges, loads of write-only ranges) are removed immediately, before the splitting phase begins.

Phase 5: Recursive Splitting

    if slices is non-empty:
        splitAlloca(state, alloca, slices)  // sub_2930B90 — recursive

This is the key: splitAlloca may create new sub-allocas that are themselves candidates for further splitting. The newly created sub-allocas are added to the worklist and processed in stack order (LIFO).

Phase 6-8: Post-Split Processing

After splitting, runOnAlloca processes newly created sub-allocas (56-byte records stored in a SmallVector with 2-element inline buffer), rewrites per-sub-alloca slice lists, and returns a two-byte result: byte 0 = changed flag, byte 1 = re-run needed flag.

Algorithm: splitAlloca (sub_2930B90)

The core splitting function. Given a partitioned alloca and its use-slices, it creates new sub-allocas and rewrites all users.

Phase 1: Pre-Filter Slices

Iterates the 24-byte slice array. For slices whose instruction is a load (opcode 61) or store (opcode 62) of a simple scalar type that fits entirely within the alloca boundary, clears the "splittable" bit (flag & 4). This prevents unnecessary splitting of trivial accesses -- a scalar i32 load from an i32 alloca does not need splitting. If any slices were de-flagged, calls sortSlices (sub_2912200) and compactSlices (sub_2915A90 / sub_2914CE0) to remove the now-redundant entries.

Phase 2: Partition Iteration

buildPartitionTable (sub_2913C40) produces a partition list from the sorted slices. Each partition is a local tuple [start, end, first_slice_ptr, last_slice_ptr]. The main loop advances through partitions via sub_2912870 (advancePartitionIterator).

Phase 3: Find Rewrite Target

For each partition [start, end):

  1. Get the DataLayout via sub_B43CC0 (getDL).
  2. If the partition contains only unsplittable slices, call findExistingValue (sub_291A860) to search for an existing SSA value that already covers [start, end). If found, reuse it instead of creating a new alloca.
  3. Otherwise, scan slices for a single dominating load or store. Dispatch on opcode:
    • 61 (load): extract the loaded type.
    • 62 (store): extract the stored value type from the store's value operand.
    • 85 (intrinsic): memcpy/memset/memmove -- follow the pointer chain to determine the affected type.
  4. Compare type sizes via getTypeSizeInBits (sub_BDB740).
  5. If no suitable existing value, create a new alloca via CreateAlloca (sub_BCD420) or CreateBitCast (sub_BCD140).

Phase 4: Size and Alignment Check

    alloc_size = getTypeAllocSize(partition_type)    // sub_9208B0
    if alloc_size > 0x800000:                        // 8 MB sanity limit
        skip partition

    // Verify rewrite target matches partition size (8-byte aligned)
    if match:
        checkTypeCompatibility(both_directions)      // sub_29191E0
        validateUnsplittableSlices(partition)         // sub_291A4D0

The 8 MB sanity limit prevents SROA from creating absurdly large sub-allocas from pathological input.

Phase 5: Slice Classification

For each slice in the partition, classifySlice (sub_29280E0) sorts it into one of two lists:

ListVariableContents
splittable-insidev446Slices fully contained within [start, end)
splittable-outsidev452Slices that reference bytes outside the partition (integer widening)

The classification also tracks:

  • v413 (sameType flag): whether all slices in the partition use the same LLVM type.
  • v415 (common type): the shared type if sameType is true.
  • v412 (hasPointerType): whether any slice involves a pointer type.
  • Integer types (type byte == 14) are routed to the outside list for special handling (widening/narrowing may be needed).

Then rewritePartition (sub_29197E0) is called twice: first for inside slices with callback sub_2919EF0, then for outside slices if the first call produced nothing.

Phase 6: New Sub-Alloca Creation

    // Compute alignment
    align_log2 = _BitScanReverse64(alloca_alignment)
    abi_align = getABITypeAlignment(type)            // sub_AE5020
    pref_align = getPrefTypeAlignment(type)          // sub_AE5260

    // Build name: original_name + ".sroa." + index
    name = getName(alloca) + ".sroa."                // sub_BD5D20

    // Create the new alloca (80-byte AllocaInst object)
    new_alloca = AllocaInst::Create(type, size, alignment, name)
                                                     // sub_BD2C40 + sub_B4CCA0
    // Insert before the original alloca
    insertBefore(new_alloca, alloca)

    // Copy debug metadata
    copyDebugInfo(alloca, new_alloca)                // sub_B96E90 + sub_B976B0

Each sub-alloca is an 80-byte AllocaInst object with the .sroa. name prefix. The insertion point is always directly before the original alloca in the entry block, maintaining the invariant that all allocas are grouped at the function entry.

Phase 7: Instruction Rewriting

The visitUse function (sub_292A4F0) rewrites each user of the original alloca to reference the appropriate sub-alloca:

  • GEP chains: retargeted to the new sub-alloca with adjusted offsets (sub_29348F0).
  • Loads: rewritten with type-casts if the sub-alloca type differs from the original load type (sub_F38250).
  • Stores: same treatment as loads (sub_F38250).
  • Memcpy/memset: split into smaller operations covering only the sub-alloca's byte range (sub_F38330).

Each rewritten instruction is validated via sub_291F660 (validateRewrite).

Phase 8: Worklist Management

Dead instructions are removed from the pass's open-addressing hash table (at pass state offset +432, mask at +896). New sub-allocas are added to the worklist (sub_2928360) for re-processing. Allocas that cannot be split are recorded via sub_2916C30 (recordNonSplitAlloca).

Phase 9: Result Recording

For each partition that produced a new alloca, the result is stored as a 24-byte entry [new_alloca, bit_offset, bit_size] in the output array. Hash table capacity is computed using the classic 4n/3 + 1 formula (next power of 2), and entries are stored via open-addressing with linear probing (sub_29222D0 handles resizing).

Phase 10: Post-Split Use Rewriting

The most complex phase. For every use of the original alloca:

  1. getOperandNo (sub_B59530) determines which operand references the alloca.
  2. getAccessRange (sub_AF47B0) computes the byte range [begin, end) within the alloca that this use touches.
  3. For each new sub-alloca in the result array, checkSubAllocaOverlap (sub_AF4D30) tests whether the sub-alloca's range overlaps the use's range.
  4. If overlap: computeRewrittenValue (sub_2916270) produces the replacement value by combining reads from multiple sub-allocas if the original use spans a partition boundary.
  5. Dead uses are identified by isDeadUse (sub_291D8F0) and erased.

The use-list implementation uses a tagged-pointer scheme: bit 2 indicates "heap-allocated list" vs. "inline single element," bits [63:3] are the actual pointer. Lists are freed via _libc_free after extracting the data pointer.

Phase 11-12: Lifetime and Debug Info

Lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are rewritten via sub_291E540 to cover only the sub-alloca's byte range. Debug declarations (dbg.declare, dbg.value) are similarly rewritten: each debug-info entry pointing to the original alloca is retargeted to the sub-alloca whose byte range covers the relevant fragment, using the debug expression's DW_OP_LLVM_fragment to indicate the piece.

Speculative Loads Through Select

When a load reaches its pointer through a select instruction, SROA hoists the load into both branches:

; Before SROA:
%p = select i1 %cond, ptr %a, ptr %b
%v = load float, ptr %p, align 4

; After SROA:
%vt = load float, ptr %a, align 4          ; .sroa.speculate.load.true
%vf = load float, ptr %b, align 4          ; .sroa.speculate.load.false
%v  = select i1 %cond, float %vt, float %vf ; .sroa.speculated

This is significant on GPU for two reasons:

  1. SIMT execution model. A select on a GPU maps to a predicated move, which executes in a single cycle without divergence. The two speculative loads execute unconditionally and in parallel (both issue to the memory pipeline regardless of the predicate). This is cheaper than a control-dependent load that would require branch divergence handling.

  2. Alloca elimination. The original pattern requires the select to produce a pointer, which means the alloca must remain in memory (the pointer must be materializable). After speculation, both pointers are consumed directly by loads, and if %a and %b are themselves sub-allocas that can be promoted to registers, the entire chain collapses to register-only operations.

The implementation (Kind 3, lines 1024-1235 of splitAlloca) creates:

  • Two BitCastInst with names .sroa.speculate.cast.true and .sroa.speculate.cast.false.
  • Two LoadInst with names .sroa.speculate.load.true and .sroa.speculate.load.false, preserving alignment from the original load.
  • One SelectInst with name .sroa.speculated via sub_B36550 (SelectInst::Create).
  • Metadata copied from the original load via sub_B91FC0 (copyMetadata).

Interaction with .param Space

Function parameters passed by value in CUDA/PTX use the .param address space (NVPTX address space 101). The EDG frontend generates an alloca to hold a copy of each byval parameter, then loads fields from it. Consider:

struct Vec3 { float x, y, z; };

__device__ float sum(Vec3 v) {
    return v.x + v.y + v.z;
}

The IR before SROA contains:

define float @sum(%struct.Vec3* byval(%struct.Vec3) align 4 %v) {
  %v.addr = alloca %struct.Vec3, align 4           ; byval copy
  %x = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 0
  %0 = load float, ptr %x, align 4
  %y = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 1
  %1 = load float, ptr %y, align 4
  %z = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 2
  %2 = load float, ptr %z, align 4
  %add = fadd float %0, %1
  %add1 = fadd float %add, %2
  ret float %add1
}

SROA splits %v.addr into three scalar allocas (%v.addr.sroa.0, .sroa.1, .sroa.2), each holding a single float. Because each sub-alloca has only simple loads and stores, mem2reg (which runs in the next pipeline iteration) promotes all three to SSA registers. The final IR has no allocas and no memory traffic -- the three float values live entirely in registers.

Without SROA, the byval copy would persist as a .local allocation, and every field access would be a .local load. For a kernel that calls sum() in a tight loop, this difference is the difference between register-speed and DRAM-speed execution.

The NVPTXTargetLowering::LowerCall function (sub_3040BF0) emits DeclareParam (opcode 505) and StoreV1/V2/V4 (opcodes 571-573) for the .param writes on the caller side; SROA's job is to ensure the callee's reads never touch memory.

Auxiliary SROA Functions (Secondary Instance)

The binary contains a second SROA instance at 0x1A10000-0x1A3FFFF (~200 KB), corresponding to the legacy pass manager code path. This instance contains additional rewriting functions not visible in the primary (new PM) instance:

FunctionSizeRoleKey strings
sub_1A3B29058 KBrewritePartition (memcpy/memset)"memcpy.load.fca", "memcpy.store.fca", "memset.store.fca", ".fca"
sub_1A2D07035 KBpresplitLoadsAndStores"select.gep.sroa", "select.sroa", "phi.sroa", "phi.gep.sroa"
sub_1A2C2F09 KBSelect speculation".sroa.speculate.load.true", ".sroa.speculate.load.false"
sub_1A2FFA012 KBVector splat handling"vsplat", ".splatinsert", ".splat"
sub_1A30D1016 KBLoad rewriting"copyload", "oldload"
sub_1A31B609 KBExtract/load patterns"extract", "load.ext", "endian_shift", "load.trunc"
sub_1A23B3011 KBType casting"sroa_raw_cast", "sroa_raw_idx", "sroa_cast"
sub_1A3A67013 KBSpeculative load promotion".sroa.speculated", ".sroa.speculate.load."
sub_1A13B3036 KBAlloca analysis / slice building--
sub_1A15E7034 KBPartition computation--
sub_1A1877038 KBUse analysis--
sub_1A3DCD015 KBCleanup--

The .fca suffix stands for "first-class aggregate" -- LLVM's term for structs and arrays passed by value. The presplitLoadsAndStores function handles a special case where loads and stores of aggregates can be split before the main SROA algorithm runs, decomposing load { i32, i32 } into separate load i32 instructions and store { i32, i32 } into separate store i32 instructions. The select.gep.sroa and phi.gep.sroa strings indicate that this pre-split phase also handles GEP chains through PHI nodes and selects, a pattern common in CUDA code after inlining.

Data Structures

Slice Entry (24 bytes)

struct SROASlice {
    uint64_t start;     // +0:  byte offset into alloca (inclusive)
    uint64_t end;       // +8:  byte offset into alloca (exclusive)
    uint64_t flags;     // +16: bit 2 = splittable, bits [63:3] = user metadata ptr
};

The splittable bit indicates whether the slice can be split across partition boundaries. Loads and stores of simple scalars that fit entirely within the alloca have this bit cleared in Phase 1 of splitAlloca.

Sub-Alloca Record (56 bytes)

struct SubAllocaRecord {
    void* alloca_ptr;       // +0:  pointer to the new AllocaInst
    void* slice_list;       // +8:  pointer to slice list for this sub-alloca
    uint64_t slice_list_cap; // +16: capacity of slice list
    // ... additional fields through +55
};

Stored in a SmallVector<SubAllocaRecord, 2> -- the inline buffer holds two elements (common case: a struct with two fields), spilling to heap for larger aggregates.

Pass State Hash Table

The SROA pass state object (parameter a1 to both main functions) contains an open-addressing hash table at offsets +432 through +896. It uses LLVM-layer sentinels (-4096 / -8192) with instruction pointer keys. This table tracks which instructions have already been processed or are pending in the worklist. See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Tagged Pointer Scheme

Use-lists and debug-info lists use a tagged-pointer encoding for memory efficiency:

  • Bit 2 clear: the "pointer" field directly contains a single element (inline storage for the common case of one use).
  • Bit 2 set: bits [63:3] are a heap-allocated pointer to a variable-length list. Freed via _libc_free after masking off the tag bits.

This avoids heap allocation for the overwhelmingly common case where an alloca field has exactly one load or one store.

IR Before/After Example

Consider a CUDA kernel that uses a local struct:

__global__ void kernel(float* out, int n) {
    struct { float a; int b; float c; } local;
    local.a = 1.0f;
    local.b = n;
    local.c = 2.0f;
    out[0] = local.a + local.c;
    out[1] = (float)local.b;
}

Before SROA:

define void @kernel(ptr %out, i32 %n) {
entry:
  %local = alloca { float, i32, float }, align 4
  %a = getelementptr { float, i32, float }, ptr %local, i32 0, i32 0
  store float 1.0, ptr %a, align 4
  %b = getelementptr { float, i32, float }, ptr %local, i32 0, i32 1
  store i32 %n, ptr %b, align 4
  %c = getelementptr { float, i32, float }, ptr %local, i32 0, i32 2
  store float 2.0, ptr %c, align 4
  %v0 = load float, ptr %a, align 4
  %v2 = load float, ptr %c, align 4
  %sum = fadd float %v0, %v2
  store float %sum, ptr %out, align 4
  %v1 = load i32, ptr %b, align 4
  %conv = sitofp i32 %v1 to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

After SROA (three sub-allocas, then mem2reg promotes to registers):

define void @kernel(ptr %out, i32 %n) {
entry:
  ; No allocas remain -- all promoted to SSA values
  %sum = fadd float 1.0, 2.0          ; constant-folded later by InstCombine
  store float %sum, ptr %out, align 4
  %conv = sitofp i32 %n to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

SROA splits %local into %local.sroa.0 (float), %local.sroa.1 (i32), %local.sroa.2 (float). Each sub-alloca has trivial load/store patterns, so mem2reg promotes all three. The stores and loads collapse, GEPs disappear, and the kernel runs entirely from registers.

Name Suffixes Created During Splitting

SuffixPurpose
.sroa.New sub-alloca name prefix
.sroa.speculate.cast.trueBitcast for true branch of select
.sroa.speculate.cast.falseBitcast for false branch of select
.sroa.speculate.load.trueSpeculative load from true branch
.sroa.speculate.load.falseSpeculative load from false branch
.sroa.speculatedFinal select combining speculative loads
.contContinuation block (after branch splitting)
.thenThen-branch block
.elseElse-branch block
.valValue extracted from split load/store
.fcaFirst-class aggregate decomposition
select.gep.sroaGEP through select, pre-split
select.sroaSelect pointer, pre-split
phi.sroaPHI pointer, pre-split
phi.gep.sroaGEP through PHI, pre-split
sroa_raw_castRaw bitcast during type rewriting
sroa_raw_idxRaw index computation during rewriting
sroa_castGeneric SROA type cast
vsplatVector splat element
.splatinsertSplat insert element
.splatSplat shuffle
copyloadCopy of a load during rewriting
oldloadOriginal load being replaced
extractExtracted sub-value
load.extLoad with extension
endian_shiftEndianness-adjustment shift
load.truncLoad with truncation
memcpy.load.fcaMemcpy load of first-class aggregate
memcpy.store.fcaMemcpy store of first-class aggregate
memset.store.fcaMemset store of first-class aggregate

Differences from Upstream LLVM

The core SROA algorithm in cicc v13.0 is stock LLVM SROA. No CUDA-specific modifications to the splitting logic, slice building, or partition computation were detected. The NVIDIA-specific elements are limited to:

  1. Pass state object layout. The offsets within the pass state structure (worklist at +432, hash table at +824-+864, sub-alloca records at +1080-+1096) reflect NVIDIA's PassManager integration, not upstream's.

  2. IR node encoding. Opcode numbers (61 = load, 62 = store, 85 = intrinsic, 55 = phi) and operand layout (32-byte basic blocks, tagged pointers) follow NVIDIA's modified IR format.

  3. Debug metadata system. The metadata kind for debug info uses MD_dbg = 38 (NVIDIA assignment), queried via sub_B91C10.

  4. Global threshold knob. The value at qword_50056C8 may have an NVIDIA-specific default different from upstream's 128-byte / 1024-bit default. The knob is likely settable via the pipeline text sroa<preserve-cfg> or sroa<modify-cfg>.

  5. Pipeline positioning. The early-pipeline placement (position 4, before NVVMLowerArgs and NVVMLowerAlloca) is NVIDIA-specific. Upstream LLVM typically places SROA after InstCombine and SimplifyCFG; cicc places it before those passes to eliminate byval parameter copies as early as possible.

Configuration

KnobGlobalDescription
qword_50056C8SROA size thresholdMaximum alloca size (in bits) that SROA will attempt to split. Allocas exceeding this are left for the backend.
qword_50055E8Two-pass analysis flagWhen set, enables a pre-analysis pass before slice building (new PM integration).
NVVMPassOptions offset +1400Disable flagSetting this byte disables SROA entirely.
Pipeline param preserve-cfg--Runs SROA without modifying the CFG (no block splitting for speculative loads across PHIs).
Pipeline param modify-cfg--Allows SROA to modify the CFG (enables full speculative load hoisting including PHI/select decomposition).

Function Map

FunctionAddressSizeRole
Primary instance (new PM)--
SROAPass::runOnAllocasub_2935C3058 KB--
SROAPass::splitAllocasub_2930B9080 KB--
buildSlices (use analysis)sub_2927160----
buildPartitions (group slices)sub_2924690----
buildPartitionTablesub_2913C40----
sortSlicessub_2912200----
compactSlices (with filter)sub_2915A90----
compactSlices (simple)sub_2914CE0----
findExistingValuesub_291A860----
rewritePartitionsub_29197E0----
rewriteCallbacksub_2919EF0----
visitUse (rewrite one use)sub_292A4F054 KB--
validateRewritesub_291F660----
analyzeSlicesub_29150D0----
addToNewAllocaWorklistsub_2929FB0----
addToWorklistsub_2928360----
addOperandToWorklistsub_29220F0----
clearPendingQueuesub_2921860----
classifySlicesub_29280E0----
recordNonSplitAllocasub_2916C30----
computeRewrittenValuesub_2916270----
advancePartitionIteratorsub_2912870----
rewriteGEPChainsub_29348F0----
replaceAndErasesub_2914800----
collectUsesForRewrite (variant)sub_2914380----
collectUsesForRewrite (original)sub_2914550----
Hash table resizesub_29222D0----
Alloca rewriting helpersub_292D81067 KB--
SROA pass metadatasub_2912100----
SROA pass registration ("Scalar Replacement Of Aggregates", "sroa")sub_2912340----
Secondary instance (legacy PM)--
SROAPass::runOnAlloca (legacy)sub_1A33E8061 KB--
SROAPass::splitAlloca (legacy)sub_1A3704046 KB--
rewritePartition (memcpy/memset)sub_1A3B29058 KB--
presplitLoadsAndStoressub_1A2D07035 KB--
Select speculationsub_1A2C2F09 KB--
Vector splat handlingsub_1A2FFA012 KB--
Load rewritingsub_1A30D1016 KB--
Extract/load patternssub_1A31B609 KB--
Type castingsub_1A23B3011 KB--
Speculative load promotionsub_1A3A67013 KB--
Alloca analysis / slice buildingsub_1A13B3036 KB--
Partition computationsub_1A15E7034 KB--
Use analysissub_1A1877038 KB--
Cleanupsub_1A3DCD015 KB--
Shared helpers--
isAllocaPromotablesub_B4CE70----
getDL (DataLayout)sub_B43CC0----
getTypeSizeInBitssub_BDB740----
getTypeAllocSizesub_9208B0----
getTypesub_BD5C60----
getNamesub_BD5D20----
AllocaInst::Createsub_BD2C40----
PHINode::Createsub_BD2DA0----
AllocaInst constructorsub_B4CCA0----
CreateBitCastsub_BCD140----
CreateAllocasub_BCD420----
replaceAllUsesWithsub_BD84D0----
eraseFromParentsub_B43D60----
SelectInst::Createsub_B36550----
UndefValue::getsub_ACADE0----
getABITypeAlignmentsub_AE5020----
getPrefTypeAlignmentsub_AE5260----
copyMetadatasub_B91FC0----
isVolatilesub_B46500----
isVectorTypesub_BCEBA0----
rewriteLoadStoreOfSlicesub_F38250----
rewriteMemTransferOfSlicesub_F38330----
collectAllUsessub_AE74C0----
getAccessRangesub_AF47B0----
checkSubAllocaOverlapsub_AF4D30----
buildMetadataTablesub_D5F1F0----
addToErasedSetsub_D6B260----
Slice optimizer initsub_11D2BF0----
Slice optimizer runsub_11D3120----
Slice optimizer finalizesub_11D7E80----

Test This

The following kernel allocates a local struct and accesses its fields. SROA should completely eliminate the alloca, promoting all fields to registers.

struct Particle {
    float x, y, z;
    float vx, vy, vz;
};

__global__ void sroa_test(float* out, int n) {
    Particle p;
    p.x  = (float)threadIdx.x;
    p.y  = (float)threadIdx.y;
    p.z  = 0.0f;
    p.vx = 1.0f;
    p.vy = 2.0f;
    p.vz = 3.0f;

    float energy = 0.5f * (p.vx*p.vx + p.vy*p.vy + p.vz*p.vz);
    out[threadIdx.x] = p.x + p.y + p.z + energy;
}

What to look for in PTX:

  • Absence of .local memory declarations. If SROA succeeds, there should be no .local .align directives in the PTX for the Particle struct. All six fields (x, y, z, vx, vy, vz) should live in %f (float) registers.
  • No st.local or ld.local instructions. These indicate that the struct survived into .local memory -- a 200-400 cycle penalty per access versus zero cycles for a register.
  • The PTX should show direct register arithmetic: mov.f32, fma.rn.f32, add.f32 -- no memory traffic at all for the struct fields.
  • To see the failure case, add volatile to the struct declaration (volatile Particle p;). This prevents SROA from promoting the alloca, and ld.local/st.local instructions will appear in the PTX, demonstrating the performance cliff that SROA normally prevents.
  • At -O0, SROA still runs (it is correctness-relevant for address space resolution), but with a more conservative threshold. Compare the .local frame size between -O0 and -O2.

Cross-References

  • Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
  • Pipeline & Ordering -- pipeline positions 4 and post-sinking
  • Register Allocation -- surviving allocas become .local spills, directly increasing register pressure
  • Rematerialization -- recomputes cheap values to reduce register pressure; operates downstream of SROA
  • StructSplitting -- NVIDIA custom pass that splits struct arguments at the call boundary; complements SROA's intra-procedural splitting
  • MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs after SROA
  • Hash Infrastructure -- the open-addressing hash table used by the SROA pass state