Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Struct/Aggregate Splitting

GPU register files are typed and scalar. An SM has no concept of loading a struct, storing a struct, or passing a struct through a register -- every value that survives past IR lowering must reduce to a set of individually-named scalar registers. LLVM's standard SROA pass handles alloca-based aggregates by promoting them to scalars, but a large class of aggregate operations never touch an alloca: return values, call arguments, PHI nodes carrying struct types, and aggregate load/store patterns from memcpy lowering. NVIDIA's struct-splitting pass operates on these non-alloca aggregate operations at the NVVM IR level, decomposing every struct-typed value into its constituent scalar fields so that downstream register allocation sees only scalar types.

The pass exists in two binary instances. The primary implementation at sub_1C86CA0 (72KB, ~1,200 lines, 500+ locals) lives in the aggregate-splitting cluster at 0x1C80000--0x1CBFFFF and operates on NVVM IR using NVIDIA-proprietary type IDs. A second, closely related implementation at sub_2CCF450 (58KB) handles the lower-aggr-copies pipeline pass and shares the same string constants ("splitStruct", "srcptr", "dstptr", "remsrc", "remdst", "split", "vld"). Both instances produce the same fundamental transformation: aggregate operations become sequences of scalar operations on individual struct elements.

Key Facts

PropertyValue
Entry pointsub_1C86CA0
Size72KB (~1,200 lines decompiled), 500+ local variables
Binary cluster0x1C80000--0x1CBFFFF (Aggregate Splitting + Memory Ops)
Second instancesub_2CCF450 (58KB, lower-aggr-copies pass)
Pipeline pass namelower-aggr-copies (parameterized: lower-aggr-func-args)
Related passlower-struct-args (parameterized: opt-byval)
IR levelNVVM IR (NVIDIA-proprietary type IDs, not LLVM Type::TypeID)
Key opcode32 (splitStruct instruction)
Use replacementsub_164D160 (RAUW -- Replace All Uses With)
LLVM upstreamNo equivalent -- this is entirely NVIDIA-proprietary

Algorithm

The pass walks every instruction in a function, looking for operations whose result type or operand type is an aggregate (struct or array). For each such operation, it decomposes the aggregate into its scalar elements, creates a splitStruct multi-output instruction, and rewires all uses to reference individual element extractions.

Step 1: Type Decomposition

For each struct type encountered, the pass retrieves the struct layout from the DataLayout and enumerates its elements:

function decomposeStructType(struct_type, data_layout):
    layout = sub_1643350(data_layout, struct_type)  // GetStructLayout
    element_types = []
    for each element in struct_type.elements:
        scalar_ty = sub_159C470(element)            // getScalarType
        element_types.append(scalar_ty)
    return element_types

sub_1643350 retrieves the StructLayout from the DataLayout, giving byte offsets and sizes for each field. sub_159C470 maps each element to its scalar type -- for nested structs, this recurses; for arrays, it yields the element type; for scalars, it returns the type directly.

The element types accumulate in a local array v505[] with the count tracked in v506. This flattened type list drives all subsequent instruction creation.

Step 2: splitStruct Instruction Creation (Opcode 32)

The pass creates a new multi-output instruction with NVVM opcode 32:

function createSplitStruct(original_inst, element_types, count):
    composite_ty = sub_15F9F50(element_types, count)     // ComputeCompositeType
    aligned_ty   = sub_1646BA0(composite_ty, data_layout) // SetAlignmentFromDL

    // If original was a vector type (type_id == 16), wrap in vector
    if getTypeId(original_inst.type) == 16:
        aligned_ty = sub_16463B0(aligned_ty)              // WrapInVectorType

    split_inst = sub_15F1EA0(aligned_ty, 32, parent, nops, flags)
                                                          // InitInstruction(opcode=32)
    // Store original type info at inst+56, composite at inst+64
    split_inst[+56] = original_type_info
    split_inst[+64] = sub_15F9F50(composite_ty)
    return split_inst

The splitStruct instruction is the NVVM-specific multi-result node that represents the decomposition. It produces N outputs, one per struct element. The instruction stores both the original aggregate type (at offset +56) and the composite element type (at offset +64) for later phases that may need to reconstruct type information.

Step 3: Element Pointer Extraction

For each element of the decomposed struct, the pass creates an indexed load from the splitStruct result:

for i in 0..count:
    ptr = sub_15FD590(split_inst, element_types[i],
                      operand=i, name="ptr", insertion_point)
    // Creates opcode 56 (extractvalue-like) with type=1

sub_15FD590 creates an instruction with opcode 56 that extracts the i-th element from the multi-output splitStruct node. The "ptr" name prefix appears in debug output. Each extraction yields a scalar-typed value that downstream passes can assign to an individual PTX register.

Step 4: Split Load with Alignment Preservation

For the actual memory access that feeds the splitStruct, the pass creates a split load instruction:

function createSplitLoad(original_load, element_types):
    alignment = computeAlignment(original_load)
    split_load = sub_15F90A0(element_types, alignment, ...)
    additional_align = sub_1CCB4A0(data_layout, element_types)
    final_align = alignment & (-additional_align)  // min power-of-2
    return split_load

The resulting instruction carries the "split" name prefix. The alignment computation is described in detail in the next section.

Step 5: Use Replacement

After creating all scalar operations, sub_164D160 (RAUW -- Replace All Uses With) replaces every use of the original aggregate operation with the corresponding scalar element extraction:

sub_164D160(original_aggregate_inst, split_inst)

This is the same RAUW infrastructure used across CICC (also called from GlobalOpt, DSE, the inliner, and other passes). After replacement, the original aggregate instruction has zero uses and is eligible for dead code elimination.

Alignment Preservation

The pass must preserve memory alignment when splitting aggregate loads/stores into per-element accesses. GPU memory transactions have strict alignment requirements: a misaligned access can silently produce wrong results or trap, depending on the address space and SM architecture.

The Alignment Formula

The decompiled alignment calculation is:

aligned_value = 1 << (alignment_field >> 1) >> 1

Breaking this down:

  1. alignment_field >> 1 -- the alignment is stored in a compressed encoding where the field value is approximately 2 * log2(alignment) + bias.
  2. 1 << (result) -- converts back to a power-of-two alignment value.
  3. >> 1 -- adjusts for the encoding's off-by-one (the encoding stores 2*log2 + 1, so the final shift corrects it).

For example, if alignment_field = 9, then 9 >> 1 = 4, 1 << 4 = 16, 16 >> 1 = 8, yielding 8-byte alignment. This encoding is compact and used throughout NVVM's type system to store alignment in a single byte.

Additional Alignment Computation

sub_1CCB4A0 provides a DataLayout-aware alignment computation for the element type. The final alignment is the minimum of the original alignment and the element's natural alignment, computed via:

final_align = original_align & (-element_natural_align)

The bitwise AND with the negation of the element alignment selects the largest power-of-two that divides both values, ensuring the per-element access is always naturally aligned for its type without exceeding the original aggregate's alignment guarantee.

NVVM Type ID System

The pass operates on NVVM's proprietary type ID system, not LLVM's Type::TypeID. The size classification logic (decompiled lines 997--1030) reveals the mapping:

NVVM Type IDTypeBit Width
1BFloat16 (i8 pair with padding)16
2Float32
3Double / i32 (context-dependent)64
4i6480 (with padding to 10 bytes)
5, 6FP128 / PPC FP128128
7Pointer8 * DataLayout::getPointerSizeInBits(0)
9Float (alternate, possibly metadata)64
0xB (11)Integer (arbitrary width)element_encoding >> 8
0xD (13)Array8 * DataLayout::getStructLayout(type) total size
0xE (14)StructRecursive sum of element sizes
16VectorTriggers vector-type wrapping via sub_16463B0

For struct types (ID 0xE), the size computation is recursive: the pass sums the sizes of all elements, each resolved through the same type-ID dispatch table. Array types (ID 0xD) use sub_15A9930 to look up the total allocation size from the DataLayout's StructLayout cache (which also handles arrays despite the name).

Nested Struct and Array Handling

When a struct element is itself a struct or an array, the pass recurses. The sub_159C470 (getScalarType) call during type decomposition flattens nested aggregates: a struct {i32, {f32, f64}, i16} decomposes not into three elements but into four scalars: i32, f32, f64, i16. The flattening continues until every element is a primitive scalar or a pointer.

Arrays within structs are handled differently depending on their size. Small arrays may be fully unrolled into individual element accesses. The size threshold is governed by the max-aggr-copy-size and large-aggr-store-limit knobs. Arrays that exceed the threshold are not decomposed into per-element loads but instead lowered to byte-copy loops (the "remsrc" / "remdst" / "i8dst" paths correspond to this remainder-byte handling when the aggregate cannot be evenly split into typed elements).

The remainder path:

  1. Computes the number of whole elements that can be extracted as typed loads.
  2. For any trailing bytes that do not fill a complete element, generates an i8 byte loop: "remsrc" is the source pointer for the remainder, "remdst" is the destination, and "i8dst" is the byte-typed destination pointer.

Relationship with SROA

LLVM's SROA (Scalar Replacement of Aggregates) and NVIDIA's struct splitting are complementary, not overlapping:

AspectLLVM SROANVIDIA Struct Splitting
Targetalloca instructions in entry blockNon-alloca aggregate operations
ScopeStack-allocated structsReturn values, call args, PHI nodes, memcpy results
IR levelLLVM IR (standard Type::TypeID)NVVM IR (proprietary type IDs)
Pipeline positionEarly scalar optimization passesAfter LLVM optimization, NVVM lowering phase
OutputSSA scalars replacing alloca usessplitStruct (opcode 32) multi-output nodes
UpstreamStandard LLVM passNo upstream equivalent

SROA runs during the standard LLVM optimization pipeline and eliminates alloca-based aggregates. By the time struct splitting runs, all remaining aggregate operations are those SROA could not handle: function return values carrying struct types, call sites passing or receiving struct-typed parameters, and aggregate-typed PHI nodes at control flow merges. Struct splitting is the final lowering step that ensures no aggregate-typed values survive into register allocation.

PTX Register Mapping

After struct splitting, every value in the IR is scalar-typed. During instruction selection and register allocation, each scalar maps to a PTX virtual register of the corresponding type:

// Before struct splitting:
%result = load {i32, f32, i64}, ptr %p, align 8

// After struct splitting:
%split = splitStruct {i32, f32, i64}   // opcode 32, multi-output
%r0 = extractelement %split, 0         // i32 -> %r1 (32-bit register)
%r1 = extractelement %split, 1         // f32 -> %f1 (32-bit FP register)
%r2 = extractelement %split, 2         // i64 -> %rd1 (64-bit register)

In PTX, register types are explicit:

  • %r registers: 32-bit integers
  • %rd registers: 64-bit integers
  • %f registers: 32-bit floats
  • %fd registers: 64-bit floats
  • %h registers: 16-bit values (half/bfloat)
  • %p registers: predicates (1-bit)

Without struct splitting, the register allocator would need to handle aggregate-typed live ranges, which is impossible on GPU hardware where the register file has no concept of a "struct register." The pass is therefore a hard prerequisite for correct register allocation.

Pipeline Position

The pass runs as part of the NVVM lowering phase, after the main LLVM optimization pipeline has completed. It is registered as lower-aggr-copies in the New PM pipeline parser at index 417 (sub_2342890), with parameter lower-aggr-func-args controlling whether function argument aggregates are also lowered.

Pipeline position:
  LLVM Optimizer (SROA, GVN, DSE, etc.)
    -> NVIDIA NVVM Lowering Phase
      -> lower-struct-args (opt-byval)     [lower struct function args]
      -> lower-aggr-copies (lower-aggr-func-args)  [struct splitting]
      -> memory-space-opt                   [address space resolution]
      -> register allocation preparation

The companion pass lower-struct-args (pass index 418) handles byval-attributed function parameters specifically, converting struct-typed byval parameters into explicit copy + scalar access patterns. It runs before lower-aggr-copies to ensure that byval struct arguments are already decomposed when the main splitting pass encounters them.

Configuration

Knobs (ctor_265 at 0x4F48E0)

KnobDefaultDescription
devicefn-param-always-local--Treat parameter space as local in device functions
skiploweraggcopysafechkfalseSkip safety check in aggregate copy lowering
large-aggr-store-limit--Threshold for large aggregate store unrolling
max-aggr-copy-size--Maximum aggregate size for full decomposition
lower-aggr-unrolled-stores-limit--Limit on unrolled stores per aggregate copy

InstCombine Aggregate Knobs (ctor_086 at 0x49E670)

KnobDefaultDescription
max-aggr-lower-size128Size threshold (bytes) below which InstCombine lowers aggregates
aggressive-max-aggr-lower-size256Aggressive threshold for aggregate lowering
instcombine-merge-stores-from-aggrtrueMerge stores originating from aggregate decomposition
KnobScopeDescription
lsa-optlower-struct-argsControls struct argument lowering
lower-read-only-devicefn-byvallower-struct-argsLower read-only device function byval params
hoist-load-paramlower-struct-argsHoist parameter loads
nvptx-force-min-byval-param-alignbackendForce 4-byte minimum alignment for byval params
nvptx-early-byval-copybackendCopy byval arguments early in the pipeline

Diagnostic Strings

"splitStruct"     -- Name prefix for the opcode-32 multi-output node
"srcptr"          -- Source pointer in aggregate copy lowering
"dstptr"          -- Destination pointer in aggregate copy lowering
"remsrc"          -- Remainder source pointer (byte-copy tail loop)
"remdst"          -- Remainder destination pointer (byte-copy tail loop)
"i8dst"           -- Byte-typed destination for remainder copies
"split"           -- Name prefix for the per-element split load
"ptr"             -- Name prefix for element pointer extractions
"vld"             -- Vector load variant in the second instance

Function Map

Primary Instance (sub_1C86CA0, 72KB)

FunctionAddressRole
Main driversub_1C86CA0Top-level struct splitting pass
StructLayout querysub_1643350DataLayout::getStructLayout
Scalar type querysub_159C470Get scalar element type (recursive for nested structs)
Composite type creationsub_15F9F50Build composite type from element array
Alignment from DLsub_1646BA0Set type alignment from DataLayout
Vector type wrappingsub_16463B0Wrap in vector type if original was vector
Instruction creationsub_15F1EA0InitInstruction(type, opcode=32, parent, nops, flags)
Element extractionsub_15FD590Create indexed load from multi-output node
Split load creationsub_15F90A0Create load with alignment preservation
Alignment computationsub_1CCB4A0DataLayout-aware alignment for element type
Use replacementsub_164D160RAUW (Replace All Uses With)
Pointer size querysub_15A9520DataLayout::getPointerSizeInBits(AS)
Struct size querysub_15A9930DataLayout::getStructLayout for size lookup

Second Instance (sub_2CCF450, 58KB)

FunctionAddressRole
Aggregate loweringsub_2CCF450lower-aggr-copies pass implementation

Pipeline Registration

FunctionAddressRole
New PM registrationsub_2342890Pass index 417 (lower-aggr-copies)
Parameter parsersub_233A3B0Parses lower-aggr-func-args parameter
lower-struct-args parsersub_233A370Parses opt-byval parameter

Test This

The following kernel returns a struct from a device function. Struct splitting should decompose the aggregate return value into individual scalar registers.

struct Result {
    float value;
    int   index;
    float confidence;
};

__device__ Result compute(const float* data, int tid) {
    Result r;
    r.value      = data[tid] * 2.0f;
    r.index      = tid;
    r.confidence = 0.95f;
    return r;
}

__global__ void struct_split_test(const float* in, float* out_val,
                                   int* out_idx, float* out_conf, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    Result r = compute(in, tid);
    out_val[tid]  = r.value;
    out_idx[tid]  = r.index;
    out_conf[tid] = r.confidence;
}

What to look for in PTX:

  • The compute function should be inlined, but even if it is not, the struct return should be decomposed. Look for the absence of .local memory for the Result struct -- all three fields (value, index, confidence) should live in individual PTX registers (%f for floats, %r for int).
  • No ld.local/st.local pairs for passing the struct between compute and the kernel. If the struct survives unsplit, the caller allocates local memory for the return value, the callee stores into it, and the caller loads from it -- a 200+ cycle penalty per field.
  • In the PTX, the three stores to out_val, out_idx, out_conf should use values directly from registers without any intermediate local memory traffic. Look for st.global.f32 and st.global.u32 with register operands, not loaded-from-local operands.
  • To see the unsplit case, make compute a __noinline__ function and compile at -O0. The struct will be passed through .param space with explicit st.param/ld.param sequences, showing the overhead that struct splitting eliminates.

Cross-References