PrologEpilogInserter & Frame Layout
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
NVIDIA GPUs have no hardware stack pointer. There is no push, no pop, no %rsp — the entire concept of a "stack frame" is a compiler fiction. When a CUDA kernel needs local storage (spill slots, alloca, local arrays), cicc allocates a byte array called __local_depot in PTX .local address space and computes all offsets at compile time. The PrologEpilogInserter (PEI) pass is responsible for this: it takes abstract MachineFrameInfo frame indices produced by register allocation and earlier lowering, assigns concrete byte offsets within the depot, emits the two-instruction prologue that sets up the %SP/%SPL pseudo-registers, and rewrites every frame-index operand in the MachineFunction to [%SP + offset] form. At 68 KB and ~2,400 decompiled lines, cicc's PEI is a heavily modified monolith — the upstream open-source NVPTX backend replaces LLVM's standard PEI with a stripped-down 280-line NVPTXPrologEpilogPass that handles only offset calculation and frame-index elimination. cicc restores and extends nearly all of the standard PEI's functionality: callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering, and a stack-size diagnostic system.
| Property | Value |
|---|---|
| Binary address | sub_35B1110 (0x35B1110) |
| Binary size | 68,332 bytes (~2,388 decompiled lines) |
| Pass identity | PrologEpilogInserter::runOnMachineFunction |
| Pass position | Post-register-allocation, before NVPTXPeephole |
| Stack frame | 0x490 bytes of local state (~400 variables) |
| Upstream equivalent | NVPTXPrologEpilogPass (280 lines) + NVPTXFrameLowering (101 lines) |
| Key strings | "warn-stack-size", "stack frame size" |
| Knobs | warn-stack-size (function attribute), nvptx-short-ptr, nv-disable-mem2reg |
The GPU "Stack" Model
__local_depot: The Frame Array
Every PTX function that needs local storage declares a .local byte array:
.local .align 16 .b8 __local_depot0[256];
This is the entire "stack frame." The alignment value is the maximum alignment of any object in the frame. The size is the total frame size computed by PEI. The suffix number (0, 1, ...) is the function index within the module.
There is no call stack in the CPU sense. GPU threads have a fixed local memory allocation (typically 512 KB per thread on modern architectures). The .local directive reserves a region within this per-thread memory. Recursive functions and dynamic allocations are legal in PTX but the driver/ptxas resolves their addresses — cicc only needs to produce a statically-sized depot for each function's fixed-size locals.
%SP and %SPL: The Two Frame Pseudo-Registers
PTX declares two pseudo-register pairs for frame access:
.reg .b64 %SP; // generic address space pointer to the frame
.reg .b64 %SPL; // local address space (AS 5) pointer to the frame
In 32-bit mode these are .reg .b32. The distinction exists because NVIDIA GPUs use address space qualification:
-
%SPL(Stack Pointer Local) — points directly into the.localaddress space (PTX address space 5). Loads/stores using%SPLemitld.local/st.localinstructions, which ptxas can optimize for the L1 cache local-memory path. This is the efficient pointer. -
%SP(Stack Pointer) — a generic address space pointer obtained by converting%SPLviacvta.local. Loads/stores using%SPgo through generic address resolution, which adds a TLB lookup to determine the address space at runtime. This is slower but required when the address escapes to code that expects generic pointers (e.g., passing a local variable's address to a called function).
The prologue sequence is:
mov.u64 %SPL, __local_depot0; // MOV_DEPOT_ADDR_64
cvta.local.u64 %SP, %SPL; // cvta_local_64
The cvta.local (Convert Address) instruction is the key: it takes a .local pointer and produces the equivalent generic-space pointer. When nvptx-short-ptr is enabled, %SPL is 32 bits (sufficient for the per-thread local memory window, always < 4 GB) while %SP may still be 64 bits on 64-bit targets.
Upstream's NVPTXFrameLowering::emitPrologue implements this directly. It checks MachineRegisterInfo::use_empty for each register — if %SP has no uses, it skips the cvta.local; if %SPL has no uses, it skips the mov.depot. The NVPTXPeephole pass runs immediately after PEI and rewrites LEA_ADDRi64 %VRFrame64, offset followed by cvta_to_local_64 into LEA_ADDRi64 %VRFrameLocal64, offset, eliminating the generic-to-local conversion when the address stays in local space.
Frame Index Resolution
During instruction selection and register allocation, local memory references use abstract frame indices: %stack.0, %stack.1, etc. Each maps to a MachineFrameInfo frame object with a size, alignment, and (after PEI) a byte offset.
Frame-index elimination in upstream is simple — NVPTXRegisterInfo::eliminateFrameIndex replaces the frame-index operand with VRFrame (which prints as %SP) and sets the immediate offset:
MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);
The VRDepot physical register (prints as %Depot internally) serves as the canonical frame base in getFrameIndexReference. For debug info, %Depot is remapped to %SP since cuda-gdb resolves stack frames via the generic pointer.
Frame Layout Algorithm
cicc's PEI executes in ten sequential phases within a single monolithic function. The algorithm is significantly more sophisticated than upstream's linear scan.
Phase 1–2: Setup and Callee-Saved Registers (lines 443–566)
Retrieves the TargetFrameLowering and TargetRegisterInfo from the MachineFunction's subtarget. If callee-saved registers exist (determined by vtable(FrameLowering, +480)), allocates a 0xA8-byte callee-save info structure at PEI state offset +200 containing two inline SmallVectors for register indices.
On GPU targets, callee-saved registers are unusual — PTX functions use a fully virtual register file, so there is no hardware register saving in the CPU sense. However, cicc models device-function calling conventions that may require preserving certain virtual registers across calls, and this mechanism handles that.
Phase 3: Fixed Object Collection (lines 567–730)
Initializes a chunk table (deque-like structure) with -4096 sentinel values. Collects prolog/epilog insertion points from the PEI state arrays at offsets +216 (prolog points, count at +224) and +264 (epilog points, count at +272).
When callee-saves exist and optimization level is not 20 (a special threshold), manually inserts save/restore instructions:
- Simple saves:
storeRegToStackSlot(MBB, MI, reg, kill=1, FI, RC, TRI) - Compound saves: handles sub-register decomposition via
sub_2F26260whenbyte+9 == 1in the callee-save info.
Phase 4: Offset Assignment — The Core Layout Engine (lines 733–1070)
This is the heart of PEI. It assigns byte offsets within __local_depot to every frame object.
MachineFrameInfo layout:
StackDirection: 1 = grows-negative (toward lower addresses)
0 = grows-positive (toward higher addresses)
LocalFrameSize: initial offset base
NumFixedObjects: count of pre-positioned objects
MaxAlignment: tracks largest alignment seen
Fixed objects are laid out first. Each frame object is a 40-byte record:
| Offset | Type | Field |
|---|---|---|
| +0 | i64 | Byte offset (written by PEI) |
| +8 | i64 | Object size in bytes |
| +16 | u8 | Alignment (log2) |
| +20 | u8 | isDead flag |
| +32 | u8 | isSpillSlot flag |
| +36 | u8 | Category (0–3) |
The alignment formula appears ~20 times throughout the pass:
// Round up 'value' to next multiple of (1 << align_log2):
aligned = -(1 << align_log2) & (value + (1 << align_log2) - 1);
// Equivalent to: aligned = (value + mask) & ~mask where mask = (1<<n) - 1
For grows-negative direction, offsets are stored as negative values; for grows-positive, they accumulate upward.
Callee-saved region is laid out next, iterating frame indices in range [PEI+208 .. PEI+212]. Each CSR object gets an aligned offset using the same formula.
Separate stack area: if MachineFrameInfo+665 flag is set, NVIDIA supports a physically separate stack region with its own alignment at +664 and total size at +656. This likely corresponds to a distinct .local segment for shared-memory scratch or ABI-reserved zones.
Phase 5: Categorized Local Variable Layout (lines 1060–1600)
This is cicc's most significant divergence from upstream PEI. Objects are classified into three priority buckets by a category byte at frame-object offset +36:
| Category | Bucket | Typical contents | Layout order |
|---|---|---|---|
| 3 | v427 | Vector/tensor spills (high alignment) | First |
| 2 | v419 | Medium-aligned objects | Second |
| 1 | v412 | General locals | Third |
| 0 | — | Skip (already placed or dead) | — |
Each bucket is processed by sub_35B0830 which assigns aligned offsets. The ordering minimizes alignment waste: laying out large-alignment objects first avoids padding gaps.
Objects are skipped if:
- They are spill slots in a separate stack area
- They fall within the callee-saved index range
- Their size is -1 (sentinel for dynamic-size objects)
- They are the frame-pointer object
- They are dead
Bitmap-Based Packing — When register count is nonzero and canUseStackBitmap returns true (frame size <= 0x7FFFFFFF), cicc builds a bitset representing every byte of the frame:
// Bitmap size in qwords:
bitmap_size = (frame_size + 63) >> 6;
// Mark all bytes as free (bits set to 1)
// Then clear bits for fixed objects and CSR objects
for each placed_object:
clear bits [offset .. offset + size)
For each unassigned general object, the algorithm scans the bitmap using tzcnt (trailing zero count) to find contiguous runs of set bits that match the object's size and alignment:
for each unassigned_obj in v412:
candidate = tzcnt_scan(bitmap, obj.size);
if (candidate != NOT_FOUND):
// Verify alignment
if aligned(candidate, obj.alignment):
// Verify all bits available (inner loop)
if all_bits_set(bitmap, candidate, candidate + obj.size):
assign_offset(obj, candidate);
clear_bits(bitmap, candidate, candidate + obj.size);
continue;
// Fallback: linear allocation at end of frame
offset = align(running_offset);
assign_offset(obj, offset);
running_offset += obj.size;
This is substantially more aggressive than both upstream LLVM PEI (which does a single linear pass) and the upstream NVPTX PrologEpilogPass (which has no packing at all). It enables reuse of "holes" left by fixed objects, callee-saves, and dead objects.
Phase 6: Final Alignment and Frame Size (lines 1688–1795)
After all objects are laid out:
- If
targetHandlesStackFrameRoundingreturns true, skip to finalization. - Add
MaxCallFrameSizeto the running offset if the function adjusts the stack. - Choose alignment:
StackAlign(fromTFI.getStackAlign()) for functions with calls or alloca, orTransientStackAlignfor leaf functions. The subtarget stores these atFrameLowering[12]and[13]respectively. - Round up:
final = align(running_offset, max(StackAlign, MaxAlignment)). - If alignment changed the total and direction is grows-negative, shift all callee-save offsets by the delta to maintain correct relative positions.
- Write
FrameInfo.StackSize = final_offset - initial_offset.
This value becomes the SIZE in .local .align ALIGN .b8 __local_depotN[SIZE].
Phase 7: Prologue/Epilogue Insertion (lines 1803–1872)
Executed when optimization level is not at threshold 20. For each prolog insertion point, calls emitPrologue(MF, MBB) via RegisterInfo vtable at +96. For each epilog point, calls emitEpilogue(MF, MBB) at +104.
Post-fixup via sub_35AC7B0, then a second pass over prolog points for insertPrologueSaveCode (vtable +152, if not a null stub).
Architecture-specific extension: checks (*(Module+2) >> 4) & 0x3FF == 0xB (SM arch code 11). When matched, calls an additional prolog handler at vtable +176. This likely targets an early or internal SM variant.
Phase 8–9: Frame Index Elimination (lines 1873–2268)
Two strategies selected by vtable(FrameLowering, +616):
Forward elimination (Path A): walks each MBB's instruction list forward. For each instruction, checks the opcode against FRAME_SETUP and FRAME_DESTROY pseudos — these adjust the SP offset tracker. For other instructions, scans operands for type-5 (FrameIndex), then calls sub_35ABF20 to attempt elimination or falls back to the target-specific handler.
Backward elimination (Path B): same logic but iterates instructions in reverse order. Handles FRAME_SETUP/FRAME_DESTROY with different SP adjustment accumulation.
This dual-path approach is unique to cicc — upstream NVPTX PrologEpilogPass only does a single backward walk. The forward path may be needed for instructions where the SP adjustment at a given point depends on preceding pseudo-ops.
Phase 10: Diagnostics and Cleanup (lines 2270–2388)
Stack size warning: default threshold is 0xFFFFFFFF (4 GB, effectively disabled). If the function has a "warn-stack-size" attribute, it parses the value via strtoul(str, &end, 10). When the total frame size (plus optional regspill area at MF+86*wordsize if opt-level flag 55 is set) exceeds the threshold, emits a "stack frame size" diagnostic.
Stack annotation: if annotation output is enabled (checked via sub_B6EA50/sub_B6F970), formats and writes stack-size metadata to the analysis output for the NVVM container.
Cleanup frees the 0xA8 callee-save info structure, resets prolog/epilog point counts, resets frame metadata, and walks the chunk table to free non-inline instruction arrays.
Dynamic Stack Allocation (alloca)
PTX supports alloca semantics at the LLVM IR level — the alloca instruction lowers to a local memory reservation. However, truly dynamic-sized allocations (variable-length arrays, runtime alloca(N)) are constrained:
MachineFrameInfo.hasVarSizedObjects(flag at +36) tracks whether the function contains VLA-style allocations.- When present, PEI selects
StackAlign(the full stack alignment) rather thanTransientStackAlignfor final frame rounding. - ptxas ultimately resolves dynamic allocations at JIT time, not cicc. cicc's role is to set up the frame pointer correctly so that dynamic objects can be addressed relative to it.
- The
FramePointerIndex(atMachineFrameInfo+68) is laid out last among general objects, ensuring the frame pointer anchors the top of the fixed frame with dynamic objects growing beyond it.
For fixed-size allocas, SROA (Scalar Replacement of Aggregates) typically promotes them to SSA registers before PEI ever runs. When SROA succeeds for all allocas, MachineFrameInfo has no stack objects and PEI emits no __local_depot at all — the function runs entirely in registers.
Spill Slots
Register spills are the primary consumer of __local_depot space. When the register allocator cannot fit a virtual register's live range into the available physical registers, it creates a spill slot — a frame object marked with isSpillSlot = 1 (byte at frame-object +32).
Spill-slot frame objects are created during register allocation. PEI does not create them; it only assigns their offsets. In cicc, spill slots interact with the categorized layout:
- Spill slots in a separate stack area (when
hasSeparateStackAreais set) are excluded from the general layout and handled in Phase 4's separate-area processing. - Remaining spill slots are classified into categories 1–3 based on their alignment requirements and register class — vector register spills (e.g., 128-bit
%rqregisters) end up in category 3, scalar spills in category 1.
After PEI assigns offsets, the spill loads/stores reference [%SP + offset] or [%SPL + offset] directly. The post-PEI NVPTXPeephole pass optimizes these: when a LEA_ADDRi64 %VRFrame64, offset feeds directly into cvta_to_local_64, the peephole collapses this to LEA_ADDRi64 %VRFrameLocal64, offset, saving the generic address conversion.
Interaction with SROA
SROA runs early in the optimization pipeline (see SROA) and aggressively promotes alloca instructions to SSA values. For many GPU kernels — especially those that avoid taking addresses of locals — SROA eliminates all allocas, resulting in an empty MachineFrameInfo. In this case:
- PEI's frame size computes to 0.
- The PTX emitter (
sub_2158E80) checksFrameInfo.StackSize; if zero, it emits no.localdirective and no%SP/%SPLdeclarations. - The function runs entirely in the virtual register file — the ideal case for GPU performance.
When SROA cannot promote (address-taken locals, aggregates too large for SROA's threshold controlled by sroa-size-limit, or when sroa-skip-mem2reg is set), PEI becomes essential. Additionally, cicc has a custom MI Mem2Reg pass (nv-disable-mem2reg controls it) that runs post-register-allocation and promotes MachineIR local-memory accesses back to registers — effectively a second chance at eliminating __local_depot usage after regalloc.
Comparison with Upstream
| Aspect | Upstream NVPTXPrologEpilogPass | cicc sub_35B1110 |
|---|---|---|
| Size | 280 lines | ~2,400 lines |
| Callee-saved regs | Not handled | Full save/restore infrastructure |
| Register scavenging | Not used | Both forward and backward paths |
| Layout algorithm | Single linear pass over all objects | Categorized 3-bucket layout + bitmap packing |
| Frame packing | None — objects placed sequentially | tzcnt-accelerated bitmap hole-finding |
| Stack direction | Supports both, simple | Supports both, with per-direction callee-save adjustment |
| Diagnostics | None | warn-stack-size attribute + annotation output |
| Separate stack area | Not supported | Full support (flag at MFI+665) |
| Arch-specific prolog | None | SM arch code 0xB extension |
| Optimization gating | None | opt-level 20 skips prolog/epilog emission |
| Frame-index elimination | Single backward walk | Dual forward/backward strategies |
The upstream pass explicitly disables LLVM's standard PrologEpilogCodeInserterID and replaces it. cicc's version is closer to the full standard LLVM PEI but with GPU-specific extensions — it re-enables callee-saved handling, register scavenging, and the frame-rounding logic that upstream strips out.
Configuration
| Knob | Type | Default | Effect |
|---|---|---|---|
warn-stack-size | Function attribute (string→int) | 0xFFFFFFFF (disabled) | Emit diagnostic when frame size exceeds threshold |
nvptx-short-ptr | cl::opt<bool> | false | Use 32-bit pointers for local/const/shared address spaces; affects %SPL width |
nv-disable-mem2reg | cl::opt<bool> | false | Disable post-regalloc MI Mem2Reg pass (more objects remain for PEI to lay out) |
sroa-size-limit | cl::opt<int> | (varies) | Max aggregate size SROA will promote; larger values reduce PEI workload |
| Opt-level flag 20 | Internal | — | Skips prolog/epilog instruction emission and callee-save handling |
| Opt-level flag 55 | Internal | — | Includes regspill area in stack-size diagnostic total |
FrameLowering[12] | Subtarget | arch-dependent | Stack alignment for functions with calls/alloca |
FrameLowering[13] | Subtarget | arch-dependent | Stack alignment for leaf functions (TransientStackAlign) |
Key Data Structures
MachineFrameInfo (at MachineFunction+48)
Offset Type Field
+8 ptr Objects array base pointer (40-byte records)
+16 ptr Objects array end pointer
+32 i32 NumFixedObjects
+36 u8 hasVarSizedObjects
+48 i64 StackSize ← WRITTEN by PEI
+64 u8 MaxAlignment (log2)
+65 u8 hasCalls / needsStackAlignment
+68 i32 FramePointerIndex (-1 if none)
+80 i64 MaxCallFrameSize (-1 if unknown)
+96 ptr Separate-area array base
+104 ptr Separate-area array end
+120 u8 hasCalleeSaves ← SET by PEI
+128 ptr Extra-area array pointer
+136 i64 Extra-area count
+656 i64 Separate area total size
+664 u8 Separate area alignment
+665 u8 hasSeparateStackArea flag
PEI State (pass object, offset from a1)
Offset Type Field
+8 ptr Analysis list (tagged analysis pointers)
+200 ptr Callee-save info (0xA8-byte struct, or null)
+208 u32 First CSR frame index
+212 u32 Last CSR frame index
+216 ptr Prolog insertion points array
+224 u32 Prolog point count
+264 ptr Epilog insertion points array
+272 u32 Epilog point count
+312 u8 hasReservedCallFrame flag
+313 u8 requiresRegisterScavenging flag
+320 ptr Stack-size annotation analysis pointer
Frame Object Record (40 bytes each)
Offset Type Field
+0 i64 Byte offset in __local_depot (assigned by PEI)
+8 i64 Object size in bytes
+16 u8 Alignment (log2 encoding)
+20 u8 isDead flag
+32 u8 isSpillSlot flag
+36 u8 Category: 0=skip, 1=general, 2=medium, 3=large
Diagnostic Strings
| String | When emitted |
|---|---|
"warn-stack-size" | Function attribute name — read and parsed as an integer threshold |
"stack frame size" | Diagnostic message when total frame size exceeds the warn-stack-size threshold |
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
PrologEpilogInserter::runOnMachineFunction — main entry (68 KB) | sub_35B1110 | -- | -- |
| PEI pre-setup: initialize frame object tracking | sub_35AC440 | -- | -- |
| Record frame object into chunk table | sub_35AFAD0 | -- | -- |
| Determine CSR frame index range (writes PEI+208, +212) | sub_35AEEB0 | -- | -- |
| Post-save fixup | sub_35AE230 | -- | -- |
| Insert restore instructions at epilog points | sub_35ADBC0 | -- | -- |
| Assign offsets to categorized frame object bucket | sub_35B0830 | -- | -- |
| Push frame object index into categorized bucket | sub_35B0B10 | -- | -- |
| Post-prolog/epilog fixup | sub_35AC7B0 | -- | -- |
| Try to eliminate a single frame index operand | sub_35ABF20 | -- | -- |
| Initialize register scavenger for a MBB | sub_35C5BD0 | -- | -- |
| Advance register scavenger | sub_35C5C00 | -- | -- |
| Post-scavenging callee-save cleanup | sub_35C6D20 | -- | -- |
| Format stack-size annotation | sub_35AE7D0 | -- | -- |
PTX emitter: emitFunctionFrameSetup() (__local_depot + %SP/%SPL) | sub_2158E80 | -- | -- |
| Local depot helper | sub_214C040 | -- | -- |
| Local depot helper | sub_2154370 | -- | -- |
| Collect callee-saved registers | sub_2E77EA0 | -- | -- |
| Get register class for physical register | sub_2FF6500 | -- | -- |
| Build sub-register decomposition list | sub_2F26260 | -- | -- |
| Insert compound save instruction | sub_2E8EAD0 | -- | -- |
| Check optimization level flag | sub_B2D610 | -- | -- |
| Check function attribute existence | sub_B2D620 | -- | -- |
| Get function attribute value | sub_B2D7E0 | -- | -- |
| Build stack-size diagnostic message | sub_B15960 | -- | -- |
Differences from Upstream LLVM
| Aspect | Upstream LLVM (NVPTX open-source) | CICC v13.0 |
|---|---|---|
| Implementation | Stripped-down NVPTXPrologEpilogPass (~280 lines); handles only offset calculation and frame-index elimination | Full 68 KB PEI monolith with callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering |
| Stack concept | No hardware stack; minimal __local_depot offset assignment | Same __local_depot model but with full-featured offset assignment: categorized frame objects, alignment-based bucketing, dead frame object elimination |
| Callee-saved registers | Skipped entirely (no function calls in typical kernels) | Restored: full callee-saved register scan, compound save/restore instruction insertion for non-inlined device function calls |
| Register scavenging | Absent | Included: sub_35C5BD0/sub_35C5C00 initialize and advance a register scavenger per MBB for emergency spill resolution |
| Frame packing | Sequential offset assignment | Bitmap-based packing with categorized buckets; objects sorted by alignment to minimize padding waste |
| Stack-size diagnostics | No diagnostic system | Annotation system (sub_35AE7D0) formats stack-size remarks; integrates with -Rpass-analysis for occupancy tuning |
| Prologue emission | Two-instruction %SP/%SPL setup | Same two-instruction prologue (sub_2158E80) but with additional __local_depot sizing logic for complex frame layouts |
Cross-References
- Register Allocation — creates spill slots that PEI lays out; the number and alignment of spills directly determines frame size.
- Register Coalescing — reduces register pressure, which reduces spills, which reduces frame size.
- SROA — SROA eliminates allocas before they reach MachineIR; when fully successful, PEI has nothing to do.
- AsmPrinter & PTX Body Emission —
sub_2158E80emits the.localdirective and%SP/%SPLdeclarations that PEI computed. - Instruction Scheduling — runs before PEI; scheduling decisions affect register pressure and thus spill count.
- Pipeline & Ordering — PEI runs post-regalloc, followed immediately by NVPTXPeephole for
%VRFrameto%VRFrameLocaloptimization.