Loop Unrolling

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (decision engine), llvm/lib/Transforms/Utils/LoopUnroll.cpp (transformation engine), llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (runtime unrolling) (LLVM 20.0.0)

Loop unrolling in cicc is one of the most heavily tuned transformations in the entire pipeline. On a GPU, unrolling directly trades register pressure against instruction-level parallelism: every additional copy of the loop body increases live register count, which reduces SM occupancy and the number of concurrent warps available to hide memory latency. Conversely, too little unrolling leaves performance on the table by failing to expose independent instructions that the hardware scheduler can overlap. NVIDIA's unroller resolves this tension through a priority-based decision cascade with GPU-specific heuristics that have no upstream equivalent -- most notably a local-array threshold multiplier, power-of-two factor enforcement, and a pragma threshold 200x larger than stock LLVM. The transformation engine itself is a lightly modified version of upstream llvm::UnrollLoop, but the decision engine (computeUnrollCount) is substantially reworked.

The pass appears twice in the cicc pipeline. The first invocation (sub_197E720) runs early, interleaved with loop vectorization in the main optimization sequence. The second invocation (sub_19C1680) runs later as a cleanup pass, gated by opts[1360] (the nv-disable-loop-unrolling flag). Both share the same decision engine; the second invocation operates on loops that were created or exposed by intervening passes (InstCombine, SROA, EarlyCSE).

Property	Value
Decision engine	`sub_19BB5C0` / `computeUnrollCount` (50 KB, ~1681 lines)
Transformation engine	`sub_2A15A20` / `UnrollLoop` (85 KB, ~2434 lines)
Top-level driver	`sub_19BE360` / `tryToUnrollLoop`
Runtime-check unroller	`sub_2A25260` / `UnrollLoopWithRuntimeChecks` (91 KB)
Pipeline slot (early)	`sub_197E720` -- runs once in main opt pipeline
Pipeline slot (late)	`sub_19C1680` -- conditional on `!opts[1360]`
Disable knob	`-Xcicc "-disable-LoopUnrollPass"` or `opts[1360]`
LLVM base	`LoopUnrollPass` from LLVM 20.0.0

Why Unrolling Matters More on GPU

On a CPU, the primary benefit of unrolling is reducing branch overhead and enabling wider SIMD scheduling. On a GPU, the calculus is different in three ways that all trace back to the GPU execution model:

First, unrolling increases register pressure, and register pressure determines occupancy. If unrolling pushes a kernel from 64 to 96 registers per thread, the SM drops from 32 to 21 resident warps -- a 34% reduction. Fewer warps means less latency hiding, so the unroll factor selection must be conservative in ways that a CPU unroller never needs to be.

Second, there is no out-of-order execution within a warp; the hardware issues instructions in program order. Unrolling creates independent instructions that the compiler (ptxas) can interleave, particularly independent loads that can overlap with arithmetic. This is the ILP benefit, and it is the primary argument for aggressive unrolling.

Third, GPU loops often access shared memory (__shared__) or local memory arrays indexed by threadIdx. Unrolling these loops enables the backend to promote array elements to registers and to rearrange memory accesses to avoid bank conflicts. NVIDIA's local-array heuristic (see below) exists specifically to exploit this opportunity.

The unroller's job is to find the sweet spot: enough copies to saturate the instruction pipeline, few enough to keep register pressure within occupancy targets.

The Decision Engine: computeUnrollCount

The decision engine at sub_19BB5C0 implements a strict six-level priority cascade. Each level is tried in order; the first level that produces a valid unroll factor wins. Every decision is logged through optimization remarks, making the logic traceable from -Rpass-analysis=loop-unroll.

UnrollParams Struct Layout

The decision communicates its result through a struct passed by pointer (a12 / v14):

Offset	Field	Type	Description
+0	Threshold	u32	Cost budget for full unroll
+4	MaxPercentThresholdBoost	u32	Max boost percentage (default 400)
+12	PartialThreshold	u32	Cost budget for partial unroll
+20	Count	u32	Chosen unroll factor (primary output)
+24	PeelCount	u32	Loop peel iteration count
+28	DefaultUnrollCount	u32	Fallback count when no factor found
+32	MaxCount	u32	Hard cap on unroll factor
+36	FullUnrollMaxCount	u32	Max trip count for full unroll
+40	FixedCost	u32	Non-scaling cost (IV increments, branches)
+44	AllowPartial	u8	Partial unrolling permitted
+45	AllowRemainder	u8	Remainder loop generation permitted
+46	UserProvidedCount	u8	True when pragma supplies count
+48	(reserved)	u8	--
+49	AllowUpperBound	u8	Use max-trip-count when exact unknown

The Cost Model

Every decision in the cascade uses the same linear cost model to estimate unrolled loop size:

estimated_size = FixedCost + Count * (LoopBodySize - FixedCost)

LoopBodySize is the instruction cost of one iteration (parameter a11, computed by LLVM's CodeMetrics). FixedCost captures instructions that do not replicate with unrolling -- induction variable increments, the backedge branch, loop overhead. The difference (LoopBodySize - FixedCost) is the per-copy marginal cost.

For full unrolls, an additional dynamic cost simulation (sub_19B9A90) constant-folds through the unrolled body. If the loop contains iteration-dependent simplifications (constant array indices, strength-reduced expressions), the simulation reports a cost lower than worst-case. The effective budget for this check is boosted:

dynamic_budget = Threshold * MaxPercentThresholdBoost / 100

With the default boost of 400%, this means a loop whose body simplifies substantially after unrolling gets 4x the normal cost budget.

Priority Cascade (Pseudocode)

int computeUnrollCount(Loop *L, SE, TTI, TripCount, MaxTripCount,
                       BodySize, UnrollParams *UP, bool *AllowRuntime) {

    // PRIORITY 1: Local array threshold multiplier (NVIDIA-specific)
    int localSize = computeLocalArraySize(L);  // scans for AS5 allocas
    int multiplier = min(max(localSize, 1), 6);
    int effectiveThreshold = multiplier * UP->Threshold;

    // PRIORITY 2: #pragma unroll N
    int pragmaCount = getMetadataCount(L, "llvm.loop.unroll.count");
    if (pragmaCount != 0) {
        if (pragmaCount == 1) {
            UP->Count = 1;  // disable unrolling
            return UNROLL_DISABLED;
        }
        UP->Count = pragmaCount;
        int estSize = UP->FixedCost + pragmaCount * (BodySize - UP->FixedCost);
        if (estSize > multiplier * PragmaUnrollThreshold) {
            // too large -- try to find smaller factor
            searchSmallerDivisibleFactor(UP, TripCount);
        }
        if (TripMultiple % pragmaCount != 0)
            emitRemark("remainder loops not allowed");
        return UNROLL_PRAGMA;
    }

    // PRIORITY 3: #pragma unroll (full, no count)
    if (hasMetadata(L, "llvm.loop.unroll.full")) {
        if (TripCount > 0 && TripCount <= UP->FullUnrollMaxCount) {
            int estSize = UP->FixedCost + TripCount * (BodySize - UP->FixedCost);
            if (estSize <= effectiveThreshold) {
                if (simulateLoopBody(L, TripCount, dynamicBudget))
                    { UP->Count = TripCount; return FULL_UNROLL; }
            }
        }
        // fallthrough to lower priorities
    }

    // PRIORITY 4: Loop peeling
    int peelCount = computePeelCount(L, SE, UP);
    if (peelCount > 0) {
        UP->PeelCount = peelCount;
        UP->Count = 1;
        return PEEL;
    }

    // PRIORITY 5: Static partial unrolling (known trip count)
    if (TripCount > 0 && (UP->AllowPartial || pragmaOversize) && isInnermost(L)) {
        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;

        // Size clamp
        if (UP->PartialThreshold < UP->FixedCost + count * (BodySize - UP->FixedCost))
            count = (UP->PartialThreshold - UP->FixedCost) / (BodySize - UP->FixedCost);
        count = min(count, UP->MaxCount);

        // Power-of-two + trip-divisible search
        while (count > 0) {
            if (TripCount % count == 0 && isPowerOfTwo(count))
                break;
            count--;
        }

        // Fallback: halve DefaultUnrollCount until it fits
        if (count == 0 && UP->UserProvidedCount) {
            count = UP->DefaultUnrollCount;
            while (UP->PartialThreshold <
                   UP->FixedCost + count * (BodySize - UP->FixedCost))
                count >>= 1;
        }

        if (count > 1) { UP->Count = count; return PARTIAL_UNROLL; }
    }

    // PRIORITY 6: Runtime unrolling (unknown trip count)
    if (!hasMetadata(L, "llvm.loop.unroll.runtime.disable")
        && RuntimeUnrollThreshold >= BodySize
        && isInnermost(L)) {

        int rtTripCount = computeRuntimeTripCount(L, SE);
        if (rtTripCount < FlatLoopTripCountThreshold) return NO_UNROLL;

        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
        // same halving + threshold logic as Priority 5
        while (UP->PartialThreshold <
               UP->FixedCost + count * (BodySize - UP->FixedCost))
            count >>= 1;
        count = min(count, UP->MaxCount);

        if (count > 1) {
            UP->Count = count;
            *AllowRuntime = true;
            return RUNTIME_UNROLL;
        }
    }

    // Small-function override (tiny kernels get aggressive unrolling)
    if (functionInstructionCount < SmallFunctionThreshold)
        return handleSmallFunction(L, UP, BodySize);

    return NO_UNROLL;
}

Local Array Heuristic

The function sub_19B5DD0 (computeLocalArraySize) is entirely NVIDIA-specific. It scans every basic block in the loop for load/store instructions that access address space 5 (GPU local memory). For each such access, it traces back to the underlying alloca, determines the array type, and computes the product of array dimensions. If any dimension is unknown at compile time, it substitutes the unroll-assumed-size knob (default 4). The returned value is the maximum local-array size found across all accesses.

This value becomes a threshold multiplier, capped at 6:

int computeLocalArraySize(Loop *L) {
    int maxSize = 0;
    for (BasicBlock *BB : L->blocks()) {
        for (Instruction &I : *BB) {
            if (!isLoadOrStore(I) || getAddressSpace(I) != 5) continue;
            Value *base = getUnderlyingAlloca(I);
            if (!base || !isArrayType(base->getType())) continue;
            int size = 1;
            for (int dim : getArrayDimensions(base))
                size *= (dim > 0) ? dim : UnrollAssumedSize;  // default 4
            maxSize = max(maxSize, size);
        }
    }
    return maxSize;
}

The rationale: GPU kernels frequently use __shared__ or local arrays indexed by threadIdx. Unrolling such loops by a factor proportional to the array size enables register promotion of individual array elements and eliminates bank-conflict-prone access patterns. The cap at 6 prevents pathological explosion when arrays are large.

Power-of-Two Factor Enforcement

The partial-unroll factor search at Priority 5 requires the chosen count to satisfy two constraints simultaneously: it must evenly divide the trip count and must be a power of two. The implementation uses the classic bitmask test:

while (count > 0) {
    if (tripCount % count == 0 && (count & (count - 1)) == 0)
        break;
    count--;
}

This is a GPU-specific requirement. Warp size is 32 (a power of two), and many GPU memory access patterns, shared-memory bank calculations, and reduction operations assume power-of-two alignment. An unroll factor of, say, 6 would create asymmetric loop bodies that interact poorly with warp-level execution.

Pragma Handling

The frontend (sub_9305A0 / emitUnrollPragma) translates CUDA pragmas to LLVM metadata during codegen:

CUDA Source	LLVM Metadata
`#pragma unroll` (bare)	`!{!"llvm.loop.unroll.full"}`
`#pragma unroll N` (N > 1)	`!{!"llvm.loop.unroll.count", i32 N}`
`#pragma unroll 1`	Disables unrolling at Priority 2

The metadata is attached to the backedge branch as a self-referential !llvm.loop node. A guard flag (dword_4D046B4) skips pragma processing entirely in fast-codegen mode.

The pragma threshold is 32768 (0x8000), compared to upstream LLVM's 16384 (0x4000). This means #pragma unroll succeeds on loop bodies up to approximately 32K cost units -- covering virtually any realistic GPU kernel loop. When even this generous budget is exceeded, the decision engine falls through to lower priorities and attempts partial unrolling.

The __launch_bounds__ attribute does not directly feed the unroll decision. Instead, it constrains register allocation downstream, which indirectly limits the benefit of aggressive unrolling. There is no feedback loop from register pressure estimation back into the unroll factor at this stage of the pipeline; that coordination happens implicitly through the PartialThreshold provided by TTI.

Runtime Unrolling

Runtime unrolling (Priority 6) handles loops whose trip count is unknown at compile time. cicc enables it by default (unroll-runtime = true), with several GPU-specific twists:

Convergent instruction support. The knob unroll-runtime-convergent (default true, NVIDIA-specific) allows unrolling loops that contain convergent operations like warp-level primitives (__shfl_sync, __ballot_sync). Upstream LLVM refuses to unroll such loops because it cannot guarantee all threads in the warp execute the same iterations. cicc overrides this, relying on the waterfall-epilogue mechanism to preserve convergence.

Epilog vs. prolog remainder. The choice is controlled by a cascade:

If waterfall-unrolling-force-epilogue is true (default, NVIDIA-specific) and the loop has runtime trip count: epilog mode is selected.
If the loop body contains function calls (hasCallInLoop / sub_2A10B40 checks for opcode 17): epilog mode is forced. This preserves the property that all threads in a warp participate in calls, which matters for convergent operations.
Otherwise, unroll-runtime-epilog (default false) determines the mode.

In practice, GPU loops almost always use epilog-style remainders.

Flat-loop exclusion. If the estimated runtime trip count is below flat-loop-tripcount-threshold (default 5), runtime unrolling is skipped. The overhead of generating the modulo check and epilog loop is not worth it for loops that iterate fewer than 5 times.

Body size gate. Runtime unrolling only proceeds if runtime-unroll-threshold (default 95) is greater than or equal to the loop body size. This is more conservative than the static partial-unroll threshold, preventing code explosion for large loop bodies when the trip count is unknown.

Thresholds: NVIDIA vs. Upstream LLVM

Parameter	Upstream LLVM (O3)	Upstream LLVM (NVPTX TTI)	cicc v13.0
Threshold	300	300	From TTI (300), then multiplied by local-array factor (1-6x)
PartialThreshold	150	75 (`Threshold / 4`)	From TTI (75), plus local-array scaling
MaxPercentThresholdBoost	400%	400%	400% (same)
PragmaUnrollThreshold	16384	16384	32768
RuntimeUnrollThreshold	--	--	95 (NVIDIA addition)
FlatLoopTripCountThreshold	5	5	5 (same)
MaxUpperBound	8	8	8 (same)
MaxPragmaUpperBound	--	--	64 (NVIDIA addition)
DefaultUnrollRuntimeCount	8	8	From TTI
AllowPartial	false	true	true (from TTI)
Runtime	false	true	true (from TTI)
AllowRemainder	true	true	true
MaxIterationsCountToAnalyze	10	10	10 (same)
UnrollAssumedSize	--	--	4 (NVIDIA addition)

The critical differences: cicc doubles the pragma threshold, introduces a body-size gate for runtime unrolling (95), adds the local-array multiplier (up to 6x on base thresholds), and enforces power-of-two partial factors. The upstream NVPTX TTI enables partial and runtime unrolling but leaves thresholds at modest CPU-oriented values; cicc's decision engine applies substantial additional logic on top.

Interaction with Loop Vectorization

In the cicc pipeline, loop vectorization (LoopVectorizePass) runs before the first unroll invocation. Specifically, sub_197E720 combines both vectorization and unrolling decisions in the early pipeline slot. The vectorizer decides the vector width first (VF), and if it applies a transformation, the resulting loop (possibly with a scalar epilog) is then presented to the unroller.

This means vectorization and unrolling do not "coordinate" in the planning sense -- the vectorizer runs to completion before the unroller sees the loop. However, the vectorizer's interleave count (IC) serves a similar role to unrolling: it replicates the vectorized loop body to increase ILP. When the vectorizer chooses IC > 1, the subsequent unroller typically finds the loop body too large to unroll further, producing a de facto coordination through cost thresholds.

The second unroll invocation (sub_19C1680) runs much later, after InstCombine, SROA, and EarlyCSE have had a chance to simplify the vectorized code. Loops that were too large to unroll earlier may become eligible after dead code elimination within the unrolled-and-vectorized body.

The Transformation Engine: UnrollLoop

The transformation at sub_2A15A20 takes a loop and an unroll factor and physically duplicates the loop body. It is structurally close to upstream llvm::UnrollLoop with the following entry guards:

Loop must have a preheader (sub_D4B130)
Loop must have a single latch (sub_D47930)
Loop must be in LCSSA form (sub_D49210)
Header flags must be clean (no special bits set)

The duplication proceeds by iterating Count - 1 times, each iteration cloning every basic block in the loop body, remapping instructions through a value map, and rewiring PHI nodes so that iteration i's latch feeds iteration i+1's header. After all copies, the backedge of the last copy is reconnected to the first copy's header (for partial unroll) or removed entirely (for full unroll).

For partial unrolls where TripCount % Count != 0, a remainder loop is generated by sub_2A23640. If remainder generation fails (e.g., multi-exit loops), the engine delegates to sub_2A25260 which generates the runtime-check variant with prologue/epilogue.

The return value encodes the result: 0 = no change, 1 = partial unroll, 2 = full unroll.

Configuration Knobs

Standard LLVM Knobs (with NVIDIA defaults)

Knob	Default	Global	Effect
`unroll-threshold`	From TTI	`sub_19B7760` struct	Base cost budget for full unroll
`unroll-partial-threshold`	From TTI	`0x4FB3140` area	Cost budget for partial unroll
`unroll-max-percent-threshold-boost`	400	`dword_4FB3100`	Max dynamic cost boost (%)
`unroll-max-iteration-count-to-analyze`	10	`dword_4FB3020`	Max iterations for cost simulation
`unroll-count`	Unset	`dword_4FB2EA8`	Force specific unroll factor
`unroll-max-count`	Unset	`sub_19B7760` struct	Hard cap on unroll factor
`unroll-full-max-count`	Unset	`0x4FB2CE0` area	Max trip count for full unroll
`unroll-peel-count`	Unset	`0x4FB2C00` area	Force specific peel count
`unroll-allow-partial`	false	`0x4FB2B20` area	Enable partial unrolling override
`unroll-allow-remainder`	false	`0x4FB2A40` area	Enable remainder loop generation
`unroll-runtime`	true	`0x4FB2960` area	Enable runtime (dynamic TC) unrolling
`unroll-max-upperbound`	8	`dword_4FB2920`	Max trip count for upper-bound unroll
`pragma-unroll-threshold`	32768	`dword_4FB2760`	Cost budget for pragma-directed unrolls
`flat-loop-tripcount-threshold`	5	`0x4FB2680` area	Min estimated TC for runtime unroll
`runtime-unroll-threshold`	95	`dword_4FB3560`	Max body size for runtime unroll
`max-pragma-upperbound-unroll`	64	`dword_4FB2840`	Max upper-bound factor for pragma
`unroll-assumed-size`	4	`dword_4FB33A0`	Assumed array size for unknown dims

NVIDIA-Specific Knobs

Knob	Default	Global	Effect
`unroll-runtime-convergent`	true	`0x500A440` area	Allow unrolling loops with convergent ops
`unroll-runtime-epilog`	false	`qword_500A3E8`	Force epilog-style remainder (override)
`waterfall-unrolling-force-epilogue`	true	`qword_500A148`	Force epilog for waterfall patterns

Knobs are registered in two constructors: standard LLVM knobs in ctor_216_0 at 0x4E5C30, NVIDIA-specific knobs in ctor_501 at 0x559890.

Function Map

Function	Address	Size	Role
`emitUnrollPragma`	`0x09305A0`	--	Frontend: `#pragma unroll` to metadata
`parseUnrollMetadata`	`0x19B4C50`	--	Reads `llvm.loop.unroll.*` metadata
`computeLocalArraySize`	`0x19B5DD0`	--	NVIDIA: local array threshold heuristic
`handleSmallFunction`	`0x19B6500`	--	Special aggressive unroll for tiny kernels
`selectUnrollFactor`	`0x19B6690`	--	Trip count analysis helper
`emitRemainderNotAllowedRemark`	`0x19B78B0`	--	Diagnostic emission
`simulateLoopBody`	`0x19B9A90`	--	Dynamic cost simulation with constant folding
`computeUnrollCount`	`0x19BB5C0`	--	Main decision engine
`tryToUnrollLoop`	`0x19BE360`	--	Top-level driver
`computePeelCount`	`0x1B0B080`	--	Loop peeling logic
`computeRuntimeTripCount`	`0x1B18810`	--	Runtime trip count estimation
`hasCallInLoop`	`0x2A10B40`	--	Checks for call/invoke in loop body
`createSideExitPHI`	`0x2A10DD0`	--	PHI nodes for side-exit unrolled loops
`cloneInstructionsInBlock`	`0x2A12AD0`	--	Instruction-level cloning
`reconcileLoopAfterUnroll`	`0x2A13F00`	--	Post-unroll SCEV/LoopInfo fixup
`UnrollLoop`	`0x2A15A20`	--	Main transformation engine
`unrollCostModel`	`0x2A1AA10`	--	Cost estimation helper
`UnrollAndJamLoop`	`0x2A1CF00`	--	Unroll-and-jam variant
`generateRemainderLoop`	`0x2A23640`	--	Remainder loop construction
`UnrollLoopWithRuntimeChecks`	`0x2A25260`	--	Prologue/epilogue generation

Pass Factory and Object Layout

The following section documents the LoopUnroll pass factory at sub_19B73C0, which was originally misidentified as LICM in the P2C.3 sweep due to binary adjacency with the actual LICM pass. The vtable at unk_4FB224C, the 7-parameter constructor signature, and diagnostic function strings all confirm LoopUnroll identity.

The pass factory at sub_19B73C0 allocates a 184-byte pass object and accepts seven parameters that control unroll behavior. When a parameter is -1, the pass uses its compiled-in default.

Constructor Parameters

Parameter	Offset	Enable Flag	Semantics
`a1` (optimization level)	+156	--	2 = standard, 3 = aggressive
`a2` (unroll threshold)	+168	+172	Trip count threshold; -1 = use default
`a3` (unroll count)	+160	+164	Explicit unroll factor; -1 = use default
`a4` (allow partial)	+176	+177	0 = disable partial unroll, 1 = enable
`a5` (runtime unroll)	+178	+179	0 = disable runtime unroll, 1 = enable
`a6` (upper bound)	+180	+181	0 = disable upper-bound unroll, 1 = enable
`a7` (profile-based)	+182	+183	0 = disable profile-guided unroll, 1 = enable

Object Construction

The factory allocates 184 bytes via sub_22077B0, sets the vtable to off_49F45F0 (loop-unroll pass vtable), stores pass ID unk_4FB224C at offset +16, initializes self-referential linked-list pointers at offsets +80/+88 and +128/+136, sets pass type 2 (FunctionPass) at offset +24, and calls sub_163A1D0 / sub_19B71A0 for pass registration.

Pipeline Invocation Configurations

CICC invokes LoopUnroll with six distinct configurations at different pipeline stages, reflecting NVIDIA's careful tuning of unroll aggressiveness per compilation phase. These are the factory-level parameter sets passed to sub_19B73C0; see also the decision engine's per-invocation behavior in The Decision Engine above.

Configuration A: Standard Pipeline (O1/O2)

Call site: sub_12DE330

LoopUnroll(2, -1, -1, -1, -1, -1, -1)

All parameters at defaults. Standard unrolling with default thresholds at optimization level 2.

Configuration B: Code-Size Mode

Call site: sub_12DE8F0, when *(a3+4480) < 0 (NVIDIA code-size flag set)

LoopUnroll(a2, -1, -1, 0, 0, 0, 0)

All unrolling features disabled: partial, runtime, upper-bound, and profile-based are all zeroed. The pass only unrolls when the trip count is statically known and the benefit is certain. This reflects the constraint that GPU register pressure makes speculative unrolling expensive when code size matters.

Configuration C: Normal Optimizer

Call site: sub_12DE8F0, when *(a3+4480) >= 0 (normal mode)

LoopUnroll(a2, -1, -1, -1, -1, -1, -1)

Fully aggressive unrolling with all defaults. The optimization level is passed through from the caller.

Configuration D: Late Pipeline (Conservative)

Call site: sub_12DE8F0, late pipeline position

LoopUnroll(a2, -1, -1, 0, 0, -1, -1)

Partial and runtime unrolling disabled, but upper-bound and profile-based unrolling retain their defaults. This conservative late-pipeline configuration avoids creating new runtime overhead in code that has already been substantially optimized.

Configuration E: Aggressive Pipeline (O3)

Call site: sub_12E54A0

LoopUnroll(3, -1, -1, 0, 0, -1, 0)

Optimization level 3 with aggressive thresholds, but partial, runtime, and profile-based unrolling are disabled. Only upper-bound unrolling retains its default. The rationale is that at O3, the higher thresholds already capture most profitable unrolling opportunities without needing speculative runtime checks.

Configuration F: User-Configured

Call site: sub_12EA3A0

LoopUnroll(a1[4], a1[5], a1[6], a1[7], a1[8], a1[9], a1[10])

All seven parameters are read from a stored configuration object, enabling user-specified unroll behavior via command-line flags or pragmas.

Threshold Initialization (Pass-Level)

The function sub_19B6690 (17 KB) configures unroll thresholds based on optimization level and LLVM knobs at pass construction time. These values feed into the UnrollParams struct consumed by the decision engine.

Default Threshold Values

Offset	Field	Default (O2+)	Default (O1)
+0	OptThreshold	405	150
+4	Threshold	400	400
+12	SmallTripCountThreshold	150	150
+56	MaxIterationsCountToAnalyze	60	60

Function-Attribute-Aware Override

The threshold initializer queries function attributes via sub_1560180:

Attribute ID 34 (minsize): Reduces OptThreshold to SmallTripCountThreshold (150).
Attribute ID 17 (optsize): Same reduction.

This means kernels annotated with size constraints get conservative unroll thresholds regardless of the global optimization level.

Per-Function Knob Override via BST

The function queries the LLVM option registry (dword_4FA0208 BST) ten times, each time looking up a different knob address. For each knob, it searches the BST rooted at dword_4FA0208[2], compares the current function hash (sub_16D5D50) against node ranges, and applies the override if the knob value meets the threshold. The knob-to-field mapping:

Knob Address	Override Address	Field
`dword_4FB3228`	`dword_4FB32C0`	OptThreshold (+0)
`dword_4FB3148`	`dword_4FB31E0`	SmallTripCountThreshold (+12)
`dword_4FB3068`	`dword_4FB3100`	Threshold (+4)
`dword_4FB2DC8`	`dword_4FB2E60`	field +32
`dword_4FB2CE8`	`dword_4FB2D80`	field +36
`dword_4FB2C08`	`dword_4FB2CA0`	field +24
`dword_4FB2B28`	(next value)	field +40

The per-function BST lookup keyed by function hash enables fine-grained tuning of unroll behavior per kernel, a capability not present in upstream LLVM.

Diagnostic Functions

Three diagnostic emission functions produce optimization remarks:

Function	Address	Diagnostic
`emitPragmaCountDiag`	`sub_19B78B0`	Reports when pragma unroll count conflicts with trip multiple
`emitThresholdDiag`	`sub_19B7B10`	Reports when unrolled size exceeds threshold
`emitLoopSizeDiag`	`sub_19B7D80`	Reports when loop body is too large to unroll

Main Loop Processing and Hash Infrastructure

The primary analysis function sub_19B7FA0 (11 KB) analyzes each candidate loop. The pass uses hash table infrastructure shared with other CICC LLVM passes:

Function	Address	Size	Role
`rehashSmallTable`	`sub_19B60B0`	5 KB	Small hash table resize
`rehashTable`	`sub_19B8820`	4 KB	Key-value hash table resize
`rehashSet`	`sub_19B89E0`	7 KB	Set hash table resize
`insertIntoSet`	`sub_19B8DA0`	--	Set insert with growth

All hash tables use the same (value >> 9) ^ (value >> 4) hash function and linear probing strategy found throughout CICC's LLVM passes. See Hash Infrastructure for the common implementation.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Pragma threshold	`UnrollThreshold` default 150; pragma multiplier ~8x	Pragma threshold 200x larger than stock (`PragmaUnrollThreshold` = 30000); enables aggressive pragma-directed unrolling for GPU kernels
Power-of-two enforcement	No power-of-two requirement; any profitable factor accepted	Enforces power-of-two unroll factors; non-power-of-two factors are rounded down to avoid irregular loop tails
Local array multiplier	No concept of local array bonus	Dedicated local-array threshold multiplier boosts unroll budget when loop body accesses `alloca`/`.local` arrays indexed by IV, enabling register promotion
Decision engine	~20 KB `computeUnrollCount`	Substantially reworked 50 KB `computeUnrollCount` (`sub_19BB5C0`) with 6-level priority cascade and GPU-specific occupancy heuristics
Register pressure model	Generic TTI-based unroll cost; no occupancy concept	Occupancy-aware cost model considers register pressure cliffs where one additional register per thread drops warp occupancy
Pipeline invocations	Single invocation in optimization pipeline	Two invocations: early (interleaved with vectorization) and late (cleanup, gated by `opts[1360]` / `nv-disable-loop-unrolling`)
Transformation engine	Stock `llvm::UnrollLoop`	Lightly modified `UnrollLoop` (`sub_2A15A20`, 85 KB); decision engine is where the changes concentrate

Test This

The following kernel contains a simple counted loop that is a prime candidate for full unrolling. Compile and compare PTX output with and without #pragma unroll.

__global__ void unroll_test(float* out, const float* in) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll
    for (int i = 0; i < 8; i++) {
        sum += in[tid + i * 128];
    }
    out[tid] = sum;
}

What to look for in PTX:

With #pragma unroll: the loop should be fully unrolled into 8 sequential ld.global.f32 + add.f32 sequences with no backedge branch. Look for the absence of bra instructions targeting a loop header and the presence of 8 distinct ld.global.f32 instructions with addresses offset by 128*sizeof(float).
Without #pragma unroll (remove the pragma): the compiler may still unroll if the trip count (8) times body size fits within the threshold (default 300). Check whether the PTX has a loop or is fully unrolled -- this exercises the automatic decision engine.
With #pragma unroll 1: the loop must remain as a counted loop with a backedge branch. This tests that pragma disabling works.
Compare .nreg values across the three variants. Full unrolling increases register pressure (8 loads live simultaneously); the partial or no-unroll variant uses fewer registers at the cost of loop overhead.
The power-of-two enforcement is visible when the trip count is not a power of two: change the loop bound to 6 and check whether the compiler partially unrolls by 4 (highest power of two dividing the body-size budget) rather than 6.

Cross-References

Loop Optimization Passes -- pipeline context and pass ordering
LICM -- runs before second unroll invocation, feeds hoisted invariants
Loop Strength Reduction -- runs after unrolling, reduces IV expressions
Register Allocation -- occupancy-driven allocation consumes what unrolling produces
StructurizeCFG -- runs after all loop transforms, restructures divergent control flow
InstCombine -- simplifies unrolled loop bodies between invocations

Keyboard shortcuts

CICC Reverse Engineering Reference