Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Loop Unrolling

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (decision engine), llvm/lib/Transforms/Utils/LoopUnroll.cpp (transformation engine), llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (runtime unrolling) (LLVM 20.0.0)

Loop unrolling in cicc is one of the most heavily tuned transformations in the entire pipeline. On a GPU, unrolling directly trades register pressure against instruction-level parallelism: every additional copy of the loop body increases live register count, which reduces SM occupancy and the number of concurrent warps available to hide memory latency. Conversely, too little unrolling leaves performance on the table by failing to expose independent instructions that the hardware scheduler can overlap. NVIDIA's unroller resolves this tension through a priority-based decision cascade with GPU-specific heuristics that have no upstream equivalent -- most notably a local-array threshold multiplier, power-of-two factor enforcement, and a pragma threshold 200x larger than stock LLVM. The transformation engine itself is a lightly modified version of upstream llvm::UnrollLoop, but the decision engine (computeUnrollCount) is substantially reworked.

The pass appears twice in the cicc pipeline. The first invocation (sub_197E720) runs early, interleaved with loop vectorization in the main optimization sequence. The second invocation (sub_19C1680) runs later as a cleanup pass, gated by opts[1360] (the nv-disable-loop-unrolling flag). Both share the same decision engine; the second invocation operates on loops that were created or exposed by intervening passes (InstCombine, SROA, EarlyCSE).

PropertyValue
Decision enginesub_19BB5C0 / computeUnrollCount (50 KB, ~1681 lines)
Transformation enginesub_2A15A20 / UnrollLoop (85 KB, ~2434 lines)
Top-level driversub_19BE360 / tryToUnrollLoop
Runtime-check unrollersub_2A25260 / UnrollLoopWithRuntimeChecks (91 KB)
Pipeline slot (early)sub_197E720 -- runs once in main opt pipeline
Pipeline slot (late)sub_19C1680 -- conditional on !opts[1360]
Disable knob-Xcicc "-disable-LoopUnrollPass" or opts[1360]
LLVM baseLoopUnrollPass from LLVM 20.0.0

Why Unrolling Matters More on GPU

On a CPU, the primary benefit of unrolling is reducing branch overhead and enabling wider SIMD scheduling. On a GPU, the calculus is different in three ways that all trace back to the GPU execution model:

First, unrolling increases register pressure, and register pressure determines occupancy. If unrolling pushes a kernel from 64 to 96 registers per thread, the SM drops from 32 to 21 resident warps -- a 34% reduction. Fewer warps means less latency hiding, so the unroll factor selection must be conservative in ways that a CPU unroller never needs to be.

Second, there is no out-of-order execution within a warp; the hardware issues instructions in program order. Unrolling creates independent instructions that the compiler (ptxas) can interleave, particularly independent loads that can overlap with arithmetic. This is the ILP benefit, and it is the primary argument for aggressive unrolling.

Third, GPU loops often access shared memory (__shared__) or local memory arrays indexed by threadIdx. Unrolling these loops enables the backend to promote array elements to registers and to rearrange memory accesses to avoid bank conflicts. NVIDIA's local-array heuristic (see below) exists specifically to exploit this opportunity.

The unroller's job is to find the sweet spot: enough copies to saturate the instruction pipeline, few enough to keep register pressure within occupancy targets.

The Decision Engine: computeUnrollCount

The decision engine at sub_19BB5C0 implements a strict six-level priority cascade. Each level is tried in order; the first level that produces a valid unroll factor wins. Every decision is logged through optimization remarks, making the logic traceable from -Rpass-analysis=loop-unroll.

UnrollParams Struct Layout

The decision communicates its result through a struct passed by pointer (a12 / v14):

OffsetFieldTypeDescription
+0Thresholdu32Cost budget for full unroll
+4MaxPercentThresholdBoostu32Max boost percentage (default 400)
+12PartialThresholdu32Cost budget for partial unroll
+20Countu32Chosen unroll factor (primary output)
+24PeelCountu32Loop peel iteration count
+28DefaultUnrollCountu32Fallback count when no factor found
+32MaxCountu32Hard cap on unroll factor
+36FullUnrollMaxCountu32Max trip count for full unroll
+40FixedCostu32Non-scaling cost (IV increments, branches)
+44AllowPartialu8Partial unrolling permitted
+45AllowRemainderu8Remainder loop generation permitted
+46UserProvidedCountu8True when pragma supplies count
+48(reserved)u8--
+49AllowUpperBoundu8Use max-trip-count when exact unknown

The Cost Model

Every decision in the cascade uses the same linear cost model to estimate unrolled loop size:

estimated_size = FixedCost + Count * (LoopBodySize - FixedCost)

LoopBodySize is the instruction cost of one iteration (parameter a11, computed by LLVM's CodeMetrics). FixedCost captures instructions that do not replicate with unrolling -- induction variable increments, the backedge branch, loop overhead. The difference (LoopBodySize - FixedCost) is the per-copy marginal cost.

For full unrolls, an additional dynamic cost simulation (sub_19B9A90) constant-folds through the unrolled body. If the loop contains iteration-dependent simplifications (constant array indices, strength-reduced expressions), the simulation reports a cost lower than worst-case. The effective budget for this check is boosted:

dynamic_budget = Threshold * MaxPercentThresholdBoost / 100

With the default boost of 400%, this means a loop whose body simplifies substantially after unrolling gets 4x the normal cost budget.

Priority Cascade (Pseudocode)

int computeUnrollCount(Loop *L, SE, TTI, TripCount, MaxTripCount,
                       BodySize, UnrollParams *UP, bool *AllowRuntime) {

    // PRIORITY 1: Local array threshold multiplier (NVIDIA-specific)
    int localSize = computeLocalArraySize(L);  // scans for AS5 allocas
    int multiplier = min(max(localSize, 1), 6);
    int effectiveThreshold = multiplier * UP->Threshold;

    // PRIORITY 2: #pragma unroll N
    int pragmaCount = getMetadataCount(L, "llvm.loop.unroll.count");
    if (pragmaCount != 0) {
        if (pragmaCount == 1) {
            UP->Count = 1;  // disable unrolling
            return UNROLL_DISABLED;
        }
        UP->Count = pragmaCount;
        int estSize = UP->FixedCost + pragmaCount * (BodySize - UP->FixedCost);
        if (estSize > multiplier * PragmaUnrollThreshold) {
            // too large -- try to find smaller factor
            searchSmallerDivisibleFactor(UP, TripCount);
        }
        if (TripMultiple % pragmaCount != 0)
            emitRemark("remainder loops not allowed");
        return UNROLL_PRAGMA;
    }

    // PRIORITY 3: #pragma unroll (full, no count)
    if (hasMetadata(L, "llvm.loop.unroll.full")) {
        if (TripCount > 0 && TripCount <= UP->FullUnrollMaxCount) {
            int estSize = UP->FixedCost + TripCount * (BodySize - UP->FixedCost);
            if (estSize <= effectiveThreshold) {
                if (simulateLoopBody(L, TripCount, dynamicBudget))
                    { UP->Count = TripCount; return FULL_UNROLL; }
            }
        }
        // fallthrough to lower priorities
    }

    // PRIORITY 4: Loop peeling
    int peelCount = computePeelCount(L, SE, UP);
    if (peelCount > 0) {
        UP->PeelCount = peelCount;
        UP->Count = 1;
        return PEEL;
    }

    // PRIORITY 5: Static partial unrolling (known trip count)
    if (TripCount > 0 && (UP->AllowPartial || pragmaOversize) && isInnermost(L)) {
        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;

        // Size clamp
        if (UP->PartialThreshold < UP->FixedCost + count * (BodySize - UP->FixedCost))
            count = (UP->PartialThreshold - UP->FixedCost) / (BodySize - UP->FixedCost);
        count = min(count, UP->MaxCount);

        // Power-of-two + trip-divisible search
        while (count > 0) {
            if (TripCount % count == 0 && isPowerOfTwo(count))
                break;
            count--;
        }

        // Fallback: halve DefaultUnrollCount until it fits
        if (count == 0 && UP->UserProvidedCount) {
            count = UP->DefaultUnrollCount;
            while (UP->PartialThreshold <
                   UP->FixedCost + count * (BodySize - UP->FixedCost))
                count >>= 1;
        }

        if (count > 1) { UP->Count = count; return PARTIAL_UNROLL; }
    }

    // PRIORITY 6: Runtime unrolling (unknown trip count)
    if (!hasMetadata(L, "llvm.loop.unroll.runtime.disable")
        && RuntimeUnrollThreshold >= BodySize
        && isInnermost(L)) {

        int rtTripCount = computeRuntimeTripCount(L, SE);
        if (rtTripCount < FlatLoopTripCountThreshold) return NO_UNROLL;

        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
        // same halving + threshold logic as Priority 5
        while (UP->PartialThreshold <
               UP->FixedCost + count * (BodySize - UP->FixedCost))
            count >>= 1;
        count = min(count, UP->MaxCount);

        if (count > 1) {
            UP->Count = count;
            *AllowRuntime = true;
            return RUNTIME_UNROLL;
        }
    }

    // Small-function override (tiny kernels get aggressive unrolling)
    if (functionInstructionCount < SmallFunctionThreshold)
        return handleSmallFunction(L, UP, BodySize);

    return NO_UNROLL;
}

Local Array Heuristic

The function sub_19B5DD0 (computeLocalArraySize) is entirely NVIDIA-specific. It scans every basic block in the loop for load/store instructions that access address space 5 (GPU local memory). For each such access, it traces back to the underlying alloca, determines the array type, and computes the product of array dimensions. If any dimension is unknown at compile time, it substitutes the unroll-assumed-size knob (default 4). The returned value is the maximum local-array size found across all accesses.

This value becomes a threshold multiplier, capped at 6:

int computeLocalArraySize(Loop *L) {
    int maxSize = 0;
    for (BasicBlock *BB : L->blocks()) {
        for (Instruction &I : *BB) {
            if (!isLoadOrStore(I) || getAddressSpace(I) != 5) continue;
            Value *base = getUnderlyingAlloca(I);
            if (!base || !isArrayType(base->getType())) continue;
            int size = 1;
            for (int dim : getArrayDimensions(base))
                size *= (dim > 0) ? dim : UnrollAssumedSize;  // default 4
            maxSize = max(maxSize, size);
        }
    }
    return maxSize;
}

The rationale: GPU kernels frequently use __shared__ or local arrays indexed by threadIdx. Unrolling such loops by a factor proportional to the array size enables register promotion of individual array elements and eliminates bank-conflict-prone access patterns. The cap at 6 prevents pathological explosion when arrays are large.

Power-of-Two Factor Enforcement

The partial-unroll factor search at Priority 5 requires the chosen count to satisfy two constraints simultaneously: it must evenly divide the trip count and must be a power of two. The implementation uses the classic bitmask test:

while (count > 0) {
    if (tripCount % count == 0 && (count & (count - 1)) == 0)
        break;
    count--;
}

This is a GPU-specific requirement. Warp size is 32 (a power of two), and many GPU memory access patterns, shared-memory bank calculations, and reduction operations assume power-of-two alignment. An unroll factor of, say, 6 would create asymmetric loop bodies that interact poorly with warp-level execution.

Pragma Handling

The frontend (sub_9305A0 / emitUnrollPragma) translates CUDA pragmas to LLVM metadata during codegen:

CUDA SourceLLVM Metadata
#pragma unroll (bare)!{!"llvm.loop.unroll.full"}
#pragma unroll N (N > 1)!{!"llvm.loop.unroll.count", i32 N}
#pragma unroll 1Disables unrolling at Priority 2

The metadata is attached to the backedge branch as a self-referential !llvm.loop node. A guard flag (dword_4D046B4) skips pragma processing entirely in fast-codegen mode.

The pragma threshold is 32768 (0x8000), compared to upstream LLVM's 16384 (0x4000). This means #pragma unroll succeeds on loop bodies up to approximately 32K cost units -- covering virtually any realistic GPU kernel loop. When even this generous budget is exceeded, the decision engine falls through to lower priorities and attempts partial unrolling.

The __launch_bounds__ attribute does not directly feed the unroll decision. Instead, it constrains register allocation downstream, which indirectly limits the benefit of aggressive unrolling. There is no feedback loop from register pressure estimation back into the unroll factor at this stage of the pipeline; that coordination happens implicitly through the PartialThreshold provided by TTI.

Runtime Unrolling

Runtime unrolling (Priority 6) handles loops whose trip count is unknown at compile time. cicc enables it by default (unroll-runtime = true), with several GPU-specific twists:

Convergent instruction support. The knob unroll-runtime-convergent (default true, NVIDIA-specific) allows unrolling loops that contain convergent operations like warp-level primitives (__shfl_sync, __ballot_sync). Upstream LLVM refuses to unroll such loops because it cannot guarantee all threads in the warp execute the same iterations. cicc overrides this, relying on the waterfall-epilogue mechanism to preserve convergence.

Epilog vs. prolog remainder. The choice is controlled by a cascade:

  1. If waterfall-unrolling-force-epilogue is true (default, NVIDIA-specific) and the loop has runtime trip count: epilog mode is selected.
  2. If the loop body contains function calls (hasCallInLoop / sub_2A10B40 checks for opcode 17): epilog mode is forced. This preserves the property that all threads in a warp participate in calls, which matters for convergent operations.
  3. Otherwise, unroll-runtime-epilog (default false) determines the mode.

In practice, GPU loops almost always use epilog-style remainders.

Flat-loop exclusion. If the estimated runtime trip count is below flat-loop-tripcount-threshold (default 5), runtime unrolling is skipped. The overhead of generating the modulo check and epilog loop is not worth it for loops that iterate fewer than 5 times.

Body size gate. Runtime unrolling only proceeds if runtime-unroll-threshold (default 95) is greater than or equal to the loop body size. This is more conservative than the static partial-unroll threshold, preventing code explosion for large loop bodies when the trip count is unknown.

Thresholds: NVIDIA vs. Upstream LLVM

ParameterUpstream LLVM (O3)Upstream LLVM (NVPTX TTI)cicc v13.0
Threshold300300From TTI (300), then multiplied by local-array factor (1-6x)
PartialThreshold15075 (Threshold / 4)From TTI (75), plus local-array scaling
MaxPercentThresholdBoost400%400%400% (same)
PragmaUnrollThreshold163841638432768
RuntimeUnrollThreshold----95 (NVIDIA addition)
FlatLoopTripCountThreshold555 (same)
MaxUpperBound888 (same)
MaxPragmaUpperBound----64 (NVIDIA addition)
DefaultUnrollRuntimeCount88From TTI
AllowPartialfalsetruetrue (from TTI)
Runtimefalsetruetrue (from TTI)
AllowRemaindertruetruetrue
MaxIterationsCountToAnalyze101010 (same)
UnrollAssumedSize----4 (NVIDIA addition)

The critical differences: cicc doubles the pragma threshold, introduces a body-size gate for runtime unrolling (95), adds the local-array multiplier (up to 6x on base thresholds), and enforces power-of-two partial factors. The upstream NVPTX TTI enables partial and runtime unrolling but leaves thresholds at modest CPU-oriented values; cicc's decision engine applies substantial additional logic on top.

Interaction with Loop Vectorization

In the cicc pipeline, loop vectorization (LoopVectorizePass) runs before the first unroll invocation. Specifically, sub_197E720 combines both vectorization and unrolling decisions in the early pipeline slot. The vectorizer decides the vector width first (VF), and if it applies a transformation, the resulting loop (possibly with a scalar epilog) is then presented to the unroller.

This means vectorization and unrolling do not "coordinate" in the planning sense -- the vectorizer runs to completion before the unroller sees the loop. However, the vectorizer's interleave count (IC) serves a similar role to unrolling: it replicates the vectorized loop body to increase ILP. When the vectorizer chooses IC > 1, the subsequent unroller typically finds the loop body too large to unroll further, producing a de facto coordination through cost thresholds.

The second unroll invocation (sub_19C1680) runs much later, after InstCombine, SROA, and EarlyCSE have had a chance to simplify the vectorized code. Loops that were too large to unroll earlier may become eligible after dead code elimination within the unrolled-and-vectorized body.

The Transformation Engine: UnrollLoop

The transformation at sub_2A15A20 takes a loop and an unroll factor and physically duplicates the loop body. It is structurally close to upstream llvm::UnrollLoop with the following entry guards:

  1. Loop must have a preheader (sub_D4B130)
  2. Loop must have a single latch (sub_D47930)
  3. Loop must be in LCSSA form (sub_D49210)
  4. Header flags must be clean (no special bits set)

The duplication proceeds by iterating Count - 1 times, each iteration cloning every basic block in the loop body, remapping instructions through a value map, and rewiring PHI nodes so that iteration i's latch feeds iteration i+1's header. After all copies, the backedge of the last copy is reconnected to the first copy's header (for partial unroll) or removed entirely (for full unroll).

For partial unrolls where TripCount % Count != 0, a remainder loop is generated by sub_2A23640. If remainder generation fails (e.g., multi-exit loops), the engine delegates to sub_2A25260 which generates the runtime-check variant with prologue/epilogue.

The return value encodes the result: 0 = no change, 1 = partial unroll, 2 = full unroll.

Configuration Knobs

Standard LLVM Knobs (with NVIDIA defaults)

KnobDefaultGlobalEffect
unroll-thresholdFrom TTIsub_19B7760 structBase cost budget for full unroll
unroll-partial-thresholdFrom TTI0x4FB3140 areaCost budget for partial unroll
unroll-max-percent-threshold-boost400dword_4FB3100Max dynamic cost boost (%)
unroll-max-iteration-count-to-analyze10dword_4FB3020Max iterations for cost simulation
unroll-countUnsetdword_4FB2EA8Force specific unroll factor
unroll-max-countUnsetsub_19B7760 structHard cap on unroll factor
unroll-full-max-countUnset0x4FB2CE0 areaMax trip count for full unroll
unroll-peel-countUnset0x4FB2C00 areaForce specific peel count
unroll-allow-partialfalse0x4FB2B20 areaEnable partial unrolling override
unroll-allow-remainderfalse0x4FB2A40 areaEnable remainder loop generation
unroll-runtimetrue0x4FB2960 areaEnable runtime (dynamic TC) unrolling
unroll-max-upperbound8dword_4FB2920Max trip count for upper-bound unroll
pragma-unroll-threshold32768dword_4FB2760Cost budget for pragma-directed unrolls
flat-loop-tripcount-threshold50x4FB2680 areaMin estimated TC for runtime unroll
runtime-unroll-threshold95dword_4FB3560Max body size for runtime unroll
max-pragma-upperbound-unroll64dword_4FB2840Max upper-bound factor for pragma
unroll-assumed-size4dword_4FB33A0Assumed array size for unknown dims

NVIDIA-Specific Knobs

KnobDefaultGlobalEffect
unroll-runtime-convergenttrue0x500A440 areaAllow unrolling loops with convergent ops
unroll-runtime-epilogfalseqword_500A3E8Force epilog-style remainder (override)
waterfall-unrolling-force-epiloguetrueqword_500A148Force epilog for waterfall patterns

Knobs are registered in two constructors: standard LLVM knobs in ctor_216_0 at 0x4E5C30, NVIDIA-specific knobs in ctor_501 at 0x559890.

Function Map

FunctionAddressSizeRole
emitUnrollPragma0x09305A0--Frontend: #pragma unroll to metadata
parseUnrollMetadata0x19B4C50--Reads llvm.loop.unroll.* metadata
computeLocalArraySize0x19B5DD0--NVIDIA: local array threshold heuristic
handleSmallFunction0x19B6500--Special aggressive unroll for tiny kernels
selectUnrollFactor0x19B6690--Trip count analysis helper
emitRemainderNotAllowedRemark0x19B78B0--Diagnostic emission
simulateLoopBody0x19B9A90--Dynamic cost simulation with constant folding
computeUnrollCount0x19BB5C0--Main decision engine
tryToUnrollLoop0x19BE360--Top-level driver
computePeelCount0x1B0B080--Loop peeling logic
computeRuntimeTripCount0x1B18810--Runtime trip count estimation
hasCallInLoop0x2A10B40--Checks for call/invoke in loop body
createSideExitPHI0x2A10DD0--PHI nodes for side-exit unrolled loops
cloneInstructionsInBlock0x2A12AD0--Instruction-level cloning
reconcileLoopAfterUnroll0x2A13F00--Post-unroll SCEV/LoopInfo fixup
UnrollLoop0x2A15A20--Main transformation engine
unrollCostModel0x2A1AA10--Cost estimation helper
UnrollAndJamLoop0x2A1CF00--Unroll-and-jam variant
generateRemainderLoop0x2A23640--Remainder loop construction
UnrollLoopWithRuntimeChecks0x2A25260--Prologue/epilogue generation

Pass Factory and Object Layout

The following section documents the LoopUnroll pass factory at sub_19B73C0, which was originally misidentified as LICM in the P2C.3 sweep due to binary adjacency with the actual LICM pass. The vtable at unk_4FB224C, the 7-parameter constructor signature, and diagnostic function strings all confirm LoopUnroll identity.

The pass factory at sub_19B73C0 allocates a 184-byte pass object and accepts seven parameters that control unroll behavior. When a parameter is -1, the pass uses its compiled-in default.

Constructor Parameters

ParameterOffsetEnable FlagSemantics
a1 (optimization level)+156--2 = standard, 3 = aggressive
a2 (unroll threshold)+168+172Trip count threshold; -1 = use default
a3 (unroll count)+160+164Explicit unroll factor; -1 = use default
a4 (allow partial)+176+1770 = disable partial unroll, 1 = enable
a5 (runtime unroll)+178+1790 = disable runtime unroll, 1 = enable
a6 (upper bound)+180+1810 = disable upper-bound unroll, 1 = enable
a7 (profile-based)+182+1830 = disable profile-guided unroll, 1 = enable

Object Construction

The factory allocates 184 bytes via sub_22077B0, sets the vtable to off_49F45F0 (loop-unroll pass vtable), stores pass ID unk_4FB224C at offset +16, initializes self-referential linked-list pointers at offsets +80/+88 and +128/+136, sets pass type 2 (FunctionPass) at offset +24, and calls sub_163A1D0 / sub_19B71A0 for pass registration.

Pipeline Invocation Configurations

CICC invokes LoopUnroll with six distinct configurations at different pipeline stages, reflecting NVIDIA's careful tuning of unroll aggressiveness per compilation phase. These are the factory-level parameter sets passed to sub_19B73C0; see also the decision engine's per-invocation behavior in The Decision Engine above.

Configuration A: Standard Pipeline (O1/O2)

Call site: sub_12DE330

LoopUnroll(2, -1, -1, -1, -1, -1, -1)

All parameters at defaults. Standard unrolling with default thresholds at optimization level 2.

Configuration B: Code-Size Mode

Call site: sub_12DE8F0, when *(a3+4480) < 0 (NVIDIA code-size flag set)

LoopUnroll(a2, -1, -1, 0, 0, 0, 0)

All unrolling features disabled: partial, runtime, upper-bound, and profile-based are all zeroed. The pass only unrolls when the trip count is statically known and the benefit is certain. This reflects the constraint that GPU register pressure makes speculative unrolling expensive when code size matters.

Configuration C: Normal Optimizer

Call site: sub_12DE8F0, when *(a3+4480) >= 0 (normal mode)

LoopUnroll(a2, -1, -1, -1, -1, -1, -1)

Fully aggressive unrolling with all defaults. The optimization level is passed through from the caller.

Configuration D: Late Pipeline (Conservative)

Call site: sub_12DE8F0, late pipeline position

LoopUnroll(a2, -1, -1, 0, 0, -1, -1)

Partial and runtime unrolling disabled, but upper-bound and profile-based unrolling retain their defaults. This conservative late-pipeline configuration avoids creating new runtime overhead in code that has already been substantially optimized.

Configuration E: Aggressive Pipeline (O3)

Call site: sub_12E54A0

LoopUnroll(3, -1, -1, 0, 0, -1, 0)

Optimization level 3 with aggressive thresholds, but partial, runtime, and profile-based unrolling are disabled. Only upper-bound unrolling retains its default. The rationale is that at O3, the higher thresholds already capture most profitable unrolling opportunities without needing speculative runtime checks.

Configuration F: User-Configured

Call site: sub_12EA3A0

LoopUnroll(a1[4], a1[5], a1[6], a1[7], a1[8], a1[9], a1[10])

All seven parameters are read from a stored configuration object, enabling user-specified unroll behavior via command-line flags or pragmas.

Threshold Initialization (Pass-Level)

The function sub_19B6690 (17 KB) configures unroll thresholds based on optimization level and LLVM knobs at pass construction time. These values feed into the UnrollParams struct consumed by the decision engine.

Default Threshold Values

OffsetFieldDefault (O2+)Default (O1)
+0OptThreshold405150
+4Threshold400400
+12SmallTripCountThreshold150150
+56MaxIterationsCountToAnalyze6060

Function-Attribute-Aware Override

The threshold initializer queries function attributes via sub_1560180:

  • Attribute ID 34 (minsize): Reduces OptThreshold to SmallTripCountThreshold (150).
  • Attribute ID 17 (optsize): Same reduction.

This means kernels annotated with size constraints get conservative unroll thresholds regardless of the global optimization level.

Per-Function Knob Override via BST

The function queries the LLVM option registry (dword_4FA0208 BST) ten times, each time looking up a different knob address. For each knob, it searches the BST rooted at dword_4FA0208[2], compares the current function hash (sub_16D5D50) against node ranges, and applies the override if the knob value meets the threshold. The knob-to-field mapping:

Knob AddressOverride AddressField
dword_4FB3228dword_4FB32C0OptThreshold (+0)
dword_4FB3148dword_4FB31E0SmallTripCountThreshold (+12)
dword_4FB3068dword_4FB3100Threshold (+4)
dword_4FB2DC8dword_4FB2E60field +32
dword_4FB2CE8dword_4FB2D80field +36
dword_4FB2C08dword_4FB2CA0field +24
dword_4FB2B28(next value)field +40

The per-function BST lookup keyed by function hash enables fine-grained tuning of unroll behavior per kernel, a capability not present in upstream LLVM.

Diagnostic Functions

Three diagnostic emission functions produce optimization remarks:

FunctionAddressDiagnostic
emitPragmaCountDiagsub_19B78B0Reports when pragma unroll count conflicts with trip multiple
emitThresholdDiagsub_19B7B10Reports when unrolled size exceeds threshold
emitLoopSizeDiagsub_19B7D80Reports when loop body is too large to unroll

Main Loop Processing and Hash Infrastructure

The primary analysis function sub_19B7FA0 (11 KB) analyzes each candidate loop. The pass uses hash table infrastructure shared with other CICC LLVM passes:

FunctionAddressSizeRole
rehashSmallTablesub_19B60B05 KBSmall hash table resize
rehashTablesub_19B88204 KBKey-value hash table resize
rehashSetsub_19B89E07 KBSet hash table resize
insertIntoSetsub_19B8DA0--Set insert with growth

All hash tables use the same (value >> 9) ^ (value >> 4) hash function and linear probing strategy found throughout CICC's LLVM passes. See Hash Infrastructure for the common implementation.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Pragma thresholdUnrollThreshold default 150; pragma multiplier ~8xPragma threshold 200x larger than stock (PragmaUnrollThreshold = 30000); enables aggressive pragma-directed unrolling for GPU kernels
Power-of-two enforcementNo power-of-two requirement; any profitable factor acceptedEnforces power-of-two unroll factors; non-power-of-two factors are rounded down to avoid irregular loop tails
Local array multiplierNo concept of local array bonusDedicated local-array threshold multiplier boosts unroll budget when loop body accesses alloca/.local arrays indexed by IV, enabling register promotion
Decision engine~20 KB computeUnrollCountSubstantially reworked 50 KB computeUnrollCount (sub_19BB5C0) with 6-level priority cascade and GPU-specific occupancy heuristics
Register pressure modelGeneric TTI-based unroll cost; no occupancy conceptOccupancy-aware cost model considers register pressure cliffs where one additional register per thread drops warp occupancy
Pipeline invocationsSingle invocation in optimization pipelineTwo invocations: early (interleaved with vectorization) and late (cleanup, gated by opts[1360] / nv-disable-loop-unrolling)
Transformation engineStock llvm::UnrollLoopLightly modified UnrollLoop (sub_2A15A20, 85 KB); decision engine is where the changes concentrate

Test This

The following kernel contains a simple counted loop that is a prime candidate for full unrolling. Compile and compare PTX output with and without #pragma unroll.

__global__ void unroll_test(float* out, const float* in) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll
    for (int i = 0; i < 8; i++) {
        sum += in[tid + i * 128];
    }
    out[tid] = sum;
}

What to look for in PTX:

  • With #pragma unroll: the loop should be fully unrolled into 8 sequential ld.global.f32 + add.f32 sequences with no backedge branch. Look for the absence of bra instructions targeting a loop header and the presence of 8 distinct ld.global.f32 instructions with addresses offset by 128*sizeof(float).
  • Without #pragma unroll (remove the pragma): the compiler may still unroll if the trip count (8) times body size fits within the threshold (default 300). Check whether the PTX has a loop or is fully unrolled -- this exercises the automatic decision engine.
  • With #pragma unroll 1: the loop must remain as a counted loop with a backedge branch. This tests that pragma disabling works.
  • Compare .nreg values across the three variants. Full unrolling increases register pressure (8 loads live simultaneously); the partial or no-unroll variant uses fewer registers at the cost of loop overhead.
  • The power-of-two enforcement is visible when the trip count is not a power of two: change the loop bound to 6 and check whether the compiler partially unrolls by 4 (highest power of two dividing the body-size budget) rather than 6.

Cross-References