Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LoopVectorize and VPlan (GPU-Adapted)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp, llvm/lib/Transforms/Vectorize/VPlan*.cpp (LLVM 20.0.0). VPlan infrastructure lives in llvm/lib/Transforms/Vectorize/VPlan.cpp, VPlanRecipes.cpp, VPlanTransforms.cpp, and related files.

LLVM version note: CICC v13.0 is based on LLVM 20.0.0 trunk. Evidence includes histogram-pattern support (merged in LLVM 19), early-exit vectorization (LLVM 20 experimental feature, gated by byte_500CDA8), and the VPlan-native path. The VPlan object size (656 bytes) is consistent with LLVM 17/18+ layout. Scalable vectors are always disabled for NVPTX.

NVIDIA's cicc ships a heavily modified copy of LLVM's LoopVectorizePass, the single largest pass in the vectorization pipeline at 88 KB of decompiled output (2,612 lines in sub_2AF1970). The modifications do not change the pass's fundamental architecture -- it still builds VPlans, selects a vectorization factor (VF) through cost modeling, and transforms IR through VPlan execution -- but the cost model, VF selection heuristics, interleave count logic, and legality checker are all tuned for a target where "vectorization" means something fundamentally different than on a CPU. On a CPU, loop vectorization fills SIMD lanes: a VF of 4 on SSE processes four float elements per vector instruction. On an NVIDIA GPU, there are no SIMD lanes in the CPU sense -- each thread already executes scalar code, and the warp executes 32 threads in lockstep. The reasons to vectorize on GPU are: (1) memory coalescing -- adjacent threads issuing adjacent loads produce 128-byte cache line transactions, and vectorizing a per-thread loop body with VF=2 or VF=4 produces ld.v2/ld.v4 wide loads that maximize bytes-per-transaction; (2) reducing instruction count -- a single ld.global.v4.f32 replaces four ld.global.f32 instructions, saving fetch/decode/issue bandwidth; (3) register-to-memory width matching -- PTX supports 32-, 64-, and 128-bit load/store widths, and vectorization widens narrow scalar accesses to fill these naturally.

Key Facts

PropertyValue
RegistrationNew PM #400, parameterized: no-interleave-forced-only;...
Runtime positionsNot in Tier 0/1/2/3 tables; invoked via LLVM standard sub-pipeline sub_1A62BF0 when vectorization is enabled (see Pipeline)
Main entry pointsub_2AF1970 (0x2AF1970) -- LoopVectorizePass::processLoop()
Binary size88 KB decompiled, 2,612 lines
VPlan buildersub_2AEE460 (0x2AEE460) -- tryToBuildVPlanWithVPRecipes(), 56 KB
VPlan object size656 bytes (0x290), consistent with LLVM 17/18 layout
LLVM baseLLVM 20 trunk (evidence: histogram-pattern support, early-exit vectorization, VPlan-native path)
Scalable vectorsAlways disabled -- sub_DFE610 returns false for NVPTX
Register bit width (TTI)32 bits fixed (TypeSize::getFixed(32) in upstream NVPTXTTIImpl)
Pass name string"vectorize-loops" at 0x439F095
Address cluster0x2AA0000--0x2C20000 (loop vectorizer + VPlan infrastructure)

Why Vectorize on GPU

GPU vectorization is not about filling SIMD lanes -- the SIMT model already replicates scalar code across 32 threads. Vectorization targets three orthogonal benefits related to memory coalescing and instruction throughput:

Memory coalescing width. The GPU memory subsystem services requests in 128-byte transactions. If a single thread's inner loop accesses 4 consecutive floats in sequence, those 4 accesses become 4 separate scalar loads issued over 4 iterations. Vectorizing with VF=4 converts them into one ld.global.v4.f32, which the memory subsystem can service in a single wider transaction per thread. Across the warp, this multiplies the effective memory bandwidth.

Instruction count reduction. PTX's ld.v2 and ld.v4 instructions load 2 or 4 elements with a single instruction. The instruction issue pipeline has finite throughput (typically 1-2 instructions per clock per scheduler), so halving instruction count directly improves throughput-bound kernels.

Register-width matching. PTX has 32-bit typed registers. A 128-bit ld.v4.f32 loads directly into four consecutive registers via a single instruction, which is strictly better than four separate 32-bit loads (each requiring its own address computation).

These benefits are bounded by register pressure -- the primary constraint that does not exist on CPU. On a GPU, every additional register per thread can cross an occupancy cliff, potentially losing an entire warp group. A VF=4 vectorization that quadruples the live register count may halve occupancy and lose net throughput.

The 8-Phase Pipeline

The main function sub_2AF1970 implements eight phases, closely following upstream LLVM's structure but with GPU-specific decision points at each stage.

Phase 1: Legality Pre-Check

sub_31A4FD0(legalityCtx, Loop, Function, ORE, SE)    // init legality scratch
TTI = *(**(Loop+32) + 72)                             // Loop->getHeader()->getParent()->getTTI()
if (!sub_31A91F0(legalityCtx, TTI, Loop, LoopInfo))   // canVectorize() quick check
    return false

sub_31AF060(costCtx, ForceVectorization)              // canVectorize() full check
// ForceVectorization = qword_500D340[17] ("vectorize-loops" knob)

The legality checker (sub_31AF060) performs standard LLVM legality analysis: loop simplify form, single exit, computable backedge-taken count, no irreducible control flow. The NVIDIA-specific addition is early-exit loop handling:

if (hasUncountableEarlyExit && !byte_500CDA8)         // -enable-early-exit-vectorization
    emit "UncountableEarlyExitLoopsDisabled"
    return false

This knob (byte_500CDA8) gates an LLVM 20 feature that NVIDIA includes but disables by default. Early-exit vectorization requires predicated execution, which on GPU means divergent warps -- typically unprofitable.

Phase 2: Outer vs Inner Loop Dispatch

if (Loop->getSubLoops().size() > 0)
    goto outerLoopPath                                // PATH A (rarely taken on GPU)
else
    goto innerLoopPath                                // PATH B (the main path)

Outer loop vectorization is controlled by byte_500D208 (-force-vector-width-outer). When enabled and TTI-based VF selection returns VF <= 1, the pass forces VF=4 -- a hardcoded NVIDIA override for kernel patterns where outer-loop vectorization benefits warp-level memory access patterns. In practice, inner loop vectorization (Path B) handles the vast majority of GPU kernels.

Phase 3: Trip Count and Safety Checks

tripCount = getSmallBestKnownTC(PSE, Loop)            // sub_2AA7EC0
if (tripCount < VectorizerMinTripCount                // dword_500EAE8
    && !isForceVectorize(legalCtx)
    && !(exactTC >= userVF))
    emit "LowTripCount"
    reduce hint to interleave-only

if (hasAttribute(TTI, NoImplicitFloat))               // attribute 30
    bail "NoImplicitFloat"

if (hasUnsafeFPOps && !canReorderFP(override))
    bail "UnsafeFP" / "CantReorderFPOps"

The FP reorder safety check has an override mechanism: dword_500D508 selects whether the override is active, and byte_500D588 provides the override value. This lets NVIDIA force-allow FP reordering for specific compilation modes (e.g., -ffast-math propagated from nvcc).

Phase 4: VF Selection

This is where NVIDIA diverges most from upstream. The upstream algorithm queries TTI::getRegisterBitWidth() which returns the vector register width (256 for AVX2, 512 for AVX-512), then computes VF = registerWidth / elementSize. On NVPTX, getRegisterBitWidth() returns 32 -- a single scalar register width. This means the upstream formula would always produce VF=1 for 32-bit types.

NVIDIA's VF selection (sub_2AB8AC0 for outer loops, sub_2AE08E0 for inner loops via VPlan cost) works differently:

// sub_2AB8AC0 — outer loop VF selection (simplified)
elementBits = getWideningElementSize(CostModel)       // sub_2AB4370: top 32 bits
regWidth    = TTI.getRegisterBitWidth(Vector)          // sub_DFE640: returns 32
VF          = regWidth / (elementBits / 8)

if (!isScalable && VF <= 1 && forceOuterMode)          // byte_500D208
    VF = 4                                             // NVIDIA hardcoded override

For inner loops (the common path), VF selection goes through the full VPlan cost model:

// sub_2AE08E0 — selectBestVF() from VPlan candidates
bestCost = INT64_MAX
for each VPlan in candidatePlans:
    for each VF in VPlan.VFRange:
        cost = computeCostForVF(VPlan, VF)             // sub_2AE0750
        if isBetterThan(cost, bestCost):               // sub_2AB3FE0
            bestVF = VF
            bestCost = cost
return {bestVF, isScalable, bestIC}

The cost accumulation uses saturating arithmetic -- __OFADD__ overflow detection clamping to INT64_MAX/INT64_MIN -- preventing wrap-around in cost comparisons. This is defensive engineering for GPU kernels with very large loop bodies where naive summation could overflow.

Phase 5: Cost Model Construction

The cost model object (sub_2AB2780, 16 parameters) assembles all analysis results into a single context:

CostModel = {
    Loop*, DominatorTree*, LoopBlocksRPO*, ScalarEvolution*,
    TargetLibraryInfo*, AssumptionCache*, PredicatedScalarEvolution*,
    ValuesToIgnore=0, ORE*,
    /* additional context fields */
}

The VPlan planner (sub_2AF13F0) generates VPlans for all candidate VFs, then sub_2AE08E0 selects the best one. Each VPlan recipe provides its own cost through the virtual getVPCost(VF, CostCtx) method, which delegates to NVPTXTargetTransformInfo for GPU-specific instruction costs.

Phase 6: Profitability Decision and Interleave Selection

After VF selection, the pass evaluates a decision matrix:

ConditionResult
VF=1, not scalableVectorizationNotBeneficial -- bail
IC=1 but user wanted moreInterleavingNotBeneficial
IC>1 but user disabledInterleavingBeneficialButDisabled
Histogram loop + scalar interleaveHistogramPreventsScalarInterleaving -- bail
VF=1, IC>1Interleave-only path: executeVPlan(VF=1, IC)
VF>1Full vectorization path

The histogram diagnostic (HistogramPreventsScalarInterleaving) is an NVIDIA addition not present in upstream LLVM. It blocks scalar interleaving of histogram-pattern loops where reduction ordering constraints make interleaving incorrect without vectorization.

Interleave count selection (sub_2AED330) is register-pressure-bounded on GPU:

// sub_2AED330 — selectInterleaveCount() (simplified)
maxIC = TTI.getMaxInterleaveFactor(VF)                 // sub_DFB120(TTI+448)
// Override knobs:
if (VF.isScalar() && ForceTargetMaxScalarInterleave)   // dword_500E148
    maxIC = ForceTargetMaxScalarInterleave
if (VF.isVector() && ForceTargetMaxVectorInterleave)   // dword_500E068
    maxIC = ForceTargetMaxVectorInterleave

tripCount = getSmallBestKnownTC(PSE, Loop)
IC = bit_floor(tripCount / (VF * 2))                  // conservative: vector loop runs >= 2x
IC = min(IC, maxIC)

// Small loop boost
if (loopCost < SmallLoopCost)                          // qword_500DC88
    smallIC = min(IC, bit_floor(SmallLoopCost / loopCost))
    IC = max(IC, smallIC)

// Scheduling-based cap (NVIDIA-specific TTI path)
issueWidth = *(TTI + 56 + 32)                          // scheduling info at TTI+88
latency    = *(TTI + 56 + 36)                          // scheduling info at TTI+92
IC = IC / max(issueWidth, latency)                     // cap by scheduling model

// Aggressive interleave mode
if (byte_500D908)                                      // AggressiveInterleave
    IC = maxIC                                         // bypass all heuristics

IC = clamp(IC, 1, maxIC)
return powerOf2Floor(IC)

On CPU, the interleave count is bounded by vector register count (e.g., 16 YMM registers / registers per iteration). On GPU, it is bounded by register pressure impact on occupancy -- the TTI scheduling info encodes this constraint. The AggressiveInterleave knob (byte_500D908) bypasses all heuristics and sets IC to the maximum, useful for benchmarking or known-good kernels.

Phase 7: VPlan Execution and Epilogue Vectorization

mainVPlan = getBestPlanFor(bestVF)                     // sub_2BF1320
executeVPlan(mainVPlan, bestVF, IC)                    // sub_2AE3460

// Epilogue vectorization (when byte_500ED88 is set)
epilogueVF = selectEpilogueVectorizationFactor()       // sub_2ABBD40
if (epilogueVF > 1):
    clonedPlan = cloneVPlan(mainVPlan)                 // sub_2BF7CB0
    epiloguePlan = getBestPlanFor(epilogueVF)
    mergeVPlans(clonedPlan, epiloguePlan)              // sub_2AB0350
    // Remap operands between main and epilogue plans:
    //   recipe types 29 (load/store), 36 (phi), 17 (GEP)
    //   types 19-20 (inttoptr/ptrtoint casts)
    executeVPlan(merged, epilogueVF, epilogueIC, isEpilogue=true)

Epilogue vectorization is particularly relevant on GPU: the scalar remainder loop after vectorization forces warp divergence (some threads in the warp execute the epilogue while others are masked off), which is expensive. A vectorized epilogue with a smaller VF reduces the scalar remainder to fewer iterations, minimizing divergence overhead.

The epilogue VF selection (sub_2ABBD40) can be forced via qword_500ECA8 (-epilogue-vectorization-force-VF). When not forced, it uses SCEV range analysis (sub_DC3A60, sub_DBB9F0) to prove the epilogue trip count is sufficient for the candidate VF.

Phase 8: Post-Vectorization Metadata

The pass applies follow-up loop metadata (llvm.loop.vectorize.followup_all, llvm.loop.vectorize.followup_epilogue) and emits optimization remarks through sub_2AC2B40. Generated basic blocks use naming conventions vec.epilog.middle.block and vec.epilog.vector.body.

VPlan Construction (sub_2AEE460)

The VPlan builder allocates a 656-byte VPlan object and iterates over candidate VFs in powers of 2 (VF *= 2 each iteration, visible as add r15d, r15d in the binary). For each VF, it calls sub_2AA9E60 (tryToBuildRecipesForVF).

Recipe type tags observed in the binary:

TagRecipe Type
0x04VPWidenMemoryInstructionRecipe
0x0FVPWidenRecipe
0x1DVPReplicateRecipe
0x21VPWidenSelectRecipe
0x43VPWidenCallRecipe

Interleave group recipes are built from LoopAccessInfo at [Planner+0x28]+0x150. The builder removes individual load/store recipes and replaces them with interleave group recipes via sub_2AB9570 (replaceAllUsesWith), using a hash map with the pointer-hash function (ptr >> 4) ^ (ptr >> 9) & mask -- identical to LLVM's DenseMap hash.

Cost annotation happens in Phase 6 of VPlan construction via sub_2C2E3C0, which walks all recipes and annotates them with TTI-derived costs. This is where NVPTXTargetTransformInfo shapes the cost model: it prices ld.v4 cheaper than 4x ld.f32, making vectorization profitable even with register pressure increase.

The VPlan verification flag at 0x500D2E8 enables VPlan dump/verify paths -- useful for debugging vectorization decisions with -mllvm -vplan-verify-or-dont.

NVPTXTargetTransformInfo Hooks

The loop vectorizer reaches NVIDIA's TTI through Loop->getHeader()->getParent()->getTTI() (recovered as *(**(Loop+32)+72)). Key hooks:

TTI MethodAddressGPU Behavior
getRegisterBitWidth(Vector)sub_DFE640Returns 32 (fixed) -- single scalar register width
supportsScalableVectors()sub_DFE610Returns false -- no SVE/RVV equivalent
getMaxInterleaveFactor()sub_DFB120Queried at TTI+448; register-pressure-bounded
getMaxInterleaveFactor(vectorized)sub_DFB730Separate limit for vectorized loops
hasAttribute(47)sub_B2D610"alwaysvectorize" check
hasAttribute(30)sub_B2D610"noimplicitfloat" check

The 32-bit register width return is the critical difference from CPU targets. It means the standard VF formula (regWidth / elemSize) produces VF=1 for 32-bit types, VF=2 for 16-bit types, and VF=4 for 8-bit types. Wider vectorization (VF=4 for float) must come from the cost model determining that ld.v4.f32 is profitable despite the VF exceeding the "register width."

The scheduling info at TTI+56 (with issue width at offset +32 and latency at +36 within that sub-structure) feeds interleave count capping. This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the issue pipeline may saturate at IC=4.

Knobs and Thresholds

KnobGlobal AddressCLI NameDefaultEffect
ForceVectorizationqword_500D340[17]vectorize-loopstrueMaster switch for loop vectorization
EnableEarlyExitVectorizationbyte_500CDA8-enable-early-exit-vectorizationfalseGates LLVM 20 early-exit loop vectorization
ForceOuterLoopVectorizationbyte_500D208-force-vector-width-outerfalseForces VF=4 for outer loops when TTI returns VF<=1
ForceCanReorderFP (selector)dword_500D508--0Whether FP reorder override is active
ForceCanReorderFP (value)byte_500D588----FP reorder override value
ForceScalarEpilogue (selector)dword_500E308--0Whether scalar epilogue is forced
ForceScalarEpilogue (value)byte_500E388----Scalar epilogue override value
VectorizerMinTripCountdword_500EAE8vectorizer-min-trip-count16 (upstream)Minimum trip count to attempt vectorization
CostThresholdqword_500EA08----Maximum cost for memory reorder safety check
EnableEpilogueVectorizationbyte_500ED88-enable-epilogue-vectorizationtrue (upstream)Enables vectorized epilogue loop
EpilogueVectorizationForceVFqword_500ECA8-epilogue-vectorization-force-VF0Forces specific epilogue VF
AggressiveInterleavebyte_500D908--falseBypasses IC heuristics, sets IC=max
PreferPredicateOverEpiloguebyte_500DAC8prefer-predicate-over-epilogue--Uses predication instead of scalar epilogue
SmallLoopCostqword_500DC88small-loop-cost20 (upstream)Threshold below which loops get boosted IC
ForceTargetMaxScalarInterleavedword_500E148force-target-max-scalar-interleave0Overrides max IC for scalar loops
ForceTargetMaxVectorInterleavedword_500E068force-target-max-vector-interleave0Overrides max IC for vectorized loops

NVIDIA vs upstream defaults: The upstream vectorizer-min-trip-count default is 16. The upstream small-loop-cost default is 20. The upstream enable-epilogue-vectorization default is true. NVIDIA preserves these defaults from the knob registration code, but the TTI hooks (particularly getRegisterBitWidth returning 32 and getMaxInterleaveFactor being register-pressure-bounded) shift the effective behavior dramatically. Where a CPU target with AVX-512 might select VF=16 for float, NVPTX typically selects VF=2 or VF=4 -- just enough to use ld.v2/ld.v4 instructions without excessive register pressure.

Diagnostic Strings

All diagnostic strings are embedded in the binary with OptimizationRemarkAnalysis tags. Source: p2-E01-loop-vectorize.txt.

TagMessageTrigger
UncountableEarlyExitLoopsDisabled"Auto-vectorization of loops with uncountable early exit is not enabled"Early-exit loop + byte_500CDA8 knob off
LowTripCount"The trip count is below the minial threshold value."TC < dword_500EAE8 min threshold (note: "minial" is a typo [sic] in the NVIDIA binary)
NoImplicitFloat"Can't vectorize when the NoImplicitFloat attribute is used"Function attribute 30 check
UnsafeFP"Potentially unsafe FP op prevents vectorization"FP safety check failure
CantReorderFPOps"loop not vectorized: cannot prove it is safe to reorder floating-point operations"FP reorder proof failure
CantReorderMemOps"loop not vectorized: cannot prove it is safe to reorder memory operations"Memory reorder proof failure
VectorizationNotBeneficial"the cost-model indicates that vectorization is not beneficial"Cost model: VF=1 wins
InterleavingNotBeneficial"the cost-model indicates that interleaving is not beneficial"Cost model: IC=1 wins
InterleavingNotBeneficialAndDisabled(appended: " and is explicitly disabled or interleave count is set to 1")IC=1 + explicitly disabled
InterleavingBeneficialButDisabled(tag only, no message body recovered)IC>1 but user disabled interleaving
InterleavingAvoided"Ignoring UserIC, because interleaving was avoided up front"User-specified IC overridden
HistogramPreventsScalarInterleaving"Unable to interleave without vectorization due to constraints on the order of histogram operations"NVIDIA-specific: histogram loop + scalar IC
ScalableVFUnfeasible"Scalable vectorization requested but not supported by the target"Scalable VF on NVPTX
UncountableEarlyExitUnsupported"Auto-vectorization of early exit loops requiring a scalar epilogue is unsupported"Early-exit + epilogue
(success remark)"interleaved loop (interleaved count: N)"Vectorization/interleaving succeeded via sub_2AC2B40
(metadata)"llvm.loop.vectorize.followup_all"Post-vectorization loop metadata tag
(metadata)"llvm.loop.vectorize.followup_epilogue"Post-vectorization epilogue metadata tag
(block name)"vec.epilog.middle.block"Epilogue vectorization middle block
(block name)"vec.epilog.vector.body"Epilogue vectorization body block
(block name)"scev.check"Runtime SCEV overflow check block (sub_27C1C30)
(VPlan debug)"Initial VPlan"VPlan builder debug output at 0x2AEFC7B

Function Map

FunctionAddressSizeRole
LoopVectorizePass::processLoop()sub_2AF197088 KB--
tryToBuildVPlanWithVPRecipes()sub_2AEE46056 KB--
Planner::plan() -- generate VPlans for candidate VFssub_2AF13F0----
selectBestVF() -- iterate VPlans, pick lowest costsub_2AE08E0----
computeCostForVF() -- per-VF cost querysub_2AE0750----
isBetterThan() -- VF cost comparatorsub_2AB3FE0----
executeVPlan() -- IR transformation from VPlansub_2AE3460----
selectInterleaveCount() -- IC heuristicsub_2AED330----
selectEpilogueVectorizationFactor()sub_2ABBD40----
LoopVectorizationCostModel constructor (16 params)sub_2AB2780----
selectVectorizationFactor() -- outer loop pathsub_2AB8AC0----
selectVectorizationFactor() -- hint/pre-checksub_2AAEAB0----
computeExpectedScalarCost()sub_2AAD640----
LoopVectorizationLegality::init()sub_31A4FD0----
canVectorize() -- pre-checksub_31A91F0----
canVectorize() -- full checksub_31AF060----
getBestPlanFor(VF) -- VPlan lookupsub_2BF1320----
cloneVPlan()sub_2BF7CB0----
mergeVPlans() -- main + epilogue mergesub_2AB0350----
buildInterleaveGroupRecipes()sub_2C06CE0----
VPlan cost annotation passsub_2C2E3C0----
VPlan simplification / recipe combiningsub_2C32950----
VPlan legality re-verificationsub_2C2A390----
getSmallBestKnownTC() -- trip count upper boundsub_2AA7EC0----
tryToBuildRecipesForVF() -- per-VF body buildersub_2AA9E60----
finalizeRecipesForVF() -- scaling/wideningsub_2AD9850----
TTI::getMaxInterleaveFactor()sub_DFB120----
TTI::getRegisterBitWidth(Vector)sub_DFE640----
TTI::supportsScalableVectors()sub_DFE610----
Emit vectorization success remarkssub_2AC2B40----
VPlan fixup/finalizesub_ABDAE0----
  • Loop Strength Reduction -- LSR runs after vectorization and must handle the wider induction variables and address expressions that vectorization introduces. NVIDIA's custom LSR is occupancy-aware and interacts with the same register pressure model.
  • Register Allocation -- The register pressure that bounds VF and IC decisions is ultimately resolved by the register allocator. VF=4 with IC=2 may request 8x the base register count; the allocator must either accommodate this or spill to local memory.
  • Scheduling -- The TTI scheduling info (issue width and latency at TTI+56) that caps interleave count comes from the same target model used by instruction scheduling.
  • SelectionDAG -- Vectorized IR produces vector types (<4 x float>) that SelectionDAG must lower to PTX ld.v4/st.v4 instructions.
  • SLP Vectorizer -- SLP vectorization (sub_2BD1C50) handles straight-line code and horizontal reductions; loop vectorization handles loop bodies. Both share the same TTI cost model.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's LoopVectorize pass was built for CPU SIMD: fill wider vector registers to process more data elements per instruction. On a GPU, every foundational assumption is inverted:

  • Upstream assumes SIMD lanes need filling. The CPU vectorizer exists to pack 4/8/16 scalar operations into one vector instruction (SSE/AVX/NEON). On GPU, there are no SIMD lanes in the CPU sense -- the SIMT model already executes 32 threads in lockstep per warp. "Vectorization" on GPU means widening per-thread memory accesses to ld.v2/ld.v4 for coalescing, not filling SIMD lanes.
  • Upstream computes VF from vector register width. The standard formula is VF = registerWidth / elementSize (e.g., AVX-512 gives VF=16 for float). NVPTX's getRegisterBitWidth() returns 32 bits -- a single scalar register width -- so this formula always produces VF=1 for 32-bit types. Wider VFs must come entirely from the cost model deciding that ld.v4.f32 is profitable, bypassing the standard VF selection path.
  • Upstream ignores register pressure when selecting VF. On CPU, VF=16 using 16 ZMM registers has no throughput penalty -- there is no occupancy concept. On GPU, VF=4 that quadruples live registers can cross an occupancy cliff, losing an entire warp group and halving net throughput. Every VF and IC decision must be bounded by register pressure impact on occupancy.
  • Upstream assumes scalable vectors are desirable. LLVM supports SVE/RISC-V V scalable vector types. NVPTX disables them entirely (supportsScalableVectors() = false) because PTX has no scalable vector model -- only fixed-width ld.v2/ld.v4 instructions exist.
  • Upstream's interleave count is bounded by CPU port pressure. CPU IC selection considers execution port contention and register file depth (e.g., 16 YMM registers). GPU IC selection is capped by the TTI scheduling model's issue width and latency at TTI+56, reflecting the SM's instruction issue pipeline saturation -- a completely different bottleneck.

Optimization Level Behavior

LevelScheduledMax VFInterleaveNotes
O0Not runN/AN/ANo optimization passes
OfcmaxNot runN/AN/AFast-compile skips vectorization entirely
OfcmidNot runN/AN/AVectorization not in medium fast-compile tier
O1Runs (Tier 1)4EnabledSingle instance after loop canonicalization
O2Runs (Tier 1)4EnabledSame scheduling as O1; benefits from more aggressive scalar optimization preceding it
O3Runs (Tier 1)4EnabledSame as O2; additional Tier 3 loop passes (interchange, distribution) may create more vectorization opportunities

Loop vectorization is a Tier 1 pass, meaning it runs at O1 and above but not in any fast-compile tier. The maximum VF is effectively capped at 4 by the GPU register pressure constraint -- higher VFs would multiply live registers past occupancy cliffs. The vectorize-loops knob (qword_500D340[17]) can force vectorization even when the cost model says it is unprofitable; this knob defaults to off and is typically used only for debugging. Early-exit vectorization (byte_500CDA8) is gated separately and defaults to disabled. See Optimization Levels for the complete tier structure.

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Vectorization purposeFill SIMD lanes (SSE/AVX/NEON) for data parallelismMemory coalescing (ld.v2/ld.v4), instruction count reduction, and register-to-memory width matching; no SIMD lanes on GPU
Scalable vectorsSupported (SVE, RISC-V V)Always disabled -- sub_DFE610 returns false for NVPTX; only fixed-width VF=2/4
Register bit width (TTI)Target-dependent (128/256/512 for x86)Fixed 32 bits (TypeSize::getFixed(32)) reflecting PTX's 32-bit register model
VF selection cost modelSIMD-width-driven: higher VF fills wider vector registersOccupancy-bounded: VF must not increase register pressure past warp occupancy cliffs; VF=4 is typically the maximum
Interleave countProfile-guided or port-pressure-based (2--8 typical)Capped by TTI scheduling info at TTI+56; conservative due to register pressure cost per interleaved iteration
Early-exit vectorizationExperimental (behind flag)Present, gated by byte_500CDA8 (-enable-early-exit-vectorization)
Convergent call handlingStandard legality rejectionAdditional barrier-aware legality: convergent intrinsics (__syncthreads, warp shuffles) block vectorization of the containing loop body