Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Intrinsic Lowering

The NVVMIntrinsicLowering pass is a pattern-matching rewrite engine that transforms NVVM intrinsic calls into equivalent sequences of standard LLVM IR operations. NVVM IR uses hundreds of target-specific intrinsics (llvm.nvvm.*) for GPU-specific operations -- texture/surface access, warp shuffles, type conversions, wide vector manipulations, barrier synchronization, and tensor core primitives. These intrinsics encode NVIDIA-specific semantics that have no direct LLVM IR equivalent. This pass bridges the gap: it matches each intrinsic call against a database of lowering rules and, when a match is found, replaces the call with a combination of standard LLVM instructions (shufflevector, extractelement, insertelement, bitcast, arithmetic) that express the same semantics in a form amenable to standard LLVM optimization passes.

The pass runs repeatedly throughout the pipeline -- up to 10 times in the "mid" compilation path -- because other optimization passes (NVVMReflect, InstCombine, inlining) can expose new intrinsic calls or simplify existing ones into forms that become lowerable. Two distinct invocation levels exist: level 0 for basic intrinsic lowering, and level 1 for barrier-related intrinsic lowering that must happen after barrier analysis infrastructure is in place.

Pass factorysub_1CB4E40 (creates pass instance with level parameter)
Core enginesub_2C63FB0 (140KB, 2,460 lines)
Pass typeFunctionPass (Legacy PM)
RegistrationLegacy PM only (not separately registered in New PM); invoked from pipeline assembler
Runtime positionsTier 1/2/3 #1, #3, #28, #50, #64 (level 1); "mid" path has 4 level-0 invocations (see Pipeline)
NVVMPassOptions slot99 (offset 2000, BOOL_COMPACT, default = 0 = enabled)
Disable flagopts[2000] = 1 disables all invocations
Level parameter0 = basic lowering, 1 = barrier-aware lowering
Iteration limit30 (global qword_5010AC8)
Upstream equivalentNone -- entirely NVIDIA-proprietary
Address range0x2C4D000--0x2C66000 (lowering engine cluster)

Algorithm

Entry and Dispatch

The pass factory sub_1CB4E40 takes a single integer parameter -- the lowering level. Level 0 performs basic intrinsic lowering (type conversions, vector decomposition, shuffle lowering). Level 1 adds barrier-related intrinsic lowering that depends on barrier analysis having already run. The factory allocates a pass object and stores the level in it; the pass entry point reads this level to filter which intrinsics are candidates for lowering.

At the core engine (sub_2C63FB0), the entry check validates that the instruction is an intrinsic call: the byte at call->arg_chain->offset_8 must equal 17 (intrinsic call marker), and call->offset_16 must be non-null (the callee exists). If either check fails, the function returns 0 (no lowering performed).

Pattern-Matching Rewrite Loop

The algorithm operates as a worklist-driven rewrite system:

function lowerIntrinsic(ctx, call, level):
    if not isIntrinsicCall(call): return 0
    if not hasCallee(call): return 0

    operands = collectOperands(call)        // v285/v286 arrays
    worklist_direct = []                     // v288: direct operand replacements
    worklist_typed  = []                     // v294: type-changed operands
    worklist_shuf   = []                     // v300: shuffle/reorganized operands

    iterations = 0
    while iterations < qword_5010AC8:       // default 30
        iterations++

        // Phase 1: build candidate lowerings
        candidates = buildCandidates(operands)   // sub_2C4D470
        for each candidate in candidates:
            pattern = extractPattern(candidate)  // sub_2C4D5A0

            // Phase 2: type compatibility check
            if not checkTypeCompat(pattern, operands):  // sub_AD7630
                continue

            // Phase 3: operand matching
            if not matchOperands(pattern, operands):
                continue

            // Phase 4: additional pattern checks
            if not additionalChecks(pattern):    // sub_2C50020
                continue

            // Phase 5: core lowering -- create replacement
            replacement = buildReplacement(      // sub_2C515C0
                ctx, operands,
                worklist_direct, worklist_typed, worklist_shuf)

            // Phase 6: substitute
            replaceAllUses(call, replacement)    // sub_BD84D0
            transferMetadata(call, replacement)  // sub_BD6B90
            queueForDeletion(call)               // sub_F15FC0
            return 1

    return 0  // no lowering found within iteration limit

The iteration limit of 30 (stored in qword_5010AC8) exists because lowering one intrinsic can produce new intrinsic calls that themselves need lowering. For example, lowering a wide vector intrinsic into narrower operations may produce calls to narrower intrinsics. Without the limit, pathological patterns could cause infinite expansion. In practice, most intrinsics lower in a single iteration; the limit is a safety net.

Three Worklist Structures

The rewrite engine maintains three parallel worklist structures that categorize how operands are transformed:

WorklistVariablePurpose
Directv288Operands that pass through unchanged -- same value, same type
Type-changedv294Operands that need a type conversion (e.g., NVVM-specific type to standard LLVM type)
Shuffle/reorganizedv300Operands that need positional rearrangement (vector lane reordering, element extraction)

When sub_2C515C0 builds the replacement instruction, it reads all three worklists to assemble the final operand list: direct operands are copied verbatim, type-changed operands go through a bitcast or type conversion, and shuffle operands are processed through a shufflevector or extractelement/insertelement sequence.

Lowering Categories

Vector Operation Decomposition

Wide vector NVVM intrinsics (operating on v4f32, v2f64, v4i32, etc.) are decomposed into sequences of narrower operations. The NVVM IR frontend emits vector intrinsics to express data-parallel GPU operations, but the NVPTX backend's instruction selector handles scalar or narrow-vector operations more efficiently.

The decomposition pattern:

// Before: single wide-vector intrinsic call
%result = call <4 x float> @llvm.nvvm.wide.op(<4 x float> %a, <4 x float> %b)

// After: four scalar operations + vector reconstruction
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
...
%r0 = call float @llvm.nvvm.narrow.op(float %a0, float %b0)
%r1 = call float @llvm.nvvm.narrow.op(float %a1, float %b1)
...
%v0 = insertelement <4 x float> undef, float %r0, i32 0
%v1 = insertelement <4 x float> %v0,   float %r1, i32 1
...

This decomposition enables scalar optimizations (constant folding, CSE) to work on individual lanes, and the narrower intrinsics may themselves lower in subsequent iterations -- hence the iteration limit.

Shuffle Vector Lowering

When an NVVM intrinsic performs pure data reorganization -- lane permutation, broadcast, or subvector extraction -- without any arithmetic, the pass replaces it with an LLVM shufflevector instruction. The core lowering for this path goes through sub_DFBC30, which takes:

sub_DFBC30(context, operation=6, type_info, shuffle_indices, count, flags)

The operation=6 constant identifies this as a shufflevector creation. The shuffle_indices array encodes the lane mapping: for a warp shuffle that broadcasts lane 0 to all lanes, the mask would be <0, 0, 0, 0, ...>. For a rotation, it might be <1, 2, 3, 0>.

Shuffle lowering handles several NVVM intrinsic families:

  • Warp-level shuffle operations (__shfl_sync, __shfl_up_sync, etc.) when the shuffle amount is a compile-time constant
  • Subvector extraction from wider types (e.g., extracting the low v2f16 from a v4f16)
  • Lane broadcast patterns used in matrix fragment loading

Type Conversion Lowering

NVVM defines intrinsic-based type conversions for types that LLVM's standard type system does not directly support, such as:

  • BF16 (bfloat16) to/from FP32 -- intrinsic ID 0x2106, gated by sm_72+
  • TF32 (tensorfloat32) conversions -- intrinsic ID 0x2106 with conversion type 3+, gated by sm_75+
  • FP8 (E4M3/E5M2) conversions -- intrinsic ID 0x2107, gated by sm_89+ (Ada)
  • Extended type conversions with saturate, rounding mode control

The lowering replaces these intrinsic calls with sequences of:

  • bitcast for reinterpretation between same-width types
  • fptrunc / fpext for standard floating-point width changes
  • trunc / zext / sext for integer width changes
  • Arithmetic sequences for rounding mode emulation when the hardware rounding mode is not directly expressible

The sub_2C52B30 helper ("get canonical type") resolves NVVM-specific type encodings to their standard LLVM Type* equivalents during this process.

Multi-Run Pattern

NVVMIntrinsicLowering appears more times in the compilation pipeline than any other NVIDIA custom pass. In the "mid" path (standard CUDA compilation), it runs approximately 10 times across the main path and the Tier 1/2/3 sub-pipelines. The pattern reveals a deliberate interleaving strategy.

"mid" Path Invocations (Level 0)

All four invocations in the main "mid" path use level 0 and are guarded by !opts[2000]:

PositionContextPreceding PassFollowing PassPurpose
1stEarly pipelineConstantMergeMemCpyOptLower intrinsics from the original IR before SROA/GVN operate
2ndAfter InstCombine + standard pipeline #5LLVM standard #5DeadArgElimRe-lower intrinsics that InstCombine may have simplified or inlining may have exposed
3rdAfter NVVMReflect + standard pipeline #8LLVM standard #8IPConstPropLower intrinsics whose arguments became constant after NVVMReflect resolved __nvvm_reflect() calls
4thLate pipelineLICMNVVMBranchDistFinal cleanup of any remaining lowerable intrinsics before register-pressure-sensitive passes

Tier 1/2/3 Invocations (Level 1)

Within the sub_12DE8F0 tier sub-pipeline, the pass runs with level 1 at five distinct points:

PositionContextNotes
1stTier entryImmediately at tier start -- lower barrier-related intrinsics before barrier analysis
2ndAfter 1st NVVMIRVerificationRe-lower after verification may have canonicalized IR
3rdAfter CVP + NVVMVerifier + NVVMIRVerificationPost-optimization cleanup of barrier intrinsics
4thAfter LoopUnswitch + standard pipeline #1Re-lower intrinsics exposed by loop transformations
5thAfter DSE + DCE + standard pipeline #1Final tier cleanup before MemorySpaceOpt

Each tier (1, 2, and 3) runs this same sequence independently, so in a full compilation with all three tiers active, level-1 lowering executes up to 15 times total.

Level Parameter Semantics

The level parameter partitions the intrinsic lowering rules into two sets:

Level 0 -- Basic lowering. Handles intrinsics whose lowering depends only on the intrinsic's operands and types. This includes vector decomposition, shuffle lowering, and standard type conversions. These are safe to run at any point in the pipeline because they have no dependencies on analysis results. The "mid" path runs level 0 exclusively.

Level 1 -- Barrier-aware lowering. Handles intrinsics related to synchronization barriers (__syncthreads, __syncwarp, barrier-guarded memory operations) whose lowering must coordinate with the barrier analysis infrastructure. In the tier sub-pipeline, level 1 runs at the entry point before NVVMBarrierAnalysis and NVVMLowerBarriers, and again after those passes have run. This two-phase pattern within the tier ensures that:

  1. Barrier intrinsics are lowered to a canonical form that the barrier analysis can recognize
  2. After barrier analysis and lowering, any residual barrier-related intrinsics are cleaned up

The reason level 1 is restricted to tiers rather than the main "mid" path: the tier sub-pipeline (sub_12DE8F0) sets up the barrier analysis state (via sub_18E4A00 / sub_1C98160) that level 1 lowering depends on. Running level 1 in the main path before this state exists would produce incorrect results.

Interaction with NVVMReflect

NVVMReflect resolves compile-time queries about the target GPU architecture:

%arch = call i32 @llvm.nvvm.reflect(metadata !"__CUDA_ARCH__")
; After NVVMReflect: %arch = i32 900  (for sm_90)

This resolution has a cascading effect on intrinsic lowering. Many NVVM intrinsics are conditionally emitted by the frontend behind architecture checks:

if (__CUDA_ARCH__ >= 900) {
    // Hopper-specific intrinsic
    __nvvm_tma_load_async(...);
} else {
    // Fallback path using standard loads
}

After NVVMReflect replaces the architecture query with a constant, and nvvm-reflect-pp (SimplifyConstantConditionalsPass) eliminates the dead branch, the surviving path may contain intrinsics that were previously unreachable. The pipeline runs NVVMIntrinsicLowering after NVVMReflect specifically to catch these newly-exposed intrinsics. This is why the 3rd invocation in the "mid" path immediately follows NVVMReflect + LLVM standard pipeline #8.

Configuration

NVVMPassOptions

SlotOffsetTypeDefaultSemantics
981976STRING(empty)Paired string parameter for the pass (unused or reserved)
992000BOOL_COMPACT0Disable flag: 0 = enabled, 1 = disabled

Setting slot 99 to 1 disables all invocations of NVVMIntrinsicLowering across the entire pipeline -- both level 0 and level 1. There is no mechanism to disable one level independently.

Global Variables

VariableDefaultPurpose
qword_5010AC830Maximum iterations per invocation of the rewrite loop

This global is not exposed as a user-facing knob. It is initialized at program startup and is constant for the lifetime of the process.

Key Helper Functions

Pattern Matching

FunctionRole
sub_2C4D470Build candidate lowering list from intrinsic operands
sub_2C4D5A0Extract pattern from candidate -- returns the lowering rule
sub_2C50020Additional pattern compatibility checks beyond type matching
sub_2C52B30Get canonical LLVM type for an NVVM-specific type encoding
sub_AD7630Type-lowering query -- checks if source type can lower to target type

Instruction Construction

FunctionRole
sub_2C515C0Build replacement instruction from three worklist structures
sub_2C4FB60Opcode dispatch -- selects the LLVM opcode for the lowered operation
sub_DFBC30Create shufflevector or similar vector IR construct (operation=6)

IR Mutation

FunctionRole
sub_BD84D0Replace all uses of old instruction with new value
sub_BD6B90Transfer metadata from old instruction to replacement
sub_F15FC0Queue old instruction for deletion

Pass Infrastructure

FunctionAddressRole
sub_1CB4E400x1CB4E40Pass factory -- creates pass with level parameter
sub_2C63FB00x2C63FB0Core lowering engine (140KB, 2,460 lines)

Diagnostic Strings

The core engine at sub_2C63FB0 contains no user-visible diagnostic strings. This is unusual for a 140KB function and reflects the fact that intrinsic lowering is a mechanical pattern-matching operation: either a lowering rule matches (silently applied) or it does not (silently skipped). Failures are not reported because an unlowered intrinsic is not necessarily an error -- it may be handled by a later pass (NVVMLowerBarriers, GenericToNVVM) or by the NVPTX instruction selector directly.

The pass factory sub_1CB4E40 similarly contains no diagnostic strings.

Pipeline Position Summary

sub_12E54A0 (Master Pipeline Assembly)
  │
  ├─ "mid" path (level 0, 4 invocations):
  │    ├─ #1: After ConstantMerge, before MemCpyOpt/SROA
  │    ├─ #2: After InstCombine + LLVM standard #5
  │    ├─ #3: After NVVMReflect + LLVM standard #8
  │    └─ #4: After LICM, before NVVMBranchDist/Remat
  │
  ├─ "ptx" path (level 0, 0 invocations):
  │    └─ (not present -- PTX input already has intrinsics lowered)
  │
  ├─ default path (level 0, 1 invocation):
  │    └─ #1: After NVVMReflect, before NVVMPeephole
  │
  └─ Tier 1/2/3 sub-pipeline (level 1, 5 invocations per tier):
       ├─ #1: Tier entry
       ├─ #2: After NVVMIRVerification
       ├─ #3: After CVP + NVVMVerifier
       ├─ #4: After LoopUnswitch + LLVM standard #1
       └─ #5: After DSE + DCE + LLVM standard #1

Cross-References