NVVM Intrinsic Lowering

The NVVMIntrinsicLowering pass is a pattern-matching rewrite engine that transforms NVVM intrinsic calls into equivalent sequences of standard LLVM IR operations. NVVM IR uses hundreds of target-specific intrinsics (llvm.nvvm.*) for GPU-specific operations -- texture/surface access, warp shuffles, type conversions, wide vector manipulations, barrier synchronization, and tensor core primitives. These intrinsics encode NVIDIA-specific semantics that have no direct LLVM IR equivalent. This pass bridges the gap: it matches each intrinsic call against a database of lowering rules and, when a match is found, replaces the call with a combination of standard LLVM instructions (shufflevector, extractelement, insertelement, bitcast, arithmetic) that express the same semantics in a form amenable to standard LLVM optimization passes.

The pass runs repeatedly throughout the pipeline -- up to 10 times in the "mid" compilation path -- because other optimization passes (NVVMReflect, InstCombine, inlining) can expose new intrinsic calls or simplify existing ones into forms that become lowerable. Two distinct invocation levels exist: level 0 for basic intrinsic lowering, and level 1 for barrier-related intrinsic lowering that must happen after barrier analysis infrastructure is in place.


Pass factory	`sub_1CB4E40` (creates pass instance with level parameter)
Core engine	`sub_2C63FB0` (140KB, 2,460 lines)
Pass type	FunctionPass (Legacy PM)
Registration	Legacy PM only (not separately registered in New PM); invoked from pipeline assembler
Runtime positions	Tier 1/2/3 #1, #3, #28, #50, #64 (level 1); "mid" path has 4 level-0 invocations (see Pipeline)
NVVMPassOptions slot	99 (offset 2000, `BOOL_COMPACT`, default = 0 = enabled)
Disable flag	`opts[2000] = 1` disables all invocations
Level parameter	0 = basic lowering, 1 = barrier-aware lowering
Iteration limit	30 (global `qword_5010AC8`)
Upstream equivalent	None -- entirely NVIDIA-proprietary
Address range	`0x2C4D000`--`0x2C66000` (lowering engine cluster)

Algorithm

Entry and Dispatch

The pass factory sub_1CB4E40 takes a single integer parameter -- the lowering level. Level 0 performs basic intrinsic lowering (type conversions, vector decomposition, shuffle lowering). Level 1 adds barrier-related intrinsic lowering that depends on barrier analysis having already run. The factory allocates a pass object and stores the level in it; the pass entry point reads this level to filter which intrinsics are candidates for lowering.

At the core engine (sub_2C63FB0), the entry check validates that the instruction is an intrinsic call: the byte at call->arg_chain->offset_8 must equal 17 (intrinsic call marker), and call->offset_16 must be non-null (the callee exists). If either check fails, the function returns 0 (no lowering performed).

Pattern-Matching Rewrite Loop

The algorithm operates as a worklist-driven rewrite system:

function lowerIntrinsic(ctx, call, level):
    if not isIntrinsicCall(call): return 0
    if not hasCallee(call): return 0

    operands = collectOperands(call)        // v285/v286 arrays
    worklist_direct = []                     // v288: direct operand replacements
    worklist_typed  = []                     // v294: type-changed operands
    worklist_shuf   = []                     // v300: shuffle/reorganized operands

    iterations = 0
    while iterations < qword_5010AC8:       // default 30
        iterations++

        // Phase 1: build candidate lowerings
        candidates = buildCandidates(operands)   // sub_2C4D470
        for each candidate in candidates:
            pattern = extractPattern(candidate)  // sub_2C4D5A0

            // Phase 2: type compatibility check
            if not checkTypeCompat(pattern, operands):  // sub_AD7630
                continue

            // Phase 3: operand matching
            if not matchOperands(pattern, operands):
                continue

            // Phase 4: additional pattern checks
            if not additionalChecks(pattern):    // sub_2C50020
                continue

            // Phase 5: core lowering -- create replacement
            replacement = buildReplacement(      // sub_2C515C0
                ctx, operands,
                worklist_direct, worklist_typed, worklist_shuf)

            // Phase 6: substitute
            replaceAllUses(call, replacement)    // sub_BD84D0
            transferMetadata(call, replacement)  // sub_BD6B90
            queueForDeletion(call)               // sub_F15FC0
            return 1

    return 0  // no lowering found within iteration limit

The iteration limit of 30 (stored in qword_5010AC8) exists because lowering one intrinsic can produce new intrinsic calls that themselves need lowering. For example, lowering a wide vector intrinsic into narrower operations may produce calls to narrower intrinsics. Without the limit, pathological patterns could cause infinite expansion. In practice, most intrinsics lower in a single iteration; the limit is a safety net.

Three Worklist Structures

The rewrite engine maintains three parallel worklist structures that categorize how operands are transformed:

Worklist	Variable	Purpose
Direct	`v288`	Operands that pass through unchanged -- same value, same type
Type-changed	`v294`	Operands that need a type conversion (e.g., NVVM-specific type to standard LLVM type)
Shuffle/reorganized	`v300`	Operands that need positional rearrangement (vector lane reordering, element extraction)

When sub_2C515C0 builds the replacement instruction, it reads all three worklists to assemble the final operand list: direct operands are copied verbatim, type-changed operands go through a bitcast or type conversion, and shuffle operands are processed through a shufflevector or extractelement/insertelement sequence.

Lowering Categories

Vector Operation Decomposition

Wide vector NVVM intrinsics (operating on v4f32, v2f64, v4i32, etc.) are decomposed into sequences of narrower operations. The NVVM IR frontend emits vector intrinsics to express data-parallel GPU operations, but the NVPTX backend's instruction selector handles scalar or narrow-vector operations more efficiently.

The decomposition pattern:

// Before: single wide-vector intrinsic call
%result = call <4 x float> @llvm.nvvm.wide.op(<4 x float> %a, <4 x float> %b)

// After: four scalar operations + vector reconstruction
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
...
%r0 = call float @llvm.nvvm.narrow.op(float %a0, float %b0)
%r1 = call float @llvm.nvvm.narrow.op(float %a1, float %b1)
...
%v0 = insertelement <4 x float> undef, float %r0, i32 0
%v1 = insertelement <4 x float> %v0,   float %r1, i32 1
...

This decomposition enables scalar optimizations (constant folding, CSE) to work on individual lanes, and the narrower intrinsics may themselves lower in subsequent iterations -- hence the iteration limit.

Shuffle Vector Lowering

When an NVVM intrinsic performs pure data reorganization -- lane permutation, broadcast, or subvector extraction -- without any arithmetic, the pass replaces it with an LLVM shufflevector instruction. The core lowering for this path goes through sub_DFBC30, which takes:

sub_DFBC30(context, operation=6, type_info, shuffle_indices, count, flags)

The operation=6 constant identifies this as a shufflevector creation. The shuffle_indices array encodes the lane mapping: for a warp shuffle that broadcasts lane 0 to all lanes, the mask would be <0, 0, 0, 0, ...>. For a rotation, it might be <1, 2, 3, 0>.

Shuffle lowering handles several NVVM intrinsic families:

Warp-level shuffle operations (__shfl_sync, __shfl_up_sync, etc.) when the shuffle amount is a compile-time constant
Subvector extraction from wider types (e.g., extracting the low v2f16 from a v4f16)
Lane broadcast patterns used in matrix fragment loading

Type Conversion Lowering

NVVM defines intrinsic-based type conversions for types that LLVM's standard type system does not directly support, such as:

BF16 (bfloat16) to/from FP32 -- intrinsic ID 0x2106, gated by sm_72+
TF32 (tensorfloat32) conversions -- intrinsic ID 0x2106 with conversion type 3+, gated by sm_75+
FP8 (E4M3/E5M2) conversions -- intrinsic ID 0x2107, gated by sm_89+ (Ada)
Extended type conversions with saturate, rounding mode control

The lowering replaces these intrinsic calls with sequences of:

bitcast for reinterpretation between same-width types
fptrunc / fpext for standard floating-point width changes
trunc / zext / sext for integer width changes
Arithmetic sequences for rounding mode emulation when the hardware rounding mode is not directly expressible

The sub_2C52B30 helper ("get canonical type") resolves NVVM-specific type encodings to their standard LLVM Type* equivalents during this process.

Multi-Run Pattern

NVVMIntrinsicLowering appears more times in the compilation pipeline than any other NVIDIA custom pass. In the "mid" path (standard CUDA compilation), it runs approximately 10 times across the main path and the Tier 1/2/3 sub-pipelines. The pattern reveals a deliberate interleaving strategy.

"mid" Path Invocations (Level 0)

All four invocations in the main "mid" path use level 0 and are guarded by !opts[2000]:

Position	Context	Preceding Pass	Following Pass	Purpose
1st	Early pipeline	ConstantMerge	MemCpyOpt	Lower intrinsics from the original IR before SROA/GVN operate
2nd	After InstCombine + standard pipeline #5	LLVM standard #5	DeadArgElim	Re-lower intrinsics that InstCombine may have simplified or inlining may have exposed
3rd	After NVVMReflect + standard pipeline #8	LLVM standard #8	IPConstProp	Lower intrinsics whose arguments became constant after NVVMReflect resolved `__nvvm_reflect()` calls
4th	Late pipeline	LICM	NVVMBranchDist	Final cleanup of any remaining lowerable intrinsics before register-pressure-sensitive passes

Tier 1/2/3 Invocations (Level 1)

Within the sub_12DE8F0 tier sub-pipeline, the pass runs with level 1 at five distinct points:

Position	Context	Notes
1st	Tier entry	Immediately at tier start -- lower barrier-related intrinsics before barrier analysis
2nd	After 1st NVVMIRVerification	Re-lower after verification may have canonicalized IR
3rd	After CVP + NVVMVerifier + NVVMIRVerification	Post-optimization cleanup of barrier intrinsics
4th	After LoopUnswitch + standard pipeline #1	Re-lower intrinsics exposed by loop transformations
5th	After DSE + DCE + standard pipeline #1	Final tier cleanup before MemorySpaceOpt

Each tier (1, 2, and 3) runs this same sequence independently, so in a full compilation with all three tiers active, level-1 lowering executes up to 15 times total.

Level Parameter Semantics

The level parameter partitions the intrinsic lowering rules into two sets:

Level 0 -- Basic lowering. Handles intrinsics whose lowering depends only on the intrinsic's operands and types. This includes vector decomposition, shuffle lowering, and standard type conversions. These are safe to run at any point in the pipeline because they have no dependencies on analysis results. The "mid" path runs level 0 exclusively.

Level 1 -- Barrier-aware lowering. Handles intrinsics related to synchronization barriers (__syncthreads, __syncwarp, barrier-guarded memory operations) whose lowering must coordinate with the barrier analysis infrastructure. In the tier sub-pipeline, level 1 runs at the entry point before NVVMBarrierAnalysis and NVVMLowerBarriers, and again after those passes have run. This two-phase pattern within the tier ensures that:

Barrier intrinsics are lowered to a canonical form that the barrier analysis can recognize
After barrier analysis and lowering, any residual barrier-related intrinsics are cleaned up

The reason level 1 is restricted to tiers rather than the main "mid" path: the tier sub-pipeline (sub_12DE8F0) sets up the barrier analysis state (via sub_18E4A00 / sub_1C98160) that level 1 lowering depends on. Running level 1 in the main path before this state exists would produce incorrect results.

Interaction with NVVMReflect

NVVMReflect resolves compile-time queries about the target GPU architecture:

%arch = call i32 @llvm.nvvm.reflect(metadata !"__CUDA_ARCH__")
; After NVVMReflect: %arch = i32 900  (for sm_90)

This resolution has a cascading effect on intrinsic lowering. Many NVVM intrinsics are conditionally emitted by the frontend behind architecture checks:

if (__CUDA_ARCH__ >= 900) {
    // Hopper-specific intrinsic
    __nvvm_tma_load_async(...);
} else {
    // Fallback path using standard loads
}

After NVVMReflect replaces the architecture query with a constant, and nvvm-reflect-pp (SimplifyConstantConditionalsPass) eliminates the dead branch, the surviving path may contain intrinsics that were previously unreachable. The pipeline runs NVVMIntrinsicLowering after NVVMReflect specifically to catch these newly-exposed intrinsics. This is why the 3rd invocation in the "mid" path immediately follows NVVMReflect + LLVM standard pipeline #8.

Configuration

NVVMPassOptions

Slot	Offset	Type	Default	Semantics
98	1976	STRING	(empty)	Paired string parameter for the pass (unused or reserved)
99	2000	BOOL_COMPACT	0	Disable flag: 0 = enabled, 1 = disabled

Setting slot 99 to 1 disables all invocations of NVVMIntrinsicLowering across the entire pipeline -- both level 0 and level 1. There is no mechanism to disable one level independently.

Global Variables

Variable	Default	Purpose
`qword_5010AC8`	30	Maximum iterations per invocation of the rewrite loop

This global is not exposed as a user-facing knob. It is initialized at program startup and is constant for the lifetime of the process.

Key Helper Functions

Pattern Matching

Function	Role
`sub_2C4D470`	Build candidate lowering list from intrinsic operands
`sub_2C4D5A0`	Extract pattern from candidate -- returns the lowering rule
`sub_2C50020`	Additional pattern compatibility checks beyond type matching
`sub_2C52B30`	Get canonical LLVM type for an NVVM-specific type encoding
`sub_AD7630`	Type-lowering query -- checks if source type can lower to target type

Instruction Construction

Function	Role
`sub_2C515C0`	Build replacement instruction from three worklist structures
`sub_2C4FB60`	Opcode dispatch -- selects the LLVM opcode for the lowered operation
`sub_DFBC30`	Create shufflevector or similar vector IR construct (operation=6)

IR Mutation

Function	Role
`sub_BD84D0`	Replace all uses of old instruction with new value
`sub_BD6B90`	Transfer metadata from old instruction to replacement
`sub_F15FC0`	Queue old instruction for deletion

Pass Infrastructure

Function	Address	Role
`sub_1CB4E40`	`0x1CB4E40`	Pass factory -- creates pass with level parameter
`sub_2C63FB0`	`0x2C63FB0`	Core lowering engine (140KB, 2,460 lines)

Diagnostic Strings

The core engine at sub_2C63FB0 contains no user-visible diagnostic strings. This is unusual for a 140KB function and reflects the fact that intrinsic lowering is a mechanical pattern-matching operation: either a lowering rule matches (silently applied) or it does not (silently skipped). Failures are not reported because an unlowered intrinsic is not necessarily an error -- it may be handled by a later pass (NVVMLowerBarriers, GenericToNVVM) or by the NVPTX instruction selector directly.

The pass factory sub_1CB4E40 similarly contains no diagnostic strings.

Pipeline Position Summary

sub_12E54A0 (Master Pipeline Assembly)
  │
  ├─ "mid" path (level 0, 4 invocations):
  │    ├─ #1: After ConstantMerge, before MemCpyOpt/SROA
  │    ├─ #2: After InstCombine + LLVM standard #5
  │    ├─ #3: After NVVMReflect + LLVM standard #8
  │    └─ #4: After LICM, before NVVMBranchDist/Remat
  │
  ├─ "ptx" path (level 0, 0 invocations):
  │    └─ (not present -- PTX input already has intrinsics lowered)
  │
  ├─ default path (level 0, 1 invocation):
  │    └─ #1: After NVVMReflect, before NVVMPeephole
  │
  └─ Tier 1/2/3 sub-pipeline (level 1, 5 invocations per tier):
       ├─ #1: Tier entry
       ├─ #2: After NVVMIRVerification
       ├─ #3: After CVP + NVVMVerifier
       ├─ #4: After LoopUnswitch + LLVM standard #1
       └─ #5: After DSE + DCE + LLVM standard #1

Cross-References

LLVM Optimizer -- master pipeline assembly and tier system
NVIDIA Custom Passes -- Inventory -- pass registry and classification
Rematerialization -- runs after intrinsic lowering in "mid" path
NVVM Peephole -- peephole patterns that may expose new lowerable intrinsics
MemorySpaceOpt -- runs after level-1 lowering in tier sub-pipeline

Keyboard shortcuts

CICC Reverse Engineering Reference