Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nvvm-peephole-optimizer

The NVVM Peephole Optimizer is an NVIDIA-proprietary function-level IR pass that performs NVVM-specific pattern matching and instruction simplification. It is distinct from both LLVM's standard InstCombine pass (which handles general-purpose peephole optimization across ~600 functions in the 0x1700000--0x17B0000 range) and the machine-level nvptx-peephole pass (sub_21DB090) that operates on MachineInstrs after instruction selection.

This page documents all three peephole layers in cicc, their pipeline positions, their transformations, and the satellite machine-level peephole passes that complement them.

Pass namenvvm-peephole-optimizer
Classllvm::NVVMPeepholeOptimizerPass
ScopeFunction pass (IR level)
RegistrationNew PM slot 382 in sub_2342890
Serializersub_2314DA0
Factory (legacy PM)sub_1CEF8F0
Pipeline parser line3534 in sub_233C410
Enable knobenable-nvvm-peephole (bool, default = true)
Knob addressctor_358_0 @ 0x50E8D0
NVVMPassOptions slotnvvm-peephole-optimizer in 4512-byte options struct
Pipeline positionFunction-level, runs after NVVMReflect + NVVMIntrinsicLowering

Purpose

CUDA programs produce IR patterns that standard LLVM optimizations do not recognize or cannot legally transform. The NVVM peephole pass fills this gap by matching NVVM-specific idioms -- address space casts, intrinsic call sequences, convergent operation patterns, and GPU-specific type conversions -- and rewriting them into simpler, cheaper forms. It operates at the LLVM IR level before code generation, complementing the machine-level nvptx-peephole pass that runs later in the pipeline.

The pass is always paired with sub_215D9D0 (NVVMAnnotationsProcessor), which runs immediately after the peephole in every pipeline path. This companion pass processes NVVM annotations (e.g., tcgen05 tensor annotation metadata) on the IR that the peephole has just simplified.

Three Peephole Layers

CICC contains three distinct peephole optimization layers, each operating at a different abstraction level and targeting different pattern classes.

LayerPassLevelAddress / SlotTargets
LLVM InstCombineinstcombineIR0x1700000+ (~600 funcs)General-purpose: algebraic simplification, constant folding, dead instruction removal
NVVM Peepholenvvm-peephole-optimizerIRslot 382, factory sub_1CEF8F0NVVM-specific: address space casts, intrinsic sequences, GPU type conversions
NVPTX Peepholenvptx-peepholeMachineInstrsub_21DB090PTX-specific: redundant cvta folding, predicate optimization, move elimination

The NVVM peephole pass handles transformations that require knowledge of NVVM's address space model, intrinsic semantics, or GPU-specific type system -- patterns that InstCombine cannot match because they depend on NVPTX target information not available to target-independent passes. The machine-level NVPTX peephole then handles patterns that only emerge after instruction selection has lowered IR to MachineInstrs.

Pipeline Positions

IR-Level: nvvm-peephole-optimizer

The IR-level peephole (sub_1CEF8F0) is invoked from the legacy pipeline assembler (sub_12E54A0) in all three language-specific code paths. Its companion sub_215D9D0 always follows immediately.

Path A -- "ptx" language (lines 580--638 in sub_12E54A0):

sub_1CEF8F0()    NVVMPeephole
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1857160()    NVVMReflect (conditional)
sub_1A62BF0(1)   LLVM standard pipeline #1
sub_1B26330()    MemCpyOpt
sub_18DEFF0()    DCE
...

Path B -- "mid" language (Ofcmid, lines 814--1075):

sub_184CD60()    ConstantMerge / GlobalDCE
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1B26330()    MemCpyOpt
sub_198E2A0()    SROA / CorrelatedValuePropagation
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_17060B0(1,0) PrintModulePass
sub_198DF00(-1)  JumpThreading / CVP
sub_1C6E800()    GVN / LICM
...

Path C -- default/general (O2/O3, lines 1077--1371):

sub_1A62BF0(4)   LLVM standard pipeline #4
sub_1857160()    NVVMReflect
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1857160()    NVVMReflect (second pass)
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1A7A9F0()    InstructionSimplify
sub_1A62BF0(5)   LLVM standard pipeline #5
...

Late position (O3 tier finalization):

sub_1B7FDF0(n)   BranchFolding / CFGSimplify
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_18B3080(f)   Sinking2Pass (fast mode)
sub_1CC60B0()    NVVMSinking
sub_18A3430()    AggressiveInstCombine
...

In every path, the peephole runs after NVVMIntrinsicLowering (sub_1CB4E40) and NVVMReflect (sub_1857160) have resolved intrinsics and reflect calls. This ensures the peephole sees simplified IR where previously-opaque intrinsic call patterns have been reduced to simpler forms amenable to pattern matching.

Machine-Level: nvptx-peephole

The machine-level peephole (sub_21DB090) runs in addPreRegAlloc() (sub_2166ED0):

EarlyTailDuplicate
codegen DCE
Machine LICM + CSE + Sinking        (conditional on enable-mlicm, enable-mcse)
PeepholeOptimizerPass                (stock LLVM, slot 492, disable-peephole)
NVPTXPeephole (sub_21DB090)         <<<
DeadMachineInstrElim
MachineCopyPropagation

The string "After codegen peephole optimization pass" in sub_2166ED0 marks the checkpoint after both the stock LLVM peephole and the NVPTX peephole have completed.

New PM Registration

The pass is registered as a function-level pass in the New Pass Manager at registration line 2242 in sub_2342890. It sits in the mid-optimization phase alongside other NVIDIA function passes:

SlotPassClass
376basic-dbeBasicDeadBarrierEliminationPass
377branch-distBranchDistPass
378byval-mem2regByValMem2RegPass
379bypass-slow-divisionBypassSlowDivisionPass
380normalize-gepNormalizeGepPass
381nvvm-reflect-ppSimplifyConstantConditionalsPass
382nvvm-peephole-optimizerNVVMPeepholeOptimizerPass
383old-load-store-vectorizerOldLoadStoreVectorizerPass
384print<merge-sets>MergeSetsAnalysisPrinterPass
385rematRematerializationPass

IR-Level Transformation Categories

Based on pipeline position (after NVVMReflect + NVVMIntrinsicLowering, before sinking and rematerialization) and the patterns visible in NVVM IR, the peephole optimizer targets several categories.

Address Space Cast Simplification

After memory-space-opt and ipmsp resolve generic pointers to specific address spaces, redundant addrspacecast chains remain in the IR. The peephole rewrites these:

; Before:
%p1 = addrspacecast ptr addrspace(3) %src to ptr        ; shared -> generic
%p2 = addrspacecast ptr %p1 to ptr addrspace(3)         ; generic -> shared
store i32 %val, ptr addrspace(3) %p2

; After:
store i32 %val, ptr addrspace(3) %src                   ; chain eliminated
; Before:
%p = addrspacecast ptr addrspace(1) %src to ptr addrspace(1)  ; identity cast

; After:
; (use %src directly — identity addrspacecast removed)

The validation function sub_21BEE70 ("Bad address space in addrspacecast", 4.1KB) ensures the peephole does not create illegal address space transitions. NVPTX address spaces are:

ASNameLegal cast targets
0GenericAll
1GlobalGeneric
3SharedGeneric
4ConstantGeneric
5LocalGeneric

Intrinsic Call Folding

After NVVMIntrinsicLowering has expanded NVVM intrinsics, some expansion sequences can be further simplified:

; Before (after intrinsic lowering, launch_bounds known):
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%cmp = icmp ult i32 %tid, 256       ; blockDim.x known = 256

; After (when nvvm-intr-range has set !range {0, 256}):
%cmp = i1 true                      ; always true for valid threads
; Before:
call void @llvm.nvvm.barrier0()
; (no shared memory operations between barriers)
call void @llvm.nvvm.barrier0()

; After:
call void @llvm.nvvm.barrier0()     ; redundant barrier removed

Type Conversion Cleanup

GPU-specific type representations (bf16, tf32, fp8) produce conversion chains not present in standard LLVM IR:

; Before (roundtrip through wider type):
%wide = fpext half %x to float
%back = fptrunc float %wide to half

; After:
; (use %x directly — roundtrip eliminated when no precision loss)
; Before (bf16 roundtrip):
%f32 = call float @llvm.nvvm.bf16.to.f32(i16 %bf)
%bf2 = call i16 @llvm.nvvm.f32.to.bf16(float %f32)

; After:
; (use %bf directly)

Post-Reflect Dead Code Cleanup

The companion pass nvvm-reflect-pp (SimplifyConstantConditionalsPass) runs immediately before the peephole in the pipeline. It resolves __nvvm_reflect() calls and simplifies constant conditionals:

; Before (after nvvm-reflect-pp resolves __nvvm_reflect("__CUDA_FTZ") = 1):
%ftz = call i32 @__nvvm_reflect(ptr @"__CUDA_FTZ")     ; resolved to 1
%cmp = icmp ne i32 %ftz, 0                              ; always true
br i1 %cmp, label %ftz_path, label %no_ftz_path

; After nvvm-reflect-pp:
br label %ftz_path                   ; unconditional

; The peephole then cleans up dead instructions in %no_ftz_path
; and simplifies any resulting phi nodes at merge points

Convergent Operation Canonicalization

CUDA's convergent operations (__syncwarp, __ballot_sync, etc.) have specific semantic constraints that standard InstCombine cannot reason about because it must treat convergent calls as opaque. The peephole, with knowledge of NVVM semantics, can simplify convergent call sequences when the mask or participating threads can be determined at compile time.

Machine-Level NVPTXPeephole (sub_21DB090)

The machine-level peephole operates on MachineInstr objects after instruction selection has converted LLVM IR to PTX pseudo-instructions. It targets patterns specific to the PTX instruction set.

Redundant cvta Folding

The cvta (convert address) instruction converts between generic and specific address spaces. Address space lowering often inserts redundant conversions:

// Before:
cvta.to.global %rd1, %rd2        ; convert global -> generic
cvta.global %rd3, %rd1           ; convert generic -> global (redundant pair)

// After:
mov.b64 %rd3, %rd2               ; direct copy, cvta pair eliminated

The companion pass sub_21DA810 ("NVPTX optimize redundant cvta.to.local instruction") handles the remaining cvta.to.local instructions that survive to late post-RA:

// Before (late pipeline):
cvta.to.local %rd1, %rd2         ; redundant when %rd2 is already local-space

// After (sub_21DA810 removes it):
; (use %rd2 directly)

Predicate Pattern Optimization

PTX uses predicate registers for conditional execution. The peephole simplifies predicate sequences:

// Before:
setp.ne.s32 %p1, %r1, 0;
@%p1 bra target;

// After (folds setp into branch when pattern is recognized):
// Combined compare-and-branch

PTX Move Elimination

sub_2204E60 ("Remove redundant moves") eliminates identity moves:

// Before:
mov.b32 %r5, %r5;               ; identity move

// After:
; (deleted)

Satellite Machine Peephole Passes

Three additional machine-level passes perform specialized peephole transformations adjacent to the main NVPTXPeephole:

param-opt (sub_2203290)

Pass nameparam-opt
Entry pointsub_2203290
Description"Optimize NVPTX ld.param"

Optimizes parameter load patterns. In PTX, kernel parameters are loaded via ld.param instructions into registers. When the same parameter is loaded multiple times (e.g., after inlining or loop unrolling), param-opt consolidates them:

// Before:
ld.param.u32 %r1, [_param_0];
...
ld.param.u32 %r7, [_param_0];    ; redundant reload of same parameter

// After:
ld.param.u32 %r1, [_param_0];
...
mov.b32 %r7, %r1;                ; reuse previous load

nvptx-trunc-opts (sub_22058E0)

Pass namenvptx-trunc-opts
Entry pointsub_22058E0
Description"Optimize redundant ANDb16ri instrunctions" [sic]

Eliminates redundant AND operations on b16 (16-bit) registers. When type legalization widens a sub-16-bit value to 16 bits, it inserts an AND with a mask to preserve the original width. If the value is already correctly masked (e.g., from a load that zero-extends), the AND is redundant:

// Before:
ld.u8 %rs1, [%rd1];              ; loads 8-bit, zero-extended to 16
and.b16 %rs2, %rs1, 0xFF;        ; redundant mask — already 8-bit clean

// After:
ld.u8 %rs1, [%rd1];
// (AND deleted, use %rs1 directly)

The binary contains the string with a typo: "instrunctions" instead of "instructions".

Remove Redundant Moves (sub_2204E60)

Entry pointsub_2204E60
Description"Remove redundant moves"

Eliminates move instructions where source and destination are the same register, or where the move is immediately dead. This complements the stock LLVM MachineCopyPropagation pass with PTX-specific move patterns.

Knobs

KnobTypeDefaultScopeEffect
enable-nvvm-peepholebooltrueIR + MachineMaster switch for both the IR-level nvvm-peephole-optimizer and the machine-level nvptx-peephole. Registered at ctor_358_0 (0x50E8D0).
disable-peepholeboolfalseMachine onlyDisables the stock LLVM PeepholeOptimizerPass (slot 492). Does not affect the NVIDIA-specific passes. Registered at ctor_314 (0x502360).
aggressive-ext-optbool(varies)Machine onlyControls aggressive extension optimization in stock LLVM peephole.
disable-adv-copy-optboolfalseMachine onlyDisables advanced copy optimization in stock LLVM peephole.
rewrite-phi-limitint(varies)Machine onlyLimits PHI rewriting in stock LLVM peephole.
recurrence-chain-limitint(varies)Machine onlyLimits recurrence chain analysis in stock LLVM peephole.

The enable-nvvm-peephole description string recovered from the binary is: "Enable NVVM Peephole Optimizer". Its default-on status indicates the pass is mature and does not require opt-in behavior.

Optimization Level Behavior

The IR-level peephole runs in all optimization paths except -O0:

LevelPathNVVMPeephole invocations
Ofcmin"ptx" path1 (early)
Ofcmid"mid" path1 (after SROA/CVP)
O2/O3"default" path1 (after NVVMReflect + IntrinsicLowering)
O3 (late)Tier finalization1 (after BranchFolding/CFGSimplify)

At -O0, the peephole is likely skipped along with most optimization passes. The factory function sub_1CEF8F0 appears only in code paths that are active at O1 and above.

End-to-End Peephole Pipeline

The complete peephole optimization flow through cicc, from IR to PTX:

Source CUDA
    |
    v
[LLVM IR after clang/EDG frontend]
    |
    v
InstCombine (0x1700000+)           General algebraic simplification
    |                               ~600 functions, target-independent
    v
NVVMReflect (sub_1857160)          Resolve __nvvm_reflect() calls
    |
    v
nvvm-reflect-pp                    Simplify constant conditionals from reflect
    |
    v
NVVMIntrinsicLowering (sub_1CB4E40) Expand NVVM intrinsics
    |
    v
nvvm-peephole-optimizer             NVVM-specific IR patterns:
  (sub_1CEF8F0 factory)              - addrspacecast chain folding
    |                                - intrinsic sequence simplification
    v                                - type conversion roundtrip elimination
NVVMAnnotationsProcessor             - post-reflect dead code cleanup
  (sub_215D9D0 companion)
    |
    v
[Further IR optimization: GVN, LICM, Sinking2, etc.]
    |
    v
[Instruction Selection: DAGToDAG (sub_2200150, 78KB)]
    |     Hash-table pattern matching: hash = (37*idx) & (tableSize-1)
    v
PeepholeOptimizerPass (slot 492)    Stock LLVM machine peephole:
    |                                - redundant copy folding
    v                                - compare-and-branch simplification
NVPTXPeephole (sub_21DB090)         PTX-specific machine peephole:
    |                                - cvta pair elimination
    v                                - predicate folding
param-opt (sub_2203290)              - ld.param consolidation
    |
    v
nvptx-trunc-opts (sub_22058E0)      - ANDb16ri elimination
    |
    v
Remove Redundant Moves (sub_2204E60) - identity move deletion
    |
    v
[Register Allocation]
    |
    v
ProxyRegErasure (sub_21DA810)       Late cvta.to.local removal
    |
    v
[PTX Emission]

Function Map

FunctionAddressSizeRole
--sub_1CEF8F0smallNVVMPeephole factory (legacy PM)
--sub_215D9D0--NVVMAnnotationsProcessor (companion, always paired)
--sub_2314DA0smallNVVMPeepholeOptimizerPass serializer (New PM)
--sub_2342890--New PM registration function (slot 382)
--sub_233C410--Pipeline text parser (line 3534)
--sub_21DB090smallNVPTXPeephole machine pass registration
--sub_2166ED01.6KBaddPreRegAlloc() -- hosts NVPTXPeephole
--sub_21DA810--ProxyRegErasure (cvta.to.local removal)
--sub_2203290smallparam-opt (ld.param optimization)
--sub_2204E60smallRemove Redundant Moves
--sub_22058E0smallnvptx-trunc-opts (ANDb16ri elimination)
--sub_21BEE704.1KB"Bad address space in addrspacecast" validation
--sub_20DA7F030KBDAG combine / peephole on MachineInstrs
--sub_37E1AE018KBLate-stage machine optimization (peephole or copy prop)

Differences from Upstream LLVM

Upstream LLVM (as of LLVM 17/18) contains NVPTXPeephole.cpp in llvm/lib/Target/NVPTX/, which implements a small machine-level pass that:

  1. Folds cvta address-space-conversion pseudo-instructions
  2. Removes NVPTX::PROXY_REG pseudo-instructions (now split into a separate NVPTXProxyRegErasure pass in cicc)

CICC v13.0 extends this significantly:

  • The IR-level pass (nvvm-peephole-optimizer) has no upstream counterpart. It is entirely NVIDIA-proprietary, filling a gap between target-independent InstCombine and target-specific machine peephole.
  • Three satellite machine passes (param-opt, nvptx-trunc-opts, Remove Redundant Moves) have no upstream equivalents.
  • The machine-level nvptx-peephole is larger than upstream, likely incorporating additional pattern rules for newer PTX features (tensor core operations, cluster operations, etc.).
  • ProxyRegErasure is separated from NVPTXPeephole into its own pass (sub_21DA810) and runs late post-RA rather than inline with the peephole.

Evidence Summary

The pass's existence and classification are confirmed through multiple independent sources:

SourceAddress / LocationEvidence
Pipeline parsersub_233C410 line 3534Registers "nvvm-peephole-optimizer" as function-level NVIDIA custom pass
New PM registrationsub_2342890 slot 382Maps string to llvm::NVVMPeepholeOptimizerPass
Serializersub_2314DA0Produces "nvvm-peephole-optimizer" text for pipeline printing
Legacy PM factorysub_1CEF8F0Called 2x from sub_12E54A0 (pipeline assembler)
Companion pairingsub_215D9D0Always immediately follows sub_1CEF8F0 in all paths
Knob sweep0x50E8D0 (ctor_358_0)enable-nvvm-peephole = "Enable NVVM Peephole Optimizer", default true
Knob duplicate0x560000 sweep line 292Confirmed with identical description
NVVMPassOptionsp2a.3-03-passoptions.txtListed as nvvm-peephole-optimizer in option table
Machine passsub_21DB090"NVPTX Peephole" / "nvptx-peephole" registration string
Machine pipelinesub_2166ED0"After codegen peephole optimization pass" checkpoint string

Confidence note. The pass registration, knobs, pipeline position, and factory function are confirmed at HIGH confidence from binary evidence. The specific transformation patterns described above are at MEDIUM confidence -- inferred from pipeline position (runs after NVVMReflect + NVVMIntrinsicLowering), NVVM IR semantics, and address space validation code, but the actual NVVMPeepholeOptimizerPass::run() body has not been individually decompiled. The factory sub_1CEF8F0 creates the pass object; the run method is dispatched through the object's vtable.

Cross-References