Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVIDIA Custom Passes

25+ proprietary optimization passes not found in upstream LLVM. Registered into the New PM pipeline at sub_2342890 and into the pipeline assembler at sub_12E54A0.

Module-level custom16 passes
Function-level custom9 passes
Loop-level custom1 pass
Custom analyses2 analyses
Machine-level custom13 passes
Registrationsub_2342890 (New PM) + sub_12E54A0 (pipeline builder)
Dedicated deep-dive pages22

IR-Level Module Passes

Pass NameClass / FunctionSizeDescription
memory-space-optsub_1C70910 / sub_1CA2920clusterResolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: first-time, second-time, no-warnings, warnings
printf-loweringsub_1CB1E6031KBLowers printfvprintf + local buffer. Validates format string is a literal. "vprintfBuffer.local", "bufIndexed"
nvvm-verifysub_2C80C90230KBThree-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability
nvvm-pretreatPretreatPassIR pre-treatment before optimization
check-kernel-functionsNVPTXSetFunctionLinkagesPassKernel function linkage validation
check-gep-indexGEP index validation
cnp-launch-checkCNPLaunchCheckPassCooperative launch validation
ipmspIPMSPPassInter-procedural memory space propagation
nv-early-inlinerNVIDIA early inlining pass
nv-inline-mustInlineMustPassForce-inline functions marked __forceinline__
select-kernelsSelectKernelsPassKernel selection for compilation
set-global-array-alignmentParameterized: modify-shared-mem, skip-shared-mem, modify-global-mem, skip-global-mem
lower-aggr-copies72KB+58KBLower aggregate copies: struct splitting, memmove unrolling. Param: lower-aggr-func-args
lower-struct-argsLower structure arguments. Param: opt-byval
process-restrictProcess __restrict__ annotations. Param: propagate-only
lower-opsLowerOpsPassLower special operations. Includes FP128/I128 emulation via 48 __nv_* library calls

IR-Level Function Passes

Pass NameFunctionSizeDescription
branch-distsub_1C47810 clusterBranch distribution optimization. Knobs: branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm
nvvm-reflectsub_1857160Resolves __nvvm_reflect() calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls
nvvm-reflect-ppNVVM reflect preprocessor
nvvm-intrinsic-loweringsub_2C63FB0140KBLowers llvm.nvvm.* intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline
nvvm-peephole-optimizerNVVM-specific peephole optimizations
rematsub_1CE7DD067KBIR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (75KB)
reuse-local-memoryLocal memory reuse optimization
set-local-array-alignmentSet alignment for local arrays
sinking2NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass)

IR-Level Loop Pass

Pass NameFunctionSizeDescription
loop-index-splitsub_2CC5900 / sub_1C7B2C069KBSplit loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM)

Custom Analyses

Analysis NamePurpose
rpaRegister Pressure Analysis — feeds into scheduling and rematerialization decisions
merge-setsMerge set computation — used by coalescing and allocation

Machine-Level Passes

Pass NameFunctionPass IDSizeDescription
Block Rematsub_2186D90nvptx-remat-block47KBTwo-phase candidate selection + iterative "pull-in" for register pressure reduction. "Max-Live-Function(", "Really Final Pull-in:"
Machine Mem2Regsub_21F9920nvptx-mem2regPromotes __local_depot stack objects back to registers post-regalloc
MRPAsub_2E5A4E0machine-rpa48KBMachine Register Pressure Analysis — incremental tracking, not in upstream LLVM
LDG Transformsub_21F2780ldgxformTransforms global loads to ldg.* (texture cache) for read-only data
GenericToNVVMsub_215DC20generic-to-nvvm36KBMoves globals from generic to global address space
Alloca Hoistingsub_21BC7D0alloca-hoistingEnsures all allocas are in entry block (PTX requirement)
Image Optimizersub_21BCF10Optimizes texture/surface access patterns
NVPTX Peepholesub_21DB090nvptx-peepholeNVPTX-specific peephole optimization
Prolog/Epilogsub_21DB5F0Custom frame management (PTX has no traditional prolog/epilog)
Replace Image Handlessub_21DBEA0Replaces IR-level image handles with PTX texture/surface references
Extra MI Printersub_21E9E80extra-machineinstr-printerRegister pressure statistics reporting
Valid Global Namessub_21BCD80nvptx-assign-valid-global-namesSanitizes global names to valid PTX identifiers
NVVMIntrRangesub_216F4B0nvvm-intr-rangeAdds !range metadata to NVVM intrinsics (e.g., tid.x bounds)

Major Proprietary Subsystems

Dead Synchronization Eliminationsub_2C84BA0

FieldValue
Size96KB
PurposeRemoves redundant __syncthreads() barriers

Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.

MemorySpaceOpt — Multi-Function Cluster

FunctionSizePurpose
sub_1C70910Pass entry point
sub_1C6A6C0Pass variant
sub_1CA292032KBAddress space resolution — "Cannot tell what pointer points to, assuming global memory space"
sub_1CA9E9028KBSecondary resolver
sub_1CA535045KBInfrastructure
sub_2CBBE9071KBMemory-space-specialized function cloning

NV Rematerialization Cluster

FunctionSizeRole
sub_1CE7DD067KBMain driver — live-in/live-out analysis, skip decisions
sub_1CE67D032KBBlock-level executor — "remat_", "uclone_" prefixes
sub_1CE3AF056KBPull-in cost analysis — "Total pull-in cost = %d"

NLO — Simplify Live Output

FunctionSizeStrings
sub_1CE10B048KB"Simplify Live Output", "nloNewBit", "newBit"
sub_1CDC1F035KB"nloNewAdd", "nloNewBit"

Creates new add/bit operations to simplify live-out values at block boundaries.

IV Demotionsub_1CD74B0

FieldValue
Size75KB
Strings"phiNode", "demoteIV", "newInit", "newInc", "argBaseIV", "newBaseIV", "iv_base_clone_", "substIV"

Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.

RLMCAST — sub_2D13E90

FieldValue
Size67KB
PurposeRegister-level multicast instruction lowering

Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.

Texture Group Merge (.Tgm) — sub_2DDE8C0

Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.

NVVM Intrinsic Verifiersub_2C7B6A0

FieldValue
Size143KB
PurposeValidates ALL NVVM intrinsics against SM capabilities

Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (230KB total).

NVVM Intrinsic Loweringsub_2C63FB0

FieldValue
Size140KB
PurposeLowers NVVM intrinsics to concrete operations

Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.

Base Address Strength Reductionsub_2CA4A10

FieldValue
Size58KB
Knobsdo-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions)

Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.

Common Base Eliminationsub_2CA8B00

FieldValue
Size39KB
PurposeHoists shared base address expressions to dominating CFG points

Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.

CSSA Transformationsub_3720740

FieldValue
Size22KB
PurposeConventional-SSA for GPU divergent control flow
Knobsdo-cssa, cssa-coalesce, cssa-verbosity, dump-before-cssa
Debug"IR Module before CSSA"

Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.

NVIDIA Codegen Knobs — sub_1C20170

70+ knobs parsed from the NVVM container format:

Graphics Pipeline

VSIsVREnabled, VSIsLastVTGStage, EnableZeroCoverageKill, AllowComputeDerivatives, AllowDerivatives, EnableNonUniformQuadDerivatives, UsePIXBAR, ManageAPICallDepth

Compute / Memory

DisableSAMRAM, DoMMACoalescing, DisablePartialHalfVectorWrites, AssumeConvertMemoryToRegProfitable, MSTSForceOneCTAPerSMForSmemEmu, AddDepFromGlobalMembarToCB

Register Allocation / Scheduling

AdvancedRemat, CSSACoalescing, DisablePredication, DisableXBlockSched, ReorderCSE, ScheduleKils, NumNopsAtStart, DisableERRBARAfterMEMBAR

Type Promotion

PromoteHalf, PromoteFixed, FP16Mode, IgnoreRndFtzOnF32F16Conv, DisableLegalizeIntegers

PGO

PGOProfileKind, PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex

Knob Forwarding

OCGKnobs, OCGKnobsFile, NVVMKnobsString, OmegaKnobs, FinalizerKnobs

Compile Modes — sub_1C21CE0

ModeConstant
Whole-program no-ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI
Whole-program ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI
Separate ABINVVM_COMPILE_MODE_SEPARATE_ABI
Extensible WP ABINVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI
Opt LevelConstant
NoneNVVM_OPT_LEVEL_NONE
1NVVM_OPT_LEVEL_1
2NVVM_OPT_LEVEL_2
3NVVM_OPT_LEVEL_3
Debug InfoConstant
NoneNVVM_DEBUG_INFO_NONE
Line infoNVVM_DEBUG_INFO_LINE_INFO
Full DWARFNVVM_DEBUG_INFO_DWARF