Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration Knobs

Three independent knob systems control compiler behavior: LLVM cl::opt flags (~1,496 unique), NVVMPassOptions (222 slots), and NVIDIA codegen knobs (~70).

LLVM cl::opt1,496 unique flags across 353 constructor files
NVVMPassOptions222 slots, initialized by sub_12D6300 (125KB)
Codegen knobs~70, parsed by sub_1C20170 / sub_CD9990 from NVVM container
BSS storage0x4F7FEA00x4FA5xxx (cl::opt), a1+0a1+4464 (PassOptions)
Dual PMSame options registered for both Legacy PM (sub_C53080) and New PM (sub_16B8280)
NVIDIA-specific172 of 1,496 cl::opt flags (11.5%) are NVIDIA-added

Knob System 1: LLVM cl::opt

Registration Pattern

Every cl::opt follows this initialization sequence in a global constructor:

// Legacy PM path
InterlockedExchangeAdd64(sub_C523C0(), 1);   // atomic option counter
sub_C53080(&option, "option-name", strlen);   // set name
sub_C53130(&option);                          // finalize registration
__cxa_atexit(destructor, &option, &dso_handle);

// New PM path (parallel registration)
InterlockedExchangeAdd64(&unk_4FA0230, 1);
sub_16B8280(&option, "option-name", strlen);
sub_16B88A0(&option);
__cxa_atexit(destructor, &option, &dso_handle);

Each cl::opt<T> occupies ~224 bytes (0xE0) in BSS. Top constructors by option count: ctor_600 (30), ctor_433 (25), ctor_472 (24), ctor_609 (22), ctor_392 (22).

Category 1: Scalar Optimization (InstCombine + FP)

Constructor: ctor_165_0 at 0x4D0500 (11,731 bytes). Registers 12 NVIDIA-specific flags plus 4 standard LLVM flags.

FlagTypeDefaultBSS AddrPurpose
split-gep-chainboolfalse0x4F901A8Split GEP chains to independent GEPs for better address mode selection
Disable-Add-to-OrbooltrueDisable add-to-or transformation (NVIDIA blocks this LLVM combine)
opt-use-fast-mathboolfalseEnable aggressive FP simplification (set by -unsafe-math / -fast-math)
opt-use-prec-divbooltrueUse precise division (set by -prec-div=1; cleared by -prec-div=0)
opt-no-signed-zerosboolfalseIgnore signed zero distinction (set by -no-signed-zeros)
disable-fp-cast-optboolfalseDisable FP-to-int and int-to-FP cast optimizations
reorder-sext-before-cnst-addboolfalsesext(add(a,CI)) to add(sext(a),CI) rewrite; hidden flag
disable-sinkboolfalseDisable instruction sinking in InstCombine
partial-sinkboolfalseEnable partial sinking of instructions
nvptx-rsqrt-approx-optboolfalseEnable reciprocal sqrt approximation optimization
disable-rsqrt-optboolfalseDisable reciprocal sqrt optimization entirely
check-vnboolfalseVerify value numbers after transformations (debug)

Standard LLVM flags in same constructor: expensive-combines (bool), instcombine-maxarray-size (int, default 1024), instcombine-visit (int), instcombine-lower-dbg-declare (bool).

Category 2: Inliner Heuristics

Constructor: ctor_186_0 at 0x4DBEC0 (14,109 bytes). Nine NVIDIA-specific flags governing the custom CGSCC inliner at sub_1864060.

FlagTypeDefaultPurpose
profuseinlineboolfalseVerbose inlining diagnostics (NVIDIA profuse framework, not PGO profuse)
inline-total-budgetintnoneGlobal total budget across all callers; unset = unlimited
nv-inline-allboolfalseForce inline ALL function calls (used by OptiX ray tracing)
inline-budgetint20000Per-caller inlining cost budget; -aggressive-inline sets to 40000
inline-adj-budget1intnoneSecondary adjusted per-caller budget
inline-switchctrlintnoneTune heuristic for switch-containing callees
inline-numswitchfuncintnoneThreshold for switch-heavy function penalty
inline-maxswitchcasesintnoneMax switch cases before inlining penalty kicks in
disable-inlined-alloca-mergingboolfalseDisable post-inline alloca merging into single frame slot

"none" means the knob is unset by default and the heuristic falls back to internal logic.

Category 3: GVN (Global Value Numbering)

Constructor: ctor_201 at 0x4E0990. Eleven knobs (8 NVIDIA-specific + 3 upstream).

FlagTypeDefaultBSS AddrPurpose
profusegvnbooltrue0x4FAE7E0Verbose GVN diagnostics (unusually, defaults on)
gvn-dom-cachebooltrue0x4FAE700Cache dominator tree nodes; cache size = 32
max-recurse-depthint10000x4FAE620Max recursion during value numbering (safety valve for template-heavy code)
enable-phi-removeint20x4FAEC40PHI removal aggressiveness: 0=off, 1=trivial only, 2=post-leader substitution
dump-phi-removeint00x4FAEB60Dump PHI removal decisions (debug)
no-split-stores-belowint-10x4FAEA80Min store width for splitting (bits); -1 = no limit
no-split-stores-aboveint-10x4FAE9A0Max store width for splitting (bits); -1 = no limit
split-storesbooltrue0x4FAE8C0Master enable for NVIDIA store-splitting in GVN
enable-prebooltrue0x4FAEEE0Enable Partial Redundancy Elimination (upstream LLVM)
enable-load-prebooltrue0x4FAEE00Enable load PRE across edges (upstream LLVM)
enable-split-backedge-in-load-preboolfalse0x4FAED20Allow backedge splitting during load PRE (upstream LLVM)

Store splitting uses a custom NVIDIA registrar (sub_190BE40) that takes a default-value pointer. Both limit knobs default to -1 = all sizes eligible.

Category 4: Loop Strength Reduction

Constructor: ctor_214_0 at 0x4E4B00. Eleven NVIDIA-specific LSR flags (69% NVIDIA customization rate).

FlagTypeDefaultPurpose
disable-unknown-trip-lsrboolfalseDisable LSR for loops with unknown trip count
lsr-check-rpbooltrue [MEDIUM]Check register pressure before applying LSR
lsr-rp-limitint~32-64 [LOW]Skip LSR entirely when RP exceeds this limit (occupancy cliff)
filter-bad-formulabooltrue [MEDIUM]Filter out poor-quality LSR formulae early
do-lsr-64-bitboolarch-dependentEnable 64-bit loop strength reduction (false on sm_3x-5x, true on sm_70+)
count-sxt-opt-for-reg-pressurebooltrue [MEDIUM]Factor sign-extension elimination savings into RP analysis
lsr-sxtoptbooltrue [MEDIUM]Perform sign-extension elimination within LSR
lsr-loop-levelint0Apply LSR only at specific loop nesting level (0 = all levels)
lsr-skip-outer-loopboolfalseIgnore outer-loop induction variables in LSR
disable-lsr-for-sharedmem32-ptrboolfalseDisable LSR for 32-bit shared memory pointers (GPU-specific)
disable-lsr-complexity-discountboolfalseDisable complexity estimation discount heuristic

Standard LLVM LSR flags in same constructor: enable-lsr-phielim, lsr-insns-cost, lsr-exp-narrow, lsr-filter-same-scaled-reg, lsr-fix-iv-inc.

Category 5: IndVarSimplify

Constructor: ctor_203_0 at 0x4E1CD0 (7,007 bytes).

FlagTypeDefaultPurpose
Disable-unknown-trip-ivboolfalseDisable IV substitution for unknown-trip-count loops
iv-loop-levelintnoneControl which loop nesting levels get IV substitution

Category 6: SimplifyCFG

Constructor: ctor_243_0 at 0x4ED0C0.

FlagTypeDefaultPurpose
disable-jump-threadingboolfalseDisable jump threading (for OCG experiments)
fold-with-var-condboolfalseFold branches with variance-based conditions

Category 7: NVPTX Backend Math/Scheduling

Constructor: ctor_607 at 0x584B60 (13,700 bytes). Core numeric precision and FMA controls. Defaults are set by the CLI flag routing in sub_9624D0, not by the cl::opt constructors.

FlagTypeCLI DefaultPurpose
nvptx-sched4regboolfalseSchedule for register pressure (key NVPTX strategy)
nvptx-fma-levelint1FMA contraction: 0=off, 1=on, 2=aggressive. CLI -fma=1 is default
nvptx-prec-divf32int1F32 div precision: 0=approx, 1=full, 2=IEEE rnd+ftz, 3=IEEE no-ftz
nvptx-prec-sqrtf32int1Sqrt precision: 0=approx, 1=rn. CLI -prec-sqrt=1 is default
nvptx-approx-log2f32boolfalseUse lg2.approx for log2 (only set by -unsafe-math)
nvptx-force-min-byval-param-alignboolfalseForce 4-byte minimum alignment for byval parameters
nvptx-normalize-selectboolfalseOverride shouldNormalizeToSelectSequence in TLI
enable-bfi64boolfalseEnable 64-bit BFI (bit-field insert) instructions

Note: These cl::opt knobs have no explicit default in their constructor (they init to 0/false). The effective defaults come from the CLI flag catalog: -fma=1 routes -nvptx-fma-level=1, -prec-div=1 routes -nvptx-prec-divf32=1, -prec-sqrt=1 routes -nvptx-prec-sqrtf32=1.

Category 8: NVPTX Backend Passes/Features

Constructor: ctor_609_0 at 0x585D30 (22 options total, largest NVPTX constructor).

FlagTypeDefaultPurpose
disable-nvptx-load-store-vectorizerboolfalseDisable load/store vectorizer
disable-nvptx-require-structured-cfgboolfalseTurn off structured CFG requirement (transitional)
nvptx-short-ptrboolfalse32-bit pointers for const/local/shared address spaces
nvptx-enable-machine-sinkboolfalseEnable machine-level instruction sinking
enable-new-nvvm-rematbooltrueEnable new NVVM rematerialization engine
nv-disable-rematboolfalseDisable all rematerialization passes
nv-disable-mem2regboolfalseDisable machine-IR mem2reg promotion
nv-disable-scev-cgpbooltrueDisable SCEV-based address mode optimization (on = disabled)
nvptx-32-bit-smemboolfalseUse 32-bit pointers for shared address space
nvptx-exit-on-unreachablebooltrueLower unreachable as PTX exit instruction
nvptx-early-byval-copyboolfalseCreate copy of byval function args in entry block
enable-nvvm-peepholebooltrueEnable NVVM peephole optimizer
no-reg-target-nvptxrematboolfalseOnly run old remat on kernels without register targets
lower-func-argsbooltrueLower large aggregate function parameters to copies
enable-sinkbooltrueEnable LLVM sinking pass
disable-post-optboolfalseDisable IR optimizations in post-opt phase
usedessaint2deSSA method: 0=off, 1=basic, 2=full
ldgbooltrueLoad-via-texture (ld.global.nc) constant transform

Category 9: NVPTX Backend Extended

Constructor: ctor_610 at 0x5888A0 (7,400 bytes).

FlagTypeDefaultPurpose
unroll-assumed-sizeint4Assumed element count for unknown-size local arrays during unroll analysis
enable-loop-peelingboolfalseEnable loop peeling transformation
enable-256-bit-load-storeboolfalseEnable 256-bit (32-byte) vector load/store generation
ias-param-always-point-to-globalboolfalseAssume function parameter pointers always point to global memory
ias-strong-global-assumptionsboolfalseStronger assumption: constant-buffer pointers resolve to globals
ias-wmma-memory-space-optboolfalseEnable MemorySpaceOpt specialization for WMMA/tensor operations

Category 10: Memory Space Optimization

Scattered across ctor_264, ctor_267_0, ctor_528, ctor_531_0. See MemorySpaceOpt and IPMSP for the full algorithm.

FlagTypeDefaultPurpose
mem-space-algint2Switch between MSO algorithm variants
dump-ir-before-memory-space-optboolfalseDump IR before MSO
dump-ir-after-memory-space-optboolfalseDump IR after MSO
track-indir-loadbooltrueTrack indirect loads during MSO dataflow
track-int2ptrbooltrueTrack IntToPtr casts in MSO
param-always-point-to-globalbooltrueKernel parameter pointers always point to global memory
devicefn-param-always-localboolfalseTreat parameter space as local in device functions
ignore-address-space-checkboolfalseIgnore address-space checks during branch distribution
sink-into-textureint3Sink loads into texture blocks: 0=off, 1=cross-block, 2=+intra, 3=+outside-only. See also Category 14
ldgbooltrueLoad Global Constant Transform (ld.global.nc)
do-clone-for-ip-mspint-1Function cloning limit for IP-MSP (-1 = unlimited, 0 = disable)
dump-ip-mspboolfalseDump interprocedural MSP info
lower-read-only-devicefn-byvalboolfalseHandle byval attribute of args to read-only device functions
reuse-lmem-very-long-live-rangeintThreshold for very-long live range in local memory reuse
hoist-load-paramboolfalseGenerate all ld.param in entry basic block
sink-ld-paramboolfalseSink one-use ld.param to use point
process-alloca-alwaysbooltrueTreat alloca as definite local (AS 5) regardless of context
wmma-memory-space-optbooltrueEnable memory space optimization for WMMA operations
strong-global-assumptionsbooltrueAssume const buffer pointers always point to globals
process-builtin-assumeboolProcess __builtin_assume(__is*(p)) assertions for space deduction

Category 11: Rematerialization

Scattered across ctor_609_0, ctor_362, ctor_277_0, ctor_361_0, and others. See Rematerialization for full algorithm detail.

IR-Level Knobs (ctor_277_0 at 0x4F7BE0)

FlagTypeDefaultGlobalPurpose
do-rematint3dword_4FC05C0Master control. 0=off, 1=conservative, 2=normal, 3=full
no-rematstring(empty)qword_4FC0440Comma-separated function exclusion list
remat-ivint4dword_4FBFB40IV demotion level. 0=off, 4=full
remat-loadint1dword_4FBFA60Load rematerialization. 0=off, 1=on
remat-addint0dword_4FBF980Add/GEP factoring. 0=off
remat-single-cost-limitint6000dword_4FC0080Max cost per single live-in reduction
remat-loop-tripint20dword_4FBFFA0Default assumed loop trip count
remat-gep-costint6000dword_4FBFEC0Max cost for GEP rematerialization
remat-use-limitint10dword_4FBFDE0Max number of uses for a candidate
remat-max-live-limitint10dword_4FBFD00Max live-in limit for rematerialization
remat-maxreg-ceilingint0dword_4FBF600Register ceiling (0 = uncapped)
remat-for-occint120dword_4FBF8A0Occupancy-driven rematerialization target
remat-lli-factorint10dword_4FC0320Long-latency instruction cost factor
remat-ignore-single-costboolfalsebyte_4FBFC20Bypass per-value cost filter
remat-moveboolfalsebyte_4FC0400Remat move instructions
simplify-live-outint2dword_4FBF520NLO level. 0=off, 2=full
dump-rematint0dword_4FC0240Debug dump level (0-4+)
dump-remat-ivint0dword_4FC0160IV remat debug dump
dump-remat-loadint0dword_4FBF720Load remat debug dump
dump-remat-addint0dword_4FBF640Add remat debug dump
dump-simplify-live-outboolfalsebyte_4FBF400NLO debug dump

Machine-Level Knobs (ctor_361_0 at 0x5108E0)

FlagTypeDefaultGlobalPurpose
nv-remat-blockint14dword_4FD3820Bitmask controlling remat modes (bits 0-3)
nv-remat-max-timesint10dword_4FD3740Max outer loop iterations
nv-remat-block-single-costint10dword_4FD3660Max cost per single live value pull-in
nv-remat-block-map-size-limitint6dword_4FD3580Map size limit for single pull-in
nv-remat-block-max-costint100dword_4FD3040Max total clone cost per live value reduction
nv-remat-block-liveout-min-percentageint70dword_4FD3120Min liveout % for special consideration
nv-remat-block-loop-cost-factorint20unk_4FD3400Loop cost multiplier
nv-remat-default-max-regint70unk_4FD3320Default max register pressure target
nv-remat-block-load-costint10unk_4FD2EC0Cost assigned to load instructions
nv-remat-threshold-for-spec-regint20unk_4FD3860Threshold for special register remat
nv-dump-remat-blockboolfalsebyte_4FD2E80Debug dump toggle
nv-remat-check-internal-liveboolfalsebyte_4FD2DA0Check internal liveness during MaxLive
max-reg-kindint0qword_4FD2C20Kind of max register pressure info
no-mi-rematstring(empty)qword_4FD2BE0Skip machine-level remat for named functions
load-rematbooltrueword_4FD32F0Enable load rematerialization
vasp-fix1boolfalseword_4FD3210VASP fix for volatile/addsp

General Remat Knobs (ctor_609_0, ctor_362, and others)

FlagTypeDefaultPurpose
nv-disable-rematboolfalseDisable all remat passes
enable-new-nvvm-rematbooltrueEnable new NVVM remat engine (disables old)
no-reg-target-nvptxrematboolfalseOnly old remat for kernels without register targets
fp-rematboolfalseAllow rematerializing floating-point instructions
high-cost-rematboolfalseAllow rematerializing high-cost instructions
cost-threshold-rematintCost threshold per remat action
block-freq-cap-rematintMaximum raw block frequency value
block-freq-norm-range-rematintNormalization range for block frequency in remat cost
collect-candidate-scale-rematintScaling ratio for high-RP candidate collection
incremental-update-rematboolfalseIncrementally update RP analysis after each remat
verify-update-rematboolfalseDebug: verify incremental update vs full analysis
print-verify-rematboolfalseDebug: print problematic RP on verification failure
rp-rematintDebug: set a target register pressure number
late-remat-update-thresholdintThreshold for copy with many other copy uses
remat-load-paramboolfalseSupport rematerializing constant ld.param not in NVVM IR

Category 12: SCEV-CGP (Address Mode Optimization)

Eleven NVIDIA-specific knobs for SCEV-based CodeGenPrepare. See CodeGenPrepare.

FlagTypeDefaultPurpose
do-scev-cgpboolfalseEnable SCEV-based CodeGenPrepare
do-scev-cgp-aggresivelyboolfalseAggressive SCEV-CGP mode [sic]
do-function-scev-cgpboolfalseFunction-level SCEV-CGP
nv-disable-scev-cgpbooltrueDisable SCEV address mode optimization (master kill switch, on by default)
scev-cgp-controlintControl max transformations applied
scev-cgp-cross-block-limitintMax common-base expressions from a single block
scev-cgp-idom-level-limitintMax dominator tree levels to walk
scev-cgp-inst-limitintMax instructions for a single parameter
scev-cgp-old-baseboolfalseForce SCEV-CGP to create new base (vs reusing old)
scev-cgp-tid-max-valueintMax value of thread ID in SCEV expressions
print-after-scev-cgpboolfalsePrint function after SCEV-CGP phase

Category 13: Branch Distribution

Seven NVIDIA-specific flags.

FlagTypeDefaultPurpose
branch-dist-block-limitintMax blocks to apply branch distribution
branch-dist-func-limitintMax functions to apply branch distribution
branch-dist-normintNormalization control
no-branch-diststringComma-separated list of functions to skip
disable-complex-branch-distboolfalseDisable complex branch distribution
dump-branch-distboolfalseDump branch distribution info

Category 14: Sinking / Code Motion

Thirteen knobs across multiple constructors. See Sinking2 for the NVIDIA-custom texture-aware sinking pass.

FlagTypeDefaultPurpose
sink-into-textureint3Texture sinking aggressiveness: 0=off, 1=cross-block, 2=+intra, 3=+outside-only
sink-limitint20Max instructions to sink per Sinking2 invocation (complexity limiter)
dump-sink2boolfalseDebug dump for Sinking2 pass
sink-check-schedbooltrueCheck scheduling effects of sinking (stock Sink)
sink-single-onlybooltrueOnly sink single-use instructions (stock Sink)
enable-andcmp-sinkingboolfalseSink and/cmp sequences into branches
aggressive-no-sinkboolfalseSink all generated instructions
max-uses-for-sinkingintDon't sink instructions with too many uses
rp-aware-sinkboolfalseConsider register pressure impact when sinking
instcombine-code-sinkingboolfalseEnable code sinking within InstCombine
hoist-const-storesboolfalseHoist loop-invariant stores

Category 15: Register Pressure / Allocation

NVIDIA-specific knobs plus LLVM greedy allocator knobs. See Register Allocation for the full algorithm.

NVIDIA RP Knobs

FlagTypeDefaultPurpose
maxregintnoneMaximum register count (--maxrregcount equivalent)
register-usage-levelintRegister usage level control
cta-reconfig-aware-mrpaboolfalseCTA reconfiguration-aware machine RP analysis
cta-reconfig-aware-rpaboolfalseCTA reconfiguration-aware RP analysis
pred-aware-mcseboolfalsePredicate-aware MachineCSE
rp-aware-mcseboolfalseRegister-pressure-aware MachineCSE
verify-update-mcseboolfalseDebug: verify incremental RP update in MachineCSE
incremental-update-mcsebooltrueIncrementally update register pressure analysis in MachineCSE
print-verifyboolfalsePrint problematic RP info if MCSE verification fails
pred-target-adjustint0Predicate register target adjustment (-10 to +10)
donot-insert-dup-copiesboolfalseSkip duplicate copies to predecessor basic block
nv-disable-mem2regboolfalseDisable machine-level mem2reg

LLVM Greedy Allocator Knobs

FlagTypeDefaultPurpose
split-spill-modeint1Spill mode: 0=default, 1=size, 2=speed
lcr-max-depthint5Last chance recoloring max recursion depth
lcr-max-interfint8Last chance recoloring max interferences
exhaustive-register-searchboolfalseBypass LCR depth/interference cutoffs
enable-deferred-spillingboolfalseDefer spill code to end of allocation
grow-region-complexity-budgetint10000growRegion() edge budget for live range splitting
split-threshold-for-reg-with-hintint75Split threshold percentage for hinted registers

Category 16: Restrict / Aliasing

Five NVIDIA-specific flags. See Alias Analysis.

FlagTypeDefaultPurpose
process-restrictboolfalseProcess __restrict__ keyword for alias analysis
allow-restrict-in-structboolfalseAllow __restrict__ inside struct members
apply-multi-level-restrictboolfalseApply restrict to all pointer levels
dump-process-restrictboolfalseDebug dump during restrict processing
strict-aliasingboolfalseDatatype-based strict aliasing

Category 17: CSSA / deSSA

Four knobs. See CSSA.

FlagTypeDefaultPurpose
cssa-coalesceintControl PHI operand coalescing strategy
cssa-verbosityint0Verbosity level
dump-before-cssaboolfalseDump specific PHI operands being coalesced
usedessaint2deSSA method: 0=off, 1=basic, 2=full

Category 18: Loop / Unrolling

Eight NVIDIA-specific knobs (beyond the 20+ standard LLVM loop-unrolling flags).

FlagTypeDefaultPurpose
nv-disable-loop-unrollingboolfalseDisable loop unrolling in all passes
aggressive-runtime-unrollingboolfalseOCG-style unrolling heuristics
aggressive-runtime-unrolling-fixed-factorintForce fixed unroll factor
aggressive-runtime-unrolling-max-factorintMaximum unroll factor
aggressive-runtime-unrolling-max-filler-instructions-per-batchintMax filler instructions
unroll-runtime-nv-expensiveboolfalseNVIDIA heuristics for expensive loops
unroll-runtime-convergentboolfalseAllow unrolling with convergent instructions
track-trip-count-moreboolfalseTrack loop trip count more aggressively

Category 19: GEP / Address Strength Reduction

Eight NVIDIA-specific knobs. See Base Address Strength Reduction.

FlagTypeDefaultPurpose
normalize-gepboolfalseNormalize 64-bit GEP subscripts
dump-normalize-gepboolfalseDebug dump for GEP normalization
do-base-address-strength-reduceint0Two levels: 1=unconditional, 2=with conditions
dump-base-address-strength-reduceboolfalseDebug dump
do-lsr-64-bitboolfalseLoop strength reduction for 64-bit (shared with LSR)
do-sign-ext-expandboolfalseExpand sign-extension during SCEV build
balance-dot-chainboolfalseBalance chain of dot operations
special-reassociate-for-threadidboolfalseDon't move back expressions containing thread ID

Category 20: Aggregate / Byval Lowering

Ten knobs.

FlagTypeDefaultPurpose
aggressive-max-aggr-lower-sizeintThreshold size for lowering aggregates
aggressive-lsvboolfalseMerge smaller dtypes in aggregate before vectorization
vect-split-aggrboolfalseSplit aggregates before vectorization
lower-aggr-unrolled-stores-limitintLimit stores in unrolled aggregate lowering
large-aggr-store-limitintCreate loops for aggregate store exceeding limit
lower-func-argsbooltrueLower large aggregate function parameters
lsa-optboolfalseOptimize copying of struct args to local memory
skiploweraggcopysafechkboolfalseSkip safety check in loweraggcopy
memdep-cache-byval-loadsbooltruePreprocess byval loads to reduce compile time
ldstmemcpy-glue-maxintLimit for gluing ld/st of memcpy

Category 21: Normalization / Canonicalization

Four knobs.

FlagTypeDefaultPurpose
norm-fold-allboolfalseFold all regular instructions
norm-preserve-orderboolfalsePreserve original instruction order
norm-rename-allboolfalseRename all instructions
norm-reorder-operandsboolfalseSort/reorder operands in commutative operations

Category 22: NVVM Infrastructure

Five knobs.

FlagTypeDefaultPurpose
nvvm-lower-printfboolfalseEnable printf lowering
nvvm-reflect-enablebooltrueNVVM reflection (reads __CUDA_FTZ, __CUDA_PREC_DIV, etc.)
nvvm-verify-show-infoboolfalseInfo messages during NVVM verification
enable-nvvm-peepholebooltrueNVVM peephole optimizer
nv-oclboolfalseDeprecated OpenCL compatibility flag

Category 23: Compilation Control

Constructor: ctor_043_0 at 0x48D7F0 + ctor_028_0 at 0x489160.

FlagTypeDefaultPurpose
debug-compileboolfalseCompile for debugging (set by -g)
generate-line-infoboolfalseEmit line info even without -G
nvptx-f32ftzboolfalseFlush f32 subnormals to zero; hidden
wboolfalseDisable warnings; hidden
WerrorboolfalseTreat warnings as errors; hidden
OsizeboolfalseOptimize for code size; hidden
OmboolfalseMaximum optimization mode; hidden
maxregintnoneMaximum register count (no limit if unset)
nvptx-nanboolfalseNaN handling control; hidden
jump-table-densityint10Minimum density (%) for jump table lowering
pass-controlint-1Disable all optional passes after pass N; -1 = no limit
disable-passnolistemptyDisable pass(es) by number (comma-separated)
sep-compboolfalseSeparate compilation mode
proffilestringFilename for PGO profile information
RstringResource constraint: name=<int> format
lnk-disable-alloptsboolfalseDisable all linker optimization passes
disable-peepholeboolfalseDisable peephole optimizer
disable-early-taildupboolfalseDisable pre-regalloc tail duplication

Category 24: Divergence / GPU Execution

Three flags.

FlagTypeDefaultPurpose
spec-exec-only-if-divergent-targetboolfalseSpeculative execution only when target is divergent
prefer-predicated-reduction-selectboolfalsePrefer predicated reduction over after-loop select
openmp-opt-disable-barrier-eliminationboolfalseDisable OpenMP barrier elimination

Category 25: MachinePipeliner (Swing Modulo Scheduling)

Eighteen LLVM-origin knobs for software pipelining. See Scheduling for the full algorithm.

FlagTypeDefaultGlobalPurpose
enable-pipelinerbooltrueunk_503EE20Master switch for SMS
enable-pipeliner-opt-sizeboolfalseqword_503ED40Enable SWP at -Os
pipeliner-max-miiint27qword_503ECE8Maximum allowed MII
pipeliner-force-iiint0qword_503EB80Force specific II (0 = auto)
pipeliner-max-stagesint3qword_503EB28Maximum pipeline stages
pipeliner-prune-depsbooltrueqword_503E9C0Prune deps between unrelated Phi nodes
pipeliner-prune-loop-carriedbooltrueqword_503E8E0Prune loop-carried order deps
pipeliner-ignore-recmiiboolfalseqword_503E888Ignore RecMII; hidden
pipeliner-show-maskboolfalseqword_503E720Debug: show scheduling mask
pipeliner-dbg-resboolfalseqword_503E640Debug: resource usage
pipeliner-annotate-for-testingboolfalseqword_503E5E8Annotate instead of codegen
pipeliner-experimental-cgboolfalseqword_503E508Use peeling code generator
pipeliner-ii-search-rangeint10qword_503E3A0Range to search for II
pipeliner-register-pressureboolfalseqword_503E2C0Consider register pressure
pipeliner-register-pressure-marginint5qword_503E1E0Margin % for reg pressure limit
pipeliner-mve-cgbooltrueunk_503E100Use MVE code generator
pipeliner-enable-copytophibooltrueqword_503E020Enable CopyToPhi DAG Mutation
pipeliner-force-issue-widthint0qword_503DF40Force issue width (0 = auto)

Category 26: LLVM Standard Inliner (Model B)

Seventeen LLVM-origin knobs from ctor_625_0 / ctor_715_0 at 0x58FAD0. These control the upstream InlineCostAnalysis::analyzeCall path; see Inliner Cost Model for why the NVIDIA custom model (Category 2) dominates in practice.

FlagTypeDefaultPurpose
inline-thresholdint225Base inlining threshold
inlinedefault-thresholdint225Default when no hint/profile
inlinehint-thresholdint325Threshold for __attribute__((always_inline)) hint
inline-cold-callsite-thresholdint45Threshold for cold callsites
inlinecold-thresholdint45Threshold for functions with cold attribute
hot-callsite-thresholdint3000Threshold for hot callsites (PGO)
locally-hot-callsite-thresholdint525Threshold for locally hot callsites
inline-instr-costint5Cost per instruction
inline-call-penaltyint25Penalty per callsite in callee
inline-memaccess-costint0Cost per load/store
inline-savings-multiplierint8Multiplier for cycle savings
inline-savings-profitable-multiplierint4Multiplier for profitability check
inline-size-allowanceint100Max callee size inlined without savings proof
inline-cost-fullboolfalseCompute full cost even when over threshold
inline-enable-cost-benefit-analysisboolfalseEnable cost-benefit analysis
inline-deferralboolDefer inlining in cold paths (PGO)
inline-remark-attributeboolfalseEmit inline remarks

Category 27: New PM CGSCC Inliner (Model C)

Two knobs for the New Pass Manager CGSCC inliner at 0x2613930. See Inliner Cost Model.

FlagTypeDefaultPurpose
function-inline-cost-multiplierintPenalize recursive function inlining
enable-ml-inlinerenumdefaultML advisory mode: default, development, release

Knob System 2: NVVMPassOptions

222 pass option slots initialized by sub_12D6300 (125KB). Each slot is accessed by integer index (1--221) and stored in a ~4,480-byte struct.

Access Functions

FunctionPurpose
sub_12D6170(base+120, index)Fetch pass option descriptor by index
sub_1691920(base+8, index)Fetch pass option value from table
sub_12D6090(a1+offset, ...)Store string-typed option
sub_12D6100(a1+offset, ...)Store integer-typed option
sub_12D6240(a1, index, "0")Get option with default value

See NVVMPassOptions for the complete 222-slot inventory.

Knob System 3: NVIDIA Codegen Knobs

Parsed from the NVVM container format by sub_1C20170 and sub_CD9990. See NVIDIA Custom Passes for the complete inventory.

Hidden / Obfuscated Flags

Obfuscated Flag (ctor_043_0 at ~0x48EE80)

A 4-byte CLI flag name computed via XOR-based obfuscation from unk_3F6F7C7:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

Stored at qword_4F857C0 with flag bits 0x87 | 0x38 = hidden + really-hidden. NVIDIA deliberately hides this option from static analysis using FNV-1a-like constants.

Environment Variable Backdoors

VariablePurposeLocation
NVVMCCWIZWizard mode (value 553282) -- unlocks -v, -keep, -dryrun, -lgenfe, -opt, -llc, -lnk, -libnvvmsub_8F9C90
barExtended debug pass registrationctor_107_0 at 0x4A64D0
NVVM_IR_VER_CHKOverride IR version check (set to "0" to disable)sub_12BFF60
LLVM_OVERRIDE_PRODUCEROverride bitcode producer string (default "7.0.1")ctor_154 at 0x4CE640
MALLOC_CONFjemalloc allocator tuningsub_12FCDB0
LIBNVVM_DISABLE_CONCURRENT_APIForce single-threaded NVVM compilationctor_104 at 0x4A5810

CLI Defaults Set by Flag Routing

These are effective defaults applied by the flag catalog parser (sub_9624D0), not by cl::opt constructors. When no user flag is specified, the parser injects these:

CLI flagDefault valueRouted cl::opt
-arch=compute_<N>compute_75 (SM 75, Turing)target architecture
-opt=<N>3 (O3)optimization level
-ftz=<N>0 (no flush-to-zero)nvptx-f32ftz
-prec-sqrt=<N>1 (precise)nvptx-prec-sqrtf32=1
-prec-div=<N>1 (precise)nvptx-prec-divf32=1 (CUDA) / =0 (CL)
-fma=<N>1 (enabled)nvptx-fma-level=1
-opt-fdiv=<N>0 (off)optimizer fast-div control
-Ofast-compile=<level>0 (off)fast-compile pipeline

NVIDIA Modification Density

SubsystemNVIDIA KnobsLLVM KnobsCustomization Rate
LSR11569%
InstCombine12475%
Inliner (NVIDIA custom)90100%
Inliner (LLVM standard)0170%
GVN8373%
NVPTX Backend30+0100%
SimplifyCFG28+20%
Memory Space Opt200100%
Rematerialization (IR)210100%
Rematerialization (MI)160100%
Rematerialization (General)150100%
SCEV-CGP110100%
Register Pressure12763%
Sinking / Code Motion5645%
MachinePipeliner0180%
Vectorizer018+0%
SCEV010+0%

Cross-References