Configuration Knobs

Three independent knob systems control compiler behavior: LLVM cl::opt flags (~1,496 unique), NVVMPassOptions (222 slots), and NVIDIA codegen knobs (~70).


LLVM cl::opt	1,496 unique flags across 353 constructor files
NVVMPassOptions	222 slots, initialized by `sub_12D6300` (125KB)
Codegen knobs	~70, parsed by `sub_1C20170` / `sub_CD9990` from NVVM container
BSS storage	`0x4F7FEA0`–`0x4FA5xxx` (cl::opt), `a1+0`–`a1+4464` (PassOptions)
Dual PM	Same options registered for both Legacy PM (`sub_C53080`) and New PM (`sub_16B8280`)
NVIDIA-specific	172 of 1,496 cl::opt flags (11.5%) are NVIDIA-added

Knob System 1: LLVM cl::opt

Registration Pattern

Every cl::opt follows this initialization sequence in a global constructor:

// Legacy PM path
InterlockedExchangeAdd64(sub_C523C0(), 1);   // atomic option counter
sub_C53080(&option, "option-name", strlen);   // set name
sub_C53130(&option);                          // finalize registration
__cxa_atexit(destructor, &option, &dso_handle);

// New PM path (parallel registration)
InterlockedExchangeAdd64(&unk_4FA0230, 1);
sub_16B8280(&option, "option-name", strlen);
sub_16B88A0(&option);
__cxa_atexit(destructor, &option, &dso_handle);

Each cl::opt<T> occupies ~224 bytes (0xE0) in BSS. Top constructors by option count: ctor_600 (30), ctor_433 (25), ctor_472 (24), ctor_609 (22), ctor_392 (22).

Category 1: Scalar Optimization (InstCombine + FP)

Constructor: ctor_165_0 at 0x4D0500 (11,731 bytes). Registers 12 NVIDIA-specific flags plus 4 standard LLVM flags.

Flag	Type	Default	BSS Addr	Purpose
`split-gep-chain`	bool	false	`0x4F901A8`	Split GEP chains to independent GEPs for better address mode selection
`Disable-Add-to-Or`	bool	true	—	Disable add-to-or transformation (NVIDIA blocks this LLVM combine)
`opt-use-fast-math`	bool	false	—	Enable aggressive FP simplification (set by `-unsafe-math` / `-fast-math`)
`opt-use-prec-div`	bool	true	—	Use precise division (set by `-prec-div=1`; cleared by `-prec-div=0`)
`opt-no-signed-zeros`	bool	false	—	Ignore signed zero distinction (set by `-no-signed-zeros`)
`disable-fp-cast-opt`	bool	false	—	Disable FP-to-int and int-to-FP cast optimizations
`reorder-sext-before-cnst-add`	bool	false	—	`sext(add(a,CI))` to `add(sext(a),CI)` rewrite; hidden flag
`disable-sink`	bool	false	—	Disable instruction sinking in InstCombine
`partial-sink`	bool	false	—	Enable partial sinking of instructions
`nvptx-rsqrt-approx-opt`	bool	false	—	Enable reciprocal sqrt approximation optimization
`disable-rsqrt-opt`	bool	false	—	Disable reciprocal sqrt optimization entirely
`check-vn`	bool	false	—	Verify value numbers after transformations (debug)

Standard LLVM flags in same constructor: expensive-combines (bool), instcombine-maxarray-size (int, default 1024), instcombine-visit (int), instcombine-lower-dbg-declare (bool).

Category 2: Inliner Heuristics

Constructor: ctor_186_0 at 0x4DBEC0 (14,109 bytes). Nine NVIDIA-specific flags governing the custom CGSCC inliner at sub_1864060.

Flag	Type	Default	Purpose
`profuseinline`	bool	false	Verbose inlining diagnostics (NVIDIA profuse framework, not PGO profuse)
`inline-total-budget`	int	none	Global total budget across all callers; unset = unlimited
`nv-inline-all`	bool	false	Force inline ALL function calls (used by OptiX ray tracing)
`inline-budget`	int	20000	Per-caller inlining cost budget; `-aggressive-inline` sets to 40000
`inline-adj-budget1`	int	none	Secondary adjusted per-caller budget
`inline-switchctrl`	int	none	Tune heuristic for switch-containing callees
`inline-numswitchfunc`	int	none	Threshold for switch-heavy function penalty
`inline-maxswitchcases`	int	none	Max switch cases before inlining penalty kicks in
`disable-inlined-alloca-merging`	bool	false	Disable post-inline alloca merging into single frame slot

"none" means the knob is unset by default and the heuristic falls back to internal logic.

Category 3: GVN (Global Value Numbering)

Constructor: ctor_201 at 0x4E0990. Eleven knobs (8 NVIDIA-specific + 3 upstream).

Flag	Type	Default	BSS Addr	Purpose
`profusegvn`	bool	true	`0x4FAE7E0`	Verbose GVN diagnostics (unusually, defaults on)
`gvn-dom-cache`	bool	true	`0x4FAE700`	Cache dominator tree nodes; cache size = 32
`max-recurse-depth`	int	1000	`0x4FAE620`	Max recursion during value numbering (safety valve for template-heavy code)
`enable-phi-remove`	int	2	`0x4FAEC40`	PHI removal aggressiveness: 0=off, 1=trivial only, 2=post-leader substitution
`dump-phi-remove`	int	0	`0x4FAEB60`	Dump PHI removal decisions (debug)
`no-split-stores-below`	int	-1	`0x4FAEA80`	Min store width for splitting (bits); -1 = no limit
`no-split-stores-above`	int	-1	`0x4FAE9A0`	Max store width for splitting (bits); -1 = no limit
`split-stores`	bool	true	`0x4FAE8C0`	Master enable for NVIDIA store-splitting in GVN
`enable-pre`	bool	true	`0x4FAEEE0`	Enable Partial Redundancy Elimination (upstream LLVM)
`enable-load-pre`	bool	true	`0x4FAEE00`	Enable load PRE across edges (upstream LLVM)
`enable-split-backedge-in-load-pre`	bool	false	`0x4FAED20`	Allow backedge splitting during load PRE (upstream LLVM)

Store splitting uses a custom NVIDIA registrar (sub_190BE40) that takes a default-value pointer. Both limit knobs default to -1 = all sizes eligible.

Category 4: Loop Strength Reduction

Constructor: ctor_214_0 at 0x4E4B00. Eleven NVIDIA-specific LSR flags (69% NVIDIA customization rate).

Flag	Type	Default	Purpose
`disable-unknown-trip-lsr`	bool	false	Disable LSR for loops with unknown trip count
`lsr-check-rp`	bool	true `[MEDIUM]`	Check register pressure before applying LSR
`lsr-rp-limit`	int	~32-64 `[LOW]`	Skip LSR entirely when RP exceeds this limit (occupancy cliff)
`filter-bad-formula`	bool	true `[MEDIUM]`	Filter out poor-quality LSR formulae early
`do-lsr-64-bit`	bool	arch-dependent	Enable 64-bit loop strength reduction (false on sm_3x-5x, true on sm_70+)
`count-sxt-opt-for-reg-pressure`	bool	true `[MEDIUM]`	Factor sign-extension elimination savings into RP analysis
`lsr-sxtopt`	bool	true `[MEDIUM]`	Perform sign-extension elimination within LSR
`lsr-loop-level`	int	0	Apply LSR only at specific loop nesting level (0 = all levels)
`lsr-skip-outer-loop`	bool	false	Ignore outer-loop induction variables in LSR
`disable-lsr-for-sharedmem32-ptr`	bool	false	Disable LSR for 32-bit shared memory pointers (GPU-specific)
`disable-lsr-complexity-discount`	bool	false	Disable complexity estimation discount heuristic

Standard LLVM LSR flags in same constructor: enable-lsr-phielim, lsr-insns-cost, lsr-exp-narrow, lsr-filter-same-scaled-reg, lsr-fix-iv-inc.

Category 5: IndVarSimplify

Constructor: ctor_203_0 at 0x4E1CD0 (7,007 bytes).

Flag	Type	Default	Purpose
`Disable-unknown-trip-iv`	bool	false	Disable IV substitution for unknown-trip-count loops
`iv-loop-level`	int	none	Control which loop nesting levels get IV substitution

Category 6: SimplifyCFG

Constructor: ctor_243_0 at 0x4ED0C0.

Flag	Type	Default	Purpose
`disable-jump-threading`	bool	false	Disable jump threading (for OCG experiments)
`fold-with-var-cond`	bool	false	Fold branches with variance-based conditions

Category 7: NVPTX Backend Math/Scheduling

Constructor: ctor_607 at 0x584B60 (13,700 bytes). Core numeric precision and FMA controls. Defaults are set by the CLI flag routing in sub_9624D0, not by the cl::opt constructors.

Flag	Type	CLI Default	Purpose
`nvptx-sched4reg`	bool	false	Schedule for register pressure (key NVPTX strategy)
`nvptx-fma-level`	int	1	FMA contraction: 0=off, 1=on, 2=aggressive. CLI `-fma=1` is default
`nvptx-prec-divf32`	int	1	F32 div precision: 0=approx, 1=full, 2=IEEE rnd+ftz, 3=IEEE no-ftz
`nvptx-prec-sqrtf32`	int	1	Sqrt precision: 0=approx, 1=rn. CLI `-prec-sqrt=1` is default
`nvptx-approx-log2f32`	bool	false	Use `lg2.approx` for log2 (only set by `-unsafe-math`)
`nvptx-force-min-byval-param-align`	bool	false	Force 4-byte minimum alignment for byval parameters
`nvptx-normalize-select`	bool	false	Override `shouldNormalizeToSelectSequence` in TLI
`enable-bfi64`	bool	false	Enable 64-bit BFI (bit-field insert) instructions

Note: These cl::opt knobs have no explicit default in their constructor (they init to 0/false). The effective defaults come from the CLI flag catalog: -fma=1 routes -nvptx-fma-level=1, -prec-div=1 routes -nvptx-prec-divf32=1, -prec-sqrt=1 routes -nvptx-prec-sqrtf32=1.

Category 8: NVPTX Backend Passes/Features

Constructor: ctor_609_0 at 0x585D30 (22 options total, largest NVPTX constructor).

Flag	Type	Default	Purpose
`disable-nvptx-load-store-vectorizer`	bool	false	Disable load/store vectorizer
`disable-nvptx-require-structured-cfg`	bool	false	Turn off structured CFG requirement (transitional)
`nvptx-short-ptr`	bool	false	32-bit pointers for const/local/shared address spaces
`nvptx-enable-machine-sink`	bool	false	Enable machine-level instruction sinking
`enable-new-nvvm-remat`	bool	true	Enable new NVVM rematerialization engine
`nv-disable-remat`	bool	false	Disable all rematerialization passes
`nv-disable-mem2reg`	bool	false	Disable machine-IR mem2reg promotion
`nv-disable-scev-cgp`	bool	true	Disable SCEV-based address mode optimization (on = disabled)
`nvptx-32-bit-smem`	bool	false	Use 32-bit pointers for shared address space
`nvptx-exit-on-unreachable`	bool	true	Lower `unreachable` as PTX `exit` instruction
`nvptx-early-byval-copy`	bool	false	Create copy of byval function args in entry block
`enable-nvvm-peephole`	bool	true	Enable NVVM peephole optimizer
`no-reg-target-nvptxremat`	bool	false	Only run old remat on kernels without register targets
`lower-func-args`	bool	true	Lower large aggregate function parameters to copies
`enable-sink`	bool	true	Enable LLVM sinking pass
`disable-post-opt`	bool	false	Disable IR optimizations in post-opt phase
`usedessa`	int	2	deSSA method: 0=off, 1=basic, 2=full
`ldg`	bool	true	Load-via-texture (ld.global.nc) constant transform

Category 9: NVPTX Backend Extended

Constructor: ctor_610 at 0x5888A0 (7,400 bytes).

Flag	Type	Default	Purpose
`unroll-assumed-size`	int	4	Assumed element count for unknown-size local arrays during unroll analysis
`enable-loop-peeling`	bool	false	Enable loop peeling transformation
`enable-256-bit-load-store`	bool	false	Enable 256-bit (32-byte) vector load/store generation
`ias-param-always-point-to-global`	bool	false	Assume function parameter pointers always point to global memory
`ias-strong-global-assumptions`	bool	false	Stronger assumption: constant-buffer pointers resolve to globals
`ias-wmma-memory-space-opt`	bool	false	Enable MemorySpaceOpt specialization for WMMA/tensor operations

Category 10: Memory Space Optimization

Scattered across ctor_264, ctor_267_0, ctor_528, ctor_531_0. See MemorySpaceOpt and IPMSP for the full algorithm.

Flag	Type	Default	Purpose
`mem-space-alg`	int	2	Switch between MSO algorithm variants
`dump-ir-before-memory-space-opt`	bool	false	Dump IR before MSO
`dump-ir-after-memory-space-opt`	bool	false	Dump IR after MSO
`track-indir-load`	bool	true	Track indirect loads during MSO dataflow
`track-int2ptr`	bool	true	Track IntToPtr casts in MSO
`param-always-point-to-global`	bool	true	Kernel parameter pointers always point to global memory
`devicefn-param-always-local`	bool	false	Treat parameter space as local in device functions
`ignore-address-space-check`	bool	false	Ignore address-space checks during branch distribution
`sink-into-texture`	int	3	Sink loads into texture blocks: 0=off, 1=cross-block, 2=+intra, 3=+outside-only. See also Category 14
`ldg`	bool	true	Load Global Constant Transform (ld.global.nc)
`do-clone-for-ip-msp`	int	-1	Function cloning limit for IP-MSP (-1 = unlimited, 0 = disable)
`dump-ip-msp`	bool	false	Dump interprocedural MSP info
`lower-read-only-devicefn-byval`	bool	false	Handle byval attribute of args to read-only device functions
`reuse-lmem-very-long-live-range`	int	—	Threshold for very-long live range in local memory reuse
`hoist-load-param`	bool	false	Generate all `ld.param` in entry basic block
`sink-ld-param`	bool	false	Sink one-use `ld.param` to use point
`process-alloca-always`	bool	true	Treat `alloca` as definite local (AS 5) regardless of context
`wmma-memory-space-opt`	bool	true	Enable memory space optimization for WMMA operations
`strong-global-assumptions`	bool	true	Assume const buffer pointers always point to globals
`process-builtin-assume`	bool	—	Process `__builtin_assume(__is*(p))` assertions for space deduction

Category 11: Rematerialization

Scattered across ctor_609_0, ctor_362, ctor_277_0, ctor_361_0, and others. See Rematerialization for full algorithm detail.

IR-Level Knobs (ctor_277_0 at `0x4F7BE0`)

Flag	Type	Default	Global	Purpose
`do-remat`	int	3	`dword_4FC05C0`	Master control. 0=off, 1=conservative, 2=normal, 3=full
`no-remat`	string	(empty)	`qword_4FC0440`	Comma-separated function exclusion list
`remat-iv`	int	4	`dword_4FBFB40`	IV demotion level. 0=off, 4=full
`remat-load`	int	1	`dword_4FBFA60`	Load rematerialization. 0=off, 1=on
`remat-add`	int	0	`dword_4FBF980`	Add/GEP factoring. 0=off
`remat-single-cost-limit`	int	6000	`dword_4FC0080`	Max cost per single live-in reduction
`remat-loop-trip`	int	20	`dword_4FBFFA0`	Default assumed loop trip count
`remat-gep-cost`	int	6000	`dword_4FBFEC0`	Max cost for GEP rematerialization
`remat-use-limit`	int	10	`dword_4FBFDE0`	Max number of uses for a candidate
`remat-max-live-limit`	int	10	`dword_4FBFD00`	Max live-in limit for rematerialization
`remat-maxreg-ceiling`	int	0	`dword_4FBF600`	Register ceiling (0 = uncapped)
`remat-for-occ`	int	120	`dword_4FBF8A0`	Occupancy-driven rematerialization target
`remat-lli-factor`	int	10	`dword_4FC0320`	Long-latency instruction cost factor
`remat-ignore-single-cost`	bool	false	`byte_4FBFC20`	Bypass per-value cost filter
`remat-move`	bool	false	`byte_4FC0400`	Remat move instructions
`simplify-live-out`	int	2	`dword_4FBF520`	NLO level. 0=off, 2=full
`dump-remat`	int	0	`dword_4FC0240`	Debug dump level (0-4+)
`dump-remat-iv`	int	0	`dword_4FC0160`	IV remat debug dump
`dump-remat-load`	int	0	`dword_4FBF720`	Load remat debug dump
`dump-remat-add`	int	0	`dword_4FBF640`	Add remat debug dump
`dump-simplify-live-out`	bool	false	`byte_4FBF400`	NLO debug dump

Machine-Level Knobs (ctor_361_0 at `0x5108E0`)

Flag	Type	Default	Global	Purpose
`nv-remat-block`	int	14	`dword_4FD3820`	Bitmask controlling remat modes (bits 0-3)
`nv-remat-max-times`	int	10	`dword_4FD3740`	Max outer loop iterations
`nv-remat-block-single-cost`	int	10	`dword_4FD3660`	Max cost per single live value pull-in
`nv-remat-block-map-size-limit`	int	6	`dword_4FD3580`	Map size limit for single pull-in
`nv-remat-block-max-cost`	int	100	`dword_4FD3040`	Max total clone cost per live value reduction
`nv-remat-block-liveout-min-percentage`	int	70	`dword_4FD3120`	Min liveout % for special consideration
`nv-remat-block-loop-cost-factor`	int	20	`unk_4FD3400`	Loop cost multiplier
`nv-remat-default-max-reg`	int	70	`unk_4FD3320`	Default max register pressure target
`nv-remat-block-load-cost`	int	10	`unk_4FD2EC0`	Cost assigned to load instructions
`nv-remat-threshold-for-spec-reg`	int	20	`unk_4FD3860`	Threshold for special register remat
`nv-dump-remat-block`	bool	false	`byte_4FD2E80`	Debug dump toggle
`nv-remat-check-internal-live`	bool	false	`byte_4FD2DA0`	Check internal liveness during MaxLive
`max-reg-kind`	int	0	`qword_4FD2C20`	Kind of max register pressure info
`no-mi-remat`	string	(empty)	`qword_4FD2BE0`	Skip machine-level remat for named functions
`load-remat`	bool	true	`word_4FD32F0`	Enable load rematerialization
`vasp-fix1`	bool	false	`word_4FD3210`	VASP fix for volatile/addsp

General Remat Knobs (ctor_609_0, ctor_362, and others)

Flag	Type	Default	Purpose
`nv-disable-remat`	bool	false	Disable all remat passes
`enable-new-nvvm-remat`	bool	true	Enable new NVVM remat engine (disables old)
`no-reg-target-nvptxremat`	bool	false	Only old remat for kernels without register targets
`fp-remat`	bool	false	Allow rematerializing floating-point instructions
`high-cost-remat`	bool	false	Allow rematerializing high-cost instructions
`cost-threshold-remat`	int	—	Cost threshold per remat action
`block-freq-cap-remat`	int	—	Maximum raw block frequency value
`block-freq-norm-range-remat`	int	—	Normalization range for block frequency in remat cost
`collect-candidate-scale-remat`	int	—	Scaling ratio for high-RP candidate collection
`incremental-update-remat`	bool	false	Incrementally update RP analysis after each remat
`verify-update-remat`	bool	false	Debug: verify incremental update vs full analysis
`print-verify-remat`	bool	false	Debug: print problematic RP on verification failure
`rp-remat`	int	—	Debug: set a target register pressure number
`late-remat-update-threshold`	int	—	Threshold for copy with many other copy uses
`remat-load-param`	bool	false	Support rematerializing constant `ld.param` not in NVVM IR

Category 12: SCEV-CGP (Address Mode Optimization)

Eleven NVIDIA-specific knobs for SCEV-based CodeGenPrepare. See CodeGenPrepare.

Flag	Type	Default	Purpose
`do-scev-cgp`	bool	false	Enable SCEV-based CodeGenPrepare
`do-scev-cgp-aggresively`	bool	false	Aggressive SCEV-CGP mode [sic]
`do-function-scev-cgp`	bool	false	Function-level SCEV-CGP
`nv-disable-scev-cgp`	bool	true	Disable SCEV address mode optimization (master kill switch, on by default)
`scev-cgp-control`	int	—	Control max transformations applied
`scev-cgp-cross-block-limit`	int	—	Max common-base expressions from a single block
`scev-cgp-idom-level-limit`	int	—	Max dominator tree levels to walk
`scev-cgp-inst-limit`	int	—	Max instructions for a single parameter
`scev-cgp-old-base`	bool	false	Force SCEV-CGP to create new base (vs reusing old)
`scev-cgp-tid-max-value`	int	—	Max value of thread ID in SCEV expressions
`print-after-scev-cgp`	bool	false	Print function after SCEV-CGP phase

Category 13: Branch Distribution

Seven NVIDIA-specific flags.

Flag	Type	Default	Purpose
`branch-dist-block-limit`	int	—	Max blocks to apply branch distribution
`branch-dist-func-limit`	int	—	Max functions to apply branch distribution
`branch-dist-norm`	int	—	Normalization control
`no-branch-dist`	string	—	Comma-separated list of functions to skip
`disable-complex-branch-dist`	bool	false	Disable complex branch distribution
`dump-branch-dist`	bool	false	Dump branch distribution info

Category 14: Sinking / Code Motion

Thirteen knobs across multiple constructors. See Sinking2 for the NVIDIA-custom texture-aware sinking pass.

Flag	Type	Default	Purpose
`sink-into-texture`	int	3	Texture sinking aggressiveness: 0=off, 1=cross-block, 2=+intra, 3=+outside-only
`sink-limit`	int	20	Max instructions to sink per Sinking2 invocation (complexity limiter)
`dump-sink2`	bool	false	Debug dump for Sinking2 pass
`sink-check-sched`	bool	true	Check scheduling effects of sinking (stock Sink)
`sink-single-only`	bool	true	Only sink single-use instructions (stock Sink)
`enable-andcmp-sinking`	bool	false	Sink and/cmp sequences into branches
`aggressive-no-sink`	bool	false	Sink all generated instructions
`max-uses-for-sinking`	int	—	Don't sink instructions with too many uses
`rp-aware-sink`	bool	false	Consider register pressure impact when sinking
`instcombine-code-sinking`	bool	false	Enable code sinking within InstCombine
`hoist-const-stores`	bool	false	Hoist loop-invariant stores

Category 15: Register Pressure / Allocation

NVIDIA-specific knobs plus LLVM greedy allocator knobs. See Register Allocation for the full algorithm.

NVIDIA RP Knobs

Flag	Type	Default	Purpose
`maxreg`	int	none	Maximum register count (`--maxrregcount` equivalent)
`register-usage-level`	int	—	Register usage level control
`cta-reconfig-aware-mrpa`	bool	false	CTA reconfiguration-aware machine RP analysis
`cta-reconfig-aware-rpa`	bool	false	CTA reconfiguration-aware RP analysis
`pred-aware-mcse`	bool	false	Predicate-aware MachineCSE
`rp-aware-mcse`	bool	false	Register-pressure-aware MachineCSE
`verify-update-mcse`	bool	false	Debug: verify incremental RP update in MachineCSE
`incremental-update-mcse`	bool	true	Incrementally update register pressure analysis in MachineCSE
`print-verify`	bool	false	Print problematic RP info if MCSE verification fails
`pred-target-adjust`	int	0	Predicate register target adjustment (-10 to +10)
`donot-insert-dup-copies`	bool	false	Skip duplicate copies to predecessor basic block
`nv-disable-mem2reg`	bool	false	Disable machine-level mem2reg

LLVM Greedy Allocator Knobs

Flag	Type	Default	Purpose
`split-spill-mode`	int	1	Spill mode: 0=default, 1=size, 2=speed
`lcr-max-depth`	int	5	Last chance recoloring max recursion depth
`lcr-max-interf`	int	8	Last chance recoloring max interferences
`exhaustive-register-search`	bool	false	Bypass LCR depth/interference cutoffs
`enable-deferred-spilling`	bool	false	Defer spill code to end of allocation
`grow-region-complexity-budget`	int	10000	`growRegion()` edge budget for live range splitting
`split-threshold-for-reg-with-hint`	int	75	Split threshold percentage for hinted registers

Category 16: Restrict / Aliasing

Five NVIDIA-specific flags. See Alias Analysis.

Flag	Type	Default	Purpose
`process-restrict`	bool	false	Process `__restrict__` keyword for alias analysis
`allow-restrict-in-struct`	bool	false	Allow `__restrict__` inside struct members
`apply-multi-level-restrict`	bool	false	Apply restrict to all pointer levels
`dump-process-restrict`	bool	false	Debug dump during restrict processing
`strict-aliasing`	bool	false	Datatype-based strict aliasing

Category 17: CSSA / deSSA

Four knobs. See CSSA.

Flag	Type	Default	Purpose
`cssa-coalesce`	int	—	Control PHI operand coalescing strategy
`cssa-verbosity`	int	0	Verbosity level
`dump-before-cssa`	bool	false	Dump specific PHI operands being coalesced
`usedessa`	int	2	deSSA method: 0=off, 1=basic, 2=full

Category 18: Loop / Unrolling

Eight NVIDIA-specific knobs (beyond the 20+ standard LLVM loop-unrolling flags).

Flag	Type	Default	Purpose
`nv-disable-loop-unrolling`	bool	false	Disable loop unrolling in all passes
`aggressive-runtime-unrolling`	bool	false	OCG-style unrolling heuristics
`aggressive-runtime-unrolling-fixed-factor`	int	—	Force fixed unroll factor
`aggressive-runtime-unrolling-max-factor`	int	—	Maximum unroll factor
`aggressive-runtime-unrolling-max-filler-instructions-per-batch`	int	—	Max filler instructions
`unroll-runtime-nv-expensive`	bool	false	NVIDIA heuristics for expensive loops
`unroll-runtime-convergent`	bool	false	Allow unrolling with `convergent` instructions
`track-trip-count-more`	bool	false	Track loop trip count more aggressively

Category 19: GEP / Address Strength Reduction

Eight NVIDIA-specific knobs. See Base Address Strength Reduction.

Flag	Type	Default	Purpose
`normalize-gep`	bool	false	Normalize 64-bit GEP subscripts
`dump-normalize-gep`	bool	false	Debug dump for GEP normalization
`do-base-address-strength-reduce`	int	0	Two levels: 1=unconditional, 2=with conditions
`dump-base-address-strength-reduce`	bool	false	Debug dump
`do-lsr-64-bit`	bool	false	Loop strength reduction for 64-bit (shared with LSR)
`do-sign-ext-expand`	bool	false	Expand sign-extension during SCEV build
`balance-dot-chain`	bool	false	Balance chain of dot operations
`special-reassociate-for-threadid`	bool	false	Don't move back expressions containing thread ID

Category 20: Aggregate / Byval Lowering

Ten knobs.

Flag	Type	Default	Purpose
`aggressive-max-aggr-lower-size`	int	—	Threshold size for lowering aggregates
`aggressive-lsv`	bool	false	Merge smaller dtypes in aggregate before vectorization
`vect-split-aggr`	bool	false	Split aggregates before vectorization
`lower-aggr-unrolled-stores-limit`	int	—	Limit stores in unrolled aggregate lowering
`large-aggr-store-limit`	int	—	Create loops for aggregate store exceeding limit
`lower-func-args`	bool	true	Lower large aggregate function parameters
`lsa-opt`	bool	false	Optimize copying of struct args to local memory
`skiploweraggcopysafechk`	bool	false	Skip safety check in `loweraggcopy`
`memdep-cache-byval-loads`	bool	true	Preprocess byval loads to reduce compile time
`ldstmemcpy-glue-max`	int	—	Limit for gluing ld/st of memcpy

Category 21: Normalization / Canonicalization

Four knobs.

Flag	Type	Default	Purpose
`norm-fold-all`	bool	false	Fold all regular instructions
`norm-preserve-order`	bool	false	Preserve original instruction order
`norm-rename-all`	bool	false	Rename all instructions
`norm-reorder-operands`	bool	false	Sort/reorder operands in commutative operations

Category 22: NVVM Infrastructure

Five knobs.

Flag	Type	Default	Purpose
`nvvm-lower-printf`	bool	false	Enable printf lowering
`nvvm-reflect-enable`	bool	true	NVVM reflection (reads `__CUDA_FTZ`, `__CUDA_PREC_DIV`, etc.)
`nvvm-verify-show-info`	bool	false	Info messages during NVVM verification
`enable-nvvm-peephole`	bool	true	NVVM peephole optimizer
`nv-ocl`	bool	false	Deprecated OpenCL compatibility flag

Category 23: Compilation Control

Constructor: ctor_043_0 at 0x48D7F0 + ctor_028_0 at 0x489160.

Flag	Type	Default	Purpose
`debug-compile`	bool	false	Compile for debugging (set by `-g`)
`generate-line-info`	bool	false	Emit line info even without `-G`
`nvptx-f32ftz`	bool	false	Flush f32 subnormals to zero; hidden
`w`	bool	false	Disable warnings; hidden
`Werror`	bool	false	Treat warnings as errors; hidden
`Osize`	bool	false	Optimize for code size; hidden
`Om`	bool	false	Maximum optimization mode; hidden
`maxreg`	int	none	Maximum register count (no limit if unset)
`nvptx-nan`	bool	false	NaN handling control; hidden
`jump-table-density`	int	10	Minimum density (%) for jump table lowering
`pass-control`	int	-1	Disable all optional passes after pass N; -1 = no limit
`disable-passno`	list	empty	Disable pass(es) by number (comma-separated)
`sep-comp`	bool	false	Separate compilation mode
`proffile`	string	—	Filename for PGO profile information
`R`	string	—	Resource constraint: `name=<int>` format
`lnk-disable-allopts`	bool	false	Disable all linker optimization passes
`disable-peephole`	bool	false	Disable peephole optimizer
`disable-early-taildup`	bool	false	Disable pre-regalloc tail duplication

Category 24: Divergence / GPU Execution

Three flags.

Flag	Type	Default	Purpose
`spec-exec-only-if-divergent-target`	bool	false	Speculative execution only when target is divergent
`prefer-predicated-reduction-select`	bool	false	Prefer predicated reduction over after-loop select
`openmp-opt-disable-barrier-elimination`	bool	false	Disable OpenMP barrier elimination

Category 25: MachinePipeliner (Swing Modulo Scheduling)

Eighteen LLVM-origin knobs for software pipelining. See Scheduling for the full algorithm.

Flag	Type	Default	Global	Purpose
`enable-pipeliner`	bool	true	`unk_503EE20`	Master switch for SMS
`enable-pipeliner-opt-size`	bool	false	`qword_503ED40`	Enable SWP at -Os
`pipeliner-max-mii`	int	27	`qword_503ECE8`	Maximum allowed MII
`pipeliner-force-ii`	int	0	`qword_503EB80`	Force specific II (0 = auto)
`pipeliner-max-stages`	int	3	`qword_503EB28`	Maximum pipeline stages
`pipeliner-prune-deps`	bool	true	`qword_503E9C0`	Prune deps between unrelated Phi nodes
`pipeliner-prune-loop-carried`	bool	true	`qword_503E8E0`	Prune loop-carried order deps
`pipeliner-ignore-recmii`	bool	false	`qword_503E888`	Ignore RecMII; hidden
`pipeliner-show-mask`	bool	false	`qword_503E720`	Debug: show scheduling mask
`pipeliner-dbg-res`	bool	false	`qword_503E640`	Debug: resource usage
`pipeliner-annotate-for-testing`	bool	false	`qword_503E5E8`	Annotate instead of codegen
`pipeliner-experimental-cg`	bool	false	`qword_503E508`	Use peeling code generator
`pipeliner-ii-search-range`	int	10	`qword_503E3A0`	Range to search for II
`pipeliner-register-pressure`	bool	false	`qword_503E2C0`	Consider register pressure
`pipeliner-register-pressure-margin`	int	5	`qword_503E1E0`	Margin % for reg pressure limit
`pipeliner-mve-cg`	bool	true	`unk_503E100`	Use MVE code generator
`pipeliner-enable-copytophi`	bool	true	`qword_503E020`	Enable CopyToPhi DAG Mutation
`pipeliner-force-issue-width`	int	0	`qword_503DF40`	Force issue width (0 = auto)

Category 26: LLVM Standard Inliner (Model B)

Seventeen LLVM-origin knobs from ctor_625_0 / ctor_715_0 at 0x58FAD0. These control the upstream InlineCostAnalysis::analyzeCall path; see Inliner Cost Model for why the NVIDIA custom model (Category 2) dominates in practice.

Flag	Type	Default	Purpose
`inline-threshold`	int	225	Base inlining threshold
`inlinedefault-threshold`	int	225	Default when no hint/profile
`inlinehint-threshold`	int	325	Threshold for `__attribute__((always_inline))` hint
`inline-cold-callsite-threshold`	int	45	Threshold for cold callsites
`inlinecold-threshold`	int	45	Threshold for functions with cold attribute
`hot-callsite-threshold`	int	3000	Threshold for hot callsites (PGO)
`locally-hot-callsite-threshold`	int	525	Threshold for locally hot callsites
`inline-instr-cost`	int	5	Cost per instruction
`inline-call-penalty`	int	25	Penalty per callsite in callee
`inline-memaccess-cost`	int	0	Cost per load/store
`inline-savings-multiplier`	int	8	Multiplier for cycle savings
`inline-savings-profitable-multiplier`	int	4	Multiplier for profitability check
`inline-size-allowance`	int	100	Max callee size inlined without savings proof
`inline-cost-full`	bool	false	Compute full cost even when over threshold
`inline-enable-cost-benefit-analysis`	bool	false	Enable cost-benefit analysis
`inline-deferral`	bool	—	Defer inlining in cold paths (PGO)
`inline-remark-attribute`	bool	false	Emit inline remarks

Category 27: New PM CGSCC Inliner (Model C)

Two knobs for the New Pass Manager CGSCC inliner at 0x2613930. See Inliner Cost Model.

Flag	Type	Default	Purpose
`function-inline-cost-multiplier`	int	—	Penalize recursive function inlining
`enable-ml-inliner`	enum	default	ML advisory mode: `default`, `development`, `release`

Knob System 2: NVVMPassOptions

222 pass option slots initialized by sub_12D6300 (125KB). Each slot is accessed by integer index (1--221) and stored in a ~4,480-byte struct.

Access Functions

Function	Purpose
`sub_12D6170(base+120, index)`	Fetch pass option descriptor by index
`sub_1691920(base+8, index)`	Fetch pass option value from table
`sub_12D6090(a1+offset, ...)`	Store string-typed option
`sub_12D6100(a1+offset, ...)`	Store integer-typed option
`sub_12D6240(a1, index, "0")`	Get option with default value

See NVVMPassOptions for the complete 222-slot inventory.

Knob System 3: NVIDIA Codegen Knobs

Parsed from the NVVM container format by sub_1C20170 and sub_CD9990. See NVIDIA Custom Passes for the complete inventory.

Hidden / Obfuscated Flags

Obfuscated Flag (`ctor_043_0` at `~0x48EE80`)

A 4-byte CLI flag name computed via XOR-based obfuscation from unk_3F6F7C7:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

Stored at qword_4F857C0 with flag bits 0x87 | 0x38 = hidden + really-hidden. NVIDIA deliberately hides this option from static analysis using FNV-1a-like constants.

Environment Variable Backdoors

Variable	Purpose	Location
`NVVMCCWIZ`	Wizard mode (value 553282) -- unlocks `-v`, `-keep`, `-dryrun`, `-lgenfe`, `-opt`, `-llc`, `-lnk`, `-libnvvm`	`sub_8F9C90`
`bar`	Extended debug pass registration	`ctor_107_0` at `0x4A64D0`
`NVVM_IR_VER_CHK`	Override IR version check (set to "0" to disable)	`sub_12BFF60`
`LLVM_OVERRIDE_PRODUCER`	Override bitcode producer string (default `"7.0.1"`)	`ctor_154` at `0x4CE640`
`MALLOC_CONF`	jemalloc allocator tuning	`sub_12FCDB0`
`LIBNVVM_DISABLE_CONCURRENT_API`	Force single-threaded NVVM compilation	`ctor_104` at `0x4A5810`

CLI Defaults Set by Flag Routing

These are effective defaults applied by the flag catalog parser (sub_9624D0), not by cl::opt constructors. When no user flag is specified, the parser injects these:

CLI flag	Default value	Routed cl::opt
`-arch=compute_<N>`	`compute_75` (SM 75, Turing)	target architecture
`-opt=<N>`	`3` (O3)	optimization level
`-ftz=<N>`	`0` (no flush-to-zero)	`nvptx-f32ftz`
`-prec-sqrt=<N>`	`1` (precise)	`nvptx-prec-sqrtf32=1`
`-prec-div=<N>`	`1` (precise)	`nvptx-prec-divf32=1` (CUDA) / `=0` (CL)
`-fma=<N>`	`1` (enabled)	`nvptx-fma-level=1`
`-opt-fdiv=<N>`	`0` (off)	optimizer fast-div control
`-Ofast-compile=<level>`	`0` (off)	fast-compile pipeline

NVIDIA Modification Density

Subsystem	NVIDIA Knobs	LLVM Knobs	Customization Rate
LSR	11	5	69%
InstCombine	12	4	75%
Inliner (NVIDIA custom)	9	0	100%
Inliner (LLVM standard)	0	17	0%
GVN	8	3	73%
NVPTX Backend	30+	0	100%
SimplifyCFG	2	8+	20%
Memory Space Opt	20	0	100%
Rematerialization (IR)	21	0	100%
Rematerialization (MI)	16	0	100%
Rematerialization (General)	15	0	100%
SCEV-CGP	11	0	100%
Register Pressure	12	7	63%
Sinking / Code Motion	5	6	45%
MachinePipeliner	0	18	0%
Vectorizer	0	18+	0%
SCEV	0	10+	0%

Cross-References

NVVMPassOptions -- 222-slot pass option system
CLI Flags -- complete flag-to-pipeline routing
Environment Variables -- all verified env vars
Optimization Levels -- O0/O1/O2/O3 and fast-compile pipelines
Rematerialization -- multi-pass remat engine
Memory Space Optimization -- address space resolution
Sinking2 -- texture-aware sinking
Register Allocation -- greedy RA with NVIDIA extensions
Scheduling -- SMS and MRPA
IPMSP -- memory space optimization engine
Alias Analysis -- restrict propagation
CodeGenPrepare -- SCEV-CGP pass
Inliner Cost Model -- four parallel inliner models
GVN -- GPU-specific value numbering
LSR -- GPU-aware loop strength reduction

Keyboard shortcuts

CICC Reverse Engineering Reference