Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVMPassOptions

NVVMPassOptions is NVIDIA's proprietary per-pass configuration system -- a 4,512-byte flat struct containing 221 option slots that controls every aspect of the NVVM optimization pipeline. It has no upstream LLVM equivalent. Where LLVM uses scattered cl::opt<T> globals that each pass reads independently, NVIDIA consolidates all pass configuration into a single contiguous struct that is allocated once and threaded through the entire pipeline assembler as a parameter. This design allows the pipeline to make pass-enable decisions through simple byte reads at known offsets rather than hash-table lookups, and it ensures that the complete configuration state can be copied between Phase I and Phase II of the two-phase compilation model.

The struct is populated by a single 125KB function (sub_12D6300) that reads from a PassOptionRegistry hash table and flattens the results into 221 typed slots. The pipeline assembler (sub_12E54A0) and its sub-pipeline builders (sub_12DE330, sub_12DE8F0) then read individual slots by offset to decide which passes to insert and how to configure them.

Initializersub_12D6300 (125KB, 4,786 lines)
Struct size4,512 bytes (sub_22077B0(4512))
Slot count221 (1-based index: 1--221)
Slot types5: STRING (24B), BOOL_COMPACT (16B), BOOL_INLINE (16B), INTEGER (16B), STRING_PTR (28B)
Type breakdown114 string + 83 bool compact + 17 bool inline + 6 integer + 1 string pointer
Registry lookupsub_12D6170 (hash table at registry+120)
PassDef resolversub_1691920 (64-byte stride table)
Bool parsersub_12D6240 (triple: lookup + lowercase + char test)
Callerssub_12E7E70 (Phase orchestrator), sub_12F4060 (TargetMachine creation)
Consumerssub_12E54A0, sub_12DE330, sub_12DE8F0, sub_12DFE00

Struct Layout

The struct is heap-allocated as a single 4,512-byte block. The first 16 bytes contain header fields, followed by 221 option slots packed contiguously, and a 32-byte zero trailer:

Offset  Size   Field
──────  ────   ─────
0       4      int opt_level (copied from registry+112)
4       4      (padding)
8       8      qword ptr to PassOptionRegistry
16      ~4464  221 option slots (variable-size, packed)
4480    32     zero trailer (4 qwords, sentinel)

Slot offsets are deterministic -- they depend on the type sequence hard-coded into sub_12D6300. String slots consume 24 bytes, boolean and integer slots consume 16 bytes, and the unique string-pointer slot at index 181 consumes 28 bytes. The initializer writes each slot at a compile-time-constant offset; there is no dynamic layout calculation.

Slot Types

Type A: String Option (24 bytes) -- sub_12D6090

114 slots. Stores a string value (pass name or parametric value) along with flags, optimization level, and pass ID.

struct StringOption {       // 24 bytes, written by sub_12D6090
    char*    value;         // +0:  pointer to string data
    int32_t  option_index;  // +8:  1-based slot index
    int32_t  flags;         // +12: from PassDef byte 40
    int32_t  opt_level;     // +16: from header opt_level
    int32_t  pass_id;       // +20: resolved via sub_1691920
};

Type B: Boolean Compact (16 bytes) -- sub_12D6100

83 slots. The most common boolean representation. The helper encapsulates the lookup-parse-resolve sequence.

struct BoolCompactOption {  // 16 bytes, written by sub_12D6100
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1:  padding
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  flags;         // +8:  from PassDef byte 40
    int32_t  pass_id;       // +12: resolved via sub_1691920
};

Type C: Boolean Inline (16 bytes) -- direct write

17 slots. Identical layout to Type B, but written directly by sub_12D6300 rather than through the sub_12D6100 helper. These correspond to option pairs where the boolean resolution requires checking PassDef+36 (has_overrides byte) and resolving via sub_1691920 inline. The 17 inline boolean slots are: 7, 11, 13, 49, 53, 55, 59, 61, 95, 103, 119, 127, 151, 159, 169, 177, 211.

struct BoolInlineOption {   // 16 bytes, same layout as Type B
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1
    int32_t  option_index;  // +4:  high 32 bits of sub_12D6240 return
    int32_t  opt_level;     // +8:  from header
    int32_t  pass_id;       // +12: resolved inline
};

Type D: Integer (16 bytes) -- direct write via sub_16D2BB0

6 slots. The integer value is parsed from the registry string by sub_16D2BB0 (string-to-int64). Layout is identical to boolean compact but the first 4 bytes store a full int32_t rather than a single byte.

struct IntegerOption {      // 16 bytes
    int32_t  value;         // +0:  parsed integer
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  opt_level;     // +8
    int32_t  pass_id;       // +12
};

Type E: String Pointer (28 bytes) -- slot 181 only

Unique. Stores a raw char* plus length rather than a managed string. Likely a file path or regex pattern that requires direct C-string access.

struct StringPtrOption {    // 28 bytes, slot 181 only
    char*    data;          // +0:  raw char pointer
    uint64_t length;        // +8:  string length
    int32_t  option_index;  // +16: 1-based slot index
    int32_t  opt_level;     // +20
    int32_t  pass_id;       // +24
};

Pair Organization Pattern

The 221 slots follow a predominantly paired layout. Slots 1--6 are six standalone STRING options (likely the global compilation parameters: ftz, prec-div, prec-sqrt, fmad, opt-level, sm-arch). Starting at slot 7, slots are organized in (EVEN, ODD) pairs:

  • Even slot N: STRING option -- the pass's parameter value or name
  • Odd slot N+1: BOOLEAN or INTEGER option -- the enable/disable toggle

Each "pass knob" thus gets a string parameter slot and a boolean gate. The pipeline assembler reads the boolean to decide whether to insert the pass, and passes the string value as the pass's configuration parameter.

Exceptions to the pair pattern:

RegionAnomaly
Slots 160--162Three consecutive STRING slots with a single boolean at 163
Slots 191--193Slot 191 STRING, then two consecutive booleans at 192--193
Slot 181STRING_PTR type instead of normal STRING
Slots 196--207Alternating STRING + INTEGER instead of STRING + BOOL

Helper Functions

sub_12D6170 -- PassOptionRegistry::lookupOption

Looks up an option by its 1-based slot index in the hash table at registry+120. Returns a pointer to an OptionNode or 0 if the option was not set from the command line:

// Signature: int64 sub_12D6170(void* registry, int option_index)
// Returns: OptionNode* or 0
//
// OptionNode layout:
//   +40   int16   flags
//   +48   char**  value_array_ptr (array of string values)
//   +56   int     value_count

The hash table uses open addressing. The lookup computes hash(option_index) and probes linearly. When an option is not present in the registry (meaning the user did not supply a CLI override), the caller falls back to the hard-coded default in sub_12D6300.

sub_12D6240 -- PassOptionRegistry::getBoolOption

Resolves a boolean option with a default value. This is the critical function for all 100 boolean slots -- it performs a three-step resolution:

sub_12D6240(registry, option_index, default_string):
    1. Call sub_12D6170(registry, option_index)
    2. If found AND has value:
         lowercase the string via sub_16D2060
         result = (first_char == '1' || first_char == 't')  // "1" or "true"
    3. If not found OR no value:
         result = (default_string[0] == '1')  // "0" -> false, "1" -> true
    Return: packed(bool_value:8, flags:32) in low 40 bits

The packing convention is significant: the boolean value occupies the low 8 bits and the flags occupy bits 8--39. Callers unpack with (result & 0xFF) for the boolean and (result >> 8) for the flags.

sub_1691920 -- PassDefTable::getPassDef

Resolves a 1-based pass index to its PassDef entry in a table with 64-byte stride:

// sub_1691920(table_ptr, pass_index):
//   return table_ptr[0] + (pass_index - 1) * 64
//
// PassDef layout (64 bytes):
//   +32   int     pass_id
//   +36   byte    has_overrides
//   +40   int16   override_index

The pass_id field is written into every option slot and later used by the pipeline assembler to map configuration back to the pass factory that should receive it.

sub_16D2BB0 -- parseInt

Parses a string to a 64-bit integer. Used for the 6 integer-typed option slots (9, 197, 203, 205, 207, 215).

Default Values

Most boolean slots default to 0 (disabled). 14 slots default to 1 (enabled) -- these represent passes that run by default and must be explicitly disabled:

Confidence note: Pass associations marked [MEDIUM] are inferred from pipeline guard cross-references (a4[offset]). Associations marked [LOW] are based solely on offset proximity or default-value patterns.

SlotOffsetLikely PassConfidence
19400Inliner (AlwaysInliner gate)MEDIUM
25520NVIDIA-specific pass ALOW
931880ConstantMergeHIGH
951920NVVMIntrinsicLoweringHIGH
1172360NVVMUnreachableBlockElimHIGH
1412840ADCEHIGH
1432880LICMHIGH
1513040CorrelatedValuePropagationMEDIUM
1553120MemorySpaceOpt (second pass)MEDIUM
1573160PrintModulePass (dump mode)HIGH
1593200Optimization-level gatingMEDIUM
1653328Late-pipeline enable blockLOW
2114264(inline bool, late pass)LOW
2194424(compact bool, late pass)LOW

Integer slot defaults:

SlotOffsetDefaultLikely Meaning
92001Optimization threshold / iteration count
197398420Limit/threshold (e.g., unroll count)
2034104-1Thread count (sentinel for auto-detect via get_nprocs())
2054144-1Thread count fallback
2074184-1Sentinel for unlimited/auto
21543440Disabled counter

CLI Flag Routing

The path from a user-visible flag to an NVVMPassOptions slot traverses four stages:

nvcc -Xcicc -opt "-do-licm=0"          ← user invocation
    │
    ▼
sub_9624D0 (flag catalog, 75KB)        ← parses -opt flags into opt_argv vector
    │   pushes "-do-licm=0" into v327 (opt vector)
    ▼
PassOptionRegistry (hash table)         ← opt-phase parser populates registry
    │   key = slot_index, value = "0"
    ▼
sub_12D6300 (125KB initializer)         ← flattens registry into 4512-byte struct
    │   sub_12D6240(registry, LICM_SLOT, "1") → returns 0 (overridden)
    │   writes opts[2880] = 0
    ▼
sub_12E54A0 / sub_12DE8F0              ← pipeline assembler reads opts[2880]
    if (opts[2880]) AddPass(LICM);     ← skipped because opts[2880] == 0

The -opt flag prefix is critical: it routes the argument to the optimizer phase vector rather than to the linker, LTO, or codegen phases. The flag catalog (sub_9624D0) recognizes several shorthand patterns:

User FlagRoutes ToEffect
--emit-optix-iropt "-do-ip-msp=0", opt "-do-licm=0"Disables IPMSP and LICM for OptiX
-Ofast-compile=maxopt "-fast-compile=max", opt "-memory-space-opt=0"Disables MemorySpaceOpt
-memory-space-opt=0opt "-memory-space-opt=0"Direct pass disable
-Xopt "-do-remat=0"opt "-do-remat=0"Direct pass-through to opt phase

Pipeline Consumer: How Passes Read NVVMPassOptions

The pipeline assembler and its sub-pipeline builders receive the NVVMPassOptions struct as parameter a4 (in sub_12E54A0) or opts (in sub_12DE330/sub_12DE8F0). They read individual boolean slots by dereferencing a byte at a known offset and branching:

// Pattern 1: simple disable guard
if (!*(uint8_t*)(opts + 1760))           // opts[1760] = MemorySpaceOpt disable
    AddPass(PM, sub_1C8E680(0), 1, 0);  // insert MemorySpaceOpt

// Pattern 2: enable guard (inverted logic)
if (*(uint8_t*)(opts + 2880))            // opts[2880] = LICM enabled (default=1)
    AddPass(PM, sub_195E880(0), 1, 0);  // insert LICM

// Pattern 3: combined guard with opt-level gating
if (*(uint8_t*)(opts + 3200) &&          // opts[3200] = opt-level sufficient
    !*(uint8_t*)(opts + 880))            // opts[880] = NVVMReflect not disabled
    AddPass(PM, sub_1857160(), 1, 0);   // insert NVVMReflect

// Pattern 4: integer parameter read
v12 = *(int32_t*)(opts + 200);           // opts[200] = opt threshold (default=1)
// used to configure codegen dispatch in sub_12DFE00

The key insight is that the pipeline assembler never performs string comparison or hash-table lookup at pass-insertion time -- it reads pre-resolved values from the flat struct. This makes the ~150 pass-insertion decisions in sub_12E54A0 essentially free in terms of runtime cost.

Offset-to-Pass Mapping

The following table maps struct offsets (as seen in pipeline assembler guards opts[OFFSET]) to the passes they control. Offsets are byte offsets from the struct base. "Guard sense" indicates whether the pass runs when the byte is 0 (!opts[X] -- most common, where the option is a disable flag) or when it is nonzero (opts[X] -- the option is an enable flag).

OffsetSlotGuard SenseControlled PassFactory
2009valueOptimization threshold (integer, read by sub_12DFE00)--
28014-15!optsDCE (DeadCodeElimination)sub_18DEFF0
32016-17!optsTailCallElim / JumpThreadingsub_1833EB0
36018-19!optsNVVMLateOptsub_1C46000
40020-21!optsAlwaysInliner gate Asub_1C4B6F0
44022-23!optsAlwaysInliner gate Bsub_1C4B6F0
48024-25!optsInliner gate Csub_1C4B6F0
52026-27!optsNVIDIA-specific pass Asub_1AAC510
56028-29!optsNVIDIA-specific pass Bsub_1AAC510
60030-31!optsNVVMVerifiersub_12D4560
68034-35!optsFunctionAttrssub_1841180
72036-37!optsSCCPsub_1842BC0
76038-39!optsDSE (DeadStoreElimination)sub_18F5480
88044-45!optsNVVMReflectsub_1857160
92046-47!optsIPConstantPropagationsub_185D600
96048-49!optsSimplifyCFGsub_190BB10
100050-51!optsInstCombinesub_19401A0
104052-53!optsSink / SimplifyCFG (early)sub_1869C50
108054-55!optsPrintModulePass (dump IR)sub_17060B0
112056-57!optsNVVMPredicateOptsub_18A3430
116058-59!optsLoopIndexSplitsub_1952F90
120060-61!optsSimplifyCFG (tier guard)sub_190BB10
124062-63!optsLICMsub_195E880
128064-65!optsReassociate / Sinkingsub_1B7FDF0
132066-67!optsADCE (AggressiveDeadCodeElimination)sub_1C76260
136068-69!optsLoopUnrollsub_19C1680
140070-71!optsSROAsub_1968390
144072-73!optsEarlyCSEsub_196A2B0
148074-75!optsADCE extra guardsub_1C76260
152076-77!optsLoopSimplifysub_198DF00
164082-83!optsNVVMWarpShufflesub_1C7F370
168084-85!optsNVIDIA pass (early)sub_19CE990
176088-89!optsMemorySpaceOpt (primary)sub_1C8E680
184092-93!optsADCE variantsub_1C6FCA0
196098-99!optsConstantMerge / GlobalDCEsub_184CD60
2000100-101!optsNVVMIntrinsicLoweringsub_1CB4E40
2040102-103!optsMemCpyOptsub_1B26330
2080104-105!optsBranchDist gate Asub_1CB73C0
2120106-107!optsBranchDist gate Bsub_1CB73C0
2160108-109!optsNVVMPredicateOpt variantsub_18A3090
2200110-111!optsGenericToNVVMsub_1A02540
2240112-113!optsNVVMLowerAlloca gate Asub_1CBC480
2280114-115!optsNVVMLowerAlloca gate Bsub_1CBC480
2320116-117!optsNVVMRematerializationsub_1A13320
2360118-119!optsNVVMUnreachableBlockElimsub_1CC3990
2400120-121!optsNVVMReductionsub_1CC5E00
2440122-123!optsNVVMSinking2sub_1CC60B0
2560128-129!optsNVVMGenericAddrOptsub_1CC71E0
2600130-131!optsNVVMIRVerificationsub_1A223D0
2640132-133!optsLoopOpt / BarrierOptsub_18B1DE0
2680134-135!optsMemorySpaceOpt (second invocation)sub_1C8E680
2720136-137!optsInstructionSimplifysub_1A7A9F0
2760138-139!optsLoopUnswitch variantsub_19B73C0
2840141optsADCE (enabled by default, slot 141, default=1)sub_1C6FCA0
2880143optsLICM (enabled by default, slot 143, default=1)sub_195E880
2920145valueLowerBarriers parametersub_1C98270
3000150-151optsEarly pass guardsub_18FD350
3040151optsCorrelatedValuePropagation (default=1)sub_18EEA90
3080153optsNVIDIA-specific loop passsub_1922F90
3120155optsMemorySpaceOpt second-pass enable (default=1)sub_1C8E680
3160157optsPrintModulePass enable (default=1)sub_17060B0
3200159optsOptimization-level gate (default=1)--
3328165optsLate-pipeline enable block (default=1)multiple
3488174-175optsNVVMBarrierAnalysis + LowerBarriers enablesub_18E4A00
3648181stringLanguage string ("ptx"/"mid")path dispatch
3704185optsLate optimization flagsub_1C8A4D0
3904193optsDebug / verification modesub_12D3E60
3944195optsBasic block naming ("F%d_B%d")sprintf
3984197valueInteger limit (default=20)--
4064201valueConcurrent compilation overridesub_12D4250
4104203valueThread count (default=-1, auto-detect)sub_12E7E70
4144205valueThread count fallback (default=-1)sub_12E7E70
4184207valueInteger parameter (default=-1)--
4224209optsOptimization enabled flagtier dispatch
4304213optsDevice-code flagPipeline B
4344215valueInteger counter (default=0)--
4384217optsFast-compile bypass flagPipeline B dispatch
4464221!optsLate CFG cleanup guardsub_1654860

Known Option Names

Option names are stored in the PassOptionRegistry hash table, not in sub_12D6300 itself. The following names are extracted from binary string references in global constructors and pass factories:

Boolean Toggles (do-X / no-X)

NameLikely Slot RegionDefault
do-ip-mspMemorySpaceOpt areaenabled
do-clone-for-ip-mspMemorySpaceOpt variant--
do-licmoffset 2880 (slot 143)1 (enabled)
do-rematoffset 2320 (slot 117)enabled
do-cssaCSSA pass area--
do-scev-cgpSCEV-CGP area--
do-function-scev-cgpfunction-level SCEV-CGP--
do-scev-cgp-aggresively [sic]aggressive SCEV-CGP mode--
do-base-address-strength-reduceBaseAddrSR area--
do-base-address-strength-reduce-chainBaseAddrSR chain variant--
do-comdat-renamingCOMDAT pass--
do-counter-promotionPGO counter promotion--
do-lsr-64-bit64-bit loop strength reduction--
do-sign-ext-expandsign extension expansion--
do-sign-ext-simplifysign extension simplification--

Dump/Debug Toggles

NamePurpose
dump-ip-mspDump IR around MemorySpaceOpt
dump-ir-before-memory-space-optIR dump pre-MSP
dump-ir-after-memory-space-optIR dump post-MSP
dump-memory-space-warningsMSP diagnostic warnings
dump-remat / dump-remat-add / dump-remat-iv / dump-remat-loadRematerialization diagnostics
dump-branch-distBranch distribution diagnostics
dump-scev-cgpSCEV-CGP diagnostics
dump-base-address-strength-reduceBaseAddrSR diagnostics
dump-sink2Sinking2 diagnostics
dump-before-cssaCSSA input dump
dump-phi-removePHI removal diagnostics
dump-normalize-gepGEP normalization dump
dump-simplify-live-outLive-out simplification dump
dump-process-restrictProcess-restrict dump
dump-process-builtin-assumeBuiltin assume processing dump
dump-conv-dot / dump-conv-func / dump-conv-textConvergence analysis dumps
dump-nvvmirNVVM IR dump
dump-vaValue analysis dump

Parametric Knobs

NameDefaultPurpose
remat-for-occ120Occupancy target for rematerialization
remat-gep-cost6000GEP rematerialization cost threshold
remat-lli-factor10Long-latency instruction factor
remat-max-live-limit10Maximum live range limit for remat
remat-single-cost-limit--Single-instruction remat cost limit
remat-loop-trip--Loop trip count for remat decisions
remat-use-limit--Use count limit for remat candidates
remat-maxreg-ceiling--Register ceiling for remat
remat-move--Remat move control
remat-load-param--Parameter load remat control
remat-ignore-single-cost--Ignore single-cost heuristic
branch-dist-block-limit-1Max blocks for branch distribution (-1 = unlimited)
branch-dist-func-limit-1Max functions for branch distribution
branch-dist-norm0Branch distribution normalization mode
scev-cgp-control--SCEV-CGP mode selector
scev-cgp-norm--SCEV-CGP normalization
scev-cgp-check-latency--Latency check threshold
scev-cgp-cross-block-limit--Cross-block limit
scev-cgp-idom-level-limit--Immediate dominator level limit
scev-cgp-inst-limit--Instruction count limit
scev-cgp-old-base--Old base address mode
scev-cgp-tid-max-value--Thread ID max value
base-address-strength-reduce-iv-limit--IV limit for base addr SR
base-address-strength-reduce-max-iv--Max IV count
cssa-coalesce--CSSA coalescing mode
cssa-verbosity--CSSA diagnostic verbosity
memory-space-opt-pass--MSP pass variant selector
peephole-opt--Peephole optimizer control
loop-index-split--Loop index split control
va-use-scdg--Value analysis SCDG mode
nvvm-peephole-optimizer--NVVM peephole enable
nvvm-intr-range--Intrinsic range analysis control

Differences from Upstream LLVM

Upstream LLVM has nothing resembling this system. The closest analogue is the cl::opt<T> flag mechanism, but that scatters configuration across hundreds of global variables that each pass reads independently. The differences are architectural:

AspectUpstream LLVMcicc NVVMPassOptions
Storage~1,689 scattered cl::opt globals in BSSSingle 4,512-byte contiguous struct
InitializationGlobal constructors register each flagOne 125KB function flattens all 221 slots
Access patternEach pass reads its own globalsPipeline assembler reads all slots centrally
CopyabilityNot designed for copyingStruct is trivially memcpy-able for Phase I/II
Thread safetyGlobal cl::opt requires careful coordinationEach thread gets its own struct copy
Override mechanismcl::opt command-line parserPassOptionRegistry hash table with fallback defaults
Pass gatingPass decides internally whether to runPipeline assembler decides before constructing pass

The thread-safety property is crucial for the two-phase concurrent compilation model. When Phase II runs per-function compilation in parallel threads, each thread receives a copy of the NVVMPassOptions struct. If NVIDIA used upstream cl::opt globals for pass configuration, they would need global locks or TLS for every option read during pass execution -- an unacceptable overhead for a GPU compiler that may process hundreds of kernels in a single translation unit.

Interaction with Two-Phase Compilation

The NVVMPassOptions struct is allocated and populated before Phase I begins, in the orchestrator sub_12E7E70:

// sub_12E7E70, line ~128
void* opts = malloc(4512);              // allocate NVVMPassOptions
sub_12D6300(opts, registry);            // populate from CLI-parsed registry
// ... pass opts to sub_12E54A0 for Phase I ...
// ... pass same opts to sub_12E54A0 for Phase II ...

Both phases receive the same opts pointer. Individual passes within the pipeline assembler check qword_4FBB3B0 (the TLS phase counter) to skip themselves in the wrong phase -- but the NVVMPassOptions struct itself does not change between phases. This means a pass cannot be enabled in Phase I but disabled in Phase II through NVVMPassOptions alone; phase selection is handled by the separate TLS mechanism.

The second caller, sub_12F4060 (TargetMachine creation in the standalone path), performs an identical allocation and initialization sequence, confirming that every compilation path goes through the same NVVMPassOptions infrastructure.

Function Map

FunctionAddressSizeRole
NVVMPassOptions::initsub_12D6300125KBPopulate 221 slots from registry
PassOptionRegistry::lookupOptionsub_12D6170~200BHash-table lookup by slot index
PassOptionRegistry::getBoolOptionsub_12D6240~300BBoolean resolution with default
writeStringOptionsub_12D6090~150BWrite 24-byte string slot
writeBoolOptionsub_12D6100~120BWrite 16-byte boolean slot
PassDefTable::getPassDefsub_1691920~80B64-byte stride table lookup
parseIntsub_16D2BB0~100BString-to-int64 parser
toLowercasesub_16D2060~80BString lowercasing for bool parse

Cross-References