Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLVM Optimizer

NVIDIA's LLVM optimizer in cicc v13.0 is not a straightforward invocation of the upstream LLVM opt pipeline. Instead, it implements a proprietary two-phase compilation model where the same 49.8KB pipeline assembly function (sub_12E54A0) is called twice with different phase counters, allowing analysis passes to run in Phase I and codegen-oriented passes in Phase II. Individual passes read a TLS variable (qword_4FBB3B0) to determine which phase is active and skip themselves accordingly.

The optimizer also supports concurrent per-function compilation: after Phase I completes on the whole module, Phase II can be parallelized across functions using a thread pool sized to get_nprocs() or a GNU Jobserver token count. This is a significant departure from upstream LLVM, which processes functions sequentially within a single pass manager invocation.

The entire optimization behavior is controlled by the NVVMPassOptions system — a 4,512-byte struct with 221 option slots (114 string + 100 boolean + 6 integer + 1 string-pointer) that provides per-pass enable/disable toggles and parametric knobs. This system is completely proprietary and has no upstream equivalent.

Address range 0x12D00000x16FFFFF (~4.2 MB of code).

Pipeline assemblersub_12E54A0 (49.8KB, 1,553 lines, ~150 pass insertions)
Phase orchestratorsub_12E7E70 (9.4KB, Phase I / Phase II)
Concurrent entrysub_12E1EF0 (51.3KB, jobserver + split-module + thread pool)
PassOptions initsub_12D6300 (125KB, 4,786 lines, 221 option slots)
New PM registrationsub_2342890 (2,816 lines, 35 NVIDIA + ~350 LLVM passes)
Target creationsub_12EA530 (4.1KB, "nvptx" / "nvptx64")
AddPasssub_12DE0B0 (3.5KB, hash-table-based pass insertion)
Tier 0 sub-pipelinesub_12DE330 (4.8KB, ~40 passes)
Tier 1/2/3 sub-pipelinesub_12DE8F0 (17.9KB, phase-conditional)
Codegen dispatchsub_12DFE00 (20.7KB)
LTO pipelinesub_12F5F30 (37.8KB, dead kernel elimination)
jemalloc5.3.x statically linked (~400 functions at 0x12FC000)

Architecture

sub_12E1EF0 (51KB, concurrent compilation entry)
  │
  ├─ GNU Jobserver init (sub_16832F0, --jobserver-auth=R,W from MAKEFLAGS)
  ├─ Bitcode reading + verification (sub_153BF40)
  ├─ Function sorting by priority (sub_12E0CA0)
  ├─ Thread pool creation (sub_16D4AB0, min(requested, num_functions) threads)
  │
  └─ sub_12E7E70 (9.4KB, two-phase orchestrator)
       │
       ├─ Phase I: qword_4FBB3B0 = 1
       │    └─ sub_12E54A0 (whole-module analysis + early optimization)
       │
       ├─ Concurrency check: sub_12D4250 (>1 defined function?)
       │    ├─ Yes, threads>1 → per-function Phase II via thread pool
       │    │    └─ sub_12E86C0 per function (qword_4FBB3B0 = 2)
       │    └─ No → sequential Phase II
       │         └─ sub_12E54A0 (qword_4FBB3B0 = 2)
       │
       └─ qword_4FBB3B0 = 3 (done)

sub_12E54A0 (49.8KB, MASTER PIPELINE ASSEMBLY)
  │
  ├─ Top branch: a4[4384] → Pipeline B (fast/codegen-only)
  │                    else → Pipeline A (normal LLVM)
  │
  ├─ Target machine setup
  │    ├─ Triple: "nvptx64" or "nvptx" (based on pointer size)
  │    ├─ sub_16D3AC0 → TargetRegistry::lookupTarget()
  │    ├─ TargetOptions: PIC=1, CodeModel=8, OptLevel=1, ThreadModel=1
  │    └─ DataLayout from qword_4FBB430
  │
  ├─ Phase 0: Infrastructure (TLI, TTI, Verifier, AssumptionCache, ProfileSummary)
  ├─ Phase 1: Language dispatch (a4[3648]: "ptx"/"mid"/default)
  ├─ Phase 2: Pre-optimization passes
  ├─ Phase 3: Main optimization loop (tier threshold dispatch)
  ├─ Phase 4: Post-opt language-specific pipelines
  ├─ Phase 5: Finalization (NVVMLowerBarriers, BreakCriticalEdges, codegen)
  ├─ Phase 6: Phase 2 codegen check (qword_4FBB3B0 == 2)
  ├─ Phase 7: PassManager::run
  └─ Phase 8: Basic block naming ("F%d_B%d" for debug)

Two-Phase Compilation — sub_12E7E70

FieldValue
Address0x12E7E70
Size9.4KB
Strings"Phase I", "Phase II", "Concurrent=Yes/No"

The two-phase model exists because certain optimization passes (e.g., inter-procedural memory space propagation, global inlining decisions) require whole-module visibility, while others (register pressure-driven rematerialization, instruction scheduling) operate per-function and benefit from parallelization. Phase I runs the whole-module analysis and early optimization passes; Phase II runs the per-function backend-oriented passes.

Both phases call the same sub_12E54A0. The difference: qword_4FBB3B0 (TLS variable) is set to 1 or 2 before each call. Individual passes read this counter and skip themselves if the current phase doesn't match their intended execution phase. When the module contains only a single defined function, the phase mechanism is bypassed entirely — a single unphased call handles everything.

Phase State Machine:

  START → [phase=1] → sub_12E54A0 (Phase I)
    │
    error? → RETURN
    │
    count_functions()
    ├─ 1 func → [phase=2] → sub_12E54A0 → [phase=3] → DONE
    ├─ N funcs, threads>1 → per-function Phase II (thread pool) → [phase=3] → DONE
    └─ N funcs, threads≤1 → [phase=2] → sub_12E54A0 → [phase=3] → DONE

Single-function modules skip the phase mechanism entirely — a single unphased call to sub_12E54A0.

GNU Jobserver Integration

When cicc is invoked from a parallel make -jN build, it can participate in the GNU Jobserver protocol to limit its own thread count to the available parallelism tokens. This prevents oversubscription — without it, a -j16 build could spawn 16 cicc processes each creating their own thread pool, resulting in hundreds of threads competing for CPU time. The jobserver reads the --jobserver-auth=R,W pipe file descriptors from the MAKEFLAGS environment variable.

In sub_12E1EF0 (lines 833–866), when a4+3288 is set:

v184 = sub_16832F0(&state, 0);   // parse MAKEFLAGS for --jobserver-auth=R,W
if (v184 == 5 || v184 == 6)      // pipe issues
    warning("jobserver pipe problem");
elif (v184 != 0)
    fatal("GNU Jobserver support requested, but an error occurred");

sub_16832F0 allocates a 296-byte state structure, parses MAKEFLAGS, creates a pipe for token management, and spawns a pthread to manage tokens. Throttles concurrent per-function compilations to match the build's -j level.

Split-Module Compilation

Split-module compilation is NVIDIA's mechanism for the -split-compile=N flag. It decomposes a multi-function module into individual per-function bitcode blobs, compiles each independently (potentially in parallel), then re-links the results. This trades away inter-procedural optimization opportunities for compilation speed and reduced peak memory usage — a worthwhile tradeoff for large CUDA kernels during development iteration.

When optimization level (a4+4104) is negative, enters split-module mode:

  1. Each function's bitcode is extracted via sub_1AB9F40 with filter callback sub_12D4BD0
  2. Module name: "<split-module>" (14 chars)
  3. After thread pool completes, split modules are re-linked via sub_12F5610
  4. Linkage attributes restored from hash table (external linkage types: bits 0–5, dso_local: bit 6 of byte+33)

Pipeline Assembly — sub_12E54A0

The pipeline assembly function is the heart of the optimizer. At 49.8KB with ~150 AddPass calls, it constructs the complete LLVM pass pipeline at runtime rather than using a static pipeline description. The function first sets up target machine infrastructure (triple, data layout, subtarget features), then dispatches into one of three language-specific paths that determine which passes run and in what order. After the language-specific path completes, a shared finalization phase runs barriers, critical edge breaking, and codegen preparation.

A distinguishing feature of NVIDIA's pipeline is the tier system: passes are organized into Tiers 0–3, each gated by a threshold counter. As compilation progresses through the main loop (which iterates over external plugin/extension pass entries), tiers fire when the accumulated pass count exceeds their threshold. This allows NVIDIA to precisely control where in the pipeline their custom passes interleave with standard LLVM passes.

Language-Specific Paths

The pipeline branches based on a4[3648] (language string). The three paths represent different optimization strategies for different IR maturity levels:

StringPathPass CountKey Difference
"ptx"Path A~15Light: NVVMPeephole → LLVM standard → DCE → MemorySpaceOpt
"mid"Path B~45Full: SROA → GVN → LICM → LoopIndexSplit → Remat → all NVIDIA passes
(default)Path C~40General: 4 LLVM standard passes + NVIDIA interleaving

Tier System

The main loop iterates over entries at a4[4488] (16-byte stride: vtable + phase_id):

if (opt_enabled && phase_id > opt_threshold) → sub_12DE330  // Tier 0 (full)
if (tier1_flag && phase_id > tier1_threshold) → sub_12DE8F0(1) // Tier 1
if (tier2_flag && phase_id > tier2_threshold) → sub_12DE8F0(2) // Tier 2
if (tier3_flag && phase_id > tier3_threshold) → sub_12DE8F0(3) // Tier 3

Each tier fires once (flag cleared after execution). Remaining tiers fire unconditionally after the loop.

Tier 0 — Full Optimization (sub_12DE330)

Tier 0 is the most aggressive optimization sub-pipeline. It runs ~40 passes in a carefully ordered sequence that interleaves standard LLVM passes with NVIDIA-specific ones. The ordering reveals NVIDIA's optimization strategy: start with GVN and SCCP for value simplification, then run NVIDIA's custom NVVMReflect and NVVMVerifier to clean up NVVM-specific constructs, followed by aggressive loop transformations (LoopIndexSplit, LoopUnroll, LoopUnswitch), and finally register-pressure-sensitive passes (Rematerialization, DSE, DCE) to prepare for codegen.

~40 passes in order:

Confidence note: Pass identifications are based on diagnostic strings, factory signatures, and pipeline ordering. Most are HIGH confidence. Entries with [MEDIUM confidence] are inferred from code structure rather than direct string evidence.

#FactoryLikely PassGuarded By
1sub_1654860(1)BreakCriticalEdges
2sub_1A62BF0(1,...)LLVM standard pipeline #1
3sub_1B26330MemCpyOpt
4sub_185D600IPConstantPropagation
5sub_1C6E800GVN
6sub_1C6E560NewGVN/GVNHoist [MEDIUM confidence]
7sub_1857160NVVMReflect
8sub_1842BC0SCCP
9sub_12D4560NVVMVerifier
10sub_18A3090NVVMPredicateOpt
11sub_184CD60ConstantMerge
12sub_1869C50(1,0,1)Sink/MemSSA [MEDIUM confidence]!opts[1040]
13sub_1833EB0(3)TailCallElim/JumpThreading [MEDIUM confidence]
14sub_1952F90(-1)LoopIndexSplit
15sub_1A62BF0(1,...)LLVM standard pipeline #1
16sub_1A223D0NVVMIRVerification
17sub_1A7A9F0InstructionSimplify
18sub_1A62BF0(1,...)LLVM standard pipeline #1
19sub_1A02540GenericToNVVM
20sub_198DF00(-1)LoopSimplify
21sub_1C76260ADCE!opts[1320]
22sub_195E880(0)LICMopts[2880]
23sub_19C1680(0,1)LoopUnroll!opts[1360]
24sub_19401A0InstCombine
25sub_1968390SROA
26sub_196A2B0EarlyCSE
27sub_19B73C0(2,...)LoopUnswitch
28sub_190BB10(0,0)SimplifyCFG
29sub_1A13320NVVMRematerialization
30sub_18F5480DSE
31sub_18DEFF0DCE
32sub_1A62BF0(1,...)LLVM standard pipeline #1
33sub_18B1DE0NVVMLoopPass [MEDIUM confidence]
34sub_1841180FunctionAttrs

"mid" Path — Complete Pass Ordering

The "mid" path is the primary optimization pipeline for standard CUDA compilation. At ~45 passes, it is the most comprehensive of the three paths. The key pattern is repeated interleaving of NVIDIA custom passes with standard LLVM passes: NVVMIntrinsicLowering runs 4 times at different points, NVVMReflect runs 3 times, and NVVMIRVerification runs after each major transformation to catch correctness regressions early. The MemorySpaceOpt pass appears once in this sequence (gated by !opts[1760]) — it runs again later via the parameterized <second-time> invocation in Tier 1/2/3.

ConstantMerge → NVVMIntrinsicLowering → MemCpyOpt → SROA → NVVMPeephole → NVVMAnnotations → LoopSimplify → GVN → NVVMIRVerification → SimplifyCFG → InstCombine → LLVM standard #5 → NVVMIntrinsicLowering → DeadArgElim → FunctionAttrs → DCE → ConstantMerge → LICM → NVVMLowerBarriers → MemorySpaceOpt → Reassociate → LLVM standard #8 → NVVMReflect → ADCE → InstructionSimplify → DeadArgElim → TailCallElim → DeadArgElim → CVP → Sink → SimplifyCFG → DSE → NVVMSinking2 → NVVMIRVerification → EarlyCSE → NVVMReflect → LLVM standard #8 → NVVMIntrinsicLowering → IPConstProp → LICM → NVVMIntrinsicLowering → NVVMBranchDist → NVVMRemat

NVVMPassOptions — sub_12D6300

NVVMPassOptions is NVIDIA's proprietary mechanism for fine-grained control over every optimization pass. Unlike LLVM's cl::opt system (which uses global command-line options), NVVMPassOptions stores per-pass configuration in a flat struct that is allocated once and passed through the pipeline by pointer. This design avoids the global-state problems of cl::opt and allows different compilation units to have different pass configurations within the same process — critical for the concurrent per-function compilation model.

The 125KB initialization function is the largest in the optimizer range. Its size comes from the sheer number of option slots: each of the 221 slots requires a hash-table lookup, a default-value resolution, and a type-specific store, with most slots organized in pairs (a string parameter + a boolean enable flag).

FieldValue
Address0x12D6300
Size125KB (4,786 lines)
Output struct4,512 bytes (allocated via sub_22077B0(4512))
Slot count221 (indices 1–221)
Slot types114 string + 100 boolean + 6 integer + 1 string-pointer

Struct Layout

RegionOffsetContent
Header0–7int opt_level (from a2+112)
Registry ptr8–15Pointer to PassOptionRegistry
Slot pairs16–4479221 option slots (string/bool/int pairs)
Sentinel4480–45114 qwords zeroed

Option Slot Types

TypeSizeWriterCount
String24Bsub_12D6090114
Bool (compact)16Bsub_12D610083
Bool (inline)16Bdirect byte write17
Integer16Bsub_16D2BB0 (parseInt)6
String pointer28Bdirect qword write (slot 181 only)1

Pair Organization

Slots are organized in pairs: even = string parameter (the pass's configuration value or name), odd = boolean enable/disable toggle (the do-X flag). This consistent pairing means each "pass knob" has both a parametric value and an on/off switch, allowing passes to be individually disabled without removing their configuration — useful for A/B testing optimizations.

Exceptions to the pair pattern: slots 160–162 (3 consecutive strings — a pass with 3 string parameters), slots 192–193 (2 consecutive bools — a pair of binary flags), slot 181 (the only string-pointer type, storing a char* + length directly — likely a file path or regex pattern).

Defaults Enabled (14 of 100 booleans)

Slots: 19, 25, 93, 95, 117, 141, 143, 151, 155, 157, 159, 165, 211, 219. These are passes that run by default and must be explicitly disabled.

Integer Defaults

SlotDefaultLikely Purpose
91Iteration count / threshold
19720Limit (e.g., unroll count)
203-1Sentinel (unlimited/auto)
205-1Sentinel
207-1Sentinel
2150Disabled counter

Known Option Names

Boolean toggles (do-X / no-X): do-ip-msp, do-licm, do-remat, do-clone-for-ip-msp, do-cssa, do-scev-cgp, do-function-scev-cgp, do-scev-cgp-aggresively, do-base-address-strength-reduce, do-base-address-strength-reduce-chain, do-comdat-renaming, do-counter-promotion, do-lsr-64-bit, do-sign-ext-expand, do-sign-ext-simplify

Parametric knobs: remat-for-occ, remat-gep-cost, remat-max-live-limit, remat-maxreg-ceiling, remat-move, remat-single-cost-limit, remat-use-limit, branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm, scev-cgp-check-latency, scev-cgp-control, scev-cgp-cross-block-limit, scev-cgp-idom-level-limit, scev-cgp-inst-limit, scev-cgp-norm, cssa-coalesce, cssa-verbosity, base-address-strength-reduce-iv-limit

Dump flags: dump-ip-msp, dump-remat, dump-branch-dist, dump-scev-cgp, dump-sink2, dump-before-cssa, dump-normalize-gep, dump-simplify-live-out

New PM Pass Registration — sub_2342890

NVIDIA maintains both the Legacy Pass Manager and the New Pass Manager in cicc v13.0. The New PM registration lives in a single 2,816-line function that registers every analysis, pass, and printer by calling sub_E41FB0(pm, class_name, len, pass_name, len) for each. Standard LLVM passes use the llvm:: prefix (stripped during registration), while NVIDIA custom passes use their own class names.

The registration function also handles parameterized pass parsing: when the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), it calls a registered parameter-parsing callback that returns a configured pass options struct. This is how MemorySpaceOpt can run twice with different configurations in the same pipeline.

NVIDIA Custom Passes (35 total)

Module passes (12): check-gep-index, check-kernel-functions, cnp-launch-check, ipmsp, nv-early-inliner, nv-inline-must, nvvm-pretreat, nvvm-verify, printf-lowering, select-kernels, lower-ops*, set-global-array-alignment*

Function passes (20): basic-dbe, branch-dist, byval-mem2reg, bypass-slow-division, normalize-gep, nvvm-reflect-pp, nvvm-peephole-optimizer, old-load-store-vectorizer, remat, propagate-alignment, reuse-local-memory, set-local-array-alignment, sinking2, d2ir-scalarizer, sink<rp-aware>, memory-space-opt*, lower-aggr-copies*, lower-struct-args*, process-restrict*

Loop pass (1): loop-index-split

Analyses (2): rpa (RegisterPressureAnalysis), merge-sets (MergeSetsAnalysis)

* = parameterized

Key Discoveries

  • nvvm-reflect-pp is actually SimplifyConstantConditionalsPass, not a reflection pass. It runs after NVVMReflect resolves __nvvm_reflect() calls to constants, cleaning up the resulting dead branches and unreachable code. The misleading name ("pp" = post-processing) obscures what is essentially a targeted dead-code-elimination pass.
  • memory-space-opt runs twice in the pipeline with different parameterizations: <first-time> early in optimization (conservative, uses available alias information) and <second-time> late (aggressive, benefits from earlier optimizations having simplified the IR). This two-pass approach is necessary because address space resolution depends on pointer analysis quality, which improves as other passes simplify the code.
  • d2ir-scalarizer reuses LLVM's ScalarizerPass class under a different name, suggesting NVIDIA added a custom registration point to control when scalarization happens in the NVPTX pipeline without modifying the upstream pass.
  • Legacy PM co-existence: both Legacy PM and New PM registrations exist for the same passes, with slightly different names (e.g., "memory-space-opt-pass" vs "memory-space-opt"). This dual registration is necessary during the LLVM Legacy→New PM migration — cicc v13.0 appears to be in the middle of this transition.

Key Global Variables

VariablePurpose
qword_4FBB3B0Phase counter TLS: 1=Phase I, 2=Phase II, 3=done
qword_4FBB370Feature flag register (value 6 = barrier opt + memspace opt)
qword_4FBB410Tier execution tracker
qword_4FBB430Optimization level store
qword_4FBB510Debug/trace verbosity level
byte_3F871B3NVIDIA global flag byte (empty/null string in .rodata)
byte_4F99740CUTLASS optimization enable flag

NVVMPassOptions Deep Dive

Memory Layout

The 4,512-byte NVVMPassOptions struct is allocated on the heap via sub_22077B0(4512) at the start of each compilation. The layout divides into four regions:

Offset 0x000 [8B]  : int32 opt_level (from config+112) + 4B padding
Offset 0x008 [8B]  : qword ptr to PassOptionRegistry (hash table source)
Offset 0x010 [4464B]: 221 option slots (indices 1-221)
Offset 0x1180[32B] : 4 qwords zeroed (sentinel/trailer)

The slots start at offset 16 and are packed contiguously. Each slot occupies a fixed size depending on its type, but the stride varies: string options take 24 bytes, boolean options take 16 bytes, integer options take 16 bytes, and the single string-pointer option (slot 181) takes 28 bytes. The overall packing is not uniform-stride; the offset of each slot must be computed from the cumulative widths of all preceding slots.

Slot Type Formats

Five distinct slot types exist, each written by a dedicated helper:

// TYPE A: String option (114 instances)
// Written by sub_12D6090 (writeStringOption)
struct StringSlot {        // 24 bytes
    char*   value_ptr;     // +0: pointer to string value
    int32_t option_index;  // +8: 1-based slot index
    int32_t flags;         // +12: from PassDef byte+40
    int32_t opt_level;     // +16: optimization level context
    int32_t pass_id;       // +20: resolved via sub_1691920
};

// TYPE B: Boolean compact (83 instances)
// Written by sub_12D6100 (writeBoolOption)
struct BoolCompactSlot {   // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1: padding
    int32_t option_index;  // +4
    int32_t flags;         // +8
    int32_t pass_id;       // +12
};

// TYPE C: Boolean inline (17 instances)
// Written directly as byte + int32 fields
struct BoolInlineSlot {    // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1
    int32_t option_index;  // +4: from sub_12D6240 return hi32
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12: resolved inline
};

// TYPE D: Integer (6 instances)
// Value parsed by sub_16D2BB0 (parseInt)
struct IntegerSlot {       // 16 bytes
    int32_t value;         // +0: parsed integer
    int32_t option_index;  // +4
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12
};

// TYPE E: String pointer (1 instance, slot 181 only)
struct StringPtrSlot {     // 28 bytes
    char*   char_ptr;      // +0: raw string data pointer
    int64_t str_length;    // +8: length of string
    int32_t option_index;  // +16
    int32_t opt_level;     // +20
    int32_t pass_id;       // +24
};

Helper Function Chain

The initialization function sub_12D6300 populates the struct by iterating all 221 slot indices and calling a chain of helpers for each:

  1. sub_12D6170 (PassOptionRegistry::lookupOption) -- looks up a slot index in the hash table at registry+120. Returns a pointer to an OptionNode struct: [+40] int16 flags, [+48] qword* value_array_ptr, [+56] int value_count. Returns null if the option was not set on the command line.

  2. sub_12D6240 (getBoolOption) -- resolves a boolean option. Calls sub_12D6170 to find the option, then if a string value exists, lowercases it via sub_16D2060 and tests if the first char is '1' (0x31) or 't' (0x74). If the option was not found, defaults to true (enabled). Returns the boolean packed with the flags in the low 40 bits.

  3. sub_1691920 (PassDefTable::getPassDef) -- looks up a PassDef entry in a table where each entry is 64 bytes. Computes: table[0] + (index - 1) * 64. The PassDef at [+32] holds the pass_id, at [+36] a has_overrides flag, and at [+40] an override index.

Initial Slots (1-6): Global Configuration

The first six slots are all string types at a uniform 24-byte stride, starting at offset 16. They do not follow the pair pattern and represent global pipeline parameters rather than per-pass knobs:

SlotOffsetLikely Content
116ftz (flush-to-zero mode string)
240prec-div (precise division setting)
364prec-sqrt (precise square root setting)
488fmad (fused multiply-add policy)
5112opt-level (optimization level string)
6136sm-arch (target SM architecture string)

CLI Interface

Users interact with NVVMPassOptions via the -opt flag, which appends key=value pairs to the PassOptionRegistry before sub_12D6300 flattens them:

cicc -opt "-do-ip-msp=0"            # disable memory space propagation
cicc -opt "-do-licm=0"              # disable LICM
cicc -opt "-remat-max-live-limit=50" # set rematerialization threshold
cicc -opt "-dump-remat"             # enable remat dump output

The registry is a hash table populated from these CLI strings. Each -opt argument is parsed into a key (the option name) and value (the string after =). When sub_12D6300 runs, it queries the registry for each of the 221 slot indices. If a CLI override exists, it takes precedence; otherwise the compiled-in default is used.

Option Anomalies

Several regions break the standard string/boolean pair pattern:

  • Slots 160-162: Three consecutive string slots with no interleaved boolean. [LOW confidence] This represents a pass (likely MemorySpaceOpt or the CSSA pass) that takes three string configuration parameters followed by a single boolean enable flag at slot 163. The pass identity is uncertain because neither MemorySpaceOpt nor CSSA has been confirmed to consume three string parameters; the association is based on pipeline position proximity only.
  • Slots 192-193: Two consecutive boolean slots. One is the main enable toggle; the other appears to be a sub-feature flag (both default to disabled).
  • Slot 181 (offset 3648): The only STRING_PTR type. Its default is byte_3F871B3 (an empty string in .rodata). The raw pointer + length storage suggests this holds a file path or regex pattern for pass filtering.
  • Slots 196-207: Alternating string + integer slots instead of string + boolean. [LOW confidence] This high-numbered region contains all six integer options, likely controlling late-pipeline passes with numeric thresholds (unroll counts, live-variable limits, iteration bounds). The specific pass-to-slot associations are unconfirmed; the "unroll counts, live-variable limits, iteration bounds" interpretation is based on typical LLVM integer-valued pass options, not direct evidence.

Complete Slot-to-Offset Map with Known Consumers

The following table maps NVVMPassOptions slot indices to struct byte offsets, types, defaults, and -- where the cross-reference to the pipeline assembler's a4[offset] guards could be established -- the consuming pass(es). Offsets marked with * are confirmed by cross-referencing a4[offset] guards in sub_12E54A0 and sub_12DE8F0.

SlotOffsetTypeDefaultKnown Knob NameConsuming Pass
116STRINGftzGlobal: flush-to-zero mode
240STRINGprec-divGlobal: precise division
364STRINGprec-sqrtGlobal: precise sqrt
488STRINGfmadGlobal: fused multiply-add
5112STRINGopt-levelGlobal: optimization level
6136STRINGsm-archGlobal: target SM architecture
7160BOOL0
8176STRING
9200*INTEGER1Opt level for sub_12DFE00 codegen
10216STRING
11240BOOL0
13280*BOOL0no-dcesub_18DEFF0 (DCE)
15320*BOOL0no-tailcallelimsub_1833EB0 (TailCallElim)
17360*BOOL0no-late-optsub_1C46000 (NVVMLateOpt)
19400*BOOL1no-inline-aInlining variant A
21440*BOOL0no-inline-bsub_1C4B6F0 (AlwaysInliner)
23480*BOOL0no-inline-csub_1C4B6F0 in sub_12DE8F0
25520*BOOL1sub_1AAC510 (NVIDIA pass A)
27560*BOOL0sub_1AAC510 (NVIDIA pass B)
29600*BOOL0no-nvvm-verifysub_12D4560 (NVVMVerifier)
33680*BOOL0no-func-attrssub_1841180 (FunctionAttrs)
35720*BOOL0no-sccpsub_1842BC0 (SCCP)
37760*BOOL0no-dsesub_18F5480 (DSE)
43880*BOOL0no-nvvm-reflectsub_1857160 (NVVMReflect)
45920*BOOL0no-ipconstsub_185D600 (IPConstProp)
47960*BOOL0no-simplifycfgsub_190BB10 (SimplifyCFG)
491000*BOOL0no-instcombinesub_19401A0 (InstCombine)
511040*BOOL0no-sinksub_1869C50 (Sink/MemSSA)
531080*BOOL0no-dumpsub_17060B0 (PrintModulePass)
551120*BOOL0no-predoptsub_18A3430 (NVVMPredicateOpt)
571160*BOOL0no-loopindexsplitsub_1952F90 (LoopIndexSplit)
591200*BOOL0no-simplifycfg-bSimplifyCFG variant B
611240*BOOL0do-licm (inverted)sub_195E880 (LICM)
631280*BOOL0no-reassocsub_1B7FDF0 (Reassociate)
651320*BOOL0no-adce-asub_1C76260 (ADCE variant)
671360*BOOL0no-loopunrollsub_19C1680 (LoopUnroll)
691400*BOOL0no-sroasub_1968390 (SROA)
711440*BOOL0no-earlycsesub_196A2B0 (EarlyCSE)
731480*BOOL0no-adce-bADCE variant B
751520*BOOL0no-loopsimplifysub_198DF00 (LoopSimplify)
831680*BOOL0sub_19CE990 (NVIDIA pass)
871760*BOOL0do-ip-msp (inverted)sub_1C8E680 (MemorySpaceOpt)
911840*BOOL0no-adce-csub_1C6FCA0 (ADCE)
931880BOOL1NVVMReduction param A
951920BOOL1NVVMReduction param B
971960*BOOL0no-constmergesub_184CD60 (ConstantMerge)
992000*BOOL0no-intrin-lowersub_1CB4E40 (NVVMIntrinsicLowering)
1012040*BOOL0no-memcpyoptsub_1B26330 (MemCpyOpt)
1052120*BOOL0no-branchdist-bsub_1CB73C0 (NVVMBranchDist B)
1092200*BOOL0no-generic2nvvmsub_1A02540 (GenericToNVVM)
1132280*BOOL0no-loweralloca-bNVVMLowerAlloca B
1152320*BOOL0do-remat (inverted)sub_1A13320 (NVVMRemat)
1172360BOOL1sub_1CC3990 (NVVMUnreachBlockElim)
1212440*BOOL0no-sinking2sub_1CC60B0 (NVVMSinking2)
1272560*BOOL0no-genericaddroptsub_1CC71E0 (NVVMGenericAddrOpt)
1292600*BOOL0no-irverifysub_1A223D0 (NVVMIRVerification)
1312640*BOOL0no-loopoptsub_18B1DE0 (NVVMLoopOpt)
1332680*BOOL0no-memspaceopt-bMemorySpaceOpt in sub_12DE8F0
1352720*BOOL0no-instsimplifysub_1A7A9F0 (InstructionSimplify)
1412840*BOOL1Enable ADCE (sub_1C6FCA0, reversed)
1432880*BOOL1do-licmEnable LICM (reversed logic)
1493000*BOOL0Extra DeadArgElim trigger
1513040BOOL1Enable CorrelatedValuePropagation
1553120*BOOL1Address space optimization flag
1573160*BOOL1dump-* masterDebug dump mode (PrintModulePass)
1593200*BOOL1Enable advanced NVIDIA passes group
1653328*BOOL1Enable SM-specific warp/reduction/sinking
1733488*BOOL0Enable barrier optimization
1753528*BOOL0Tier 1 optimization enable
1773568*BOOL0Tier 2 optimization enable
1793608*BOOL0Tier 3 optimization enable
1813648*STR_PTR""Language string ("ptx"/"mid"/"idn")
1833704*BOOL0Late optimization / address-space mode
1933904*BOOL0Debug: verify after each plugin pass
1953944*BOOL0Debug: rename BBs to "F%d_B%d"
1973984INTEGER20Limit/threshold (e.g., unroll count)
2034104INTEGER-1Sentinel: unlimited/auto
2054144INTEGER-1Sentinel: unlimited/auto
2074184INTEGER-1Sentinel: unlimited/auto
2094224*BOOL0Master optimization switch
2114264BOOL1
2134304*BOOL0Device-code / separate-compilation
2154344INTEGER0Disabled counter
2174384*BOOL0Fast-compile / bypass LLVM pipeline
2194424BOOL1
2214464*BOOL0Disable late CFG cleanup variant B

Slots not listed have no confirmed cross-reference to pipeline assembler guards. The full 221-slot table is in the NVVMPassOptions Reference.

Complete Option Name Inventory

The following option names were extracted from binary string references in .rodata. They are set via -opt "-name=value" on the cicc command line (requires NVVMCCWIZ=553282 in non-release builds).

Boolean toggles (do-X / no-X):

NameEffect
do-ip-mspEnable inter-procedural memory space propagation
do-licmEnable LICM (loop-invariant code motion)
do-rematEnable NVVMRematerialization
do-clone-for-ip-mspEnable function cloning for IPMSP
do-cssaEnable Conventional SSA construction
do-scev-cgpEnable SCEV-based CodeGenPrepare
do-function-scev-cgpEnable function-level SCEV-CGP
do-scev-cgp-aggresivelyAggressive SCEV-CGP mode [sic]
do-base-address-strength-reduceEnable base address strength reduction
do-base-address-strength-reduce-chainEnable chained base address SR
do-comdat-renamingEnable COMDAT group renaming
do-counter-promotionEnable counter promotion
do-lsr-64-bitEnable 64-bit loop strength reduction
do-sign-ext-expandEnable sign extension expansion
do-sign-ext-simplifyEnable sign extension simplification

Parametric knobs:

NameTypeDefaultPurpose
remat-for-occstringRematerialization occupancy target
remat-gep-coststringGEP rematerialization cost
remat-ignore-single-coststringSkip single-use cost analysis
remat-lli-factorstringLive-interval factor
remat-load-paramstringParameter load remat policy
remat-loop-tripstringLoop trip count for remat decisions
remat-max-live-limitstringMaximum live variable count
remat-maxreg-ceilingstringRegister ceiling for remat
remat-movestringRematerialization move policy
remat-single-cost-limitstringSingle-value cost limit
remat-use-limitstringUse count limit for remat
branch-dist-block-limitstringBlock count limit for branch distribution
branch-dist-func-limitstringFunction-level branch dist limit
branch-dist-normstringNormalization factor
scev-cgp-check-latencystringLatency check threshold
scev-cgp-controlstringCGP control mode
scev-cgp-cross-block-limitstringCross-block analysis limit
scev-cgp-idom-level-limitstringImmediate dominator depth limit
scev-cgp-inst-limitstringInstruction count limit
scev-cgp-normstringNormalization factor
scev-cgp-old-basestringLegacy base address mode
scev-cgp-tid-max-valuestringThread ID maximum value
base-address-strength-reduce-iv-limitstringIV count limit for base addr SR
base-address-strength-reduce-max-ivstringMaximum IV for base addr SR
cssa-coalescestringCSSA coalescing mode
cssa-verbositystringCSSA debug verbosity

Dump/debug flags:

NamePurpose
dump-ip-mspDump IPMSP analysis results
dump-ir-before-memory-space-optDump IR before MemorySpaceOpt
dump-ir-after-memory-space-optDump IR after MemorySpaceOpt
dump-memory-space-warningsDump address space warnings
dump-rematDump rematerialization decisions
dump-remat-addDump remat additions
dump-remat-ivDump remat induction variables
dump-remat-loadDump remat load decisions
dump-branch-distDump branch distribution analysis
dump-scev-cgpDump SCEV-CGP analysis
dump-base-address-strength-reduceDump base address SR
dump-sink2Dump Sinking2 pass output
dump-before-cssaDump IR before CSSA
dump-phi-removeDump PHI node removal
dump-normalize-gepDump GEP normalization
dump-simplify-live-outDump live-out simplification
dump-process-restrictDump restrict processing
dump-process-builtin-assumeDump builtin assume processing
dump-conv-dotDump convergence as DOT graph
dump-conv-funcDump convergence per function
dump-conv-textDump convergence as text
dump-nvvmirDump NVVM IR
dump-vaDump value analysis

Tier-Based Pass Ordering

The Threshold Dispatch Mechanism

NVIDIA's tier system is a priority-driven scheduling mechanism that interleaves optimization sub-pipelines with external plugin passes. The master pipeline function sub_12E54A0 iterates over a pass registration array at a4[4488] (16-byte stride entries: [+0] vtable_ptr, [+8] phase_id). As it processes each entry, it checks whether the entry's phase_id exceeds a threshold. When it does, the corresponding tier sub-pipeline fires once:

// Pseudocode for the main loop in sub_12E54A0
for (entry = a4[4488]; entry < a4[4496]; entry += 16) {
    int phase_id = *(int*)(entry + 8);

    if (opt_enabled && phase_id > opt_threshold) {
        sub_12DE330(PM, opts);      // Tier 0: full optimization
        opt_enabled = 0;            // fire once
    }
    if (tier1_flag && phase_id > tier1_threshold) {
        sub_12DE8F0(PM, 1, opts);   // Tier 1
        tier1_flag = 0;
    }
    if (tier2_flag && phase_id > tier2_threshold) {
        sub_12DE8F0(PM, 2, opts);   // Tier 2
        tier2_flag = 0;
    }
    if (tier3_flag && phase_id > tier3_threshold) {
        sub_12DE8F0(PM, 3, opts);   // Tier 3
        tier3_flag = 0;
    }

    // Insert the plugin/external pass itself
    pass = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, pass, 1, 0);
}

// Any tier that didn't fire during the loop fires now
if (opt_enabled)  sub_12DE330(PM, opts);
if (tier1_flag)   sub_12DE8F0(PM, 1, opts);
if (tier2_flag)   sub_12DE8F0(PM, 2, opts);
if (tier3_flag)   sub_12DE8F0(PM, 3, opts);

This design means tier placement is data-driven: the thresholds stored at config offsets 4224/4228 (Tier 0), 3528/3532 (Tier 1), 3568/3572 (Tier 2), and 3608/3612 (Tier 3) determine exactly where in the plugin pass sequence each tier's sub-pipeline gets inserted. Changing the threshold shifts an entire tier of ~40 passes to a different position relative to the external passes. After each tier fires, its flag is cleared so it cannot fire again.

Tier 0 Ordering Strategy

Tier 0 (sub_12DE330) is the most comprehensive sub-pipeline at ~40 passes. Its ordering reflects NVIDIA's optimization philosophy for GPU code:

Phase A -- Value Simplification (passes 1-8): BreakCriticalEdges normalizes the CFG, then the CGSCC inliner framework runs first to create optimization opportunities. NVVMReflect resolves __nvvm_reflect() calls to compile-time constants (GPU architecture queries), and SCCP propagates those constants. GVN and NewGVN/GVNHoist eliminate redundant computations.

Phase B -- NVIDIA-Specific Cleanup (passes 9-12): NVVMVerifier catches NVVM-specific IR errors early. NVVMPredicateOpt optimizes predicate expressions. ConstantMerge reduces module size.

Phase C -- Loop Transformations (passes 13-27): This is the core loop optimization sequence. Sink/MemSSA moves code out of hot paths. LoopIndexSplit divides loops at index boundaries. LICM hoists invariants. LoopUnroll with factor 3 expands small loops. LoopUnswitch moves conditionals out of loops. ADCE removes dead code exposed by loop transformations.

Phase D -- Register Pressure Management (passes 28-40): InstCombine and SROA simplify the IR further. NVVMRematerialization recomputes values to reduce register pressure -- critical for GPU occupancy. DSE and DCE clean up dead stores and code. The final CGSCC pass and FunctionAttrs prepare for per-function Phase II processing.

Tier 1/2/3 Incremental Additions -- sub_12DE8F0

Address0x12DE8F0
Size17,904 bytes
Signatureint64 sub_12DE8F0(int64 passMgr, int tier, int64 opts)

sub_12DE8F0 adds passes incrementally based on the tier value (1, 2, or 3). Its first action stores the tier into qword_4FBB410 (the tier tracker global), then checks qword_4FBB3B0 (phase counter) for phase-dependent behavior. Nearly every pass insertion is gated by a boolean in the NVVMPassOptions struct.

The full pass list for sub_12DE8F0 (all tiers combined, with tier-specific gates):

sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (level=1)
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (barrier=1)
sub_18E4A00()  [opts[3488]]             NVVMBarrierAnalysis
sub_1C98160(0) [opts[3488]]             NVVMLowerBarriers
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_185D600()  [opts[3200]&&!opts[920]] IPConstPropagation         [advanced group]
sub_1857160()  [opts[3200]&&!opts[880]] NVVMReflect                [advanced group]
sub_18A3430()  [opts[3200]&&!opts[1120]] NVVMPredicateOpt          [advanced group]
sub_1842BC0()  [opts[3200]&&!opts[720]] SCCP                       [advanced group]
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_18A3090()  [opts[3200]&&!opts[2160]] NVVMPredicateOpt variant  [advanced group]
sub_184CD60()  [opts[3200]&&!opts[1960]] ConstantMerge             [advanced group]
sub_190BB10(1,0)[tier!=1 && guards]     SimplifyCFG                [TIER 2/3 ONLY]
sub_1952F90(-1)[tier!=1 && guards]      LoopIndexSplit             [TIER 2/3 ONLY]
sub_12D4560()  [tier!=1 && !opts[600]]  NVVMVerifier               [TIER 2/3 ONLY]
sub_195E880(0) [opts[3704]&&opts[2880]] LICM
sub_1C8A4D0(v) [v=1 if opts[3704]]     EarlyCSE
sub_1869C50(1,0,1)[tier!=1&&!opts[1040]] Sink                     [TIER 2/3 ONLY]
sub_1833EB0(3) [tier==3 && !opts[320]]  TailCallElim              [TIER 3 ONLY]
sub_1CC3990()  [!opts[2360]]            NVVMUnreachableBlockElim
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C4B6F0()  [!opts[440]&&!opts[480]] Inliner
sub_1A7A9F0()  [!opts[2720]]            InstructionSimplify
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A02540()  [!opts[2200]]            GenericToNVVM
sub_198DF00(-1)[!opts[1520]]            LoopSimplify
sub_1C76260()  [!opts[1320]&&!opts[1480]] ADCE
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1C98160(v) [opts[3488]]             NVVMLowerBarriers
sub_19C1680(0,1)[!opts[1360]]           LoopUnroll
sub_19401A0()  [!opts[1000]]            InstCombine
sub_196A2B0()  [!opts[1440]]            EarlyCSE
sub_1968390()  [!opts[1400]]            SROA
sub_19B73C0(t,...)[tier!=1]             LoopUnswitch (SM-dependent) [TIER 2/3 ONLY]
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_190BB10(0,0)[!opts[960]]            SimplifyCFG
sub_1922F90()  [opts[3080]]             NVIDIA-specific loop pass
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1A13320()  [!opts[2320]]            NVVMRematerialization
sub_1968390()  [!opts[1400]]            SROA
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_18F5480()  [!opts[760]]             DSE
sub_18DEFF0()  [!opts[280]]             DCE
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1AAC510()  [!opts[520]&&!opts[560]] NVIDIA-specific pass
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C8E680()  [!opts[2680]]            MemorySpaceOpt (from opts[3120])
sub_1CC71E0()  [!opts[2560]]            NVVMGenericAddrOpt
sub_1C98270(1,v)[opts[3488]]            NVVMLowerBarriers variant
sub_1C6FCA0()  [opts[2840]&&!opts[1840]] ADCE
sub_18B1DE0()  [opts[3200]&&!opts[2640]] LoopOpt/BarrierOpt        [advanced group]
sub_1857160()  [opts[3200]&&tier==3]    NVVMReflect                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs              [advanced group]
sub_1C46000()  [tier==3&&!opts[360]]    NVVMLateOpt                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs (2nd call)   [advanced group]
sub_1CBC480()  [!opts[2240]&&!opts[2280]] NVVMLowerAlloca
sub_1CB73C0()  [!opts[2080]&&!opts[2120]] NVVMBranchDist
sub_1C7F370(1) [opts[3328]&&!opts[1640]] NVVMWarpShuffle           [SM-specific]
sub_1CC5E00()  [opts[3328]&&!opts[2400]] NVVMReduction             [SM-specific]
sub_1CC60B0()  [opts[3328]&&!opts[2440]] NVVMSinking2              [SM-specific]
sub_1CB73C0()  [opts[3328]&&guards]     BranchDist (2nd call)      [SM-specific]
sub_1B7FDF0(3) [opts[3328]&&!opts[1280]] Reassociate               [SM-specific]

Tier 1 (baseline) adds the passes above EXCEPT those gated by tier!=1: SimplifyCFG, LoopIndexSplit, Sink, and LoopUnswitch are all skipped. This is a conservative set focused on NVIDIA-specific cleanup without expensive LLVM optimization.

Tier 2 adds everything Tier 1 has plus the tier!=1-gated passes. The LoopUnswitch parameters are SM-architecture-dependent: sub_19B73C0 receives different vector widths based on the target subtarget.

Tier 3 adds TailCallElim (gated tier==3), NVVMReflect at a late position (gated tier==3), and NVVMLateOpt (gated tier==3). Critically, it also triggers feature flag escalation (see below).

Feature Flag Escalation

A notable pattern occurs only in Tier 3: if BYTE4(qword_4FBB370[2]) is zero (no advanced features enabled), the tier handler allocates a new integer with value 6 and stores it via sub_16D40E0. The value 6 (binary 110) enables two feature gates used by later passes: barrier optimization and memory-space optimization. This means Tier 3 (O3) automatically enables optimization features that lower tiers leave disabled, without requiring explicit CLI flags.

O-Level Pipeline Comparison

Pipeline Selection

The new-PM driver sub_226C400 selects pipeline name strings based on config flags:

byte[888]  set  →  "nvopt<O0>"
byte[928]  set  →  "nvopt<O1>"
byte[968]  set  →  "nvopt<O2>"
byte[1008] set  →  "nvopt<O3>"

These strings are passed to sub_2277440 (the new-PM text pipeline parser). The nvopt prefix is registered as a pipeline element in both sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 and 0x49E6A58 respectively.

O0: No Optimization

O0 skips the full pipeline entirely. The code falls through to LABEL_159 which calls only sub_1C8A4D0(0) (NVVMFinalCleanup), then proceeds directly to finalization. No Tier 0/1/2/3 sub-pipelines fire. The result is ~5-8 passes total: TargetLibraryInfo, TargetTransformInfo, Verifier, AssumptionCache, ProfileSummary, NVVMFinalCleanup, and codegen setup.

O1/O2/O3: Full Pipeline with Tier Differentiation

All three levels call sub_12DE330 for the same ~40-pass Tier 0 sub-pipeline. The differences manifest through four mechanisms:

1. Tier sub-pipeline gating. sub_12DE8F0 is called with the tier number corresponding to the O-level. O1 gets tier=1 (conservative, skips several passes). O2 gets tier=2 (full set). O3 gets tier=3 (aggressive + feature flag escalation).

2. CGSCC iteration counts. The CGSCC pass manager wrapper sub_1A62BF0 takes an iteration count as its first argument. In the O1/O2/O3 base pipeline, it is called with 1 (single inliner pass). In the "mid" fast-compile path, it is called with 5 iterations. In the default path, it varies from 1 to 8 depending on pipeline position, allowing more aggressive devirtualization and inlining at higher optimization levels.

3. Loop unroll factor. sub_1833EB0 is called with factor 3 in the standard pipeline. Tier 3 adds an additional call to TailCallElim and more aggressive LoopUnswitch parameters (the sub_19B73C0 call receives SM-arch-dependent vector widths at Tier 2/3).

4. Vectorizer parameters. sub_19B73C0 receives different arguments based on tier:

  • Tier 0: (2, -1, -1, -1, -1, -1, -1) -- conservative vector width 2, all thresholds unlimited
  • "mid" path: (3, -1, -1, 0, 0, -1, 0) -- vector width 3, some thresholds zeroed (disabled)
  • Tier 2/3: Parameters vary by SM architecture via config struct lookups

Fast-Compile Levels vs O-Levels

PipelineEntry PathPassesLSAMemSpaceOptKey Difference
nvopt<O0>LABEL_159~5-8offoffNo optimization
nvopt<Ofcmax>LABEL_196~12-15forced 0forced 0Sinking2(fast) + minimal canonicalization
nvopt<Ofcmid>LABEL_297~25-30normalenabledCGSCC(5), LoopVectorize(conservative)
nvopt<Ofcmin>LABEL_297~30-35normalenabledLike Ofcmid but more aggressive loop settings
nvopt<O1>sub_12DE330~35normalenabledTier 1: conservative set
nvopt<O2>sub_12DE330~35+normalenabledTier 2: full optimization set
nvopt<O3>sub_12DE330~35+normalenabledTier 3: aggressive + feature escalation

Ofcmax is architecturally distinct: it forces -lsa-opt=0 and -memory-space-opt=0 in the optimizer flags (confirmed in both sub_9624D0 line 1358 and sub_12CC750 line 2025). This means two of NVIDIA's most important proprietary passes -- LSA optimization and MemorySpaceOpt -- are unconditionally disabled regardless of what the user requests.

Pipeline Text Strings and nvopt<> Dispatch

The nvopt<> Naming Convention

NVIDIA replaces LLVM's standard default<O2> pipeline naming with a proprietary nvopt<> prefix. The new-PM driver sub_226C400 (35KB, at 0x226C400) selects one of exactly seven pipeline name strings based on optimization level and fast-compile flags. These strings are passed verbatim to sub_2277440 (60KB, at 0x2277440) -- NVIDIA's equivalent of LLVM's PassBuilder::buildDefaultPipeline().

nvopt<O0>       Optimization disabled. ~5-8 infrastructure passes only.
nvopt<O1>       Standard optimization, Tier 1 (conservative).
nvopt<O2>       Standard optimization, Tier 2 (full).
nvopt<O3>       Standard optimization, Tier 3 (aggressive + feature escalation).
nvopt<Ofcmax>   Fast-compile maximum speed. Forces -lsa-opt=0, -memory-space-opt=0.
nvopt<Ofcmid>   Fast-compile medium. MemorySpaceOpt enabled, CGSCC(5) iterations.
nvopt<Ofcmin>   Fast-compile minimum. Like Ofcmid but more aggressive loop settings.

Selection Algorithm (sub_226C400)

The config struct encodes O-level flags at fixed byte offsets. The fast-compile level string (if present) is at qwords 131/132 (offset 1048/1056), encoded as a 3-byte sequence compared via 2-byte word + 1-byte suffix:

// sub_226C400, lines 828-874 (pseudocode)
char* select_pipeline_name(Config* cfg) {
    if (cfg->byte[928])   return "nvopt<O1>";     // 9 chars
    if (cfg->byte[968])   return "nvopt<O2>";     // 9 chars
    if (cfg->byte[1008])  return "nvopt<O3>";     // 9 chars

    char* fc = cfg->qword[131];
    int fc_len = cfg->qword[132];
    if (fc_len == 3) {
        // Word comparison: *(uint16_t*)fc, then byte fc[2]
        if (*(uint16_t*)fc == 24941 && fc[2] == 120)  // "max" = 'a','m' + 'x'
            return "nvopt<Ofcmax>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 100)  // "mid" = 'i','m' + 'd'
            return "nvopt<Ofcmid>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 110)  // "min" = 'i','m' + 'n'
            return "nvopt<Ofcmin>";   // 14 chars
    }
    return "nvopt<O0>";              // 9 chars
}

The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM, vtable 0x4A08350) and sub_12C35D0 (legacy PM, vtable 0x49E6A58). Both route into an nvopt pipeline builder class that creates a 512-byte pipeline object via sub_12EC960.

Mutual Exclusion

Combining -O# with --passes= or --foo-pass is an error:

Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass,
use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'

Pipeline Text Parser (sub_2277440)

sub_2277440 (60KB) is the new-PM buildDefaultPipeline() equivalent. It tokenizes the pipeline name string via sub_2352D90, then dispatches to the appropriate pipeline builder based on the nvopt<> parameter. NVIDIA custom passes are injected via extension point callbacks at [PassBuilder+2208] (stride 32 bytes per entry, count at [PassBuilder+2216]). Each callback entry has a guard pointer at [+16] and a callback function at [+24].

Fast-Compile Level Encoding

In the libnvvm config struct, offset 1640 holds an integer encoding:

ValueCLI SourcePipeline NameNotes
0(no -Ofast-compile)normal O-levelDefault
1-Ofast-compile=0reset to 0Treated as "off"
2-Ofc=maxnvopt<Ofcmax>Forces -lsa-opt=0, -memory-space-opt=0
3-Ofc=midnvopt<Ofcmid>MemorySpaceOpt enabled
4-Ofc=minnvopt<Ofcmin>Closest to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".

Pass Registration Architecture

Dual Pass Manager Support

cicc v13.0 maintains registrations for both the Legacy Pass Manager and the New Pass Manager simultaneously. This dual support is necessary during the LLVM Legacy-to-New PM migration. The Legacy PM path is taken when a4[4384] != 0 (the fast-compile/bypass flag), while the New PM path handles normal compilation.

Legacy PM registration occurs in pass constructor functions scattered throughout the binary. For example, MemorySpaceOpt registers as "memory-space-opt-pass" via sub_1C97F80. Each Legacy PM pass calls RegisterPass<> with a pass ID and description string.

New PM registration is centralized in sub_2342890 -- a single 2,816-line function that registers every analysis, pass, and printer. It calls sub_E41FB0(pm, class_name, len, pass_name, len) for each pass, inserting into a StringMap with open-addressing and linear probing.

New PM Registration Structure

sub_2342890 registers passes in a strict ordering by pipeline level:

SectionLinesCountContent
Module analyses514-596~18CallGraph, ProfileSummary, LazyCallGraph, etc.
Module passes599-1153~95AlwaysInline, GlobalOpt, NVIDIA module passes
CGSCC analyses1155-1163~5FunctionAnalysisManagerCGSCC, etc.
CGSCC passes1170-1206~15Inliner, Attributor, ArgumentPromotion
Function analyses1208-1415~65DominatorTree, LoopInfo, MemorySSA, rpa, merge-sets
Function passes1420-2319~185SROA, GVN, LICM, all NVIDIA function passes
LoopNest passes2320-2339~8LoopInterchange, LoopFlatten
Loop analyses2340-2362~10LoopAccessAnalysis, IVUsers
Loop passes2367-2482~40IndVarSimplify, LICM, LoopUnroll, loop-index-split
Machine analyses2483-2580~30LiveIntervals, SlotIndexes
Machine passes2581-2815~80ExpandPostRAPseudos, BranchFolding

Parameterized Pass Parsing

When the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), a registered callback parses the parameter string. The parsing flow:

  1. sub_2337DE0 matches the pass name via a starts_with comparison
  2. sub_234CEE0 extracts the <...> parameter string
  3. The parameter-parsing callback (e.g., sub_23331A0 for MemorySpaceOpt) is invoked
  4. The parser splits on ; and matches each token against known parameter names
  5. A configured pass options struct is returned and used to construct the pass

For MemorySpaceOpt, the parameter parser (sub_23331A0) recognizes four tokens:

TokenLengthEffect
first-time10Sets first_time = true (default)
second-time11Sets first_time = false
warnings8Enables address-space warnings
no-warnings11Disables warnings

Invalid parameters produce: "invalid MemorySpaceOpt pass parameter '{0}'".

Pass Serialization

Each parameterized NVIDIA pass also registers a serializer for pipeline text output (used by --print-pipeline-passes). The serializers write the pass class name followed by the current parameter state:

PassSerializerOutput Format
MemorySpaceOptsub_2CE0440MemorySpaceOptPass]<first-time;...>
BranchDistsub_2311040BranchDistPass]
Sinking2sub_2315E20llvm::Sinking2Pass]
Rematsub_2311820RematerializationPass]
NVVMPeepholesub_2314DA0NVVMPeepholeOptimizerPass]
LoopIndexSplitsub_2312380LoopIndexSplitPass]

Pipeline Construction Flow

The AddPass Mechanism -- sub_12DE0B0

Address0x12DE0B0
Size3,458 bytes
Signatureint64 sub_12DE0B0(int64 passMgr, int64 passObj, uint8 flags, char barrier)
Call count~137 direct calls from sub_12E54A0, ~40 from sub_12DE330, ~50+ per tier

sub_12DE0B0 is the sole entry point for adding passes to the pipeline. Every pass factory call in the entire pipeline assembler funnels through this function. It performs three operations atomically: hash-table insertion for O(1) lookup, flag encoding for the pass scheduler, and append to the ordered pass array.

// Detailed pseudocode for sub_12DE0B0
int64 AddPass(PassManager* PM, Pass* pass, uint8_t flags, char barrier) {
    // --- Step 1: Hash the pass pointer ---
    // Uses a custom shift-XOR hash, NOT a standard hash function.
    // The two shifts (9 and 4) spread pointer bits across the table.
    uint64_t hash = ((uint64_t)pass >> 9) ^ ((uint64_t)pass >> 4);

    // --- Step 2: Open-addressing insert into hash table at PM+80 ---
    // The hash table is a flat array of 16-byte entries at PM+80:
    //   [+0] uint64 pass_pointer (0 = empty slot)
    //   [+8] uint8  combined_flags
    // Table capacity is stored at PM+72 (initial: derived from 0x800000000 mask).
    // Collision resolution: linear probing with step 1.
    uint8_t combined = flags | (barrier ? 2 : 0);
    //   Bit 0 (0x01): 1 = FunctionPass, 0 = ModulePass/AnalysisPass
    //   Bit 1 (0x02): 1 = barrier (scheduling fence)
    //   Remaining bits: reserved

    size_t capacity = PM->ht_capacity;       // at PM+72
    size_t idx = hash & (capacity - 1);      // power-of-2 masking
    Entry* table = (Entry*)(PM + 80);

    while (table[idx].pass != 0) {
        if (table[idx].pass == pass) {
            // Pass already inserted -- update flags only
            table[idx].flags = combined;
            return 0;  // dedup: no second insertion
        }
        idx = (idx + 1) & (capacity - 1);   // linear probe
    }
    table[idx].pass = pass;
    table[idx].flags = combined;

    // --- Step 3: Append to ordered pass array at PM[0] ---
    // PM[0] = pointer to dynamic array of 8-byte pass pointers
    // PM[1] = count of passes (PM+8)
    // Growth: geometric reallocation (not shown here)
    uint64_t* array = (uint64_t*)PM->passes; // PM[0]
    array[PM->count] = (uint64_t)pass;
    PM->count++;                              // PM+8

    return 0;
}

The flags parameter encodes the pass type: 0 for module/analysis passes, 1 for function passes. The barrier parameter (bit 1) is a scheduling fence that tells the pass manager all preceding passes must complete before this pass runs -- used for passes that require the module in a globally consistent state (e.g., after whole-module inlining).

The hash table serves two purposes: (a) deduplication -- if the same pass factory is called twice (which happens for NVVMReflect, NVVMIntrinsicLowering, etc.), the second call updates flags rather than inserting a duplicate; and (b) O(1) flag lookup during the codegen dispatch phase (sub_12DFE00), where each pass's type and barrier status must be queried efficiently.

The pass manager container is initialized at line 390 of sub_12E54A0 with inline storage: v270 = v272 (stack buffer), v271 = 0x800000000 (capacity/flags encoding with 33-bit sentinel).

Complete 8-Phase Construction Algorithm

The full pipeline construction in sub_12E54A0 proceeds through eight phases. The pseudocode below is reconstructed from the decompiled 49.8KB function at lines 300-757 of the decompilation output. All a4 offsets refer to the CompilerOptions struct (parameter 4, ~4500 bytes).

Phase 0: Infrastructure (lines 396-420, always runs)

// Phase 0: Analysis infrastructure required by all subsequent passes
#01  TLI = sub_149CCE0(malloc(368), sub_14A04B0(triple));
     AddPass(PM, TLI, 0, 0);     // TargetLibraryInfoWrapperPass [Module]

#02  TTI = sub_1BFB520(malloc(208), sub_1BFB9A0(dataLayout));
     AddPass(PM, TTI, 1, 0);     // TargetTransformInfoWrapperPass [Function]

#03  verifier = sub_14A7550();
     AddPass(PM, verifier, 0, 0); // VerifierPass / BasicAliasAnalysis [Module]

#04  assumptions = sub_1361950();
     AddPass(PM, assumptions, 0, 0); // AssumptionCacheTracker [Module]

#05  profile = sub_1CB0F50();
     AddPass(PM, profile, 1, 0); // ProfileSummaryInfoWrapperPass [Function]

These five analysis passes have no upstream-LLVM equivalent in terms of initialization ordering. NVIDIA always adds them first regardless of optimization level, language, or fast-compile mode.

Phase 1: Language Dispatch (lines 421-488)

Phase 1 reads the language string at a4[3648] (pointer) with length at a4[3656]. Three language paths exist; each produces a fundamentally different pass sequence. See the Language Path Differences section below for the complete per-path pass lists.

// Phase 1: Language-based pipeline branching
char* lang = *(char**)(a4 + 3648);
int lang_len = *(int*)(a4 + 3656);

bool opt_enabled = *(bool*)(a4 + 4224);
bool fc_max = false, fc_mid = false;
int v238 = *(int*)(a4 + 4304);  // device-code / additional-opt flag

if (lang_len == 3) {
    uint16_t w = *(uint16_t*)lang;
    if (w == 0x7470 && lang[2] == 0x78) {        // "ptx"
        goto PATH_A_PTX;
    }
    if (w == 0x696D && lang[2] == 0x64) {         // "mid"
        goto PATH_B_MID;
    }
    // "idn" (w == 0x696D && lang[2] == 0x6E) shares the default path
}
// Fall through to PATH_C_DEFAULT

// Fast-compile dispatch (within the language check):
// fc="max" AND !v238 → v244=1, v238=1, goto LABEL_191 (minimal + O0)
// fc="max" AND v238  → goto LABEL_196 → LABEL_188 (Sinking2 + common)
// fc="mid"           → goto LABEL_297 (mid pipeline)
// fc="min"           → goto LABEL_297 (min pipeline, differs via v238)
// no fc, no O-level  → LABEL_159 (O0 minimal pipeline)
// O-level set        → LABEL_38 → LABEL_39 (process pass list + tiers)

Phase 2: Pre-Optimization (lines 442-480)

Only when optimization is not completely skipped. Each pass is gated by a per-pass disable flag in the NVVMPassOptions struct.

// Phase 2: Early passes before the main optimization loop
if (!a4[1960] || a4[3000])                           // not disabled OR extra trigger
    AddPass(PM, sub_1857160(), 1, 0);                // NVVMReflect

if (a4[3000])                                        // extra DeadArgElim trigger
    AddPass(PM, sub_18FD350(0), 1, 0);              // DeadArgElimination

if (!a4[1680])                                       // NVIDIA pass not disabled
    AddPass(PM, sub_19CE990(), 1, 0);               // LoopStrengthReduce (NVIDIA)

AddPass(PM, sub_1CB4E40(0), 1, 0);                  // NVVMIntrinsicLowering(level=0)

if (!a4[2040])
    AddPass(PM, sub_1B26330(), 1, 0);                // MemCpyOpt

AddPass(PM, sub_12D4560(), 1, 0);                    // NVVMVerifier

if (!a4[1960])
    AddPass(PM, sub_184CD60(), 1, 0);                // ConstantMerge

if (!a4[440] && !a4[400])
    AddPass(PM, sub_1C4B6F0(), 1, 0);               // AlwaysInliner

if (a4[3160])                                        // debug dump enabled
    AddPass(PM, sub_17060B0(1, 0), 1, 0);           // PrintModulePass

Phase 3: Main Optimization Loop (lines 481-553)

The tier-threshold-driven loop iterates over the plugin/external pass array at a4[4488]. Each entry is 16 bytes (vtable pointer + phase_id). When a threshold is crossed, the corresponding tier sub-pipeline fires once and never again.

// Phase 3: Tier dispatch within the main plugin pass loop
uint64_t* entry = *(uint64_t**)(a4 + 4488);
uint64_t* end   = *(uint64_t**)(a4 + 4496);

while (entry < end) {
    int phase_id = *(int*)((char*)entry + 8);

    // Tier 0: full optimization sub-pipeline
    if (*(bool*)(a4+4224) && phase_id > *(int*)(a4+4228)) {
        sub_12DE330(PM, opts);        // ~40 passes
        *(bool*)(a4+4224) = false;    // fire once
    }
    // Tier 1: conservative
    if (*(bool*)(a4+3528) && phase_id > *(int*)(a4+3532)) {
        sub_12DE8F0(PM, 1, opts);
        *(bool*)(a4+3528) = false;
    }
    // Tier 2: full
    if (*(bool*)(a4+3568) && phase_id > *(int*)(a4+3572)) {
        sub_12DE8F0(PM, 2, opts);
        *(bool*)(a4+3568) = false;
    }
    // Tier 3: aggressive
    if (*(bool*)(a4+3608) && phase_id > *(int*)(a4+3612)) {
        sub_12DE8F0(PM, 3, opts);
        *(bool*)(a4+3608) = false;
    }

    // Insert the plugin/external pass itself
    Pass* plugin = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, plugin, 1, 0);

    // Optional debug verification after each plugin pass
    if (a4[3904]) {
        sub_12D3E60();  // insert verification/print pass
        sub_16E8CB0();
        sub_15E9F00();
    }

    entry = (uint64_t*)((char*)entry + 16);  // next entry (16-byte stride)
}

// Any tier that didn't fire during the loop fires unconditionally now
if (*(bool*)(a4+4224))  sub_12DE330(PM, opts);
if (*(bool*)(a4+3528))  sub_12DE8F0(PM, 1, opts);
if (*(bool*)(a4+3568))  sub_12DE8F0(PM, 2, opts);
if (*(bool*)(a4+3608))  sub_12DE8F0(PM, 3, opts);

Phase 4: Post-Optimization Language Paths (lines 580-1371)

After the main loop, language-specific post-optimization runs. This is where the three paths diverge most significantly. Each path ends by falling through to LABEL_84 (Phase 5). See Language Path Differences for complete pass lists per path.

Phase 5: Finalization (LABEL_84, lines 640-653)

Always runs after the language-specific optimization path completes.

// Phase 5: Finalization -- barriers, cleanup, codegen
if (!v244 && a4[3488])                              // barrier optimization enabled
    AddPass(PM, sub_1C98160(a4[2920]!=0), 1, 0);   // NVVMLowerBarriers

AddPass(PM, sub_1CEBD10(), 1, 0);                   // NVVMFinalLowering (cleanup)

if (!a4[2800] && !a4[4464])                         // late CFG cleanup not disabled
    AddPass(PM, sub_1654860(1), 1, 0);              // BreakCriticalEdges

sub_12DFE00(PM, subtargetInfo, a4);                  // Codegen dispatch (see below)

Phase 6: Phase II Codegen Check (lines 654-693)

Reads the compilation phase counter and conditionally enters a special codegen extension block for multi-phase compilation.

// Phase 6: Phase-II-specific codegen extensions
int phase = sub_16D40F0(qword_4FBB3B0);  // read cl::opt<int> phase counter
if (phase == 2 && (*(int*)(a4 + 4480) & 4)) {
    // Enter special Phase II codegen block
    // Calls vtable at v245+56 (TargetMachine::addPhaseIIPasses)
    // Passes SubtargetInfo (v253) and CodeGenOpt config (v262)
    target_machine->addPhaseIIPasses(subtarget, codegen_config);
}

Phase 7: Pipeline Execution (lines 694-698)

// Phase 7: Run the assembled pipeline
sub_160FB70(PM, *output, output[1]);   // PassManager::run(Module, outputs)
sub_1619BD0(PM, module);               // PassManager::finalize(Module)
free(v274);                            // cleanup allocations
sub_160FE50(PM);                       // PassManager::destroy()

Phase 8: Basic Block Naming (lines 700-757)

Only when a4[3944] (debug/naming mode) is set. Produces deterministic block names for debugging.

// Phase 8: Debug block naming for IR dump readability
if (a4[3944]) {
    int funcIdx = 0;
    for (Function* F = module->functions; F; F = F->next) {
        if (sub_15E4F60(F))  continue;  // skip declarations
        funcIdx++;
        int blockIdx = 0;
        for (BasicBlock* BB = F->blocks; BB; BB = BB->next) {
            blockIdx++;
            char name[32];
            sprintf(name, "F%d_B%d", funcIdx, blockIdx);
            sub_164B780(BB, &name);  // BB->setName()
        }
    }
}

Language Path Differences

The three language paths in Phase 1/4 represent fundamentally different IR maturity levels. The a4[3648] string pointer determines which path is taken, with length at a4[3656].

Path A: "ptx" -- Light Pipeline (~15 passes)

PTX text input has already been lowered by an earlier compilation stage. This path applies only light cleanup and canonicalization:

sub_1CEF8F0()               NVVMPeephole
sub_215D9D0()               NVVMAnnotationsProcessor
sub_1857160()  [!a4[880]]   NVVMReflect
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1
sub_1B26330()  [!a4[2040]]  MemCpyOpt
sub_17060B0(0,0)            PrintModulePass (debug)
sub_18DEFF0()  [!a4[280]]   DCE
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1 (repeat)
sub_18B1DE0()  [!a4[2640]]  LoopPass / BarrierOpt
sub_1C8E680(0) [!a4[1760]]  MemorySpaceOptimization
 --> LABEL_84 (finalization)

Key difference: no SROA, no GVN, no loop transformations, no CGSCC inlining. The PTX path trusts that the earlier compilation already optimized the code.

Path B: "mid" -- Full Optimization (~45 passes)

The primary path for standard CUDA compilation. The IR comes from the EDG frontend through IR generation and is at "mid-level" maturity (high-level constructs lowered, but not yet optimized).

sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (1st of 4)
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_198E2A0()                  SROA
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_1C6E800()                  GVN
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification (1st of 5+)
sub_190BB10(0,0)               SimplifyCFG
sub_1832270(1)                 InstructionCombining
sub_1A62BF0(5,0,0,1,0,0,1)    CGSCC pipeline (5 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (2nd)
sub_18FD350(0)                 DeadArgElim
sub_1841180()  [!a4[680]]     FunctionAttrs
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_195E880(0) [!a4[1240]]    LICM
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt (1st invocation)
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1857160()  [!a4[880]]     NVVMReflect (2nd of 3)
sub_1C6FCA0()  [!a4[1840]]    ADCE
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_18FD350(0)                 DeadArgElim
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1869C50(1,0,1)             Sink (MemorySSA-based)
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_18F5480()  [!a4[760]]     DSE
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C8A4D0(0)                 EarlyCSE
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (3rd)
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_195E880(0) [!a4[1240]]    LICM
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (4th)
sub_1CB73C0()  [!a4[2120]]    NVVMBranchDist
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
 --> LABEL_84 (finalization)

Key pattern: NVVMIntrinsicLowering runs 4 times, NVVMReflect runs 3 times, NVVMIRVerification runs 5+ times. The CGSCC pipeline is called with 5 and 8 iteration counts (aggressive devirtualization).

Path C: Default -- General Pipeline (~40 passes)

Used for bitcode from external sources (not marked as "ptx" or "mid"). Balances optimization breadth with conservative assumptions about IR maturity.

sub_1A62BF0(4,0,0,1,0,0,1)    LLVM standard pipeline #4
sub_1857160()  [!a4[880]]     NVVMReflect (1st)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering
sub_1857160()  [!a4[880]]     NVVMReflect (2nd)
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_1A62BF0(5,0,0,1,0,0,1)    LLVM standard pipeline #5
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_1C6E800()                  GVN
sub_1842BC0()  [!a4[720]]     SCCP
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1A62BF0(1,0,0,1,0,0,1)    LLVM standard pipeline #1
sub_197E720()                  LoopUnroll
sub_19401A0()  [!a4[1000]]    InstCombine
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(7,0,0,1,0,0,1)    LLVM standard pipeline #7
sub_1C8A4D0(0)                 EarlyCSE
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1832270(1)                 InstructionCombining
sub_1869C50(1,0,1)             Sink
sub_1A68E70()                  LoopIdiomRecognize
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_195E880(0) [!a4[1240]]    LICM
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_19B73C0(3,-1,-1,0,0,-1,0)  LoopUnswitch
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_18B1DE0()  [!a4[2640]]    LoopPass
sub_1952F90(-1)[!a4[1160]]     LoopIndexSplit
sub_18FD350(0)                 DeadArgElim
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A62BF0(2,0,0,1,0,0,1)    LLVM standard pipeline #2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_18A3430()  [!a4[1120]]    NVVMPredicateOpt
sub_1A62BF0(4,0,0,1,1,0,1)    LLVM standard pipeline #4 (inlining)
 --> LABEL_84 (finalization)

Key difference from "mid": default path uses LLVM standard pipeline wrappers (IDs 1,2,4,5,7) more heavily, runs SCCP explicitly, includes LoopIdiomRecognize, and uses a conservative LoopUnswitch with zeroed thresholds (3,-1,-1,0,0,-1,0).

Codegen Dispatch -- sub_12DFE00

Address0x12DFE00
Size20,729 bytes
Signatureint64 sub_12DFE00(int64 passMgr, int64 subtargetInfo, int64 opts)
Called fromPhase 5 of sub_12E54A0 (LABEL_84, line 640)

The codegen dispatch does not simply append passes to the pipeline. It performs a full dependency analysis over every pass already inserted, constructs an ordering graph, and then emits codegen passes in topologically-sorted order. This is necessary because machine-level passes (register allocation, instruction scheduling, frame lowering) have strict ordering dependencies that the flat AddPass model cannot express.

// Pseudocode for sub_12DFE00 (codegen dispatch with dependency analysis)
void CodegenDispatch(PassManager* PM, SubtargetInfo* STI, CompilerOpts* opts) {
    // Step 1: Read optimization level to determine analysis depth
    int opt_level = *(int*)(opts + 200);  // opts[200] = optimization level
    bool do_deps = (opt_level > 1);       // dependency tracking for O2+

    // Step 2: Classify existing passes
    // Iterates PM->passes[0..PM->count], calling two vtable methods per pass
    HashTable dep_graph;   // secondary hash table for dependencies (v134..v137)
    init_hashtable(&dep_graph);

    for (int i = 0; i < PM->count; i++) {
        Pass* p = PM->passes[i];

        // 2a. Check if pass is codegen-only (vtable+112)
        bool is_codegen = p->vtable->isCodeGenOnly(p);   // vtable offset +112
        if (is_codegen)
            continue;  // already classified, skip

        // 2b. Check registration status
        int status = sub_163A1D0(p);   // pass registry check
        sub_163A340(p, &status);       // update status

        // 2c. If pass needs codegen support, mark it in the hash table
        if (pass_needs_codegen(p)) {
            // Set flag |= 2 in the AddPass hash table entry
            // This marks the pass as "codegen-interacting"
            Entry* e = hashtable_find(PM + 80, p);
            if (e) e->flags |= 2;
        }

        // 2d. Build dependency edges (getAnalysisUsage)
        if (do_deps) {
            AnalysisUsage AU;
            p->vtable->getAnalysisUsage(p, &AU);   // vtable offset +16

            // For each required analysis, create an ordering edge
            // in the dependency hash table
            for (AnalysisID* req = AU.required; req; req = req->next) {
                dep_graph_add_edge(&dep_graph, p, req->pass);
            }
        }
    }

    // Step 3: Emit codegen passes in dependency-respecting order
    // Calls the SubtargetInfo hook to get the ordered codegen pass list
    // vtable+16 at STI -> STI->emitCodeGenPasses(PM, dep_graph)
    STI->vtable->emitCodeGenPasses(STI, PM, &dep_graph);
    // Each emitted pass gets a flag:
    //   0 = normal pass (no special ordering)
    //   1 = pass with codegen requirement (flag bit 0 from AddPass)
}

The dependency graph construction is what makes this function 20KB: it must handle the full LLVM analysis dependency model, including transitive dependencies and analysis preservation. The getAnalysisUsage calls return Required, RequiredTransitive, and Preserved sets that define the ordering constraints between passes.

For O0 compilation (opt_level == 0), the dependency tracking is skipped entirely -- codegen passes are emitted in a fixed default order since no optimization passes exist that could create ordering conflicts.

Pass Iteration and Convergence

CGSCC Fixed-Point Iteration

The CGSCC (Call Graph Strongly Connected Component) pass manager sub_1A62BF0 wraps a standard LLVM InlinerWrapper with a configurable iteration count. The first parameter controls how many times the CGSCC pipeline iterates over the call graph:

Pipeline PositionIteration CountContext
O1/O2/O3 base (sub_12DE330)1Standard inlining: one pass over the call graph
"mid" path (Ofcmid/Ofcmin)5Aggressive: 5 iterations to resolve indirect calls
Default path (general IR)1, 2, 4, 5, 7, or 8Varies by position in pipeline

Higher iteration counts allow the CGSCC framework to resolve more indirect calls through devirtualization. After each iteration, newly-inlined code may expose new call targets, which the next iteration can inline. The diminishing returns typically plateau after 3-5 iterations, which explains NVIDIA's choice of 5 for the "mid" fast-compile path (balancing compile time against code quality).

NVVMReflect Multi-Run Pattern

NVVMReflect (sub_1857160) runs multiple times in the pipeline because NVVM IR may contain __nvvm_reflect("__CUDA_ARCH") calls at different nesting depths. The first run resolves top-level reflect calls to constants. Subsequent optimization passes (inlining, constant propagation, loop unrolling) may expose new reflect calls that were hidden inside inlined functions or unrolled loop bodies. Running NVVMReflect again after these transformations catches these newly-exposed calls.

In the "mid" path, NVVMReflect appears at three distinct positions:

  1. Early (before GVN) -- resolves top-level architecture queries
  2. Mid (after CGSCC inlining and DeadArgElim) -- catches reflect calls exposed by inlining
  3. Late (after LoopSimplify and second CGSCC) -- catches reflect calls exposed by loop transformations

NVVMIntrinsicLowering Repetition

Similarly, NVVMIntrinsicLowering (sub_1CB4E40) runs 4 times in the "mid" path. Each invocation lowers a different subset of NVVM intrinsics based on what the preceding optimization passes have simplified. The pass takes a level parameter (0 or 1) that controls which lowering rules are active. Level 0 handles basic intrinsic lowering; level 1 handles barrier-related lowering that only becomes safe after certain control flow transformations.

NVVMIRVerification as a Convergence Check

NVVMIRVerification (sub_1A223D0) runs after every major transformation group -- not for optimization, but as a correctness invariant check. In the "mid" path it appears at 5+ positions. In the tier 1/2/3 sub-pipeline it appears 4 times (after NVVMIntrinsicLowering, after barrier lowering, after GenericToNVVM, and after the late optimization sequence). If any transformation violates NVVM IR constraints (invalid address space usage, malformed intrinsic signatures, broken metadata), this pass reports the error immediately rather than allowing it to propagate to codegen where diagnosis would be much harder.

The Repeat-Until-Clean Philosophy

NVIDIA's pipeline does not use explicit fixed-point loops (run passes until IR stops changing). Instead, it achieves convergence through strategic repetition: the same pass appears at multiple carefully-chosen pipeline positions, with different optimization passes running between repetitions. This is more predictable than a true fixed-point approach because compilation time is bounded by the static pipeline length rather than by how many iterations are needed for convergence. The tradeoff is that the pipeline may not reach a true fixed point -- some optimization opportunities exposed by late passes may not be caught -- but in practice, the multi-position placement catches the vast majority of cases.

LLVM Standard Pass Pipeline Factory -- sub_1A62BF0

The LLVM standard pass pipeline is invoked multiple times throughout the optimizer via sub_1A62BF0. The first parameter is a pipeline ID that selects which LLVM extension point to inject passes at:

Pipeline IDLLVM Extension PointUsage Context
1EP_EarlyAsPossible / basic cleanupTier 0, default path
2EP_LoopOptimizerEndDefault path late
4EP_ScalarOptimizerLateDefault path, Tier sub-pipeline
5EP_VectorizerStart"mid" path, default path
7EP_OptimizerLastDefault path
8EP_CGSCCOptimizerLate"mid" path (with opt flag = 1 for inlining)

The signature is sub_1A62BF0(pipelineID, 0, 0, 1, optFlag, 0, 1, outBuf), where optFlag at position 5 enables inlining within the CGSCC sub-pipeline (observed as 1 for pipeline IDs 4 and 8 in the "mid" path: sub_1A62BF0(8,0,0,1,1,0,1)).

Each call potentially returns a cleanup callback stored in v298, invoked as v298[0](s, s, 3) for destructor/finalization. The factory is called 9+ times across the three language paths.

CompilerOptions Struct Flag Map

The a4 parameter to sub_12E54A0 is a ~4500-byte CompilerOptions struct. The following offsets have been confirmed through cross-referencing guards in the pipeline assembler and tier sub-pipelines.

OffsetTypePurposeCross-Reference
+200intOptimization level (0-3)sub_12DFE00 codegen depth
+280boolDisable DCEsub_18DEFF0 guard
+320boolDisable TailCallElimsub_1833EB0 guard
+360boolDisable NVVMLateOptsub_1C46000 guard
+400boolDisable inlining variant A
+440boolDisable inlining variant Bsub_1C4B6F0 guard
+480boolDisable inlining variant Csub_12DE8F0 guard
+520boolDisable NVIDIA pass Asub_1AAC510 guard
+560boolDisable NVIDIA pass Bsub_1AAC510 guard
+600boolDisable NVVMVerifiersub_12D4560 guard
+680boolDisable FunctionAttrssub_1841180 guard
+720boolDisable SCCPsub_1842BC0 guard
+760boolDisable DSEsub_18F5480 guard
+880boolDisable NVVMReflectsub_1857160 guard
+920boolDisable IPConstPropagationsub_185D600 guard
+960boolDisable SimplifyCFGsub_190BB10 guard
+1000boolDisable InstCombinesub_19401A0 guard
+1040boolDisable Sink/MemSSAsub_1869C50 guard
+1080boolDisable PrintModulePasssub_17060B0 guard
+1120boolDisable NVVMPredicateOptsub_18A3430 guard
+1160boolDisable LoopIndexSplitsub_1952F90 guard
+1240boolDisable LICMsub_195E880 guard
+1280boolDisable Reassociatesub_1B7FDF0 guard
+1320boolDisable ADCE variant Asub_1C76260 guard
+1360boolDisable LoopUnrollsub_19C1680 guard
+1400boolDisable SROAsub_1968390 guard
+1440boolDisable EarlyCSEsub_196A2B0 guard
+1520boolDisable LoopSimplifysub_198DF00 guard
+1680boolDisable NVIDIA passsub_19CE990 guard
+1760boolDisable MemorySpaceOptsub_1C8E680 guard
+1840boolDisable ADCE Csub_1C6FCA0 guard
+1960boolDisable ConstantMergesub_184CD60 guard
+2000boolDisable NVVMIntrinsicLoweringsub_1CB4E40 guard
+2040boolDisable MemCpyOptsub_1B26330 guard
+2120boolDisable NVVMBranchDist Bsub_1CB73C0 guard
+2200boolDisable GenericToNVVMsub_1A02540 guard
+2320boolDisable NVVMRematerializationsub_1A13320 guard
+2440boolDisable NVVMSinking2sub_1CC60B0 guard
+2560boolDisable NVVMGenericAddrOptsub_1CC71E0 guard
+2600boolDisable NVVMIRVerificationsub_1A223D0 guard
+2640boolDisable NVVMLoopOptsub_18B1DE0 guard
+2720boolDisable InstructionSimplifysub_1A7A9F0 guard
+2840boolEnable ADCE (reversed logic)sub_1C6FCA0
+2880boolEnable LICM (reversed logic)sub_195E880
+2920boolNVVMLowerBarriers paramsub_1C98160
+3000boolExtra DeadArgElim triggersub_18FD350
+3040boolEnable CVPsub_18EEA90
+3080boolEnable NVIDIA loop passsub_1922F90
+3120boolAddress space optimization flagsub_1C8E680 param
+3160boolDebug dump modesub_17060B0 enable
+3200boolEnable advanced NVIDIA groupIPConst/Reflect/SCCP/etc.
+3328boolEnable SM-specific passesWarp/Reduction/Sinking2
+3488boolEnable barrier optimizationsub_1C98160, sub_18E4A00
+3528boolTier 1 enablePhase 3 loop
+3532intTier 1 phase thresholdPhase 3 loop
+3568boolTier 2 enablePhase 3 loop
+3572intTier 2 phase thresholdPhase 3 loop
+3608boolTier 3 enablePhase 3 loop
+3612intTier 3 phase thresholdPhase 3 loop
+3648ptrLanguage string ("ptx"/"mid"/"idn")Phase 1 dispatch
+3656intLanguage string lengthPhase 1 dispatch
+3704boolLate optimization modesub_195E880, sub_1C8A4D0
+3904boolDebug: verify after pluginsPhase 3 loop
+3944boolDebug: BB naming "F%d_B%d"Phase 8
+4224boolOptimization master switchTier 0 gate
+4228intOptimization phase thresholdTier 0 gate
+4304boolDevice-code flagPhase 1 v238
+4384boolFast-compile / bypass pipelineTop branch Pipeline A vs B
+4464boolDisable late CFG cleanup BPhase 5 sub_1654860
+4480ptrSM feature capabilityPhase 6: & 4 = codegen ext
+4488ptrPlugin pass array startPhase 3 loop
+4496ptrPlugin pass array endPhase 3 loop

Pass Factory Address Inventory

All unique pass factory addresses called from the pipeline assembler and tier sub-pipelines:

FunctionAddressSizeRole
NVVMVerifiersub_12D4560many (tiers)many (tiers)
AssumptionCacheTrackersub_136195011
TargetLibraryInfoWrapperPasssub_149CCE011
VerifierPass / BasicAAsub_14A755011
BreakCriticalEdgessub_165486022
PrintModulePass (debug dump)sub_17060B0~30+~30+
InstructionCombiningsub_183227022
TailCallElim / JumpThreadingsub_1833EB033
FunctionAttrssub_184118033
SCCPsub_1842BC022
NVVMReflectsub_1857160~8~8
IPConstantPropagationsub_185D60033
Sink (MemorySSA-based)sub_1869C5033
NVVMPredicateOpt variantsub_18A309022
NVVMPredicateOpt / SelectionOptsub_18A343022
NVVMLoopOpt / BarrierOptsub_18B1DE033
Sinking2Pass (fast=1 for fc mode)sub_18B308011
DCEsub_18DEFF044
NVVMBarrierAnalysissub_18E4A0011
CorrelatedValuePropagationsub_18EEA9033
DSEsub_18F548022
DeadArgEliminationsub_18FD35055
SimplifyCFGsub_190BB1044
NVIDIA-specific loop passsub_1922F9011
LoopIndexSplitsub_1952F9033
LICM / LoopRotatesub_195E88044
SROAsub_196839022
EarlyCSEsub_196A2B022
LoopUnrollsub_197E72011
LoopSimplifysub_198DF0033
SROA (variant)sub_198E2A011
InstCombinesub_19401A022
LoopUnswitch (7 params)sub_19B73C033
LoopUnroll variantsub_19C168022
NVIDIA custom passsub_19CE99011
GenericToNVVMsub_1A0254011
NVVMRematerializationsub_1A1332033
NVVMIRVerificationsub_1A223D05+5+
LLVM StandardPassPipelinesub_1A62BF0~9~9
LoopIdiomRecognizesub_1A68E7011
InstructionSimplifysub_1A7A9F033
NVIDIA-specific passsub_1AAC51011
MemCpyOptsub_1B2633044
Reassociatesub_1B7FDF033
TTIWrapperPasssub_1BFB52011
NVVMLateOptsub_1C4600011
Inliner / AlwaysInlinesub_1C4B6F022
NewGVN / GVNHoistsub_1C6E56011
GVNsub_1C6E80022
ADCEsub_1C6FCA022
ADCE variantsub_1C7626022
NVVMWarpShufflesub_1C7F37011
EarlyCSE / GVN variantsub_1C8A4D033
MemorySpaceOptimizationsub_1C8E68044
NVVMLowerBarrierssub_1C9816044
NVVMLowerBarriers variantsub_1C9827011
ProfileSummaryInfosub_1CB0F5011
NVVMIntrinsicLoweringsub_1CB4E40~10~10
NVVMBranchDistsub_1CB73C033
NVVMLowerAllocasub_1CBC48011
NVVMUnreachableBlockElimsub_1CC399011
NVVMReductionsub_1CC5E0011
NVVMSinking2sub_1CC60B033
NVVMGenericAddrOptsub_1CC71E011
NVVMFinalLoweringsub_1CEBD1011
NVVMPeepholesub_1CEF8F022
NVVMAnnotationsProcessorsub_215D9D022

Total unique pass factory addresses: ~55.

Function Map

FunctionAddressSizeRole
NVVMPassOptions::initsub_12D6300125KBPopulates 4,512-byte options struct
writeStringOptionsub_12D6090~100BWrites 24-byte string slot
writeBoolOptionsub_12D6100~80BWrites 16-byte boolean slot
PassOptionRegistry::lookupOptionsub_12D6170~200BHash table lookup
getBoolOptionsub_12D6240~300BBoolean resolution with default
PassDefTable::getPassDefsub_1691920~50B64-byte stride table lookup
parseIntsub_16D2BB0~100BString to int64
Pipeline assembler (master)sub_12E54A049.8KB8-phase pipeline construction
AddPasssub_12DE0B03.5KBHash-table-based insertion
Tier 0 sub-pipelinesub_12DE3304.8KB~40 passes, full optimization
Tier 1/2/3 sub-pipelinesub_12DE8F017.9KBPhase-conditional, incremental
Codegen dispatchsub_12DFE0020.7KBDependency-ordered codegen
Phase I/II orchestratorsub_12E7E709.4KBTwo-phase state machine
New PM registrationsub_2342890~50KB2,816 lines, 35 NVIDIA + ~350 LLVM
registerPass (hash insert)sub_E41FB0~300BStringMap insertion
Pass name prefix matchersub_2337DE0~100Bstarts_with comparison
Parameterized pass parsersub_234CEE0~200BExtracts <params>
MemorySpaceOpt param parsersub_23331A0~300Bfirst-time/second-time/warnings
New PM pipeline driversub_226C40035KBnvopt<O0/O1/O2/O3/Ofcmax/Ofcmid/Ofcmin> selection
New PM text parser (buildDefaultPipeline)sub_227744060KBParses pipeline name strings
nvopt registration (new PM)sub_225D540~32KBPipeline element vtable at 0x4A08350
nvopt registration (legacy PM)sub_12C35D0~500BPipeline element vtable at 0x49E6A58
nvopt object initializersub_12EC960~100BCreates 512-byte pipeline object
LLVM standard pipeline factorysub_1A62BF0variesPipeline IDs 1,2,4,5,7,8
Pass registry checksub_163A1D0~100BPass registration status
Pass status updatesub_163A340~100BUsed in codegen dispatch
Pipeline text tokenizersub_2352D90~200BTokenizes nvopt<> strings

Reimplementation Checklist

  1. Two-phase compilation model. Implement a TLS phase variable (values 1=Phase I, 2=Phase II, 3=done) read by individual passes to skip themselves when the current phase does not match their intended execution phase. Phase I runs whole-module analysis; Phase II runs per-function codegen-oriented passes.
  2. Pipeline assembly function (~150 AddPass calls). Build the master pipeline at runtime using hash-table-based pass insertion (AddPass), with language-specific dispatch (paths for "ptx", "mid", and default), tier-based interleaving (Tiers 0--3 fired by accumulated pass-count thresholds), and phase-conditional pass inclusion.
  3. NVVMPassOptions system (4,512-byte struct, 221 slots). Implement the proprietary per-pass enable/disable and parametric knob system with 114 string + 100 boolean + 6 integer + 1 string-pointer option slots, parsed from CLI flags and routed to individual passes.
  4. Concurrent per-function compilation. After Phase I completes on the whole module, split Phase II across a thread pool sized to get_nprocs() or GNU Jobserver token count, with per-function bitcode extraction, independent compilation, and re-linking of results.
  5. GNU Jobserver integration. Parse --jobserver-auth=R,W from MAKEFLAGS environment variable, create a token management pipe, and spawn a pthread to throttle concurrent compilations to the build system's -j level.
  6. Split-module compilation. Implement the -split-compile=N mechanism: decompose multi-function modules into per-function bitcode blobs via filter callbacks, compile each independently (potentially in parallel), re-link results, and restore linkage attributes from a hash table.
  7. Tier 0 full optimization sub-pipeline. Assemble the ~40-pass Tier 0 sequence: BreakCriticalEdges, GVN, NVVMReflect, SCCP, NVVMVerifier, LoopIndexSplit, ADCE, LICM, LoopUnroll, InstCombine, SROA, EarlyCSE, LoopUnswitch, SimplifyCFG, NVVMRematerialization, DSE, DCE, with per-pass NVVMPassOptions gating.

Cross-References