LTO & Module Optimization

CICC v13.0 implements Link-Time Optimization as a five-pass pipeline that exploits the GPU's closed-world compilation model for optimization opportunities unavailable to CPU compilers. In CPU LTO, the linker merges partially-optimized object files and runs a second round of optimization on the combined module. The fundamental constraint is that shared libraries, dynamic loading, and symbol interposition limit what the optimizer can assume about the complete program. On GPU, none of these constraints exist. Every __device__ function that can execute on the hardware must be statically visible at compile time -- there is no device-side dlopen, no .so files, no PLT/GOT, no symbol preemption. This closed-world guarantee means the LTO pipeline can inline aggressively across translation units, devirtualize every virtual call site against a complete class hierarchy, and promote or split global variables with full knowledge that no external observer will access the original symbols.

The LTO pipeline runs after the main LLVM optimizer (tier 0-3 passes) has performed per-module optimization. It is triggered when cicc processes bitcode from separate compilation (nvcc --device-c / -dc mode), where each .cu file compiles to a relocatable device object containing LLVM bitcode in the NVVM container. The device linker (nvlink) merges these objects and reinvokes cicc in LTO mode, passing the combined bitcode through the LTO pipeline before final PTX emission. In whole-program compilation (the default), the pipeline is still partially active -- GlobalOpt and the inliner run regardless, but the summary-based import machinery is skipped because there is only one module.


LTO pipeline entry	`sub_12F5F30` (`0x12F5F30`, 37.8 KB)
NVModuleSummary driver	`sub_D81040` (`0xD81040`, 56 KB)
Summary builder	`sub_D7D4E0` (`0xD7D4E0`, 74 KB)
Address range (summary cluster)	`0xD60000`--`0xD82000`
Address range (import/inline cluster)	`0x1850000`--`0x186CA00`
NVVM container IRLevel for LTO	`NVVM_IR_LEVEL_LTO` (value 1)
Compile mode for separate compilation	`NVVM_COMPILE_MODE_SEPARATE_ABI` (value 2)
Module flags read	`EnableSplitLTOUnit`, `UnifiedLTO`, `ThinLTO`

Why LTO Matters for GPU

Three properties of GPU execution make LTO dramatically more valuable than on CPU:

Function calls are expensive. Every GPU function call marshals arguments through the .param calling convention via st.param / ld.param instruction sequences. A function with 8 struct arguments can generate hundreds of cycles of marshaling overhead that inlining eliminates entirely. Cross-module inlining -- which requires LTO -- is the primary mechanism for removing this cost for functions defined in separate translation units. See the inliner cost model for the full cost analysis.

Register pressure determines performance. Occupancy is bounded by per-thread register usage, with discrete cliff boundaries. Call boundaries force the backend to save and restore registers across the call site, often spilling to local memory (device DRAM, 200-800 cycle latency). LTO enables cross-module inlining, which in turn enables cross-function register allocation -- the single most impactful optimization for GPU code.

Indirect calls are catastrophic. An indirect call in PTX (call.uni through a register) prevents backend inlining, forces full register spills, destroys instruction scheduling freedom, and creates warp-divergence hazards. Whole-program devirtualization, which requires LTO-level visibility of the complete type hierarchy, converts indirect calls to direct calls and enables all downstream optimizations.

Regular LTO vs ThinLTO

CICC supports both regular (monolithic) LTO and ThinLTO. The LTO driver at sub_D81040 reads three module flags via sub_BA91D0 to determine which mode is active:

Module Flag	Effect
`EnableSplitLTOUnit`	Enables the split LTO unit mechanism for type metadata
`UnifiedLTO`	Enables LLVM's unified LTO pipeline (combined thin+regular)
`ThinLTO`	Activates summary-based import and the two-phase declaration merge in `sub_D7D4E0`

Regular LTO merges all translation units into a single LLVM module, then runs the full optimization pipeline on the merged result. This gives the optimizer complete visibility but has O(n) memory cost in the total program size and serializes compilation. For GPU programs this is often acceptable because device code is typically smaller than host code.

ThinLTO builds per-module summaries (via NVModuleSummary), uses the summaries to make import decisions without loading full bitcode, then imports selected functions and optimizes each module independently. The builder's a8 parameter (thinlto_mode flag) activates Phase 2 of the summary builder, which performs a second walk over declarations to merge forward-declared and defined symbol tables. This mode enables parallel per-module optimization at the cost of less global visibility.

In practice, NVIDIA's toolchain (nvcc + nvlink) uses regular LTO as the default for device code, because the closed-world model and relatively small code size (compared to CPU programs) make the memory and compile-time cost acceptable. ThinLTO is available for large CUDA programs where compile time is a concern, activated by passing -dlto to nvcc (device LTO) or -flto=thin through the driver.

LTO Pipeline

The LTO pipeline executes five major passes in a fixed order. Each pass consumes the output of its predecessor:

 ┌────────────────────────────────────────────────────────────────────────┐
 │                    NVVM Container (IRLevel=1)                         │
 │                    LLVM Bitcode + Module Flags                        │
 └────────────────────┬───────────────────────────────────────────────────┘
                      │
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  1. NVModuleSummary Builder  (sub_D7D4E0, 74 KB)              │
 │     Build per-function summaries with 4-level import priority, │
 │     complexity budget, CUDA attribute flags, call graph edges  │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  ModuleSummaryIndex
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  2. ThinLTO Function Import  (sub_1854A20, 4.3 KB)            │
 │     Summary-guided cross-module import with floating-point     │
 │     threshold computation, priority-class multipliers,         │
 │     global import budget cap                                   │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Materialized functions + thinlto_src_module metadata
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  3. Inliner  (sub_1864060 + sub_2613930 + sub_38576C0)        │
 │     Four parallel cost models: NVIDIA custom (20K budget),     │
 │     LLVM standard (225), New PM CGSCC + ML, NVPTX target      │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Inlined module
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  4. GlobalOpt  (sub_18612A0, 65 KB)                            │
 │     Small-constant promotion (≤2047 bits), SRA for structs     │
 │     (≤16 fields), malloc/free elimination, address-space-aware │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Optimized globals
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  5. WholeProgramDevirtualization  (sub_2703170, 13 KB)         │
 │     Type-test metadata → vtable resolution → direct calls      │
 │     Red-black tree for type info lookup, 0x90-byte records     │
 └────────────────────┬──────────────────────────────────────────-┘
                      │
                      ▼
              Dead Kernel Elimination + GlobalDCE
              → Standard optimizer pipeline (tier 0-3)
              → Code generation + PTX emission

The LTO pipeline entry at sub_12F5F30 (37.8 KB) orchestrates this sequence and also runs dead kernel elimination -- removing __global__ functions that are never referenced by host-side kernel launches. This is a GPU-specific optimization: on CPU, the linker preserves all externally-visible entry points, but in GPU LTO the compiler knows the complete set of kernel launch sites from the host code.

LTO Pipeline Entry -- `sub_12F5F30` Algorithm

sub_12F5F30 (0x12F5F30, 37,797 bytes) is the top-level LTO orchestrator. It is called after the CLI parser (sub_12F7D90) has resolved the compilation mode bitmask and the LTO argument vector has been populated from the -Xlto forwarding meta-flag. The function operates in three distinct modes determined by the mode bitmask in a13:

Mode	Bitmask	CLI Flag	Behavior
gen-lto	`0x21`	`-gen-lto`	Emit partially-optimized bitcode for later linking. No dead-kernel pass.
full LTO	`0x23`	`-lto`	Full merge + optimize + dead-kernel elimination + emit PTX.
link-lto	`0x26`	`-link-lto`	Link pre-existing LTO bitcode modules, run full pipeline.

The function's argument list is reconstructed from the LTO output vector v330 (the fourth CLI routing vector, populated by -Xlto and the six -host-ref-* flags). It receives the merged LLVM module, the host reference tables, and the compilation options struct.

Pseudocode: `sub_12F5F30` Top-Level

function sub_12F5F30(module, lto_args, options, error_cb):
    # ---- Phase A: Parse LTO-specific arguments ----
    mode = NONE
    trace_enabled = false
    optimize_unused_vars = false
    host_refs = HostRefTable{}      # 6-field table: ek, ik, ec, ic, eg, ig
    force_device_c = false

    for arg in lto_args:
        switch arg:
            case "-gen-lto":       mode = GEN_LTO
            case "-link-lto":      mode = LINK_LTO
            case "-olto":          lto_opt_level = next_arg()
            case "--device-c":     device_c = true
            case "--force-device-c": force_device_c = true
            case "--trace":        trace_enabled = true
            case "-optimize-unused-variables": optimize_unused_vars = true
            case "-has-global-host-info":      has_host_info = true
            case "-host-ref-ek=*": host_refs.ek = parse_symbol_list(value)
            case "-host-ref-ik=*": host_refs.ik = parse_symbol_list(value)
            case "-host-ref-ec=*": host_refs.ec = parse_symbol_list(value)
            case "-host-ref-ic=*": host_refs.ic = parse_symbol_list(value)
            case "-host-ref-eg=*": host_refs.eg = parse_symbol_list(value)
            case "-host-ref-ig=*": host_refs.ig = parse_symbol_list(value)

    # ---- Phase B: Build preserved-symbol sets ----
    # Collect symbols from llvm.used and llvm.metadata named metadata
    used_set = collect_named_metadata(module, "llvm.used")
    metadata_set = collect_named_metadata(module, "llvm.metadata")

    # Merge host reference tables into a unified "referenced from host" set.
    # The 6 host-ref flags encode three entity types x two reference modes:
    #   e = explicit reference (symbol name appears in host launch site)
    #   i = implicit reference (symbol address taken on host side)
    #   k = kernel (__global__),  c = constant (__constant__),  g = global (__device__)
    host_referenced_kernels  = host_refs.ek  UNION  host_refs.ik
    host_referenced_constants = host_refs.ec  UNION  host_refs.ic
    host_referenced_globals   = host_refs.eg  UNION  host_refs.ig

    # ---- Phase C: Decide what to preserve ----
    preserved = used_set  UNION  metadata_set  UNION  host_referenced_kernels

    if NOT optimize_unused_vars:
        preserved = preserved  UNION  host_referenced_constants
                               UNION  host_referenced_globals

    # ---- Phase D: Dead kernel/variable elimination ----
    if mode == GEN_LTO:
        # gen-lto: emit bitcode only, skip elimination
        return emit_lto_bitcode(module)

    if has_host_info:
        dead_kernel_elimination(module, preserved, trace_enabled)

        if optimize_unused_vars:
            dead_variable_elimination(module, preserved,
                                     host_referenced_constants,
                                     host_referenced_globals,
                                     trace_enabled)

    # ---- Phase E: Run the 5-pass LTO pipeline ----
    if mode == LINK_LTO or mode == FULL_LTO:
        run_module_summary_builder(module)      # sub_D7D4E0 via sub_D81040
        run_thinlto_import(module)              # sub_1854A20  (if ThinLTO)
        run_inliner(module)                     # sub_1864060 + sub_2613930
        run_globalopt(module)                   # sub_18612A0
        run_whole_program_devirt(module)        # sub_2703170
        run_global_dce(module)                  # final GlobalDCE sweep

    # ---- Phase F: Hand off to optimizer pipeline ----
    return module    # returned to sub_12E7E70 for tier 0-3 passes

Host Reference Flag Encoding

The six -host-ref-* flags are the mechanism by which nvlink communicates host-side symbol usage to cicc's LTO pass. nvlink inspects the host-side relocatable objects and emits a semicolon-separated list of device symbol names for each flag. The two-letter suffix encodes:

Suffix	Entity Type	Reference Kind
`-host-ref-ek`	`__global__` kernel	Explicit (launch site in host code)
`-host-ref-ik`	`__global__` kernel	Implicit (address taken, e.g. `&myKernel`)
`-host-ref-ec`	`__constant__` variable	Explicit (`cudaMemcpyToSymbol` target)
`-host-ref-ic`	`__constant__` variable	Implicit (address taken)
`-host-ref-eg`	`__device__` global variable	Explicit (`cudaMemcpyToSymbol` target)
`-host-ref-ig`	`__device__` global variable	Implicit (address taken)

The -has-global-host-info flag signals that nvlink has provided complete host reference information. When this flag is absent, sub_12F5F30 conservatively preserves all externally-visible symbols -- the dead kernel/variable elimination pass is skipped entirely.

Function Map

Function	Address	Size	Role
`sub_12F5F30`	`0x12F5F30`	37.8 KB	LTO pipeline entry and dead-symbol orchestrator
`sub_12F5610`	`0x12F5610`	7.3 KB	LLVM module linker wrapper (`Linker::linkModules`)
`sub_12F7D90`	`0x12F7D90`	14.3 KB	CLI argument parser (architecture, opt level, flags)
`sub_12F4060`	`0x12F4060`	15.7 KB	TargetMachine creation with NVIDIA options
`sub_1C13840`	`0x1C13840`	--	Global/function iterator used for dead-code sweep
`sub_12F1650`	`0x12F1650`	5.2 KB	Bitcode reader variant A
`sub_12F11C0`	`0x12F11C0`	5.2 KB	Bitcode reader variant B

Dead Kernel Elimination Algorithm

Dead kernel elimination is the most impactful GPU-specific optimization in the LTO pipeline. It exploits the closed-world model: every __global__ function that will ever execute must have a corresponding <<<>>> launch site (or cudaLaunchKernel call) in the host code that nvlink has already seen. Any kernel not in the host reference set is dead.

This pass cannot exist on CPU. A CPU linker must preserve all non-hidden external functions because shared libraries loaded at runtime via dlopen could call them. On GPU there is no dlopen, no dynamic symbol resolution, no PLT. The set of reachable kernels is completely determined at link time.

Pseudocode: `dead_kernel_elimination`

function dead_kernel_elimination(module, preserved_set, trace):
    # Walk all functions in the module via sub_1C13840 iterator
    worklist = []

    for func in module.functions():
        if func.isDeclaration():
            continue

        cc = func.getCallingConv()

        # PTX calling convention 71 = __global__ (kernel entry point)
        # PTX calling convention 72 = __device__ (device function)
        # PTX calling convention 95 = CUDA internal (managed init)
        if cc != 71:
            continue    # only eliminate kernels, not device functions

        name = func.getName()

        if name in preserved_set:
            continue    # referenced from host, or in llvm.used -- keep it

        # This kernel has no host launch site.
        if trace:
            emit_diagnostic("no reference to kernel " + name)

        worklist.append(func)

    # ---- Remove dead kernels ----
    for func in worklist:
        # Before erasing, check if any device-side indirect references exist.
        # On GPU, device-side function pointers (callback patterns) can reference
        # kernels via address-of. Check use_empty():
        if NOT func.use_empty():
            # Has device-side users -- cannot safely remove.
            # (This is rare: kernels are almost never called from device code.)
            continue

        func.replaceAllUsesWith(UndefValue)
        func.eraseFromParent()

    return len(worklist)

Pseudocode: `dead_variable_elimination`

When -optimize-unused-variables is enabled, the same logic extends to __device__ and __constant__ global variables:

function dead_variable_elimination(module, preserved_set,
                                   host_constants, host_globals, trace):
    worklist = []

    for gv in module.globals():
        if gv.isDeclaration():
            continue

        name = gv.getName()

        if name in preserved_set:
            continue

        as = gv.getAddressSpace()

        # Address space 1 = global, address space 4 = constant
        if as == 4 and name NOT in host_constants:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)
        elif as == 1 and name NOT in host_globals:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)

    for gv in worklist:
        if NOT gv.use_empty():
            continue    # still referenced from device code
        gv.eraseFromParent()

    return len(worklist)

The --trace-lto CLI flag (which maps to --trace in the LTO argument vector via the flag catalog at line 2394) enables the diagnostic messages. When active, cicc prints one line per eliminated symbol to stderr, enabling build-system integration and debugging of unexpected kernel removal.

Module Merge Process

Before sub_12F5F30 can perform dead-kernel elimination or any LTO optimization, the separate-compilation bitcode modules must be merged into a single LLVM module. This merge happens in two layers: the NVIDIA module linker wrapper sub_12F5610 (7.3 KB) and the underlying LLVM IRLinker at sub_16786A0 (61 KB).

Two-Level Linking Architecture

nvlink extracts .nv_fatbin bitcode sections
         |
         v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Loader  (sub_12C06E0, 63 KB)                │
│  - Validates LLVM bitcode magic (0xDEC0170B or 0x4243C0DE) │
│  - Checks IR version via sub_12BFF60                        │
│  - Validates target triple (must be "nvptx64-*")            │
│  - Single-module fast path: return directly if N=1          │
│  - Multi-module: normalize triples, set matching DataLayout │
└─────────────────────┬───────────────────────────────────────┘
                      │  N validated modules
                      v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Linker Wrapper  (sub_12F5610, 7.3 KB)       │
│  - Selects primary module (typically the largest)           │
│  - For each secondary module:                               │
│      Copy triple from primary → secondary                   │
│      Call IRLinker to merge secondary into primary           │
│  - Post-link: restore linkage attributes from hash table    │
│      Values 7-8: external linkage (low 6 bits)              │
│      Other: set low 4 bits + visibility from bits 4-5       │
│      Set dso_local flag (byte+33 |= 0x40)                  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      v
┌─────────────────────────────────────────────────────────────┐
│  LLVM IRLinker::run  (sub_16786A0, 61 KB)                   │
│  - Allocates 0x2000-byte DenseMap for symbol resolution     │
│  - Hash function: (addr >> 9) ^ (addr >> 4)                 │
│  - Resolves COMDAT groups (sub_167DAB0, 39 KB)              │
│  - Links global value prototypes (sub_1675980, 37 KB)       │
│  - Links function bodies (sub_143B970, 14 KB)               │
│  - Merges named metadata (llvm.dbg.cu, llvm.used, etc.)     │
│  - Resolves llvm.global_ctors / llvm.global_dtors ordering  │
│  - Maps values across modules via DenseMap<Value*, Value*>  │
│  - Tombstone sentinels: empty=-8, deleted=-16               │
└─────────────────────┬───────────────────────────────────────┘
                      │  single merged module
                      v
              sub_12F5F30 (LTO pipeline entry)

Pseudocode: Module Merge (`sub_12F5610` + `sub_12C06E0`)

function module_merge(module_list, llvm_ctx, options):
    # ---- Step 1: Load and validate all modules (sub_12C06E0) ----
    modules = []
    for entry in module_list:
        buf = open_buffer(entry.data, entry.length, entry.name)

        # Validate bitcode magic
        magic = read_u32(buf, 0)
        if magic != 0x0B17C0DE and magic != 0xDEC04342:
            error("invalid bitcode: " + entry.name)
            return NULL

        module = parse_bitcode(buf, llvm_ctx)  # sub_15099C0

        # Check IR version compatibility (sub_12BFF60)
        if ir_version_check(module_list, module, flags) != 0:
            error(entry.name + ": error: incompatible IR detected. "
                  "Possible mix of compiler/IR from different releases.")
            return NULL

        # Validate target triple
        triple = module.getTargetTriple()
        if NOT triple.startswith("nvptx64-"):
            error("Module does not contain a triple, "
                  "should be 'nvptx64-'")
            return NULL

        modules.append(module)

    # ---- Step 2: Single-module fast path ----
    if len(modules) == 1:
        return modules[0]

    # ---- Step 3: Multi-module linking (sub_12F5610) ----
    # Save linkage attributes before linking (they get modified)
    linkage_map = DenseMap<StringRef, u8>{}
    for module in modules:
        for func in module.functions():
            linkage_map[func.getName()] = func.getLinkage()
        for gv in module.globals():
            linkage_map[gv.getName()] = gv.getLinkage()

    # Select primary module and link secondaries into it
    primary = modules[0]
    for i in range(1, len(modules)):
        secondary = modules[i]

        # Normalize: copy DataLayout from primary to secondary
        secondary.setDataLayout(primary.getDataLayout())
        secondary.setTargetTriple(primary.getTargetTriple())

        # IRLinker::run (sub_16786A0)
        # Resolves COMDATs, links globals, maps values, merges metadata
        err = Linker::linkModules(primary, secondary)
        if err:
            error("<module_name>: link error: <details>")
            return NULL

    # ---- Step 4: Restore linkage attributes ----
    # During linking, LLVM may promote linkage (e.g., internal -> external)
    # to resolve cross-module references. Restore the original linkage
    # where possible, preserving the correct visibility for PTX emission.
    for func in primary.functions():
        name = func.getName()
        if name in linkage_map:
            original = linkage_map[name]
            if original in [7, 8]:       # external linkage variants
                func.setLinkage(original & 0x3F)
            else:
                func.setLinkage(original & 0x0F)
                if (original & 0x30) != 0:
                    func.setVisibility(original >> 4)
            func.setDSOLocal(true)       # byte+33 |= 0x40

    for gv in primary.globals():
        # same linkage restoration logic
        ...

    return primary

Key Data Structures in the Merge

Structure	Location	Details
Value map DenseMap	Allocated in `sub_16786A0`	0x2000 bytes (8192), hash: `(addr >> 9) ^ (addr >> 4)`, quadratic probing
Linkage hash table	Stack-allocated in `sub_12E1EF0` (v362)	Maps `StringRef` name to original linkage byte
Function-to-module map	Stack-allocated in `sub_12E1EF0` (v359)	Maps `StringRef` name to function pointer for split-module dispatch
COMDAT group map	Internal to `sub_167DAB0`	Tracks COMDAT selection kinds: any / exact-match / largest / no-dup / same-size
Named metadata merge list	Internal to `sub_1671B40`	Special handling for `llvm.dbg.cu`, `llvm.used`, `llvm.compiler.used`, `llvm.global_ctors`, `llvm.global_dtors`, `llvm.global.annotations`
Module config flag	`dword_4F99BC0`	Controls linker behavior variant

Split-Module Compilation and Re-Linking

When concurrent compilation is active (thread count > 1 and multiple defined functions), the optimization pipeline uses a split-module strategy: each function is extracted into its own bitcode module, optimized independently in a thread pool, and then re-linked. The split/re-link cycle uses the same sub_12F5610 linker wrapper:

Split (sub_1AB9F40): extracts per-function bitcode using a filter callback (sub_12D4BD0) that selects a single function by name from the function-to-module hash table.
Optimize (thread pool via sub_16D5230): each worker runs sub_12E86C0 (Phase II optimizer) with qword_4FBB3B0 = 2.
Re-link (sub_12F5610): merges all per-function bitcode modules back into a single module.
Restore linkage (v362 hash table): the saved linkage attributes from step 0 are written back to prevent linkage promotion artifacts.

This cycle is orchestrated by sub_12E1EF0 (51 KB, the top-level concurrent compilation entry). The GNU Jobserver integration (sub_16832F0) throttles thread pool size to match the build system's -j level when cicc is invoked from make.

Separate Compilation and the NVVM Container

When nvcc --device-c compiles a .cu file, cicc produces an NVVM container with CompileMode = NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) and IRLevel = NVVM_IR_LEVEL_LTO (value 1). This container wraps partially-optimized LLVM bitcode -- the per-module optimizer has run, but cross-module optimization has not. The bitcode is embedded in the ELF .nv_fatbin section of the relocatable object file.

At link time, nvlink extracts the bitcode sections from all input objects, concatenates them, and passes the result back to cicc in LTO mode. cicc deserializes each container, links the bitcode modules via LLVM's Linker::linkModules, and then runs the LTO pipeline described above on the merged module. The pipeline sees the complete device program for the first time at this point.

The IRLevel enum controls which optimizations have already been applied:

IRLevel	Value	Meaning
`NVVM_IR_LEVEL_UNIFIED_AFTER_DCI`	0	Default: fully optimized, no LTO needed
`NVVM_IR_LEVEL_LTO`	1	Partially optimized, awaiting LTO pipeline
`NVVM_IR_LEVEL_OPTIX`	2	OptiX pipeline IR (separate optimization model)

Pass Inventory

Pass	Entry Point	Size	Pipeline Slot	Type	Sub-page
NVModuleSummary Builder	`sub_D7D4E0`	74 KB	N/A (called from driver)	Analysis	module-summary.md
NVModuleSummary Driver	`sub_D81040`	56 KB	N/A (LTO entry)	Module	module-summary.md
ThinLTO Function Import	`sub_1854A20`	4.3 KB	Slot 43 (`"function-import"`)	Module	thinlto-import.md
ThinLTO Threshold Engine	`sub_1853180`	5.1 KB	N/A (called from import driver)	Utility	thinlto-import.md
NVIDIA Custom Inliner	`sub_1864060`	75 KB	CGSCC pass	CGSCC	inliner-cost.md
LLVM Standard InlineCost	`sub_30DC7E0`	51 KB	N/A (library)	Analysis	inliner-cost.md
New PM CGSCC Inliner	`sub_2613930`	69 KB	CGSCC pass	CGSCC	inliner-cost.md
NVPTX Target Cost Modifier	`sub_38576C0`	58 KB	N/A (target hook)	Target	inliner-cost.md
GlobalOpt	`sub_18612A0`	65 KB	Slot 45 (`"globalopt"`)	Module	globalopt.md
WholeProgramDevirt	`sub_2703170`	13 KB	Slot 121 (`"wholeprogramdevirt"`)	Module	devirtualization.md

Key Differences from CPU LTO

Aspect	CPU LTO	CICC GPU LTO
Import threshold	100 instructions (default)	Priority-class multipliers, global budget at `dword_4FAB120`
Cold import	0x multiplier (never import cold)	Imports cold functions if priority >= 2
Inline budget	225 (LLVM default)	20,000 (NVIDIA custom), 89x larger
Devirt conservatism	Must handle DSOs, hidden visibility	Full type hierarchy always visible
Code size concern	Bloats `.text`, impacts cache/pages	No shared libs; size is secondary to register pressure
Address spaces	Trivial (flat memory model)	5+ address spaces; GlobalOpt must preserve AS through splits
Dead symbol elimination	Linker GC sections	Dead kernel elimination in `sub_12F5F30`
Threshold comparison	Integer instruction count	Floating-point threshold with hotness/linkage/priority multipliers
ML-guided inlining	Available upstream	Integrated via InlineAdvisor at `sub_2609820` with model at `sub_29B2CD0`

LTO Knob Summary

NVModuleSummary Knobs

Knob	Default	Effect
`dword_4F87C60` (global override)	0	When nonzero, forces all symbols to importable; value 2 = conservative comdat handling

ThinLTO Import Knobs

Registered in ctor_184_0 (0x4DA920) and ctor_029 (0x489C80):

Knob	Type	Default	Effect
`import-instr-limit`	int	100	Base instruction count threshold for import
`import-hot-multiplier`	float	10.0	Multiplier applied to threshold for hot callsites
`import-cold-multiplier`	float	0.0	Multiplier for cold callsites (0 = never import cold on CPU)
`dword_4FAB120`	int	-1	Global import budget; negative = unlimited
`dword_4FAA770`	int	0	Current import count (runtime accumulator)
`summary-file`	string	--	Path to external summary file for ThinLTO
`function-import`	--	--	Pipeline registration string (slot 43)
`disable-thinlto-funcattrs`	bool	false	Disable ThinLTO function attribute propagation
`thinlto-workload-def`	string	--	Workload definition file for priority-guided import

Inliner Knobs

Registered in ctor_186_0 (0x4DBEC0):

Knob	Type	Default	Effect
`inline-budget`	int	20,000	Per-caller inlining cost budget (NVIDIA custom model)
`inline-total-budget`	int	--	Global total budget across all callers
`inline-adj-budget1`	int	--	Adjusted per-caller budget (secondary)
`nv-inline-all`	bool	off	Force inline every function call
`profuseinline`	bool	off	Verbose inlining diagnostic output
`inline-switchctrl`	int	--	Heuristic tuning for switch statements
`inline-threshold`	int	225	LLVM standard model threshold (separate from NVIDIA's 20K)
`function-inline-cost-multiplier`	float	--	New PM: penalty multiplier for recursive functions

GlobalOpt Knobs

No dedicated cl::opt flags. All thresholds are hardcoded:

Parameter	Value	Description
Max bits for promotion	2,047 (`0x7FF`)	Globals exceeding this fall through to SRA
Max struct fields for SRA	16	Structs with >16 fields are not split
Hash table load factor	75%	Triggers rehash of processed-globals table
Pipeline position	Step 30 (tier 2/3)	After GlobalDCE, before LoopVectorize

Devirtualization Knobs

Knob	Type	Default	Effect
`wholeprogramdevirt`	--	--	Pipeline registration string (slot 121)

The pass has no NVIDIA-specific tuning knobs. It relies entirely on the completeness of type_test metadata produced by the NVModuleSummary builder.

Cross-References

NVModuleSummary Builder -- 4-level import priority, complexity budget, CUDA attribute tracking
ThinLTO Function Import -- threshold computation, priority-class multipliers, global budget
Inliner Cost Model -- four parallel models, .param address space cost, ML advisory
GlobalOpt for GPU -- address-space-aware SRA, small-constant promotion, malloc elimination
Whole-Program Devirtualization -- closed-world virtual call resolution, type test metadata
NVVM Container Format -- IRLevel enum, CompileMode, bitcode payload encoding
LLVM Optimizer -- LTO pipeline entry at sub_12F5F30, tier system
LazyCallGraph & CGSCC -- call graph infrastructure used by the CGSCC inliner
Entry Point & CLI -- flag catalog routing to lto output vector, -dc mode

Keyboard shortcuts

CICC Reverse Engineering Reference