Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LTO & Module Optimization

CICC v13.0 implements Link-Time Optimization as a five-pass pipeline that exploits the GPU's closed-world compilation model for optimization opportunities unavailable to CPU compilers. In CPU LTO, the linker merges partially-optimized object files and runs a second round of optimization on the combined module. The fundamental constraint is that shared libraries, dynamic loading, and symbol interposition limit what the optimizer can assume about the complete program. On GPU, none of these constraints exist. Every __device__ function that can execute on the hardware must be statically visible at compile time -- there is no device-side dlopen, no .so files, no PLT/GOT, no symbol preemption. This closed-world guarantee means the LTO pipeline can inline aggressively across translation units, devirtualize every virtual call site against a complete class hierarchy, and promote or split global variables with full knowledge that no external observer will access the original symbols.

The LTO pipeline runs after the main LLVM optimizer (tier 0-3 passes) has performed per-module optimization. It is triggered when cicc processes bitcode from separate compilation (nvcc --device-c / -dc mode), where each .cu file compiles to a relocatable device object containing LLVM bitcode in the NVVM container. The device linker (nvlink) merges these objects and reinvokes cicc in LTO mode, passing the combined bitcode through the LTO pipeline before final PTX emission. In whole-program compilation (the default), the pipeline is still partially active -- GlobalOpt and the inliner run regardless, but the summary-based import machinery is skipped because there is only one module.

LTO pipeline entrysub_12F5F30 (0x12F5F30, 37.8 KB)
NVModuleSummary driversub_D81040 (0xD81040, 56 KB)
Summary buildersub_D7D4E0 (0xD7D4E0, 74 KB)
Address range (summary cluster)0xD60000--0xD82000
Address range (import/inline cluster)0x1850000--0x186CA00
NVVM container IRLevel for LTONVVM_IR_LEVEL_LTO (value 1)
Compile mode for separate compilationNVVM_COMPILE_MODE_SEPARATE_ABI (value 2)
Module flags readEnableSplitLTOUnit, UnifiedLTO, ThinLTO

Why LTO Matters for GPU

Three properties of GPU execution make LTO dramatically more valuable than on CPU:

Function calls are expensive. Every GPU function call marshals arguments through the .param calling convention via st.param / ld.param instruction sequences. A function with 8 struct arguments can generate hundreds of cycles of marshaling overhead that inlining eliminates entirely. Cross-module inlining -- which requires LTO -- is the primary mechanism for removing this cost for functions defined in separate translation units. See the inliner cost model for the full cost analysis.

Register pressure determines performance. Occupancy is bounded by per-thread register usage, with discrete cliff boundaries. Call boundaries force the backend to save and restore registers across the call site, often spilling to local memory (device DRAM, 200-800 cycle latency). LTO enables cross-module inlining, which in turn enables cross-function register allocation -- the single most impactful optimization for GPU code.

Indirect calls are catastrophic. An indirect call in PTX (call.uni through a register) prevents backend inlining, forces full register spills, destroys instruction scheduling freedom, and creates warp-divergence hazards. Whole-program devirtualization, which requires LTO-level visibility of the complete type hierarchy, converts indirect calls to direct calls and enables all downstream optimizations.

Regular LTO vs ThinLTO

CICC supports both regular (monolithic) LTO and ThinLTO. The LTO driver at sub_D81040 reads three module flags via sub_BA91D0 to determine which mode is active:

Module FlagEffect
EnableSplitLTOUnitEnables the split LTO unit mechanism for type metadata
UnifiedLTOEnables LLVM's unified LTO pipeline (combined thin+regular)
ThinLTOActivates summary-based import and the two-phase declaration merge in sub_D7D4E0

Regular LTO merges all translation units into a single LLVM module, then runs the full optimization pipeline on the merged result. This gives the optimizer complete visibility but has O(n) memory cost in the total program size and serializes compilation. For GPU programs this is often acceptable because device code is typically smaller than host code.

ThinLTO builds per-module summaries (via NVModuleSummary), uses the summaries to make import decisions without loading full bitcode, then imports selected functions and optimizes each module independently. The builder's a8 parameter (thinlto_mode flag) activates Phase 2 of the summary builder, which performs a second walk over declarations to merge forward-declared and defined symbol tables. This mode enables parallel per-module optimization at the cost of less global visibility.

In practice, NVIDIA's toolchain (nvcc + nvlink) uses regular LTO as the default for device code, because the closed-world model and relatively small code size (compared to CPU programs) make the memory and compile-time cost acceptable. ThinLTO is available for large CUDA programs where compile time is a concern, activated by passing -dlto to nvcc (device LTO) or -flto=thin through the driver.

LTO Pipeline

The LTO pipeline executes five major passes in a fixed order. Each pass consumes the output of its predecessor:

 ┌────────────────────────────────────────────────────────────────────────┐
 │                    NVVM Container (IRLevel=1)                         │
 │                    LLVM Bitcode + Module Flags                        │
 └────────────────────┬───────────────────────────────────────────────────┘
                      │
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  1. NVModuleSummary Builder  (sub_D7D4E0, 74 KB)              │
 │     Build per-function summaries with 4-level import priority, │
 │     complexity budget, CUDA attribute flags, call graph edges  │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  ModuleSummaryIndex
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  2. ThinLTO Function Import  (sub_1854A20, 4.3 KB)            │
 │     Summary-guided cross-module import with floating-point     │
 │     threshold computation, priority-class multipliers,         │
 │     global import budget cap                                   │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Materialized functions + thinlto_src_module metadata
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  3. Inliner  (sub_1864060 + sub_2613930 + sub_38576C0)        │
 │     Four parallel cost models: NVIDIA custom (20K budget),     │
 │     LLVM standard (225), New PM CGSCC + ML, NVPTX target      │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Inlined module
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  4. GlobalOpt  (sub_18612A0, 65 KB)                            │
 │     Small-constant promotion (≤2047 bits), SRA for structs     │
 │     (≤16 fields), malloc/free elimination, address-space-aware │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Optimized globals
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  5. WholeProgramDevirtualization  (sub_2703170, 13 KB)         │
 │     Type-test metadata → vtable resolution → direct calls      │
 │     Red-black tree for type info lookup, 0x90-byte records     │
 └────────────────────┬──────────────────────────────────────────-┘
                      │
                      ▼
              Dead Kernel Elimination + GlobalDCE
              → Standard optimizer pipeline (tier 0-3)
              → Code generation + PTX emission

The LTO pipeline entry at sub_12F5F30 (37.8 KB) orchestrates this sequence and also runs dead kernel elimination -- removing __global__ functions that are never referenced by host-side kernel launches. This is a GPU-specific optimization: on CPU, the linker preserves all externally-visible entry points, but in GPU LTO the compiler knows the complete set of kernel launch sites from the host code.

LTO Pipeline Entry -- sub_12F5F30 Algorithm

sub_12F5F30 (0x12F5F30, 37,797 bytes) is the top-level LTO orchestrator. It is called after the CLI parser (sub_12F7D90) has resolved the compilation mode bitmask and the LTO argument vector has been populated from the -Xlto forwarding meta-flag. The function operates in three distinct modes determined by the mode bitmask in a13:

ModeBitmaskCLI FlagBehavior
gen-lto0x21-gen-ltoEmit partially-optimized bitcode for later linking. No dead-kernel pass.
full LTO0x23-ltoFull merge + optimize + dead-kernel elimination + emit PTX.
link-lto0x26-link-ltoLink pre-existing LTO bitcode modules, run full pipeline.

The function's argument list is reconstructed from the LTO output vector v330 (the fourth CLI routing vector, populated by -Xlto and the six -host-ref-* flags). It receives the merged LLVM module, the host reference tables, and the compilation options struct.

Pseudocode: sub_12F5F30 Top-Level

function sub_12F5F30(module, lto_args, options, error_cb):
    # ---- Phase A: Parse LTO-specific arguments ----
    mode = NONE
    trace_enabled = false
    optimize_unused_vars = false
    host_refs = HostRefTable{}      # 6-field table: ek, ik, ec, ic, eg, ig
    force_device_c = false

    for arg in lto_args:
        switch arg:
            case "-gen-lto":       mode = GEN_LTO
            case "-link-lto":      mode = LINK_LTO
            case "-olto":          lto_opt_level = next_arg()
            case "--device-c":     device_c = true
            case "--force-device-c": force_device_c = true
            case "--trace":        trace_enabled = true
            case "-optimize-unused-variables": optimize_unused_vars = true
            case "-has-global-host-info":      has_host_info = true
            case "-host-ref-ek=*": host_refs.ek = parse_symbol_list(value)
            case "-host-ref-ik=*": host_refs.ik = parse_symbol_list(value)
            case "-host-ref-ec=*": host_refs.ec = parse_symbol_list(value)
            case "-host-ref-ic=*": host_refs.ic = parse_symbol_list(value)
            case "-host-ref-eg=*": host_refs.eg = parse_symbol_list(value)
            case "-host-ref-ig=*": host_refs.ig = parse_symbol_list(value)

    # ---- Phase B: Build preserved-symbol sets ----
    # Collect symbols from llvm.used and llvm.metadata named metadata
    used_set = collect_named_metadata(module, "llvm.used")
    metadata_set = collect_named_metadata(module, "llvm.metadata")

    # Merge host reference tables into a unified "referenced from host" set.
    # The 6 host-ref flags encode three entity types x two reference modes:
    #   e = explicit reference (symbol name appears in host launch site)
    #   i = implicit reference (symbol address taken on host side)
    #   k = kernel (__global__),  c = constant (__constant__),  g = global (__device__)
    host_referenced_kernels  = host_refs.ek  UNION  host_refs.ik
    host_referenced_constants = host_refs.ec  UNION  host_refs.ic
    host_referenced_globals   = host_refs.eg  UNION  host_refs.ig

    # ---- Phase C: Decide what to preserve ----
    preserved = used_set  UNION  metadata_set  UNION  host_referenced_kernels

    if NOT optimize_unused_vars:
        preserved = preserved  UNION  host_referenced_constants
                               UNION  host_referenced_globals

    # ---- Phase D: Dead kernel/variable elimination ----
    if mode == GEN_LTO:
        # gen-lto: emit bitcode only, skip elimination
        return emit_lto_bitcode(module)

    if has_host_info:
        dead_kernel_elimination(module, preserved, trace_enabled)

        if optimize_unused_vars:
            dead_variable_elimination(module, preserved,
                                     host_referenced_constants,
                                     host_referenced_globals,
                                     trace_enabled)

    # ---- Phase E: Run the 5-pass LTO pipeline ----
    if mode == LINK_LTO or mode == FULL_LTO:
        run_module_summary_builder(module)      # sub_D7D4E0 via sub_D81040
        run_thinlto_import(module)              # sub_1854A20  (if ThinLTO)
        run_inliner(module)                     # sub_1864060 + sub_2613930
        run_globalopt(module)                   # sub_18612A0
        run_whole_program_devirt(module)        # sub_2703170
        run_global_dce(module)                  # final GlobalDCE sweep

    # ---- Phase F: Hand off to optimizer pipeline ----
    return module    # returned to sub_12E7E70 for tier 0-3 passes

Host Reference Flag Encoding

The six -host-ref-* flags are the mechanism by which nvlink communicates host-side symbol usage to cicc's LTO pass. nvlink inspects the host-side relocatable objects and emits a semicolon-separated list of device symbol names for each flag. The two-letter suffix encodes:

SuffixEntity TypeReference Kind
-host-ref-ek__global__ kernelExplicit (launch site in host code)
-host-ref-ik__global__ kernelImplicit (address taken, e.g. &myKernel)
-host-ref-ec__constant__ variableExplicit (cudaMemcpyToSymbol target)
-host-ref-ic__constant__ variableImplicit (address taken)
-host-ref-eg__device__ global variableExplicit (cudaMemcpyToSymbol target)
-host-ref-ig__device__ global variableImplicit (address taken)

The -has-global-host-info flag signals that nvlink has provided complete host reference information. When this flag is absent, sub_12F5F30 conservatively preserves all externally-visible symbols -- the dead kernel/variable elimination pass is skipped entirely.

Function Map

FunctionAddressSizeRole
sub_12F5F300x12F5F3037.8 KBLTO pipeline entry and dead-symbol orchestrator
sub_12F56100x12F56107.3 KBLLVM module linker wrapper (Linker::linkModules)
sub_12F7D900x12F7D9014.3 KBCLI argument parser (architecture, opt level, flags)
sub_12F40600x12F406015.7 KBTargetMachine creation with NVIDIA options
sub_1C138400x1C13840--Global/function iterator used for dead-code sweep
sub_12F16500x12F16505.2 KBBitcode reader variant A
sub_12F11C00x12F11C05.2 KBBitcode reader variant B

Dead Kernel Elimination Algorithm

Dead kernel elimination is the most impactful GPU-specific optimization in the LTO pipeline. It exploits the closed-world model: every __global__ function that will ever execute must have a corresponding <<<>>> launch site (or cudaLaunchKernel call) in the host code that nvlink has already seen. Any kernel not in the host reference set is dead.

This pass cannot exist on CPU. A CPU linker must preserve all non-hidden external functions because shared libraries loaded at runtime via dlopen could call them. On GPU there is no dlopen, no dynamic symbol resolution, no PLT. The set of reachable kernels is completely determined at link time.

Pseudocode: dead_kernel_elimination

function dead_kernel_elimination(module, preserved_set, trace):
    # Walk all functions in the module via sub_1C13840 iterator
    worklist = []

    for func in module.functions():
        if func.isDeclaration():
            continue

        cc = func.getCallingConv()

        # PTX calling convention 71 = __global__ (kernel entry point)
        # PTX calling convention 72 = __device__ (device function)
        # PTX calling convention 95 = CUDA internal (managed init)
        if cc != 71:
            continue    # only eliminate kernels, not device functions

        name = func.getName()

        if name in preserved_set:
            continue    # referenced from host, or in llvm.used -- keep it

        # This kernel has no host launch site.
        if trace:
            emit_diagnostic("no reference to kernel " + name)

        worklist.append(func)

    # ---- Remove dead kernels ----
    for func in worklist:
        # Before erasing, check if any device-side indirect references exist.
        # On GPU, device-side function pointers (callback patterns) can reference
        # kernels via address-of. Check use_empty():
        if NOT func.use_empty():
            # Has device-side users -- cannot safely remove.
            # (This is rare: kernels are almost never called from device code.)
            continue

        func.replaceAllUsesWith(UndefValue)
        func.eraseFromParent()

    return len(worklist)

Pseudocode: dead_variable_elimination

When -optimize-unused-variables is enabled, the same logic extends to __device__ and __constant__ global variables:

function dead_variable_elimination(module, preserved_set,
                                   host_constants, host_globals, trace):
    worklist = []

    for gv in module.globals():
        if gv.isDeclaration():
            continue

        name = gv.getName()

        if name in preserved_set:
            continue

        as = gv.getAddressSpace()

        # Address space 1 = global, address space 4 = constant
        if as == 4 and name NOT in host_constants:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)
        elif as == 1 and name NOT in host_globals:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)

    for gv in worklist:
        if NOT gv.use_empty():
            continue    # still referenced from device code
        gv.eraseFromParent()

    return len(worklist)

The --trace-lto CLI flag (which maps to --trace in the LTO argument vector via the flag catalog at line 2394) enables the diagnostic messages. When active, cicc prints one line per eliminated symbol to stderr, enabling build-system integration and debugging of unexpected kernel removal.

Module Merge Process

Before sub_12F5F30 can perform dead-kernel elimination or any LTO optimization, the separate-compilation bitcode modules must be merged into a single LLVM module. This merge happens in two layers: the NVIDIA module linker wrapper sub_12F5610 (7.3 KB) and the underlying LLVM IRLinker at sub_16786A0 (61 KB).

Two-Level Linking Architecture

nvlink extracts .nv_fatbin bitcode sections
         |
         v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Loader  (sub_12C06E0, 63 KB)                │
│  - Validates LLVM bitcode magic (0xDEC0170B or 0x4243C0DE) │
│  - Checks IR version via sub_12BFF60                        │
│  - Validates target triple (must be "nvptx64-*")            │
│  - Single-module fast path: return directly if N=1          │
│  - Multi-module: normalize triples, set matching DataLayout │
└─────────────────────┬───────────────────────────────────────┘
                      │  N validated modules
                      v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Linker Wrapper  (sub_12F5610, 7.3 KB)       │
│  - Selects primary module (typically the largest)           │
│  - For each secondary module:                               │
│      Copy triple from primary → secondary                   │
│      Call IRLinker to merge secondary into primary           │
│  - Post-link: restore linkage attributes from hash table    │
│      Values 7-8: external linkage (low 6 bits)              │
│      Other: set low 4 bits + visibility from bits 4-5       │
│      Set dso_local flag (byte+33 |= 0x40)                  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      v
┌─────────────────────────────────────────────────────────────┐
│  LLVM IRLinker::run  (sub_16786A0, 61 KB)                   │
│  - Allocates 0x2000-byte DenseMap for symbol resolution     │
│  - Hash function: (addr >> 9) ^ (addr >> 4)                 │
│  - Resolves COMDAT groups (sub_167DAB0, 39 KB)              │
│  - Links global value prototypes (sub_1675980, 37 KB)       │
│  - Links function bodies (sub_143B970, 14 KB)               │
│  - Merges named metadata (llvm.dbg.cu, llvm.used, etc.)     │
│  - Resolves llvm.global_ctors / llvm.global_dtors ordering  │
│  - Maps values across modules via DenseMap<Value*, Value*>  │
│  - Tombstone sentinels: empty=-8, deleted=-16               │
└─────────────────────┬───────────────────────────────────────┘
                      │  single merged module
                      v
              sub_12F5F30 (LTO pipeline entry)

Pseudocode: Module Merge (sub_12F5610 + sub_12C06E0)

function module_merge(module_list, llvm_ctx, options):
    # ---- Step 1: Load and validate all modules (sub_12C06E0) ----
    modules = []
    for entry in module_list:
        buf = open_buffer(entry.data, entry.length, entry.name)

        # Validate bitcode magic
        magic = read_u32(buf, 0)
        if magic != 0x0B17C0DE and magic != 0xDEC04342:
            error("invalid bitcode: " + entry.name)
            return NULL

        module = parse_bitcode(buf, llvm_ctx)  # sub_15099C0

        # Check IR version compatibility (sub_12BFF60)
        if ir_version_check(module_list, module, flags) != 0:
            error(entry.name + ": error: incompatible IR detected. "
                  "Possible mix of compiler/IR from different releases.")
            return NULL

        # Validate target triple
        triple = module.getTargetTriple()
        if NOT triple.startswith("nvptx64-"):
            error("Module does not contain a triple, "
                  "should be 'nvptx64-'")
            return NULL

        modules.append(module)

    # ---- Step 2: Single-module fast path ----
    if len(modules) == 1:
        return modules[0]

    # ---- Step 3: Multi-module linking (sub_12F5610) ----
    # Save linkage attributes before linking (they get modified)
    linkage_map = DenseMap<StringRef, u8>{}
    for module in modules:
        for func in module.functions():
            linkage_map[func.getName()] = func.getLinkage()
        for gv in module.globals():
            linkage_map[gv.getName()] = gv.getLinkage()

    # Select primary module and link secondaries into it
    primary = modules[0]
    for i in range(1, len(modules)):
        secondary = modules[i]

        # Normalize: copy DataLayout from primary to secondary
        secondary.setDataLayout(primary.getDataLayout())
        secondary.setTargetTriple(primary.getTargetTriple())

        # IRLinker::run (sub_16786A0)
        # Resolves COMDATs, links globals, maps values, merges metadata
        err = Linker::linkModules(primary, secondary)
        if err:
            error("<module_name>: link error: <details>")
            return NULL

    # ---- Step 4: Restore linkage attributes ----
    # During linking, LLVM may promote linkage (e.g., internal -> external)
    # to resolve cross-module references. Restore the original linkage
    # where possible, preserving the correct visibility for PTX emission.
    for func in primary.functions():
        name = func.getName()
        if name in linkage_map:
            original = linkage_map[name]
            if original in [7, 8]:       # external linkage variants
                func.setLinkage(original & 0x3F)
            else:
                func.setLinkage(original & 0x0F)
                if (original & 0x30) != 0:
                    func.setVisibility(original >> 4)
            func.setDSOLocal(true)       # byte+33 |= 0x40

    for gv in primary.globals():
        # same linkage restoration logic
        ...

    return primary

Key Data Structures in the Merge

StructureLocationDetails
Value map DenseMapAllocated in sub_16786A00x2000 bytes (8192), hash: (addr >> 9) ^ (addr >> 4), quadratic probing
Linkage hash tableStack-allocated in sub_12E1EF0 (v362)Maps StringRef name to original linkage byte
Function-to-module mapStack-allocated in sub_12E1EF0 (v359)Maps StringRef name to function pointer for split-module dispatch
COMDAT group mapInternal to sub_167DAB0Tracks COMDAT selection kinds: any / exact-match / largest / no-dup / same-size
Named metadata merge listInternal to sub_1671B40Special handling for llvm.dbg.cu, llvm.used, llvm.compiler.used, llvm.global_ctors, llvm.global_dtors, llvm.global.annotations
Module config flagdword_4F99BC0Controls linker behavior variant

Split-Module Compilation and Re-Linking

When concurrent compilation is active (thread count > 1 and multiple defined functions), the optimization pipeline uses a split-module strategy: each function is extracted into its own bitcode module, optimized independently in a thread pool, and then re-linked. The split/re-link cycle uses the same sub_12F5610 linker wrapper:

  1. Split (sub_1AB9F40): extracts per-function bitcode using a filter callback (sub_12D4BD0) that selects a single function by name from the function-to-module hash table.
  2. Optimize (thread pool via sub_16D5230): each worker runs sub_12E86C0 (Phase II optimizer) with qword_4FBB3B0 = 2.
  3. Re-link (sub_12F5610): merges all per-function bitcode modules back into a single module.
  4. Restore linkage (v362 hash table): the saved linkage attributes from step 0 are written back to prevent linkage promotion artifacts.

This cycle is orchestrated by sub_12E1EF0 (51 KB, the top-level concurrent compilation entry). The GNU Jobserver integration (sub_16832F0) throttles thread pool size to match the build system's -j level when cicc is invoked from make.

Separate Compilation and the NVVM Container

When nvcc --device-c compiles a .cu file, cicc produces an NVVM container with CompileMode = NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) and IRLevel = NVVM_IR_LEVEL_LTO (value 1). This container wraps partially-optimized LLVM bitcode -- the per-module optimizer has run, but cross-module optimization has not. The bitcode is embedded in the ELF .nv_fatbin section of the relocatable object file.

At link time, nvlink extracts the bitcode sections from all input objects, concatenates them, and passes the result back to cicc in LTO mode. cicc deserializes each container, links the bitcode modules via LLVM's Linker::linkModules, and then runs the LTO pipeline described above on the merged module. The pipeline sees the complete device program for the first time at this point.

The IRLevel enum controls which optimizations have already been applied:

IRLevelValueMeaning
NVVM_IR_LEVEL_UNIFIED_AFTER_DCI0Default: fully optimized, no LTO needed
NVVM_IR_LEVEL_LTO1Partially optimized, awaiting LTO pipeline
NVVM_IR_LEVEL_OPTIX2OptiX pipeline IR (separate optimization model)

Pass Inventory

PassEntry PointSizePipeline SlotTypeSub-page
NVModuleSummary Buildersub_D7D4E074 KBN/A (called from driver)Analysismodule-summary.md
NVModuleSummary Driversub_D8104056 KBN/A (LTO entry)Modulemodule-summary.md
ThinLTO Function Importsub_1854A204.3 KBSlot 43 ("function-import")Modulethinlto-import.md
ThinLTO Threshold Enginesub_18531805.1 KBN/A (called from import driver)Utilitythinlto-import.md
NVIDIA Custom Inlinersub_186406075 KBCGSCC passCGSCCinliner-cost.md
LLVM Standard InlineCostsub_30DC7E051 KBN/A (library)Analysisinliner-cost.md
New PM CGSCC Inlinersub_261393069 KBCGSCC passCGSCCinliner-cost.md
NVPTX Target Cost Modifiersub_38576C058 KBN/A (target hook)Targetinliner-cost.md
GlobalOptsub_18612A065 KBSlot 45 ("globalopt")Moduleglobalopt.md
WholeProgramDevirtsub_270317013 KBSlot 121 ("wholeprogramdevirt")Moduledevirtualization.md

Key Differences from CPU LTO

AspectCPU LTOCICC GPU LTO
Import threshold100 instructions (default)Priority-class multipliers, global budget at dword_4FAB120
Cold import0x multiplier (never import cold)Imports cold functions if priority >= 2
Inline budget225 (LLVM default)20,000 (NVIDIA custom), 89x larger
Devirt conservatismMust handle DSOs, hidden visibilityFull type hierarchy always visible
Code size concernBloats .text, impacts cache/pagesNo shared libs; size is secondary to register pressure
Address spacesTrivial (flat memory model)5+ address spaces; GlobalOpt must preserve AS through splits
Dead symbol eliminationLinker GC sectionsDead kernel elimination in sub_12F5F30
Threshold comparisonInteger instruction countFloating-point threshold with hotness/linkage/priority multipliers
ML-guided inliningAvailable upstreamIntegrated via InlineAdvisor at sub_2609820 with model at sub_29B2CD0

LTO Knob Summary

NVModuleSummary Knobs

KnobDefaultEffect
dword_4F87C60 (global override)0When nonzero, forces all symbols to importable; value 2 = conservative comdat handling

ThinLTO Import Knobs

Registered in ctor_184_0 (0x4DA920) and ctor_029 (0x489C80):

KnobTypeDefaultEffect
import-instr-limitint100Base instruction count threshold for import
import-hot-multiplierfloat10.0Multiplier applied to threshold for hot callsites
import-cold-multiplierfloat0.0Multiplier for cold callsites (0 = never import cold on CPU)
dword_4FAB120int-1Global import budget; negative = unlimited
dword_4FAA770int0Current import count (runtime accumulator)
summary-filestring--Path to external summary file for ThinLTO
function-import----Pipeline registration string (slot 43)
disable-thinlto-funcattrsboolfalseDisable ThinLTO function attribute propagation
thinlto-workload-defstring--Workload definition file for priority-guided import

Inliner Knobs

Registered in ctor_186_0 (0x4DBEC0):

KnobTypeDefaultEffect
inline-budgetint20,000Per-caller inlining cost budget (NVIDIA custom model)
inline-total-budgetint--Global total budget across all callers
inline-adj-budget1int--Adjusted per-caller budget (secondary)
nv-inline-allbooloffForce inline every function call
profuseinlinebooloffVerbose inlining diagnostic output
inline-switchctrlint--Heuristic tuning for switch statements
inline-thresholdint225LLVM standard model threshold (separate from NVIDIA's 20K)
function-inline-cost-multiplierfloat--New PM: penalty multiplier for recursive functions

GlobalOpt Knobs

No dedicated cl::opt flags. All thresholds are hardcoded:

ParameterValueDescription
Max bits for promotion2,047 (0x7FF)Globals exceeding this fall through to SRA
Max struct fields for SRA16Structs with >16 fields are not split
Hash table load factor75%Triggers rehash of processed-globals table
Pipeline positionStep 30 (tier 2/3)After GlobalDCE, before LoopVectorize

Devirtualization Knobs

KnobTypeDefaultEffect
wholeprogramdevirt----Pipeline registration string (slot 121)

The pass has no NVIDIA-specific tuning knobs. It relies entirely on the completeness of type_test metadata produced by the NVModuleSummary builder.

Cross-References